On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
> wow, setting db.max.outlinks.per.page immediately fixed my problem.  It looks 
> like I totally mis-diagnosed things.
>
> May I pose two questions:
> 1) how did you view all the outlinks?

bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser <local_file>

> 2) how severe is NUTCH-119 - does it occur on a lot of sites?

AFAIK, HtmlParser doesn't extract urls with regexps. Nutch uses a
regexp to extract outlinks from files that have no markup information
(such as plain text). See OutlinkExtractor.java.


>
>
> ----- Original Message ----
> From: Doğacan Güney <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Tuesday, June 26, 2007 10:56:32 PM
> Subject: Re: NUTCH-119 :: how hard to fix
>
> On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
> > I am evaluating nutch+lucene as a crawl and search solution.
> >
> > However, I am finding major bugs in nutch right off the bat.
> >
> > In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some 
> > discussion of it here:
> > http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html
> >
> > Most of the links off www.variety.com, one of my main test sites, have 
> > relative URLs.  It seems incredible that nutch, which is capable of 
> > mapreduce, cannot fetch these URLs.
> >
> > It could be that I would fix this bug if, for other reasons, I decide to go 
> > with nutch+lucene.  Has anyone tried fixing this problem?  Is it 
> > intractable?  Or are the developers, who are just volunteers anyway, more 
> > interested in fixing other problems?
> >
> > Could someone outline the issue for me a bit more clearly so I would know 
> > how to evaluate it?
>
> Both this one and the other site you were mentioning (sf911truth) have
> more than 100 outlinks. Nutch, by default, only stores 100 outlinks
> per page (db.max.outlinks.per.page). Link about.html happens to be
> 105th link or so, so nutch doesn't store it. All you have to do is
> either increase db.max.outlinks.per.page or set it  to -1 (which
> means, store all outlinks).
>
> >
> >
> >
> >
> >       
> > ____________________________________________________________________________________
> > Park yourself in front of a world of choices in alternative vehicles. Visit 
> > the Yahoo! Auto Green Center.
> > http://autos.yahoo.com/green_center/
>
>
> --
> Doğacan Güney
>
>
>
>
>
>
>
>
> ____________________________________________________________________________________
> Be a better Heartthrob. Get better relationship answers from someone who 
> knows. Yahoo! Answers - Check it out.
> http://answers.yahoo.com/dir/?link=list&sid=396545433


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to