On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: > wow, setting db.max.outlinks.per.page immediately fixed my problem. It looks > like I totally mis-diagnosed things. > > May I pose two questions: > 1) how did you view all the outlinks?
bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser <local_file> > 2) how severe is NUTCH-119 - does it occur on a lot of sites? AFAIK, HtmlParser doesn't extract urls with regexps. Nutch uses a regexp to extract outlinks from files that have no markup information (such as plain text). See OutlinkExtractor.java. > > > ----- Original Message ---- > From: Doğacan Güney <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tuesday, June 26, 2007 10:56:32 PM > Subject: Re: NUTCH-119 :: how hard to fix > > On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: > > I am evaluating nutch+lucene as a crawl and search solution. > > > > However, I am finding major bugs in nutch right off the bat. > > > > In particular, NUTCH-119: nutch is not crawling relative URLs. I have some > > discussion of it here: > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html > > > > Most of the links off www.variety.com, one of my main test sites, have > > relative URLs. It seems incredible that nutch, which is capable of > > mapreduce, cannot fetch these URLs. > > > > It could be that I would fix this bug if, for other reasons, I decide to go > > with nutch+lucene. Has anyone tried fixing this problem? Is it > > intractable? Or are the developers, who are just volunteers anyway, more > > interested in fixing other problems? > > > > Could someone outline the issue for me a bit more clearly so I would know > > how to evaluate it? > > Both this one and the other site you were mentioning (sf911truth) have > more than 100 outlinks. Nutch, by default, only stores 100 outlinks > per page (db.max.outlinks.per.page). Link about.html happens to be > 105th link or so, so nutch doesn't store it. All you have to do is > either increase db.max.outlinks.per.page or set it to -1 (which > means, store all outlinks). > > > > > > > > > > > > > ____________________________________________________________________________________ > > Park yourself in front of a world of choices in alternative vehicles. Visit > > the Yahoo! Auto Green Center. > > http://autos.yahoo.com/green_center/ > > > -- > Doğacan Güney > > > > > > > > > ____________________________________________________________________________________ > Be a better Heartthrob. Get better relationship answers from someone who > knows. Yahoo! Answers - Check it out. > http://answers.yahoo.com/dir/?link=list&sid=396545433 -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers