On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: > I am evaluating nutch+lucene as a crawl and search solution. > > However, I am finding major bugs in nutch right off the bat. > > In particular, NUTCH-119: nutch is not crawling relative URLs. I have some > discussion of it here: > http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html > > Most of the links off www.variety.com, one of my main test sites, have > relative URLs. It seems incredible that nutch, which is capable of > mapreduce, cannot fetch these URLs. > > It could be that I would fix this bug if, for other reasons, I decide to go > with nutch+lucene. Has anyone tried fixing this problem? Is it intractable? > Or are the developers, who are just volunteers anyway, more interested in > fixing other problems? > > Could someone outline the issue for me a bit more clearly so I would know how > to evaluate it?
Both this one and the other site you were mentioning (sf911truth) have more than 100 outlinks. Nutch, by default, only stores 100 outlinks per page (db.max.outlinks.per.page). Link about.html happens to be 105th link or so, so nutch doesn't store it. All you have to do is either increase db.max.outlinks.per.page or set it to -1 (which means, store all outlinks). > > > > > > ____________________________________________________________________________________ > Park yourself in front of a world of choices in alternative vehicles. Visit > the Yahoo! Auto Green Center. > http://autos.yahoo.com/green_center/ -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers