[ http://issues.apache.org/jira/browse/NUTCH-7?page=comments#action_63160 ] Doug Cutting commented on NUTCH-7: ----------------------------------
The link analysis tool is not actively maintained. It's use is optional, so, if you have problems with it, you can just stop using it. To get some of its effects (prioritizing pages when crawling and searching) without using the analyze command, set both fetchlist.score.by.link.count and indexer.boost.by.link.count to true. This "poor man's link analysis" implementation works surprisingly well. > analyze tool takes up all the disk space when there are circular links > ---------------------------------------------------------------------- > > Key: NUTCH-7 > URL: http://issues.apache.org/jira/browse/NUTCH-7 > Project: Nutch > Type: Bug > Components: indexer > Environment: analyze runs for an excessive amount of time and creates huge > temp files until it runs out of disk space (if you let the db grow) > Reporter: Phoebe Miller > > It is repeatable by running an instance with these seeds: > http://www.acf.hhs.gov/programs/ofs/forms.htm/grants/grants/grants/grants/data/grants/data/data/data/data/grants/data/grants/grants/grants/process.htm > http://www.acf.hhs.gov/programs/ofs/ > and limit it (for best effect) to just: > *.acf.hhs.gov/* > Let it go for about 12 cycles to build it up and the temp file size roughly > doubles with each segment. > ]$ ls -l /db/tmpdir2344la/ > ... > 1503641425 Mar 10 17:42 scoreEdits.0.unsorted > for a very small db: > Stats for [EMAIL PROTECTED] > ------------------------------- > Number of pages: 6916 > Number of links: 8085 > scoreEdits.0.sorted.0 contains rows of links that looked like the first seed > url, but with more grants/ and data/ in the sub dirs. > In the File: > .DistributedAnalysisTool.java > 345 if (curIndex - startIndex > extent) { > 346 break; > 347 } > is the hard stop. > Further down the score is written: > 381 for (int i = 0; i < outLinks.length; i++) { > ... > 385 scoreWriter.append(outLinks[i].getURL(), score); > Putting a check here stops the tmpdir.../scoreEdits.0 file growth > but the links themselves should not be produced in the generation either. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
