Is your change to the update db tool going to be in the next release? Have you tested it?
Thanks for the fix! -----Original Message----- From: Phoebe Miller (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, March 31, 2005 8:59 AM To: [email protected] Subject: [jira] Commented: (NUTCH-7) analyze tool takes up all the disk space when there are circular links [ http://issues.apache.org/jira/browse/NUTCH-7?page=comments#action_61899 ] Phoebe Miller commented on NUTCH-7: ----------------------------------- I have fixed this problem by changing the update database tool, basically, links from a page is not added if the page has already been processed and current (same MD5). Now link analysis won't run into these infinite chains of links. Here is the diff in UpdateDatabaseTool.java. 64d63 < private IWebDBReader webdbread; 72c71 < public UpdateDatabaseTool(IWebDBWriter webdb, IWebDBReader webdbread, boolean additionsAllowed, int maxCount) { --- > public UpdateDatabaseTool(IWebDBWriter webdb, boolean additionsAllowed, int maxCount) { 74d72 < this.webdbread = webdbread; 229,231d226 < // If the page is already in the db, so are the links, < // This should take care of relative links and symlinks to itself. < if (!webdbread.pageExists(newPage.getMD5())) // page not seen before 365,366c360 < IWebDBReader webdbread = new WebDBReader(nfs, root); < UpdateDatabaseTool tool = new UpdateDatabaseTool(webdb, webdbread, additionsAllowed, max); --- > UpdateDatabaseTool tool = new UpdateDatabaseTool(webdb, additionsAllowed, max); > analyze tool takes up all the disk space when there are circular links > ---------------------------------------------------------------------- > > Key: NUTCH-7 > URL: http://issues.apache.org/jira/browse/NUTCH-7 > Project: Nutch > Type: Bug > Components: indexer > Environment: analyze runs for an excessive amount of time and creates huge temp files until it runs out of disk space (if you let the db grow) > Reporter: Phoebe Miller > > It is repeatable by running an instance with these seeds: > http://www.acf.hhs.gov/programs/ofs/forms.htm/grants/grants/grants/grants/da ta/grants/data/data/data/data/grants/data/grants/grants/grants/process.htm > http://www.acf.hhs.gov/programs/ofs/ > and limit it (for best effect) to just: > *.acf.hhs.gov/* > Let it go for about 12 cycles to build it up and the temp file size roughly doubles with each segment. > ]$ ls -l /db/tmpdir2344la/ > ... > 1503641425 Mar 10 17:42 scoreEdits.0.unsorted > for a very small db: > Stats for [EMAIL PROTECTED] > ------------------------------- > Number of pages: 6916 > Number of links: 8085 > scoreEdits.0.sorted.0 contains rows of links that looked like the first seed url, but with more grants/ and data/ in the sub dirs. > In the File: > .DistributedAnalysisTool.java > 345 if (curIndex - startIndex > extent) { > 346 break; > 347 } > is the hard stop. > Further down the score is written: > 381 for (int i = 0; i < outLinks.length; i++) { > ... > 385 scoreWriter.append(outLinks[i].getURL(), score); > Putting a check here stops the tmpdir.../scoreEdits.0 file growth > but the links themselves should not be produced in the generation either. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - If you want more information on JIRA, or have a bug to report see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.net email is sponsored by Demarc: A global provider of Threat Management Solutions. Download our HomeAdmin security software for free today! http://www.demarc.com/Info/Sentarus/hamr30 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
