[ http://issues.apache.org/jira/browse/NUTCH-7?page=all ]
Piotr Kosiorowski updated NUTCH-7:
----------------------------------
Attachment: patch
I am attaching a patch that should fix this issue.
I was trying to reproduce the bug on URL submitted in bug description but
failed to do so (I am not sure why). But I run at exactly this problem when
computing PageRank for my WebDB (~70mln of pages, ~150mln of links).
The most problematic site in my WebDB was starting from this url:
http://www.scotlandsheritagehotels.co.uk/darroch/special-promotions/wine-and-dine/
Majority of links on this page will add one additional component in URL but
point to to exactly the same page content.
I have over 1mln of pages with this url pattern in my development WebDB :(.
After running PageRank computation on it I was running out of disk space (my
WebDB is about 48GB and I had ~600GB free before starting PageRank computation)
and process was terminated in the middle.
I debugged the problem and it appears that it is not a problem with cyclic
links but with many pages having the same MD5.
Because all 1mln pages in my WebDB have the same content and thus the same MD5
- all links from these pages are grouped and returned when PageRank
computation tries to gather outlinks from currently processed page.
So for each of 1mln bad pages it gets over a milion of outlinks and tries to
write url and scores to file. In my opinion each link should be used exactly
once in PageRank computation not as many times as pages with the same md5
exist.
Because during PageRank computation iteration is done over pages sorted by MD5
the change is simple to implement.
I have added a check so only first page of the set of pages having the same MD5
is actually processed - all others are skipped.
This guarantees that number of processed links should be equal to number of
links in WebDB (not greater as previously).
Additionaly I have added a warning in logs if number of outlinks per page is
greater than some limit (10000 right now - but can be set to smaller value). It
allows to identify some problematic sites in WebDB easily - but should be
treated as a hint only as for example if dmoz data is used to seed the WebDB
www.dmoz.org will have huge number of outlinks.
This patch should improve performance of PageRank computation as less data is
actually written to disk and processed. I was surprised to see many small
groups of duplicated Pages in my WebDB in addition to this huge one - so I
encourage others to test if they have performance improvements even though they
have not run out of disk space yet.
In my opinion it will also improve the quality of results because each link
would be used exactly once in PageRank computation.
I also agree that we should avoid adding such pages to WebDB in first place - I
will investigate some ideas that will help as to do so in future.
> analyze tool takes up all the disk space when there are circular links
> ----------------------------------------------------------------------
>
> Key: NUTCH-7
> URL: http://issues.apache.org/jira/browse/NUTCH-7
> Project: Nutch
> Type: Bug
> Components: indexer
> Environment: analyze runs for an excessive amount of time and creates huge
> temp files until it runs out of disk space (if you let the db grow)
> Reporter: Phoebe Miller
> Attachments: patch
>
> It is repeatable by running an instance with these seeds:
> http://www.acf.hhs.gov/programs/ofs/forms.htm/grants/grants/grants/grants/data/grants/data/data/data/data/grants/data/grants/grants/grants/process.htm
> http://www.acf.hhs.gov/programs/ofs/
> and limit it (for best effect) to just:
> *.acf.hhs.gov/*
> Let it go for about 12 cycles to build it up and the temp file size roughly
> doubles with each segment.
> ]$ ls -l /db/tmpdir2344la/
> ...
> 1503641425 Mar 10 17:42 scoreEdits.0.unsorted
> for a very small db:
> Stats for [EMAIL PROTECTED]
> -------------------------------
> Number of pages: 6916
> Number of links: 8085
> scoreEdits.0.sorted.0 contains rows of links that looked like the first seed
> url, but with more grants/ and data/ in the sub dirs.
> In the File:
> .DistributedAnalysisTool.java
> 345 if (curIndex - startIndex > extent) {
> 346 break;
> 347 }
> is the hard stop.
> Further down the score is written:
> 381 for (int i = 0; i < outLinks.length; i++) {
> ...
> 385 scoreWriter.append(outLinks[i].getURL(), score);
> Putting a check here stops the tmpdir.../scoreEdits.0 file growth
> but the links themselves should not be produced in the generation either.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira