Sweet, I'll look into how to go about doing that later today.
On Oct 24, 2007, at 6:06 AM, Andrzej Bialecki wrote:
Ned Rockson wrote:
So recently I switched to Fetcher2 over Fetcher for larger whole
web fetches (50-100M at a time). I found that the URLs generated
are not optimal because they are simply randomized by a hash
comparator. In one crawl on 24 machines it took about 3 days to
crawl 30M URLs. In comparison with old benchmarks I had set with
regular Fetcher.java this was at least 3 fold more time.
Anyway, I realized that the best situation for ordering can be
approached by randomization, but in order to get optimal ordering,
urls from the same host should be as far apart in the list as
possible. So I wrote a series of 2 map/reduces to optimize the
ordering and for a list of 25M documents it takes about 10 minutes
on our cluster. Right now I have it in its own class, but I
figured it can go in Generator.java and just add a flag in nutch-
default.xml determining if the user wants to use it.
So, should I submit the code by email? Is there some way to
change Generator.java or should I just submit the function in its
own class?
This sounds intriguing. Nutch mailing lists strip attachments - you
should create a JIRA issue, copy this description, and attach a
patch in unified diff format (svn diff will do).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com