Briggs wrote:
>> Are you running this in a distributed setup, or in "local" mode? Local
>> mode is not designed to cope with such large datasets, so it's likely
>> that you will be getting OOM errors during sorting ... I can only
>> recommend that you use a distributed setup with several machines, and
>> adjust RAM consumption with the number of reduce tasks.
>
> Currently we are running in local mode.  We do not have the setup for
> distributing. That is why I want to merge these segments.  Would that
> not help?  Insteand of having potentially tens of thousands of
> segments, I want to create several large segments and index those.

Yes, it makes perfect sense, but you are probably hitting the limits of 
a single machine.

I suggest that you should do the merging in several steps: by trial and 
error find the maximum number of segments that don't explode 
SegmentMerger, and do the first pass merging these small segments into 
larger ones; then in the second pass merge these larger ones in the 
really large ones.


>
> Sorry for my ignorance, but not really sure how to scale nutch
> correctly.  Do you know of a document, or some pointers as to how
> segment/index data should be stored?

Most of this information is already available on the Nutch Wiki. All I 
can say is that there is certainly a limit to what you can do using the 
"local" mode - if you need to handle large numbers of pages you will 
need to migrate to the distributed setup.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to