Briggs wrote:
Are you running this in a distributed setup, or in "local" mode? Local
mode is not designed to cope with such large datasets, so it's likely
that you will be getting OOM errors during sorting ... I can only
recommend that you use a distributed setup with several machines, and
adjust RAM consumption with the number of reduce tasks.

Currently we are running in local mode.  We do not have the setup for
distributing. That is why I want to merge these segments.  Would that
not help?  Insteand of having potentially tens of thousands of
segments, I want to create several large segments and index those.

Yes, it makes perfect sense, but you are probably hitting the limits of a single machine.

I suggest that you should do the merging in several steps: by trial and error find the maximum number of segments that don't explode SegmentMerger, and do the first pass merging these small segments into larger ones; then in the second pass merge these larger ones in the really large ones.



Sorry for my ignorance, but not really sure how to scale nutch
correctly.  Do you know of a document, or some pointers as to how
segment/index data should be stored?

Most of this information is already available on the Nutch Wiki. All I can say is that there is certainly a limit to what you can do using the "local" mode - if you need to handle large numbers of pages you will need to migrate to the distributed setup.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to