Are you running this in a distributed setup, or in "local" mode? Local
mode is not designed to cope with such large datasets, so it's likely
that you will be getting OOM errors during sorting ... I can only
recommend that you use a distributed setup with several machines, and
adjust RAM consumption with the number of reduce tasks.

Currently we are running in local mode.  We do not have the setup for
distributing. That is why I want to merge these segments.  Would that
not help?  Insteand of having potentially tens of thousands of
segments, I want to create several large segments and index those.

Sorry for my ignorance, but not really sure how to scale nutch
correctly.  Do you know of a document, or some pointers as to how
segment/index data should be stored?

<briggs />

"Concious decisions by concious minds are what make reality real"

Reply via email to