Are you running this in a distributed setup, or in "local" mode? Local mode is not designed to cope with such large datasets, so it's likely that you will be getting OOM errors during sorting ... I can only recommend that you use a distributed setup with several machines, and adjust RAM consumption with the number of reduce tasks.
Currently we are running in local mode. We do not have the setup for distributing. That is why I want to merge these segments. Would that not help? Insteand of having potentially tens of thousands of segments, I want to create several large segments and index those. Sorry for my ignorance, but not really sure how to scale nutch correctly. Do you know of a document, or some pointers as to how segment/index data should be stored? <briggs /> "Concious decisions by concious minds are what make reality real"
