Briggs wrote:
Has anyone written an API that can merge thousands of segments? The current segment merge tool cannot handle this much data as there just isn't enough
RAM available on the box. So, I was wondering if there was a better,
incremental way to handle this.

Currently I have 1 segment for each domain that was crawled and I want to
merge them all into several large segments. So, if anyone has any pointers I would appreciate it. Has anyone else attempted to keep segments at this
granularity?  This doesn't seem to work so well.


Are you running this in a distributed setup, or in "local" mode? Local mode is not designed to cope with such large datasets, so it's likely that you will be getting OOM errors during sorting ... I can only recommend that you use a distributed setup with several machines, and adjust RAM consumption with the number of reduce tasks.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to