Briggs wrote:
Has anyone written an API that can merge thousands of segments? The
current
segment merge tool cannot handle this much data as there just isn't
enough
RAM available on the box. So, I was wondering if there was a better,
incremental way to handle this.
Currently I have 1 segment for each domain that was crawled and I want to
merge them all into several large segments. So, if anyone has any
pointers
I would appreciate it. Has anyone else attempted to keep segments at
this
granularity? This doesn't seem to work so well.
Are you running this in a distributed setup, or in "local" mode? Local
mode is not designed to cope with such large datasets, so it's likely
that you will be getting OOM errors during sorting ... I can only
recommend that you use a distributed setup with several machines, and
adjust RAM consumption with the number of reduce tasks.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com