Briggs wrote: > Has anyone written an API that can merge thousands of segments? The > current > segment merge tool cannot handle this much data as there just isn't > enough > RAM available on the box. So, I was wondering if there was a better, > incremental way to handle this. > > Currently I have 1 segment for each domain that was crawled and I want to > merge them all into several large segments. So, if anyone has any > pointers > I would appreciate it. Has anyone else attempted to keep segments at > this > granularity? This doesn't seem to work so well.
Are you running this in a distributed setup, or in "local" mode? Local mode is not designed to cope with such large datasets, so it's likely that you will be getting OOM errors during sorting ... I can only recommend that you use a distributed setup with several machines, and adjust RAM consumption with the number of reduce tasks. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
