Hi, thanks for your answers, I've configured compression: mapred.output.compress = true mapred.compress.map.output = true mapred.output.compression.type= BLOCK
( in xml format in hadoop-site.xml ) and it works (and uses less disk space, no more out of disk space exception), but merging now takes a really long time. My next question is simple: Is segment merging necessary step (if I don't need all in one segment and do not have optional filtering) or is it ok to proceed with indexing ? I ask because many tutorials and most re-crawl scripts have this step. Tomislav On Wed, 2009-07-15 at 21:04 +0300, Doğacan Güney wrote: > On Wed, Jul 15, 2009 at 20:45, MilleBii<mille...@gmail.com> wrote: > > Are you on a single node conf ? > > If yes I have the same problem, and some people have suggested earlier to > > use the hadoop pseudo-distributed config on a single server. > > Others have also suggested to use compress mode of hadoop. > > Yes, that's a good point. Playing around with these options may help: > > mapred.output.compress > > mapred.output.compression.type (BLOCK may help a lot here) > advices > mapred.compress.map.output > > > > But I have not been able to make it work on my PC because I get bogged down > > by some windows/hadoop compatibility issues. > > If you are on Linux you may be more lucky, interested by your results by the > > way, so I know if when moving to Linux I get those problems solved. > > > > > > 2009/7/15 Doğacan Güney <doga...@gmail.com> > > > >> On Wed, Jul 15, 2009 at 19:31, Tomislav Poljak<tpol...@gmail.com> wrote: > >> > Hi, > >> > I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one > >> > machine contained in 10 segments, using: > >> > > >> > bin/nutch mergesegs crawl/merge_seg -dir crawl/segments > >> > > >> > ,but there is not enough space on 500G disk to complete this merge task > >> > (getting java.io.IOException: No space left on device in hadoop.log) > >> > > >> > Shouldn't 500G be enough disk space for this merge? Is this a bug? If > >> > this is not a bug, how much disk space is required for this merge? > >> > > >> > >> A lot :) > >> > >> Try deleting your hadoop temporary folders. If that doesn't help you > >> may try merging > >> segment parts one by one. For example, move your content/ directories > >> and try merging > >> again. If successful you can then merge contents later and move the > >> resulting content/ into > >> your merge_seg dir. > >> > >> > Tomislav > >> > > >> > > >> > >> > >> > >> -- > >> Doğacan Güney > >> > > > > > > > > -- > > -MilleBii- > > > > >