Re: mergesegs disk space

Tomislav Poljak Tue, 21 Jul 2009 11:50:59 -0700

Hi,
thanks for your answers, I've configured compression:

mapred.output.compress = true
mapred.compress.map.output = true
mapred.output.compression.type= BLOCK


( in xml format in hadoop-site.xml )

and it works (and uses less disk space, no more out of disk space
exception), but merging now takes a really long time. My next question
is simple:
Is segment merging necessary step (if I don't need all in one segment
and do not have optional filtering) or is it ok to proceed with
indexing ? I ask because many tutorials and most re-crawl scripts have
this step.

Tomislav


On Wed, 2009-07-15 at 21:04 +0300, Doğacan Güney wrote:
> On Wed, Jul 15, 2009 at 20:45, MilleBii<mille...@gmail.com> wrote:
> > Are you on a single node conf ?
> > If yes I have the same problem, and some people have suggested earlier to
> > use the hadoop pseudo-distributed config on a single server.
> > Others have also suggested to use compress mode of hadoop.
> 
> Yes, that's a good point. Playing around with these options may help:
> 
> mapred.output.compress
> 
> mapred.output.compression.type (BLOCK may help a lot here)
> advices
> mapred.compress.map.output
> 
> 
> > But I have not been able to make it work on my PC because I get bogged down
> > by some windows/hadoop compatibility issues.
> > If you are on Linux you may be more lucky, interested by your results by the
> > way, so I know if when moving to Linux I get those problems solved.
> >
> >
> > 2009/7/15 Doğacan Güney <doga...@gmail.com>
> >
> >> On Wed, Jul 15, 2009 at 19:31, Tomislav Poljak<tpol...@gmail.com> wrote:
> >> > Hi,
> >> > I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one
> >> > machine contained in 10 segments, using:
> >> >
> >> > bin/nutch mergesegs crawl/merge_seg -dir crawl/segments
> >> >
> >> > ,but there is not enough space on 500G disk to complete this merge task
> >> > (getting java.io.IOException: No space left on device in hadoop.log)
> >> >
> >> > Shouldn't 500G be enough disk space for this merge? Is this a bug? If
> >> > this is not a bug, how much disk space is required for this merge?
> >> >
> >>
> >> A lot :)
> >>
> >> Try deleting your hadoop temporary folders. If that doesn't help you
> >> may try merging
> >> segment parts one by one. For example, move your content/ directories
> >> and try merging
> >> again. If successful you can then merge contents later and move the
> >> resulting content/ into
> >> your merge_seg dir.
> >>
> >> > Tomislav
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Doğacan Güney
> >>
> >
> >
> >
> > --
> > -MilleBii-
> >
> 
> 
>

Re: mergesegs disk space

Reply via email to