Re: Merge taking forever

MilleBii Mon, 15 Jun 2009 10:30:18 -0700

Hi Julien,

Presumably in hadoop-site.xml as a property/value ?


On the other hand, I'm asking myself why merging segments... I don't fully
understand the benefits, if someone can shed some light.


2009/6/15 Julien Nioche <lists.digitalpeb...@gmail.com>

> Hi,
>
> Have you tried setting *mapred.compress.map.output *to true? This should
> reduce the amount of temp space required.
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/6/15 czerwionka paul <czerw...@mi.fu-berlin.de>
>
> > hi justin,
> >
> > i am running hadoop in distributed mode and having the same problem.
> >
> > merging segments just eats up much more temp space than the segments
> would
> > have combined.
> >
> > paul.
> >
> >
> > On 14.06.2009, at 18:17, MilleBii wrote:
> >
> >  Same for merging 3 segments of 100k, 100K, 300k URLs resulted in
> >> consumming
> >> 200Gb and partition full after 18hours processing
> >>
> >> Something strange with this segment merge,
> >>
> >> Conf : PC Dual Core, Vista, Hadoop on single node.
> >>
> >> Can someone confirm if installing Hadoop in a distributed will fix it ?
> Is
> >> there a good config guide for the distributed mode.
> >>
> >>
> >> 2009/6/12 Justin Yao <jyaoj...@gmail.com>
> >>
> >>  Hi John,
> >>> I have no idea about that neither.
> >>> Justin
> >>>
> >>> On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak <
> >>> j...@beforedawnsolutions.com> wrote:
> >>>
> >>>  Justin,
> >>>>
> >>>> Thanks for the response.
> >>>>
> >>>> I was having a similar issue, i was trying to merge the segments for
> >>>>
> >>> crawls
> >>>
> >>>> during the month of may probably around 13-15GB,  so after everything
> >>>> was
> >>>> running it had used tmp space of around 900 GB doesn't seem very
> >>>>
> >>> efficient.
> >>>
> >>>>
> >>>> I will try this out and see if it changes anything.
> >>>>
> >>>> Do you know if there is any risk in using the following:
> >>>> <property>
> >>>>  <name>mapred.min.split.size</name>
> >>>>  <value>671088640</value>
> >>>> </property>
> >>>>
> >>>> as suggested in the article?
> >>>>
> >>>> -John
> >>>>
> >>>> On Jun 11, 2009, at 7:25 PM, Justin Yao wrote:
> >>>>
> >>>> Hi John,
> >>>>
> >>>>>
> >>>>> I had the same issue before but never found a solution.
> >>>>> Here is a workaround mentioned by someone in this mailing list, you
> may
> >>>>> have
> >>>>> a try:
> >>>>>
> >>>>> Seemingly abnormal temp space use by segment merger
> >>>>>
> >>>>>
> >>>>>
> >>>
> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>
> <
> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
> >
> >>> <
> >>>
> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
> >>> >
> >>>
> >>>>
> >>>>> Regards,
> >>>>> Justin
> >>>>>
> >>>>> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <
> >>>>> j...@beforedawnsolutions.com
> >>>>>
> >>>>>  wrote:
> >>>>>>
> >>>>>>
> >>>>> Ok.
> >>>>>
> >>>>>>
> >>>>>> So a update to this item.
> >>>>>>
> >>>>>> I did start running nutch with hadoop, I am trying a single node
> >>>>>> config
> >>>>>> just to test it out.
> >>>>>>
> >>>>>> It took forever to get all of the files in the DFS it was just over
> >>>>>>
> >>>>> 80GB
> >>>
> >>>> but it is in there.  So I started the SegmentMerge job, and it is
> >>>>>>
> >>>>> working
> >>>
> >>>> flawlessly, still a little slow though.
> >>>>>>
> >>>>>> Also looking at the stats for the CPU they sometimes go over 20% by
> >>>>>> not
> >>>>>> by
> >>>>>> much and not often, the Disk is very lightly taxed, peak was about
> 20
> >>>>>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue
> >>>>>> there.
> >>>>>>
> >>>>>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when
> I
> >>>>>> restarted all it is still only using 2 and 1.  Any ideas?  I made
> that
> >>>>>> change in the hadoop-site.xml file BTW.
> >>>>>>
> >>>>>> -John
> >>>>>>
> >>>>>>
> >>>>>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:
> >>>>>>
> >>>>>> John Martyniak wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Andrzej,
> >>>>>>>
> >>>>>>>> I am a little embarassed asking.  But is there is a setup guide
> for
> >>>>>>>> setting up Hadoop for Nutch 1.0, or is it the same process as
> >>>>>>>> setting
> >>>>>>>> up for
> >>>>>>>> Nutch 0.17 (Which I think is the existing guide out there).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>  Basically, yes - but this guide is primarily about the set up of
> >>>>>>>
> >>>>>> Hadoop
> >>>
> >>>> cluster using the Hadoop pieces distributed with Nutch. As such these
> >>>>>>> instructions are already slightly outdated. So it's best simply to
> >>>>>>> install a
> >>>>>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki,
> and
> >>>>>>> then
> >>>>>>> build nutch*.job file separately.
> >>>>>>>
> >>>>>>> Also I have Hadoop already running for some other applications, not
> >>>>>>>
> >>>>>>>  associated with Nutch, can I use the same install?  I think that
> it
> >>>>>>>>
> >>>>>>> is
> >>>
> >>>> the
> >>>>>>>> same version that Nutch 1.0 uses.  Or is it just easier to set it
> up
> >>>>>>>> using
> >>>>>>>> the nutch config.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>  Yes, it's perfectly ok to use Nutch with an existing Hadoop
> cluster
> >>>>>>> of
> >>>>>>> the
> >>>>>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would
> >>>>>>> strongly
> >>>>>>> recommend this way, instead of the usual "dirty" way of setting up
> >>>>>>>
> >>>>>> Nutch
> >>>
> >>>> by
> >>>>>>> replicating the local build dir ;)
> >>>>>>>
> >>>>>>> Just specify the nutch*.job file like this:
> >>>>>>>
> >>>>>>>    bin/hadoop jar nutch*.job <className> <args ..>
> >>>>>>>
> >>>>>>> where className and args is one of Nutch command-line tools. You
> can
> >>>>>>> also
> >>>>>>> modify slightly the bin/nutch script, so that you don't have to
> >>>>>>>
> >>>>>> specify
> >>>
> >>>> fully-qualified class names.
> >>>>>>>
> >>>>>>> --
> >>>>>>> Best regards,
> >>>>>>> Andrzej Bialecki     <><
> >>>>>>> ___. ___ ___ ___ _ _   __________________________________
> >>>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >>>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >>>>>>> http://www.sigram.com  Contact: info at sigram dot com
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>  John Martyniak
> >>>> President
> >>>> Before Dawn Solutions, Inc.
> >>>> 9457 S. University Blvd #266
> >>>> Highlands Ranch, CO 80126
> >>>> o: 877-499-1562 x707
> >>>> f: 877-499-1562
> >>>> c: 303-522-1756
> >>>> e: j...@beforedawnsolutions.com
> >>>>
> >>>>
> >>>>
> >>>
> >
>

Re: Merge taking forever

Reply via email to