Hi Julien, Presumably in hadoop-site.xml as a property/value ?
On the other hand, I'm asking myself why merging segments... I don't fully understand the benefits, if someone can shed some light. 2009/6/15 Julien Nioche <lists.digitalpeb...@gmail.com> > Hi, > > Have you tried setting *mapred.compress.map.output *to true? This should > reduce the amount of temp space required. > > Julien > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > 2009/6/15 czerwionka paul <czerw...@mi.fu-berlin.de> > > > hi justin, > > > > i am running hadoop in distributed mode and having the same problem. > > > > merging segments just eats up much more temp space than the segments > would > > have combined. > > > > paul. > > > > > > On 14.06.2009, at 18:17, MilleBii wrote: > > > > Same for merging 3 segments of 100k, 100K, 300k URLs resulted in > >> consumming > >> 200Gb and partition full after 18hours processing > >> > >> Something strange with this segment merge, > >> > >> Conf : PC Dual Core, Vista, Hadoop on single node. > >> > >> Can someone confirm if installing Hadoop in a distributed will fix it ? > Is > >> there a good config guide for the distributed mode. > >> > >> > >> 2009/6/12 Justin Yao <jyaoj...@gmail.com> > >> > >> Hi John, > >>> I have no idea about that neither. > >>> Justin > >>> > >>> On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak < > >>> j...@beforedawnsolutions.com> wrote: > >>> > >>> Justin, > >>>> > >>>> Thanks for the response. > >>>> > >>>> I was having a similar issue, i was trying to merge the segments for > >>>> > >>> crawls > >>> > >>>> during the month of may probably around 13-15GB, so after everything > >>>> was > >>>> running it had used tmp space of around 900 GB doesn't seem very > >>>> > >>> efficient. > >>> > >>>> > >>>> I will try this out and see if it changes anything. > >>>> > >>>> Do you know if there is any risk in using the following: > >>>> <property> > >>>> <name>mapred.min.split.size</name> > >>>> <value>671088640</value> > >>>> </property> > >>>> > >>>> as suggested in the article? > >>>> > >>>> -John > >>>> > >>>> On Jun 11, 2009, at 7:25 PM, Justin Yao wrote: > >>>> > >>>> Hi John, > >>>> > >>>>> > >>>>> I had the same issue before but never found a solution. > >>>>> Here is a workaround mentioned by someone in this mailing list, you > may > >>>>> have > >>>>> a try: > >>>>> > >>>>> Seemingly abnormal temp space use by segment merger > >>>>> > >>>>> > >>>>> > >>> > http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569> > < > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 > > > >>> < > >>> > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 > >>> > > >>> > >>>> > >>>>> Regards, > >>>>> Justin > >>>>> > >>>>> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak < > >>>>> j...@beforedawnsolutions.com > >>>>> > >>>>> wrote: > >>>>>> > >>>>>> > >>>>> Ok. > >>>>> > >>>>>> > >>>>>> So a update to this item. > >>>>>> > >>>>>> I did start running nutch with hadoop, I am trying a single node > >>>>>> config > >>>>>> just to test it out. > >>>>>> > >>>>>> It took forever to get all of the files in the DFS it was just over > >>>>>> > >>>>> 80GB > >>> > >>>> but it is in there. So I started the SegmentMerge job, and it is > >>>>>> > >>>>> working > >>> > >>>> flawlessly, still a little slow though. > >>>>>> > >>>>>> Also looking at the stats for the CPU they sometimes go over 20% by > >>>>>> not > >>>>>> by > >>>>>> much and not often, the Disk is very lightly taxed, peak was about > 20 > >>>>>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue > >>>>>> there. > >>>>>> > >>>>>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when > I > >>>>>> restarted all it is still only using 2 and 1. Any ideas? I made > that > >>>>>> change in the hadoop-site.xml file BTW. > >>>>>> > >>>>>> -John > >>>>>> > >>>>>> > >>>>>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote: > >>>>>> > >>>>>> John Martyniak wrote: > >>>>>> > >>>>>> > >>>>>>> Andrzej, > >>>>>>> > >>>>>>>> I am a little embarassed asking. But is there is a setup guide > for > >>>>>>>> setting up Hadoop for Nutch 1.0, or is it the same process as > >>>>>>>> setting > >>>>>>>> up for > >>>>>>>> Nutch 0.17 (Which I think is the existing guide out there). > >>>>>>>> > >>>>>>>> > >>>>>>>> Basically, yes - but this guide is primarily about the set up of > >>>>>>> > >>>>>> Hadoop > >>> > >>>> cluster using the Hadoop pieces distributed with Nutch. As such these > >>>>>>> instructions are already slightly outdated. So it's best simply to > >>>>>>> install a > >>>>>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, > and > >>>>>>> then > >>>>>>> build nutch*.job file separately. > >>>>>>> > >>>>>>> Also I have Hadoop already running for some other applications, not > >>>>>>> > >>>>>>> associated with Nutch, can I use the same install? I think that > it > >>>>>>>> > >>>>>>> is > >>> > >>>> the > >>>>>>>> same version that Nutch 1.0 uses. Or is it just easier to set it > up > >>>>>>>> using > >>>>>>>> the nutch config. > >>>>>>>> > >>>>>>>> > >>>>>>>> Yes, it's perfectly ok to use Nutch with an existing Hadoop > cluster > >>>>>>> of > >>>>>>> the > >>>>>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would > >>>>>>> strongly > >>>>>>> recommend this way, instead of the usual "dirty" way of setting up > >>>>>>> > >>>>>> Nutch > >>> > >>>> by > >>>>>>> replicating the local build dir ;) > >>>>>>> > >>>>>>> Just specify the nutch*.job file like this: > >>>>>>> > >>>>>>> bin/hadoop jar nutch*.job <className> <args ..> > >>>>>>> > >>>>>>> where className and args is one of Nutch command-line tools. You > can > >>>>>>> also > >>>>>>> modify slightly the bin/nutch script, so that you don't have to > >>>>>>> > >>>>>> specify > >>> > >>>> fully-qualified class names. > >>>>>>> > >>>>>>> -- > >>>>>>> Best regards, > >>>>>>> Andrzej Bialecki <>< > >>>>>>> ___. ___ ___ ___ _ _ __________________________________ > >>>>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web > >>>>>>> ___|||__|| \| || | Embedded Unix, System Integration > >>>>>>> http://www.sigram.com Contact: info at sigram dot com > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> John Martyniak > >>>> President > >>>> Before Dawn Solutions, Inc. > >>>> 9457 S. University Blvd #266 > >>>> Highlands Ranch, CO 80126 > >>>> o: 877-499-1562 x707 > >>>> f: 877-499-1562 > >>>> c: 303-522-1756 > >>>> e: j...@beforedawnsolutions.com > >>>> > >>>> > >>>> > >>> > > >