Re: Merge taking forever

MilleBii Thu, 18 Jun 2009 14:51:46 -0700

Went through the install as described in the URL... everything looks fine
however I get the following error :


Generator: java.net.ConnectException: Call to
localhost/127.0.0.1:9000failed on connection exception:
java.net.ConnectException: Connection
refused: no further information
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:724)
    at org.apache.hadoop.ipc.Client.call(Client.java:700)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
    at
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
    at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
    at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
    at org.apache.nutch.crawl.Generator.generate(Generator.java:429)
    at org.apache.nutch.crawl.Generator.run(Generator.java:618)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Generator.main(Generator.java:581)
Caused by: java.net.ConnectException: Connection refused: no further
information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
    at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100)
    at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:300)
    at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:177)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:801)
    at org.apache.hadoop.ipc.Client.call(Client.java:686)

Any help out there ?




2009/6/17 John Martyniak <j...@beforedawnsolutions.com>

> Here is the hadoop quickstart for setting up hadoop in the pseudo mode.
>
> http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html
>
> And then to run nutch using the newly created hadoop cluster (can be a
> cluster of one) use the following:
> in/hadoop jar nutch*.job <className> <args ..>
>
> hope this helps.
>
> -John
>
>
> On Jun 17, 2009, at 10:57 AM, Alex Basa wrote:
>
>
>> Can someone point me to creating a hadoop dfs on a single machine?  I'd
>> like to test this out to see how much it speeds up merging.  I'm using a ZFS
>> filesystem and have no IO waits.  It seems like when the index reaches the
>> 4GB range, the merge time drastically goes up.  My ZFS filesystem is on a
>> SAN and is not the bottleneck.
>>
>> Thanks in advance!
>>
>> --- On Mon, 6/15/09, Julien Nioche <lists.digitalpeb...@gmail.com> wrote:
>>
>>  From: Julien Nioche <lists.digitalpeb...@gmail.com>
>>> Subject: Re: Merge taking forever
>>> To: nutch-user@lucene.apache.org
>>> Date: Monday, June 15, 2009, 12:58 PM
>>> Hi,
>>>
>>>
>>>  Presumably in hadoop-site.xml as a property/value ?
>>>>
>>>
>>>
>>> Indeed.
>>>
>>> J.
>>>
>>>
>>>
>>>> On the other hand, I'm asking myself why merging
>>>>
>>> segments... I don't fully
>>>
>>>> understand the benefits, if someone can shed some
>>>>
>>> light.
>>>
>>>>
>>>>
>>>> 2009/6/15 Julien Nioche <lists.digitalpeb...@gmail.com>
>>>>
>>>>  Hi,
>>>>>
>>>>> Have you tried setting
>>>>>
>>>> *mapred.compress.map.output *to true? This should
>>>
>>>> reduce the amount of temp space required.
>>>>>
>>>>> Julien
>>>>> --
>>>>> DigitalPebble Ltd
>>>>> http://www.digitalpebble.com
>>>>>
>>>>> 2009/6/15 czerwionka paul <czerw...@mi.fu-berlin.de>
>>>>>
>>>>>  hi justin,
>>>>>>
>>>>>> i am running hadoop in distributed mode and
>>>>>>
>>>>> having the same problem.
>>>
>>>>
>>>>>> merging segments just eats up much more temp
>>>>>>
>>>>> space than the segments
>>>
>>>> would
>>>>>
>>>>>> have combined.
>>>>>>
>>>>>> paul.
>>>>>>
>>>>>>
>>>>>> On 14.06.2009, at 18:17, MilleBii wrote:
>>>>>>
>>>>>>  Same for merging 3 segments of 100k,
>>>>>>
>>>>> 100K, 300k URLs resulted in
>>>
>>>> consumming
>>>>>>> 200Gb and partition full after 18hours
>>>>>>>
>>>>>> processing
>>>
>>>>
>>>>>>> Something strange with this segment
>>>>>>>
>>>>>> merge,
>>>
>>>>
>>>>>>> Conf : PC Dual Core, Vista, Hadoop on
>>>>>>>
>>>>>> single node.
>>>
>>>>
>>>>>>> Can someone confirm if installing Hadoop
>>>>>>>
>>>>>> in a distributed will fix it
>>>
>>>> ?
>>>>
>>>>> Is
>>>>>
>>>>>> there a good config guide for the
>>>>>>>
>>>>>> distributed mode.
>>>
>>>>
>>>>>>>
>>>>>>> 2009/6/12 Justin Yao <jyaoj...@gmail.com>
>>>>>>>
>>>>>>>  Hi John,
>>>>>>>
>>>>>>>> I have no idea about that neither.
>>>>>>>> Justin
>>>>>>>>
>>>>>>>> On Fri, Jun 12, 2009 at 8:05 AM,
>>>>>>>>
>>>>>>> John Martyniak <
>>>
>>>> j...@beforedawnsolutions.com>
>>>>>>>>
>>>>>>> wrote:
>>>
>>>>
>>>>>>>>  Justin,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for the response.
>>>>>>>>>
>>>>>>>>> I was having a similar issue, i
>>>>>>>>>
>>>>>>>> was trying to merge the segments for
>>>
>>>>
>>>>>>>>>  crawls
>>>>>>>>
>>>>>>>>  during the month of may probably
>>>>>>>>>
>>>>>>>> around 13-15GB,  so after
>>>
>>>> everything
>>>>
>>>>> was
>>>>>>>>> running it had used tmp space of
>>>>>>>>>
>>>>>>>> around 900 GB doesn't seem very
>>>
>>>>
>>>>>>>>>  efficient.
>>>>>>>>
>>>>>>>>
>>>>>>>>> I will try this out and see if
>>>>>>>>>
>>>>>>>> it changes anything.
>>>
>>>>
>>>>>>>>> Do you know if there is any risk
>>>>>>>>>
>>>>>>>> in using the following:
>>>
>>>> <property>
>>>>>>>>>
>>>>>>>>>  <name>mapred.min.split.size</name>
>>>
>>>>
>>>>>>>>>  <value>671088640</value>
>>>
>>>> </property>
>>>>>>>>>
>>>>>>>>> as suggested in the article?
>>>>>>>>>
>>>>>>>>> -John
>>>>>>>>>
>>>>>>>>> On Jun 11, 2009, at 7:25 PM,
>>>>>>>>>
>>>>>>>> Justin Yao wrote:
>>>
>>>>
>>>>>>>>> Hi John,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I had the same issue before
>>>>>>>>>>
>>>>>>>>> but never found a solution.
>>>
>>>> Here is a workaround
>>>>>>>>>>
>>>>>>>>> mentioned by someone in this mailing list, you
>>>
>>>> may
>>>>>
>>>>>> have
>>>>>>>>>> a try:
>>>>>>>>>>
>>>>>>>>>> Seemingly abnormal temp
>>>>>>>>>>
>>>>>>>>> space use by segment merger
>>>
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>
>>>> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>
>>>> <
>>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
>>>> >
>>>> <
>>>>
>>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
>>>>
>>>>>
>>>>> <
>>>>>
>>>>>
>>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
>>>>
>>>>>
>>>>>>  <
>>>>>>>>
>>>>>>>>
>>>>>
>>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
>>>>
>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>>  Regards,
>>>>>>>>>> Justin
>>>>>>>>>>
>>>>>>>>>> On Sat, Jun 6, 2009 at 4:09
>>>>>>>>>>
>>>>>>>>> PM, John Martyniak <
>>>
>>>> j...@beforedawnsolutions.com
>>>>>>>>>>
>>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Ok.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> So a update to this
>>>>>>>>>>>
>>>>>>>>>> item.
>>>
>>>>
>>>>>>>>>>> I did start running
>>>>>>>>>>>
>>>>>>>>>> nutch with hadoop, I am trying a single node
>>>
>>>> config
>>>>>>>>>>> just to test it out.
>>>>>>>>>>>
>>>>>>>>>>> It took forever to get
>>>>>>>>>>>
>>>>>>>>>> all of the files in the DFS it was just
>>>
>>>> over
>>>>
>>>>>
>>>>>>>>>>>  80GB
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>  but it is in there.  So I
>>>>>>>>>
>>>>>>>> started the SegmentMerge job, and it is
>>>
>>>>
>>>>>>>>>>>  working
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>  flawlessly, still a little slow
>>>>>>>>>
>>>>>>>> though.
>>>
>>>>
>>>>>>>>>>> Also looking at the
>>>>>>>>>>>
>>>>>>>>>> stats for the CPU they sometimes go over 20%
>>>
>>>> by
>>>>
>>>>> not
>>>>>>>>>>> by
>>>>>>>>>>> much and not often, the
>>>>>>>>>>>
>>>>>>>>>> Disk is very lightly taxed, peak was about
>>>
>>>> 20
>>>>>
>>>>>> MB/sec, the drives and
>>>>>>>>>>>
>>>>>>>>>> interface are rated at 3 GB/sec, so no
>>>
>>>> issue
>>>>
>>>>> there.
>>>>>>>>>>>
>>>>>>>>>>> I tried to set the map
>>>>>>>>>>>
>>>>>>>>>> jobs to 7 and the reduce jobs to 3, but
>>>
>>>> when
>>>>
>>>>> I
>>>>>
>>>>>> restarted all it is
>>>>>>>>>>>
>>>>>>>>>> still only using 2 and 1.  Any ideas?  I made
>>>
>>>> that
>>>>>
>>>>>> change in the
>>>>>>>>>>>
>>>>>>>>>> hadoop-site.xml file BTW.
>>>
>>>>
>>>>>>>>>>> -John
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jun 4, 2009, at 10:00
>>>>>>>>>>>
>>>>>>>>>> AM, Andrzej Bialecki wrote:
>>>
>>>>
>>>>>>>>>>> John Martyniak wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Andrzej,
>>>>>>>>>>>>
>>>>>>>>>>>>  I am a little
>>>>>>>>>>>>>
>>>>>>>>>>>> embarassed asking.  But is there is a setup guide
>>>
>>>> for
>>>>>
>>>>>> setting up
>>>>>>>>>>>>>
>>>>>>>>>>>> Hadoop for Nutch 1.0, or is it the same process as
>>>
>>>> setting
>>>>>>>>>>>>> up for
>>>>>>>>>>>>> Nutch 0.17
>>>>>>>>>>>>>
>>>>>>>>>>>> (Which I think is the existing guide out there).
>>>
>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Basically,
>>>>>>>>>>>>>
>>>>>>>>>>>> yes - but this guide is primarily about the set up
>>>
>>>> of
>>>>
>>>>>
>>>>>>>>>>>>  Hadoop
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>  cluster using the Hadoop pieces
>>>>>>>>>
>>>>>>>> distributed with Nutch. As such
>>>
>>>> these
>>>>
>>>>> instructions are
>>>>>>>>>>>>
>>>>>>>>>>> already slightly outdated. So it's best simply
>>>
>>>> to
>>>>
>>>>> install a
>>>>>>>>>>>> clean Hadoop 0.19.1
>>>>>>>>>>>>
>>>>>>>>>>> according to the instructions on Hadoop wiki,
>>>
>>>> and
>>>>>
>>>>>> then
>>>>>>>>>>>> build nutch*.job
>>>>>>>>>>>>
>>>>>>>>>>> file separately.
>>>
>>>>
>>>>>>>>>>>> Also I have Hadoop
>>>>>>>>>>>>
>>>>>>>>>>> already running for some other applications,
>>>
>>>> not
>>>>
>>>>>
>>>>>>>>>>>>  associated
>>>>>>>>>>>>
>>>>>>>>>>> with Nutch, can I use the same install?  I think that
>>>
>>>> it
>>>>>
>>>>>>
>>>>>>>>>>>>>  is
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>  the
>>>>>>>>>
>>>>>>>>>> same version
>>>>>>>>>>>>>
>>>>>>>>>>>> that Nutch 1.0 uses.  Or is it just easier to set
>>>
>>>> it
>>>>
>>>>> up
>>>>>
>>>>>> using
>>>>>>>>>>>>> the nutch
>>>>>>>>>>>>>
>>>>>>>>>>>> config.
>>>
>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Yes, it's
>>>>>>>>>>>>>
>>>>>>>>>>>> perfectly ok to use Nutch with an existing Hadoop
>>>
>>>> cluster
>>>>>
>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>> same vintage (which
>>>>>>>>>>>>
>>>>>>>>>>> is 0.19.1 in Nutch 1.0). In fact, I would
>>>
>>>> strongly
>>>>>>>>>>>> recommend this way,
>>>>>>>>>>>>
>>>>>>>>>>> instead of the usual "dirty" way of setting
>>>
>>>> up
>>>>
>>>>>
>>>>>>>>>>>>  Nutch
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>  by
>>>>>>>>>
>>>>>>>>>> replicating the
>>>>>>>>>>>>
>>>>>>>>>>> local build dir ;)
>>>
>>>>
>>>>>>>>>>>> Just specify the
>>>>>>>>>>>>
>>>>>>>>>>> nutch*.job file like this:
>>>
>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  bin/hadoop jar nutch*.job <className> <args ..>
>>>
>>>>
>>>>>>>>>>>> where className and
>>>>>>>>>>>>
>>>>>>>>>>> args is one of Nutch command-line tools. You
>>>
>>>> can
>>>>>
>>>>>> also
>>>>>>>>>>>> modify slightly the
>>>>>>>>>>>>
>>>>>>>>>>> bin/nutch script, so that you don't have to
>>>
>>>>
>>>>>>>>>>>>  specify
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>  fully-qualified class names.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Andrzej
>>>>>>>>>>>>
>>>>>>>>>>> Bialecki     <><
>>>
>>>> ___. ___ ___ ___ _
>>>>>>>>>>>>
>>>>>>>>>>> _   __________________________________
>>>
>>>> [__ ||
>>>>>>>>>>>>
>>>>>>>>>>> __|__/|__||\/|  Information Retrieval, Semantic Web
>>>
>>>> ___|||__||
>>>>>>>>>>>>
>>>>>>>>>>> \|  ||  |  Embedded Unix, System Integration
>>>
>>>> http://www.sigram.com  Contact: info at sigram dot
>>>>>>>>>>>>
>>>>>>>>>>> com
>>>
>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   John Martyniak
>>>>>>>>>>>
>>>>>>>>>> President
>>>>>>>>> Before Dawn Solutions, Inc.
>>>>>>>>> 9457 S. University Blvd #266
>>>>>>>>> Highlands Ranch, CO 80126
>>>>>>>>> o: 877-499-1562 x707
>>>>>>>>> f: 877-499-1562
>>>>>>>>> c: 303-522-1756
>>>>>>>>> e: j...@beforedawnsolutions.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>> http://www.digitalpebble.com
>>>
>>>
>>
>>
>>
> John Martyniak
> President/CEO
> Before Dawn Solutions, Inc.
> 9457 S. University Blvd #266
> Highlands Ranch, CO 80126
> o: 877-499-1562
> c: 303-522-1756
>
> e: j...@beforedawnsoutions.com
> w: http://www.beforedawnsolutions.com
>
>


-- 
-MilleBii-

Re: Merge taking forever

Reply via email to