Went through the install as described in the URL... everything looks fine however I get the following error :
Generator: java.net.ConnectException: Call to localhost/127.0.0.1:9000failed on connection exception: java.net.ConnectException: Connection refused: no further information at org.apache.hadoop.ipc.Client.wrapException(Client.java:724) at org.apache.hadoop.ipc.Client.call(Client.java:700) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120) at org.apache.nutch.crawl.Generator.generate(Generator.java:429) at org.apache.nutch.crawl.Generator.run(Generator.java:618) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Generator.main(Generator.java:581) Caused by: java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:300) at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:177) at org.apache.hadoop.ipc.Client.getConnection(Client.java:801) at org.apache.hadoop.ipc.Client.call(Client.java:686) Any help out there ? 2009/6/17 John Martyniak <j...@beforedawnsolutions.com> > Here is the hadoop quickstart for setting up hadoop in the pseudo mode. > > http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html > > And then to run nutch using the newly created hadoop cluster (can be a > cluster of one) use the following: > in/hadoop jar nutch*.job <className> <args ..> > > hope this helps. > > -John > > > On Jun 17, 2009, at 10:57 AM, Alex Basa wrote: > > >> Can someone point me to creating a hadoop dfs on a single machine? I'd >> like to test this out to see how much it speeds up merging. I'm using a ZFS >> filesystem and have no IO waits. It seems like when the index reaches the >> 4GB range, the merge time drastically goes up. My ZFS filesystem is on a >> SAN and is not the bottleneck. >> >> Thanks in advance! >> >> --- On Mon, 6/15/09, Julien Nioche <lists.digitalpeb...@gmail.com> wrote: >> >> From: Julien Nioche <lists.digitalpeb...@gmail.com> >>> Subject: Re: Merge taking forever >>> To: nutch-user@lucene.apache.org >>> Date: Monday, June 15, 2009, 12:58 PM >>> Hi, >>> >>> >>> Presumably in hadoop-site.xml as a property/value ? >>>> >>> >>> >>> Indeed. >>> >>> J. >>> >>> >>> >>>> On the other hand, I'm asking myself why merging >>>> >>> segments... I don't fully >>> >>>> understand the benefits, if someone can shed some >>>> >>> light. >>> >>>> >>>> >>>> 2009/6/15 Julien Nioche <lists.digitalpeb...@gmail.com> >>>> >>>> Hi, >>>>> >>>>> Have you tried setting >>>>> >>>> *mapred.compress.map.output *to true? This should >>> >>>> reduce the amount of temp space required. >>>>> >>>>> Julien >>>>> -- >>>>> DigitalPebble Ltd >>>>> http://www.digitalpebble.com >>>>> >>>>> 2009/6/15 czerwionka paul <czerw...@mi.fu-berlin.de> >>>>> >>>>> hi justin, >>>>>> >>>>>> i am running hadoop in distributed mode and >>>>>> >>>>> having the same problem. >>> >>>> >>>>>> merging segments just eats up much more temp >>>>>> >>>>> space than the segments >>> >>>> would >>>>> >>>>>> have combined. >>>>>> >>>>>> paul. >>>>>> >>>>>> >>>>>> On 14.06.2009, at 18:17, MilleBii wrote: >>>>>> >>>>>> Same for merging 3 segments of 100k, >>>>>> >>>>> 100K, 300k URLs resulted in >>> >>>> consumming >>>>>>> 200Gb and partition full after 18hours >>>>>>> >>>>>> processing >>> >>>> >>>>>>> Something strange with this segment >>>>>>> >>>>>> merge, >>> >>>> >>>>>>> Conf : PC Dual Core, Vista, Hadoop on >>>>>>> >>>>>> single node. >>> >>>> >>>>>>> Can someone confirm if installing Hadoop >>>>>>> >>>>>> in a distributed will fix it >>> >>>> ? >>>> >>>>> Is >>>>> >>>>>> there a good config guide for the >>>>>>> >>>>>> distributed mode. >>> >>>> >>>>>>> >>>>>>> 2009/6/12 Justin Yao <jyaoj...@gmail.com> >>>>>>> >>>>>>> Hi John, >>>>>>> >>>>>>>> I have no idea about that neither. >>>>>>>> Justin >>>>>>>> >>>>>>>> On Fri, Jun 12, 2009 at 8:05 AM, >>>>>>>> >>>>>>> John Martyniak < >>> >>>> j...@beforedawnsolutions.com> >>>>>>>> >>>>>>> wrote: >>> >>>> >>>>>>>> Justin, >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks for the response. >>>>>>>>> >>>>>>>>> I was having a similar issue, i >>>>>>>>> >>>>>>>> was trying to merge the segments for >>> >>>> >>>>>>>>> crawls >>>>>>>> >>>>>>>> during the month of may probably >>>>>>>>> >>>>>>>> around 13-15GB, so after >>> >>>> everything >>>> >>>>> was >>>>>>>>> running it had used tmp space of >>>>>>>>> >>>>>>>> around 900 GB doesn't seem very >>> >>>> >>>>>>>>> efficient. >>>>>>>> >>>>>>>> >>>>>>>>> I will try this out and see if >>>>>>>>> >>>>>>>> it changes anything. >>> >>>> >>>>>>>>> Do you know if there is any risk >>>>>>>>> >>>>>>>> in using the following: >>> >>>> <property> >>>>>>>>> >>>>>>>>> <name>mapred.min.split.size</name> >>> >>>> >>>>>>>>> <value>671088640</value> >>> >>>> </property> >>>>>>>>> >>>>>>>>> as suggested in the article? >>>>>>>>> >>>>>>>>> -John >>>>>>>>> >>>>>>>>> On Jun 11, 2009, at 7:25 PM, >>>>>>>>> >>>>>>>> Justin Yao wrote: >>> >>>> >>>>>>>>> Hi John, >>>>>>>>> >>>>>>>>> >>>>>>>>>> I had the same issue before >>>>>>>>>> >>>>>>>>> but never found a solution. >>> >>>> Here is a workaround >>>>>>>>>> >>>>>>>>> mentioned by someone in this mailing list, you >>> >>>> may >>>>> >>>>>> have >>>>>>>>>> a try: >>>>>>>>>> >>>>>>>>>> Seemingly abnormal temp >>>>>>>>>> >>>>>>>>> space use by segment merger >>> >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>> >>>> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569> >>>> < >>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 >>>> > >>>> < >>>> >>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 >>>> >>>>> >>>>> < >>>>> >>>>> >>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 >>>> >>>>> >>>>>> < >>>>>>>> >>>>>>>> >>>>> >>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 >>>> >>>>> >>>>>>>>> >>>>>>>> >>>>>>>>> Regards, >>>>>>>>>> Justin >>>>>>>>>> >>>>>>>>>> On Sat, Jun 6, 2009 at 4:09 >>>>>>>>>> >>>>>>>>> PM, John Martyniak < >>> >>>> j...@beforedawnsolutions.com >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Ok. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> So a update to this >>>>>>>>>>> >>>>>>>>>> item. >>> >>>> >>>>>>>>>>> I did start running >>>>>>>>>>> >>>>>>>>>> nutch with hadoop, I am trying a single node >>> >>>> config >>>>>>>>>>> just to test it out. >>>>>>>>>>> >>>>>>>>>>> It took forever to get >>>>>>>>>>> >>>>>>>>>> all of the files in the DFS it was just >>> >>>> over >>>> >>>>> >>>>>>>>>>> 80GB >>>>>>>>>> >>>>>>>>> >>>>>>>> but it is in there. So I >>>>>>>>> >>>>>>>> started the SegmentMerge job, and it is >>> >>>> >>>>>>>>>>> working >>>>>>>>>> >>>>>>>>> >>>>>>>> flawlessly, still a little slow >>>>>>>>> >>>>>>>> though. >>> >>>> >>>>>>>>>>> Also looking at the >>>>>>>>>>> >>>>>>>>>> stats for the CPU they sometimes go over 20% >>> >>>> by >>>> >>>>> not >>>>>>>>>>> by >>>>>>>>>>> much and not often, the >>>>>>>>>>> >>>>>>>>>> Disk is very lightly taxed, peak was about >>> >>>> 20 >>>>> >>>>>> MB/sec, the drives and >>>>>>>>>>> >>>>>>>>>> interface are rated at 3 GB/sec, so no >>> >>>> issue >>>> >>>>> there. >>>>>>>>>>> >>>>>>>>>>> I tried to set the map >>>>>>>>>>> >>>>>>>>>> jobs to 7 and the reduce jobs to 3, but >>> >>>> when >>>> >>>>> I >>>>> >>>>>> restarted all it is >>>>>>>>>>> >>>>>>>>>> still only using 2 and 1. Any ideas? I made >>> >>>> that >>>>> >>>>>> change in the >>>>>>>>>>> >>>>>>>>>> hadoop-site.xml file BTW. >>> >>>> >>>>>>>>>>> -John >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jun 4, 2009, at 10:00 >>>>>>>>>>> >>>>>>>>>> AM, Andrzej Bialecki wrote: >>> >>>> >>>>>>>>>>> John Martyniak wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Andrzej, >>>>>>>>>>>> >>>>>>>>>>>> I am a little >>>>>>>>>>>>> >>>>>>>>>>>> embarassed asking. But is there is a setup guide >>> >>>> for >>>>> >>>>>> setting up >>>>>>>>>>>>> >>>>>>>>>>>> Hadoop for Nutch 1.0, or is it the same process as >>> >>>> setting >>>>>>>>>>>>> up for >>>>>>>>>>>>> Nutch 0.17 >>>>>>>>>>>>> >>>>>>>>>>>> (Which I think is the existing guide out there). >>> >>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Basically, >>>>>>>>>>>>> >>>>>>>>>>>> yes - but this guide is primarily about the set up >>> >>>> of >>>> >>>>> >>>>>>>>>>>> Hadoop >>>>>>>>>>> >>>>>>>>>> >>>>>>>> cluster using the Hadoop pieces >>>>>>>>> >>>>>>>> distributed with Nutch. As such >>> >>>> these >>>> >>>>> instructions are >>>>>>>>>>>> >>>>>>>>>>> already slightly outdated. So it's best simply >>> >>>> to >>>> >>>>> install a >>>>>>>>>>>> clean Hadoop 0.19.1 >>>>>>>>>>>> >>>>>>>>>>> according to the instructions on Hadoop wiki, >>> >>>> and >>>>> >>>>>> then >>>>>>>>>>>> build nutch*.job >>>>>>>>>>>> >>>>>>>>>>> file separately. >>> >>>> >>>>>>>>>>>> Also I have Hadoop >>>>>>>>>>>> >>>>>>>>>>> already running for some other applications, >>> >>>> not >>>> >>>>> >>>>>>>>>>>> associated >>>>>>>>>>>> >>>>>>>>>>> with Nutch, can I use the same install? I think that >>> >>>> it >>>>> >>>>>> >>>>>>>>>>>>> is >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> the >>>>>>>>> >>>>>>>>>> same version >>>>>>>>>>>>> >>>>>>>>>>>> that Nutch 1.0 uses. Or is it just easier to set >>> >>>> it >>>> >>>>> up >>>>> >>>>>> using >>>>>>>>>>>>> the nutch >>>>>>>>>>>>> >>>>>>>>>>>> config. >>> >>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, it's >>>>>>>>>>>>> >>>>>>>>>>>> perfectly ok to use Nutch with an existing Hadoop >>> >>>> cluster >>>>> >>>>>> of >>>>>>>>>>>> the >>>>>>>>>>>> same vintage (which >>>>>>>>>>>> >>>>>>>>>>> is 0.19.1 in Nutch 1.0). In fact, I would >>> >>>> strongly >>>>>>>>>>>> recommend this way, >>>>>>>>>>>> >>>>>>>>>>> instead of the usual "dirty" way of setting >>> >>>> up >>>> >>>>> >>>>>>>>>>>> Nutch >>>>>>>>>>> >>>>>>>>>> >>>>>>>> by >>>>>>>>> >>>>>>>>>> replicating the >>>>>>>>>>>> >>>>>>>>>>> local build dir ;) >>> >>>> >>>>>>>>>>>> Just specify the >>>>>>>>>>>> >>>>>>>>>>> nutch*.job file like this: >>> >>>> >>>>>>>>>>>> >>>>>>>>>>>> bin/hadoop jar nutch*.job <className> <args ..> >>> >>>> >>>>>>>>>>>> where className and >>>>>>>>>>>> >>>>>>>>>>> args is one of Nutch command-line tools. You >>> >>>> can >>>>> >>>>>> also >>>>>>>>>>>> modify slightly the >>>>>>>>>>>> >>>>>>>>>>> bin/nutch script, so that you don't have to >>> >>>> >>>>>>>>>>>> specify >>>>>>>>>>> >>>>>>>>>> >>>>>>>> fully-qualified class names. >>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Best regards, >>>>>>>>>>>> Andrzej >>>>>>>>>>>> >>>>>>>>>>> Bialecki <>< >>> >>>> ___. ___ ___ ___ _ >>>>>>>>>>>> >>>>>>>>>>> _ __________________________________ >>> >>>> [__ || >>>>>>>>>>>> >>>>>>>>>>> __|__/|__||\/| Information Retrieval, Semantic Web >>> >>>> ___|||__|| >>>>>>>>>>>> >>>>>>>>>>> \| || | Embedded Unix, System Integration >>> >>>> http://www.sigram.com Contact: info at sigram dot >>>>>>>>>>>> >>>>>>>>>>> com >>> >>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> John Martyniak >>>>>>>>>>> >>>>>>>>>> President >>>>>>>>> Before Dawn Solutions, Inc. >>>>>>>>> 9457 S. University Blvd #266 >>>>>>>>> Highlands Ranch, CO 80126 >>>>>>>>> o: 877-499-1562 x707 >>>>>>>>> f: 877-499-1562 >>>>>>>>> c: 303-522-1756 >>>>>>>>> e: j...@beforedawnsolutions.com >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> DigitalPebble Ltd >>> http://www.digitalpebble.com >>> >>> >> >> >> > John Martyniak > President/CEO > Before Dawn Solutions, Inc. > 9457 S. University Blvd #266 > Highlands Ranch, CO 80126 > o: 877-499-1562 > c: 303-522-1756 > > e: j...@beforedawnsoutions.com > w: http://www.beforedawnsolutions.com > > -- -MilleBii-