It may fix the problem it may not. There have been many changes in hadoop since 0.4. I think they are now on .11.x. So if you are upgrading existing dfs implementations that currently have content that is something to take into consideration. That being said the changes in hadoop from .4 to present may very well have fixed the error you are seeing and to use the most recent version of hadoop you will need to use the NUTCH-437 patch.

Looking at your output below though my first thought would be that this is something in the PDF parser and not hadoop causing the error. Nutch uses pdfbox software to parse PDF files so you may want to take the specific file and see if it parses correctly outside of nutch using pdfbox.

Dennis Kubes

Armel T. Nene wrote:
Dennis

I was wondering if this patch could fix my problem which is, if not the
same, very similar to this one. I am using Nutch 0.8.2-dev, I have made
checkout awhile ago from SVN but never updated again. I was able to crawl
10000 xml files before with no error whatsoever. This is the following
errors that I get when I'm fetching:

INFO parser.custom: Custom-parse: Parsing content
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf
07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with:
java.lang.NullPointerException
07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0
pages/s, 0 kb/s, 07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232)
07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher
caught:java.lang.NullPointerException

One of the problem is that my hadoop version says the following:
hadoop-0.4.0-patched. Now I don't know if it means that I am running the
0.4.0 version but it seems a little bit confusing. Once you can clarify that
for me, then I will be able to apply the patch to my version.
Best Regards,

Armel

-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: 13 February 2007 21:09
To: [email protected]
Subject: Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

Actually I take it back. I don't think it is the same problem but I do think it is the right solution.

Dennis Kubes

Dennis Kubes wrote:
This has to do with HADOOP-964. Replace the jar files in your Nutch versions with the most recent versions from Hadoop. You will also need to apply NUTCH-437 patch to get Nutch to work with the most recent changes to the Hadoop codebase.

Dennis Kubes

Gal Nitzan wrote:
Hi,

Does anybody uses Nutch trunk?

I am running nutch 0.9 and unable to fetch.

after 50-60K urls I get NPE in
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue every time.

I was wandering if anyone have a work around or maybe something is wrong with
my setup.

I have opened a new issue in jira
http://issues.apache.org/jira/browse/hadoop-1008 for this.

Any clue?

Gal



Reply via email to