It may fix the problem it may not. There have been many changes in
hadoop since 0.4. I think they are now on .11.x. So if you are
upgrading existing dfs implementations that currently have content that
is something to take into consideration. That being said the changes in
hadoop from .4 to present may very well have fixed the error you are
seeing and to use the most recent version of hadoop you will need to use
the NUTCH-437 patch.
Looking at your output below though my first thought would be that this
is something in the PDF parser and not hadoop causing the error. Nutch
uses pdfbox software to parse PDF files so you may want to take the
specific file and see if it parses correctly outside of nutch using pdfbox.
Dennis Kubes
Armel T. Nene wrote:
Dennis
I was wondering if this patch could fix my problem which is, if not the
same, very similar to this one. I am using Nutch 0.8.2-dev, I have made
checkout awhile ago from SVN but never updated again. I was able to crawl
10000 xml files before with no error whatsoever. This is the following
errors that I get when I'm fetching:
INFO parser.custom: Custom-parse: Parsing content
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf
07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with:
java.lang.NullPointerException
07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0
pages/s, 0 kb/s,
07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232)
07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher
caught:java.lang.NullPointerException
One of the problem is that my hadoop version says the following:
hadoop-0.4.0-patched. Now I don't know if it means that I am running the
0.4.0 version but it seems a little bit confusing. Once you can clarify that
for me, then I will be able to apply the patch to my version.
Best Regards,
Armel
-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED]
Sent: 13 February 2007 21:09
To: [email protected]
Subject: Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue
Actually I take it back. I don't think it is the same problem but I do
think it is the right solution.
Dennis Kubes
Dennis Kubes wrote:
This has to do with HADOOP-964. Replace the jar files in your Nutch
versions with the most recent versions from Hadoop. You will also need
to apply NUTCH-437 patch to get Nutch to work with the most recent
changes to the Hadoop codebase.
Dennis Kubes
Gal Nitzan wrote:
Hi,
Does anybody uses Nutch trunk?
I am running nutch 0.9 and unable to fetch.
after 50-60K urls I get NPE in
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue every time.
I was wandering if anyone have a work around or maybe something is
wrong with
my setup.
I have opened a new issue in jira
http://issues.apache.org/jira/browse/hadoop-1008 for this.
Any clue?
Gal