[ 
http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12371887 ] 

Richard Braman commented on NUTCH-220:
--------------------------------------

Here is an example of the error from my log file.  It seems it was fixed with 
the latest PDFBox pre Ben Litchfiled, developer of PDF Box.


060325 212856 fetch of http://www.state.sd.us/drr2/reg/bank/Trust%20Fee%20Calcul
ation.pdf failed with: java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:180
)
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:171
)
        at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:24
5)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:185)
060325 212856 SEVERE fetcher caught:java.lang.NullPointerException

> PDF Box can't parse document: java.lang.NullPointerException
> ------------------------------------------------------------
>
>          Key: NUTCH-220
>          URL: http://issues.apache.org/jira/browse/NUTCH-220
>      Project: Nutch
>         Type: Bug
>  Environment: PDFBox 0.7.2
>     Reporter: Richard Braman

>
> This error was fixed in the ltest build of PDFBOx, which should be tested 
> with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document. 
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
>  Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The 
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability).  The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug .  Is this one that you are referring to?
> >
> > -----Original Message-----
> > From: Ben Litchfield [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: [email protected]; [email protected];
> > [EMAIL PROTECTED]
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has 
> > been fixed since the 0.7.2 release.  Please give the nightly 
> > build(should be a drop in replacement) a try from 
> > http://www.pdfbox.org/dist and let me know if you are still having 
> > issues.
> >
> > Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to