crawl errors over pdf files

Fred Gilmore Fri, 22 Feb 2008 14:29:53 -0800

I've searched and searched the archives for any mentioned of thisparticular IO error. I suspect this is another newbie error, but mostof those I've found in the archives and we've worked through. I can'tfind this one. Any suggestions on what I can investigate next areappreciated.

nutch-site.xml is configured to load the pdf parser plugin, the httplimit has been raised to accomodate the larger file sizes encounteredwith pdfs.

the crawl of the website finishes but throws an error for, I think,every PDF, large or small. To rule out the obvious, we do not imposequotas on the space within which nutch is operating.


the hadoop log error:

2008-02-21 00:35:24,720 WARN parse.pdf - General exception in PDFparser: Disc quota exceeded2008-02-21 00:35:24,721 WARN parse.pdf - General exception in PDFparser: Disc quota exceeded2008-02-21 00:35:24,721 WARN parse.pdf - General exception in PDFparser: Disc quota exceeded2008-02-21 00:35:24,721 WARN parse.pdf - java.io.IOException: Discquota exceeded2008-02-21 00:35:24,722 WARN parse.pdf - java.io.IOException: Discquota exceeded2008-02-21 00:35:24,722 WARN parse.pdf - java.io.IOException: Discquota exceeded2008-02-21 00:35:24,722 WARN parse.pdf - atjava.io.UnixFileSystem.createFileExclusively(Native Method)2008-02-21 00:35:24,722 WARN parse.pdf - atjava.io.File.checkAndCreate(File.java:1345)2008-02-21 00:35:24,722 WARN parse.pdf - atjava.io.RandomAccessFile.writeBytes(Native Method)2008-02-21 00:35:24,722 WARN parse.pdf - atjava.io.File.createTempFile(File.java:1434)2008-02-21 00:35:24,723 WARN parse.pdf - atjava.io.RandomAccessFile.write(RandomAccessFile.java:456)2008-02-21 00:35:24,723 WARN parse.pdf - atorg.pdfbox.cos.COSDocument.<init>(COSDocument.java:105)2008-02-21 00:35:24,723 WARN parse.pdf - atorg.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:112)2008-02-21 00:35:24,723 WARN parse.pdf - atorg.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:127)2008-02-21 00:35:24,723 WARN parse.pdf - atjava.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)2008-02-21 00:35:24,723 WARN parse.pdf - atorg.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:91)2008-02-21 00:35:24,724 WARN parse.pdf - atjava.io.BufferedOutputStream.write(BufferedOutputStream.java:78)2008-02-21 00:35:24,724 WARN parse.pdf - atorg.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)2008-02-21 00:35:24,724 WARN parse.pdf - atorg.pdfbox.pdfparser.BaseParser.readUntilEndStream(BaseParser.java:410)2008-02-21 00:35:24,724 WARN parse.pdf - atorg.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:308)2008-02-21 00:35:24,724 WARN parse.pdf - atorg.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:368)2008-02-21 00:35:24,724 WARN parse.pdf - atorg.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:153)2008-02-21 00:35:24,724 WARN parse.pdf - atorg.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:440)2008-02-21 00:35:24,722 WARN parse.pdf - atjava.io.RandomAccessFile.writeBytes(Native Method)2008-02-21 00:35:24,725 WARN parse.pdf - atorg.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)2008-02-21 00:35:24,725 WARN parse.ParseUtil - Unable to successfullyparse contenthttp://www.lib.utexas.edu/benson/lagovdocs/uruguay/federal/defensa/memoria2002/opp.pdfof type application/pdf2008-02-21 00:35:24,725 WARN parse.pdf - atorg.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:91)2008-02-21 00:35:24,725 WARN parse.pdf - atjava.io.RandomAccessFile.write(RandomAccessFile.java:456)2008-02-21 00:35:24,725 WARN fetcher.Fetcher - Error parsing:http://www.lib.utexas.edu/benson/lagovdocs/uruguay/federal/defensa/memoria2002/opp.pdf:failed(2,0): Can't be handled as pdf document. java.io.IOException: Discquota exceeded



the crawl log error merely echoes the last line above.

We are using the crawl script (meaning it's doing the fetch then parsebit together) v. a no-parse fetch. But we're not getting heap or memoryerrors, the crawl and indexing finishes successfully, omitting the pdffiles.


Fred Gilmore
Sr. Operating Systems Specialist
University of Texas at Austin Libraries

crawl errors over pdf files

Reply via email to