I've searched and searched the archives for any mentioned of this particular IO error. I suspect this is another newbie error, but most of those I've found in the archives and we've worked through. I can't find this one. Any suggestions on what I can investigate next are appreciated.

nutch-site.xml is configured to load the pdf parser plugin, the http limit has been raised to accomodate the larger file sizes encountered with pdfs.

the crawl of the website finishes but throws an error for, I think, every PDF, large or small. To rule out the obvious, we do not impose quotas on the space within which nutch is operating.

the hadoop log error:
2008-02-21 00:35:24,720 WARN parse.pdf - General exception in PDF parser: Disc quota exceeded 2008-02-21 00:35:24,721 WARN parse.pdf - General exception in PDF parser: Disc quota exceeded 2008-02-21 00:35:24,721 WARN parse.pdf - General exception in PDF parser: Disc quota exceeded 2008-02-21 00:35:24,721 WARN parse.pdf - java.io.IOException: Disc quota exceeded 2008-02-21 00:35:24,722 WARN parse.pdf - java.io.IOException: Disc quota exceeded 2008-02-21 00:35:24,722 WARN parse.pdf - java.io.IOException: Disc quota exceeded 2008-02-21 00:35:24,722 WARN parse.pdf - at java.io.UnixFileSystem.createFileExclusively(Native Method) 2008-02-21 00:35:24,722 WARN parse.pdf - at java.io.File.checkAndCreate(File.java:1345) 2008-02-21 00:35:24,722 WARN parse.pdf - at java.io.RandomAccessFile.writeBytes(Native Method) 2008-02-21 00:35:24,722 WARN parse.pdf - at java.io.File.createTempFile(File.java:1434) 2008-02-21 00:35:24,723 WARN parse.pdf - at java.io.RandomAccessFile.write(RandomAccessFile.java:456) 2008-02-21 00:35:24,723 WARN parse.pdf - at org.pdfbox.cos.COSDocument.<init>(COSDocument.java:105) 2008-02-21 00:35:24,723 WARN parse.pdf - at org.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:112) 2008-02-21 00:35:24,723 WARN parse.pdf - at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:127) 2008-02-21 00:35:24,723 WARN parse.pdf - at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) 2008-02-21 00:35:24,723 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:91) 2008-02-21 00:35:24,724 WARN parse.pdf - at java.io.BufferedOutputStream.write(BufferedOutputStream.java:78) 2008-02-21 00:35:24,724 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) 2008-02-21 00:35:24,724 WARN parse.pdf - at org.pdfbox.pdfparser.BaseParser.readUntilEndStream(BaseParser.java:410) 2008-02-21 00:35:24,724 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:308) 2008-02-21 00:35:24,724 WARN parse.pdf - at org.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:368) 2008-02-21 00:35:24,724 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:153) 2008-02-21 00:35:24,724 WARN parse.pdf - at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:440) 2008-02-21 00:35:24,722 WARN parse.pdf - at java.io.RandomAccessFile.writeBytes(Native Method) 2008-02-21 00:35:24,725 WARN parse.pdf - at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176) 2008-02-21 00:35:24,725 WARN parse.ParseUtil - Unable to successfully parse content http://www.lib.utexas.edu/benson/lagovdocs/uruguay/federal/defensa/memoria2002/opp.pdf of type application/pdf 2008-02-21 00:35:24,725 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:91) 2008-02-21 00:35:24,725 WARN parse.pdf - at java.io.RandomAccessFile.write(RandomAccessFile.java:456) 2008-02-21 00:35:24,725 WARN fetcher.Fetcher - Error parsing: http://www.lib.utexas.edu/benson/lagovdocs/uruguay/federal/defensa/memoria2002/opp.pdf: failed(2,0): Can't be handled as pdf document. java.io.IOException: Disc quota exceeded


the crawl log error merely echoes the last line above.

We are using the crawl script (meaning it's doing the fetch then parse bit together) v. a no-parse fetch. But we're not getting heap or memory errors, the crawl and indexing finishes successfully, omitting the pdf files.

Fred Gilmore
Sr. Operating Systems Specialist
University of Texas at Austin Libraries

Reply via email to