I've searched and searched the archives for any mentioned of this
particular IO error. I suspect this is another newbie error, but most
of those I've found in the archives and we've worked through. I can't
find this one. Any suggestions on what I can investigate next are
appreciated.
nutch-site.xml is configured to load the pdf parser plugin, the http
limit has been raised to accomodate the larger file sizes encountered
with pdfs.
the crawl of the website finishes but throws an error for, I think,
every PDF, large or small. To rule out the obvious, we do not impose
quotas on the space within which nutch is operating.
the hadoop log error:
2008-02-21 00:35:24,720 WARN parse.pdf - General exception in PDF
parser: Disc quota exceeded
2008-02-21 00:35:24,721 WARN parse.pdf - General exception in PDF
parser: Disc quota exceeded
2008-02-21 00:35:24,721 WARN parse.pdf - General exception in PDF
parser: Disc quota exceeded
2008-02-21 00:35:24,721 WARN parse.pdf - java.io.IOException: Disc
quota exceeded
2008-02-21 00:35:24,722 WARN parse.pdf - java.io.IOException: Disc
quota exceeded
2008-02-21 00:35:24,722 WARN parse.pdf - java.io.IOException: Disc
quota exceeded
2008-02-21 00:35:24,722 WARN parse.pdf - at
java.io.UnixFileSystem.createFileExclusively(Native Method)
2008-02-21 00:35:24,722 WARN parse.pdf - at
java.io.File.checkAndCreate(File.java:1345)
2008-02-21 00:35:24,722 WARN parse.pdf - at
java.io.RandomAccessFile.writeBytes(Native Method)
2008-02-21 00:35:24,722 WARN parse.pdf - at
java.io.File.createTempFile(File.java:1434)
2008-02-21 00:35:24,723 WARN parse.pdf - at
java.io.RandomAccessFile.write(RandomAccessFile.java:456)
2008-02-21 00:35:24,723 WARN parse.pdf - at
org.pdfbox.cos.COSDocument.<init>(COSDocument.java:105)
2008-02-21 00:35:24,723 WARN parse.pdf - at
org.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:112)
2008-02-21 00:35:24,723 WARN parse.pdf - at
org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:127)
2008-02-21 00:35:24,723 WARN parse.pdf - at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
2008-02-21 00:35:24,723 WARN parse.pdf - at
org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:91)
2008-02-21 00:35:24,724 WARN parse.pdf - at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:78)
2008-02-21 00:35:24,724 WARN parse.pdf - at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
2008-02-21 00:35:24,724 WARN parse.pdf - at
org.pdfbox.pdfparser.BaseParser.readUntilEndStream(BaseParser.java:410)
2008-02-21 00:35:24,724 WARN parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:308)
2008-02-21 00:35:24,724 WARN parse.pdf - at
org.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:368)
2008-02-21 00:35:24,724 WARN parse.pdf - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:153)
2008-02-21 00:35:24,724 WARN parse.pdf - at
org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:440)
2008-02-21 00:35:24,722 WARN parse.pdf - at
java.io.RandomAccessFile.writeBytes(Native Method)
2008-02-21 00:35:24,725 WARN parse.pdf - at
org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
2008-02-21 00:35:24,725 WARN parse.ParseUtil - Unable to successfully
parse content
http://www.lib.utexas.edu/benson/lagovdocs/uruguay/federal/defensa/memoria2002/opp.pdf
of type application/pdf
2008-02-21 00:35:24,725 WARN parse.pdf - at
org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:91)
2008-02-21 00:35:24,725 WARN parse.pdf - at
java.io.RandomAccessFile.write(RandomAccessFile.java:456)
2008-02-21 00:35:24,725 WARN fetcher.Fetcher - Error parsing:
http://www.lib.utexas.edu/benson/lagovdocs/uruguay/federal/defensa/memoria2002/opp.pdf:
failed(2,0): Can't be handled as pdf document. java.io.IOException: Disc
quota exceeded
the crawl log error merely echoes the last line above.
We are using the crawl script (meaning it's doing the fetch then parse
bit together) v. a no-parse fetch. But we're not getting heap or memory
errors, the crawl and indexing finishes successfully, omitting the pdf
files.
Fred Gilmore
Sr. Operating Systems Specialist
University of Texas at Austin Libraries