Raymond Wu created TIKA-1679:
--------------------------------
Summary: Parse PDF file page by page
Key: TIKA-1679
URL: https://issues.apache.org/jira/browse/TIKA-1679
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.9
Reporter: Raymond Wu
I have a PDF file contains 5 pages.
Page 3 cannot be parsed by PDFBox, but the rest pages are okay.
So I try to parse this file page by page.
Fix method PDF2XHTML.process() at PDF2XHTML.java.
public static void process(
PDDocument document, ContentHandler handler, Metadata metadata,
boolean extractAnnotationText, boolean enableAutoSpace,
boolean suppressDuplicateOverlappingText, boolean sortByPosition)
throws SAXException, TikaException {
try {
// Extract text using a dummy Writer as we override the
// key methods to output to the given content
// handler.
Writer dummyWriter = new Writer() {
@Override
public void write(char[] cbuf, int off, int len) {
}
@Override
public void flush() {
}
@Override
public void close() {
}
};
// Parse page by page
int nop = document.getNumberOfPages();
for(int i=1;i<=nop;i++) {
PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
extractAnnotationText, enableAutoSpace,
suppressDuplicateOverlappingText, sortByPosition);
try {
pdf2XHTML.setStartPage(i);
pdf2XHTML.setEndPage(i);
pdf2XHTML.writeText(document, dummyWriter);
} catch(Exception e) {
// TODO ...
}
}
} catch (IOException e) {
if (e.getCause() instanceof SAXException) {
throw (SAXException) e.getCause();
} else {
throw new TikaException("Unable to extract PDF content", e);
}
}
}
This method can parse PDF with partial broken pages.
I know It's not an optimized design.
But it is enough to solve my problem.
>From Tika 1.4~1.9, I need to recompile every version for this problem.
So I'd like to improve this parser.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)