We are using poppler for parsing and indexing scientific articles. For this purpose I wrote some bindings to poppler-cpp for the R programming language. A few questions:
- Many of our pdf files give parsing errors, such as "Failed to get object num from hint tables" or "Expected the optional content group list, but wasn't able to find it" or "insufficient arguments for Marked Content". Examples of problematic pdf files are here: https://github.com/sckott/pdftoolspdfs. Are all of these pdf files corrupted or are these limitations in poppler? Each of these files seem to open just fine in any pdf reader. - Is there any sensible way to extract tabular data from pdf documents in a machine readable form (such as xml or csv or html)? I noticed that pdftotext with the -layout option does a really nice job positioning the table contents so I suppose poppler must have picked up on the table internally? _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
