I maintain R bindings called pdftools, mostly used for extracting text from scientific documents. The bindings wrap the C++ API, in particular we convert pdf to text using poppler::page::text() with physical_layout.
Recently users have started to report changes in behaviour with newer versions of poppler, in particular wrt whitespace. For example, all pages are now terminated end with an '\f' symbol which was not the case before. On Windows, linebreaks are now converted as '\r\n' instead of just '\n' as before (we use mingw-w64 compilers). And also, some documents that would contain a single linebreak in e.g. poppler 0.73, now have 4 or 5 linebreaks on the same place with the latest poppler. I had a look at the changelog but I couldn't find any notes of this. Are these expected changes? The new behavior is causing some existing pipelines to break, where people were using e.g. line offsets to extract fragments of the text. _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
