Larry Evans wrote: > Well, I did see code here: > > sdext/source/pdfimport/pdfparse/pdfparse.cxx > > but that looked like it used boost/spirit to parse the pdf file > (about line 553): > > boost::spirit::parse( pBuffer, > pBuffer+nLen, > aGrammar, > boost::spirit::space_p ); > That's chiefly to deal with hybrid pdf, which needs to detect early-on that instead of parsing PDF, it should instead load the embedded ODF file. So for understanding real PDF import, simply ignore that part -
> Hence, I guess Poplar/xpdf does some sophisticated > processing that the use of boost::spirit does not do or is > incapable of doing. Of course, I'm jumping to conclusions > which hopefully people of the devel list will correct :) > Yes. Poppler does the actual pdf processing (it's also powering most of the linux desktop pdf viewers, like okular or evince). > > In general - it would be -way- better to pick up something like eg. > > pdfium - and add a rendering front-end there to match first, the same > > protocol (but we can do this in-process), and subsquently to simplify > > and factor lots of that madness out =) PDFium seems to be gaining > > traction in browsers (Chrome + Firefox) and so on. > > Thanks for the pointer. I'm googling for PDFium now. > For the import of PDF into Draw/Writer (compared to simply rendering PDF as a picture), the above is a bit of a red herring. The added complexity in terms of code for doing this in a separate process is pretty low; the challenge for that sort of thing really is decent layout detection. There's been a GSoC project proposal to hook up something like Tesseract or other OCR engines to help with that, sadly with little traction so far. ;) > I'm trying to solve the problem I posed earlier in this > post: > > https://lists.freedesktop.org/archives/libreoffice/2014-January/059106.html > Ah, XFA. Well then, poppler does not have support for that, pdfium apparently has a branch: https://pdfium.googlesource.com/pdfium/+/xfa - no idea how useable that is though. And from the grapevines, XFA seems pretty dead as an architecture? > I've also noticed that the font sizes and location of > letters is sometime not correct; hence, I'd like to figure > out how to correct that. > That's mostly due to prioritizing editability over accuracy. The code to look at is in sdext/source/pdfimport/tree/drawtreevisiting.cxx, which writes out ODF from the render tree. Hope that helps, -- Thorsten
signature.asc
Description: Digital signature
_______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice