Hi Suzuki, have you noticed any problems while using the patched poppler-dump utility?
On Tue, May 8, 2018 at 2:25 AM, obsidian . <[email protected]> wrote: > Thanks Suzuki. > > I was looking for something more tried, tested and "stable". > I'm kind of surprised there's no other way to output char level > information. > > On Sat, May 5, 2018 at 9:41 AM, Adam Reichold <[email protected]> > wrote: > >> Hello again, >> >> so I obviously forgot the attachment... |:-\ Sorry for that. >> >> Regards, >> Adam >> >> Am 05.05.2018 um 08:16 schrieb Adam Reichold: >> > Hello mpsuzuki, >> > >> > attached is a version of your patch with some inline comments. >> > >> > Generally speaking, I would say that some well-defined format like JSON >> > or YAML would be preferable to the ad-hoc encoding? >> > >> > Best regards, >> > Adam >> > >> > Am 03.05.2018 um 13:50 schrieb suzuki toshiya: >> >> Current poppler-dump (a testing tool of cpp-frontend) has no feature to >> >> demonstrate per-character bbox feature. >> >> Attached patch adds the option to demonstrate it (I'm not saying "this >> is ready >> >> to use, please use", I want to understand your request and whether >> existing >> >> features could cover some part of your requests). >> >> >> >> The patched poppler-dump can work like this: >> >> >> >> $ cpp/tests/poppler-dump --show-glyph-list test.pdf >> >> Page 1/1: >> >> --- >> >> [Please] @ ( x=72 y=72.624 w=61.32 h=21.6 ) >> >> [0] @ ( x=72 y=72.624 w=13.344 h=21.6 ) >> >> [1] @ ( x=85.344 y=72.624 w=6.672 h=21.6 ) >> >> [2] @ ( x=92.016 y=72.624 w=10.656 h=21.6 ) >> >> [3] @ ( x=102.672 y=72.624 w=10.656 h=21.6 ) >> >> [4] @ ( x=113.328 y=72.624 w=9.336 h=21.6 ) >> >> [5] @ ( x=122.664 y=72.624 w=10.656 h=21.6 ) >> >> [wait...] @ ( x=139.32 y=72.624 w=59.328 h=21.6 ) >> >> [0] @ ( x=139.32 y=72.624 w=17.328 h=21.6 ) >> >> [1] @ ( x=156.648 y=72.624 w=10.656 h=21.6 ) >> >> [2] @ ( x=167.304 y=72.624 w=6.672 h=21.6 ) >> >> [3] @ ( x=173.976 y=72.624 w=6.672 h=21.6 ) >> >> [4] @ ( x=180.648 y=72.624 w=6 h=21.6 ) >> >> [5] @ ( x=186.648 y=72.624 w=6 h=21.6 ) >> >> [6] @ ( x=192.648 y=72.624 w=6 h=21.6 ) >> >> [If] @ ( x=72 y=112.428 w=7.992 h=10.8 ) >> >> [0] @ ( x=72 y=112.428 w=3.996 h=10.8 ) >> >> [1] @ ( x=75.996 y=112.428 w=3.996 h=10.8 ) >> >> [this] @ ( x=82.992 y=112.428 w=17.34 h=10.8 ) >> >> [0] @ ( x=82.992 y=112.428 w=3.336 h=10.8 ) >> >> [1] @ ( x=86.328 y=112.428 w=6 h=10.8 ) >> >> [2] @ ( x=92.328 y=112.428 w=3.336 h=10.8 ) >> >> [3] @ ( x=95.664 y=112.428 w=4.668 h=10.8 ) >> >> ... >> >> >> >> Regards, >> >> mpsuzuki >> >> >> >> suzuki toshiya wrote: >> >>> Dear obsidian, >> >>> >> >>> Too many posts about similar issues :-) >> >>> I'm not sure whether poppler maintainers are interested in the >> enhancement of >> >>> pdftotext, >> >>> but recently Jeroen and I were working with cpp-frontend to have >> similar features. >> >>> >> >>> in the latest version of poppler, >> >>> cpp-frontend has a feature to retrieve the list of words with >> bounding box, >> >>> and it can retrieve the bounding box for each glyph in the word. >> >>> >> >>> -- >> >>> >> >>> also I proposed a patch to retrieve the font family and point size: >> >>> https://lists.freedesktop.org/archives/poppler/2018-April/013035.html >> >>> >> >>> it might be waiting the maintainers review. the discussion and result >> would be >> >>> found at here: >> >>> https://github.com/ropensci/pdftools/issues/29 >> >>> >> >>> -- >> >>> >> >>>> - style, i.e. none, bold, italic >> >>> >> >>> if the document producer has a bold font and used in the document, >> aslike >> >>> Helvetica-Bold, >> >>> it would be found by the family name. >> >>> but if the document producer has no bold font and let the word >> processor >> >>> software synthesize the embolden fonts, >> >>> it would be difficult for the PDF renderer to recognize it as >> embolden font, >> >>> because the embolding is done by showing same glyph with subtle >> shifting. >> >>> Simple PDF renderers would be unable to distinguish "normal font but >> layered" >> >>> and "embolden font". >> >>> >> >>> Regards, >> >>> mpsuzuki >> >>> >> >>> obsidian . wrote: >> >>>> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html. >> >>>> >> >>>> Here's a sample line from the output: >> >>>> <word xMin="359.852025" yMin="462.548936" xMax="365.689478" >> yMax="467.681498">foo</word> >> >>>> >> >>>> Is there a way to get font information for every word like: >> >>>> - font family, e.g. Verdana >> >>>> - style, i.e. none, bold, italic >> >>>> - size, e.g. font size 9 >> >>>> >> >>>> I'm using pdftotext version 0.55.0 on Windows. >> >>>> >> >>>> >> >>> >> >>> _______________________________________________ >> >>> poppler mailing list >> >>> [email protected] >> >>> https://lists.freedesktop.org/mailman/listinfo/poppler >> >>> >> >>> >> >>> _______________________________________________ >> >>> poppler mailing list >> >>> [email protected] >> >>> https://lists.freedesktop.org/mailman/listinfo/poppler >> > >> > >> > >> > _______________________________________________ >> > poppler mailing list >> > [email protected] >> > https://lists.freedesktop.org/mailman/listinfo/poppler >> > >> >> _______________________________________________ >> poppler mailing list >> [email protected] >> https://lists.freedesktop.org/mailman/listinfo/poppler >> >> >
_______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
