[poppler] Small issue with extracting prices from PDF

Jean-Sebastien Vachon Wed, 12 Dec 2018 13:26:39 -0800

Hi all,

I just started using the pdftotext python module to extract text from PDFs
and It really does look good so thanks for your hard work.


The only issue I am having right now is regarding the extraction of pricing
information such as within a menu. A lot of restaurants won't use a dot to
separate dollars and cents but will rely on a slightly smaller font size
for cents. As a result, an item listed at 4.00$ comes out at 400...

Is there anyway to detect such changes in fonts size/color and treat them
as separate words?

I am not sure if this would be better to support this on the python side or
directly within poppler.

Thanks

_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] Small issue with extracting prices from PDF

Reply via email to