On 12/07/13 12:43, Carlos Garcia Campos wrote:
Richard Wossal <[email protected]> writes:
Hi!
I'm trying to use poppler to extract text from PDFs, and I've found
empirically
that using the "raw order" option gives better results (I can supply example
files where non-raw order returns mangled text, if needed).
Yes, please it would help to see any of those examples.
Here are some samples:
If you save the following google doc as a PDF (File->Download as):
https://docs.google.com/document/d/1U6SsDnTIce3IH-GhdKpx_uStQQSzSCsACoPkvmZtqTc/edit?usp=sharing
$ pdftotext -v
pdftotext version 0.18.4
Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
$ pdftotext ~/Downloads/sample.pdf - | head
This is a title
This is a subtitle
T iin r l x
h oma t t
ss
e
This is underlined text
$ pdftotext -raw ~/Downloads/sample.pdf - | head
This is a title
This is a subtitle
This is normal text
This is underlined text
This is a Heading
Here’s some nonascii stuff: öäüß§
Similar effects can be observed for the title page of
http://www.farmworkerjustice.org/sites/default/files/documents/7.2.a.6%20fwj.pdf
While looking at it more closely now, it appears that sometimes
non-raw reading order gives better results, as with
http://win.niddk.nih.gov/publications/pdfs/teenblackwhite3.pdf
$ pdftotext 'pdfs/teenblackwhite3.pdf' - | head
A Guide for
Teenagers!
Take
C h a rg e
of
Your
$ pdftotext -raw 'pdfs/teenblackwhite3.pdf' - | head
TakeTake
Charge
o f
Your Health!
A Guide for
Teenagers!
A GuideT fe oen r
TakeTake
agers!
Charge
(Just to give some sense as to the magnitude: the last two are from
a random sample of 100 PDFs my users threw at me. The google doc I
wrote myself, as a test case. So it's not exactly a huge problem.)
As far as I can see, I could either:
* hack something so I can extract text in raw-order using the Glib-bindings
(I'd prefer staying C-only, but I don't see how this would be possible,
except by adding it to the bindings)
* or re-implement poppler_page_get_text_attributes in C++, using poppler's
private API (or take poppler's implementation)
What do you think would be the best way to go about that?
I you really need to get the text in raw order we can add new methods in
the API for that. I'm thinking that maybe we could add a more generic
text iteration API with options like area, order and even the break
iterator (so that you can iter over characters, lines and words).
Being able to iterate over basically some kind of AST of the PDF
(say, chars+attributes) would be pretty nice indeed.
For myself, I've decided to go ahead with poppler-glib's
page_get_text_* for now. The failure rate is low enough for my
application. I was initially stumped that my simple google-doc test
case wouldn't parse correctly, but it doesn't seem to be such a big
problem with PDFs in the wild.
Thanks!
Richard
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler