Re: text extraction from pdf

Andrzej Bialecki Wed, 14 May 2008 04:35:51 -0700

Cam Bazz wrote:

Hello All,


Any suggestions for extracting text from PDF? I have tried pdfbox, but it
works nice, however if the pdf is structured, it wont provide good results.
For example consider the pdf:


P1 Lorem Ipsum Bla bla                                      P3 Lorem2 Ipsum2
P1 bla bla

P2 bla bla bla
P2 bla bla bla



above P1,2 and 3 are meaningful paragraphs or fields. The pdfbox will
convert

P1 Lorem Ipsim Bla bla P3 Lorem2 Ipsum2
P1 bla bla

which is not useful to me.

the unix program pdf2text can convert keeping the text places, but I wanted
to ask you guys if you know something better,

AFAIK, PDFBox has a lower-level API that allows you to get hold of textpositions.




--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: text extraction from pdf

Reply via email to