Cam Bazz wrote:
Hello All,
Any suggestions for extracting text from PDF? I have tried pdfbox, but it
works nice, however if the pdf is structured, it wont provide good results.
For example consider the pdf:
P1 Lorem Ipsum Bla bla P3 Lorem2 Ipsum2
P1 bla bla
P2 bla bla bla
P2 bla bla bla
above P1,2 and 3 are meaningful paragraphs or fields. The pdfbox will
convert
P1 Lorem Ipsim Bla bla P3 Lorem2 Ipsum2
P1 bla bla
which is not useful to me.
the unix program pdf2text can convert keeping the text places, but I wanted
to ask you guys if you know something better,
AFAIK, PDFBox has a lower-level API that allows you to get hold of text
positions.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]