Re: identifying paragraphs in PDF documents

Andreas Lehmkühler Tue, 06 Oct 2009 22:42:40 -0700

Hi,

Michael Smolyak schrieb:
> Hello,
> 
> I am new to PDFBox, so I apologize ahead of time if this is not an 
> appropriate forum for this sort of questions.
> I have a requirement to extract text from PDF documents breaking it into 
> paragraphs. The examples if text extraction 
> I saw did not make it clear whether this is possible. HTML extraction 
> identifies lines and pages but not paragraphs.
> Is it possible to extract text from PDF documents one paragraph ata a time? 
> If so could you supply a code sample?
For now pdfbox doesn't recognize paragraphs during extraction. But there
is perhaps something on the way to fulfil your needs. Mel and Navendu
are working on a patch to improve the text extraction. Have a look at
[1]  and [2] for further details.



BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-521
[1] https://issues.apache.org/jira/browse/PDFBOX-533

Re: identifying paragraphs in PDF documents

Reply via email to