Hi, Michael Smolyak schrieb: > Hello, > > I am new to PDFBox, so I apologize ahead of time if this is not an > appropriate forum for this sort of questions. > I have a requirement to extract text from PDF documents breaking it into > paragraphs. The examples if text extraction > I saw did not make it clear whether this is possible. HTML extraction > identifies lines and pages but not paragraphs. > Is it possible to extract text from PDF documents one paragraph ata a time? > If so could you supply a code sample? For now pdfbox doesn't recognize paragraphs during extraction. But there is perhaps something on the way to fulfil your needs. Mel and Navendu are working on a patch to improve the text extraction. Have a look at [1] and [2] for further details.
BR Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX-521 [1] https://issues.apache.org/jira/browse/PDFBOX-533