[
https://issues.apache.org/jira/browse/PDFBOX-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-5613:
------------------------------------
Affects Version/s: 2.0.28
> uncorrent paragraph split
> -------------------------
>
> Key: PDFBOX-5613
> URL: https://issues.apache.org/jira/browse/PDFBOX-5613
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.1, 2.0.28
> Reporter: Key Hutu
> Priority: Major
> Attachments: Daily Report.pdf
>
>
> when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info
> {code}
> public class PDFParagraphTextStripper extends PDFTextStripper {
> public PDFParagraphTextStripper() throws IOException{
> this.setLineSeparator(" ");
> this.setParagraphStart("");
> this.setParagraphEnd(this.LINE_SEPARATOR);
> this.setPageStart("");
> this.setPageEnd("");
> this.setArticleStart(this.LINE_SEPARATOR);
> this.setArticleEnd(this.LINE_SEPARATOR);
> }
> }
> public class PdfParser {
> private static final String dataPath =
> "D:\\IdeaProject\\PdfParser\\PdfParser\\data";
> public static void main(String[] args) {
> String fileName = "Daily Report.pdf";
> try{
> extract_pdfbox(dataPath + fileName);
> }
> catch (Exception e) {
> e.printStackTrace();
> }
> }
> private static void extract_pdfbox(String filePath) throws Exception{
> File file = new File(filePath);
> PDDocument document = PDDocument.load(file);
> PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
> String text = pdfTextStripper.getText(document);
> System.out.println(text);
> document.close();
> }
> }
> {code}
> {noformat}
> Daily Report 1) which language is your text in? - English
> 2) some examples of sentences containing
> addresses you'd want to pick up - Data are
> contarct documents, it contains addresses in
> different formates(of different
> countries),some are comma saperated, some
> are new line saperated etc 3) perhaps
> examples of mistakes - currently en model
> of SpaCy is even not able to tag entities
> clearly 4) Are you training your own model
> or are you using a model as is? - tried as it is
> but very poor in results to need to know a
> generic approach to train own model. any
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]