[ https://issues.apache.org/jira/browse/PDFBOX-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-5613: ------------------------------------ Summary: incorrect paragraph split (was: uncorrent paragraph split) > incorrect paragraph split > ------------------------- > > Key: PDFBOX-5613 > URL: https://issues.apache.org/jira/browse/PDFBOX-5613 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 2.0.1, 2.0.28 > Reporter: Key Hutu > Priority: Major > Attachments: Daily Report.pdf > > > when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info > {code} > public class PDFParagraphTextStripper extends PDFTextStripper { > public PDFParagraphTextStripper() throws IOException{ > this.setLineSeparator(" "); > this.setParagraphStart(""); > this.setParagraphEnd(this.LINE_SEPARATOR); > this.setPageStart(""); > this.setPageEnd(""); > this.setArticleStart(this.LINE_SEPARATOR); > this.setArticleEnd(this.LINE_SEPARATOR); > } > } > public class PdfParser { > private static final String dataPath = > "D:\\IdeaProject\\PdfParser\\PdfParser\\data"; > public static void main(String[] args) { > String fileName = "Daily Report.pdf"; > try{ > extract_pdfbox(dataPath + fileName); > } > catch (Exception e) { > e.printStackTrace(); > } > } > private static void extract_pdfbox(String filePath) throws Exception{ > File file = new File(filePath); > PDDocument document = PDDocument.load(file); > PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper(); > String text = pdfTextStripper.getText(document); > System.out.println(text); > document.close(); > } > } > {code} > {noformat} > Daily Report 1) which language is your text in? - English > 2) some examples of sentences containing > addresses you'd want to pick up - Data are > contarct documents, it contains addresses in > different formates(of different > countries),some are comma saperated, some > are new line saperated etc 3) perhaps > examples of mistakes - currently en model > of SpaCy is even not able to tag entities > clearly 4) Are you training your own model > or are you using a model as is? - tried as it is > but very poor in results to need to know a > generic approach to train own model. any > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org