[
https://issues.apache.org/jira/browse/PDFBOX-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726900#comment-17726900
]
Key Hutu commented on PDFBOX-5613:
----------------------------------
[~tilman]
it is still uncorrent in version 2.0.28
I want to output paragraph by paragraph, not split on line
<expectoutput>
Daily Report
1) which language is your text in? - English
2) some examples of sentences containing addresses you'd want to pick up - Data
are contarct documents, it contains addresses in different formates(of
different countries),some are comma saperated, some are new line saperated etc
3) perhaps examples of mistakes - currently en model of SpaCy is even not able
to tag entities clearly 4) Are you training your own model or are you using a
model as is? - tried as it is but very poor in results to need to know a
generic approach to train own model. any
</expectoutput>
> uncorrent paragraph split
> -------------------------
>
> Key: PDFBOX-5613
> URL: https://issues.apache.org/jira/browse/PDFBOX-5613
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Text extraction
> Affects Versions: 2.0.1
> Reporter: Key Hutu
> Priority: Major
> Attachments: Daily Report.pdf
>
>
> when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info
> <code>
> public class PDFParagraphTextStripper extends PDFTextStripper {
> public PDFParagraphTextStripper() throws IOException{
> this.setLineSeparator(" ");
> this.setParagraphStart("");
> this.setParagraphEnd(this.LINE_SEPARATOR);
> this.setPageStart("");
> this.setPageEnd("");
> this.setArticleStart(this.LINE_SEPARATOR);
> this.setArticleEnd(this.LINE_SEPARATOR);
> }
> }
> public class PdfParser {
> private static final String dataPath =
> "D:\\IdeaProject\\PdfParser\\PdfParser\\data";
> public static void main(String[] args) {
> String fileName = "Daily Report.pdf";
> try{
> extract_pdfbox(dataPath + fileName);
> }catch (Exception e)\{ e.printStackTrace(); }
> }
> private static void extract_pdfbox(String filePath) throws Exception{
> File file = new File(filePath);
> PDDocument document = PDDocument.load(file);
> PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
> String text = pdfTextStripper.getText(document);
> System.out.println(text);
> document.close();
> }
> }
> </code>
> <output>
> Daily Report 1) which language is your text in? - English
> 2) some examples of sentences containing
> addresses you'd want to pick up - Data are
> contarct documents, it contains addresses in
> different formates(of different
> countries),some are comma saperated, some
> are new line saperated etc 3) perhaps
> examples of mistakes - currently en model
> of SpaCy is even not able to tag entities
> clearly 4) Are you training your own model
> or are you using a model as is? - tried as it is
> but very poor in results to need to know a
> generic approach to train own model. any
> </output>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]