[jira] [Commented] (PDFBOX-3110) Extract by beads doesn't work

Tilman Hausherr (JIRA) Mon, 16 Nov 2015 08:44:13 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006898#comment-15006898
 ]


Tilman Hausherr commented on PDFBOX-3110:
-----------------------------------------

Thanks. The "some beads are a little smaller than the text" doesn't seem to 
make trouble, because the (0,0) glyph coordinate is inside. Cross page 
extraction isn't supported by PDFBox yet because it works page by page - that 
is something for a far future. But your test case shows nicely that the patch 
improves things.

> Extract by beads doesn't work
> -----------------------------
>
>                 Key: PDFBOX-3110
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3110
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>              Labels: beads
>         Attachments: 003422-1-bad.txt, 003422-1-good.txt, 003422-1.pdf, 
> 003422-marked-1.png, PDFBOX-3110-poems-beads-bad.txt, 
> PDFBOX-3110-poems-beads-good.txt, poems-marked-1.png, poems-marked-2.png, 
> poems.pdf
>
>
> Text extraction by beads has never worked, or (more likely) has been broken 
> years ago, when/if the code was changed so that text positions are in image 
> coordinates (y=0 is top) and not in PDF coordinates (y=0 is bottom).
> todos:
> - adjust bead rectangles (done locally)
> - adjust for cropbox (done locally)
> - separate output from different beads with a newline (will open a different 
> issue if I don't find solution)
> - optimize
> - find a non copyrighted test file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3110) Extract by beads doesn't work

Reply via email to