[ 
https://issues.apache.org/jira/browse/PDFBOX-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009196#comment-15009196
 ] 

Tilman Hausherr commented on PDFBOX-3110:
-----------------------------------------

The missing newline problem can be seen at many places, but the most obvious is 
where the Goethe text ends and the Molière text starts:
{quote}
In seinen Armen das Kind war tot.ARGAN, seul dans sa chambre, assis, une table 
devant lui, compte des parties d'apothicaire avec 
{quote}
The solution would be to set articleEnd to LINE_SEPARATOR, but 1) this would 
result in two newlines at the end; 2) the current code specifically does this 
when "addMoreFormatting" is set so it means that it wasn't wanted at the time 
it was implemented so I'll just do nothing for now, unless others see it 
differently.

[[email protected]] please check whether you like the output in TIKA with 
Maruans poem test file when using the latest PDFBox trunk. Search your output 
for ARGAN, it should be separated from the previous text.

> Extract by beads doesn't work
> -----------------------------
>
>                 Key: PDFBOX-3110
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3110
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>              Labels: beads
>         Attachments: 003422-1-bad.txt, 003422-1-good.txt, 003422-1.pdf, 
> 003422-marked-1.png, PDFBOX-3110-poems-beads-bad.txt, 
> PDFBOX-3110-poems-beads-good.txt, poems-marked-1.png, poems-marked-2.png, 
> poems.pdf
>
>
> Text extraction by beads has never worked, or (more likely) has been broken 
> years ago, when/if the code was changed so that text positions are in image 
> coordinates (y=0 is top) and not in PDF coordinates (y=0 is bottom).
> todos:
> - adjust bead rectangles (done)
> - adjust for cropbox (done)
> - separate output from different beads with a newline (will open a different 
> issue if I don't find solution)
> - optimize (done)
> - implement in 1.8.11
> - find a non copyrighted test file (done)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to