[
https://issues.apache.org/jira/browse/PDFBOX-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235801#comment-14235801
]
Tilman Hausherr commented on PDFBOX-2545:
-----------------------------------------
1) The date "06-04-12 11:02" is also available with Adobe Reader so this is not
an error. If you press CTRL-A you can see where it is, there's a grey rectangle.
2) The file name is within this content stream:
{code}
W n
BT
/CS2 cs 1 scn
/GS1 gs
/TT0 1 Tf
4.4875 0 0 4.4875 -178.0243 187.425 Tm
(VSN_Briefpapier_ontwerp_V03.indd 1)Tj
ET
{code}
changing "-178.0243" to 0 (this modifies the horizontal translation) makes the
file name available. (To see it, remove "W" which sets a clipping).
In other words: the filename is outside of the page and PDFBox isn't respecting
this.
However - is this an error? Or should PDFTextStripperByArea be used instead?
> ExtractText extracts filename and date
> --------------------------------------
>
> Key: PDFBOX-2545
> URL: https://issues.apache.org/jira/browse/PDFBOX-2545
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7
> Reporter: Stefan Postema
> Attachments: 07-ALS-Onvoldoende-eten.pdf
>
>
> When using PDFBox 1.8 (and also a snapshot of 2.0.0), the ExtractText method
> produces text which also contains the original Adobe Indesign filename (and
> also the date and used images).
> Command line example:
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText
> 07-ALS-Onvoldoende-eten.pdf test.txt
> The first lines of this test.txt file are:
> VSN_Briefpapier_ontwerp_V03.indd 1 06-04-12 11:02
> Wat kan ik doen als het niet lukt om voldoende te eten? ALS en voeding
> Drinkvoeding
> Which should be without the Filename and date.
> When copy/pasting the text using Adobe Reader, the Indesign filename didn't
> show up. Using a CLI tool 'pdftotext' also didn't show up the line with the
> filename.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)