Am 10.10.2017 um 21:24 schrieb Allison, Timothy B.:
If we're talking about the same file...same number of pages, attachments and 
common words.

However, PDFBox 2.0.8-SNAPSHOT has a more 0, 1, 2 and 3s...

The TOP_10_MORE_IN_B column in the contents report shows that there are 15 more 
0's, 15 more 1's 11 more '2's etc.

0: 15 | 1: 15 | 2: 11 | 20: 5 | 3: 2 | 4: 2

Yeah but where do they come from? Not from the pure text extraction. In the json files, I see that there are many "0:", "1:" in the new file. I wonder if this is about acroform fiels? Can be seen e.g. near for b12c96nfdate36.

Tilman




-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Tuesday, October 10, 2017 11:47 AM
To: [email protected]
Subject: Re: 2.0.8?

Am 09.10.2017 um 22:26 schrieb Allison, Timothy B.:
Thank you, Andreas, for fixing the slow parse on corrupt file so quickly!

Reports are here:
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz

Tim, can you please find out what we lost with 254348.pdf? It's not in the text 
extraction, so I assume it's some meta data but I don't see where.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to