[ 
https://issues.apache.org/jira/browse/PDFBOX-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080509#comment-15080509
 ] 

Tilman Hausherr edited comment on PDFBOX-3058 at 1/3/16 10:22 PM:
------------------------------------------------------------------

I have't been THAT active in the last 10 days, but here's the result, see also 
attachment:
- some files that could be extracted in 1.8 but not with Adobe Reader, and not 
with 2.0
- some files that had a Flate decode exception (I suspect you and PDFBox abort 
when one page doesn't work)
- files with rotated glyphs. We can't extract these anymore, due to some 
refactoring some time ago that each glyph is handled individually. I'm not sure 
what to do with these, i.e. should a user have an expectation that these can be 
extracted?
- truncated files
- files with beads - if the beads are incorrect, or if there is text outside of 
beads, extraction will be weird (e.g. file 032431.pdf, try running the 
DrawPrintTextLocations tool to see what I mean). This is different to 1.8.10, 
because until that version, text extraction by beads didn't work at all
- type 3 fonts have incorrect Bbox / Matrix, even Adobe Reader has trouble 
marking these
- I was able to fix one bug (PDFBOX-3123, not a regression)
Despite this, total improvement is 0.8%.

What I haven't done yet:
- check meta data differences, there's something I haven't understood. 
L33MUTT2SVCWGCS6UIYL5TH3PNPXHIS6 is said to have a different count of metadata, 
but I can't find the 1.8 output
- check new exceptions



was (Author: tilman):
I have't been THAT active in the last 10 days, but here's the result, see also 
attachment:
- some files that could be extracted in 1.8 but not with Adobe Reader, and not 
with 2.0
- some files that had a Flate decode exception (I suspect you and PDFBox abort 
when one page doesn't work)
- files with rotated glyphs. We can't extract these anymore, due to some 
refactoring some time ago that each glyph is handled individually. I'm not sure 
what to do with these, i.e. should a user have an expectation that these can be 
extracted?
- truncated files
- files with beads - if the beads are incorrect, or if there is text outside of 
beads, extraction will be weird (e.g. file 032431.pdf, try running the 
DrawPrintTextLocations tool to see what I mean). This is different to 1.8.10, 
because until that version, text extraction by beads didn't work at all
- type 3 fonts have incorrect Bbox / Matrix, even Adobe Reader has trouble 
marking these
- I was able to fix one bug (PDFBOX-3123)
Despite this, total improvement is 0.8%.

What I haven't done yet:
- check meta data differences, there's something I haven't understood. 
L33MUTT2SVCWGCS6UIYL5TH3PNPXHIS6 is said to have a different count of metadata, 
but I can't find the 1.8 output
- check new exceptions


> Support TIKA Migration to PDFBox 2.0
> ------------------------------------
>
>                 Key: PDFBOX-3058
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3058
>             Project: PDFBox
>          Issue Type: Task
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Maruan Sahyoun
>         Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json, 
> NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json, content_diffs-1.8-to-2.0.xlsx, 
> textLostFromACausedByNewExceptionsInB.zip
>
>
> This issue is to track fixing issues which came up as part of TIKA-1285 
> (Upgrade to PDFBox 2.0.0 when available) mainly
> - new exceptions compared to PDFBox 1.8.x
> - regressions in text extraction
> - lower quality text extraction
> There should be individual issues to track tasks/bugs arising from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to