RE: PDFBox 2.0.3 TIKA comparison

Allison, Timothy B. Thu, 15 Sep 2016 09:07:12 -0700

Great.  Thank you!

-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]] 
Sent: Thursday, September 15, 2016 12:03 PM
To: [email protected]
Subject: Re: PDFBox 2.0.3 TIKA comparison


Am 14.09.2016 um 20:50 schrieb Tilman Hausherr:
>
>> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>>
>>>
>>> There are some regressions in content extraction, but overall, 
>>> content extraction looks to have improved quite a bit.  Looks like
>>> ~2 million more "common English words" via Tilman's methodology. 
>
> After some wandering around I finally looked at content extraction 
> only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
> It turned out that all files were from Delaware courts, so I've 
> decided to look only at one single file, 
> Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
> The extracted text with 2.0.2 and 2.0.3 is
>
> IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE
>
> in 2.0.1 and 1.8 it is
>
> IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
>
> For 1.8 the explanation is that text extraction takes words, while in
> 2.* each character is taken alone.
>
> The bad result in 2.0.3 is because of an incorrect /W array. The space 
> has a width of 3, while other characters have widths between 200 and 
> 722. So PDFBox believes that there are spaces where there are none.

The story is different, the space width (which is 250, not 3 - the table is a 
ranges array) is NOT taken from the space glyph, but from an average of all 
glyphs. It's a good thing I looked past in history. The breaking change was in 
rev 1744613 (PDFBOX-3354) and is related to the calculation of the average 
glyph width. Before rev 1744613 the averageWidth was always 0 (due to a bug 
likely accidentally introduced in some refactoring), which was corrected to a 
default value (1000) in text extraction.

Starting with rev 1744613 an average width was calculated, but due to many 0 
values (over 65534) in the /W ranges array, the result was
unreliable:

/W [1 1 0 2 3 250 4 10 0 11
12 333 13 14 0 15 15 250 16 16
333 17 17 250 18 18 277 19 19 0
20 23 500 24 35 0 36 36 722 37
37 666 38 39 722 40 40 666 41 41
610 42 43 777 44 44 389 45 45 0
46 46 777 47 47 666 48 48 943 49
49 722 50 50 777 51 51 610 52 52
0 53 53 722 54 54 556 55 55 666
56 57 722 59 59 0 60 60 722 61
67 0 68 68 500 69 69 556 70 70
443 71 71 556 72 72 443 73 73 333
74 74 500 75 75 556 76 76 277 77
77 0 78 78 556 79 79 277 80 80
833 81 81 556 82 82 500 83 84 556
85 85 443 86 86 389 87 87 333 88
88 556 89 89 0 90 90 722 91 92
500 93 178 0 179 180 500 181 181 0
182 182 333 183 751 0 752 752 198 753
794 0 795 795 612 796 1126 0 1127 1127
125 1129 1129 2000 1130 65534 0]

Solution: ignore widths that are <=0. 0 values in PDFont are already ignored in 
PDFont, but not in PDCIDFont.

Before the solution: 0.52861196. After the fix: 549.8571.

I'll open an issue and commit a fix after sending this. It won't be in 2.0.3, 
but in 2.0.4.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]

RE: PDFBox 2.0.3 TIKA comparison

Reply via email to