> On 15 Sep 2016, at 09:02, Tilman Hausherr <thaush...@t-online.de> wrote: > > Am 14.09.2016 um 20:50 schrieb Tilman Hausherr: >> >>> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.: >>>> >>>> >>>> There are some regressions in content extraction, but overall, content >>>> extraction looks to have improved quite a bit. Looks like ~2 million more >>>> "common English words" via Tilman's methodology. >> >> After some wandering around I finally looked at content extraction only, at >> column P ("TOP_10_MORE_IN_A") for cells with meaningful words. >> It turned out that all files were from Delaware courts, so I've decided to >> look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW. >> The extracted text with 2.0.2 and 2.0.3 is >> >> IN THE COUR T OF CHAN CER Y O F TH E STA TE OF D ELA WARE >> >> in 2.0.1 and 1.8 it is >> >> IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE >> >> For 1.8 the explanation is that text extraction takes words, while in 2.* >> each character is taken alone. >> >> The bad result in 2.0.3 is because of an incorrect /W array. The space has a >> width of 3, while other characters have widths between 200 and 722. So >> PDFBox believes that there are spaces where there are none. > > The story is different, the space width (which is 250, not 3 - the table is a > ranges array) is NOT taken from the space glyph, but from an average of all > glyphs.
Ok, good. I was just about to investigate that remark in your previous email because the Widths array overrides any embedded font widths, so strictly speaking can’t contain a “bad” width, as whatever it contains is defined to be the width. We even stretch glyphs to fit that width (as Acrobat does). > It's a good thing I looked past in history. The breaking change was in rev > 1744613 (PDFBOX-3354) and is related to the calculation of the average glyph > width. Before rev 1744613 the averageWidth was always 0 (due to a bug likely > accidentally introduced in some refactoring), which was corrected to a > default value (1000) in text extraction. I’m not convinced that we should be using average widths at all. In the absence of justification, typographic tradition defines a space as being between 0.2 and 0.3 em (where 1em = the font size in pt). 250 would be a sensible default, unless the font contains a space character (with an empty path, so we know it is really a space). Perhaps this could go on the wish list for “new text extraction”. — John > Starting with rev 1744613 an average width was calculated, but due to many 0 > values (over 65534) in the /W ranges array, the result was unreliable: > > /W [1 1 0 2 3 250 4 10 0 11 > 12 333 13 14 0 15 15 250 16 16 > 333 17 17 250 18 18 277 19 19 0 > 20 23 500 24 35 0 36 36 722 37 > 37 666 38 39 722 40 40 666 41 41 > 610 42 43 777 44 44 389 45 45 0 > 46 46 777 47 47 666 48 48 943 49 > 49 722 50 50 777 51 51 610 52 52 > 0 53 53 722 54 54 556 55 55 666 > 56 57 722 59 59 0 60 60 722 61 > 67 0 68 68 500 69 69 556 70 70 > 443 71 71 556 72 72 443 73 73 333 > 74 74 500 75 75 556 76 76 277 77 > 77 0 78 78 556 79 79 277 80 80 > 833 81 81 556 82 82 500 83 84 556 > 85 85 443 86 86 389 87 87 333 88 > 88 556 89 89 0 90 90 722 91 92 > 500 93 178 0 179 180 500 181 181 0 > 182 182 333 183 751 0 752 752 198 753 > 794 0 795 795 612 796 1126 0 1127 1127 > 125 1129 1129 2000 1130 65534 0] > > Solution: ignore widths that are <=0. 0 values in PDFont are already ignored > in PDFont, but not in PDCIDFont. > > Before the solution: 0.52861196. After the fix: 549.8571. > > I'll open an issue and commit a fix after sending this. It won't be in 2.0.3, > but in 2.0.4. > > Tilman > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > <mailto:dev-unsubscr...@pdfbox.apache.org> > For additional commands, e-mail: dev-h...@pdfbox.apache.org > <mailto:dev-h...@pdfbox.apache.org>