>But this doesn't matter, the overall extraction of rotated pages would still look bad.
>>...it appears that there were word segmentation problems in both A and B Y. I agree. Thank you for looking more carefully than I did! On Sat, Apr 6, 2019 at 11:19 AM Tilman Hausherr <[email protected]> wrote: > I looked at about 10 files... all are rotated. I suspect this is a > result of PDFBOX-4480, that previously some rotated words came as one. > But this doesn't matter, the overall extraction of rotated pages would > still look bad. > > For example, the file you mention extracted this in 2.0.14: > > ... > R > E > R > M > H > IV > -1 > infection > hum > an(B > 8) > [G > oulder97c] > ... > > So it had "infection" but the rest was still worthless. The same file > extracts nicely with the "rotationMagic" option of ExtractText. > > Tilman > > Am 06.04.2019 um 15:50 schrieb Tim Allison: > > http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz > > > > This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though, > > there were no content differences btwn 2.0.13 and 2.0.14. I did not > > apply angle detection. > > > > No new exceptions; 2 fixed exceptions. We're getting higher page > > counts in a few documents, because we overrode processPages() to > > process. Some changes in content, but overall, better, I think, based > > on contents/common_token_comparisons_by_mime.xlsx. > > > > To see where content appears to degrade, open > > contents/content_diffs_(no|with)_exceptions, and sort column M > > ('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order. Also, look at > > columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S > > (TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most > > frequent tokens that are unique to A or unique to B; from this, it > > looks like there is a regression in, e.g. govdocs1/038/038519.pdf, > > but, generally (hand waving), it appears that there were word > > segmentation problems in both A and B as I look at the results. > > > > Cheers, > > > > Tim > > > > On Fri, Apr 5, 2019 at 10:53 AM Tim Allison <[email protected]> wrote: > >> +1 I should have regression results by tomorrow > >> > >> On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <[email protected]> > wrote: > >>> +1 > >>> > >>>> Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <[email protected] > >: > >>>> > >>>> Hi, > >>>> > >>>> looks like it's time for the next release. How about cutting 2.0.15 > next monday? > >>>> > >>>> WDYT? > >>>> > >>>> Andreas > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [email protected] > >>>> For additional commands, e-mail: [email protected] > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [email protected] > >>>> For additional commands, e-mail: [email protected] > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
