>But this doesn't matter, the overall extraction of rotated pages would
still look bad.

>>...it appears that there were word segmentation problems in both A and B

Y. I agree.  Thank you for looking more carefully than I did!


On Sat, Apr 6, 2019 at 11:19 AM Tilman Hausherr <[email protected]>
wrote:

> I looked at about 10 files... all are rotated. I suspect this is a
> result of PDFBOX-4480, that previously some rotated words came as one.
> But this doesn't matter, the overall extraction of rotated pages would
> still look bad.
>
> For example, the file you mention extracted this in 2.0.14:
>
> ...
> R
> E
> R
> M
> H
> IV
> -1
> infection
> hum
> an(B
> 8)
> [G
> oulder97c]
> ...
>
> So it had "infection" but the rest was still worthless. The same file
> extracts nicely with the "rotationMagic" option of ExtractText.
>
> Tilman
>
> Am 06.04.2019 um 15:50 schrieb Tim Allison:
> > http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz
> >
> > This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though,
> > there were no content differences btwn 2.0.13 and 2.0.14.  I did not
> > apply angle detection.
> >
> > No new exceptions; 2 fixed exceptions.  We're getting higher page
> > counts in a few documents, because we overrode processPages() to
> > process.  Some changes in content, but overall, better, I think, based
> > on contents/common_token_comparisons_by_mime.xlsx.
> >
> > To see where content appears to degrade, open
> > contents/content_diffs_(no|with)_exceptions, and sort column M
> > ('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order.  Also, look at
> > columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S
> > (TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most
> > frequent tokens that are unique to A or unique to B; from this, it
> > looks like there is a regression in, e.g. govdocs1/038/038519.pdf,
> > but, generally (hand waving), it appears that there were word
> > segmentation problems in both A and B as I look at the results.
> >
> > Cheers,
> >
> >               Tim
> >
> > On Fri, Apr 5, 2019 at 10:53 AM Tim Allison <[email protected]> wrote:
> >> +1 I should have regression results by tomorrow
> >>
> >> On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <[email protected]>
> wrote:
> >>> +1
> >>>
> >>>> Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <[email protected]
> >:
> >>>>
> >>>> Hi,
> >>>>
> >>>> looks like it's time for the next release. How about cutting 2.0.15
> next monday?
> >>>>
> >>>> WDYT?
> >>>>
> >>>> Andreas
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to