Sorry. Never made it back to my keyboard on Friday. I just started the
comparison code. Should have reports in a few hours.

On Mon, Jun 24, 2019 at 12:36 AM Andreas Lehmkuehler <[email protected]>
wrote:

> @Tim, just a friendly reminder, are there already any results available?
>
> Thanks
> Andreas
>
> Am 21.06.19 um 17:27 schrieb Tim Allison:
> > Sorry. I was afk. I’ll kick this off shortly.
> >
> > On Wed, Jun 19, 2019 at 2:54 AM Tilman Hausherr <[email protected]>
> > wrote:
> >
> >> Hi Tim,
> >>
> >> Please do another one.
> >>
> >> Thanks
> >> Tilman
> >>
> >> Am 15.06.2019 um 02:13 schrieb Tim Allison:
> >>> http://162.242.228.174/reports/pdfbox_2_0_16_1861286.tgz
> >>>
> >>> Sharing before reviewing...sorry...
> >>>
> >>> On Fri, Jun 14, 2019 at 7:56 AM Tim Allison <[email protected]>
> wrote:
> >>>> Y. Will rerun today.
> >>>>
> >>>> On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr <
> [email protected]>
> >> wrote:
> >>>>> Hi, can you run these again? The recent fixed regression in
> PDFBOX-4550
> >>>>> resulted in large amounts of files without extraction.
> >>>>> (NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B)
> >>>>>
> >>>>> Tilman
> >>>>>
> >>>>> Am 13.06.2019 um 14:36 schrieb Tim Allison:
> >>>>>> All,
> >>>>>>
> >>>>>>      On a dev branch, I replaced Optimaize with a dev version of
> >>>>>> OpenNLP's language detector, and I updated the common tokens list to
> >>>>>> cover the 120 langs covered by a dev version of OpenNLP's language
> >>>>>> model.  I changed the min token length for common words to 3 (from
> 4),
> >>>>>> and I'm now using 30k common tokens per lang rather than 20k.
> >>>>>>
> >>>>>>      I reran this dev version of tika-eval on PDFBox 2.0.15 vs
> >>>>>> 2.0.16-SNAPSHOT, and the results are here:
> >>>>>>
> >>>>>> http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz
> >>>>>>
> >>>>>>      Are there any critical problems with the updates in the
> contents
> >>>>>> comparison files?  Any improvements?
> >>>>>>
> >>>>>>      I notice that 'cmn' is the most common category for 'not much
> >> actual
> >>>>>> text'...we may want to require a higher confidence in language
> >>>>>> detection before reporting a detected language...
> >>>>>>
> >>>>>>      Any and all recommendations are welcomed!  Thank you!
> >>>>>>
> >>>>>>               Cheers,
> >>>>>>
> >>>>>>                           Tim
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler <
> >> [email protected]> wrote:
> >>>>>>> Am 12.06.19 um 21:08 schrieb Tilman Hausherr:
> >>>>>>>> Am 12.06.2019 um 03:56 schrieb Tim Allison:
> >>>>>>>>> Reports are available here for 2.0.16-SNAPSHOT:
> >>>>>>>>>
> >>>>>>>>>
> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz
> >>>>>>>>>
> >>>>>>>>> I haven't had a chance to look yet...
> >>>>>>>> I did... It's not looking good. It's probably the change in the
> >> ToUnicode stream
> >>>>>>>> parsing, I'll investigate this.
> >>>>>>> I'm going to have a look
> >>>>>>>
> >>>>>>> Andreas
> >>>>>>>> Tilman
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <[email protected]>
> >> wrote:
> >>>>>>>>>> +1
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler <
> >> [email protected]> wrote:
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> looks like it's time for the next release. How about cutting
> >> 2.0.16 in about 2
> >>>>>>>>>>> weeks from now?
> >>>>>>>>>>>
> >>>>>>>>>>> WDYT?
> >>>>>>>>>>>
> >>>>>>>>>>> Andreas
> >>>>>>>>>>>
> >>>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>>>>>
> >>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [email protected]
> >>>>> For additional commands, e-mail: [email protected]
> >>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to