Sorry. Never made it back to my keyboard on Friday. I just started the comparison code. Should have reports in a few hours.
On Mon, Jun 24, 2019 at 12:36 AM Andreas Lehmkuehler <[email protected]> wrote: > @Tim, just a friendly reminder, are there already any results available? > > Thanks > Andreas > > Am 21.06.19 um 17:27 schrieb Tim Allison: > > Sorry. I was afk. I’ll kick this off shortly. > > > > On Wed, Jun 19, 2019 at 2:54 AM Tilman Hausherr <[email protected]> > > wrote: > > > >> Hi Tim, > >> > >> Please do another one. > >> > >> Thanks > >> Tilman > >> > >> Am 15.06.2019 um 02:13 schrieb Tim Allison: > >>> http://162.242.228.174/reports/pdfbox_2_0_16_1861286.tgz > >>> > >>> Sharing before reviewing...sorry... > >>> > >>> On Fri, Jun 14, 2019 at 7:56 AM Tim Allison <[email protected]> > wrote: > >>>> Y. Will rerun today. > >>>> > >>>> On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr < > [email protected]> > >> wrote: > >>>>> Hi, can you run these again? The recent fixed regression in > PDFBOX-4550 > >>>>> resulted in large amounts of files without extraction. > >>>>> (NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B) > >>>>> > >>>>> Tilman > >>>>> > >>>>> Am 13.06.2019 um 14:36 schrieb Tim Allison: > >>>>>> All, > >>>>>> > >>>>>> On a dev branch, I replaced Optimaize with a dev version of > >>>>>> OpenNLP's language detector, and I updated the common tokens list to > >>>>>> cover the 120 langs covered by a dev version of OpenNLP's language > >>>>>> model. I changed the min token length for common words to 3 (from > 4), > >>>>>> and I'm now using 30k common tokens per lang rather than 20k. > >>>>>> > >>>>>> I reran this dev version of tika-eval on PDFBox 2.0.15 vs > >>>>>> 2.0.16-SNAPSHOT, and the results are here: > >>>>>> > >>>>>> http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz > >>>>>> > >>>>>> Are there any critical problems with the updates in the > contents > >>>>>> comparison files? Any improvements? > >>>>>> > >>>>>> I notice that 'cmn' is the most common category for 'not much > >> actual > >>>>>> text'...we may want to require a higher confidence in language > >>>>>> detection before reporting a detected language... > >>>>>> > >>>>>> Any and all recommendations are welcomed! Thank you! > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Tim > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler < > >> [email protected]> wrote: > >>>>>>> Am 12.06.19 um 21:08 schrieb Tilman Hausherr: > >>>>>>>> Am 12.06.2019 um 03:56 schrieb Tim Allison: > >>>>>>>>> Reports are available here for 2.0.16-SNAPSHOT: > >>>>>>>>> > >>>>>>>>> > http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz > >>>>>>>>> > >>>>>>>>> I haven't had a chance to look yet... > >>>>>>>> I did... It's not looking good. It's probably the change in the > >> ToUnicode stream > >>>>>>>> parsing, I'll investigate this. > >>>>>>> I'm going to have a look > >>>>>>> > >>>>>>> Andreas > >>>>>>>> Tilman > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <[email protected]> > >> wrote: > >>>>>>>>>> +1 > >>>>>>>>>> > >>>>>>>>>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler < > >> [email protected]> wrote: > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> looks like it's time for the next release. How about cutting > >> 2.0.16 in about 2 > >>>>>>>>>>> weeks from now? > >>>>>>>>>>> > >>>>>>>>>>> WDYT? > >>>>>>>>>>> > >>>>>>>>>>> Andreas > >>>>>>>>>>> > >>>>>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>>>>> > >>>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>>> > >>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>> > >>>>>>> > --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>> For additional commands, e-mail: [email protected] > >>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [email protected] > >>>>>> For additional commands, e-mail: [email protected] > >>>>>> > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: [email protected] > >>>>> For additional commands, e-mail: [email protected] > >>>>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
