Sorry. I was afk. I’ll kick this off shortly. On Wed, Jun 19, 2019 at 2:54 AM Tilman Hausherr <[email protected]> wrote:
> Hi Tim, > > Please do another one. > > Thanks > Tilman > > Am 15.06.2019 um 02:13 schrieb Tim Allison: > > http://162.242.228.174/reports/pdfbox_2_0_16_1861286.tgz > > > > Sharing before reviewing...sorry... > > > > On Fri, Jun 14, 2019 at 7:56 AM Tim Allison <[email protected]> wrote: > >> Y. Will rerun today. > >> > >> On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr <[email protected]> > wrote: > >>> Hi, can you run these again? The recent fixed regression in PDFBOX-4550 > >>> resulted in large amounts of files without extraction. > >>> (NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B) > >>> > >>> Tilman > >>> > >>> Am 13.06.2019 um 14:36 schrieb Tim Allison: > >>>> All, > >>>> > >>>> On a dev branch, I replaced Optimaize with a dev version of > >>>> OpenNLP's language detector, and I updated the common tokens list to > >>>> cover the 120 langs covered by a dev version of OpenNLP's language > >>>> model. I changed the min token length for common words to 3 (from 4), > >>>> and I'm now using 30k common tokens per lang rather than 20k. > >>>> > >>>> I reran this dev version of tika-eval on PDFBox 2.0.15 vs > >>>> 2.0.16-SNAPSHOT, and the results are here: > >>>> > >>>> http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz > >>>> > >>>> Are there any critical problems with the updates in the contents > >>>> comparison files? Any improvements? > >>>> > >>>> I notice that 'cmn' is the most common category for 'not much > actual > >>>> text'...we may want to require a higher confidence in language > >>>> detection before reporting a detected language... > >>>> > >>>> Any and all recommendations are welcomed! Thank you! > >>>> > >>>> Cheers, > >>>> > >>>> Tim > >>>> > >>>> > >>>> > >>>> > >>>> On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler < > [email protected]> wrote: > >>>>> Am 12.06.19 um 21:08 schrieb Tilman Hausherr: > >>>>>> Am 12.06.2019 um 03:56 schrieb Tim Allison: > >>>>>>> Reports are available here for 2.0.16-SNAPSHOT: > >>>>>>> > >>>>>>> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz > >>>>>>> > >>>>>>> I haven't had a chance to look yet... > >>>>>> I did... It's not looking good. It's probably the change in the > ToUnicode stream > >>>>>> parsing, I'll investigate this. > >>>>> I'm going to have a look > >>>>> > >>>>> Andreas > >>>>>> Tilman > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <[email protected]> > wrote: > >>>>>>>> +1 > >>>>>>>> > >>>>>>>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler < > [email protected]> wrote: > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> looks like it's time for the next release. How about cutting > 2.0.16 in about 2 > >>>>>>>>> weeks from now? > >>>>>>>>> > >>>>>>>>> WDYT? > >>>>>>>>> > >>>>>>>>> Andreas > >>>>>>>>> > >>>>>>>>> > --------------------------------------------------------------------- > >>>>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>>>> For additional commands, e-mail: [email protected] > >>>>>>>>> > >>>>>>> > --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>> For additional commands, e-mail: [email protected] > >>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [email protected] > >>>>>> For additional commands, e-mail: [email protected] > >>>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: [email protected] > >>>>> For additional commands, e-mail: [email protected] > >>>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [email protected] > >>>> For additional commands, e-mail: [email protected] > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
