http://162.242.228.174/reports/pdfbox_2_0_16_1861286.tgz
Sharing before reviewing...sorry... On Fri, Jun 14, 2019 at 7:56 AM Tim Allison <[email protected]> wrote: > > Y. Will rerun today. > > On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr <[email protected]> > wrote: >> >> Hi, can you run these again? The recent fixed regression in PDFBOX-4550 >> resulted in large amounts of files without extraction. >> (NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B) >> >> Tilman >> >> Am 13.06.2019 um 14:36 schrieb Tim Allison: >> > All, >> > >> > On a dev branch, I replaced Optimaize with a dev version of >> > OpenNLP's language detector, and I updated the common tokens list to >> > cover the 120 langs covered by a dev version of OpenNLP's language >> > model. I changed the min token length for common words to 3 (from 4), >> > and I'm now using 30k common tokens per lang rather than 20k. >> > >> > I reran this dev version of tika-eval on PDFBox 2.0.15 vs >> > 2.0.16-SNAPSHOT, and the results are here: >> > >> > http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz >> > >> > Are there any critical problems with the updates in the contents >> > comparison files? Any improvements? >> > >> > I notice that 'cmn' is the most common category for 'not much actual >> > text'...we may want to require a higher confidence in language >> > detection before reporting a detected language... >> > >> > Any and all recommendations are welcomed! Thank you! >> > >> > Cheers, >> > >> > Tim >> > >> > >> > >> > >> > On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler <[email protected]> >> > wrote: >> >> Am 12.06.19 um 21:08 schrieb Tilman Hausherr: >> >>> Am 12.06.2019 um 03:56 schrieb Tim Allison: >> >>>> Reports are available here for 2.0.16-SNAPSHOT: >> >>>> >> >>>> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz >> >>>> >> >>>> I haven't had a chance to look yet... >> >>> >> >>> I did... It's not looking good. It's probably the change in the >> >>> ToUnicode stream >> >>> parsing, I'll investigate this. >> >> I'm going to have a look >> >> >> >> Andreas >> >>> Tilman >> >>> >> >>> >> >>> >> >>>> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <[email protected]> wrote: >> >>>>> +1 >> >>>>> >> >>>>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler <[email protected]> >> >>>>> wrote: >> >>>>>> Hi, >> >>>>>> >> >>>>>> looks like it's time for the next release. How about cutting 2.0.16 >> >>>>>> in about 2 >> >>>>>> weeks from now? >> >>>>>> >> >>>>>> WDYT? >> >>>>>> >> >>>>>> Andreas >> >>>>>> >> >>>>>> --------------------------------------------------------------------- >> >>>>>> To unsubscribe, e-mail: [email protected] >> >>>>>> For additional commands, e-mail: [email protected] >> >>>>>> >> >>>> --------------------------------------------------------------------- >> >>>> To unsubscribe, e-mail: [email protected] >> >>>> For additional commands, e-mail: [email protected] >> >>>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: [email protected] >> >>> For additional commands, e-mail: [email protected] >> >>> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [email protected] >> > For additional commands, e-mail: [email protected] >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
