Again, apologies for my delay: http://162.242.228.174/reports/pdfbox_2_0_16_1861801.tgz
On Mon, Jun 24, 2019 at 6:03 AM Tim Allison <[email protected]> wrote: > > Sorry. Never made it back to my keyboard on Friday. I just started the > comparison code. Should have reports in a few hours. > > On Mon, Jun 24, 2019 at 12:36 AM Andreas Lehmkuehler <[email protected]> wrote: >> >> @Tim, just a friendly reminder, are there already any results available? >> >> Thanks >> Andreas >> >> Am 21.06.19 um 17:27 schrieb Tim Allison: >> > Sorry. I was afk. I’ll kick this off shortly. >> > >> > On Wed, Jun 19, 2019 at 2:54 AM Tilman Hausherr <[email protected]> >> > wrote: >> > >> >> Hi Tim, >> >> >> >> Please do another one. >> >> >> >> Thanks >> >> Tilman >> >> >> >> Am 15.06.2019 um 02:13 schrieb Tim Allison: >> >>> http://162.242.228.174/reports/pdfbox_2_0_16_1861286.tgz >> >>> >> >>> Sharing before reviewing...sorry... >> >>> >> >>> On Fri, Jun 14, 2019 at 7:56 AM Tim Allison <[email protected]> wrote: >> >>>> Y. Will rerun today. >> >>>> >> >>>> On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr <[email protected]> >> >> wrote: >> >>>>> Hi, can you run these again? The recent fixed regression in PDFBOX-4550 >> >>>>> resulted in large amounts of files without extraction. >> >>>>> (NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B) >> >>>>> >> >>>>> Tilman >> >>>>> >> >>>>> Am 13.06.2019 um 14:36 schrieb Tim Allison: >> >>>>>> All, >> >>>>>> >> >>>>>> On a dev branch, I replaced Optimaize with a dev version of >> >>>>>> OpenNLP's language detector, and I updated the common tokens list to >> >>>>>> cover the 120 langs covered by a dev version of OpenNLP's language >> >>>>>> model. I changed the min token length for common words to 3 (from 4), >> >>>>>> and I'm now using 30k common tokens per lang rather than 20k. >> >>>>>> >> >>>>>> I reran this dev version of tika-eval on PDFBox 2.0.15 vs >> >>>>>> 2.0.16-SNAPSHOT, and the results are here: >> >>>>>> >> >>>>>> http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz >> >>>>>> >> >>>>>> Are there any critical problems with the updates in the contents >> >>>>>> comparison files? Any improvements? >> >>>>>> >> >>>>>> I notice that 'cmn' is the most common category for 'not much >> >> actual >> >>>>>> text'...we may want to require a higher confidence in language >> >>>>>> detection before reporting a detected language... >> >>>>>> >> >>>>>> Any and all recommendations are welcomed! Thank you! >> >>>>>> >> >>>>>> Cheers, >> >>>>>> >> >>>>>> Tim >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler < >> >> [email protected]> wrote: >> >>>>>>> Am 12.06.19 um 21:08 schrieb Tilman Hausherr: >> >>>>>>>> Am 12.06.2019 um 03:56 schrieb Tim Allison: >> >>>>>>>>> Reports are available here for 2.0.16-SNAPSHOT: >> >>>>>>>>> >> >>>>>>>>> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz >> >>>>>>>>> >> >>>>>>>>> I haven't had a chance to look yet... >> >>>>>>>> I did... It's not looking good. It's probably the change in the >> >> ToUnicode stream >> >>>>>>>> parsing, I'll investigate this. >> >>>>>>> I'm going to have a look >> >>>>>>> >> >>>>>>> Andreas >> >>>>>>>> Tilman >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>>> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <[email protected]> >> >> wrote: >> >>>>>>>>>> +1 >> >>>>>>>>>> >> >>>>>>>>>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler < >> >> [email protected]> wrote: >> >>>>>>>>>>> Hi, >> >>>>>>>>>>> >> >>>>>>>>>>> looks like it's time for the next release. How about cutting >> >> 2.0.16 in about 2 >> >>>>>>>>>>> weeks from now? >> >>>>>>>>>>> >> >>>>>>>>>>> WDYT? >> >>>>>>>>>>> >> >>>>>>>>>>> Andreas >> >>>>>>>>>>> >> >>>>>>>>>>> >> >> --------------------------------------------------------------------- >> >>>>>>>>>>> To unsubscribe, e-mail: [email protected] >> >>>>>>>>>>> For additional commands, e-mail: [email protected] >> >>>>>>>>>>> >> >>>>>>>>> >> >> --------------------------------------------------------------------- >> >>>>>>>>> To unsubscribe, e-mail: [email protected] >> >>>>>>>>> For additional commands, e-mail: [email protected] >> >>>>>>>>> >> >>>>>>>> >> >> --------------------------------------------------------------------- >> >>>>>>>> To unsubscribe, e-mail: [email protected] >> >>>>>>>> For additional commands, e-mail: [email protected] >> >>>>>>>> >> >>>>>>> --------------------------------------------------------------------- >> >>>>>>> To unsubscribe, e-mail: [email protected] >> >>>>>>> For additional commands, e-mail: [email protected] >> >>>>>>> >> >>>>>> --------------------------------------------------------------------- >> >>>>>> To unsubscribe, e-mail: [email protected] >> >>>>>> For additional commands, e-mail: [email protected] >> >>>>>> >> >>>>> >> >>>>> --------------------------------------------------------------------- >> >>>>> To unsubscribe, e-mail: [email protected] >> >>>>> For additional commands, e-mail: [email protected] >> >>>>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: [email protected] >> >>> For additional commands, e-mail: [email protected] >> >>> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> >> >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
