Y. Will rerun today. On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr <[email protected]> wrote:
> Hi, can you run these again? The recent fixed regression in PDFBOX-4550 > resulted in large amounts of files without extraction. > (NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B) > > Tilman > > Am 13.06.2019 um 14:36 schrieb Tim Allison: > > All, > > > > On a dev branch, I replaced Optimaize with a dev version of > > OpenNLP's language detector, and I updated the common tokens list to > > cover the 120 langs covered by a dev version of OpenNLP's language > > model. I changed the min token length for common words to 3 (from 4), > > and I'm now using 30k common tokens per lang rather than 20k. > > > > I reran this dev version of tika-eval on PDFBox 2.0.15 vs > > 2.0.16-SNAPSHOT, and the results are here: > > > > http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz > > > > Are there any critical problems with the updates in the contents > > comparison files? Any improvements? > > > > I notice that 'cmn' is the most common category for 'not much actual > > text'...we may want to require a higher confidence in language > > detection before reporting a detected language... > > > > Any and all recommendations are welcomed! Thank you! > > > > Cheers, > > > > Tim > > > > > > > > > > On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler <[email protected]> > wrote: > >> Am 12.06.19 um 21:08 schrieb Tilman Hausherr: > >>> Am 12.06.2019 um 03:56 schrieb Tim Allison: > >>>> Reports are available here for 2.0.16-SNAPSHOT: > >>>> > >>>> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz > >>>> > >>>> I haven't had a chance to look yet... > >>> > >>> I did... It's not looking good. It's probably the change in the > ToUnicode stream > >>> parsing, I'll investigate this. > >> I'm going to have a look > >> > >> Andreas > >>> Tilman > >>> > >>> > >>> > >>>> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <[email protected]> > wrote: > >>>>> +1 > >>>>> > >>>>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler <[email protected]> > wrote: > >>>>>> Hi, > >>>>>> > >>>>>> looks like it's time for the next release. How about cutting 2.0.16 > in about 2 > >>>>>> weeks from now? > >>>>>> > >>>>>> WDYT? > >>>>>> > >>>>>> Andreas > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [email protected] > >>>>>> For additional commands, e-mail: [email protected] > >>>>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [email protected] > >>>> For additional commands, e-mail: [email protected] > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
