All,

  On a dev branch, I replaced Optimaize with a dev version of
OpenNLP's language detector, and I updated the common tokens list to
cover the 120 langs covered by a dev version of OpenNLP's language
model.  I changed the min token length for common words to 3 (from 4),
and I'm now using 30k common tokens per lang rather than 20k.

  I reran this dev version of tika-eval on PDFBox 2.0.15 vs
2.0.16-SNAPSHOT, and the results are here:

http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz

  Are there any critical problems with the updates in the contents
comparison files?  Any improvements?

  I notice that 'cmn' is the most common category for 'not much actual
text'...we may want to require a higher confidence in language
detection before reporting a detected language...

  Any and all recommendations are welcomed!  Thank you!

           Cheers,

                       Tim




On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler <andr...@lehmi.de> wrote:
>
> Am 12.06.19 um 21:08 schrieb Tilman Hausherr:
> > Am 12.06.2019 um 03:56 schrieb Tim Allison:
> >> Reports are available here for 2.0.16-SNAPSHOT:
> >>
> >> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz
> >>
> >> I haven't had a chance to look yet...
> >
> >
> > I did... It's not looking good. It's probably the change in the ToUnicode 
> > stream
> > parsing, I'll investigate this.
> I'm going to have a look
>
> Andreas
> >
> > Tilman
> >
> >
> >
> >>
> >> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <talli...@apache.org> wrote:
> >>> +1
> >>>
> >>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler <andr...@lehmi.de> 
> >>> wrote:
> >>>> Hi,
> >>>>
> >>>> looks like it's time for the next release. How about cutting 2.0.16 in 
> >>>> about 2
> >>>> weeks from now?
> >>>>
> >>>> WDYT?
> >>>>
> >>>> Andreas
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

Reply via email to