http://162.242.228.174/reports/pdfbox_2_0_16_1861286.tgz

Sharing before reviewing...sorry...

On Fri, Jun 14, 2019 at 7:56 AM Tim Allison <[email protected]> wrote:
>
> Y. Will rerun today.
>
> On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr <[email protected]> 
> wrote:
>>
>> Hi, can you run these again? The recent fixed regression in PDFBOX-4550
>> resulted in large amounts of files without extraction.
>> (NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B)
>>
>> Tilman
>>
>> Am 13.06.2019 um 14:36 schrieb Tim Allison:
>> > All,
>> >
>> >    On a dev branch, I replaced Optimaize with a dev version of
>> > OpenNLP's language detector, and I updated the common tokens list to
>> > cover the 120 langs covered by a dev version of OpenNLP's language
>> > model.  I changed the min token length for common words to 3 (from 4),
>> > and I'm now using 30k common tokens per lang rather than 20k.
>> >
>> >    I reran this dev version of tika-eval on PDFBox 2.0.15 vs
>> > 2.0.16-SNAPSHOT, and the results are here:
>> >
>> > http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz
>> >
>> >    Are there any critical problems with the updates in the contents
>> > comparison files?  Any improvements?
>> >
>> >    I notice that 'cmn' is the most common category for 'not much actual
>> > text'...we may want to require a higher confidence in language
>> > detection before reporting a detected language...
>> >
>> >    Any and all recommendations are welcomed!  Thank you!
>> >
>> >             Cheers,
>> >
>> >                         Tim
>> >
>> >
>> >
>> >
>> > On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler <[email protected]> 
>> > wrote:
>> >> Am 12.06.19 um 21:08 schrieb Tilman Hausherr:
>> >>> Am 12.06.2019 um 03:56 schrieb Tim Allison:
>> >>>> Reports are available here for 2.0.16-SNAPSHOT:
>> >>>>
>> >>>> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz
>> >>>>
>> >>>> I haven't had a chance to look yet...
>> >>>
>> >>> I did... It's not looking good. It's probably the change in the 
>> >>> ToUnicode stream
>> >>> parsing, I'll investigate this.
>> >> I'm going to have a look
>> >>
>> >> Andreas
>> >>> Tilman
>> >>>
>> >>>
>> >>>
>> >>>> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <[email protected]> wrote:
>> >>>>> +1
>> >>>>>
>> >>>>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler <[email protected]> 
>> >>>>> wrote:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> looks like it's time for the next release. How about cutting 2.0.16 
>> >>>>>> in about 2
>> >>>>>> weeks from now?
>> >>>>>>
>> >>>>>> WDYT?
>> >>>>>>
>> >>>>>> Andreas
>> >>>>>>
>> >>>>>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: [email protected]
>> >>>>>> For additional commands, e-mail: [email protected]
>> >>>>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: [email protected]
>> >>>> For additional commands, e-mail: [email protected]
>> >>>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [email protected]
>> >>> For additional commands, e-mail: [email protected]
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to