@Tim, just a friendly reminder, are there already any results available?

Thanks
Andreas

Am 21.06.19 um 17:27 schrieb Tim Allison:
Sorry. I was afk. I’ll kick this off shortly.

On Wed, Jun 19, 2019 at 2:54 AM Tilman Hausherr <[email protected]>
wrote:

Hi Tim,

Please do another one.

Thanks
Tilman

Am 15.06.2019 um 02:13 schrieb Tim Allison:
http://162.242.228.174/reports/pdfbox_2_0_16_1861286.tgz

Sharing before reviewing...sorry...

On Fri, Jun 14, 2019 at 7:56 AM Tim Allison <[email protected]> wrote:
Y. Will rerun today.

On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr <[email protected]>
wrote:
Hi, can you run these again? The recent fixed regression in PDFBOX-4550
resulted in large amounts of files without extraction.
(NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B)

Tilman

Am 13.06.2019 um 14:36 schrieb Tim Allison:
All,

     On a dev branch, I replaced Optimaize with a dev version of
OpenNLP's language detector, and I updated the common tokens list to
cover the 120 langs covered by a dev version of OpenNLP's language
model.  I changed the min token length for common words to 3 (from 4),
and I'm now using 30k common tokens per lang rather than 20k.

     I reran this dev version of tika-eval on PDFBox 2.0.15 vs
2.0.16-SNAPSHOT, and the results are here:

http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz

     Are there any critical problems with the updates in the contents
comparison files?  Any improvements?

     I notice that 'cmn' is the most common category for 'not much
actual
text'...we may want to require a higher confidence in language
detection before reporting a detected language...

     Any and all recommendations are welcomed!  Thank you!

              Cheers,

                          Tim




On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler <
[email protected]> wrote:
Am 12.06.19 um 21:08 schrieb Tilman Hausherr:
Am 12.06.2019 um 03:56 schrieb Tim Allison:
Reports are available here for 2.0.16-SNAPSHOT:

http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz

I haven't had a chance to look yet...
I did... It's not looking good. It's probably the change in the
ToUnicode stream
parsing, I'll investigate this.
I'm going to have a look

Andreas
Tilman



On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <[email protected]>
wrote:
+1

On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler <
[email protected]> wrote:
Hi,

looks like it's time for the next release. How about cutting
2.0.16 in about 2
weeks from now?

WDYT?

Andreas


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to