Am 12.08.2020 um 23:21 schrieb Tim Allison:
All,
   Apologies for my delay...

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz

I haven't had a chance to look at the reports yet. :(

I tried to update the instructions for running the process on the vm.
Please let me know if you have any questions, or if I need to make
improvements.


Thanks, it looks good that there are no show stoppers, i.e. thumbs up from me.

But the tika problem is still there, although less. It is gone for the file I mentioned (or the file wasn't in the test), but not for others, e.g.

commoncrawl3/6V/6VTB5IUKXBFA3JZPJBUPVSRY7L56K6LE

commoncrawl3/5I/5I6STZEO5W25GPETYGLCDLIB6OKQXCIG

Tilman




Thank you.

    Best,

               Tim

On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <andr...@lehmi.de>
wrote:

Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
I've looked at all the files I had highlighted yesterday. All
differences
except two are related to the metadata problem.

The other two have a problem with spaces, i.e. glyphs not being near
each other.
commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ

This doesn't have to be a bug, I've seen many files where the
extraction is
better, so whatever change there is may have improved more things.
Thanks, for the analysis. IMHO we are good to cut a new release, aren't
we?

Yeah we could.

But if the bug gets solved it would be nice to have a new diff output to
see if
anything else gets shown more clearly.
I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
anything
else we have to wait before we run the tests again, maybe some tika fix?

Andreas

Tilman




Tilman

Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
Hi,

I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf

There's something with the XMP metadata extraction. dc:title: is
empty (or
an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.

I thought this could be related to some minor xmpbox changes but tika
doesn't use it. So I searched and found some changes in
PDMetadataExtractor.
I'm not yet sure if that is the cause, although I played around with
that one.
If it is, then it is related to

https://issues.apache.org/jira/browse/TIKA-3101

Tilman

Am 30.07.2020 um 12:43 schrieb Tim Allison:
Looks like there may be some issues with Japanese...don't know if
this is
related to your observation?

It feels like when I sort by ascending order of
NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
pairs
in the "lost common tokens".

Will look a bit more.

On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
thaush...@t-online.de>
wrote:

Am 28.07.2020 um 23:51 schrieb Tim Allison:
Reports are here:

https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
Thank you. Besides the exceptions, there are a few cases in content
extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
has
meaningful content, that is suspicious and needs further
investigation.
Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to