How about this:
https://corpora.tika.apache.org/base/reports/tika-2.8.1-pre-rc1-v3.tgz


On Thu, Jul 20, 2023 at 11:00 AM Tim Allison <[email protected]> wrote:

> How about these:
> https://corpora.tika.apache.org/base/reports/tika-2.8.1-pre-rc1-v2.tgz
>
> That adds the "file_name", which is either the container file name or, in
> the case of an embedded file, Tika's best guess as to what the container
> file named the embedded file.  The "file_name" is problematic because it is
> "user defined data" and can be messy or malicious.  We should include the
> embedded file id path, which is just numbers and slashes, but that'll take
> a bit more work.
>
> So, let me know if this helps any...
>
> On Thu, Jul 20, 2023 at 9:32 AM Tim Allison <[email protected]> wrote:
>
>> Y, again, those are embedded files.  I'll add this report to the ticket
>> as well to include embedded resource path.  Thank you!
>>
>> On Wed, Jul 19, 2023 at 11:56 PM Tilman Hausherr <[email protected]>
>> wrote:
>>
>>> On 19.07.2023 19:19, Tim Allison wrote:
>>> > Results are here:
>>> > https://corpora.tika.apache.org/base/reports/tika-2.8.1-pre-rc1.tgz
>>>
>>>
>>>
>>> govdocs1/974/974098.ppt
>>>
>>> appears twice in the content_diffs_no_exceptions.xlsx file but with
>>> different content?!
>>>
>>>
>>> govdocs1/974/974098.ppt         304128  application/msword
>>> application/vnd.ms-excel        44      567     66      1421    eng
>>>  48      eng     39      -9      the:
>>> 6 | of: 5 | standard: 4 | 90: 3 | energy: 3 | 75: 2 | a: 2 | and: 2 |
>>> by: 2 | criteria: 2     330: 42 | 21008: 12 | 24371: 12 | 29977: 12 |
>>> energy: 9 | lcc: 8 | 1: 6 | 10: 6 | 11: 6 | 12: 6       BASIC_LATIN: 437
>>> BASIC_LATIN: 7140       the: 6 | of: 5 | standard: 4 | a: 2 | and: 2 |
>>> by: 2
>>> | criteria: 2 | technical: 2 | 1980: 1 | adopted: 1     330: 42 | 21008:
>>> 12
>>> | 24371: 12 | 29977: 12 | lcc: 8 | 1: 6 | 10: 6 | 11: 6 | 12: 6 | 13: 6
>>> the: 6 | of: 5 | standard: 4 | a: 2 | and: 2 | by: 2 | criteria: 2 |
>>> technical: 2 | 1980: 1 | adopted: 1     330: 42 | 21008: 12 | 24371: 12
>>> |
>>> 29977: 12 | lcc: 8 | 1: 6 | 10: 6 | 11: 6 | 12: 6 | 13: 6       0,013
>>>  0,012
>>>
>>> govdocs1/974/974098.ppt         304128  application/vnd.ms-excel
>>> application/vnd.ms-excel        567     148     1421    262     eng
>>>  39      eng     21      -18
>>> 330: 42 | 21008: 12 | 24371: 12 | 29977: 12 | energy: 9 | lcc: 8 | 1: 6
>>> | 10: 6 | 11: 6 | 12: 6         1: 10 | 30: 7 | 25: 5 | 218.06175: 4 |
>>> 436.1235: 4 | 654.18525: 4 | 872.247: 4 | energy: 4 | national: 4 | 50:
>>> 3       BASIC_LATIN: 7140       BASIC_LATIN: 1717       330: 42 | 21008:
>>> 12 | 24371: 12
>>> | 29977: 12 | lcc: 8 | 31: 6 | 32: 6 | 33: 6 | 34: 6 | 35: 6
>>> 218.06175:
>>> 4 | 436.1235: 4 | 654.18525: 4 | 872.247: 4 | national: 4 | 90.1: 3 |
>>> ashrae: 3 | 1265.469: 2 | 141.157: 2 | 149.503: 2       330: 42 | 21008:
>>> 12 |
>>> 24371: 12 | 29977: 12 | lcc: 8 | 31: 6 | 32: 6 | 33: 6 | 34: 6 | 35: 6
>>> 1: 4 | 218.06175: 4 | 436.1235: 4 | 654.18525: 4 | 872.247: 4 |
>>> national: 4 | 90.1: 3 | ashrae: 3 | 1265.469: 2 | 141.157: 2    0,092
>>>  0,096
>>>
>>>
>>>

Reply via email to