[
https://issues.apache.org/jira/browse/TIKA-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251931#comment-17251931
]
Hudson commented on TIKA-3253:
------------------------------
SUCCESS: Integrated in Jenkins build Tika ยป tika-branch1x-jdk8 #64 (See
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/64/])
TIKA-3253 -- add report for files missing in B (tallison:
[https://github.com/apache/tika/commit/84eb3963264818ab4f183681ed28977f6fbde32e])
* (edit) tika-eval/src/main/resources/comparison-reports.xml
> improve "attachments" tika-eval report directory
> ------------------------------------------------
>
> Key: TIKA-3253
> URL: https://issues.apache.org/jira/browse/TIKA-3253
> Project: Tika
> Issue Type: Improvement
> Components: tika-eval
> Affects Versions: 1.25
> Environment: W10
> Reporter: Tilman Hausherr
> Priority: Minor
> Attachments: GHOSTSCRIPT-690526-0.pdf,
> container_files_missing_in_B_by_mime.xlsx
>
>
> While doing regression testing for PDFBox I found
> container_files_missing_in_B_by_mime.xlsx
> which has
> MIME_STRING CNT
> application/pdf 4
> I have no idea which files this is about. The other reports don't tell it. I
> was able to solve this by accessing the H2 database and then submitting this
> query
> {code}
> select pa.file_name
> from profiles_a pa
> left join profiles_b pb on pa.id=pb.id
> where pb.id is null and pa.is_embedded=false
> {code}
> and got
> GHOSTSCRIPT-690526-0.pdf
> GHOSTSCRIPT-692591-0.pdf
> GHOSTSCRIPT-692591-2.pdf
> PDFBOX-4319-0.zip-0.pdf
> So my suggestion is to add 2 files to the report directory where the names
> are mentioned.
> I have attached one of the "bad" PDF files. The B extract is empty, tika runs
> forever. I'll investigate that separately. (Update: PDFBOX-5049. Will
> probably be solved by TIKA-3246)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)