Julio J. Gomez Diaz created TIKA-3965:
-----------------------------------------
Summary: Detector for valid PDF files
Key: TIKA-3965
URL: https://issues.apache.org/jira/browse/TIKA-3965
Project: Tika
Issue Type: Bug
Components: tika-core
Affects Versions: 2.6.0
Reporter: Julio J. Gomez Diaz
Attachments: test2.pdf
If we use MagicDetector or the detector using the content via DefaultDetector
it identifies as PDF file an invalid file such as the attached one, with this
simple content:
{code:java}
<script>alert(1)</script>
%PDF-1.7{code}
Is there any alternative detector in Tika that reads the whole file content in
order to not detected as PDF a non-valid PDF file?
If there is not, will it make sense to implement it? Which would be the right
java package location for this?
This sample file is detected as wrong by Adobe Reader and any online PDF
processor we found online, but Tika identified it as PDF.
Thanks in advance
--
This message was sent by Atlassian Jira
(v8.20.10#820010)