[
https://issues.apache.org/jira/browse/TIKA-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683856#comment-17683856
]
Tim Allison commented on TIKA-3965:
-----------------------------------
That makes great sense. Unfortunately, PDFs don't. For PDFs, there is no
requirement that the header starts at the beginning of the file. There is
also no requirement that the file ends with %%EOF. This has led to mayhem via
polyglot files and other dastardly creations for years.
> Detector for valid PDF files
> ----------------------------
>
> Key: TIKA-3965
> URL: https://issues.apache.org/jira/browse/TIKA-3965
> Project: Tika
> Issue Type: Bug
> Components: tika-core
> Affects Versions: 2.6.0
> Reporter: Julio J. Gomez Diaz
> Priority: Minor
> Attachments: test2.pdf
>
>
> If we use MagicDetector or the detector using the content via DefaultDetector
> it identifies as PDF file an invalid file such as the attached one, with this
> simple content:
>
> {code:java}
> <script>alert(1)</script>
> %PDF-1.7{code}
>
> Is there any alternative detector in Tika that reads the whole file content
> in order to not detected as PDF a non-valid PDF file?
> If there is not, will it make sense to implement it? Which would be the right
> java package location for this?
>
> This sample file is detected as wrong by Adobe Reader and any online PDF
> processor we found online, but Tika identified it as PDF.
>
> Thanks in advance
--
This message was sent by Atlassian Jira
(v8.20.10#820010)