[jira] [Commented] (TIKA-3965) Detector for valid PDF files

Tim Allison (Jira) Fri, 03 Feb 2023 04:23:04 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683856#comment-17683856
 ]


Tim Allison commented on TIKA-3965:
-----------------------------------

That makes great sense.  Unfortunately, PDFs don't.  For PDFs, there is no 
requirement that the header starts at the beginning of the file.   There is 
also no requirement that the file ends with %%EOF.  This has led to mayhem via 
polyglot files and other dastardly creations for years.  

> Detector for valid PDF files
> ----------------------------
>
>                 Key: TIKA-3965
>                 URL: https://issues.apache.org/jira/browse/TIKA-3965
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-core
>    Affects Versions: 2.6.0
>            Reporter: Julio J. Gomez Diaz
>            Priority: Minor
>         Attachments: test2.pdf
>
>
> If we use MagicDetector or the detector using the content via DefaultDetector 
> it identifies as PDF file an invalid file such as the attached one, with this 
> simple content:
>  
> {code:java}
> <script>alert(1)</script>
> %PDF-1.7{code}
>  
> Is there any alternative detector in Tika that reads the whole file content 
> in order to not detected as PDF a non-valid PDF file?
> If there is not, will it make sense to implement it? Which would be the right 
> java package location for this?
>  
> This sample file is detected as wrong by Adobe Reader and any online PDF 
> processor we found online, but Tika identified it as PDF.
>  
> Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3965) Detector for valid PDF files

Reply via email to