[jira] [Commented] (TIKA-3965) Detector for valid PDF files

Tim Allison (Jira) Thu, 02 Feb 2023 07:18:09 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683462#comment-17683462
 ]


Tim Allison commented on TIKA-3965:
-----------------------------------

Validation requires parsing.  Perhaps run the text extractor and set a small 
writelimit?  If you get an exception, then there was a problem?

You could also write a simple wrapper around pdftotext or another poppler tool 
(pdfinfo?) or other open source parser, but again this would be a full parse.

> Detector for valid PDF files
> ----------------------------
>
>                 Key: TIKA-3965
>                 URL: https://issues.apache.org/jira/browse/TIKA-3965
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-core
>    Affects Versions: 2.6.0
>            Reporter: Julio J. Gomez Diaz
>            Priority: Minor
>         Attachments: test2.pdf
>
>
> If we use MagicDetector or the detector using the content via DefaultDetector 
> it identifies as PDF file an invalid file such as the attached one, with this 
> simple content:
>  
> {code:java}
> <script>alert(1)</script>
> %PDF-1.7{code}
>  
> Is there any alternative detector in Tika that reads the whole file content 
> in order to not detected as PDF a non-valid PDF file?
> If there is not, will it make sense to implement it? Which would be the right 
> java package location for this?
>  
> This sample file is detected as wrong by Adobe Reader and any online PDF 
> processor we found online, but Tika identified it as PDF.
>  
> Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3965) Detector for valid PDF files

Reply via email to