[ 
https://issues.apache.org/jira/browse/TIKA-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683779#comment-17683779
 ] 

Julio J. Gomez Diaz commented on TIKA-3965:
-------------------------------------------

Hello [~tallison] thanks for your comment, these are valid alternatives. One of 
my questions to Tika development team is:
 * Why does the MagicDetector for the PDF type search for the "magic bytes" in 
any file position? Would't it be more correct to look for the "magic bytes" 
ONLY in the beginning of the file?

Doing that, the provided sample file would fail to identify as PDF (as the 
magic bytes are not found in the beginning of the file content). Also it would 
be mor e performant as well, because  you only have to read 4 bytes, you would 
not need  to search inside the whole file content.

 

Thanks in advance,

> Detector for valid PDF files
> ----------------------------
>
>                 Key: TIKA-3965
>                 URL: https://issues.apache.org/jira/browse/TIKA-3965
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-core
>    Affects Versions: 2.6.0
>            Reporter: Julio J. Gomez Diaz
>            Priority: Minor
>         Attachments: test2.pdf
>
>
> If we use MagicDetector or the detector using the content via DefaultDetector 
> it identifies as PDF file an invalid file such as the attached one, with this 
> simple content:
>  
> {code:java}
> <script>alert(1)</script>
> %PDF-1.7{code}
>  
> Is there any alternative detector in Tika that reads the whole file content 
> in order to not detected as PDF a non-valid PDF file?
> If there is not, will it make sense to implement it? Which would be the right 
> java package location for this?
>  
> This sample file is detected as wrong by Adobe Reader and any online PDF 
> processor we found online, but Tika identified it as PDF.
>  
> Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to