[ 
https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztián Gyula Tóth updated TIKA-3554:
---------------------------------------
    Summary: Detect plain text file as application/zip based on file ext wrong  
(was: Detect plain text file as application/zip based on file ext false)

> Detect plain text file as application/zip based on file ext wrong
> -----------------------------------------------------------------
>
>                 Key: TIKA-3554
>                 URL: https://issues.apache.org/jira/browse/TIKA-3554
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, metadata, mime
>    Affects Versions: 1.26
>            Reporter: Krisztián Gyula Tóth
>            Priority: Major
>         Attachments: image-2021-09-15-10-33-33-560.png
>
>
> Given a simple plain text file with the file extension `.zip` and with 
> content `Hello World!`. Example file name: "hello.txt.zip"
> When calling the function `tika.detect()` with the file bytes from an 
> `InputStream` using `BufferedInputStream`
> {code:java}
> String detectedMimeType = Optional.of(tika.detect(bytes.get(), 
> fileItem.getName()));
> {code}
> Then it returns `application/zip` as for the detected MimeType. Regardless 
> the file's content is in plain text, only the file extension contains the 
> `.zip`.
> The result is the same with uploading a file with HTML content but 
> having`.zip` as file ext.
> It’s not a super rare file type that’s hard to detect. So I’d say it’s a bug 
> in Tika.
>  
> *Expected behavior*
> Tika should detect the provided file as a plain text file and return 
> `text/plain` for the detected mime type regardless of the file extension 
> being `.zip`.
> *Suggested solution:*
> Check file signature further on the file extension in case the file ext is 
> `.zip`
> To ensure that the uploaded file is really a zip archive, it should have a 
> matching file signature with one of the following:
>  * 50 4B 03 04
>  * 50 4B 05 06 (empty archive)
>  * 50 4B 07 08 (spanned archive)
>  
> See magic numbers at [Wiki page for ZIP file 
> format]([https://en.wikipedia.org/wiki/ZIP_(file_format)|https://en.wikipedia.org/wiki/ZIP_(file_format))])
>  
> *Background info:*
> We are using the `Tika.detect()` to detect the file's mime type on uploading 
> to the server in a Java servlet before saving it for further processing. To 
> ensure that the client-provided file has the expected mime type and accepts 
> only that type of file. In this context, we are working with `ZIP` archives. 
> Users are only allowed to upload zip archives. But, it turned out that Tika 
> cannot detect plain text files and still recognizes them as ZIP archives if 
> the file extension is given as {{`.zip}}`.
> However, there are newer versions of the Apache Tika than what we are 
> currently using is 1.26 this is still an issue in the newer version.
>  
> *How do I investigate this:*
> 1. Upload a valid zip archive with filename `archive.zip.txt` where the file 
> extension is `.txt`
>  - Expectation: Tika should detect the file mime type as `application/zip`
>  - Result: Provides the expected result. A valid zip archive, but with having 
> the file `.txt` file extension in its name is still detected as 
> `application/zip` successfully.
> 2. Upload a valid zip archive with filename, but without the `.zip` file 
> extension.
>  - Expectation: Tika should detect the file mime type as `application/zip`
>  - Result: Provides the expected result. A valid zip archive, but without 
> having the file `.zip` file extension in its name is still detected as 
> `application/zip` successfully.
> 3. Upload a common GIF file, but with `.zip` file extension 
> `something.gif.zip`
>  - Expectation: Tika should detect the file mime type as `image/gif`
>  - Result: Provides the expected result. A GIF image, but with having the 
> file `.zip` extension is still can be detected as `image/gif`
> 4. Upload any plain text file (can be `HTML` doc or `TEXT`) with filename 
> `myText.zip` where the file extension is `.zip`
>  - Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content. 
>  - Result: Tika `detect()` **fails**! Detects it as `application/zip`.
> 5. Upload any plain text file (can be `HTML` doc or plain `TEXT`) with 
> filename, but without the file extension.
>  - Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  - Result: Provides the expected result. Detects it as 
> `application/octet-stream`. (So to say it's acceptable for a file without 
> file extension and text content, `text/plain` would be a perfect match)
> | idx | Tika detect file type test case | Pass (Y/N) | Expected | Detected |
> | --- | --------------------------------------------------- | ----------- | 
> -------------------------- | ------------------------- |
> | 1. | A valid ZIP archive with name, but `.txt` file ext | Y | 
> application/zip | application/zip |
> | 2. | A valid ZIP archive with name, but without file ext | Y | 
> application/zip | application/zip |
> | 3. | A common binary (GIF) with, but with `.zip` file ext| Y | image/gif | 
> image/gif |
> | 4. | A plain text file, but with `.zip` file ext | N | 
> application/octet-stream (or text/plain) | application/zip |
> | 5. | A plain text file, but without file ext | Y | application/octet-stream 
> (or text/plain) | application/octet-stream |
> ...
> *Conclusion*: It turned out that Tika cannot detect plain text files and 
> still recognizes them as ZIP archives if the file extension is given as 
> `.zip`.
> So, I think the issue is with detecting plain text files <--> ZIP archives 
> the most significant in Tika. Other known files/binaries can be detected 
> simply regardless of the filename and the file extension given by the client.
>  
> *Visual proof*
> See in attachments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to