[
https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krisztián Gyula Tóth updated TIKA-3554:
---------------------------------------
Summary: Detect plain text file as application/zip based on file ext wrong
(was: Detect plain text file as application/zip based on file ext false)
> Detect plain text file as application/zip based on file ext wrong
> -----------------------------------------------------------------
>
> Key: TIKA-3554
> URL: https://issues.apache.org/jira/browse/TIKA-3554
> Project: Tika
> Issue Type: Bug
> Components: detector, metadata, mime
> Affects Versions: 1.26
> Reporter: Krisztián Gyula Tóth
> Priority: Major
> Attachments: image-2021-09-15-10-33-33-560.png
>
>
> Given a simple plain text file with the file extension `.zip` and with
> content `Hello World!`. Example file name: "hello.txt.zip"
> When calling the function `tika.detect()` with the file bytes from an
> `InputStream` using `BufferedInputStream`
> {code:java}
> String detectedMimeType = Optional.of(tika.detect(bytes.get(),
> fileItem.getName()));
> {code}
> Then it returns `application/zip` as for the detected MimeType. Regardless
> the file's content is in plain text, only the file extension contains the
> `.zip`.
> The result is the same with uploading a file with HTML content but
> having`.zip` as file ext.
> It’s not a super rare file type that’s hard to detect. So I’d say it’s a bug
> in Tika.
>
> *Expected behavior*
> Tika should detect the provided file as a plain text file and return
> `text/plain` for the detected mime type regardless of the file extension
> being `.zip`.
> *Suggested solution:*
> Check file signature further on the file extension in case the file ext is
> `.zip`
> To ensure that the uploaded file is really a zip archive, it should have a
> matching file signature with one of the following:
> * 50 4B 03 04
> * 50 4B 05 06 (empty archive)
> * 50 4B 07 08 (spanned archive)
>
> See magic numbers at [Wiki page for ZIP file
> format]([https://en.wikipedia.org/wiki/ZIP_(file_format)|https://en.wikipedia.org/wiki/ZIP_(file_format))])
>
> *Background info:*
> We are using the `Tika.detect()` to detect the file's mime type on uploading
> to the server in a Java servlet before saving it for further processing. To
> ensure that the client-provided file has the expected mime type and accepts
> only that type of file. In this context, we are working with `ZIP` archives.
> Users are only allowed to upload zip archives. But, it turned out that Tika
> cannot detect plain text files and still recognizes them as ZIP archives if
> the file extension is given as {{`.zip}}`.
> However, there are newer versions of the Apache Tika than what we are
> currently using is 1.26 this is still an issue in the newer version.
>
> *How do I investigate this:*
> 1. Upload a valid zip archive with filename `archive.zip.txt` where the file
> extension is `.txt`
> - Expectation: Tika should detect the file mime type as `application/zip`
> - Result: Provides the expected result. A valid zip archive, but with having
> the file `.txt` file extension in its name is still detected as
> `application/zip` successfully.
> 2. Upload a valid zip archive with filename, but without the `.zip` file
> extension.
> - Expectation: Tika should detect the file mime type as `application/zip`
> - Result: Provides the expected result. A valid zip archive, but without
> having the file `.zip` file extension in its name is still detected as
> `application/zip` successfully.
> 3. Upload a common GIF file, but with `.zip` file extension
> `something.gif.zip`
> - Expectation: Tika should detect the file mime type as `image/gif`
> - Result: Provides the expected result. A GIF image, but with having the
> file `.zip` extension is still can be detected as `image/gif`
> 4. Upload any plain text file (can be `HTML` doc or `TEXT`) with filename
> `myText.zip` where the file extension is `.zip`
> - Expectation: Tika should detect the file mime type as
> `application/octet-stream` in general or `text/plain` or `text/html`
> depending on the file's content.
> - Result: Tika `detect()` **fails**! Detects it as `application/zip`.
> 5. Upload any plain text file (can be `HTML` doc or plain `TEXT`) with
> filename, but without the file extension.
> - Expectation: Tika should detect the file mime type as
> `application/octet-stream` in general or `text/plain` or `text/html`
> depending on the file's content.
> - Result: Provides the expected result. Detects it as
> `application/octet-stream`. (So to say it's acceptable for a file without
> file extension and text content, `text/plain` would be a perfect match)
> | idx | Tika detect file type test case | Pass (Y/N) | Expected | Detected |
> | --- | --------------------------------------------------- | ----------- |
> -------------------------- | ------------------------- |
> | 1. | A valid ZIP archive with name, but `.txt` file ext | Y |
> application/zip | application/zip |
> | 2. | A valid ZIP archive with name, but without file ext | Y |
> application/zip | application/zip |
> | 3. | A common binary (GIF) with, but with `.zip` file ext| Y | image/gif |
> image/gif |
> | 4. | A plain text file, but with `.zip` file ext | N |
> application/octet-stream (or text/plain) | application/zip |
> | 5. | A plain text file, but without file ext | Y | application/octet-stream
> (or text/plain) | application/octet-stream |
> ...
> *Conclusion*: It turned out that Tika cannot detect plain text files and
> still recognizes them as ZIP archives if the file extension is given as
> `.zip`.
> So, I think the issue is with detecting plain text files <--> ZIP archives
> the most significant in Tika. Other known files/binaries can be detected
> simply regardless of the filename and the file extension given by the client.
>
> *Visual proof*
> See in attachments.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)