[
https://issues.apache.org/jira/browse/TIKA-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932887#comment-17932887
]
Subbu edited comment on TIKA-4370 at 3/6/25 8:25 AM:
-----------------------------------------------------
While I understand CharsetDetector could be run before MimeTypes and let it
figure out that incoming file is SJIS, and if it says so MimeType can return
text. But I think CharsetDetector is in parser, can we have core depend on
parser?
_Another thought is to hardcode the text detector after MimeTypes...which I
don't like, but I'm not beyond. :D_
I couldn't get this clearly as even if we hardcode TextDetector after MimeTypes
without it being able to detect SJIS, it would be still be octet-stream? Let me
if I misunderstood or you are thinking of a better way.
was (Author: JIRAUSER307746):
While I understand CharsetDetector could be run before MimeTypes and let it
figure out that incoming file is SJIS, and if it says so MimeType can return
text. But I think CharsetDetector is in parser, can we have core depend on
parser?
_Another thought is to hardcode the text detector after MimeTypes...which I
don't like, but I'm not beyond. :D_
I couldn't get this clearly as even if we hardcode TextDetector after MimeTypes
without it being able to detect SJIS, it would be still be octet-stream? Let me
if I misunderstood or thinking of a better way.
> SJIS Encoded Files Can't be Detected
> ------------------------------------
>
> Key: TIKA-4370
> URL: https://issues.apache.org/jira/browse/TIKA-4370
> Project: Tika
> Issue Type: Bug
> Reporter: Subbu
> Priority: Major
>
> When character encoding of file is SJIS, without file name in the metadata,
> most files content-type detected as application/octet-stream. Is there zero
> support for SJIS?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)