[
https://issues.apache.org/jira/browse/TIKA-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17924032#comment-17924032
]
Subbu edited comment on TIKA-4370 at 2/5/25 12:05 PM:
------------------------------------------------------
[~tallison] I reviewed this further and see that even in UTF8 files it go
through TextDetector to determine it's text file.
I tried org.apache.tika.parser.txt.TXTParser seperately and used encoding
detector to find the incoming file is Shift_JIS? Do you see a problem to use
TXTParser to determine if it is Shift_JIS, and then return text/plain in the
detector?
Sorry if my understanding is wrong
was (Author: JIRAUSER307746):
[~tallison] I reviewed this further and see that in UTF8 files it go through
TextDetector to determine it's text file. I tried TXTParser seperately and
used encoding detector to find the incoming file is Shift_JIS? Do you see a
problem to use TXTParser to determine if it is Shift_JIS, and then return
text/plain in the detector?
Sorry if my understanding is wrong
> SJIS Encoded Files Can't be Detected
> ------------------------------------
>
> Key: TIKA-4370
> URL: https://issues.apache.org/jira/browse/TIKA-4370
> Project: Tika
> Issue Type: Bug
> Reporter: Subbu
> Priority: Major
>
> When character encoding of file is SJIS, without file name in the metadata,
> most files content-type detected as application/octet-stream. Is there zero
> support for SJIS?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)