[jira] [Commented] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

Tilman Hausherr (Jira) Fri, 19 Sep 2025 01:54:25 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021364#comment-18021364
 ]


Tilman Hausherr commented on TIKA-4491:
---------------------------------------

Please attach such files and include more of your code, and the config file if 
there is one.

> The encoding format is ansi, GB18030 txt document, and the parsed content 
> returns an empty String
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4491
>                 URL: https://issues.apache.org/jira/browse/TIKA-4491
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 3.0.0
>         Environment: Tika 3.0.0
>            Reporter: yuying zhang
>            Priority: Major
>
> When I use AutoDetectParse to parse txt documents with encoding formats of 
> ANSI and GB18030, the parsed content returns an empty string. When I checked 
> AutoDetectParse calling ??parse (inputstream, handler, metadata, context) 
> ??to parse text, I found that the returned type is application/octet stream, 
> which is inconsistent with the text/plain returned by a txt document encoded 
> in utf-8 format. I tried to detect the file type through ??tika. detect 
> (file)?? before calling the parse function and set it to the Content Type 
> type of metadata, and the problem was solved.
> Why does this problem occur? Why does ??detector. detect (tis, metadata) 
> ??return application/octet stream type, while ??tika.detect (file)?? returns 
> text/plain type?
> {code:java}
> String type = tika.detect(file);
> metadata.set(Metadata.CONTENT_TYPE,type);
> autoDetectParser.parse(inputStream,handler,metadata,context);{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

Reply via email to