[
https://issues.apache.org/jira/browse/TIKA-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029507#comment-18029507
]
Tim Allison commented on TIKA-4491:
-----------------------------------
Text files are notoriously hard to detect. Tika uses the file name as a hint
if all of the other magics fail.
{quote}Why does this problem occur?
Why does {{detector.detect(tis, metadata)}} return
{{{}application/octet-stream{}}}, {{tika.detect(file)}} returns
{{{}text/plain{}}}?
{quote}
When you use {{{}tika.detect(file){}}}, Tika is including the file name as the
final hint. When you use a FileInputStream, Tika can't see the name of the file
and doesn't get that hint.
If you have a file, use TikaInputStream.get(file, metadata) always. If you need
to use only an inputstream, try adding the file name as a hint in the metadata
(if there's a file name) before the parse:
{{metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "file_name.txt");}}
> The encoding format is ansi, GB18030 txt document, and the parsed content
> returns an empty String
> -------------------------------------------------------------------------------------------------
>
> Key: TIKA-4491
> URL: https://issues.apache.org/jira/browse/TIKA-4491
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Affects Versions: 3.0.0
> Environment: Tika 3.0.0
> jdk21
> Reporter: yuying zhang
> Priority: Major
> Attachments: image-2025-10-12-21-21-04-527.png, test_ansi.txt
>
>
> *Problem Description:*
> When using *AutoDetectParser* to parse {{.txt}} files encoded in *ANSI* or
> {*}GB18030{*}, the parsed content is empty.
> Debugging shows that during the call:
> autoDetectParser.parse(inputStream, handler, metadata, context);
> the detected content type is:
> application/octet-stream
> !image-2025-10-12-21-21-04-527.png|width=528,height=327!
> However, {{.txt}} files encoded in *UTF-8* are correctly detected as
> {{{}text/plain{}}}.
> I tried to detect the file type through tika. detect (file) before calling
> the parse function and set it to the Content Type type of metadata, and the
> problem was solved.
> {code:java}
> package org.example.documentparse;
> import org.apache.tika.Tika;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.BodyContentHandler;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.InputStream;
> public class TikaEncodingTest {
> public static void main(String[] args) throws Exception {
> // 1. Prepare test file (ANSI or GB18030 encoded txt)
> File file = new File("D:\\javaWebLearn\\testFile\\test_ansi.txt"); //
> or "sample-gb18030.txt"
> // 2. Create Tika and AutoDetectParser
> Tika tika = new Tika();
> AutoDetectParser parser = new AutoDetectParser();
> Metadata metadata = new Metadata();
> BodyContentHandler handler = new BodyContentHandler();
> ParseContext context = new ParseContext();
> // 3. Parse using InputStream directly (content may be empty)
> try (InputStream inputStream = new FileInputStream(file)) {
> parser.parse(inputStream, handler, metadata, context);
> System.out.println("Content parsed directly: " +
> handler.toString());
> System.out.println("Detected type: " +
> metadata.get(Metadata.CONTENT_TYPE));
> }
> System.out.println("-----------------------------------------");
> // 4. Use Tika.detect(file) to manually set Content-Type
> String type = tika.detect(file);
> metadata.set(Metadata.CONTENT_TYPE, type);
> handler = new BodyContentHandler(); // reset handler
> try (InputStream inputStream = new FileInputStream(file)) {
> parser.parse(inputStream, handler, metadata, context);
> System.out.println("Content after setting Content-Type: " +
> handler.toString());
> System.out.println("Metadata Content-Type: " +
> metadata.get(Metadata.CONTENT_TYPE));
> }
> }
> }
> {code}
> h3. *Question:*
> Why does this problem occur?
> Why does {{detector.detect(tis, metadata)}} return
> {{{}application/octet-stream{}}}, {{tika.detect(file)}} returns
> {{{}text/plain{}}}?
> h3. *Expected Behavior:*
> * AutoDetectParser should correctly parse {{.txt}} files encoded in
> ANSI/GB18030 without requiring manual content type setting.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)