[jira] [Updated] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

yuying zhang (Jira) Sat, 18 Oct 2025 13:12:58 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


yuying zhang updated TIKA-4491:
-------------------------------
    Description: 
*Problem Description:*

When using *AutoDetectParser* to parse {{.txt}} files encoded in *ANSI* or 
{*}GB18030{*}, the parsed content is empty.
Debugging shows that during the call:
autoDetectParser.parse(inputStream, handler, metadata, context);
the detected content type is:
application/octet-stream
!image-2025-10-12-21-21-04-527.png|width=528,height=327!

However, {{.txt}} files encoded in *UTF-8* are correctly detected as 
{{{}text/plain{}}}.

Manually detecting the file type via {{tika.detect(file)}} and setting it to 
metadata before parsing resolves the issue:
{code:java}
package org.example.documentparse;

import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;

public class TikaEncodingTest {
    public static void main(String[] args) throws Exception {
        // 1. Prepare test file (ANSI or GB18030 encoded txt)
        File file = new File("D:\\javaWebLearn\\testFile\\test_ansi.txt"); // 
or "sample-gb18030.txt"

        // 2. Create Tika and AutoDetectParser
        Tika tika = new Tika();
        AutoDetectParser parser = new AutoDetectParser();
        Metadata metadata = new Metadata();
        BodyContentHandler handler = new BodyContentHandler();
        ParseContext context = new ParseContext();
        // 3. Parse using InputStream directly (content may be empty)
        try (InputStream inputStream = new FileInputStream(file)) {
            parser.parse(inputStream, handler, metadata, context);
            System.out.println("Content parsed directly: " + 
handler.toString());
            System.out.println("Detected type: " + 
metadata.get(Metadata.CONTENT_TYPE));
        }

        System.out.println("-----------------------------------------");

        // 4. Use Tika.detect(file) to manually set Content-Type
        String type = tika.detect(file);
        metadata.set(Metadata.CONTENT_TYPE, type);
        handler = new BodyContentHandler(); // reset handler
        try (InputStream inputStream = new FileInputStream(file)) {
            parser.parse(inputStream, handler, metadata, context);
            System.out.println("Content after setting Content-Type: " + 
handler.toString());
            System.out.println("Metadata Content-Type: " + 
metadata.get(Metadata.CONTENT_TYPE));
        }
    }
}
 {code}


 

  was:
When I use AutoDetectParse to parse txt documents with encoding formats of ANSI 
and GB18030, the parsed content returns an empty string. When I checked 
AutoDetectParse calling ??parse (inputstream, handler, metadata, context) ??to 
parse text, I found that the returned type is application/octet stream, which 
is inconsistent with the text/plain returned by a txt document encoded in utf-8 
format. I tried to detect the file type through ??tika. detect (file)?? before 
calling the parse function and set it to the Content Type type of metadata, and 
the problem was solved.
Why does this problem occur? Why does ??detector. detect (tis, metadata) 
??return application/octet stream type, while ??tika.detect (file)?? returns 
text/plain type?
{code:java}
String type = tika.detect(file);
metadata.set(Metadata.CONTENT_TYPE,type);
autoDetectParser.parse(inputStream,handler,metadata,context);{code}


> The encoding format is ansi, GB18030 txt document, and the parsed content 
> returns an empty String
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4491
>                 URL: https://issues.apache.org/jira/browse/TIKA-4491
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 3.0.0
>         Environment: Tika 3.0.0
>            Reporter: yuying zhang
>            Priority: Major
>         Attachments: image-2025-10-12-21-21-04-527.png, test_ansi.txt
>
>
> *Problem Description:*
> When using *AutoDetectParser* to parse {{.txt}} files encoded in *ANSI* or 
> {*}GB18030{*}, the parsed content is empty.
> Debugging shows that during the call:
> autoDetectParser.parse(inputStream, handler, metadata, context);
> the detected content type is:
> application/octet-stream
> !image-2025-10-12-21-21-04-527.png|width=528,height=327!
> However, {{.txt}} files encoded in *UTF-8* are correctly detected as 
> {{{}text/plain{}}}.
> Manually detecting the file type via {{tika.detect(file)}} and setting it to 
> metadata before parsing resolves the issue:
> {code:java}
> package org.example.documentparse;
> import org.apache.tika.Tika;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.BodyContentHandler;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.InputStream;
> public class TikaEncodingTest {
>     public static void main(String[] args) throws Exception {
>         // 1. Prepare test file (ANSI or GB18030 encoded txt)
>         File file = new File("D:\\javaWebLearn\\testFile\\test_ansi.txt"); // 
> or "sample-gb18030.txt"
>         // 2. Create Tika and AutoDetectParser
>         Tika tika = new Tika();
>         AutoDetectParser parser = new AutoDetectParser();
>         Metadata metadata = new Metadata();
>         BodyContentHandler handler = new BodyContentHandler();
>         ParseContext context = new ParseContext();
>         // 3. Parse using InputStream directly (content may be empty)
>         try (InputStream inputStream = new FileInputStream(file)) {
>             parser.parse(inputStream, handler, metadata, context);
>             System.out.println("Content parsed directly: " + 
> handler.toString());
>             System.out.println("Detected type: " + 
> metadata.get(Metadata.CONTENT_TYPE));
>         }
>         System.out.println("-----------------------------------------");
>         // 4. Use Tika.detect(file) to manually set Content-Type
>         String type = tika.detect(file);
>         metadata.set(Metadata.CONTENT_TYPE, type);
>         handler = new BodyContentHandler(); // reset handler
>         try (InputStream inputStream = new FileInputStream(file)) {
>             parser.parse(inputStream, handler, metadata, context);
>             System.out.println("Content after setting Content-Type: " + 
> handler.toString());
>             System.out.println("Metadata Content-Type: " + 
> metadata.get(Metadata.CONTENT_TYPE));
>         }
>     }
> }
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

Reply via email to