[
https://issues.apache.org/jira/browse/TIKA-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
yuying zhang updated TIKA-4491:
-------------------------------
Description:
*Problem Description:*
When using *AutoDetectParser* to parse {{.txt}} files encoded in *ANSI* or
{*}GB18030{*}, the parsed content is empty.
Debugging shows that during the call:
autoDetectParser.parse(inputStream, handler, metadata, context);
the detected content type is:
application/octet-stream
!image-2025-10-12-21-21-04-527.png|width=528,height=327!
However, {{.txt}} files encoded in *UTF-8* are correctly detected as
{{{}text/plain{}}}.
Manually detecting the file type via {{tika.detect(file)}} and setting it to
metadata before parsing resolves the issue:
{code:java}
package org.example.documentparse;
import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
public class TikaEncodingTest {
public static void main(String[] args) throws Exception {
// 1. Prepare test file (ANSI or GB18030 encoded txt)
File file = new File("D:\\javaWebLearn\\testFile\\test_ansi.txt"); //
or "sample-gb18030.txt"
// 2. Create Tika and AutoDetectParser
Tika tika = new Tika();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
// 3. Parse using InputStream directly (content may be empty)
try (InputStream inputStream = new FileInputStream(file)) {
parser.parse(inputStream, handler, metadata, context);
System.out.println("Content parsed directly: " +
handler.toString());
System.out.println("Detected type: " +
metadata.get(Metadata.CONTENT_TYPE));
}
System.out.println("-----------------------------------------");
// 4. Use Tika.detect(file) to manually set Content-Type
String type = tika.detect(file);
metadata.set(Metadata.CONTENT_TYPE, type);
handler = new BodyContentHandler(); // reset handler
try (InputStream inputStream = new FileInputStream(file)) {
parser.parse(inputStream, handler, metadata, context);
System.out.println("Content after setting Content-Type: " +
handler.toString());
System.out.println("Metadata Content-Type: " +
metadata.get(Metadata.CONTENT_TYPE));
}
}
}
{code}
was:
When I use AutoDetectParse to parse txt documents with encoding formats of ANSI
and GB18030, the parsed content returns an empty string. When I checked
AutoDetectParse calling ??parse (inputstream, handler, metadata, context) ??to
parse text, I found that the returned type is application/octet stream, which
is inconsistent with the text/plain returned by a txt document encoded in utf-8
format. I tried to detect the file type through ??tika. detect (file)?? before
calling the parse function and set it to the Content Type type of metadata, and
the problem was solved.
Why does this problem occur? Why does ??detector. detect (tis, metadata)
??return application/octet stream type, while ??tika.detect (file)?? returns
text/plain type?
{code:java}
String type = tika.detect(file);
metadata.set(Metadata.CONTENT_TYPE,type);
autoDetectParser.parse(inputStream,handler,metadata,context);{code}
> The encoding format is ansi, GB18030 txt document, and the parsed content
> returns an empty String
> -------------------------------------------------------------------------------------------------
>
> Key: TIKA-4491
> URL: https://issues.apache.org/jira/browse/TIKA-4491
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Affects Versions: 3.0.0
> Environment: Tika 3.0.0
> Reporter: yuying zhang
> Priority: Major
> Attachments: image-2025-10-12-21-21-04-527.png, test_ansi.txt
>
>
> *Problem Description:*
> When using *AutoDetectParser* to parse {{.txt}} files encoded in *ANSI* or
> {*}GB18030{*}, the parsed content is empty.
> Debugging shows that during the call:
> autoDetectParser.parse(inputStream, handler, metadata, context);
> the detected content type is:
> application/octet-stream
> !image-2025-10-12-21-21-04-527.png|width=528,height=327!
> However, {{.txt}} files encoded in *UTF-8* are correctly detected as
> {{{}text/plain{}}}.
> Manually detecting the file type via {{tika.detect(file)}} and setting it to
> metadata before parsing resolves the issue:
> {code:java}
> package org.example.documentparse;
> import org.apache.tika.Tika;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.BodyContentHandler;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.InputStream;
> public class TikaEncodingTest {
> public static void main(String[] args) throws Exception {
> // 1. Prepare test file (ANSI or GB18030 encoded txt)
> File file = new File("D:\\javaWebLearn\\testFile\\test_ansi.txt"); //
> or "sample-gb18030.txt"
> // 2. Create Tika and AutoDetectParser
> Tika tika = new Tika();
> AutoDetectParser parser = new AutoDetectParser();
> Metadata metadata = new Metadata();
> BodyContentHandler handler = new BodyContentHandler();
> ParseContext context = new ParseContext();
> // 3. Parse using InputStream directly (content may be empty)
> try (InputStream inputStream = new FileInputStream(file)) {
> parser.parse(inputStream, handler, metadata, context);
> System.out.println("Content parsed directly: " +
> handler.toString());
> System.out.println("Detected type: " +
> metadata.get(Metadata.CONTENT_TYPE));
> }
> System.out.println("-----------------------------------------");
> // 4. Use Tika.detect(file) to manually set Content-Type
> String type = tika.detect(file);
> metadata.set(Metadata.CONTENT_TYPE, type);
> handler = new BodyContentHandler(); // reset handler
> try (InputStream inputStream = new FileInputStream(file)) {
> parser.parse(inputStream, handler, metadata, context);
> System.out.println("Content after setting Content-Type: " +
> handler.toString());
> System.out.println("Metadata Content-Type: " +
> metadata.get(Metadata.CONTENT_TYPE));
> }
> }
> }
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)