[
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeremy McLain updated TIKA-1262:
--------------------------------
Description:
The code that demonstrates this bug can be found in attachment:
This code detects 'application/octet-stream' for the Content-Type and returns
an empty string for the contents. It should detect 'text/plain' for the
Content-Type and return a Unicode string of the contents of the file. I don't
see a way to attach files here so I'll update this once I get the file uploaded
somewhere.
was:
This code detects 'application/octet-stream' for the Content-Type and returns
an empty string for the contents. It should detect 'text/plain' for the
Content-Type and return a Unicode string of the contents of the file. I don't
see a way to attach files here so I'll update this once I get the file uploaded
somewhere.
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import java.io.*;
public class ChineseTextExtraction {
// Experimenting with getting the file meta data and a Unicode version
// of the file's text using as little code as possible.
public static void main(String[] args) throws IOException, TikaException {
Tika tika = new Tika();
Metadata metadata = new Metadata();
// GB2312 is a subset of GB18030 so either charset is correct for this
file.
String filepath = "GB2312.txt";
TikaInputStream reader = TikaInputStream.get(new File(filepath));
String contents = tika.parseToString(reader, metadata);
reader.close();
for(String name : metadata.names()) {
System.out.println(name + ": " + metadata.get(name));
}
FileWriter writer = new FileWriter("GB2312-converted.txt");
writer.write(contents);
writer.close();
}
}
> parseToString fails to detect content-type / charset for GB2312 text file
> -------------------------------------------------------------------------
>
> Key: TIKA-1262
> URL: https://issues.apache.org/jira/browse/TIKA-1262
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.5
> Environment: Java 1.7; Windows 7 64 bit
> Reporter: Jeremy McLain
> Attachments: ChineseTextExtraction.java, GB2312.txt
>
>
> The code that demonstrates this bug can be found in attachment:
> This code detects 'application/octet-stream' for the Content-Type and returns
> an empty string for the contents. It should detect 'text/plain' for the
> Content-Type and return a Unicode string of the contents of the file. I don't
> see a way to attach files here so I'll update this once I get the file
> uploaded somewhere.
--
This message was sent by Atlassian JIRA
(v6.2#6252)