[jira] [Updated] (TIKA-1262) parseToString fails to detect content-type / charset for GB2312 text file

Jeremy McLain (JIRA) Wed, 19 Mar 2014 16:36:12 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeremy McLain updated TIKA-1262:
--------------------------------

    Description: 
The code that demonstrates this bug can be found in attachment: 

This code detects 'application/octet-stream' for the Content-Type and returns 
an empty string for the contents. It should detect 'text/plain' for the 
Content-Type and return a Unicode string of the contents of the file. I don't 
see a way to attach files here so I'll update this once I get the file uploaded 
somewhere.



  was:
This code detects 'application/octet-stream' for the Content-Type and returns 
an empty string for the contents. It should detect 'text/plain' for the 
Content-Type and return a Unicode string of the contents of the file. I don't 
see a way to attach files here so I'll update this once I get the file uploaded 
somewhere.

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;

import java.io.*;

public class ChineseTextExtraction {

    // Experimenting with getting the file meta data and a Unicode version
    // of the file's text using as little code as possible.
    public static void main(String[] args) throws IOException, TikaException {
        Tika tika = new Tika();
        Metadata metadata = new Metadata();

        // GB2312 is a subset of GB18030 so either charset is correct for this 
file.
        String filepath = "GB2312.txt";
        TikaInputStream reader = TikaInputStream.get(new File(filepath));

        String contents = tika.parseToString(reader, metadata);

        reader.close();

        for(String name : metadata.names()) {
            System.out.println(name + ": " + metadata.get(name));
        }

        FileWriter writer = new FileWriter("GB2312-converted.txt");
        writer.write(contents);
        writer.close();
    }
}



> parseToString fails to detect content-type / charset for GB2312 text file
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1262
>                 URL: https://issues.apache.org/jira/browse/TIKA-1262
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.5
>         Environment: Java 1.7; Windows 7 64 bit
>            Reporter: Jeremy McLain
>         Attachments: ChineseTextExtraction.java, GB2312.txt
>
>
> The code that demonstrates this bug can be found in attachment: 
> This code detects 'application/octet-stream' for the Content-Type and returns 
> an empty string for the contents. It should detect 'text/plain' for the 
> Content-Type and return a Unicode string of the contents of the file. I don't 
> see a way to attach files here so I'll update this once I get the file 
> uploaded somewhere.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1262) parseToString fails to detect content-type / charset for GB2312 text file

Reply via email to