Di Dongke created TIKA-3834:
-------------------------------

             Summary: Tika-Server can not get the text of a document encoding 
in GB18030.
                 Key: TIKA-3834
                 URL: https://issues.apache.org/jira/browse/TIKA-3834
             Project: Tika
          Issue Type: Bug
          Components: tika-server
    Affects Versions: 2.3.0
         Environment: Linux
            Reporter: Di Dongke
         Attachments: 111.csv, 112.csv

There are 2 files :

111.csv (Content-Encoding: UTF-8)

112.csv (Content-Encoding: GB18030)

 

Tika-app can get the text of the two files.

java -jar tika-app-1.24.1.jar -t 111.csv

java -jar tika-app-1.24.1.jar -t 112.csv

 

Tika-server can get the text of 111.csv.

curl -T 111.csv http://127.0.0.1:12000/tika --head "Accept: text/plain"

 

{color:#FF0000}But Tika-server can not get the text of 112.csv.{color}

curl -T 112.csv http://127.0.0.1:12000/tika --head "Accept: text/plain"

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to