Di Dongke created TIKA-3834:
-------------------------------
Summary: Tika-Server can not get the text of a document encoding
in GB18030.
Key: TIKA-3834
URL: https://issues.apache.org/jira/browse/TIKA-3834
Project: Tika
Issue Type: Bug
Components: tika-server
Affects Versions: 2.3.0
Environment: Linux
Reporter: Di Dongke
Attachments: 111.csv, 112.csv
There are 2 files :
111.csv (Content-Encoding: UTF-8)
112.csv (Content-Encoding: GB18030)
Tika-app can get the text of the two files.
java -jar tika-app-1.24.1.jar -t 111.csv
java -jar tika-app-1.24.1.jar -t 112.csv
Tika-server can get the text of 111.csv.
curl -T 111.csv http://127.0.0.1:12000/tika --head "Accept: text/plain"
{color:#FF0000}But Tika-server can not get the text of 112.csv.{color}
curl -T 112.csv http://127.0.0.1:12000/tika --head "Accept: text/plain"
--
This message was sent by Atlassian Jira
(v8.20.10#820010)