[ 
https://issues.apache.org/jira/browse/TIKA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577471#comment-17577471
 ] 

Tim Allison commented on TIKA-3834:
-----------------------------------

I confirmed this behavior in 2.4.2-SNAPSHOT in tika-app and tika-server.

The issue is that tika-app has the filename as a file type hint, whereas 
tika-server does not.  Detecting text files without a file name is actually 
hard, and without a file name, tika detects this as unknown=application/octet 
so it is never sent for charset detection under the TextAndCSVParser.

If you add the filename to your request to tika-server, Tika correctly detects 
this as text:

{{curl -T 112.csv http://localhost:9998/tika  -H "Content-Disposition: 
attachment; filename=foo.csv"}}


> Tika-Server can not get the text of a document encoding in GB18030.
> -------------------------------------------------------------------
>
>                 Key: TIKA-3834
>                 URL: https://issues.apache.org/jira/browse/TIKA-3834
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-server
>    Affects Versions: 2.3.0
>         Environment: Linux
>            Reporter: Di Dongke
>            Priority: Critical
>              Labels: tika-server
>         Attachments: 111.csv, 112.csv
>
>
> There are 2 files :
> 111.csv (Content-Encoding: UTF-8)
> 112.csv (Content-Encoding: GB18030)
>  
> Tika-app can get the text of the two files.
> java -jar tika-app-1.24.1.jar -t 111.csv
> java -jar tika-app-1.24.1.jar -t 112.csv
>  
> Tika-server can get the text of 111.csv.
> curl -T 111.csv http://127.0.0.1:12000/tika --head "Accept: text/plain"
>  
> {color:#FF0000}But Tika-server can not get the text of 112.csv.{color}
> curl -T 112.csv http://127.0.0.1:12000/tika --head "Accept: text/plain"
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to