[ 
https://issues.apache.org/jira/browse/CONNECTORS-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556038#comment-13556038
 ] 

Karl Wright commented on CONNECTORS-613:
----------------------------------------

I just took the step of actually seeing what cURL sends in the multipart data.  
It looks like it uses the file extension to make a guess about the content type:

{code}
002c: Content-Disposition: form-data; name="myfile"; filename="sjis.tx
006c: t"
0070: Content-Type: text/plain
008a: 
=> Send data, 49 bytes (0x31)
0000: This is a sjis text. .........{.....e.L.X.g.....B
=> Send data, 48 bytes (0x30)
0000: 
{code}

So the assumption that cURL is not providing a content type is incorrect.  When 
I rename "sjis.txt" to just plain "sjis", and post that, then I get this:

{code}
002c: Content-Disposition: form-data; name="myfile"; filename="sjis"
006c: Content-Type: application/octet-stream
0094: 
=> Send data, 49 bytes (0x31)
0000: This is a sjis text. .........{.....e.L.X.g.....B
=> Send data, 48 bytes (0x30)
0000: 
{code}

I'd really love to know whether Tika properly extracts sjis from THAT post.  
Abe-san, can you check?  Or better yet, can you tell me how you are seeing 
exactly what Tika extracts, and I can check here?


                
> The content of sjis file can't be extracted
> -------------------------------------------
>
>                 Key: CONNECTORS-613
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-613
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector, Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.0.1, ManifoldCF 1.1
>         Environment: Solr 4.x (not Solr 3.x)
>            Reporter: Shinichiro Abe
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.1
>
>         Attachments: files.zip
>
>
> When posting sjis text file by using curl, the content can be extracted.
> {noformat}
> curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true"; -F 
> "[email protected]"
> {noformat} 
> But when posting this file by File system connector, it can't be extracted. 
> it results empty.
> It seems that the content of utf-8 text file can be extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to