[
https://issues.apache.org/jira/browse/CONNECTORS-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556038#comment-13556038
]
Karl Wright commented on CONNECTORS-613:
----------------------------------------
I just took the step of actually seeing what cURL sends in the multipart data.
It looks like it uses the file extension to make a guess about the content type:
{code}
002c: Content-Disposition: form-data; name="myfile"; filename="sjis.tx
006c: t"
0070: Content-Type: text/plain
008a:
=> Send data, 49 bytes (0x31)
0000: This is a sjis text. .........{.....e.L.X.g.....B
=> Send data, 48 bytes (0x30)
0000:
{code}
So the assumption that cURL is not providing a content type is incorrect. When
I rename "sjis.txt" to just plain "sjis", and post that, then I get this:
{code}
002c: Content-Disposition: form-data; name="myfile"; filename="sjis"
006c: Content-Type: application/octet-stream
0094:
=> Send data, 49 bytes (0x31)
0000: This is a sjis text. .........{.....e.L.X.g.....B
=> Send data, 48 bytes (0x30)
0000:
{code}
I'd really love to know whether Tika properly extracts sjis from THAT post.
Abe-san, can you check? Or better yet, can you tell me how you are seeing
exactly what Tika extracts, and I can check here?
> The content of sjis file can't be extracted
> -------------------------------------------
>
> Key: CONNECTORS-613
> URL: https://issues.apache.org/jira/browse/CONNECTORS-613
> Project: ManifoldCF
> Issue Type: Bug
> Components: File system connector, Lucene/SOLR connector
> Affects Versions: ManifoldCF 1.0.1, ManifoldCF 1.1
> Environment: Solr 4.x (not Solr 3.x)
> Reporter: Shinichiro Abe
> Assignee: Karl Wright
> Fix For: ManifoldCF 1.1
>
> Attachments: files.zip
>
>
> When posting sjis text file by using curl, the content can be extracted.
> {noformat}
> curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F
> "[email protected]"
> {noformat}
> But when posting this file by File system connector, it can't be extracted.
> it results empty.
> It seems that the content of utf-8 text file can be extracted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira