[jira] [Commented] (CONNECTORS-613) The content of sjis file can't be extracted

Karl Wright (JIRA) Tue, 15 Jan 2013 18:23:17 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554668#comment-13554668
 ]


Karl Wright commented on CONNECTORS-613:
----------------------------------------

So, a number of differences.  (1) Curl uses a multipart form, and Solrj just 
uses a standard post.  (2) Curl does not declare a content type, and Solrj uses 
a content-type of "application/octet-stream".  As for the content itself, I 
don't actually see that in curl, but it is no doubt the binary contents of the 
file, while in Solrj it is:

{code}
DEBUG 2013-01-15 21:00:58,643 (Thread-761) - >> "31[\r][\n]"
DEBUG 2013-01-15 21:00:58,643 (Thread-761) - >> "This is a sjis text. 
[0x82][0xb1][0x82][0xea][0x82][0xcd][0x93][0xfa][0x96]{[0x8c][0xea][0x82][0xcc][0x83]e[0x83]L[0x83]X[0x83]g[0x82][0xc5][0x82][0xb7][0x81]B"
DEBUG 2013-01-15 21:00:58,643 (Thread-761) - >> "[\r][\n]"
DEBUG 2013-01-15 21:00:58,643 (Thread-761) - >> "0[\r][\n]"
DEBUG 2013-01-15 21:00:58,643 (Thread-761) - >> "[\r][\n]"
{code}

Presuming that the stuff in the square brackets ([]) is actually hex bytes, it 
seems like it would yield reasonable sjis.  The actual content of the file is:

{code}
1805:0100  54 68 69 73 20 69 73 20-61 20 73 6A 69 73 20 74   This is a sjis t
1805:0110  65 78 74 2E 20 82 B1 82-EA 82 CD 93 FA 96 7B 8C   ext. .........{.
1805:0120  EA 82 CC 83 65 83 4C 83-58 83 67 82 C5 82 B7 81   ....e.L.X.g.....
1805:0130  42 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00   B...............
{code}

... which seems like it matches.

So it is not clear why Tika treats one form of document delivery differently 
than another.  Is there any way to intercept the Tika pipeline and see what 
exactly the posted bytes look like, as a debug exercise?

                
> The content of sjis file can't be extracted
> -------------------------------------------
>
>                 Key: CONNECTORS-613
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-613
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector, Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.0.1, ManifoldCF 1.1
>         Environment: Solr 4.x (not Solr 3.x)
>            Reporter: Shinichiro Abe
>             Fix For: ManifoldCF 1.1
>
>         Attachments: files.zip
>
>
> When posting sjis text file by using curl, the content can be extracted.
> {noformat}
> curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true"; -F 
> "[email protected]"
> {noformat} 
> But when posting this file by File system connector, it can't be extracted. 
> it results empty.
> It seems that the content of utf-8 text file can be extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-613) The content of sjis file can't be extracted

Reply via email to