[jira] [Comment Edited] (CONNECTORS-613) The content of sjis file can't be extracted

Karl Wright (JIRA) Tue, 15 Jan 2013 05:00:20 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553768#comment-13553768
 ]


Karl Wright edited comment on CONNECTORS-613 at 1/15/13 12:59 PM:
------------------------------------------------------------------

ManifoldCF always reads binary from the file system, as a java InputStream:

{code}
                  InputStream is = new FileInputStream(file);
{code}

This is bytes, and has no encoding.  So that cannot be an issue.

Posting to Solr is the only place, then, where a problem can be.  In 1.0.1 we 
use a custom http poster, which we have used all along.  It posts binary as 
well, with no encoding.  In Solr 3.x this is sufficient for Tika to determine 
what is happening.  In Solr 4.x it is not sufficient, if I understand correctly.

It is possible that curl may set certain http headers when posting that neither 
SolrJ nor our old proprietary code set.  For example, it may set a content-type 
header that is more helpful to Tika than just "application/octet-stream".  
ManifoldCF is not going to be able to do something like that because it has no 
knowledge of what the actual content type is.  The best way to diagnose this 
would be to get a packet capture with Wireshark to see what the difference is 
between a curl post and a ManifoldCF post.


                
      was (Author: [email protected]):
    ManifoldCF always reads binary from the file system, as a java InputStream:

{code}
                  InputStream is = new FileInputStream(file);
{code}

This is bytes, and has no encoding.  So that cannot be an issue.

Posting to Solr is the only place, then, where a problem can be.  In 1.0.1 we 
use a custom http poster, which we have used all along.  It posts binary as 
well, with no encoding.  In Solr 3.x this is sufficient for Tika to determine 
what is happening.  In Solr 4.x it is not sufficient, if I understand correctly.

It is possible that curl may set certain http headers when posting that neither 
SolrJ nor our old proprietary code set.  For example, it may set a content-type 
header that is more helpful to Tika than just "application/octet-stream".  
ManifoldCF is not going to be able to do something like that because it has no 
knowledge of what the actual content type is.  The best way to diagnose this 
would be to get a packet capture with Wireshark to see what the difference is 
between a curl post and a ManifoldCF post.




in 1.0.1 posts binary with 
                  
> The content of sjis file can't be extracted
> -------------------------------------------
>
>                 Key: CONNECTORS-613
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-613
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector, Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.0.1, ManifoldCF 1.1
>         Environment: Solr 4.x (not Solr 3.x)
>            Reporter: Shinichiro Abe
>             Fix For: ManifoldCF 1.1
>
>         Attachments: files.zip
>
>
> When posting sjis text file by using curl, the content can be extracted.
> {noformat}
> curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true"; -F 
> "[email protected]"
> {noformat} 
> But when posting this file by File system connector, it can't be extracted. 
> it results empty.
> It seems that the content of utf-8 text file can be extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CONNECTORS-613) The content of sjis file can't be extracted

Reply via email to