[
https://issues.apache.org/jira/browse/CONNECTORS-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553768#comment-13553768
]
Karl Wright edited comment on CONNECTORS-613 at 1/15/13 12:59 PM:
------------------------------------------------------------------
ManifoldCF always reads binary from the file system, as a java InputStream:
{code}
InputStream is = new FileInputStream(file);
{code}
This is bytes, and has no encoding. So that cannot be an issue.
Posting to Solr is the only place, then, where a problem can be. In 1.0.1 we
use a custom http poster, which we have used all along. It posts binary as
well, with no encoding. In Solr 3.x this is sufficient for Tika to determine
what is happening. In Solr 4.x it is not sufficient, if I understand correctly.
It is possible that curl may set certain http headers when posting that neither
SolrJ nor our old proprietary code set. For example, it may set a content-type
header that is more helpful to Tika than just "application/octet-stream".
ManifoldCF is not going to be able to do something like that because it has no
knowledge of what the actual content type is. The best way to diagnose this
would be to get a packet capture with Wireshark to see what the difference is
between a curl post and a ManifoldCF post.
was (Author: [email protected]):
ManifoldCF always reads binary from the file system, as a java InputStream:
{code}
InputStream is = new FileInputStream(file);
{code}
This is bytes, and has no encoding. So that cannot be an issue.
Posting to Solr is the only place, then, where a problem can be. In 1.0.1 we
use a custom http poster, which we have used all along. It posts binary as
well, with no encoding. In Solr 3.x this is sufficient for Tika to determine
what is happening. In Solr 4.x it is not sufficient, if I understand correctly.
It is possible that curl may set certain http headers when posting that neither
SolrJ nor our old proprietary code set. For example, it may set a content-type
header that is more helpful to Tika than just "application/octet-stream".
ManifoldCF is not going to be able to do something like that because it has no
knowledge of what the actual content type is. The best way to diagnose this
would be to get a packet capture with Wireshark to see what the difference is
between a curl post and a ManifoldCF post.
in 1.0.1 posts binary with
> The content of sjis file can't be extracted
> -------------------------------------------
>
> Key: CONNECTORS-613
> URL: https://issues.apache.org/jira/browse/CONNECTORS-613
> Project: ManifoldCF
> Issue Type: Bug
> Components: File system connector, Lucene/SOLR connector
> Affects Versions: ManifoldCF 1.0.1, ManifoldCF 1.1
> Environment: Solr 4.x (not Solr 3.x)
> Reporter: Shinichiro Abe
> Fix For: ManifoldCF 1.1
>
> Attachments: files.zip
>
>
> When posting sjis text file by using curl, the content can be extracted.
> {noformat}
> curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F
> "[email protected]"
> {noformat}
> But when posting this file by File system connector, it can't be extracted.
> it results empty.
> It seems that the content of utf-8 text file can be extracted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira