[
https://issues.apache.org/jira/browse/CONNECTORS-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554871#comment-13554871
]
Karl Wright commented on CONNECTORS-613:
----------------------------------------
It seems to be as I feared; Tika is behaving differently in Solr 4.0, and is
apparently no longer trying to identify what kind of document it is if the mime
type is "application/octet-stream". It seems now that we need to do further
research to figure out how to convince Tika to extract the content. Is there a
content type that will convince Tika to do that?
The place to experiment in the Solr connector in HttpPoster.java around line
1130. You can make the document's content stream have whatever content type
you like there:
{code}
@Override
public String getContentType()
{
return "application/octet-stream";
}
{code}
But you can't lie to Tika and always say it is text, because then it won't
attempt to extract content from binary documents perhaps. So returning a
content type of "text/plain" is not going to work in all cases. Things you
might try:
- "application/unknown"
- ""
- null
If any of these work for you please let me know and I will include the fix in
MCF 1.1. Otherwise it becomes a Tika bug, or at least a Tika question.
> The content of sjis file can't be extracted
> -------------------------------------------
>
> Key: CONNECTORS-613
> URL: https://issues.apache.org/jira/browse/CONNECTORS-613
> Project: ManifoldCF
> Issue Type: Bug
> Components: File system connector, Lucene/SOLR connector
> Affects Versions: ManifoldCF 1.0.1, ManifoldCF 1.1
> Environment: Solr 4.x (not Solr 3.x)
> Reporter: Shinichiro Abe
> Fix For: ManifoldCF 1.1
>
> Attachments: files.zip
>
>
> When posting sjis text file by using curl, the content can be extracted.
> {noformat}
> curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F
> "[email protected]"
> {noformat}
> But when posting this file by File system connector, it can't be extracted.
> it results empty.
> It seems that the content of utf-8 text file can be extracted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira