[jira] [Comment Edited] (CONNECTORS-613) The content of sjis file can't be extracted

Karl Wright (JIRA) Wed, 16 Jan 2013 01:30:22 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554871#comment-13554871
 ]


Karl Wright edited comment on CONNECTORS-613 at 1/16/13 9:29 AM:
-----------------------------------------------------------------

It seems to be as I feared; Tika is behaving differently in Solr 4.0, and is 
apparently no longer trying to identify what kind of document it is if the mime 
type is "application/octet-stream".  It seems now that we need to do further 
research to figure out how to convince Tika to extract the content.  Is there a 
content type that will convince Tika to do that?

The place to experiment is in the Solr connector in HttpPoster.java around line 
1130.  You can make the document's content stream have whatever content type 
you like there:

{code}
    @Override
    public String getContentType()
    {
      return "application/octet-stream";
    }
{code}

But you can't lie to Tika and always say it is text, because then it won't 
attempt to extract content from binary documents perhaps.  So returning a 
content type of "text/plain" is not going to work in all cases.  Things you 
might try:

- "application/unknown"
- ""
- null

If any of these work for you please let me know and I will include the fix in 
MCF 1.1.  Otherwise it becomes a Tika bug, or at least a Tika question.


                
      was (Author: [email protected]):
    It seems to be as I feared; Tika is behaving differently in Solr 4.0, and 
is apparently no longer trying to identify what kind of document it is if the 
mime type is "application/octet-stream".  It seems now that we need to do 
further research to figure out how to convince Tika to extract the content.  Is 
there a content type that will convince Tika to do that?

The place to experiment in the Solr connector in HttpPoster.java around line 
1130.  You can make the document's content stream have whatever content type 
you like there:

{code}
    @Override
    public String getContentType()
    {
      return "application/octet-stream";
    }
{code}

But you can't lie to Tika and always say it is text, because then it won't 
attempt to extract content from binary documents perhaps.  So returning a 
content type of "text/plain" is not going to work in all cases.  Things you 
might try:

- "application/unknown"
- ""
- null

If any of these work for you please let me know and I will include the fix in 
MCF 1.1.  Otherwise it becomes a Tika bug, or at least a Tika question.


                  
> The content of sjis file can't be extracted
> -------------------------------------------
>
>                 Key: CONNECTORS-613
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-613
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector, Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.0.1, ManifoldCF 1.1
>         Environment: Solr 4.x (not Solr 3.x)
>            Reporter: Shinichiro Abe
>             Fix For: ManifoldCF 1.1
>
>         Attachments: files.zip
>
>
> When posting sjis text file by using curl, the content can be extracted.
> {noformat}
> curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true"; -F 
> "[email protected]"
> {noformat} 
> But when posting this file by File system connector, it can't be extracted. 
> it results empty.
> It seems that the content of utf-8 text file can be extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CONNECTORS-613) The content of sjis file can't be extracted

Reply via email to