[ 
https://issues.apache.org/jira/browse/CONNECTORS-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070137#comment-14070137
 ] 

Karl Wright commented on CONNECTORS-984:
----------------------------------------

Hi Abe-san,

It's time to start thinking about finishing up the 1.7 release.  I think this 
work would be very important to include.  I hope you will be able to get to it 
before we freeze the release in two weeks?  If not, can you give me some 
general instructions as to how to do the things you mention in the ticket?

Thanks!

> Give Tika's metadata some hints
> -------------------------------
>
>                 Key: CONNECTORS-984
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-984
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>             Fix For: ManifoldCF 1.7
>
>
> Component: Tika connector
> Currently in trunk code, we don't set data in Tika's metadata object.
> We likely have to give metadata some hints to detect and extract from 
> document.
> * resourceName
> * ContentType
> * stream size
> * charset(new feature)
> * Password handling(new feature)
> Also, when TikaException(e.g. parsing error at PDFBox/POI) is thrown, we need 
> to decide to ignore or not about the parsing document. Solr Cell has 
> 'ignoreTikaException' param. When TikaException is thrown, if true, metadata 
> only is indexed, if false, Solr responds server error and the document is not 
> indexed.
> Reference-->Solr Cell:
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l142



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to