Shinichiro Abe created CONNECTORS-984:
-----------------------------------------

             Summary: Give Tika's metadata some hints
                 Key: CONNECTORS-984
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-984
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Amazon CloudSearch output connector
    Affects Versions: ManifoldCF 1.7
            Reporter: Shinichiro Abe
             Fix For: ManifoldCF 1.7


Component: Tika connector

Currently in trunk code, we don't set data in Tika's metadata object.
We likely have to give metadata some hints to detect and extract from document.
* resourceName
* ContentType
* stream size
* charset(new feature)
* Password handling(new feature)

Also, when TikaException(e.g. parsing error at PDFBox/POI) is thrown, we need 
to decide to ignore or not about the parsing document. Solr Cell has 
'ignoreTikaException' param. When TikaException is thrown, if true, metadata 
only is indexed, if false, Solr responds server error and the document is not 
indexed.

Reference-->Solr Cell:
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l142



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to