Shinichiro Abe created CONNECTORS-984:
-----------------------------------------
Summary: Give Tika's metadata some hints
Key: CONNECTORS-984
URL: https://issues.apache.org/jira/browse/CONNECTORS-984
Project: ManifoldCF
Issue Type: Improvement
Components: Amazon CloudSearch output connector
Affects Versions: ManifoldCF 1.7
Reporter: Shinichiro Abe
Fix For: ManifoldCF 1.7
Component: Tika connector
Currently in trunk code, we don't set data in Tika's metadata object.
We likely have to give metadata some hints to detect and extract from document.
* resourceName
* ContentType
* stream size
* charset(new feature)
* Password handling(new feature)
Also, when TikaException(e.g. parsing error at PDFBox/POI) is thrown, we need
to decide to ignore or not about the parsing document. Solr Cell has
'ignoreTikaException' param. When TikaException is thrown, if true, metadata
only is indexed, if false, Solr responds server error and the document is not
indexed.
Reference-->Solr Cell:
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l142
--
This message was sent by Atlassian JIRA
(v6.2#6252)