[
https://issues.apache.org/jira/browse/CONNECTORS-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042942#comment-14042942
]
Shinichiro Abe commented on CONNECTORS-984:
-------------------------------------------
Okay, I'll try to do that two weeks later, maybe it will take a while to write
code and test using some binary files.
* The reason for adding metadata info--> CONNECTORS-613
Additional comment:
* Plus 'stream_name' as possible - for backwards compatibility
* one idea: copy(or move?) mimetype mapping from core to Tika connector's tab:
info--> CONNECTORS-654
> Give Tika's metadata some hints
> -------------------------------
>
> Key: CONNECTORS-984
> URL: https://issues.apache.org/jira/browse/CONNECTORS-984
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Amazon CloudSearch output connector
> Affects Versions: ManifoldCF 1.7
> Reporter: Shinichiro Abe
> Fix For: ManifoldCF 1.7
>
>
> Component: Tika connector
> Currently in trunk code, we don't set data in Tika's metadata object.
> We likely have to give metadata some hints to detect and extract from
> document.
> * resourceName
> * ContentType
> * stream size
> * charset(new feature)
> * Password handling(new feature)
> Also, when TikaException(e.g. parsing error at PDFBox/POI) is thrown, we need
> to decide to ignore or not about the parsing document. Solr Cell has
> 'ignoreTikaException' param. When TikaException is thrown, if true, metadata
> only is indexed, if false, Solr responds server error and the document is not
> indexed.
> Reference-->Solr Cell:
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l142
--
This message was sent by Atlassian JIRA
(v6.2#6252)