[jira] [Commented] (CONNECTORS-984) Give Tika's metadata some hints

Shinichiro Abe (JIRA) Tue, 22 Jul 2014 21:24:13 -0700

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071349#comment-14071349
 ]


Shinichiro Abe commented on CONNECTORS-984:
-------------------------------------------

Sorry, my writing code is too late for the next release, so I'd like to ask you 
to do that. Otherwise I want to postpone supporting this issue.

quick-fix idea for addOrReplaceDocumentWithException():
{code}

Metadata metadata = new Metadata();
+ metadata.add(TikaMetadataKeys.RESOURCE_NAME_KEY, document.getFileName())
+ metadata.add(HttpHeaders.CONTENT_TYPE, document.getMimeType());
+ metadata.add("stream_name", document.getFileName());
+ metadata.add("stream_size", ds.getBinaryLength());
 :
 :
 :
            try
            {
              parser.parse(document.getBinaryStream(), handler, metadata, pc);
            }
            catch (TikaException e)
            {
+             if(ignoreTikaException) { //ignoreTikaException is needed 
configurable somewhere.
+              // If true, I'd like not to skip next process. i.e. I don't 
think DOCUMENTSTATUS_REJECTED is return.
+              // Plese see: 
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l219
+             else {
              resultCode = "TIKAEXCEPTION";
              description = e.getMessage();
              return handleTikaException(e);
+             }
            }

{code}

Thanks.

> Give Tika's metadata some hints
> -------------------------------
>
>                 Key: CONNECTORS-984
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-984
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>             Fix For: ManifoldCF 1.7
>
>
> Component: Tika connector
> Currently in trunk code, we don't set data in Tika's metadata object.
> We likely have to give metadata some hints to detect and extract from 
> document.
> * resourceName
> * ContentType
> * stream size
> * charset(new feature)
> * Password handling(new feature)
> Also, when TikaException(e.g. parsing error at PDFBox/POI) is thrown, we need 
> to decide to ignore or not about the parsing document. Solr Cell has 
> 'ignoreTikaException' param. When TikaException is thrown, if true, metadata 
> only is indexed, if false, Solr responds server error and the document is not 
> indexed.
> Reference-->Solr Cell:
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l142



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CONNECTORS-984) Give Tika's metadata some hints

Reply via email to