[ https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027579#comment-13027579 ]
Shinichiro Abe commented on SOLR-2480: -------------------------------------- {quote} But I think you want Solr to skip the content field because tika cannot extract it for some reasons but add meta data fields, right? {quote} Yes, I want to post the metadate without contents that throw parse-error. ExtractingDocumentLoader also should be fixed. This patch expresses improvement ideas(1). And I think SOLR-445 can resolve improvement ideas(2). > Text extraction of password protected files > ------------------------------------------- > > Key: SOLR-2480 > URL: https://issues.apache.org/jira/browse/SOLR-2480 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) > Affects Versions: 3.1 > Reporter: Shinichiro Abe > Priority: Minor > Attachments: SOLR-2480-idea1.patch > > > Proposal: > There are password-protected files. PDF, Office documents in 2007 format/97 > format. > These files are posted using SolrCell. > We do not have to read these files if we do not know the reading password of > files. > So, these files may not be extracted text. > My requirement is that these files should be processed normally without > extracting text, and without throwing exception. > This background: > Now, when you post a password-protected file, solr returns 500 server error. > Solr catches the error in ExtractingDocumentLoader and throws TikException. > I use ManifoldCF. > If the solr server responds 500, ManifoldCF judge is that "this > document should be retried because I have absolutely no idea what > happened". > And it attempts to retry posting many times without getting the password. > In the other case, my customer posts the files with embedded images. > Sometimes it seems that solr throws TikaException of unknown cause. > He wants to post just metadata without extracting text, but makes him stop > posting by the exception. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org