Here's how I solved this problem:

In solrconfig.xml, for the ExtractingRequestHandler, I specified a 
stream.type of text/plain so that Tika wouldn't have to guess at the mime 
type.  Since the fulltext element of the solr doc comes strictly from the 
TEXT bundle (I think), this should not cause any problem.  Any comments on 
this attempted solution are welcome.  In the meantime, all of the 
previously missing items have been indexed into solr.

Cheers!
Bill

On Thursday, June 2, 2016 at 11:53:59 AM UTC-5, Bill T wrote:
>
> All
>
> using DSpace 5.5, I have noticed a number of records that fail to get 
> indexed by solr.  The TEXT bitstreams are extracted from pdfs that look 
> like email messages.  That is, they are memos that begin with "FROM:" or 
> they are actually email messages.  In either case, the solr indexing fails 
> with
>
>   TikaException failed to parse an email message
>
> followed by a stack trace (which I will include if anybody asks).
>
> The item is not included in the solr index, and so is not browsable or 
> searchable, and thus pretty much invisible.
>
> I can actually hand-edit the extracted text on disk, and if I add some 
> arbitrary text to the top of the file, it indexes properly.  This is a 
> pretty crude work-around though.
>
> Any better ideas on how to fix this?  If this looks like a bug, I will be 
> happy to submit it to JIRA...
>
> Cheers!
> Bill
>

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Reply via email to