All

using DSpace 5.5, I have noticed a number of records that fail to get 
indexed by solr.  The TEXT bitstreams are extracted from pdfs that look 
like email messages.  That is, they are memos that begin with "FROM:" or 
they are actually email messages.  In either case, the solr indexing fails 
with

  TikaException failed to parse an email message

followed by a stack trace (which I will include if anybody asks).

The item is not included in the solr index, and so is not browsable or 
searchable, and thus pretty much invisible.

I can actually hand-edit the extracted text on disk, and if I add some 
arbitrary text to the top of the file, it indexes properly.  This is a 
pretty crude work-around though.

Any better ideas on how to fix this?  If this looks like a bug, I will be 
happy to submit it to JIRA...

Cheers!
Bill

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Reply via email to