[Dspace-devel] [DuraSpace JIRA] (DS-1650) Extracted text from filter-media can fail to be indexed by solr in some cases

Gilleain Torrance (DuraSpace JIRA) Fri, 06 Sep 2013 05:48:56 -0700

Issue Type:	Bug
Affects Versions:	3.1
Assignee:	Unassigned
Components:	DSpace API, Solr
Created:	06/Sep/13 12:45 PM
Description:	The PDFTextExtractor (org.dspace.app.mediafilter.PDFFilter) can create text bitstreams that cannot be written to the solr index. The stack trace is : 23-Aug-2013 09:29:55 org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #29472, byte #28575) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301) Obviously this is not a big deal, since it just means no full-text search on that item, but it is a bug. Possibly the filter-media classes could be wrapped in 'filters' (uhhh...FilterFilterWrappers?) that removed non-writable characters from the text.
Project:	DSpace
Priority:	Minor
Reporter:	Gilleain Torrance

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk

_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] [DuraSpace JIRA] (DS-1650) Extracted text from filter-media can fail to be indexed by solr in some cases

Reply via email to