![]() |
|
|
Issue Type:
|
Bug
|
Affects Versions:
|
3.1 |
Assignee:
|
Unassigned |
Components:
|
DSpace API, Solr |
Created:
|
06/Sep/13 12:45 PM
|
Description:
|
The PDFTextExtractor (org.dspace.app.mediafilter.PDFFilter) can create text bitstreams that cannot be written to the solr index. The stack trace is :
23-Aug-2013 09:29:55 org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #29472, byte #28575)
at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301)
Obviously this is not a big deal, since it just means no full-text search on that item, but it is a bug. Possibly the filter-media classes could be wrapped in 'filters' (uhhh...FilterFilterWrappers?) that removed non-writable characters from the text.
|
Project:
|
DSpace
|
Priority:
|
Minor
|
Reporter:
|
Gilleain Torrance
|
|
|
|
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
|
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel