It looks like the Tika content extraction is not actually producing valid utf-8. I'm not sure what it is producing, but that is the underlying problem.
I'll create a ticket and look into it. Karl On Tue, Aug 12, 2014 at 9:52 AM, Karl Wright <[email protected]> wrote: > Hi Abe-san, > > It looks to me like SolrJ when it uses SolrInputDocument cannot correctly > post some kinds of characters. The exception is coming from inside Solr > itself -- not SolrJ. So I think a Solr ticket would be the right thing to > do here. > > Can you try leaving your pipeline to include Tika, but changing your Solr > connection to go back to using the extracting update handler? If that > works, then I think we have correctly diagnosed the problem. > > Thanks, > Karl > > > > On Tue, Aug 12, 2014 at 9:43 AM, Shinichiro Abe < > [email protected]> wrote: > >> Hi Karl, >> >> The content field was garbled via /update and tika connector. >> Sample Docs: http://www.rondhuit.com/download.html#whitepaper >> My mcf-job was from filesystem:Japanese PDF,XLS to Solr. >> >> I was surprised that Solr threw an exception when >> en_US end-user-documentation.pdf >> was posted via tika connector. Posting files via /update/extract were not >> garbled, not threw exceptions. >> Could you reproduce this? >> >> 2268394 [qtp1224864813-14] ERROR >> org.apache.solr.servlet.SolrDispatchFilter >> – null:java.lang.RuntimeException: [was class >> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char >> #112515, byte #184319) >> at >> >> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) >> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) >> at >> >> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) >> at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) >> at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:395) >> at >> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) >> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) >> at >> >> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) >> at >> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >> at >> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) >> ... >> Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff >> at char #112515, byte #184319) >> at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) >> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) >> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) >> at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) >> at >> >> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) >> at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) >> at >> >> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) >> at >> >> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) >> at >> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) >> at >> >> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) >> ... 36 more >> >> Thanks, >> Shinichiro Abe >> >> >> >> >> 2014-08-12 22:24 GMT+09:00 Karl Wright <[email protected]>: >> >> > I ran "ant rat-sources", and inspected the packages. All looks good. >> The >> > only comment is that the connector-lib area has grown by about 18MB this >> > cycle, and of course all the images for the Chinese documentation add >> > another 5MB, so our binary packages are now just about 200MB. I don't >> > think this something we can do a lot about, though, except maybe by >> > repackaging so we release connectors independently of the framework. >> > >> > I'll give a final vote after I hear more back from Erlend and Abe-san. >> > >> > Thanks, >> > Karl >> > >> > >> > On Tue, Aug 12, 2014 at 2:23 AM, Karl Wright <[email protected]> >> wrote: >> > >> > > I request that the vote be left open at least until 8/21/2014, since >> 1.7 >> > > is a major release and we want as many people to try it out as >> possible >> > > before declaring it complete. Thanks! >> > > >> > > Karl >> > > >> > > >> > > >> > > On Tue, Aug 12, 2014 at 12:44 AM, Shinichiro Abe < >> > > [email protected]> wrote: >> > > >> > >> Hi, >> > >> >> > >> +1 from me. >> > >> >> > >> -Checked SIGS, checksum by running check_signatures.sh. >> > >> -Checked that the code signing Key of Mingchun is available online. >> > >> >> > >> Shinichiro Abe >> > >> >> > >> On 2014/08/12, at 12:13, Mingchun Zhao <[email protected]> >> > wrote: >> > >> >> > >> > Hi all, >> > >> > >> > >> > Please vote on whether to release the ManifoldCF, version 1.7, RC0. >> > >> > >> > >> > You can find the artifact at: >> > >> > >> > >> > http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0 >> > >> > >> > >> > There is also a tag at: >> > >> > >> > >> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0 >> > >> > >> > >> > Vote will remain open at least 72 hours. >> > >> > >> > >> > Thanks! >> > >> > Mingchun Zhao >> > >> >> > >> >> > > >> > >> >> >> >> -- >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> Shinichiro Abe >> 阿部 慎一朗 >> > >
