Thanks Karl,

When posting MCF's end-user-documentation.pdf(English) via standard update
handler,
Solr throws an exception, this is a problem, I'm not sure why.
It works by leaving my pipeline to include Tika and using the extracting
update handler.
Solr's Tika version matches MCF's Tika one(1.5).



2014-08-12 23:10 GMT+09:00 Karl Wright <[email protected]>:

> It looks like the Tika content extraction is not actually producing valid
> utf-8.  I'm not sure what it is producing, but that is the underlying
> problem.
>
> I'll create a ticket and look into it.
>
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 9:52 AM, Karl Wright <[email protected]> wrote:
>
> > Hi Abe-san,
> >
> > It looks to me like SolrJ when it uses SolrInputDocument cannot correctly
> > post some kinds of characters.  The exception is coming from inside Solr
> > itself -- not SolrJ.  So I think a Solr ticket would be the right thing
> to
> > do here.
> >
> > Can you try leaving your pipeline to include Tika, but changing your Solr
> > connection to go back to using the extracting update handler?  If that
> > works, then I think we have correctly diagnosed the problem.
> >
> > Thanks,
> > Karl
> >
> >
> >
> > On Tue, Aug 12, 2014 at 9:43 AM, Shinichiro Abe <
> > [email protected]> wrote:
> >
> >> Hi Karl,
> >>
> >> The content field was garbled via /update and tika connector.
> >> Sample Docs: http://www.rondhuit.com/download.html#whitepaper
> >> My mcf-job was from filesystem:Japanese PDF,XLS to Solr.
> >>
> >> I was surprised that Solr threw an exception when
> >> en_US end-user-documentation.pdf
> >> was posted via tika connector. Posting files via /update/extract were
> not
> >> garbled, not threw exceptions.
> >> Could you reproduce this?
> >>
> >> 2268394 [qtp1224864813-14] ERROR
> >> org.apache.solr.servlet.SolrDispatchFilter
> >>  – null:java.lang.RuntimeException: [was class
> >> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char
> >> #112515, byte #184319)
> >> at
> >>
> >>
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >> at
> >>
> >>
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >> at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >> at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:395)
> >> at
> >>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
> >> at
> >>
> >>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >> at
> >>
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >> at
> >>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> >> ...
> >> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> 0xffff
> >> at char #112515, byte #184319)
> >> at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
> >> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
> >> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
> >> at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
> >> at
> >>
> >>
> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
> >> at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
> >> at
> >>
> >>
> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
> >> at
> >>
> >>
> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
> >> at
> >>
> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
> >> at
> >>
> >>
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
> >> ... 36 more
> >>
> >> Thanks,
> >> Shinichiro Abe
> >>
> >>
> >>
> >>
> >> 2014-08-12 22:24 GMT+09:00 Karl Wright <[email protected]>:
> >>
> >> > I ran "ant rat-sources", and inspected the packages.  All looks good.
> >>  The
> >> > only comment is that the connector-lib area has grown by about 18MB
> this
> >> > cycle, and of course all the images for the Chinese documentation add
> >> > another 5MB, so our binary packages are now just about 200MB.  I don't
> >> > think this something we can do a lot about, though, except maybe by
> >> > repackaging so we release connectors independently of the framework.
> >> >
> >> > I'll give a final vote after I hear more back from Erlend and Abe-san.
> >> >
> >> > Thanks,
> >> > Karl
> >> >
> >> >
> >> > On Tue, Aug 12, 2014 at 2:23 AM, Karl Wright <[email protected]>
> >> wrote:
> >> >
> >> > > I request that the vote be left open at least until 8/21/2014, since
> >> 1.7
> >> > > is a major release and we want as many people to try it out as
> >> possible
> >> > > before declaring it complete.  Thanks!
> >> > >
> >> > > Karl
> >> > >
> >> > >
> >> > >
> >> > > On Tue, Aug 12, 2014 at 12:44 AM, Shinichiro Abe <
> >> > > [email protected]> wrote:
> >> > >
> >> > >> Hi,
> >> > >>
> >> > >> +1 from me.
> >> > >>
> >> > >> -Checked SIGS, checksum by running check_signatures.sh.
> >> > >> -Checked that the code signing Key of Mingchun is available online.
> >> > >>
> >> > >> Shinichiro Abe
> >> > >>
> >> > >> On 2014/08/12, at 12:13, Mingchun Zhao <[email protected]>
> >> > wrote:
> >> > >>
> >> > >> > Hi all,
> >> > >> >
> >> > >> > Please vote on whether to release the ManifoldCF, version 1.7,
> RC0.
> >> > >> >
> >> > >> > You can find the artifact at:
> >> > >> >
> >> > >> > http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0
> >> > >> >
> >> > >> > There is also a tag at:
> >> > >> >
> >> > >> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0
> >> > >> >
> >> > >> > Vote will remain open at least 72 hours.
> >> > >> >
> >> > >> > Thanks!
> >> > >> > Mingchun Zhao
> >> > >>
> >> > >>
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >> Shinichiro Abe
> >> 阿部 慎一朗
> >>
> >
> >
>



-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Shinichiro Abe
阿部 慎一朗

Reply via email to