Ok, I understand we specify 'text/plain;charset=utf-8' string temporarily so that we accept all kinds of mime types.
Thanks, Shinichiro Abe 2014-08-13 1:25 GMT+09:00 Karl Wright <[email protected]>: > bq. I have a question. > What is this? -> hard-coded mymetype checkings, "text/plain;charset=utf-8". > For what? This seems to be unnecessary. > > http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l156 > > http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l99 > > > Hi Abe-san, > > The idea is that the Tika extractor always confirms that the downstream > pipeline accepts text/plain;charset=utf-8 because that is what it always > outputs. On the upstream side, we should technically only accept documents > that Tika knows how to extract. Right now, we accept all kinds, because I > don't know what that list is. > > Karl > > > > > On Tue, Aug 12, 2014 at 12:20 PM, Shinichiro Abe < > [email protected] > > wrote: > > > Hi Karl, > > > > I also confirmed that using a SJIS file attached on CONNECTORS-613, > > then the file was not garbled and could extract content and metadata > > properly by tika connector. > > Therefore currently we don't need to respin RC. > > > > I have a question. > > What is this? -> hard-coded mymetype checkings, > "text/plain;charset=utf-8". > > For what? This seems to be unnecessary. > > > > > http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l156 > > > > > http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l99 > > > > Thanks, > > Shinichiro Abe > > > > > > 2014-08-13 1:09 GMT+09:00 Karl Wright <[email protected]>: > > > > > Ok, I closed the ticket. > > > > > > So thanks, I think I'm now read to vote +1. > > > > > > Karl > > > > > > > > > > > > On Tue, Aug 12, 2014 at 11:38 AM, Shinichiro Abe < > > > [email protected] > > > > wrote: > > > > > > > I apologize for the mistake, I forgot to configure tika connector in > > the > > > > job. I configured documentFilter and Metadata adjuster only. > > > > It works by adding tika connector, there is no problem. English pdf, > > > > Japanese pdf/xls are not garbled! > > > > I'm sorry! So we don't have to fix CONNECTORS-1008. > > > > > > > > Shinichiro Abe > > > > > > > > > > > > 2014-08-13 0:24 GMT+09:00 Karl Wright <[email protected]>: > > > > > > > > > Ok, I've done some more experimentation, and confirmed that there > is > > > > really > > > > > only ONE problem: in SolrJ or Solr. ManifoldCF is working > perfectly. > > > > > > > > > > The ticket I created, CONNECTORS-1008, will therefore be postponed > to > > > MCF > > > > > 2.0. The workaround is the use the extracting update handler even > > when > > > > the > > > > > content has already been extracted on the MCF side. So we should > > open > > > a > > > > > SOLR ticket, but there is no reason to respin the MCF release. > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > On Tue, Aug 12, 2014 at 10:18 AM, Karl Wright <[email protected]> > > > > wrote: > > > > > > > > > > > So there are two problems. One problem is that the Tika > Extractor > > is > > > > not > > > > > > doing the right thing (I think). The second problem is that > valid > > > > > > characters are not being sent to Solr when SolrInputDocument is > > used. > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 12, 2014 at 10:15 AM, Shinichiro Abe < > > > > > > [email protected]> wrote: > > > > > > > > > > > >> Thanks Karl, > > > > > >> > > > > > >> When posting MCF's end-user-documentation.pdf(English) via > > standard > > > > > update > > > > > >> handler, > > > > > >> Solr throws an exception, this is a problem, I'm not sure why. > > > > > >> It works by leaving my pipeline to include Tika and using the > > > > extracting > > > > > >> update handler. > > > > > >> Solr's Tika version matches MCF's Tika one(1.5). > > > > > >> > > > > > >> > > > > > >> > > > > > >> 2014-08-12 23:10 GMT+09:00 Karl Wright <[email protected]>: > > > > > >> > > > > > >> > It looks like the Tika content extraction is not actually > > > producing > > > > > >> valid > > > > > >> > utf-8. I'm not sure what it is producing, but that is the > > > > underlying > > > > > >> > problem. > > > > > >> > > > > > > >> > I'll create a ticket and look into it. > > > > > >> > > > > > > >> > Karl > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > On Tue, Aug 12, 2014 at 9:52 AM, Karl Wright < > > [email protected]> > > > > > >> wrote: > > > > > >> > > > > > > >> > > Hi Abe-san, > > > > > >> > > > > > > > >> > > It looks to me like SolrJ when it uses SolrInputDocument > > cannot > > > > > >> correctly > > > > > >> > > post some kinds of characters. The exception is coming from > > > > inside > > > > > >> Solr > > > > > >> > > itself -- not SolrJ. So I think a Solr ticket would be the > > > right > > > > > >> thing > > > > > >> > to > > > > > >> > > do here. > > > > > >> > > > > > > > >> > > Can you try leaving your pipeline to include Tika, but > > changing > > > > your > > > > > >> Solr > > > > > >> > > connection to go back to using the extracting update > handler? > > > If > > > > > that > > > > > >> > > works, then I think we have correctly diagnosed the problem. > > > > > >> > > > > > > > >> > > Thanks, > > > > > >> > > Karl > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > On Tue, Aug 12, 2014 at 9:43 AM, Shinichiro Abe < > > > > > >> > > [email protected]> wrote: > > > > > >> > > > > > > > >> > >> Hi Karl, > > > > > >> > >> > > > > > >> > >> The content field was garbled via /update and tika > connector. > > > > > >> > >> Sample Docs: > > http://www.rondhuit.com/download.html#whitepaper > > > > > >> > >> My mcf-job was from filesystem:Japanese PDF,XLS to Solr. > > > > > >> > >> > > > > > >> > >> I was surprised that Solr threw an exception when > > > > > >> > >> en_US end-user-documentation.pdf > > > > > >> > >> was posted via tika connector. Posting files via > > > /update/extract > > > > > were > > > > > >> > not > > > > > >> > >> garbled, not threw exceptions. > > > > > >> > >> Could you reproduce this? > > > > > >> > >> > > > > > >> > >> 2268394 [qtp1224864813-14] ERROR > > > > > >> > >> org.apache.solr.servlet.SolrDispatchFilter > > > > > >> > >> – null:java.lang.RuntimeException: [was class > > > > > >> > >> java.io.CharConversionException] Invalid UTF-8 character > > 0xffff > > > > at > > > > > >> char > > > > > >> > >> #112515, byte #184319) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) > > > > > >> > >> at > > > > > >> > > com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) > > > > > >> > >> at > > > > > >> > > > com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) > > > > > >> > >> at > > > > > >> > > org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:395) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) > > > > > >> > >> at > > > > > org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > > > > > >> > >> at > org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) > > > > > >> > >> ... > > > > > >> > >> Caused by: java.io.CharConversionException: Invalid UTF-8 > > > > character > > > > > >> > 0xffff > > > > > >> > >> at char #112515, byte #184319) > > > > > >> > >> at > > > com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) > > > > > >> > >> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) > > > > > >> > >> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) > > > > > >> > >> at > > com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) > > > > > >> > >> at > > > com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) > > > > > >> > >> at > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) > > > > > >> > >> ... 36 more > > > > > >> > >> > > > > > >> > >> Thanks, > > > > > >> > >> Shinichiro Abe > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> 2014-08-12 22:24 GMT+09:00 Karl Wright <[email protected] > >: > > > > > >> > >> > > > > > >> > >> > I ran "ant rat-sources", and inspected the packages. All > > > looks > > > > > >> good. > > > > > >> > >> The > > > > > >> > >> > only comment is that the connector-lib area has grown by > > > about > > > > > 18MB > > > > > >> > this > > > > > >> > >> > cycle, and of course all the images for the Chinese > > > > documentation > > > > > >> add > > > > > >> > >> > another 5MB, so our binary packages are now just about > > 200MB. > > > > I > > > > > >> don't > > > > > >> > >> > think this something we can do a lot about, though, > except > > > > maybe > > > > > by > > > > > >> > >> > repackaging so we release connectors independently of the > > > > > >> framework. > > > > > >> > >> > > > > > > >> > >> > I'll give a final vote after I hear more back from Erlend > > and > > > > > >> Abe-san. > > > > > >> > >> > > > > > > >> > >> > Thanks, > > > > > >> > >> > Karl > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > On Tue, Aug 12, 2014 at 2:23 AM, Karl Wright < > > > > [email protected] > > > > > > > > > > > >> > >> wrote: > > > > > >> > >> > > > > > > >> > >> > > I request that the vote be left open at least until > > > > 8/21/2014, > > > > > >> since > > > > > >> > >> 1.7 > > > > > >> > >> > > is a major release and we want as many people to try it > > out > > > > as > > > > > >> > >> possible > > > > > >> > >> > > before declaring it complete. Thanks! > > > > > >> > >> > > > > > > > >> > >> > > Karl > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > On Tue, Aug 12, 2014 at 12:44 AM, Shinichiro Abe < > > > > > >> > >> > > [email protected]> wrote: > > > > > >> > >> > > > > > > > >> > >> > >> Hi, > > > > > >> > >> > >> > > > > > >> > >> > >> +1 from me. > > > > > >> > >> > >> > > > > > >> > >> > >> -Checked SIGS, checksum by running > check_signatures.sh. > > > > > >> > >> > >> -Checked that the code signing Key of Mingchun is > > > available > > > > > >> online. > > > > > >> > >> > >> > > > > > >> > >> > >> Shinichiro Abe > > > > > >> > >> > >> > > > > > >> > >> > >> On 2014/08/12, at 12:13, Mingchun Zhao < > > > > > >> [email protected]> > > > > > >> > >> > wrote: > > > > > >> > >> > >> > > > > > >> > >> > >> > Hi all, > > > > > >> > >> > >> > > > > > > >> > >> > >> > Please vote on whether to release the ManifoldCF, > > > version > > > > > 1.7, > > > > > >> > RC0. > > > > > >> > >> > >> > > > > > > >> > >> > >> > You can find the artifact at: > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > > http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0 > > > > > >> > >> > >> > > > > > > >> > >> > >> > There is also a tag at: > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > > >> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0 > > > > > >> > >> > >> > > > > > > >> > >> > >> > Vote will remain open at least 72 hours. > > > > > >> > >> > >> > > > > > > >> > >> > >> > Thanks! > > > > > >> > >> > >> > Mingchun Zhao > > > > > >> > >> > >> > > > > > >> > >> > >> > > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> -- > > > > > >> > >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > - - > > > > > >> > >> Shinichiro Abe > > > > > >> > >> 阿部 慎一朗 > > > > > >> > >> > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > > > > >> Shinichiro Abe > > > > > >> 阿部 慎一朗 > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > > > Shinichiro Abe > > > > 阿部 慎一朗 > > > > > > > > > > > > > > > -- > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > Shinichiro Abe > > 阿部 慎一朗 > > > -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Shinichiro Abe 阿部 慎一朗
