I tested and checked in the patch, so you can just synch up and you'll get it. Karl
On Tue, Mar 15, 2011 at 4:23 AM, Karl Wright <[email protected]> wrote: > I've done what I think is the code for CONNECTORS-168 and attached a > patch to that ticket. Perhaps you could try it on your setup to see > if the reporting of 500 errors improves. > > Karl > > On Tue, Mar 15, 2011 at 3:42 AM, Karl Wright <[email protected]> wrote: >> It is hard to tell what you are seeing here because you need to also >> mention where you are seeing it. But it is unlikely to be a result of >> the way the POST is being done within the Solr Connector; that >> connector does not perform any XML encoding, so that is not what is >> failing. As I think you have discovered, it sounds like the problem >> is that somewhere deep in Solr something is going wrong and a 500 >> error is being returned with non-XML contents. The Solr Connector >> attempts to parse the response as XML and fails. I;ve looked at the >> code; when this happens, a stack trace is dumped to stdout (which is >> not very helpful but is better than nothing). Ideally, the connector >> should dump the response into the log (as part of a warning), and also >> write the raw response into the history (as part of the results of the >> indexing attempt). So you should be able to see the actual error in >> the crawler UI by getting a simple history. I've opened a new ticket >> (CONNECTORS-168) to capture this work. >> >> Other than that, I would hazard that there is currently nothing >> actually wrong with the Solr connector at this time. There is an >> outstanding Jira ticket to port it to SolrJ (CONNECTORS-19), but based >> on how unreliable Solr has been of late maybe that's not such a great >> idea at the moment. It's certainly in wide use at this time and >> people have not found an actual problem with it. >> >> >> Thanks, >> Karl >> >> >> >> On Mon, Mar 14, 2011 at 10:49 PM, Fuad Efendi <[email protected]> wrote: >>> >>> I just noticed: >>> Currently, default for ManifoldCF is /update/extract, which corresponds to >>> SOLR Cell request handler. >>> >>> So... >>> It is EXTREMELY generic... >>> http://wiki.apache.org/solr/ExtractingRequestHandler >>> >>> What happens is: we submit "field" which is HTML snippet (inside RSS), and >>> if that snippet is malformed... SOLR responds with error message such as >>> this: >>> <u>Unexpected character ' >>> -' (code 45) in external DTD subset; expected closing '>' after ENTITY >>> declaration at [row,col,system-id]: >>> [81,5,"http://www.w3.org/TR/html4/strict.dtd"] >>> from [row,col {unknown-source}]: [1,1]</u></p><p><b>description</b> <u>The >>> request sent by the client was syntactically incorrect (Unexpected charact >>> er '-' (code 45) in external DTD subset; expected closing '>' after >>> ENTITY declaration at [row,col,system-id]: >>> [81,5,"http://www.w3.org/TR/html4/strict.dtd"] >>> >>> And, SOLR response is malformed too, so that we have >>> [Fatal Error] :7:112: The element type "HR" must be terminated by the >>> matching end-tag "</HR>". >>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing >>> error: The element type "HR" must be terminated by the matching end-tag >>> "</HR>" >>> >>> >>> two exceptions: >>> 1. at SOLR because of malformed HTML such as >>> <my_rss_field>>bold<BOLD>/body<</my_rss_field> >>> 2. at ManifoldCF, because SOLR response is malformed >>> >>> >>> Using SOLR Cell for RSS feeds... we probably need few types of SOLR >>> Connectors, or single type (but configurable); and it's much easier with >>> SOLRJ client... including troubleshooting... otherwise we should have unit >>> tests for void writeField(OutputStream out, String fieldName, String >>> fieldValue) and etc...... >>> >>> >>> I want to write new "connector" for my task, based on SOLRJ... >>> >>> >>> -Fuad >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Fuad Efendi [mailto:[email protected]] >>> Sent: March-14-11 10:34 PM >>> To: [email protected] >>> Subject: RE: SOLR >>> >>> >>> It's not trunk version; I use (different) trunk versions in few production >>> sites... in SOLR, path "/update" is defined in solrconfig.xml (and usually >>> user will copy it from "example" schema and "may be" modify): >>> >>> <requestHandler name="/update" >>> class="solr.XmlUpdateRequestHandler"> >>> >>> >>> And, what ManifoldCF expects, which kind of "update" handler?!! >>> >>> That's why I suggest to use SOLRJ API instead... I noticed a lot of >>> low-level coding... >>> >>> >>> >>> What kind of SOLR protocol is expected? It is definitely not POST of XML >>> content: >>> >>> >>> /** Write a field */ >>> protected static void writeField(OutputStream out, String fieldName, >>> String fieldValue) >>> throws IOException >>> { >>> writePreamble(out); >>> writeBoundary(out,"text/plain; charset=UTF-8",fieldName,null); >>> >>> byte[] tmp = fieldValue.getBytes("UTF-8"); >>> out.write(tmp, 0, tmp.length); >>> writePostamble(out); >>> } >>> >>> >>> >>> Do you expect "binary" handler on SOLR? >>> <!-- Binary Update Request Handler >>> http://wiki.apache.org/solr/javabin >>> --> >>> <requestHandler name="/update/javabin" >>> class="solr.BinaryUpdateRequestHandler" /> >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Karl Wright [mailto:[email protected]] >>> Sent: March-14-11 7:58 PM >>> To: [email protected] >>> Subject: Re: SOLR >>> >>> The trunk version of Solr may have changed around how the extracting update >>> request handler works. It changes daily, so there is no way I can keep up >>> with it. Maybe it would be better to go back and use a known quantity. >>> >>> Thanks, >>> Karl >>> >>> >>> On Mon, Mar 14, 2011 at 6:24 PM, Fuad Efendi <[email protected]> wrote: >>>> >>>> Default settings for ManifoldCE: /update/extract >>>> http://localhost:8080/solr/update/extract?commit=true >>>> >>>> And using browser, I see SOLR responds with malformed HTML containing >>>> non-closing <HR>... >>>> >>>> Fix: >>>> Update handler: /update >>>> >>>> >>>> -Fuad >>>> >>>> >>>> -----Original Message----- >>>> From: Fuad Efendi [mailto:[email protected]] >>>> Sent: March-14-11 6:17 PM >>>> To: [email protected] >>>> Subject: RE: SOLR >>>> >>>> Hi Karl, >>>> >>>> I verified (via browser), >>>> http://localhost:8080/solr/update?commit=true >>>> >>>> And response from SOLR: >>>> <?xml version="1.0" encoding="UTF-8"?> <response> <lst >>>> name="responseHeader"><int name="status">0</int><int >>>> name="QTime">15</int></lst> </response> >>>> >>>> The problem root is >>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(H >>>> ttpPos >>>> ter.java:1658) >>>> >>>> >>>> Everything is fine except I can't understand why we have "HR" from >>>> SOLR, do we have any multithreading issues? I believe I connect to >>>> SOLR, port 8080 is configured via console... may be somewhere else? >>>> >>>> I believe default setting for "Update handler:" at Connector screen is >>>> incorrect, it is /update/extract >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Karl Wright [mailto:[email protected]] >>>> Sent: March-14-11 6:00 PM >>>> To: [email protected] >>>> Subject: Re: SOLR >>>> >>>> This is because your solr setup is incorrect. The post to "solr" is >>>> returning HTML, not XML, so you are not actually communicating with >>>> Solr at all. >>>> >>>> In order for the Solr connector to work, you need to have the solr >>>> extracting update request handler present and configured. I am told >>>> that the latest release of Solr makes the jar with this code optional >>>> - it's a contrib jar that you have to separately download. If you are >>>> building solr off of trunk, then this should not be a problem. >>>> >>>> Kalr >>>> >>>> On Mon, Mar 14, 2011 at 5:40 PM, Fuad Efendi <[email protected]> wrote: >>>>> This exception, XML contains encoded HTML, and it doesn't happen with >>>>> standard Java 6 StAX parser: >>>>> >>>>> [Fatal Error] :124:120: The element type "HR" must be terminated by >>>>> the matching end-tag "</HR>". >>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML >>>>> parsing >>>>> error: The element type "HR" must be terminated by the matching >>>>> end-tag "</HR>" >>>>> . >>>>> at >>>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369) >>>>> at >>>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317) >>>>> at >>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpP >>>>> o >>>>> ster.j >>>>> ava:619) >>>>> at >>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run( >>>>> H >>>>> ttpPos >>>>> ter.java:1658) >>>>> Caused by: org.xml.sax.SAXParseException: The element type "HR" must >>>>> be terminated by the matching end-tag "</HR>". >>>>> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) >>>>> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown >>>>> Source) >>>>> at >>>>> javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124) >>>>> at >>>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:365) >>>>> ... 3 more >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Fuad Efendi [mailto:[email protected]] >>>>> Sent: March-14-11 5:37 PM >>>>> To: [email protected] >>>>> Subject: RE: SOLR >>>>> >>>>> Thank you very much Karl, >>>>> >>>>> And I have first problem, >>>>> Starting crawler... >>>>> [Fatal Error] :124:120: The element type "HR" must be terminated by >>>>> the matching end-tag "</HR>". >>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML >>>>> parsing >>>>> error: The element type "HR" must be terminated by the matching >>>>> end-tag "</HR>" >>>>> . >>>>> at >>>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369) >>>>> at >>>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317) >>>>> >>>>> I am using RSS connector to crawl specific XML (containing >>>>> XML-encoded >HR< and other HTML tags). It doesn't happened with >>>>> standard StAX parser (Java 6)... >>>>> >>>>> >>>>> Regarding (2), do you mean this interface method? >>>>> /** View specification. >>>>> * This method is called in the body section of a job's view page. >>>>> Its purpose is to present the output specification information to the >>>> user. >>>>> * The coder can presume that the HTML that is output from this >>>>> configuration will be within appropriate <html> and <body> tags. >>>>> *@param out is the output to which any HTML should be sent. >>>>> *@param os is the current output specification for this job. >>>>> */ >>>>> public void viewSpecification(IHTTPOutput out, OutputSpecification >>>>> os) >>>>> throws ManifoldCFException, IOException >>>>> >>>>> >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Karl Wright [mailto:[email protected]] >>>>> Sent: March-14-11 5:21 PM >>>>> To: [email protected] >>>>> Subject: Re: SOLR >>>>> >>>>> Hi Fuad, >>>>> >>>>> (1) "Arguments" are indeed optional key/value pairs, which are sent >>>>> to solr as part of the URL. >>>>> (2) ManifoldCF presents tabs for a job of three kinds: (a) tabs that >>>>> all jobs have; (b) tabs related to the repository connector's >>>>> management of the document specification information; and (c) tabs >>>>> related to the output connector's output specification information. >>>>> The Solr output connector's output specification information includes >>>>> the metadata to solr mapping, so those tabs come from the Solr connector. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Mon, Mar 14, 2011 at 4:51 PM, Fuad Efendi <[email protected]> wrote: >>>>>> Hi, any sample of how to use SOLR connector? >>>>>> >>>>>> http://incubator.apache.org/connectors/end-user-documentation.html#s >>>>>> o >>>>>> l >>>>>> routputconnector >>>>>> >>>>>> >>>>>> >>>>>> Some questions: >>>>>> >>>>>> >>>>>> >>>>>> 1. Argument. Is it optional key=value pairs which can be sent >>>>>> to SOLR as part of HTTP GET/POST request? >>>>>> >>>>>> 2. I see code for “Connector”, and I see how to configure SOLR >>>>>> Output Connection. But how “Job” happens to know about <metadata> to >>>>>> <solr> mapping, is it generic (without dependency on SOLR)? >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Fuad >>>>> >>>>> >>>> >>>> >>> >>> >> >
