I've done what I think is the code for CONNECTORS-168 and attached a patch to that ticket. Perhaps you could try it on your setup to see if the reporting of 500 errors improves.
Karl On Tue, Mar 15, 2011 at 3:42 AM, Karl Wright <[email protected]> wrote: > It is hard to tell what you are seeing here because you need to also > mention where you are seeing it. But it is unlikely to be a result of > the way the POST is being done within the Solr Connector; that > connector does not perform any XML encoding, so that is not what is > failing. As I think you have discovered, it sounds like the problem > is that somewhere deep in Solr something is going wrong and a 500 > error is being returned with non-XML contents. The Solr Connector > attempts to parse the response as XML and fails. I;ve looked at the > code; when this happens, a stack trace is dumped to stdout (which is > not very helpful but is better than nothing). Ideally, the connector > should dump the response into the log (as part of a warning), and also > write the raw response into the history (as part of the results of the > indexing attempt). So you should be able to see the actual error in > the crawler UI by getting a simple history. I've opened a new ticket > (CONNECTORS-168) to capture this work. > > Other than that, I would hazard that there is currently nothing > actually wrong with the Solr connector at this time. There is an > outstanding Jira ticket to port it to SolrJ (CONNECTORS-19), but based > on how unreliable Solr has been of late maybe that's not such a great > idea at the moment. It's certainly in wide use at this time and > people have not found an actual problem with it. > > > Thanks, > Karl > > > > On Mon, Mar 14, 2011 at 10:49 PM, Fuad Efendi <[email protected]> wrote: >> >> I just noticed: >> Currently, default for ManifoldCF is /update/extract, which corresponds to >> SOLR Cell request handler. >> >> So... >> It is EXTREMELY generic... >> http://wiki.apache.org/solr/ExtractingRequestHandler >> >> What happens is: we submit "field" which is HTML snippet (inside RSS), and >> if that snippet is malformed... SOLR responds with error message such as >> this: >> <u>Unexpected character ' >> -' (code 45) in external DTD subset; expected closing '>' after ENTITY >> declaration at [row,col,system-id]: >> [81,5,"http://www.w3.org/TR/html4/strict.dtd"] >> from [row,col {unknown-source}]: [1,1]</u></p><p><b>description</b> <u>The >> request sent by the client was syntactically incorrect (Unexpected charact >> er '-' (code 45) in external DTD subset; expected closing '>' after >> ENTITY declaration at [row,col,system-id]: >> [81,5,"http://www.w3.org/TR/html4/strict.dtd"] >> >> And, SOLR response is malformed too, so that we have >> [Fatal Error] :7:112: The element type "HR" must be terminated by the >> matching end-tag "</HR>". >> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing >> error: The element type "HR" must be terminated by the matching end-tag >> "</HR>" >> >> >> two exceptions: >> 1. at SOLR because of malformed HTML such as >> <my_rss_field>>bold<BOLD>/body<</my_rss_field> >> 2. at ManifoldCF, because SOLR response is malformed >> >> >> Using SOLR Cell for RSS feeds... we probably need few types of SOLR >> Connectors, or single type (but configurable); and it's much easier with >> SOLRJ client... including troubleshooting... otherwise we should have unit >> tests for void writeField(OutputStream out, String fieldName, String >> fieldValue) and etc...... >> >> >> I want to write new "connector" for my task, based on SOLRJ... >> >> >> -Fuad >> >> >> >> >> >> -----Original Message----- >> From: Fuad Efendi [mailto:[email protected]] >> Sent: March-14-11 10:34 PM >> To: [email protected] >> Subject: RE: SOLR >> >> >> It's not trunk version; I use (different) trunk versions in few production >> sites... in SOLR, path "/update" is defined in solrconfig.xml (and usually >> user will copy it from "example" schema and "may be" modify): >> >> <requestHandler name="/update" >> class="solr.XmlUpdateRequestHandler"> >> >> >> And, what ManifoldCF expects, which kind of "update" handler?!! >> >> That's why I suggest to use SOLRJ API instead... I noticed a lot of >> low-level coding... >> >> >> >> What kind of SOLR protocol is expected? It is definitely not POST of XML >> content: >> >> >> /** Write a field */ >> protected static void writeField(OutputStream out, String fieldName, >> String fieldValue) >> throws IOException >> { >> writePreamble(out); >> writeBoundary(out,"text/plain; charset=UTF-8",fieldName,null); >> >> byte[] tmp = fieldValue.getBytes("UTF-8"); >> out.write(tmp, 0, tmp.length); >> writePostamble(out); >> } >> >> >> >> Do you expect "binary" handler on SOLR? >> <!-- Binary Update Request Handler >> http://wiki.apache.org/solr/javabin >> --> >> <requestHandler name="/update/javabin" >> class="solr.BinaryUpdateRequestHandler" /> >> >> >> >> >> >> >> -----Original Message----- >> From: Karl Wright [mailto:[email protected]] >> Sent: March-14-11 7:58 PM >> To: [email protected] >> Subject: Re: SOLR >> >> The trunk version of Solr may have changed around how the extracting update >> request handler works. It changes daily, so there is no way I can keep up >> with it. Maybe it would be better to go back and use a known quantity. >> >> Thanks, >> Karl >> >> >> On Mon, Mar 14, 2011 at 6:24 PM, Fuad Efendi <[email protected]> wrote: >>> >>> Default settings for ManifoldCE: /update/extract >>> http://localhost:8080/solr/update/extract?commit=true >>> >>> And using browser, I see SOLR responds with malformed HTML containing >>> non-closing <HR>... >>> >>> Fix: >>> Update handler: /update >>> >>> >>> -Fuad >>> >>> >>> -----Original Message----- >>> From: Fuad Efendi [mailto:[email protected]] >>> Sent: March-14-11 6:17 PM >>> To: [email protected] >>> Subject: RE: SOLR >>> >>> Hi Karl, >>> >>> I verified (via browser), >>> http://localhost:8080/solr/update?commit=true >>> >>> And response from SOLR: >>> <?xml version="1.0" encoding="UTF-8"?> <response> <lst >>> name="responseHeader"><int name="status">0</int><int >>> name="QTime">15</int></lst> </response> >>> >>> The problem root is >>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(H >>> ttpPos >>> ter.java:1658) >>> >>> >>> Everything is fine except I can't understand why we have "HR" from >>> SOLR, do we have any multithreading issues? I believe I connect to >>> SOLR, port 8080 is configured via console... may be somewhere else? >>> >>> I believe default setting for "Update handler:" at Connector screen is >>> incorrect, it is /update/extract >>> >>> >>> >>> >>> -----Original Message----- >>> From: Karl Wright [mailto:[email protected]] >>> Sent: March-14-11 6:00 PM >>> To: [email protected] >>> Subject: Re: SOLR >>> >>> This is because your solr setup is incorrect. The post to "solr" is >>> returning HTML, not XML, so you are not actually communicating with >>> Solr at all. >>> >>> In order for the Solr connector to work, you need to have the solr >>> extracting update request handler present and configured. I am told >>> that the latest release of Solr makes the jar with this code optional >>> - it's a contrib jar that you have to separately download. If you are >>> building solr off of trunk, then this should not be a problem. >>> >>> Kalr >>> >>> On Mon, Mar 14, 2011 at 5:40 PM, Fuad Efendi <[email protected]> wrote: >>>> This exception, XML contains encoded HTML, and it doesn't happen with >>>> standard Java 6 StAX parser: >>>> >>>> [Fatal Error] :124:120: The element type "HR" must be terminated by >>>> the matching end-tag "</HR>". >>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML >>>> parsing >>>> error: The element type "HR" must be terminated by the matching >>>> end-tag "</HR>" >>>> . >>>> at >>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369) >>>> at >>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317) >>>> at >>>> org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpP >>>> o >>>> ster.j >>>> ava:619) >>>> at >>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run( >>>> H >>>> ttpPos >>>> ter.java:1658) >>>> Caused by: org.xml.sax.SAXParseException: The element type "HR" must >>>> be terminated by the matching end-tag "</HR>". >>>> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) >>>> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown >>>> Source) >>>> at >>>> javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124) >>>> at >>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:365) >>>> ... 3 more >>>> >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Fuad Efendi [mailto:[email protected]] >>>> Sent: March-14-11 5:37 PM >>>> To: [email protected] >>>> Subject: RE: SOLR >>>> >>>> Thank you very much Karl, >>>> >>>> And I have first problem, >>>> Starting crawler... >>>> [Fatal Error] :124:120: The element type "HR" must be terminated by >>>> the matching end-tag "</HR>". >>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML >>>> parsing >>>> error: The element type "HR" must be terminated by the matching >>>> end-tag "</HR>" >>>> . >>>> at >>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369) >>>> at >>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317) >>>> >>>> I am using RSS connector to crawl specific XML (containing >>>> XML-encoded >HR< and other HTML tags). It doesn't happened with >>>> standard StAX parser (Java 6)... >>>> >>>> >>>> Regarding (2), do you mean this interface method? >>>> /** View specification. >>>> * This method is called in the body section of a job's view page. >>>> Its purpose is to present the output specification information to the >>> user. >>>> * The coder can presume that the HTML that is output from this >>>> configuration will be within appropriate <html> and <body> tags. >>>> *@param out is the output to which any HTML should be sent. >>>> *@param os is the current output specification for this job. >>>> */ >>>> public void viewSpecification(IHTTPOutput out, OutputSpecification >>>> os) >>>> throws ManifoldCFException, IOException >>>> >>>> >>>> >>>> Thanks! >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Karl Wright [mailto:[email protected]] >>>> Sent: March-14-11 5:21 PM >>>> To: [email protected] >>>> Subject: Re: SOLR >>>> >>>> Hi Fuad, >>>> >>>> (1) "Arguments" are indeed optional key/value pairs, which are sent >>>> to solr as part of the URL. >>>> (2) ManifoldCF presents tabs for a job of three kinds: (a) tabs that >>>> all jobs have; (b) tabs related to the repository connector's >>>> management of the document specification information; and (c) tabs >>>> related to the output connector's output specification information. >>>> The Solr output connector's output specification information includes >>>> the metadata to solr mapping, so those tabs come from the Solr connector. >>>> >>>> Karl >>>> >>>> >>>> On Mon, Mar 14, 2011 at 4:51 PM, Fuad Efendi <[email protected]> wrote: >>>>> Hi, any sample of how to use SOLR connector? >>>>> >>>>> http://incubator.apache.org/connectors/end-user-documentation.html#s >>>>> o >>>>> l >>>>> routputconnector >>>>> >>>>> >>>>> >>>>> Some questions: >>>>> >>>>> >>>>> >>>>> 1. Argument. Is it optional key=value pairs which can be sent >>>>> to SOLR as part of HTTP GET/POST request? >>>>> >>>>> 2. I see code for “Connector”, and I see how to configure SOLR >>>>> Output Connection. But how “Job” happens to know about <metadata> to >>>>> <solr> mapping, is it generic (without dependency on SOLR)? >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Fuad >>>> >>>> >>> >>> >> >> >
