Re: SOLR

Karl Wright Tue, 15 Mar 2011 08:14:59 -0700

I tested and checked in the patch, so you can just synch up and you'll get it.
Karl


On Tue, Mar 15, 2011 at 4:23 AM, Karl Wright <[email protected]> wrote:
> I've done what I think is the code for CONNECTORS-168 and attached a
> patch to that ticket.  Perhaps you could try it on your setup to see
> if the reporting of 500 errors improves.
>
> Karl
>
> On Tue, Mar 15, 2011 at 3:42 AM, Karl Wright <[email protected]> wrote:
>> It is hard to tell what you are seeing here because you need to also
>> mention where you are seeing it.  But it is unlikely to be a result of
>> the way the POST is being done within the Solr Connector; that
>> connector does not perform any XML encoding, so that is not what is
>> failing.  As I think you have discovered, it sounds like the problem
>> is that somewhere deep in Solr something is going wrong and a 500
>> error is being returned with non-XML contents.  The Solr Connector
>> attempts to parse the response as XML and fails.  I;ve looked at the
>> code; when this happens, a stack trace is dumped to stdout (which is
>> not very helpful but is better than nothing).  Ideally, the connector
>> should dump the response into the log (as part of a warning), and also
>> write the raw response into the history (as part of the results of the
>> indexing attempt).  So you should be able to see the actual error in
>> the crawler UI by getting a simple history.  I've opened a new ticket
>> (CONNECTORS-168) to capture this work.
>>
>> Other than that, I would hazard that there is currently nothing
>> actually wrong with the Solr connector at this time.  There is an
>> outstanding Jira ticket to port it to SolrJ (CONNECTORS-19), but based
>> on how unreliable Solr has been of late maybe that's not such a great
>> idea at the moment.  It's certainly in wide use at this time and
>> people have not found an actual problem with it.
>>
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Mon, Mar 14, 2011 at 10:49 PM, Fuad Efendi <[email protected]> wrote:
>>>
>>> I just noticed:
>>> Currently, default for ManifoldCF is /update/extract, which corresponds to
>>> SOLR Cell request handler.
>>>
>>> So...
>>> It is EXTREMELY generic...
>>> http://wiki.apache.org/solr/ExtractingRequestHandler
>>>
>>> What happens is: we submit "field" which is HTML snippet (inside RSS), and
>>> if that snippet is malformed... SOLR responds with error message such as
>>> this:
>>> <u>Unexpected character '
>>> -' (code 45) in external DTD subset; expected closing '&gt;' after ENTITY
>>> declaration  at [row,col,system-id]:
>>> [81,5,&quot;http://www.w3.org/TR/html4/strict.dtd&quot;]
>>>  from [row,col {unknown-source}]: [1,1]</u></p><p><b>description</b> <u>The
>>> request sent by the client was syntactically incorrect (Unexpected charact
>>> er '-' (code 45) in external DTD subset; expected closing '&gt;' after
>>> ENTITY declaration  at [row,col,system-id]:
>>> [81,5,&quot;http://www.w3.org/TR/html4/strict.dtd&quot;]
>>>
>>> And, SOLR response is malformed too, so that we have
>>> [Fatal Error] :7:112: The element type "HR" must be terminated by the
>>> matching end-tag "</HR>".
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing
>>> error: The element type "HR" must be terminated by the matching end-tag
>>> "</HR>"
>>>
>>>
>>> two exceptions:
>>> 1. at SOLR because of malformed HTML such as
>>> <my_rss_field>&gt;bold&lt;BOLD&gt/body&lt;</my_rss_field>
>>> 2. at ManifoldCF, because SOLR response is malformed
>>>
>>>
>>> Using SOLR Cell for RSS feeds... we probably need few types of SOLR
>>> Connectors, or single type (but configurable); and it's much easier with
>>> SOLRJ client... including troubleshooting... otherwise  we should have unit
>>> tests for void writeField(OutputStream out, String fieldName, String
>>> fieldValue) and etc......
>>>
>>>
>>> I want to write new "connector" for my task, based on SOLRJ...
>>>
>>>
>>> -Fuad
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Fuad Efendi [mailto:[email protected]]
>>> Sent: March-14-11 10:34 PM
>>> To: [email protected]
>>> Subject: RE: SOLR
>>>
>>>
>>> It's not trunk version; I use (different) trunk versions in few production
>>> sites... in SOLR, path "/update" is defined in solrconfig.xml (and usually
>>> user will copy it from "example" schema and "may be" modify):
>>>
>>>  <requestHandler name="/update"
>>>                  class="solr.XmlUpdateRequestHandler">
>>>
>>>
>>> And, what ManifoldCF expects, which kind of "update" handler?!!
>>>
>>> That's why I suggest to use SOLRJ API instead... I noticed a lot of
>>> low-level coding...
>>>
>>>
>>>
>>> What kind of SOLR protocol is expected? It is definitely not POST of XML
>>> content:
>>>
>>>
>>>  /** Write a field */
>>>  protected static void writeField(OutputStream out, String fieldName,
>>> String fieldValue)
>>>    throws IOException
>>>  {
>>>    writePreamble(out);
>>>    writeBoundary(out,"text/plain; charset=UTF-8",fieldName,null);
>>>
>>>    byte[] tmp = fieldValue.getBytes("UTF-8");
>>>    out.write(tmp, 0, tmp.length);
>>>    writePostamble(out);
>>>  }
>>>
>>>
>>>
>>> Do you expect "binary" handler on SOLR?
>>>  <!-- Binary Update Request Handler
>>>       http://wiki.apache.org/solr/javabin
>>>    -->
>>>  <requestHandler name="/update/javabin"
>>>                  class="solr.BinaryUpdateRequestHandler" />
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Karl Wright [mailto:[email protected]]
>>> Sent: March-14-11 7:58 PM
>>> To: [email protected]
>>> Subject: Re: SOLR
>>>
>>> The trunk version of Solr may have changed around how the extracting update
>>> request handler works.  It changes daily, so there is no way I can keep up
>>> with it.  Maybe it would be better to go back and use a known quantity.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Mon, Mar 14, 2011 at 6:24 PM, Fuad Efendi <[email protected]> wrote:
>>>>
>>>> Default settings for ManifoldCE: /update/extract
>>>> http://localhost:8080/solr/update/extract?commit=true
>>>>
>>>> And using browser, I see SOLR responds with malformed HTML containing
>>>> non-closing <HR>...
>>>>
>>>> Fix:
>>>> Update handler:  /update
>>>>
>>>>
>>>> -Fuad
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Fuad Efendi [mailto:[email protected]]
>>>> Sent: March-14-11 6:17 PM
>>>> To: [email protected]
>>>> Subject: RE: SOLR
>>>>
>>>> Hi Karl,
>>>>
>>>> I verified (via browser),
>>>> http://localhost:8080/solr/update?commit=true
>>>>
>>>> And response from SOLR:
>>>> <?xml version="1.0" encoding="UTF-8"?> <response> <lst
>>>> name="responseHeader"><int name="status">0</int><int
>>>> name="QTime">15</int></lst> </response>
>>>>
>>>> The problem root is
>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(H
>>>> ttpPos
>>>> ter.java:1658)
>>>>
>>>>
>>>> Everything is fine except I can't understand why we have "HR" from
>>>> SOLR, do we have any multithreading issues? I believe I connect to
>>>> SOLR, port 8080 is configured via console... may be somewhere else?
>>>>
>>>> I believe default setting for "Update handler:" at Connector screen is
>>>> incorrect, it is /update/extract
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Karl Wright [mailto:[email protected]]
>>>> Sent: March-14-11 6:00 PM
>>>> To: [email protected]
>>>> Subject: Re: SOLR
>>>>
>>>> This is because your solr setup is incorrect.  The post to "solr" is
>>>> returning HTML, not XML, so you are not actually communicating with
>>>> Solr at all.
>>>>
>>>> In order for the Solr connector to work, you need to have the solr
>>>> extracting update request handler present and configured.  I am told
>>>> that the latest release of Solr makes the jar with this code optional
>>>> - it's a contrib jar that you have to separately download.  If you are
>>>> building solr off of trunk, then this should not be a problem.
>>>>
>>>> Kalr
>>>>
>>>> On Mon, Mar 14, 2011 at 5:40 PM, Fuad Efendi <[email protected]> wrote:
>>>>> This exception, XML contains encoded HTML, and it doesn't happen with
>>>>> standard Java 6 StAX parser:
>>>>>
>>>>> [Fatal Error] :124:120: The element type "HR" must be terminated by
>>>>> the matching end-tag "</HR>".
>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML
>>>>> parsing
>>>>> error: The element type "HR" must be terminated by the matching
>>>>> end-tag "</HR>"
>>>>> .
>>>>>        at
>>>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369)
>>>>>        at
>>>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317)
>>>>>        at
>>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpP
>>>>> o
>>>>> ster.j
>>>>> ava:619)
>>>>>        at
>>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(
>>>>> H
>>>>> ttpPos
>>>>> ter.java:1658)
>>>>> Caused by: org.xml.sax.SAXParseException: The element type "HR" must
>>>>> be terminated by the matching end-tag "</HR>".
>>>>>        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>>>>>        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown
>>>>> Source)
>>>>>        at
>>>>> javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
>>>>>        at
>>>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:365)
>>>>>        ... 3 more
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Fuad Efendi [mailto:[email protected]]
>>>>> Sent: March-14-11 5:37 PM
>>>>> To: [email protected]
>>>>> Subject: RE: SOLR
>>>>>
>>>>> Thank you very much Karl,
>>>>>
>>>>> And I have first problem,
>>>>> Starting crawler...
>>>>> [Fatal Error] :124:120: The element type "HR" must be terminated by
>>>>> the matching end-tag "</HR>".
>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML
>>>>> parsing
>>>>> error: The element type "HR" must be terminated by the matching
>>>>> end-tag "</HR>"
>>>>> .
>>>>>        at
>>>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369)
>>>>>        at
>>>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317)
>>>>>
>>>>> I am using RSS connector to crawl specific XML (containing
>>>>> XML-encoded &gt;HR&lt; and other HTML tags). It doesn't happened with
>>>>> standard StAX parser (Java 6)...
>>>>>
>>>>>
>>>>> Regarding (2), do you mean this interface method?
>>>>>  /** View specification.
>>>>>  * This method is called in the body section of a job's view page.
>>>>> Its purpose is to present the output specification information to the
>>>> user.
>>>>>  * The coder can presume that the HTML that is output from this
>>>>> configuration will be within appropriate <html> and <body> tags.
>>>>>  *@param out is the output to which any HTML should be sent.
>>>>>  *@param os is the current output specification for this job.
>>>>>  */
>>>>>  public void viewSpecification(IHTTPOutput out, OutputSpecification
>>>>> os)
>>>>>    throws ManifoldCFException, IOException
>>>>>
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Karl Wright [mailto:[email protected]]
>>>>> Sent: March-14-11 5:21 PM
>>>>> To: [email protected]
>>>>> Subject: Re: SOLR
>>>>>
>>>>> Hi Fuad,
>>>>>
>>>>> (1) "Arguments" are indeed optional key/value pairs, which are sent
>>>>> to solr as part of the URL.
>>>>> (2) ManifoldCF presents tabs for a job of three kinds: (a) tabs that
>>>>> all jobs have; (b) tabs related to the repository connector's
>>>>> management of the document specification information; and (c) tabs
>>>>> related to the output connector's output specification information.
>>>>> The Solr output connector's output specification information includes
>>>>> the metadata to solr mapping, so those tabs come from the Solr connector.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Mar 14, 2011 at 4:51 PM, Fuad Efendi <[email protected]> wrote:
>>>>>> Hi, any sample of how to use SOLR connector?
>>>>>>
>>>>>> http://incubator.apache.org/connectors/end-user-documentation.html#s
>>>>>> o
>>>>>> l
>>>>>> routputconnector
>>>>>>
>>>>>>
>>>>>>
>>>>>> Some questions:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1.       Argument. Is it optional key=value pairs which can be sent
>>>>>> to SOLR as part of HTTP GET/POST request?
>>>>>>
>>>>>> 2.       I see code for “Connector”, and I see how to configure SOLR
>>>>>> Output Connection. But how “Job” happens to know about <metadata> to
>>>>>> <solr> mapping, is it generic (without dependency on SOLR)?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Fuad
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: SOLR

Reply via email to