CONNECTORS-171.

Fuad, I think it's perfectly reasonable to modify XMLWriterContext to
simply add proper entity-escaping of characters.  Would you like to
submit a patch?

Karl


On Fri, Mar 25, 2011 at 12:12 AM, Karl Wright <[email protected]> wrote:
> Hmm, no, I'd call this a bug in the current connector.  You can give
> it a feed that's perfectly valid, and if you are running in
> "dechromed" mode the description that gets indexed might well be
> corrupted.
>
> I'll create a ticket.
>
> Karl
>
>
> On Thu, Mar 24, 2011 at 10:41 PM, Fuad Efendi <[email protected]> wrote:
>> Not a bug with current RSS connector, but something probably important...
>>
>> Current RSS connector  uses XMLFileContext for temporary XML(?), and here
>> problems may happen if <description> and <content> contain sub-elements...
>> but in our specific use case it is HTML snippet, and we don't consider it
>> XML, so that unescaped characters are natural...
>>
>> So I think there are no any problems (with current RSS specs), but we might
>> have problems in the future with another use cases such as:
>> <description>
>>        <sub-description-1>     &lt;H1&gt;Header </sub-description-1>
>> </description>
>>
>>
>> Output to temp. file will be malformed XML:
>>        <sub-description-1>     <H1> Header </sub-description-1>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Karl Wright [mailto:[email protected]]
>> Sent: March-24-11 10:26 PM
>> To: [email protected]
>> Subject: Re: XMLWriterContext: tagContext doesn't escape chars
>>
>> Ok, although I am curious whether this is a bug with a current connector?
>> Or is this something new you were trying to do?
>>
>> Karl
>>
>> On Thu, Mar 24, 2011 at 10:21 PM, Fuad Efendi <[email protected]> wrote:
>>> Hi Karl, I think initial message was improperly (re)formatted... I
>>> suspect connector-user allows HTML, and connector-dev allows only plain
>> text.
>>>
>>> The class XMLWriterContext, method tagContents(char[] ch, int start,
>>> int
>>> length) should escape special characters before writing to Writer...
>>> beginTag and endTag already do that; obviously this class is needed to
>>> output XML.
>>> Fortunately it is easy to extend this class in "connector" plugin and
>>> override this method.
>>>
>>>
>>>  /** This method is meant to be extended by classes that extend this
>>> class */
>>>  protected void tagContents(char[] ch, int start, int length)
>>>    throws ManifoldCFException
>>>  {
>>>    try
>>>    {
>>>      theWriter.write(ch,start,length);
>>>    }
>>>    catch (java.net.SocketTimeoutException e) ... ... ...
>>>
>>>
>>> -Fuad
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Karl Wright [mailto:[email protected]]
>>> Sent: March-24-11 10:10 PM
>>> To: [email protected]
>>> Subject: Re: XMLWriterContext: tagContext doesn't escape chars
>>>
>>> Could you resend your previous message?  I don't think it made it
>>> through; perhaps you were not signed up for the list at that point.
>>> This is the first message of this thread that was posted.
>>>
>>> Thanks,
>>> Karl
>>>
>>> On Thu, Mar 24, 2011 at 7:22 PM, Fuad Efendi <[email protected]> wrote:
>>>> I just found it.
>>>>
>>>>
>>>>
>>>>  /** This method is meant to be extended by classes that extend this
>>>> class */
>>>>
>>>>  protected void tagContents(char[] ch, int start, int length)
>>>>
>>>>    throws ManifoldCFException
>>>>
>>>>  {
>>>>
>>>>    try
>>>>
>>>>    {
>>>>
>>>>      theWriter.write(ch,start,length);
>>>>
>>>>    }
>>>>
>>>>    catch (java.net.SocketTimeoutException e)
>>>>
>>>> ...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> And we are using temp files with RSS connector.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I tried to split big feed on "entities", stored as an XML Documents,
>>>> but I found some XML-escaped characters will be unescaped (for
>>>> instance, RSS may contain HTML snippet as a value of an element)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -Fuad
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Reply via email to