CONNECTORS-171. Fuad, I think it's perfectly reasonable to modify XMLWriterContext to simply add proper entity-escaping of characters. Would you like to submit a patch?
Karl On Fri, Mar 25, 2011 at 12:12 AM, Karl Wright <[email protected]> wrote: > Hmm, no, I'd call this a bug in the current connector. You can give > it a feed that's perfectly valid, and if you are running in > "dechromed" mode the description that gets indexed might well be > corrupted. > > I'll create a ticket. > > Karl > > > On Thu, Mar 24, 2011 at 10:41 PM, Fuad Efendi <[email protected]> wrote: >> Not a bug with current RSS connector, but something probably important... >> >> Current RSS connector uses XMLFileContext for temporary XML(?), and here >> problems may happen if <description> and <content> contain sub-elements... >> but in our specific use case it is HTML snippet, and we don't consider it >> XML, so that unescaped characters are natural... >> >> So I think there are no any problems (with current RSS specs), but we might >> have problems in the future with another use cases such as: >> <description> >> <sub-description-1> <H1>Header </sub-description-1> >> </description> >> >> >> Output to temp. file will be malformed XML: >> <sub-description-1> <H1> Header </sub-description-1> >> >> >> >> >> >> >> -----Original Message----- >> From: Karl Wright [mailto:[email protected]] >> Sent: March-24-11 10:26 PM >> To: [email protected] >> Subject: Re: XMLWriterContext: tagContext doesn't escape chars >> >> Ok, although I am curious whether this is a bug with a current connector? >> Or is this something new you were trying to do? >> >> Karl >> >> On Thu, Mar 24, 2011 at 10:21 PM, Fuad Efendi <[email protected]> wrote: >>> Hi Karl, I think initial message was improperly (re)formatted... I >>> suspect connector-user allows HTML, and connector-dev allows only plain >> text. >>> >>> The class XMLWriterContext, method tagContents(char[] ch, int start, >>> int >>> length) should escape special characters before writing to Writer... >>> beginTag and endTag already do that; obviously this class is needed to >>> output XML. >>> Fortunately it is easy to extend this class in "connector" plugin and >>> override this method. >>> >>> >>> /** This method is meant to be extended by classes that extend this >>> class */ >>> protected void tagContents(char[] ch, int start, int length) >>> throws ManifoldCFException >>> { >>> try >>> { >>> theWriter.write(ch,start,length); >>> } >>> catch (java.net.SocketTimeoutException e) ... ... ... >>> >>> >>> -Fuad >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Karl Wright [mailto:[email protected]] >>> Sent: March-24-11 10:10 PM >>> To: [email protected] >>> Subject: Re: XMLWriterContext: tagContext doesn't escape chars >>> >>> Could you resend your previous message? I don't think it made it >>> through; perhaps you were not signed up for the list at that point. >>> This is the first message of this thread that was posted. >>> >>> Thanks, >>> Karl >>> >>> On Thu, Mar 24, 2011 at 7:22 PM, Fuad Efendi <[email protected]> wrote: >>>> I just found it. >>>> >>>> >>>> >>>> /** This method is meant to be extended by classes that extend this >>>> class */ >>>> >>>> protected void tagContents(char[] ch, int start, int length) >>>> >>>> throws ManifoldCFException >>>> >>>> { >>>> >>>> try >>>> >>>> { >>>> >>>> theWriter.write(ch,start,length); >>>> >>>> } >>>> >>>> catch (java.net.SocketTimeoutException e) >>>> >>>> ... >>>> >>>> >>>> >>>> >>>> >>>> And we are using temp files with RSS connector. >>>> >>>> >>>> >>>> >>>> >>>> I tried to split big feed on "entities", stored as an XML Documents, >>>> but I found some XML-escaped characters will be unescaped (for >>>> instance, RSS may contain HTML snippet as a value of an element) >>>> >>>> >>>> >>>> >>>> >>>> -Fuad >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >
