Hmm, no, I'd call this a bug in the current connector. You can give it a feed that's perfectly valid, and if you are running in "dechromed" mode the description that gets indexed might well be corrupted.
I'll create a ticket. Karl On Thu, Mar 24, 2011 at 10:41 PM, Fuad Efendi <[email protected]> wrote: > Not a bug with current RSS connector, but something probably important... > > Current RSS connector uses XMLFileContext for temporary XML(?), and here > problems may happen if <description> and <content> contain sub-elements... > but in our specific use case it is HTML snippet, and we don't consider it > XML, so that unescaped characters are natural... > > So I think there are no any problems (with current RSS specs), but we might > have problems in the future with another use cases such as: > <description> > <sub-description-1> <H1>Header </sub-description-1> > </description> > > > Output to temp. file will be malformed XML: > <sub-description-1> <H1> Header </sub-description-1> > > > > > > > -----Original Message----- > From: Karl Wright [mailto:[email protected]] > Sent: March-24-11 10:26 PM > To: [email protected] > Subject: Re: XMLWriterContext: tagContext doesn't escape chars > > Ok, although I am curious whether this is a bug with a current connector? > Or is this something new you were trying to do? > > Karl > > On Thu, Mar 24, 2011 at 10:21 PM, Fuad Efendi <[email protected]> wrote: >> Hi Karl, I think initial message was improperly (re)formatted... I >> suspect connector-user allows HTML, and connector-dev allows only plain > text. >> >> The class XMLWriterContext, method tagContents(char[] ch, int start, >> int >> length) should escape special characters before writing to Writer... >> beginTag and endTag already do that; obviously this class is needed to >> output XML. >> Fortunately it is easy to extend this class in "connector" plugin and >> override this method. >> >> >> /** This method is meant to be extended by classes that extend this >> class */ >> protected void tagContents(char[] ch, int start, int length) >> throws ManifoldCFException >> { >> try >> { >> theWriter.write(ch,start,length); >> } >> catch (java.net.SocketTimeoutException e) ... ... ... >> >> >> -Fuad >> >> >> >> >> >> >> -----Original Message----- >> From: Karl Wright [mailto:[email protected]] >> Sent: March-24-11 10:10 PM >> To: [email protected] >> Subject: Re: XMLWriterContext: tagContext doesn't escape chars >> >> Could you resend your previous message? I don't think it made it >> through; perhaps you were not signed up for the list at that point. >> This is the first message of this thread that was posted. >> >> Thanks, >> Karl >> >> On Thu, Mar 24, 2011 at 7:22 PM, Fuad Efendi <[email protected]> wrote: >>> I just found it. >>> >>> >>> >>> /** This method is meant to be extended by classes that extend this >>> class */ >>> >>> protected void tagContents(char[] ch, int start, int length) >>> >>> throws ManifoldCFException >>> >>> { >>> >>> try >>> >>> { >>> >>> theWriter.write(ch,start,length); >>> >>> } >>> >>> catch (java.net.SocketTimeoutException e) >>> >>> ... >>> >>> >>> >>> >>> >>> And we are using temp files with RSS connector. >>> >>> >>> >>> >>> >>> I tried to split big feed on "entities", stored as an XML Documents, >>> but I found some XML-escaped characters will be unescaped (for >>> instance, RSS may contain HTML snippet as a value of an element) >>> >>> >>> >>> >>> >>> -Fuad >>> >>> >>> >>> >>> >>> >>> >>> >> >> > >
