According to Neal Richter: > On Wed, 19 Nov 2003, Gilles Detillieux wrote: > > According to Neal Richter: > > > > This error is happening in the DISPLAY of the excerpts... so it > > > > seems like looking for &#XXX; patterns and NOT encoding them before > > > > display is a reasonable strategy... the browser will decide how to display it. > > > > That would be a reasonable compromise, but note that it is a compromise. > > For example, if an HTML document has something like "use ™ in > > your HTML to encode a ™ character", this will end up in db.excerpts > > as "use ™ in your HTML to encode a ™ character". At that point, > > htsearch has no way of knowing that the first occurrence was originally > > different than the second. It comes down to a decision between encoding > > both or leaving both as-is. > > Eh... why not explicitly look for patterns like '™' and leave > them as-is?
Do you mean in htdig or in htsearch? The point I'm making is that by the time htsearch reads the excerpt, it's already too late. You could of course load up the SGML decoding in htdig with all sorts of exceptions, so it would convert & to &, but not of & is followed by #(some_number). It comes down to a question of how elaborate you want to get with the exceptions. Not converting any SGML entities at all isn't really a viable option, because then there will be problems finding matches. We tried that in some limited capacity in 3.1.x with the translate_amp attribute, et al., but ended up dropping that idea because it caused more problems than it solved. If you want the whole story about that, I'd recommend searching the archives for the numerous prior discussions. > For the pupose of excerpts.... I think we may not need to do encoding at > all... so that there is no conflict between store and display. > > Encoding SGML entities is beneficial for searchability via the > db.words.db, but I don't see how it is a benefit for db.excerpts. Well, if you don't mind that the excerpt highlighting won't find the SGML entities, then no, there isn't any other benefit. We've been through this too with 3.1.x, and decided excerpt highlighting was important enough to get it to work consistently. If you can find a better way than what we worked out back then, go for it. Just be sure to test what you develop, because it seems you're not grasping the pitfalls I tried to point out in my last e-mail on Wednesday. > I don't want to go tearing up code that is there for a reason.... please > elaborate. I'm not sure how I can explain myself more clearly than I did on Wednesday. I suggest you have a look at htsearch/Display.cc and htlib/StringMatch.cc to see how the code finds the words to highlight. It's a separate matching mechanism from the search of db.words.db! > > The other option would be for htdig to replace the & lead-in character > > for undecoded entities into some other, non-ambiguous lead-in character in > > the database, so that htsearch could always distinguish between the two. > > But what character could we use, that wouldn't conflict with anything > > else? > > For that matter we could be storing excerpts marked up via XML and > process this XML as appropriate during display. > > A bigger project would be to make the entire search-query process > produce an XML document that we could render to HTML via XSLT. This would > allow pretty magnificent user customization of the search results. Sounds like a good idea, but we're talking major coding effort here. Who's up for it? (I don't have the time!) -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev