Ok here is what I have found: The &#XXX; entities are OK inside the db files
They are munged improperly during htsearch display of the excerpt. ResultFetch::hilight is the driving function and it calls HtSGMLCodec::instance()->decode(s) which does the improper conversion ™ --> &#153; I think that this change to HtSGMLCodec.h looks important: http://cvs.sourceforge.net/viewcvs.py/htdig/htdig/htcommon/HtSGMLCodec.h?r1=1.1&r2=1.1.2.1 Tue Mar 28 04:06:34 2000 UTC (3 years, 7 months ago) by ghutchis "Differentiate between codec used for &foo; and numeric form &#nnn; Make sure encoding goes through both but decoding only goes through the preferred text form." This part of the code is pretty cheese-whizzy, so attention Geoff! Any insights? I am assuming that at some point this worked properly. I'll pound on it some more tommorow.. it looks like the populated replacements array in the myTextWordCodec object of the singleton HtSGMLCodec object is improperly done, or there is some problem in the order of calls. HtSGMLCodec.cc 29 // Similar to the HtWordCodec class. Each string may contain 30 // zero or more of words from the lists. Here we need to run 31 // it through two codecs because we might have two different forms 32 inline String encode(const String &uncoded) const 33 { return myTextWordCodec->encode(myNumWordCodec->encode(uncoded)); } 34 35 // But we only want to decode into one form i.e. &foo; NOT &#nnn; 36 String decode(const String &coded) const 37 { return myTextWordCodec->decode(coded); } Intuitively I would think that if encode is as above, that decode should be the reverse of encode: return myNumWordCodec->decode(myTextWordCodec->decode(coded)); But this makes the problem worse! &#XXX; --> &amp;XXX; Thanks. Neal On Mon, 17 Nov 2003, Neal Richter wrote: > > I am seeing some HTML entities show up in search result 'blurbs'. > > See below. Basically any entity of this form &#XXX; get translated to &#XXX; > > ™ --> &#153; > > This only happens for numbered entities below 160. > >   --> > © --> © > ® --> ® > > I'm digging for this code.. looks like > > Is there a fix for this in 3.1.X?? Anyone complain about this before???? > > Thanks! > > Example Page: > > 1 <HTML> > 2 <TITLE>Test page > 3 </TITLE> > 4 <BODY> > 5 <h1>HTDIG ™</h1> > 6 <h2>Use our software — to enhance your website</h2> > 7 <BR> > 8 HTDig ™ 3.2.0 > 9 <BR> > 10 > 11 The ht://Dig system is a complete world wide web indexing and searching system > 12 for a domain or intranet. > 13 > 14 <BR> > 15 <BR> > 16 1 ‹2 < 3 > 17 <BR> > 18 © 2003 Neal Richter > 19 <BR> > 20 © 2003 HtDig Group > 21 </BODY> > 22 </HTML> > 23 > > Search results: > > [EMAIL PROTECTED] htdig-3.2.0b5-bin]$ cgi-bin/htsearch -c conf/htdig.conf > Enter value for words: htdig > Content-type: text/html > > Enter value for format: long > <dl><dt><strong><a > href="http://westfork.rightnow.com/data/test/test2.html">Test page > </a></strong><img src="/htdig/star.gif" alt="*"><img src="/htdig/star.gif" > alt="*"><img src="/htdig/star.gif" alt="*"><img src="/htdig/star.gif" > alt="*"> > </dt><dd> <strong>HTDIG</strong> &#153; USE OUR SOFTWARE &#151; TO > ENHANCE YOUR WEBSITE <strong>HTDig</strong> &#153; 3.2.0 The ht://Dig > system is a complete world wide web indexing and searching system for a > domain or intranet. 1 &#139;2 < 3 © 2003 > Neal Richter © 2003 <strong>HtDig</strong> Group <br> > <em><a > href="http://westfork.rightnow.com/data/test/test2.html">http://westfork.rightnow.com/data/test/test2.html</a></em> > <font size="-1">11/17/03, 384 bytes</font> > </dd></dl> > > > Neal Richter > Knowledgebase Developer > RightNow Technologies, Inc. > Customer Service for Every Web Site > Office: 406-522-1485 > > > > > > > > ------------------------------------------------------- > This SF. Net email is sponsored by: GoToMyPC > GoToMyPC is the fast, easy and secure way to access your computer from > any Web browser or wireless device. Click here to Try it Free! > https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl > _______________________________________________ > ht://Dig Developer mailing list: > [EMAIL PROTECTED] > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 ------------------------------------------------------- This SF. Net email is sponsored by: GoToMyPC GoToMyPC is the fast, easy and secure way to access your computer from any Web browser or wireless device. Click here to Try it Free! https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev