Ok here is what I have found:

The &#XXX; entities are OK inside the db files

They are munged improperly during htsearch display of the excerpt.

ResultFetch::hilight is the driving function and it calls
HtSGMLCodec::instance()->decode(s) which does the improper conversion
™ --> ™

I think that this change to HtSGMLCodec.h looks important:

http://cvs.sourceforge.net/viewcvs.py/htdig/htdig/htcommon/HtSGMLCodec.h?r1=1.1&r2=1.1.2.1
Tue Mar 28 04:06:34 2000 UTC (3 years, 7 months ago) by ghutchis
"Differentiate between codec used for &foo; and numeric form &#nnn; Make
sure encoding goes through both but decoding only goes through the
preferred text form."

This part of the code is pretty cheese-whizzy, so attention Geoff!

Any insights? I am assuming that at some point this worked properly.

I'll pound on it some more tommorow.. it looks like the populated
replacements array in the myTextWordCodec object of the singleton
HtSGMLCodec object is improperly done, or there is some problem in the
order of calls.

HtSGMLCodec.cc
29   // Similar to the HtWordCodec class.  Each string may contain
30   // zero or more of words from the lists. Here we need to run
31   // it through two codecs because we might have two different forms
32   inline String encode(const String &uncoded) const
33   { return myTextWordCodec->encode(myNumWordCodec->encode(uncoded)); }
34
35   // But we only want to decode into one form i.e. &foo; NOT &#nnn;
36   String decode(const String &coded) const
37   { return myTextWordCodec->decode(coded); }


Intuitively I would think that if encode is as above, that decode should
be the reverse of encode:
 return myNumWordCodec->decode(myTextWordCodec->decode(coded));

  But this makes the problem worse!

&#XXX;  -->  &XXX;

Thanks.

Neal

On Mon, 17 Nov 2003, Neal Richter wrote:

>
> I am seeing some HTML entities show up in search result 'blurbs'.
>
> See below.  Basically any entity of this form &#XXX; get translated to &#XXX;
>
> ™ --> ™
>
> This only happens for numbered entities below 160.
>
>   -->  
> © --> ©
> ® --> ®
>
> I'm digging for this code.. looks like
>
> Is there a fix for this in 3.1.X??  Anyone complain about this before????
>
> Thanks!
>
> Example Page:
>
> 1 <HTML>
> 2 <TITLE>Test page
> 3 </TITLE>
> 4 <BODY>
> 5 <h1>HTDIG &#153;</h1>
> 6 <h2>Use our software &#151; to enhance your website</h2>
> 7 <BR>
> 8 HTDig &#153; 3.2.0
> 9 <BR>
> 10
> 11 The ht://Dig system is a complete world wide web indexing and searching system
> 12 for a domain or intranet.
> 13
> 14 <BR>
> 15 <BR>
> 16 1&nbsp;&#139;2&#160;&lt;&nbsp;3
> 17 <BR>
> 18 &#169;&#160;2003 Neal Richter
> 19 <BR>
> 20 &copy;&#160;2003 HtDig Group
> 21 </BODY>
> 22 </HTML>
> 23
>
> Search results:
>
> [EMAIL PROTECTED] htdig-3.2.0b5-bin]$ cgi-bin/htsearch -c conf/htdig.conf
> Enter value for words: htdig
> Content-type: text/html
>
> Enter value for format: long
> <dl><dt><strong><a
> href="http://westfork.rightnow.com/data/test/test2.html";>Test page
> </a></strong><img src="/htdig/star.gif" alt="*"><img src="/htdig/star.gif"
> alt="*"><img src="/htdig/star.gif" alt="*"><img src="/htdig/star.gif"
> alt="*">
> </dt><dd> <strong>HTDIG</strong> &amp;#153; USE OUR SOFTWARE &amp;#151; TO
> ENHANCE YOUR WEBSITE <strong>HTDig</strong> &amp;#153; 3.2.0 The ht://Dig
> system is a complete world wide web indexing and searching system for a
> domain or intranet. 1&nbsp;&amp;#139;2&nbsp;&lt;&nbsp;3 &copy;&nbsp;2003
> Neal Richter &copy;&nbsp;2003 <strong>HtDig</strong> Group <br>
> <em><a
> href="http://westfork.rightnow.com/data/test/test2.html";>http://westfork.rightnow.com/data/test/test2.html</a></em>
>  <font size="-1">11/17/03, 384 bytes</font>
> </dd></dl>
>
>
> Neal Richter
> Knowledgebase Developer
> RightNow Technologies, Inc.
> Customer Service for Every Web Site
> Office: 406-522-1485
>
>
>
>
>
>
>
> -------------------------------------------------------
> This SF. Net email is sponsored by: GoToMyPC
> GoToMyPC is the fast, easy and secure way to access your computer from
> any Web browser or wireless device. Click here to Try it Free!
> https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl
> _______________________________________________
> ht://Dig Developer mailing list:
> [EMAIL PROTECTED]
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-dev
>

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485




-------------------------------------------------------
This SF. Net email is sponsored by: GoToMyPC
GoToMyPC is the fast, easy and secure way to access your computer from
any Web browser or wireless device. Click here to Try it Free!
https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to