[htdig] Re: xmlsearch encoding issue

Jeff Stern Thu, 16 Dec 2004 17:47:35 -0800

hey gilles and everyone..

i am having the same problem jordan is (see below). '&nbsp;'s
(and other HTML general entities) seem to come straight through
verbatim in the generated XML, when using htdig in 'xmlsearch'
configuration mode. this leads to parser errors in a variety of
situations. this is because XML 1.0 pre-defines or
natively-recognizes only 5 of these entities (&nbsp;, &lt;, &gt;,
&amp;, &quot;, and &apos;).  the rest of the (roughly 250)
HTML-defined general entities are not defined in XML 1.0.


there seem to be two solutions (at least that i have found)

1) make an enhanced htdig.dtd which will include a definition for 
'&nbsp;' and the other couple of hundred HTML general entities

i have tried this and it validates correctly, but still it 
creates problems for mozilla-based browsers which do not validate 
based on the definitions in the .dtd. it also creates problems 
for certain php scripts using commands based on non-validating 
parsing.

2) have htdig itself instead pre-filter its output so that all 
'&'s are turned into '&amp;'s. thus, when htdig returns an 
excerpt with '&nbsp' in it the prefilter would return the script 
with '&amp;nbsp;' instead. and the same for the other 250 html 
entities. but other wise everything would stay the same. when 
parsed, the code would be converted back to '&nbsp;'.

i have experimented with this also (albeit in other contexts such
php scripts i've written) and found that solution 2 works quite
nicely: it's fast because it doesn't involve a large .dtd to
parse through nor to have post-translations done on the
receiving/parsing end. it's also 'easier' at least for me.

as far as i can tell, 2 is a better solution. but i haven't seen
or figured out how to make the htdig engine to pre-filter '&'s
like this. i thought there might be a nice htdig.conf switch i
could use to send a config to htsearch so that it would do this,
but i can't seem to find any (at least not listed in the docs).

is there a way to do this solution 2 with htdig/xmlsearch? it
really would make it easier for me :) :):)

or has anyone else found alternative solutions?

thanks,
jeff stern

---------- Forwarded message ----------
From: Jordan Kirby <[EMAIL PROTECTED]>
xmlsearch encoding issue  
2002-12-03 06:59

 Hi,
 
 We use xmlsearch to pull back the search results from htdig, 
some of our
 results come back with &nbsp; in them, which, on some occasions 
seem to
 throw the xml handler we use out, and as such we get errors 
everywhere.
 
 An extract from the results we get:
 ---------------------------------------------------------
 <RESULT>
 <TITLE>Business Finder Results</TITLE>
 
<URL>http://www.lutterworth-online.co.uk/pp/business/results.asp?title=estat
 e%20agent</URL>
 <SCORE>165524</SCORE>
 <PERCENT>24</PERCENT>
 <EXCERPT>---SNIPPED----</EXCERPT>
 <SIZE>7690</SIZE>
 <SIZEK>8</SIZEK>
 <DESCRIPTION>Local Estate Agents&nbsp;</DESCRIPTION>
 <DESCRIPTIONS>Local Estate Agents&nbsp;&lt;br&gt;
 </DESCRIPTIONS>
 <CURRENT>3</CURRENT>
 <MODIFIED>2002-12-03</MODIFIED>
 <HOPCOUNT>1</HOPCOUNT>
 <DOCID>28460</DOCID>
 <ANCHOR></ANCHOR>
 <BACKLINKS>2</BACKLINKS>
 </RESULT>
 ---------------------------------------------------------
 
 The error is on the <DESCRIPTION> line, I get: "Reference to 
undefined
 entity "nbsp"."
 
 Can anyone shed any light on how I can get round this erroring 
everytime?
 
 Thanks
 
 Jordan
 



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

[htdig] Re: xmlsearch encoding issue

Reply via email to