hey gilles and everyone.. i am having the same problem jordan is (see below). ' 's (and other HTML general entities) seem to come straight through verbatim in the generated XML, when using htdig in 'xmlsearch' configuration mode. this leads to parser errors in a variety of situations. this is because XML 1.0 pre-defines or natively-recognizes only 5 of these entities ( , <, >, &, ", and '). the rest of the (roughly 250) HTML-defined general entities are not defined in XML 1.0.
there seem to be two solutions (at least that i have found) 1) make an enhanced htdig.dtd which will include a definition for ' ' and the other couple of hundred HTML general entities i have tried this and it validates correctly, but still it creates problems for mozilla-based browsers which do not validate based on the definitions in the .dtd. it also creates problems for certain php scripts using commands based on non-validating parsing. 2) have htdig itself instead pre-filter its output so that all '&'s are turned into '&'s. thus, when htdig returns an excerpt with ' ' in it the prefilter would return the script with '&nbsp;' instead. and the same for the other 250 html entities. but other wise everything would stay the same. when parsed, the code would be converted back to ' '. i have experimented with this also (albeit in other contexts such php scripts i've written) and found that solution 2 works quite nicely: it's fast because it doesn't involve a large .dtd to parse through nor to have post-translations done on the receiving/parsing end. it's also 'easier' at least for me. as far as i can tell, 2 is a better solution. but i haven't seen or figured out how to make the htdig engine to pre-filter '&'s like this. i thought there might be a nice htdig.conf switch i could use to send a config to htsearch so that it would do this, but i can't seem to find any (at least not listed in the docs). is there a way to do this solution 2 with htdig/xmlsearch? it really would make it easier for me :) :):) or has anyone else found alternative solutions? thanks, jeff stern ---------- Forwarded message ---------- From: Jordan Kirby <[EMAIL PROTECTED]> xmlsearch encoding issue 2002-12-03 06:59 Hi, We use xmlsearch to pull back the search results from htdig, some of our results come back with in them, which, on some occasions seem to throw the xml handler we use out, and as such we get errors everywhere. An extract from the results we get: --------------------------------------------------------- <RESULT> <TITLE>Business Finder Results</TITLE> <URL>http://www.lutterworth-online.co.uk/pp/business/results.asp?title=estat e%20agent</URL> <SCORE>165524</SCORE> <PERCENT>24</PERCENT> <EXCERPT>---SNIPPED----</EXCERPT> <SIZE>7690</SIZE> <SIZEK>8</SIZEK> <DESCRIPTION>Local Estate Agents </DESCRIPTION> <DESCRIPTIONS>Local Estate Agents <br> </DESCRIPTIONS> <CURRENT>3</CURRENT> <MODIFIED>2002-12-03</MODIFIED> <HOPCOUNT>1</HOPCOUNT> <DOCID>28460</DOCID> <ANCHOR></ANCHOR> <BACKLINKS>2</BACKLINKS> </RESULT> --------------------------------------------------------- The error is on the <DESCRIPTION> line, I get: "Reference to undefined entity "nbsp"." Can anyone shed any light on how I can get round this erroring everytime? Thanks Jordan ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general