Handling multiple character sets within the same file is still a problem. Sometimes the agent encounters a multiple language file. At times the file appearly is using overlapping character sets. The character sets like CP1252 and ISO8859-1 are used ( and browsers tolerate it, so the source is not corrected! ). The agents are encountering the above with a mix of HTML encode characters used in another part of the same file encoded as CP1252 and/or ISO8859-1. This makes the summaries a bit difficult to build and display correctly.
Any one have a bestpractise for a robot agent handling a "many" multi-parted translation "rosette-stone-like" file? Anyone have a bestpractise for encode such a file? -Thomas ---------- Original Text ---------- From: "Art Pollard" <[EMAIL PROTECTED]>, on 4/6/2002 9:32 AM: At 06:43 PM 4/5/2002 -0800, you wrote: >I'm working on a multi language spider, and I've come to a point where I'm >not sure what assumption to make. <BIG SNIP> The solution to your problem is to use a language identifier. A language identifier is capable of recognizing not only what language it is but also what character set is in use. So all you need to do is to download the page and throw it at a language identifier and it will tell you what language and character set it is. Or, you could do it at a paragraph at a time just in case you are dealing with a mixed language document. Just so happens we market one. ;-) It supports ~230 languages in a variety of different character sets in addition to UTF-8, and Unicode Big/Little Endian. You can play with a simple demo at: http://www.languageidentifier.com/ (Though Chinese isn't included in the demo.) We developed it originally to assist with doing language specific crawling among other things. Interestingly enough, we are finishing up work on a Chinese text segmentation system. (This puts the spaces into Chinese text so that you can index it and search it more efficiently.) If interested, please contact me at: [EMAIL PROTECTED] -Art -- Art Pollard http://www.lextek.com/ Suppliers of High Performance Text Retrieval Engines.