Handling multiple character sets within the same file is still a problem. 
Sometimes the agent encounters a multiple language file.  At times the file 
appearly is using overlapping character sets. The character sets like CP1252 
and ISO8859-1 are used ( and browsers tolerate it, so the source is not 
corrected! ).   The agents are encountering the above with a mix of HTML 
encode characters used in another part of the same file encoded as CP1252 
and/or  ISO8859-1.   This makes the summaries a bit difficult to build and 
display correctly.

Any one have a bestpractise for a robot agent handling a "many" multi-parted 
translation "rosette-stone-like" file?  Anyone have a bestpractise for encode 
such a file?

-Thomas
---------- Original Text ----------

From: "Art Pollard" <[EMAIL PROTECTED]>, on 4/6/2002 9:32 AM:


At 06:43 PM 4/5/2002 -0800, you wrote:

>I'm working on a multi language spider, and I've come to a point where I'm
>not sure what assumption to make.

<BIG SNIP>

The solution to your problem is to use a language identifier.
A language identifier is capable of recognizing not only what
language it is but also what character set is in use.  So all you
need to do is to download the page and throw it at a language
identifier and it will tell you what language and character set
it is.  Or, you could do it at a paragraph at a time just in case
you are dealing with a mixed language document.

Just so happens we market one. ;-)  It supports ~230 languages
in a variety of different character sets in addition to UTF-8, and
Unicode Big/Little Endian.  You can play with a simple demo at:
http://www.languageidentifier.com/ (Though Chinese isn't included
in the demo.)

We developed it originally to assist with doing language specific
crawling among other things.  Interestingly enough, we are
finishing up work on a Chinese text segmentation system.
(This puts the spaces into Chinese text so that you can index it
and search it more efficiently.)

If interested, please contact me at: [EMAIL PROTECTED]

-Art
-- 
Art Pollard
http://www.lextek.com/
Suppliers of High Performance Text Retrieval Engines.




Reply via email to