In-Reply-To: <[EMAIL PROTECTED]>
On  Mon, 29 Dec 2003 10:44:48 -0700 Neal Richter <[EMAIL PROTECTED]> 
noted:

>   There is also the largely undefined issue of Asian word-breaking.  May
> asian languages do not use spaces to 'break' words in text, this makes 
> it very difficult to index by word.

So we need someone who reads Japanese look at, say, 
google.jp and yahoo.jp to see how they handle the 
issue? I'm taking Japanese as the hardest example, 
what with ideograms and two syllabic scripts all 
mushed up together... 

Musing, hoping for contradiction: 

1) is it sufficient for the class to hold that all 
characters in ideogram ranges are words?

2) or (more dubiously, but possible a workable kluge) 
that all search strings entirely consisting of syllabic 
script ranges get an implicit "*" truncator? 

Then we're left with the problem of indexing accurate 
transcriptions of Latin inscriptions :-)

BTW, [EMAIL PROTECTED] is currently 
discussing Python "codecs" for Japanese. I know 
nothing about the borrowability of Python 
interpreter code, but there may well be unpatented
ideas in there. 

Mike



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to