Title: [htdig] how to defind word


This is a quite interesting problem which shows "occidental" cultural conventions are not universal all over the wold. (See my answers at the end).


According to Prisda Gomutputra :

> I am currently trying to fine tuning Ht://dig to be able to
> work with Thai
> (8bit) language more accurately.  I can get it to work fine
> but the accuracy
> of the search is not highly relavent since Thai lanuage does
> not have space
> to separate words.  Space is only used to seperate sentences.
>
> For example, a sentense in English "this is tesRt1. this is
> test2", it would
> be written in thai as follow "thisisteRst1. thisistest2"
>                           ^^^^
> 1) Is there a way to tell ht://dig to be able to identify the
> words and
> index them properly?
> 2) when the words are combided togeter with out space in between, it
> intorduc a new problem such as the example above,
> "thiSISTERst1".  When user
> search for a word "sister", "thiSISTERst1" will be returned
> too.   is there
> a way to prevent this problem from happening?


How can you make the difference between "thiSISTERst1" and "thisisTERST2" ?
Is this the global sence of the sentence which allows you to decide how to understand "thisisterst1" ?
Are there some ambigous sentences (where it is difficult to decide the sence of "thisisterst") ?
Is there a way to make clearly the difference between "thiSISTERst1" and "thisisTERST2" ?


I think a solution is to insert (manually or automatically) an "invisible" space between "this" "is" and "terst". I mean a character which won't be shown when you read, but which will be understood by softwares (such ht://dig) as a separation between to words. (Also think about html- sgml-like markup : for example : <word>this</word><word>is</word><word>terst</word>).

-- Manually : it may be long, and difficult to change the usual way of writing.
-- Automatically : you may use or build a software that analyse every sentences to add "invisible" spaces between words -- I don't know if such a software exist.

Another theoretical solution, less elegant but immediatly possible, is to use synonyms in ht://Dig :
"thiSISTERst" should have "this" "is" "terst" as synonyms.


I Hope you will find a solution.
Charles N�pote.
Paris, France.


> Highly appreciated
> Prisda

Reply via email to