On Friday, June 27, 2003 3:36 PM, Jony Rosenne <[EMAIL PROTECTED]> wrote:

> For Hebrew and Arabic, add a step: Find the root, remove prefixes,
> suffixes and other grammatical artifacts and obtain the base form of
> the word. 

Removing common suffixes is a separate issue (this requires unification of lexically 
similar words, and can just consists in adding multiple search tokens by removing 1, 2 
or 3 letters at end).

However I'd like to have information about the common prefixes used in Hebrew or 
Arabic, and whever it can be detected in a language-neutral way (using only the script 
information), because this combines with the additional simple suffix removal 
technics. So I don't want to create too many tokens from a single word.

I suppose this was clear in the rest of my message, because I would like to avoid 
dictionnary-based approaches, in contexts where the language is unknown, and only the 
script information (i.e. the encoded plain text) is available (including for Thai), so 
that it can be documented and implemented as a minimum tokenizing algorithm, simple to 
implement across platforms and programming languages.

> Nearly nobody does it, and searches in these languages are less
> useful than parallel searches in other languages.

I looked in some related projects, like Jakarta Lucene, and this does not seem 
documented there, letting users write their own "Analyzer" class to tokenize text.

I don't want a system that will create the best tokens, only a system that can produce 
reasonnably good and sufficient search tokens, letting users add a few tokens for 
known variants in their plain-text searches if needed, or letting users insert 
simplified keywords in their plain-text documents so that they become easily indexable 
and searchable.

Some examples: in German, the composed words typically don't have any separator. It's 
quite hard, without an actual German dictionnary and recognizing that the text is 
actually in German, to get all the best tokens from a single composed word. However, I 
expect that these words will be splitted somewhere in the document (for example 
prefixes agglutinated to an infinitive verb).

Can I reasonnably expect the same thing with text using agglutinating languages like 
Finnish or Hugarian ?

Now comes the real difficulty: can I resonnably split a Thai or Chinese sentence into 
tokens that may not match exactly an actual word, but that may still contain enough 
information to allow filtering relevant texts containing those sequences?

My first intent for Chinese was to split long sequences of Han ideographic characters 
into sequences of 2 or 3 Han characters, at each position in the sequence made of 
characters with the same general category, and the same script property. This would 
paliate the absence of spaces, while also giving a good selectivity for searches.

Note that the documents I need to index are not extremely long (most of them will be 
below 1KB and would consist in descriptive paragraphs for a longer document, or it 
would consist in an extraction of the first 4KB of the plain text document, which 
generally contains an introduction, and reasonnably descriptive titles), so I will 
limit the number of indexable tokens with the longest ones, or the least frequent ones 
(with the help a global statistic database where indexed tokens are hashed and counted 
globally across documents that are part of the same collection).


Reply via email to