I worked at a couple of search engine vendors (Infoseek Ultraseek and 
MarkLogic), and user dictionaries are important for linguistic processing. 
Every application has some local jargon.

With languages that don’t separate words with spaces (Chinese and Japanese), 
the tokenizer needs the user dictionary in order to even split out the words as 
separate searchable things. For those languages, they are essential.

At LexisNexis, they have over 19,000 words in the user dictionary.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 18, 2024, at 2:33 PM, Michael Sokolov <msoko...@gmail.com> wrote:
> 
> We use it Amazon. I can't really read it so I'm not sure, but I think
> it's used to encode terms that come up that aren't handled well by the
> standard dictionary.
> 
> On Sat, May 18, 2024 at 8:39 AM Bruno Roustant <bruno.roust...@gmail.com> 
> wrote:
>> 
>> Hi,
>> 
>> While looking at the various usages of Map with Integer keys, I found 
>> ja.dict.UserDictionary with its lookup() method where there is a TODO: can 
>> we avoid this treemap/toIndexArray?
>> 
>> I could propose something, but I would like to know how much it is used, and 
>> if it is worth improving it.
>> 
>> Thanks
>> 
>> Bruno
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

Reply via email to