Leos Literak wrote: > But I need some enhancements, that are related to > my language. [...] > but we have 7 shapes for each of singular > and plural: > > pes,psa,psovi,psu,pse,psovi,psem > psi,psu,psum,psy,psi,psech,psy > > I'd like to be able to search for all of this > variants in first shape: pes > > eg. whenever I encounter "psa" index it as "pes"
How are those suffixes related to each other? It makes most sense to reduce every plural to it's singular form first. After that you can strip unnecessary suffixes. > 2\ another issue is with diacritics. > > for example l�vka (la'vka) > > some people use it, some write words without it. > > so I'd like to be able to look up both > lavka and l�vka. easy way is to index words without > diacritics, because it is common denominator. If you eleminate the diacritics for indexing and searching you have the advantage that everyone can search even if he is not familiar with czech diacritics. Or maybe is not able to use diacritics (wrong codepage, font, whatever). > so my question is: what kind of interfaces/classes > shall I implement/overwrite? i have no idea of > relations between classes in Lucene and their > purpose. so it would take me lot of time to find that. You should first look at the classes Analyzer and Filter. You need an Analyzer that pipes the content of a document trough a suitable Tokenizer (StandardTokenizer should fit for czech) and then trough the needed filters. One of those filters is yours that makes the needed grammatically changes. Greets, Gerhard -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
