Paul, there are two implementations in compounds, one is dictionary-based, the other is hyphenation-grammar + dictionary (it restricts the decompounding based on hyphenation rules). You could also subclass the compound base class and implement your own.
I haven't seen any user-measures (relevance, etc), would be a cool thing to see though. I'm not sure I understand your last question, can you elaborate? it might be that to improve some cases, you want to use the onlyLongestMatch parameter: @param onlyLongestMatch Add only the longest matching subword to the stream for scoring, I think lucene's scoring might help too, because the original word, without decompounding, is left as a token so if you search on an exact match it should be ranked higher. (not sure if this is answering your question) On Wed, Oct 21, 2009 at 5:27 AM, Paul Libbrecht <p...@activemath.org> wrote: > > I'm interested to this analyzer.. it had escaped me and solves an old > problem! > Could you report about its usage: > - did you have to feed words in a dictionary? > - does anyone have user-measures already? > ... and the last question for the research fun: is there any approach > towards preferring Überwachunggesetz as a concept than, say, > Fleischüberwachung? (again, that could be based on a dictionary probably). > > thanks in advance > > paul > > > Le 21-oct.-09 à 04:00, Robert Muir a écrit : > > > hi, it will work because it will also decompound "Rindfleish" into Rind >> and >> fleish, with posIncr=0 >> >> so if you index Rindfleischüberwachungsgesetz, then query with >> "Rindfleish", >> its matching because Rindfleish also gets decompounded into Rind and >> fleish. >> >> On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas >> <bbdoug...@basistech.com>wrote: >> >> Hello, >>> >>> I've found a number of posts in different places talking about how to >>> perform decompounding, but I haven't found too many discussing how to use >>> the results of decompounding. If anyone can answer this question or point >>> me >>> to an existing discussion it would be very helpful. >>> >>> In the description of the org.apache.lucene.analysis.compound package, it >>> gives the following example: >>> >>> Rindfleischüberwachungsgesetz, 0, 29 >>> Rind, 0, 4, posIncr=0 >>> fleisch, 4, 11, posIncr=0 >>> überwachung, 11, 22, posIncr=0 >>> gesetz, 23, 29, posIncr=0 >>> >>> And I see how this allows me to find single components such as "gesetz" >>> or >>> "Rind". But what if I want to find combinations of components such as >>> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using >>> posIncr=0 for all components excludes the possibility of finding >>> sub-strings >>> that are made up of multiple components. >>> >>> Any comments or thoughts would be appreciated. >>> >>> Ben Douglas >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> >> -- >> Robert Muir >> rcm...@gmail.com >> > > -- Robert Muir rcm...@gmail.com