just add them to the dictionary, the compound filter will do this automatically.
if you want to tweak it even further, you can also tell compounds to NOT emit the subwords if they form a bigger compound with the onlyLongestMatch parameter i spoke of earlier. I haven't played with this option much but I think this is what its supposed to do: if the dictionary is soft ball softball then "softball" (or compounds containing it) won't emit "soft" and "ball", because "softball" is in the dictionary and its a longest match. with the option off, you'd get softball, ball, soft On Wed, Oct 21, 2009 at 3:09 PM, Benjamin Douglas <bbdoug...@basistech.com>wrote: > OK, that makes sense. So I just need to add all of the sub-compounds that > are real words at posIncr=0, even if they are combinations of other > sub-compounds. > > Thanks! > > -----Original Message----- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Wednesday, October 21, 2009 11:49 AM > To: java-user@lucene.apache.org > Subject: Re: Using org.apache.lucene.analysis.compound > > yes, your dictionary :) > > if überwachungsgesetz is a real word, add it to your dictionary. > > for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere", > "Gesetz", "Aufgabe", "Überwachung" }, and you index > Rindfleischüberwachungsgesetz, then all 3 queries will have the same score. > but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere", > "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this makes > a big difference. > > all 3 queries will still match, but überwachungsgesetz will have a higher > score. this is because now things are analyzed differently: > Rindfleischüberwachungsgesetz will be decompounded as before, but with an > additional token: Überwachungsgesetz. > so back to your original question, these 'concatenations' of multiple > components, yes compounds will do that, if they are real words. but it > won't > just make them up. > > "überwachungsgesetz" > 0.23013961 = (MATCH) sum of: > 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of: > 0.5 = queryWeight(field:überwachungsgesetz), product of: > 0.30685282 = idf(docFreq=1, maxDocs=1) > 1.6294457 = queryNorm > 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product > of: > 1.0 = tf(termFreq(field:überwachungsgesetz)=1) > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.375 = fieldNorm(field=field, doc=0) > 0.057534903 = (MATCH) weight(field:überwachung in 0), product of: > 0.5 = queryWeight(field:überwachung), product of: > 0.30685282 = idf(docFreq=1, maxDocs=1) > 1.6294457 = queryNorm > 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of: > 1.0 = tf(termFreq(field:überwachung)=1) > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.375 = fieldNorm(field=field, doc=0) > 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of: > 0.5 = queryWeight(field:überwachungsgesetz), product of: > 0.30685282 = idf(docFreq=1, maxDocs=1) > 1.6294457 = queryNorm > 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product > of: > 1.0 = tf(termFreq(field:überwachungsgesetz)=1) > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.375 = fieldNorm(field=field, doc=0) > 0.057534903 = (MATCH) weight(field:gesetz in 0), product of: > 0.5 = queryWeight(field:gesetz), product of: > 0.30685282 = idf(docFreq=1, maxDocs=1) > 1.6294457 = queryNorm > 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of: > 1.0 = tf(termFreq(field:gesetz)=1) > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.375 = fieldNorm(field=field, doc=0) > > "gesetzüberwachung" > 0.064782135 = (MATCH) sum of: > 0.032391068 = (MATCH) weight(field:gesetz in 0), product of: > 0.2814906 = queryWeight(field:gesetz), product of: > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.9173473 = queryNorm > 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of: > 1.0 = tf(termFreq(field:gesetz)=1) > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.375 = fieldNorm(field=field, doc=0) > 0.032391068 = (MATCH) weight(field:überwachung in 0), product of: > 0.2814906 = queryWeight(field:überwachung), product of: > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.9173473 = queryNorm > 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of: > 1.0 = tf(termFreq(field:überwachung)=1) > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.375 = fieldNorm(field=field, doc=0) > > "fleischgesetz" > 0.064782135 = (MATCH) sum of: > 0.032391068 = (MATCH) weight(field:fleisch in 0), product of: > 0.2814906 = queryWeight(field:fleisch), product of: > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.9173473 = queryNorm > 0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of: > 1.0 = tf(termFreq(field:fleisch)=1) > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.375 = fieldNorm(field=field, doc=0) > 0.032391068 = (MATCH) weight(field:gesetz in 0), product of: > 0.2814906 = queryWeight(field:gesetz), product of: > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.9173473 = queryNorm > 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of: > 1.0 = tf(termFreq(field:gesetz)=1) > 0.30685282 = idf(docFreq=1, maxDocs=1) > 0.375 = fieldNorm(field=field, doc=0) > > > > > On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas > <bbdoug...@basistech.com>wrote: > > > Thanks for all of the answers so far! > > > > Paul's question is similar to another aspect I am curious about: > > > > Given the way the sample word is analyzed, is there anything in the > scoring > > mechanism that would rank "überwachungsgesetz" higher than > > "gesetzüberwachung" or "fleischgesetz"? > > > > > > -- > Robert Muir > rcm...@gmail.com > -- Robert Muir rcm...@gmail.com