I created a test class for rapid testing that should be runnable out of the
box, with
LUCENE-SUGGEST-5.0-SNAPSHOT (maven) dependency. (see attachment)
Because I can't subclass from the final FuzzySuggester I subclassed
AnalyzingSuggester, delegating
all 3 method calls 'convertAutomaton, getFullPrefixPaths and
getTokenStreamToAutomaton' +
constructor over an internal FuzzySuggester member.
Then I overrided AnalyzingSuggester.lookup(..) by copying it from
AnalyzingSuggester, sadly invoking
2 methods and reading some field members with reflection api because of their
private declaration
(alternative would be to copy everything). Everything worked as expected so far.
I added our slight modification - moving the getFullPrefixPaths invocation to
the first prefix path
creation.
The main class checks a simple scenario with KeywordAnalyzer, three term
dictionary and some query
term variations.
Here is the output. Sadly some (for me) unexpected results:
Dictionary: [screen, screensaver, mouse]
query: 'screan' - exact result as expected (correct). But not in any case! This
is when one letter
is changed, which is not the first or last one.
Exact results:
screen/1
All results: - double entry of 'screen'?
screen/1
screen/1
screensaver/1
query: 'screew' - last letter changed: exact result empty (incorrect).
Exact results:
All results:
screen/1
screensaver/1
query: 'wcreen' - first letter changed: nothing found at all.
Exact results:
All results:
query: 'scree' - last letter removed.
Exact results:
All results:
screen/1
screensaver/1
query: 'scren' - 5th letter removed. Same as with last removed letter.
Exact results:
All results:
screen/1
screensaver/1
query: 'sreen' - 2th letter removed. Why different?
Exact results:
screen/1
All results: - double entry of 'screen'?
screen/1
screen/1
screensaver/1
query: 'screen' - correct query: screen not found at all?
Exact results:
All results:
screensaver/1
Now, my latin is at the end (as we say in Germany ;) ). Don't know how to
proceed further, as the
deeper code starts to become very complex.
Thanks a lot!
Christian Reuschling
On 15.11.2013 18:49, Michael McCandless wrote:
> Hmm, I'm not sure offhand why that change gives you no results.
>
> The fullPrefixPaths should have been a super-set of the original
> prefix paths, since the LevA just adds further paths.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Nov 14, 2013 at 2:43 PM, Christian Reuschling
> <[email protected]> wrote:
>> I tried it by changing the first prefixPath initialization to
>>
>> List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths =
>> FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst);
>> prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);
>>
>> inside AnalyzingSuggester.lookup(..). (simply copied the line from below)
>>
>> Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled
>> query.
>>
>> Correct spelled query:
>> prefixPaths size == 1
>> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc,
>> bytesReader)
>> (without getFullPrefixPath: non-null)
>>
>> Query within edit distance - the same:
>> prefixPaths size == 1 (without getFullPrefixPath: 0)
>> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc,
>> bytesReader)
>>
>> Query outside of edit distance:
>> prefixPaths size = 0
>>
>> Seems like the fuzziness is there, but getFullPrefixPaths kicks all
>> END_BYTEs ?
>>
>>
>>
>> On 14.11.2013 17:05, Michael McCandless wrote:
>>> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling
>>> <[email protected]> wrote:
>>>> We started to implement a named entity recognition on the base of
>>>> AnalyzingSuggester, which
>>>> offers the great support for Synonyms, Stopwords, etc. For this, we
>>>> slightly modified
>>>> AnalyzingSuggester.lookup() to only return the exactFirst hits
>>>> (considering the exactFirst
>>>> code block only, skipping the 'sameSurfaceForm' check and break, to get
>>>> the synonym hits
>>>> too).
>>>>
>>>> This works pretty good, and our next step would be to bring in some
>>>> fuzzyness against
>>>> spelling mistakes. For this, the idea was to do exactly the same, but with
>>>> FuzzySuggester
>>>> instead.
>>>>
>>>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only
>>>> relies on sharing the
>>>> same prefix - also different/misspelled terms inside the edit distance are
>>>> considered as 'not
>>>> exact', which means we get the same results as with AnalyzingSuggester.
>>>>
>>>>
>>>> query: "screen" misspelled query: "screan" dictionary: "screen",
>>>> "screensaver"
>>>>
>>>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on
>>>> misspelled query:
>>>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester
>>>> EXACT_FIRST hits on
>>>> misspelled query: <empty>
>>>>
>>>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled
>>>> query: screen,
>>>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester
>>>> EXACT_FIRST hits on
>>>> misspelled query: <empty> => TARGET: screen
>>>>
>>>>
>>>> Is there a possibility to distinguish? I see that the 'exact' criteria
>>>> relies on an FST
>>>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when
>>>> building the
>>>> Levenshtein automata? I have no clue.
>>>
>>> It seems like the problem is that AnalyzingSuggester checks for exactFirst
>>> before calling
>>> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the
>>> fuzziness)?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> --------------------------------------------------------------------- To
>>> unsubscribe, e-mail:
>>> [email protected] For additional commands, e-mail:
>>> [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
package org.apache.lucene.search.suggest.analyzing;
import java.io.IOException;
import java.lang.reflect.Field;
import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStreamToAutomaton;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.search.suggest.InputIterator.InputIteratorWrapper;
import org.apache.lucene.search.suggest.analyzing.FSTUtil.Path;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefIterator;
import org.apache.lucene.util.CharsRef;
import org.apache.lucene.util.IntsRef;
import org.apache.lucene.util.automaton.Automaton;
import org.apache.lucene.util.fst.FST;
import org.apache.lucene.util.fst.Util;
import org.apache.lucene.util.fst.FST.BytesReader;
import org.apache.lucene.util.fst.PairOutputs.Pair;
import org.apache.lucene.util.fst.Util.MinResult;
public class FuzzySuggesterExactMatchOnlyTest extends AnalyzingSuggester
{
// we can not subclass FuzzySuggester because of its final state...
FuzzySuggester m_fuzzySuggester;
public FuzzySuggesterExactMatchOnlyTest(Analyzer analyzer)
{
super(analyzer);
m_fuzzySuggester = new FuzzySuggester(analyzer);
}
public static void main(String[] args) throws Exception
{
System.out.println("For named entity extraction, the 'exact' results are the only ones of interest.\n"
+ "For fuzzy match, the goal is to get all Levensthein-matching terms as 'exact' results too.");
FuzzySuggesterExactMatchOnlyTest testSuggester = new FuzzySuggesterExactMatchOnlyTest(new KeywordAnalyzer());
List<String> lDictionary = Arrays.asList("screen", "screensaver", "mouse");
System.out.println("\nDictionary: " + lDictionary);
final Iterator<String> termIterator = lDictionary.iterator();
testSuggester.build(new InputIteratorWrapper(new BytesRefIterator()
{
@Override
public BytesRef next() throws IOException
{
if(!termIterator.hasNext()) return null;
return new BytesRef(termIterator.next());
}
@Override
public Comparator<BytesRef> getComparator()
{
return null;
}
}));
String strQuery = "screan";
System.out.println("\nquery: '" + strQuery + "' - exact result as expected (correct). But not in any case! "
+ "This is when one letter is changed, which is not the first or last one.");
List<LookupResult> results = testSuggester.lookup(strQuery, false, 10000);
System.out.println("All results: - double entry of 'screen'?");
for (LookupResult result : results)
System.out.println(" " + result);
strQuery = "screew";
System.out.println("\nquery: '" + strQuery + "' - last letter changed: exact result empty (incorrect).");
results = testSuggester.lookup(strQuery, false, 10000);
System.out.println("All results:");
for (LookupResult result : results)
System.out.println(" " + result);
strQuery = "wcreen";
System.out.println("\nquery: '" + strQuery + "' - first letter changed: nothing found at all.");
results = testSuggester.lookup(strQuery, false, 10000);
System.out.println("All results:");
for (LookupResult result : results)
System.out.println(" " + result);
strQuery = "scree";
System.out.println("\nquery: '" + strQuery + "' - last letter removed.");
results = testSuggester.lookup(strQuery, false, 10000);
System.out.println("All results:");
for (LookupResult result : results)
System.out.println(" " + result);
strQuery = "scren";
System.out.println("\nquery: '" + strQuery + "' - 5th letter removed. Same as with last removed letter.");
results = testSuggester.lookup(strQuery, false, 10000);
System.out.println("All results:");
for (LookupResult result : results)
System.out.println(" " + result);
strQuery = "sreen";
System.out.println("\nquery: '" + strQuery + "' - 2th letter removed. Why different?");
results = testSuggester.lookup(strQuery, false, 10000);
System.out.println("All results: - double entry of 'screen'?");
for (LookupResult result : results)
System.out.println(" " + result);
strQuery = "screen";
System.out.println("\nquery: '" + strQuery + "' - correct query: screen not found at all?");
results = testSuggester.lookup(strQuery, false, 10000);
System.out.println("All results:");
for (LookupResult result : results)
System.out.println(" " + result);
}
// modified lookup method (modifications with NER comments. Access to private super fields/members with reflection API)
@Override
public List<LookupResult> lookup(CharSequence key, boolean onlyMorePopular, int num)
{
// NER: copy private member references to 'somethingVisible' members for lookup. Verified: these ones won't be modified further in this
// method.
makePrivateMembersVisible();
assert num > 0;
if(onlyMorePopular)
{
throw new IllegalArgumentException("this suggester only works with onlyMorePopular=false");
}
if(m_fstVisible == null)
{
return Collections.emptyList();
}
// System.out.println("lookup key=" + key + " num=" + num);
for (int i = 0; i < key.length(); i++)
{
if(key.charAt(i) == 0x1E)
{
throw new IllegalArgumentException("lookup key cannot contain HOLE character U+001E; this character is reserved");
}
if(key.charAt(i) == 0x1F)
{
throw new IllegalArgumentException("lookup key cannot contain unit separator character U+001F; this character is reserved");
}
}
final BytesRef utf8Key = new BytesRef(key);
try
{
// NER: sadly invocation of private superclass method necessary
// Automaton lookupAutomaton = toLookupAutomaton(key);
Automaton lookupAutomaton = (Automaton) invokePrivateMethod(this, "toLookupAutomaton", new Object[] { key });
final CharsRef spare = new CharsRef();
// System.out.println(" now intersect exactFirst=" + exactFirst);
// Intersect automaton w/ suggest wFST and get all
// prefix starting nodes & their outputs:
// final PathIntersector intersector = getPathIntersector(lookupAutomaton, fst);
// System.out.println(" prefixPaths: " + prefixPaths.size());
BytesReader bytesReader = m_fstVisible.getBytesReader();
FST.Arc<Pair<Long, BytesRef>> scratchArc = new FST.Arc<Pair<Long, BytesRef>>();
final List<LookupResult> results = new ArrayList<LookupResult>();
// List<FSTUtil.Path<Pair<Long, BytesRef>>> prefixPaths = FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), m_fstVisible);
// NER: Modification according email from Michael McCandless
// prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, m_fst);
List<FSTUtil.Path<Pair<Long, BytesRef>>> prefixPaths = getFullPrefixPaths(null, lookupAutomaton, m_fstVisible);
if(m_exactFirstVisible)
{
int count = 0;
for (FSTUtil.Path<Pair<Long, BytesRef>> path : prefixPaths)
{
if(m_fstVisible.findTargetArc(m_END_BYTEVisible, path.fstNode, scratchArc, bytesReader) != null)
{
// This node has END_BYTE arc leaving, meaning it's an
// "exact" match:
count++;
}
}
// Searcher just to find the single exact only
// match, if present:
Util.TopNSearcher<Pair<Long, BytesRef>> searcher =
new Util.TopNSearcher<Pair<Long, BytesRef>>(m_fstVisible, count * m_maxSurfaceFormsPerAnalyzedFormVisible, count
* m_maxSurfaceFormsPerAnalyzedFormVisible, m_weightComparatorVisible);
// NOTE: we could almost get away with only using
// the first start node. The only catch is if
// maxSurfaceFormsPerAnalyzedForm had kicked in and
// pruned our exact match from one of these nodes
// ...:
for (FSTUtil.Path<Pair<Long, BytesRef>> path : prefixPaths)
{
if(m_fstVisible.findTargetArc(m_END_BYTEVisible, path.fstNode, scratchArc, bytesReader) != null)
{
// This node has END_BYTE arc leaving, meaning it's an
// "exact" match:
// NER: sadly access of private member necessary
Pair<Long, BytesRef> output = (Pair<Long, BytesRef>) getPrivateField(path, "output");
searcher.addStartPaths(scratchArc, m_fstVisible.outputs.add(output, scratchArc.output), false, path.input);
}
}
MinResult<Pair<Long, BytesRef>> completions[] = searcher.search();
// NOTE: this is rather inefficient: we enumerate
// every matching "exactly the same analyzed form"
// path, and then do linear scan to see if one of
// these exactly matches the input. It should be
// possible (though hairy) to do something similar
// to getByOutput, since the surface form is encoded
// into the FST output, so we more efficiently hone
// in on the exact surface-form match. Still, I
// suspect very little time is spent in this linear
// seach: it's bounded by how many prefix start
// nodes we have and the
// maxSurfaceFormsPerAnalyzedForm:
for (MinResult<Pair<Long, BytesRef>> completion : completions)
{
BytesRef output2 = completion.output.output2;
// NER: we want to have the several manifestations/synonyms in the results, so we commented out this final check
// boolean bSameSurfaceForm =
// (boolean) PrivateAccessor.invokePrivateMethod(this, "sameSurfaceForm", new Object[] { utf8Key, output2 });
// if(bSameSurfaceForm)
{
// NER: sadly invocation of private superclass method necessary
LookupResult lookupResult =
(LookupResult) invokePrivateMethod(this, "getLookupResult",
new Object[] { completion.output.output1, output2, spare });
results.add(lookupResult);
// NER: don't break - we want to have all synonym results
// break;
}
}
if(results.size() == num)
{
// That was quick:
return results;
}
}
// NER: these are the exact results. We print them out. These are the only ones of interest
System.out.println("Exact results:");
for (LookupResult result : results)
System.out.println(" " + result);
Util.TopNSearcher<Pair<Long, BytesRef>> searcher;
searcher =
new Util.TopNSearcher<Pair<Long, BytesRef>>(m_fstVisible, num - results.size(), num * m_maxAnalyzedPathsForOneInputVisible,
m_weightComparatorVisible)
{
private final Set<BytesRef> seen = new HashSet<BytesRef>();
@Override
protected boolean acceptResult(IntsRef input, Pair<Long, BytesRef> output)
{
try
{
// Dedup: when the input analyzes to a graph we
// can get duplicate surface forms:
if(seen.contains(output.output2))
{
return false;
}
seen.add(output.output2);
if(!m_exactFirstVisible)
{
return true;
}
else
{
// In exactFirst mode, don't accept any paths
// matching the surface form since that will
// create duplicate results:
// NER: sadly invocation of private superclass method necessary
boolean bSameSurfaceForm =
(boolean) invokePrivateMethod(FuzzySuggesterExactMatchOnlyTest.this, "sameSurfaceForm", new Object[] {
utf8Key, output.output2 });
if(bSameSurfaceForm)
{
// We found exact match, which means we should
// have already found it in the first search:
assert results.size() == 1;
return false;
}
else
{
return true;
}
}
}
catch (Exception e)
{
throw new RuntimeException(e);
}
}
};
// NER: since we invoked this above, we don't need this here again
// prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, m_fstVisible);
for (FSTUtil.Path<Pair<Long, BytesRef>> path : prefixPaths)
{
// NER: sadly access of private member necessary
Pair<Long, BytesRef> output = (Pair<Long, BytesRef>) getPrivateField(path, "output");
searcher.addStartPaths(path.fstNode, output, true, path.input);
}
MinResult<Pair<Long, BytesRef>> completions[] = searcher.search();
for (MinResult<Pair<Long, BytesRef>> completion : completions)
{
// NER: sadly invocation of private superclass method necessary
LookupResult result =
(LookupResult) invokePrivateMethod(this, "getLookupResult", new Object[] { completion.output.output1,
completion.output.output2, spare });
// TODO: for fuzzy case would be nice to return
// how many edits were required
// System.out.println(" result=" + result);
results.add(result);
if(results.size() == num)
{
// In the exactFirst=true case the search may
// produce one extra path
break;
}
}
return results;
}
catch (Exception bogus)
{
throw new RuntimeException(bogus);
}
}
protected FST<Pair<Long, BytesRef>> m_fstVisible;
protected boolean m_exactFirstVisible;
protected int m_maxSurfaceFormsPerAnalyzedFormVisible;
protected int m_END_BYTEVisible;
protected Comparator<Pair<Long, BytesRef>> m_weightComparatorVisible;
protected int m_maxAnalyzedPathsForOneInputVisible;
// sadly some reflection API reference copies because of private members use
@SuppressWarnings("unchecked")
protected void makePrivateMembersVisible()
{
try
{
m_fstVisible = (FST<Pair<Long, BytesRef>>) getPrivateField(this, "fst");
m_exactFirstVisible = (boolean) getPrivateField(this, "exactFirst");
m_maxSurfaceFormsPerAnalyzedFormVisible = (int) getPrivateField(this, "maxSurfaceFormsPerAnalyzedForm");
m_END_BYTEVisible = (int) getPrivateField(this, "END_BYTE");
m_weightComparatorVisible = (Comparator<Pair<Long, BytesRef>>) getPrivateField(this, "weightComparator");
m_maxAnalyzedPathsForOneInputVisible = (int) getPrivateField(this, "maxAnalyzedPathsForOneInput");
}
catch (Exception e)
{
throw new RuntimeException(e);
}
}
@SuppressWarnings("rawtypes")
public static Object invokePrivateMethod(Object alcatrazObject, String methodName, Object[] params) throws IllegalArgumentException,
IllegalAccessException, InvocationTargetException, NoSuchMethodException
{
// Check we have valid arguments...
if(alcatrazObject == null || methodName == null || params == null)
throw new IllegalStateException("Object, methodName and params must not be null");
// Go and find the private method...
Class currentClass = alcatrazObject.getClass();
while (currentClass != null)
{
final Method methods[] = currentClass.getDeclaredMethods();
for (int i = 0; i < methods.length; ++i)
{
if(methodName.equals(methods[i].getName()))
{
methods[i].setAccessible(true);
return methods[i].invoke(alcatrazObject, params);
}
}
currentClass = currentClass.getSuperclass();
}
throw new NoSuchMethodException("Method '" + methodName + "' not found");
}
@SuppressWarnings("rawtypes")
public static Object getPrivateField(Object alcatrazObject, String fieldName) throws IllegalArgumentException, IllegalAccessException,
NoSuchFieldException
{
// Check we have valid arguments...
if(alcatrazObject == null || fieldName == null) throw new IllegalStateException("Object or fieldname must not be null");
// Go and find the private field...
Class currentClass = alcatrazObject.getClass();
while (currentClass != null)
{
final Field fields[] = currentClass.getDeclaredFields();
for (int i = 0; i < fields.length; ++i)
{
if(fieldName.equals(fields[i].getName()))
{
fields[i].setAccessible(true);
return fields[i].get(alcatrazObject);
}
}
currentClass = currentClass.getSuperclass();
}
throw new NoSuchFieldException("Field '" + fieldName + "' not found");
}
// delegate methods to internal FuzzySuggester
@Override
protected Automaton convertAutomaton(Automaton a)
{
return m_fuzzySuggester.convertAutomaton(a);
}
@Override
protected List<Path<Pair<Long, BytesRef>>> getFullPrefixPaths(List<Path<Pair<Long, BytesRef>>> prefixPaths, Automaton lookupAutomaton,
FST<Pair<Long, BytesRef>> fst) throws IOException
{
return m_fuzzySuggester.getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);
}
@Override
TokenStreamToAutomaton getTokenStreamToAutomaton()
{
return m_fuzzySuggester.getTokenStreamToAutomaton();
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]