You probably are not getting this document returned:
list.add("strfffing_ m atcbbhing");
because... both terms have an edit distance greater than two.
All the other documents have one or the other or both terms with an editing
distance of 2 or less.
Your query is essentially: Match a document if EITHER term matches. So, if
NEITHER matches (within an editing distance of 2), the document is not a
match.
-- Jack Krupansky
-----Original Message-----
From: Pierre Antoine DuBoDeNa
Sent: Saturday, February 09, 2013 12:52 PM
To: java-user@lucene.apache.org
Subject: Re: fuzzy queries
with query like string~ matching~ (without specifying threshold) i get 14
results back..
Can it be problem with the analyzers?
Here is the code:
private File indexDir = new File("/a-directory-here");
private StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
private IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
analyzer);
public static void main(String[] args) throws Exception {
IndexProfiles Indexer = new IndexProfiles();
IndexWriter w = Indexer.CreateIndex();
ArrayList<String> list = new ArrayList<String>();
list.add("string matching");
list.add("string123 matching");
list.add("string matching123");
list.add("string123 matching123");
list.add("str4ing match2ing");
list.add("1string 2matching");
list.add("str_ing ma_tching");
list.add("string_matching");
list.add("strang mutching");
list.add("strrring maatchinng");
list.add("strfffing_ m atcbbhing");
list.add("str2ing__mat3ching");
list.add("string_m atching");
list.add("string matching another token");
list.add("strasding matc4hing ano23ther tok3en");
list.add("str4ing maaatching_another 2t oken");
for (String companyname:list)
{
Indexer.addSingleField(w, companyname);
}
int numDocs = w.numDocs();
System.out.println("# of Docs in Index: " + numDocs);
w.close();
DoIndexQuery("string~ matching~");
}
public static void DoIndexQuery(String query) throws IOException,
ParseException {
IndexProfiles Indexer = new IndexProfiles();
IndexReader reader = Indexer.LoadIndex();
Indexer.SearchIndex(reader, query, 50);
reader.close();
}
public IndexWriter CreateIndex() throws IOException {
Directory index = FSDirectory.open(indexDir);
IndexWriter w = new IndexWriter(index, config);
return w;
}
public HashMap SearchIndex(IndexReader w, String query, int topk)
throwsIOException, ParseException {
Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
).parse(query);
IndexSearcher searcher = new IndexSearcher(w);
TopScoreDocCollector collector = TopScoreDocCollector.create(topk, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
HashMap map = new HashMap();
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
map.put(docId, d.get("Name"));
System.out.println((i + 1) + ". " + d.get("Name"));
}
searcher.close();
return map;
}
public void addSingleField(IndexWriter w, String str) throws IOException {
Document doc = new Document();
doc.add(new Field("Name", str, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
2013/2/9 Michael McCandless <luc...@mikemccandless.com>
Can you reduce your test case to indexing one document/field and
running a single FuzzyQuery (you seem to be running two at once,
OR'ing the results)?
And show the complete standalone source code (eg what is topk?) so we
can see how you are indexing / building the Query / searching.
The default minSim is 0.5.
Note that 0.01 is not useful in practice: it (should) match nearly all
terms. But I agree it's odd one term is not matching.
Mike McCandless
http://blog.mikemccandless.com
On Sat, Feb 9, 2013 at 5:20 AM, Pierre Antoine DuBoDeNa
<pad...@gmail.com> wrote:
>>
>> Hello,
>>
>> I use lucene 3.6 and i try to use fuzzy queries so that I can match
>> much
>> more results.
>>
>> I am adding for example these strings:
>>
>> list.add("string matching");
>>
>> list.add("string123 matching");
>>
>> list.add("string matching123");
>>
>> list.add("string123 matching123");
>>
>> list.add("str4ing match2ing");
>>
>> list.add("1string 2matching");
>>
>> list.add("str_ing ma_tching");
>>
>> list.add("string_matching");
>>
>> list.add("strang mutching");
>>
>> list.add("strrring maatchinng");
>>
>> list.add("strfffing_ m atcbbhing");
>>
>> list.add("str2ing__mat3ching");
>>
>> list.add("string_m atching");
>>
>> list.add("string matching another token");
>>
>> list.add("strasding matc4hing ano23ther tok3en");
>>
>> list.add("str4ing maaatching_another 2t oken");
>>
>>
>>
>> then i do a query:
>>
>>
>> "string~0.01 matching~0.01"
>>
>>
>> and I get back these results:
>>
>>
>> Found 15 hits.
>>
>> 1. 1string 2matching
>>
>> 2. str_ing ma_tching
>>
>> 3. string_m atching
>>
>> 4. strang mutching
>>
>> 5. str4ing match2ing
>>
>> 6. strrring maatchinng
>>
>> 7. string matching
>>
>> 8. strasding matc4hing ano23ther tok3en
>>
>> 9. string matching another token
>>
>> 10. string matching123
>>
>> 11. string123 matching
>>
>> 12. strfffing_ m atcbbhing
>>
>> 13. string123 matching123
>>
>> 14. str4ing maaatching_another 2t oken
>>
>> 15. string_matching
>>
>> So only 1 result is missing (with threshold 0.01).. str2ing__mat3ching
any
>> idea why? how can i extend the query to catch this one as well?
>>
>> Also what's the default threshold for the ~ operator? Without
>> specifying
>> threshold I get 14 results string_matching and str2ing__mat3ching
missing
>> this time.
>>
>> Here is the code for the queries
>>
>>
>> Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
>> ).parse(query);
>>
>>
>>
>> IndexSearcher searcher = new IndexSearcher(w);
>>
>> TopScoreDocCollector collector = TopScoreDocCollector.create(topk,
true);
>>
>> searcher.search(q, collector);
>>
>> ScoreDoc[] hits = collector.topDocs().scoreDocs;
>>
>>
>> Thanks for the help.
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org