GitHub user chenlica closed a discussion: Token based Fuzzy Match (from old 
wiki)

>From the page https://github.com/apache/texera/wiki/Token-based-Fuzzy-Match 
>(may be dangling)

===

Authors: Varun Bharill, Parag Saraogi.

Reviewers:

Ideas from paper - "Efficient Merging and Filtering Algorithms for Approximate 
String Searches"

1. Creating N-gram inverted lists.
A single document can be treated as a string and all the N grams of each 
documents need to be indexed. The final result of this task would be an 
inverted index of N-grams Vs document-ID.

Questions:
a) How do we fix "N" ?
b) For a particular gram are the document IDs stored in sorted order ?

2. T-occurance problem.
Given a Query string (q), compute all N-grams. From the index created in step 
1, we have a list of documents in which each gram of query string appears.
The T-occurance problem is to find the documents with at least T occurances. 
Available algorithms - 
a. Heap count
b. Scan Count
c. Divide skip
etc.

Questions:
Lets say, a phrase "University of california" appears in a certain document d1. 
Assume N = 5. So all the 5-grams of this phase have been indexed. Assume query 
string q = "University california". From step 2, many 5-grams of the query 
string will map to different documents. The index created in step 1 might as 
well give the location of the grams in the document. Based on the mappings of 
the 5-grams of query string, can we identify a relevant phrase in the document ?
   

3. Filtering

  We have studied various filtering mechanisms like Length filtering, Position 
filtering, Prefix filtering which can be combined in the form of a tree 
structure, where every level corresponds to a filter. Also we studied the 
sequence in which these filters should be applied to gain maximum performance 
advantages. 


Presentation on Token based fuzzy matching - 
https://docs.google.com/presentation/d/123dmjHnazpfU82TVmlT3OXfcCGlK6o0NgLUWAUGj_ns/edit#slide=id.g12b64ffe2d_0_259


GitHub link: https://github.com/apache/texera/discussions/3989

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to