Re: [D] Token based Fuzzy Match (from old wiki) [texera]

via GitHub Mon, 20 Oct 2025 22:42:33 -0700


GitHub user chenlica closed a discussion: Token based Fuzzy Match (from old 
wiki)

>From the page https://github.com/apache/texera/wiki/Token-based-Fuzzy-Match
>(may be dangling)

===

Authors: Varun Bharill, Parag Saraogi.

Reviewers:

Ideas from paper - "Efficient Merging and Filtering Algorithms for Approximate
String Searches"

1. Creating N-gram inverted lists.
A single document can be treated as a string and all the N grams of each
documents need to be indexed. The final result of this task would be an
inverted index of N-grams Vs document-ID.

Questions:
a) How do we fix "N" ?
b) For a particular gram are the document IDs stored in sorted order ?

2. T-occurance problem.
Given a Query string (q), compute all N-grams. From the index created in step
1, we have a list of documents in which each gram of query string appears.
The T-occurance problem is to find the documents with at least T occurances.
Available algorithms -
a. Heap count
b. Scan Count
c. Divide skip
etc.

Questions:
Lets say, a phrase "University of california" appears in a certain document d1.
Assume N = 5. So all the 5-grams of this phase have been indexed. Assume query
string q = "University california". From step 2, many 5-grams of the query
string will map to different documents. The index created in step 1 might as
well give the location of the grams in the document. Based on the mappings of
the 5-grams of query string, can we identify a relevant phrase in the document ?

3. Filtering

We have studied various filtering mechanisms like Length filtering, Position
filtering, Prefix filtering which can be combined in the form of a tree
structure, where every level corresponds to a filter. Also we studied the
sequence in which these filters should be applied to gain maximum performance
advantages.

Presentation on Token based fuzzy matching -
https://docs.google.com/presentation/d/123dmjHnazpfU82TVmlT3OXfcCGlK6o0NgLUWAUGj_ns/edit#slide=id.g12b64ffe2d_0_259

GitHub link: https://github.com/apache/texera/discussions/3989

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]

Re: [D] Token based Fuzzy Match (from old wiki) [texera]

Reply via email to