GitHub user chenlica created a discussion: Query Rewriter (from old wiki)
>From the page https://github.com/apache/texera/wiki/Query-Rewriter (may be >dangling) ===== Authors: [Shiladitya Sen](http://github.com/shiladityasen), [Kishore Narendran](http://github.com/kishore-narendran) Reviewer: Chen Li (**DONE**) ## Synopsis The purpose of the "QueryRewriter" operator is to correct errors of missing spaces in a query that can lead to incorrect tokenization. For instance, a query "newyork" can be rewritten by this operator to "new york". The operator is be used to return: * The most likely rewritten query found using a word-frequency dictionary; or * A set of valid rewritten queries. ## Status As of 6/3/2016: **COMPLETED** ## Modules `edu.uci.ics.texera.dataflow.queryrewriter` ## Related Issues Design: Query Rewriter Issue - https://github.com/Texera/texera/issues/29 ## Description The operator inserts spaces to a query string to find likely words in order to rewrite the query. It has two implementations: * A dynamic programming algorithm that utilizes a word-frequency dictionary to find the most likely tokenization. This algorithm was adopted from the Chinese characters tokenization performed in the [Srch2 Chinese Tokenization] module (https://github.com/SRCH2/srch2-ngn/blob/master/src/core/analyzer/ChineseTokenizer.cpp#L197). The word-frequency dictionary was derived from [Google unigrams](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) and the [NLTK English dictionary](http://www.nltk.org/book/ch02.html). The score for each word used for the algorithm is a reciprocal of frequency. * A recursive algorithm that uses an English dictionary (possibly without word frequencies) to find all combinations of valid tokenizations in a search string. This algorithm that can be found [here](https://github.com/Texera/texera/blob/master/texera/texera-dataflow/src/main/java/edu/uci/ics/texera/dataflow/queryrewriter/QuerySegmenter.java#L143) ## Presentation [Query Rewriter Dynamic Programming Algorithm](https://docs.google.com/presentation/d/1-Ufi_1G2JYYdHCOWeSRxxchhKYXAP2AMHhBDOvTQyVw/pub?start=true&loop=true&delayms=10000) GitHub link: https://github.com/apache/texera/discussions/3981 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
