[GitHub] [pinot] jasperjiaguo opened a new issue, #9666: Need customized stop word set for LuceneTextIndex

GitBox Wed, 26 Oct 2022 13:40:47 -0700


jasperjiaguo opened a new issue, #9666:
URL: https://github.com/apache/pinot/issues/9666


   In `LuceneTextIndexCreator` we are now hardcoding the stop words for Lucene 
text index. 
   ```
   Arrays.asList("a", "an", "and", "are", "as", "at", "be", "but", "by", "for", 
"if", "in", "into", "is", "it", "no",
             "not", "of", "on", "or", "such", "that", "the", "their", "then", 
"than", "there", "these", "they", "this",
             "to", "was", "will", "with", "those"),
   ```
   These words will get pruned out during the text index generation as well as 
filter (in `StandardAnalyzer`). The problem with this is in production we found 
users issuing queries like
   `SELECT ... FROM ignoreMe WHERE TEXT_MATCH(title, '"IT staff" OR "IT 
manager"')`
   as will actually give the result matching  `TEXT_MATCH(title, '"staff" OR 
"manager"')`This can be easily reproduced in `TextSearchQueriesTest`. 
   Meanwhile, there is a TODO item of making LUCENE_INDEX_MAX_BUFFER_SIZE_MB. 
These two changes can be evaluated/made together.
   cc @Jackie-Jiang @walterddr @siddharthteotia @SabrinaZhaozyf 
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] jasperjiaguo opened a new issue, #9666: Need customized stop word set for LuceneTextIndex

Reply via email to