jerry-024 commented on code in PR #7670:
URL: https://github.com/apache/paimon/pull/7670#discussion_r3114853088


##########
paimon-python/pypaimon/globalindex/tantivy/tantivy_full_text_global_index_reader.py:
##########
@@ -152,13 +152,27 @@ def visit_full_text_search(self, full_text_search) -> 
Optional[ScoredGlobalIndex
 
         searcher = self._searcher
         query = self._index.parse_query(query_text, ["text"])
-        results = searcher.search(query, limit)
 
+        scored_results = searcher.search(query, limit)
+        if not scored_results.hits:
+            return DictBasedScoredIndexResult({})
+
+        addr_to_score: Dict[tuple, float] = {
+            (addr.segment_ord, addr.doc): score
+            for score, addr in scored_results.hits
+        }
+

Review Comment:
   This fallback looks like a potentially large performance regression for 
broad queries. We only need the `row_id` for the top-`limit` hits in 
`scored_results`, but this second search asks tantivy-py to collect up to 
`searcher.num_docs` matches ordered by `row_id`. For a common term on a large 
shard, a `limit=10` lookup can now degenerate into scanning/materializing 
almost the full match set just to recover 10 ids. Could we keep `row_id` stored 
until batch fast-field access is available in the shipped tantivy-py version, 
or add a direct fast-field read path instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to