jerry-024 commented on code in PR #7670:
URL: https://github.com/apache/paimon/pull/7670#discussion_r3114853088
##########
paimon-python/pypaimon/globalindex/tantivy/tantivy_full_text_global_index_reader.py:
##########
@@ -152,13 +152,27 @@ def visit_full_text_search(self, full_text_search) ->
Optional[ScoredGlobalIndex
searcher = self._searcher
query = self._index.parse_query(query_text, ["text"])
- results = searcher.search(query, limit)
+ scored_results = searcher.search(query, limit)
+ if not scored_results.hits:
+ return DictBasedScoredIndexResult({})
+
+ addr_to_score: Dict[tuple, float] = {
+ (addr.segment_ord, addr.doc): score
+ for score, addr in scored_results.hits
+ }
+
Review Comment:
This fallback looks like a potentially large performance regression for
broad queries. We only need the `row_id` for the top-`limit` hits in
`scored_results`, but this second search asks tantivy-py to collect up to
`searcher.num_docs` matches ordered by `row_id`. For a common term on a large
shard, a `limit=10` lookup can now degenerate into scanning/materializing
almost the full match set just to recover 10 ids. Could we keep `row_id` stored
until batch fast-field access is available in the shipped tantivy-py version,
or add a direct fast-field read path instead?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]