kaori-seasons commented on code in PR #6766: URL: https://github.com/apache/paimon/pull/6766#discussion_r2601258956
########## paimon-python/pypaimon/manifest/index_manifest_file.py: ########## @@ -0,0 +1,95 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from io import BytesIO +from typing import List + +import fastavro + +from pypaimon.index.deletion_vector_meta import DeletionVectorMeta +from pypaimon.index.index_file_meta import IndexFileMeta +from pypaimon.manifest.index_manifest_entry import IndexManifestEntry +from pypaimon.table.row.generic_row import GenericRowDeserializer + + +class IndexManifestFile: + """Index manifest file reader for reading index manifest entries.""" + + DELETION_VECTORS_INDEX = "DELETION_VECTORS" + + def __init__(self, table): + from pypaimon.table.file_store_table import FileStoreTable + + self.table: FileStoreTable = table + manifest_path = table.table_path.rstrip('/') Review Comment: Problem: In `_to_deletion_files`, the `index_path` is constructed directly using `self.table.table_path + '/index'`, without considering `index_file.external_path` (the `IndexFileMeta` object has an `external_path` field). Impact: When parsing `dv_meta_record` using fastavro, hardcoded field names like `dv_meta_record['f0']`, 'f1', 'f2', etc., are used. If the Avro schema field names are different, an error will occur. Suggestion: When constructing `dv_file_path`, if `index_file.external_path` exists, it should be used preferentially. When parsing `dv_meta_record`, use a defensive approach with key lookup (`get`) and be compatible with multiple naming conventions (or use schema-driven parsing with the fastavro reader). Examples: ``` def _to_deletion_files(self, index_entry) -> Dict[str, DeletionFile]: deletion_files = {} index_file = index_entry.index_file if not index_file.dv_ranges: return deletion_files # prefer external_path when present index_path = index_file.external_path if index_file.external_path else self.table.table_path.rstrip('/') + '/index' dv_file_path = f"{index_path}/{index_file.file_name}" for key, dv_meta in index_file.dv_ranges.items(): # dv_meta is expected to be DeletionVectorMeta deletion_file = DeletionFile( dv_index_path=dv_file_path, offset=dv_meta.offset, length=dv_meta.length, cardinality=dv_meta.cardinality ) deletion_files[key] = deletion_file return deletion_files ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
