QuakeWang commented on PR #347: URL: https://github.com/apache/paimon-rust/pull/347#issuecomment-4589789983
> Thanks for the work here. The overall shape looks good to me: the builder validates the table/index prerequisites up front, plans shards by contiguous row-id ranges, reads with `_ROW_ID` and strict vector validation, and the index-only commit path preserves row tracking while adding overlap checks for global index files. > > A couple of follow-ups I think are worth addressing/clarifying before merge: > > 1. **Snapshot/concurrency semantics.** `LuminaIndexBuildBuilder::execute` plans and extracts vectors from the latest snapshot it observed, then commits index-only messages through `TableCommit`, which resolves against whatever snapshot is latest at commit time. The commit-side overlap check protects against another global-index commit, but it does not seem to verify that the underlying data snapshot/ranges are still the same. If a concurrent data commit rewrites/removes files for the same row ranges before the index-only commit lands, can the new index manifest be attached to a newer snapshot whose data no longer matches the built Lumina files? If this is intentionally safe because these tables are append-only + data-evolution row ranges are immutable, could you add a short comment/test or otherwise document that assumption? Otherwise we may need to carry the source snapshot id into the commit and reject conflicting data changes. > 2. **Temporary file cleanup on failure.** `build_index_file` removes the local temp file only after `copy_local_file_to_output` succeeds. If `pretrain`, `insert`, `dump`, or the copy fails, the temp file can be left behind. This is minor, but a scope guard or cleanup-on-error helper would make repeated failed builds safer. > > Also noted that the only full `execute` test is ignored because it needs the native Lumina library. That is understandable, but if we cannot enable it in CI, it would be helpful to keep as much non-native coverage as possible around the metadata/index-only commit path and to document how the ignored test is expected to be run manually. @leaves12138 Thanks for the careful review. I addressed the follow-ups in the latest local changes: 1. `LuminaIndexBuildBuilder::execute` now commits through `TableCommit::commit_if_latest_snapshot`, carrying the source snapshot id observed during planning/extraction. `TableCommit` validates the latest snapshot id on each retry before committing the index-only snapshot, so a concurrent data or index commit will reject the stale Lumina build instead of attaching it to a newer snapshot. 2. `build_index_file` now uses a small temp-file guard. The local Lumina dump file is cleaned up on failures from native `pretrain` / `insert` / `dump` / copy, and explicitly cleaned after a successful copy. 3. I kept the native execute E2E ignored, but documented the exact manual run command next to the test. I also added the ignored native builder E2E to the Linux integration CI path after `lumina-data` is installed, matching the existing optional Vortex-style CI coverage pattern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
