QuakeWang commented on PR #347:
URL: https://github.com/apache/paimon-rust/pull/347#issuecomment-4589789983

   > Thanks for the work here. The overall shape looks good to me: the builder 
validates the table/index prerequisites up front, plans shards by contiguous 
row-id ranges, reads with `_ROW_ID` and strict vector validation, and the 
index-only commit path preserves row tracking while adding overlap checks for 
global index files.
   > 
   > A couple of follow-ups I think are worth addressing/clarifying before 
merge:
   > 
   > 1. **Snapshot/concurrency semantics.** `LuminaIndexBuildBuilder::execute` 
plans and extracts vectors from the latest snapshot it observed, then commits 
index-only messages through `TableCommit`, which resolves against whatever 
snapshot is latest at commit time. The commit-side overlap check protects 
against another global-index commit, but it does not seem to verify that the 
underlying data snapshot/ranges are still the same. If a concurrent data commit 
rewrites/removes files for the same row ranges before the index-only commit 
lands, can the new index manifest be attached to a newer snapshot whose data no 
longer matches the built Lumina files? If this is intentionally safe because 
these tables are append-only + data-evolution row ranges are immutable, could 
you add a short comment/test or otherwise document that assumption? Otherwise 
we may need to carry the source snapshot id into the commit and reject 
conflicting data changes.
   > 2. **Temporary file cleanup on failure.** `build_index_file` removes the 
local temp file only after `copy_local_file_to_output` succeeds. If `pretrain`, 
`insert`, `dump`, or the copy fails, the temp file can be left behind. This is 
minor, but a scope guard or cleanup-on-error helper would make repeated failed 
builds safer.
   > 
   > Also noted that the only full `execute` test is ignored because it needs 
the native Lumina library. That is understandable, but if we cannot enable it 
in CI, it would be helpful to keep as much non-native coverage as possible 
around the metadata/index-only commit path and to document how the ignored test 
is expected to be run manually.
   
   @leaves12138   Thanks for the careful review.
   
   I addressed the follow-ups in the latest local changes:
   
   1. `LuminaIndexBuildBuilder::execute` now commits through 
`TableCommit::commit_if_latest_snapshot`, carrying the source snapshot id 
observed during planning/extraction. `TableCommit` validates the latest 
snapshot id on each retry before committing the index-only snapshot, so a 
concurrent data or index commit will reject the stale Lumina build instead of 
attaching it to a newer snapshot.
   
   2. `build_index_file` now uses a small temp-file guard. The local Lumina 
dump file is cleaned up on failures from native `pretrain` / `insert` / `dump` 
/ copy, and explicitly cleaned after a successful copy.
   
   3. I kept the native execute E2E ignored, but documented the exact manual 
run command next to the test. I also added the ignored native builder E2E to 
the Linux integration CI path after `lumina-data` is installed, matching the 
existing optional Vortex-style CI coverage pattern.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to