snazy commented on PR #4535:
URL: https://github.com/apache/polaris/pull/4535#issuecomment-4563754649

   I think it’s better to avoid intentional code duplication where the files 
are currently identical or nearly identical.
   
   From what I can see:
   * In `spark/src`, there are 16 byte-identical duplicated files, and 3 with a 
few differences.
   * In `integration/src`, there are 7 byte-identical duplicated files, and 1 
with a few differences.
   
   Those look like strong candidates for shared code, with 
Spark-version-specific adapters only where the APIs actually differ.
   
   I also don’t think we should copy Iceberg’s module layout blindly.
   Iceberg has a much larger Spark compatibility surface and different 
maintenance constraints.
   Polaris should still use versioned modules where that helps, but share code 
where our implementation is actually identical.
   
   On the regression tests, I agree with @dimas-b: Most of that coverage would 
be better as JUnit integration tests. Both regtest CI jobs run for more than 40 
minutes, so adding another copied regtest stack seems expensive to maintain and 
expensive for CI.
   
   Also, I’m not sure the commit-history preservation trick helps after the PR 
is squash-merged, because the copy/restore commits would be squashed away. Git 
does not record rename metadata but infers renames from file similarity. 
`--follow`, `--find-renames`, and the related config options are useful but not 
fully reliable, especially for copies/duplicates. Sharing the code instead of 
duplicating it would give us better `git log` and `git diff` behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to