snazy commented on PR #4535: URL: https://github.com/apache/polaris/pull/4535#issuecomment-4563754649
I think it’s better to avoid intentional code duplication where the files are currently identical or nearly identical. From what I can see: * In `spark/src`, there are 16 byte-identical duplicated files, and 3 with a few differences. * In `integration/src`, there are 7 byte-identical duplicated files, and 1 with a few differences. Those look like strong candidates for shared code, with Spark-version-specific adapters only where the APIs actually differ. I also don’t think we should copy Iceberg’s module layout blindly. Iceberg has a much larger Spark compatibility surface and different maintenance constraints. Polaris should still use versioned modules where that helps, but share code where our implementation is actually identical. On the regression tests, I agree with @dimas-b: Most of that coverage would be better as JUnit integration tests. Both regtest CI jobs run for more than 40 minutes, so adding another copied regtest stack seems expensive to maintain and expensive for CI. Also, I’m not sure the commit-history preservation trick helps after the PR is squash-merged, because the copy/restore commits would be squashed away. Git does not record rename metadata but infers renames from file similarity. `--follow`, `--find-renames`, and the related config options are useful but not fully reliable, especially for copies/duplicates. Sharing the code instead of duplicating it would give us better `git log` and `git diff` behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
