houqp commented on pull request #982: URL: https://github.com/apache/arrow-datafusion/pull/982#issuecomment-915835638
Good points @Dandandan. Having a lock file for datafusion definitely prevents us from catching build errors from latest dependency releases. I just took a look at the cargo project itself, it's interesting that even though they release binaries, they don't check-in a Cargo.lock file. I am going to do more research on this topic. > Next to that, do we have any users actually using the source release (e.g. with included Cargo.lock file), instead of only using datafusion in the Cargo.toml file? I am certainly not expecting `datafusion` users to use the ASF source release. They should always use cargo to manage library dependencies, which will ignore lock files. This is mainly for the python binding and ballista binaries. The problem right now is ballista and datafusion are in the same workspace, so they are sharing the same lock file :( But now that I took a second look at ballista crates, all of them including the scheduler and the executor are expected to be used as upstream dependencies in the client crate as libraries. So we should hold on adding lock file for ballista crates for now until we move its binaries into their own crates. The lack of access to reproducible builds from the same source tree has impact to our development and release process too. For example, random build failures like https://github.com/apache/arrow-datafusion/issues/961 could have been avoided with a lock file. In that case, our CI broke without any code change. Another potential issue it could address is to make sure the final binary wheels we publish to pypi are built with the exact same set of dependencies that are used in our automated and manual tests. The test run and wheel release build could happen at very different times. In short, I propose we keep the lock file to the python binding for now. > Also, before we go ahead with this, I think it makes sense to have a plan / document how to update the cargo lock file. Definitely something good to document in the developer doc. I think on demand update would make the most sense because cargo.lock file would change if a new commit adds a new dependency, the lock file should be updated in the same commit, otherwise the build won't be reproducible anymore. I also found it useful to use lock file diff to look for red flags on dependency bloats. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
