houqp commented on pull request #982:
URL: https://github.com/apache/arrow-datafusion/pull/982#issuecomment-915835638


   Good points @Dandandan. Having a lock file for datafusion definitely 
prevents us from catching build errors from latest dependency releases. I just 
took a look at the cargo project itself, it's interesting that even though they 
release binaries, they don't check-in a Cargo.lock file. I am going to do more 
research on this topic.
   
   > Next to that, do we have any users actually using the source release (e.g. 
with included Cargo.lock file), instead of only using datafusion in the 
Cargo.toml file?
   
   I am certainly not expecting `datafusion` users to use the ASF source 
release. They should always use cargo to manage library dependencies, which 
will ignore lock files.
   
   This is mainly for the python binding and ballista binaries. The problem 
right now is ballista and datafusion are in the same workspace, so they are 
sharing the same lock file :( But now that I took a second look at ballista 
crates, all of them including the scheduler and the executor are expected to be 
used as upstream dependencies in the client crate as libraries. So we should 
hold on adding lock file for ballista crates for now until we move its binaries 
into their own crates.
   
   The lack of access to reproducible builds from the same source tree has 
impact to our development and release process too. For example, random build 
failures like https://github.com/apache/arrow-datafusion/issues/961 could have 
been avoided with a lock file. In that case, our CI broke without any code 
change. Another potential issue it could address is to make sure the final 
binary wheels we publish to pypi are built with the exact same set of 
dependencies that are used in our automated and manual tests. The test run and 
wheel release build could happen at very different times.
   
   In short, I propose we keep the lock file to the python binding for now.
   
   > Also, before we go ahead with this, I think it makes sense to have a plan 
/ document how to update the cargo lock file.
   
   Definitely something good to document in the developer doc. I think on 
demand update would make the most sense because cargo.lock file would change if 
a new commit adds a new dependency, the lock file should be updated in the same 
commit, otherwise the build won't be reproducible anymore. I also found it 
useful to use lock file diff to look for red flags on dependency bloats.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to