Hi Sean, Thanks for the quick response! Yes, what you described about how LICENSE file should be distributed makes sense.
The reason I learned about this is that I was trying to build spark-2.4.5-bin-custom.tgz, then distributes this build to multiple machines, so that: 1. These machines can run spark with the built. 2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory. Step 2 would fail because of missing the licenses directory. Building pyspark out of a binary distribution is a bit unconventional, but I did this after failing to do what the official doc recommended ( https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), so taking a step back to describe what I did originally: In the spark-2.4.5 src directory, I just did a simple: `./build/mvn -DskipTests clean package` And then went to the python directory and did: `python setup.py sdist` followed by `pip install dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) This ran into "error: package directory `deps/jars` does not exist". However, directly running `sudo python setup.py install` worked. On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote: > The source distribution has the source LICENSE file. The binary > distribution has the LICENSE-binary license file. The source release isn't > supposed to have LICENSE-binary as it would not be accurate for that > release; LICENSE is. If you're redistributing a build, you'll have your own > process for modifying and building it, including modifying the LICENSE file > as appropriate; these LICENSE files represent what the project delivers to > you rather than what you deliver to others. You could get the > LICENSE-binary file from the right hash commit from git, if desired, as > part of your build. > > On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com> wrote: > >> Hello, >> >> I downloaded spark-2.4.5 source from >> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz >> After extracting it and running: >> >> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr >> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes >> >> >> It creates a Spark binary distribution named: >> spark-2.4.5-bin-custom-spark.tgz >> >> So this file is supposedly a ready-to-distribute Spark binary file like >> the one you can download from >> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz >> >> However, one big difference between this custom build and the official >> build is that you do not have a LICENSE file in the custom build. I don't >> know much about Apache license, but I would suppose a custom build >> distribution should have one. >> >> The reason we are missing the file is caused by the following code in >> make-distribution.sh: >> [image: image.png] >> >> There is no LICENSE-binary file in the official spark-2.4.5.tgz file, >> therefore there will be no LICENSE file in your custom build. >> >> I am aware of two pull requests related to this: >> >> https://github.com/apache/spark/pull/22436 >> started to use LICENSE-binary instead of just the LICENSE. >> >> And >> https://github.com/apache/spark/pull/22840 >> To avoid failure when there is no LICENSE-binary in spark-2.4.5 source >> directory. >> >> I think we need to change make-distribution.sh to make sure that the >> LICENSE file is copied over to its corresponding custom build distribution. >> However, I am not ready to do a pull request, so hopefully we can discuss >> it here first. >> -- >> Sincerely >> Xiangyu Li >> >> <yisky...@gmail.com> >> > -- Sincerely Xiangyu Li <yisky...@gmail.com>