To reproduce this, I just did curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz tar xzf spark-2.4.5.tgz cd spark-2.4.5 ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 mv spark-2.4.5-bin-custom-spark.tgz ../ cd .. tar xzf spark-2.4.5-bin-custom-spark.tgz cd spark-2.4.5-bin-custom-spark/python/ sudo python setup.py install
And here is the output: [image: image.png] On Fri, May 1, 2020 at 2:48 PM Sean Owen <sro...@gmail.com> wrote: > You wrote: > > " > 2. On each machine, I can install pyspark by running `python setup.py > install` inside the python directory. > > Step 2 would fail because of missing the licenses directory. > " > > That shouldn't depend on the license file, and the script you showed does > not fail when not present, so I am wondering what this means. > I'm not sure there's a JIRA here yet. > > On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote: > >> Hmm, sorry I don't get what part of my email were you referring to when >> you said "the build fails?". >> >> So I am trying to build a custom spark binary distribution with, say, >> different Hadoop versions and R support. >> >> Then I stored this custom build on S3, so as I am building more machines >> I can just directly download this custom build from S3. But besides >> spark-submit and what not, I also wanted to install the pyspark python >> package to the machine I am building. >> >> The lack of the LICENSE file in the custom build would prevent pyspark >> from being successfully built. >> >> Hopefully this answers your question. >> >> The second part of my last email was about building pyspark inside spark >> source directory, I will raise an issue on Jira for that, as it is more of >> a clean cut problem with the documentation on the website and the comments >> in make-distribution.sh. >> >> >> >> On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote: >> >>> Hm, the build fails? you can see this is just skipped if not present, >>> for this reason. >>> I'm not clear why you need the file for its own sake, for your own >>> internal modification that you don't redistribute. >>> >>> >>> >>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com> wrote: >>> >>>> Hi Sean, >>>> >>>> Thanks for the quick response! Yes, what you described about how >>>> LICENSE file should be distributed makes sense. >>>> >>>> The reason I learned about this is that I was trying to build >>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >>>> machines, so that: >>>> >>>> 1. These machines can run spark with the built. >>>> 2. On each machine, I can install pyspark by running `python setup.py >>>> install` inside the python directory. >>>> >>>> Step 2 would fail because of missing the licenses directory. >>>> >>>> Building pyspark out of a binary distribution is a bit unconventional, >>>> but I did this after failing to do what the official doc recommended ( >>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), >>>> so taking a step back to describe what I did originally: >>>> >>>> In the spark-2.4.5 src directory, I just did a simple: >>>> >>>> `./build/mvn -DskipTests clean package` >>>> >>>> >>>> And then went to the python directory and did: >>>> >>>> >>>> `python setup.py sdist` followed by `pip install >>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) >>>> >>>> >>>> This ran into "error: package directory `deps/jars` does not exist". >>>> >>>> >>>> However, directly running >>>> >>>> >>>> `sudo python setup.py install` >>>> >>>> >>>> worked. >>>> >>>> >>>> >>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> The source distribution has the source LICENSE file. The binary >>>>> distribution has the LICENSE-binary license file. The source release isn't >>>>> supposed to have LICENSE-binary as it would not be accurate for that >>>>> release; LICENSE is. If you're redistributing a build, you'll have your >>>>> own >>>>> process for modifying and building it, including modifying the LICENSE >>>>> file >>>>> as appropriate; these LICENSE files represent what the project delivers to >>>>> you rather than what you deliver to others. You could get the >>>>> LICENSE-binary file from the right hash commit from git, if desired, as >>>>> part of your build. >>>>> >>>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I downloaded spark-2.4.5 source from >>>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>>>>> After extracting it and running: >>>>>> >>>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr >>>>>> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes >>>>>> >>>>>> >>>>>> It creates a Spark binary distribution named: >>>>>> spark-2.4.5-bin-custom-spark.tgz >>>>>> >>>>>> So this file is supposedly a ready-to-distribute Spark binary file >>>>>> like the one you can download from >>>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz >>>>>> >>>>>> However, one big difference between this custom build and the >>>>>> official build is that you do not have a LICENSE file in the custom >>>>>> build. >>>>>> I don't know much about Apache license, but I would suppose a custom >>>>>> build >>>>>> distribution should have one. >>>>>> >>>>>> The reason we are missing the file is caused by the following code in >>>>>> make-distribution.sh: >>>>>> [image: image.png] >>>>>> >>>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz file, >>>>>> therefore there will be no LICENSE file in your custom build. >>>>>> >>>>>> I am aware of two pull requests related to this: >>>>>> >>>>>> https://github.com/apache/spark/pull/22436 >>>>>> started to use LICENSE-binary instead of just the LICENSE. >>>>>> >>>>>> And >>>>>> https://github.com/apache/spark/pull/22840 >>>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5 >>>>>> source directory. >>>>>> >>>>>> I think we need to change make-distribution.sh to make sure that the >>>>>> LICENSE file is copied over to its corresponding custom build >>>>>> distribution. >>>>>> However, I am not ready to do a pull request, so hopefully we can discuss >>>>>> it here first. >>>>>> -- >>>>>> Sincerely >>>>>> Xiangyu Li >>>>>> >>>>>> <yisky...@gmail.com> >>>>>> >>>>> >>>> >>>> -- >>>> Sincerely >>>> Xiangyu Li >>>> >>>> <yisky...@gmail.com> >>>> >>> >> >> -- >> Sincerely >> Xiangyu Li >> >> <yisky...@gmail.com> >> > -- Sincerely Xiangyu Li <yisky...@gmail.com>