To reproduce this, I just did

curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
tar xzf spark-2.4.5.tgz
cd spark-2.4.5
./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
mv spark-2.4.5-bin-custom-spark.tgz ../
cd ..
tar xzf spark-2.4.5-bin-custom-spark.tgz
cd spark-2.4.5-bin-custom-spark/python/
sudo python setup.py install

And here is the output:
[image: image.png]


On Fri, May 1, 2020 at 2:48 PM Sean Owen <sro...@gmail.com> wrote:

> You wrote:
>
> "
> 2. On each machine, I can install pyspark by running `python setup.py
> install` inside the python directory.
>
> Step 2 would fail because of missing the licenses directory.
> "
>
> That shouldn't depend on the license file, and the script you showed does
> not fail when not present, so I am wondering what this means.
> I'm not sure there's a JIRA here yet.
>
> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote:
>
>> Hmm, sorry I don't get what part of my email were you referring to when
>> you said "the build fails?".
>>
>> So I am trying to build a custom spark binary distribution with, say,
>> different Hadoop versions and R support.
>>
>> Then I stored this custom build on S3, so as I am building more machines
>> I can just directly download this custom build from S3. But besides
>> spark-submit and what not, I also wanted to install the pyspark python
>> package to the machine I am building.
>>
>> The lack of the LICENSE file in the custom build would prevent pyspark
>> from being successfully built.
>>
>> Hopefully this answers your question.
>>
>> The second part of my last email was about building pyspark inside spark
>> source directory, I will raise an issue on Jira for that, as it is more of
>> a clean cut problem with the documentation on the website and the comments
>> in make-distribution.sh.
>>
>>
>>
>> On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> Hm, the build fails? you can see this is just skipped if not present,
>>> for this reason.
>>> I'm not clear why you need the file for its own sake, for your own
>>> internal modification that you don't redistribute.
>>>
>>>
>>>
>>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> Thanks for the quick response! Yes, what you described about how
>>>> LICENSE file should be distributed makes sense.
>>>>
>>>> The reason I learned about this is that I was trying to build
>>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>>>> machines, so that:
>>>>
>>>> 1. These machines can run spark with the built.
>>>> 2. On each machine, I can install pyspark by running `python setup.py
>>>> install` inside the python directory.
>>>>
>>>> Step 2 would fail because of missing the licenses directory.
>>>>
>>>> Building pyspark out of a binary distribution is a bit unconventional,
>>>> but I did this after failing to do what the official doc recommended (
>>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
>>>> so taking a step back to describe what I did originally:
>>>>
>>>> In the spark-2.4.5 src directory, I just did a simple:
>>>>
>>>> `./build/mvn -DskipTests clean package`
>>>>
>>>>
>>>> And then went to the python directory and did:
>>>>
>>>>
>>>> `python setup.py sdist` followed by `pip install
>>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)
>>>>
>>>>
>>>> This ran into "error: package directory `deps/jars` does not exist".
>>>>
>>>>
>>>> However, directly running
>>>>
>>>>
>>>> `sudo python setup.py install`
>>>>
>>>>
>>>> worked.
>>>>
>>>>
>>>>
>>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> The source distribution has the source LICENSE file. The binary
>>>>> distribution has the LICENSE-binary license file. The source release isn't
>>>>> supposed to have LICENSE-binary as it would not be accurate for that
>>>>> release; LICENSE is. If you're redistributing a build, you'll have your 
>>>>> own
>>>>> process for modifying and building it, including modifying the LICENSE 
>>>>> file
>>>>> as appropriate; these LICENSE files represent what the project delivers to
>>>>> you rather than what you deliver to others. You could get the
>>>>> LICENSE-binary file from the right hash commit from git, if desired, as
>>>>> part of your build.
>>>>>
>>>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I downloaded spark-2.4.5 source from
>>>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>>>>> After extracting it and running:
>>>>>>
>>>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
>>>>>> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>>>>>>
>>>>>>
>>>>>> It creates a Spark binary distribution named:
>>>>>> spark-2.4.5-bin-custom-spark.tgz
>>>>>>
>>>>>> So this file is supposedly a ready-to-distribute Spark binary file
>>>>>> like the one you can download from
>>>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>>>>>
>>>>>> However, one big difference between this custom build and the
>>>>>> official build is that you do not have a LICENSE file in the custom 
>>>>>> build.
>>>>>> I don't know much about Apache license, but I would suppose a custom 
>>>>>> build
>>>>>> distribution should have one.
>>>>>>
>>>>>> The reason we are missing the file is caused by the following code in
>>>>>> make-distribution.sh:
>>>>>> [image: image.png]
>>>>>>
>>>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz file,
>>>>>> therefore there will be no LICENSE file in your custom build.
>>>>>>
>>>>>> I am aware of two pull requests related to this:
>>>>>>
>>>>>> https://github.com/apache/spark/pull/22436
>>>>>> started to use LICENSE-binary instead of just the LICENSE.
>>>>>>
>>>>>> And
>>>>>> https://github.com/apache/spark/pull/22840
>>>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5
>>>>>> source directory.
>>>>>>
>>>>>> I think we need to change make-distribution.sh to make sure that the
>>>>>> LICENSE file is copied over to its corresponding custom build 
>>>>>> distribution.
>>>>>> However, I am not ready to do a pull request, so hopefully we can discuss
>>>>>> it here first.
>>>>>> --
>>>>>> Sincerely
>>>>>> Xiangyu Li
>>>>>>
>>>>>> <yisky...@gmail.com>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Sincerely
>>>> Xiangyu Li
>>>>
>>>> <yisky...@gmail.com>
>>>>
>>>
>>
>> --
>> Sincerely
>> Xiangyu Li
>>
>> <yisky...@gmail.com>
>>
>

-- 
Sincerely
Xiangyu Li

<yisky...@gmail.com>

Reply via email to