You wrote:

"
2. On each machine, I can install pyspark by running `python setup.py
install` inside the python directory.

Step 2 would fail because of missing the licenses directory.
"

That shouldn't depend on the license file, and the script you showed does
not fail when not present, so I am wondering what this means.
I'm not sure there's a JIRA here yet.

On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote:

> Hmm, sorry I don't get what part of my email were you referring to when
> you said "the build fails?".
>
> So I am trying to build a custom spark binary distribution with, say,
> different Hadoop versions and R support.
>
> Then I stored this custom build on S3, so as I am building more machines I
> can just directly download this custom build from S3. But besides
> spark-submit and what not, I also wanted to install the pyspark python
> package to the machine I am building.
>
> The lack of the LICENSE file in the custom build would prevent pyspark
> from being successfully built.
>
> Hopefully this answers your question.
>
> The second part of my last email was about building pyspark inside spark
> source directory, I will raise an issue on Jira for that, as it is more of
> a clean cut problem with the documentation on the website and the comments
> in make-distribution.sh.
>
>
>
> On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote:
>
>> Hm, the build fails? you can see this is just skipped if not present, for
>> this reason.
>> I'm not clear why you need the file for its own sake, for your own
>> internal modification that you don't redistribute.
>>
>>
>>
>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks for the quick response! Yes, what you described about how LICENSE
>>> file should be distributed makes sense.
>>>
>>> The reason I learned about this is that I was trying to build
>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>>> machines, so that:
>>>
>>> 1. These machines can run spark with the built.
>>> 2. On each machine, I can install pyspark by running `python setup.py
>>> install` inside the python directory.
>>>
>>> Step 2 would fail because of missing the licenses directory.
>>>
>>> Building pyspark out of a binary distribution is a bit unconventional,
>>> but I did this after failing to do what the official doc recommended (
>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
>>> so taking a step back to describe what I did originally:
>>>
>>> In the spark-2.4.5 src directory, I just did a simple:
>>>
>>> `./build/mvn -DskipTests clean package`
>>>
>>>
>>> And then went to the python directory and did:
>>>
>>>
>>> `python setup.py sdist` followed by `pip install
>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.)
>>>
>>>
>>> This ran into "error: package directory `deps/jars` does not exist".
>>>
>>>
>>> However, directly running
>>>
>>>
>>> `sudo python setup.py install`
>>>
>>>
>>> worked.
>>>
>>>
>>>
>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> The source distribution has the source LICENSE file. The binary
>>>> distribution has the LICENSE-binary license file. The source release isn't
>>>> supposed to have LICENSE-binary as it would not be accurate for that
>>>> release; LICENSE is. If you're redistributing a build, you'll have your own
>>>> process for modifying and building it, including modifying the LICENSE file
>>>> as appropriate; these LICENSE files represent what the project delivers to
>>>> you rather than what you deliver to others. You could get the
>>>> LICENSE-binary file from the right hash commit from git, if desired, as
>>>> part of your build.
>>>>
>>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I downloaded spark-2.4.5 source from
>>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>>>> After extracting it and running:
>>>>>
>>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
>>>>> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
>>>>>
>>>>>
>>>>> It creates a Spark binary distribution named:
>>>>> spark-2.4.5-bin-custom-spark.tgz
>>>>>
>>>>> So this file is supposedly a ready-to-distribute Spark binary file
>>>>> like the one you can download from
>>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>>>>
>>>>> However, one big difference between this custom build and the official
>>>>> build is that you do not have a LICENSE file in the custom build. I don't
>>>>> know much about Apache license, but I would suppose a custom build
>>>>> distribution should have one.
>>>>>
>>>>> The reason we are missing the file is caused by the following code in
>>>>> make-distribution.sh:
>>>>> [image: image.png]
>>>>>
>>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz file,
>>>>> therefore there will be no LICENSE file in your custom build.
>>>>>
>>>>> I am aware of two pull requests related to this:
>>>>>
>>>>> https://github.com/apache/spark/pull/22436
>>>>> started to use LICENSE-binary instead of just the LICENSE.
>>>>>
>>>>> And
>>>>> https://github.com/apache/spark/pull/22840
>>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5 source
>>>>> directory.
>>>>>
>>>>> I think we need to change make-distribution.sh to make sure that the
>>>>> LICENSE file is copied over to its corresponding custom build 
>>>>> distribution.
>>>>> However, I am not ready to do a pull request, so hopefully we can discuss
>>>>> it here first.
>>>>> --
>>>>> Sincerely
>>>>> Xiangyu Li
>>>>>
>>>>> <yisky...@gmail.com>
>>>>>
>>>>
>>>
>>> --
>>> Sincerely
>>> Xiangyu Li
>>>
>>> <yisky...@gmail.com>
>>>
>>
>
> --
> Sincerely
> Xiangyu Li
>
> <yisky...@gmail.com>
>

Reply via email to