I see, that makes more sense, though I have limited knowledge of how the
pip packaging works. You don't need pip packaging, do you? just pyspark
itself right. Omit --pip?

On Fri, May 1, 2020 at 3:32 PM Xiangyu Li <yisky...@gmail.com> wrote:

> make-distribution.sh with --pip would run a `python setup.py sdist` within
> that make-distribution.sh script.
> I also tested `make-distribution.sh` without --pip, and the same error
> happens.
>
> Correct me if I'm wrong, but pyspark binary has always been successfully
> built, it is the pyspark pip package that is failing.
>
> On Fri, May 1, 2020 at 4:23 PM Sean Owen <sro...@gmail.com> wrote:
>
>> Hm, others may have to chime in here. Either that's not how you create
>> the pyspark binary from the source release (make-distribution.sh doesn't do
>> that?) or there is a small but important issue here, that the source
>> release doesn't contain one thing that the binary release script expects,
>> which is LICENSE-binary et al. If it's the latter, we could move around the
>> LICENSE bits in the source tree so that both are "source" files included in
>> the source release, so you can make the binary release with it, but, I'd
>> probably say it's easier/better to simply skip adding the license in this
>> path (if it's supposed to work this way at all) as the use case, a custom
>> derived work, doesn't need the *ASF's* license statement.
>>
>>
>> On Fri, May 1, 2020 at 3:13 PM Xiangyu Li <yisky...@gmail.com> wrote:
>>
>>> To reproduce this, I just did
>>>
>>> curl -O http://www.trieuvan.com/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>> tar xzf spark-2.4.5.tgz
>>> cd spark-2.4.5
>>> ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7
>>> mv spark-2.4.5-bin-custom-spark.tgz ../
>>> cd ..
>>> tar xzf spark-2.4.5-bin-custom-spark.tgz
>>> cd spark-2.4.5-bin-custom-spark/python/
>>> sudo python setup.py install
>>>
>>> And here is the output:
>>> [image: image.png]
>>>
>>>
>>> On Fri, May 1, 2020 at 2:48 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> You wrote:
>>>>
>>>> "
>>>> 2. On each machine, I can install pyspark by running `python setup.py
>>>> install` inside the python directory.
>>>>
>>>> Step 2 would fail because of missing the licenses directory.
>>>> "
>>>>
>>>> That shouldn't depend on the license file, and the script you showed
>>>> does not fail when not present, so I am wondering what this means.
>>>> I'm not sure there's a JIRA here yet.
>>>>
>>>> On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote:
>>>>
>>>>> Hmm, sorry I don't get what part of my email were you referring to
>>>>> when you said "the build fails?".
>>>>>
>>>>> So I am trying to build a custom spark binary distribution with, say,
>>>>> different Hadoop versions and R support.
>>>>>
>>>>> Then I stored this custom build on S3, so as I am building more
>>>>> machines I can just directly download this custom build from S3. But
>>>>> besides spark-submit and what not, I also wanted to install the pyspark
>>>>> python package to the machine I am building.
>>>>>
>>>>> The lack of the LICENSE file in the custom build would prevent pyspark
>>>>> from being successfully built.
>>>>>
>>>>> Hopefully this answers your question.
>>>>>
>>>>> The second part of my last email was about building pyspark inside
>>>>> spark source directory, I will raise an issue on Jira for that, as it is
>>>>> more of a clean cut problem with the documentation on the website and the
>>>>> comments in make-distribution.sh.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>
>>>>>> Hm, the build fails? you can see this is just skipped if not present,
>>>>>> for this reason.
>>>>>> I'm not clear why you need the file for its own sake, for your own
>>>>>> internal modification that you don't redistribute.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Sean,
>>>>>>>
>>>>>>> Thanks for the quick response! Yes, what you described about how
>>>>>>> LICENSE file should be distributed makes sense.
>>>>>>>
>>>>>>> The reason I learned about this is that I was trying to build
>>>>>>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple
>>>>>>> machines, so that:
>>>>>>>
>>>>>>> 1. These machines can run spark with the built.
>>>>>>> 2. On each machine, I can install pyspark by running `python
>>>>>>> setup.py install` inside the python directory.
>>>>>>>
>>>>>>> Step 2 would fail because of missing the licenses directory.
>>>>>>>
>>>>>>> Building pyspark out of a binary distribution is a bit
>>>>>>> unconventional, but I did this after failing to do what the official doc
>>>>>>> recommended (
>>>>>>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable),
>>>>>>> so taking a step back to describe what I did originally:
>>>>>>>
>>>>>>> In the spark-2.4.5 src directory, I just did a simple:
>>>>>>>
>>>>>>> `./build/mvn -DskipTests clean package`
>>>>>>>
>>>>>>>
>>>>>>> And then went to the python directory and did:
>>>>>>>
>>>>>>>
>>>>>>> `python setup.py sdist` followed by `pip install
>>>>>>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the
>>>>>>> make-distribution.sh.)
>>>>>>>
>>>>>>>
>>>>>>> This ran into "error: package directory `deps/jars` does not exist".
>>>>>>>
>>>>>>>
>>>>>>> However, directly running
>>>>>>>
>>>>>>>
>>>>>>> `sudo python setup.py install`
>>>>>>>
>>>>>>>
>>>>>>> worked.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote:
>>>>>>>
>>>>>>>> The source distribution has the source LICENSE file. The binary
>>>>>>>> distribution has the LICENSE-binary license file. The source release 
>>>>>>>> isn't
>>>>>>>> supposed to have LICENSE-binary as it would not be accurate for that
>>>>>>>> release; LICENSE is. If you're redistributing a build, you'll have 
>>>>>>>> your own
>>>>>>>> process for modifying and building it, including modifying the LICENSE 
>>>>>>>> file
>>>>>>>> as appropriate; these LICENSE files represent what the project 
>>>>>>>> delivers to
>>>>>>>> you rather than what you deliver to others. You could get the
>>>>>>>> LICENSE-binary file from the right hash commit from git, if desired, as
>>>>>>>> part of your build.
>>>>>>>>
>>>>>>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I downloaded spark-2.4.5 source from
>>>>>>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz
>>>>>>>>> After extracting it and running:
>>>>>>>>>
>>>>>>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
>>>>>>>>> -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn 
>>>>>>>>> -Pkubernetes
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It creates a Spark binary distribution named:
>>>>>>>>> spark-2.4.5-bin-custom-spark.tgz
>>>>>>>>>
>>>>>>>>> So this file is supposedly a ready-to-distribute Spark binary file
>>>>>>>>> like the one you can download from
>>>>>>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>>>>>>>>
>>>>>>>>> However, one big difference between this custom build and the
>>>>>>>>> official build is that you do not have a LICENSE file in the custom 
>>>>>>>>> build.
>>>>>>>>> I don't know much about Apache license, but I would suppose a custom 
>>>>>>>>> build
>>>>>>>>> distribution should have one.
>>>>>>>>>
>>>>>>>>> The reason we are missing the file is caused by the following code
>>>>>>>>> in make-distribution.sh:
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz
>>>>>>>>> file, therefore there will be no LICENSE file in your custom build.
>>>>>>>>>
>>>>>>>>> I am aware of two pull requests related to this:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/pull/22436
>>>>>>>>> started to use LICENSE-binary instead of just the LICENSE.
>>>>>>>>>
>>>>>>>>> And
>>>>>>>>> https://github.com/apache/spark/pull/22840
>>>>>>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5
>>>>>>>>> source directory.
>>>>>>>>>
>>>>>>>>> I think we need to change make-distribution.sh to make sure that
>>>>>>>>> the LICENSE file is copied over to its corresponding custom build
>>>>>>>>> distribution. However, I am not ready to do a pull request, so 
>>>>>>>>> hopefully we
>>>>>>>>> can discuss it here first.
>>>>>>>>> --
>>>>>>>>> Sincerely
>>>>>>>>> Xiangyu Li
>>>>>>>>>
>>>>>>>>> <yisky...@gmail.com>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sincerely
>>>>>>> Xiangyu Li
>>>>>>>
>>>>>>> <yisky...@gmail.com>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Sincerely
>>>>> Xiangyu Li
>>>>>
>>>>> <yisky...@gmail.com>
>>>>>
>>>>
>>>
>>> --
>>> Sincerely
>>> Xiangyu Li
>>>
>>> <yisky...@gmail.com>
>>>
>>
>
> --
> Sincerely
> Xiangyu Li
>
> <yisky...@gmail.com>
>

Reply via email to