Re: script running in jupyter 6-7x faster than spark submit

Dhrubajyoti Hati Tue, 10 Sep 2019 21:16:55 -0700

Just checked from where the script is submitted i.e. wrt Driver, the python
env are different. Jupyter one is running within a the virtual environment
which is Python 2.7.5 and the spark-submit one uses 2.6.6. But the
executors have the same python version right? I tried doing a spark-submit
from jupyter shell, it fails to find python 2.7  which is not there hence
throws error.


Here is the udf which might take time:

import base64
import zlib

def decompress(data):

    bytecode = base64.b64decode(data)
    d = zlib.decompressobj(32 + zlib.MAX_WBITS)
    decompressed_data = d.decompress(bytecode )
    return(decompressed_data.decode('utf-8'))


Could this because of the two python environment mismatch from Driver
side? But the processing

happens in the executor side?




*Regards,Dhrub*

On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari <abdealikoth...@gmail.com>
wrote:

> Maybe you can try running it in a python shell or jupyter-console/ipython
> instead of a spark-submit and check how much time it takes too.
>
> Compare the env variables to check that no additional env configuration is
> present in either environment.
>
> Also is the python environment for both the exact same? I ask because it
> looks like you're using a UDF and if the Jupyter python has (let's say)
> numpy compiled with blas it would be faster than a numpy without it. Etc.
> I.E. Some library you use may be using pure python and another may be using
> a faster C extension...
>
> What python libraries are you using in the UDFs? It you don't use UDFs at
> all and use some very simple pure spark functions does the time difference
> still exist?
>
> Also are you using dynamic allocation or some similar spark config which
> could vary performance between runs because the same resources we're not
> utilized on Jupyter / spark-submit?
>
>
> On Wed, Sep 11, 2019, 08:43 Stephen Boesch <java...@gmail.com> wrote:
>
>> Sounds like you have done your homework to properly compare .   I'm
>> guessing the answer to the following is yes .. but in any case:  are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>>
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>> dhruba.w...@gmail.com>:
>>
>>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>>> compressed base64 encoded text data from a hive table and decompressing and
>>> decoding in one of the udfs. Also the time compared is from Spark UI not
>>> how long the job actually takes after submission. Its just the running time
>>> i am comparing/mentioning.
>>>
>>> As mentioned earlier, all the spark conf params even match in two
>>> scripts and that's why i am puzzled what going on.
>>>
>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>> pmccar...@dstillery.com> wrote:
>>>
>>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>>> already is connected to a running spark context, while spark-submit needs
>>>> to get a new spot in the (YARN?) queue.
>>>>
>>>> I would check the cluster job IDs for both to ensure you're getting new
>>>> cluster tasks for each.
>>>>
>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.w...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am facing a weird behaviour while running a python script. Here is
>>>>> what the code looks like mostly:
>>>>>
>>>>> def fn1(ip):
>>>>>    some code...
>>>>>     ...
>>>>>
>>>>> def fn2(row):
>>>>>     ...
>>>>>     some operations
>>>>>     ...
>>>>>     return row1
>>>>>
>>>>>
>>>>> udf_fn1 = udf(fn1)
>>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with
>>>>> ~4500 partitions
>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>>     .drop("colz") \
>>>>>     .withColumnRenamed("colz", "coly")
>>>>>
>>>>> edf = ddf \
>>>>>     .filter(ddf.colp == 'some_value') \
>>>>>     .rdd.map(lambda row: fn2(row)) \
>>>>>     .toDF()
>>>>>
>>>>> print edf.count() // simple way for the performance test in both
>>>>> platforms
>>>>>
>>>>> Now when I run the same code in a brand new jupyter notebook it runs
>>>>> 6x faster than when I run this python script using spark-submit. The
>>>>> configurations are printed and  compared from both the platforms and they
>>>>> are exact same. I even tried to run this script in a single cell of 
>>>>> jupyter
>>>>> notebook and still have the same performance. I need to understand if I am
>>>>> missing something in the spark-submit which is causing the issue.  I tried
>>>>> to minimise the script to reproduce the same error without much code.
>>>>>
>>>>> Both are run in client mode on a yarn based spark cluster. The
>>>>> machines from which both are executed are also the same and from same 
>>>>> user.
>>>>>
>>>>> What i found is the  the quantile values for median for one ran with
>>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am 
>>>>> not
>>>>> able to figure out why this is happening.
>>>>>
>>>>> Any one faced this kind of issue before or know how to resolve this?
>>>>>
>>>>> *Regards,*
>>>>> *Dhrub*
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Patrick McCarthy  *
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>

Re: script running in jupyter 6-7x faster than spark submit

Reply via email to