If you say that libraries are not transferred by default and in my case I haven't used any --py-files then just because the driver python is different I have facing 6x speed difference ? I am using client mode to submit the program but the udfs and all are executed in the executors, then why is the difference so much?
I tried the prints For jupyter one the driver prints ../../jupyter-folder/venv and executors print /usr For spark-submit both of them print /usr The cluster is created few years back and used organisation wide. So how python 2.6.6 is installed, i honestly do not know. I copied the whole jupyter from org git repo as it was shared, so i do not know how the venv was created or python for venv was created even. The os is CentOS release 6.9 (Final) *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* On Wed, Sep 11, 2019 at 8:22 PM Abdeali Kothari <abdealikoth...@gmail.com> wrote: > The driver python may not always be the same as the executor python. > You can set these using PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON > > The dependent libraries are not transferred by spark in any way unless you > do a --py-files or .addPyFile() > > Could you try this: > *import sys; print(sys.prefix)* > > on the driver, and also run this inside a UDF with: > > *def dummy(a):* > * import sys; raise AssertionError(sys.prefix)* > > and get the traceback exception on the driver ? > This would be the best way to get the exact sys.prefix (python path) for > both the executors and driver. > > Also, could you elaborate on what environment is this ? > Linux? - CentOS/Ubuntu/etc. ? > How was the py 2.6.6 installed ? > How was the py 2.7.5 venv created and how what the base py 2.7.5 installed > ? > > Also, how are you creating the Spark Session in jupyter ? > > > On Wed, Sep 11, 2019 at 7:33 PM Dhrubajyoti Hati <dhruba.w...@gmail.com> > wrote: > >> But would it be the case for multiple tasks running on the same worker >> and also both the tasks are running in client mode, so the one true is true >> for both or for neither. As mentioned earlier all the confs are same. I >> have checked and compared each conf. >> >> As Abdeali mentioned it must be because the way libraries are in both >> the environments. Also i verified by running the same script for jupyter >> environment and was able to get the same result using the normal script >> which i was running with spark-submit. >> >> Currently i am searching for the ways the python packages are transferred >> from driver to spark cluster in client mode. Any info on that topic would >> be helpful. >> >> Thanks! >> >> >> >> On Wed, 11 Sep, 2019, 7:06 PM Patrick McCarthy, <pmccar...@dstillery.com> >> wrote: >> >>> Are you running in cluster mode? A large virtualenv zip for the driver >>> sent into the cluster on a slow pipe could account for much of that eight >>> minutes. >>> >>> On Wed, Sep 11, 2019 at 3:17 AM Dhrubajyoti Hati <dhruba.w...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I just ran the same script in a shell in jupyter notebook and find the >>>> performance to be similar. So I can confirm this is because the libraries >>>> used jupyter notebook python is different than the spark-submit python this >>>> is happening. >>>> >>>> But now I have a following question. Are the dependent libraries in a >>>> python script also transferred to the worker machines when executing a >>>> python script in spark. Because though the driver python versions are >>>> different, the workers machines will use their same python environment to >>>> run the code. If anyone can explain this part, it would be helpful. >>>> >>>> >>>> >>>> >>>> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* >>>> >>>> >>>> On Wed, Sep 11, 2019 at 9:45 AM Dhrubajyoti Hati <dhruba.w...@gmail.com> >>>> wrote: >>>> >>>>> Just checked from where the script is submitted i.e. wrt Driver, the >>>>> python env are different. Jupyter one is running within a the virtual >>>>> environment which is Python 2.7.5 and the spark-submit one uses 2.6.6. But >>>>> the executors have the same python version right? I tried doing a >>>>> spark-submit from jupyter shell, it fails to find python 2.7 which is not >>>>> there hence throws error. >>>>> >>>>> Here is the udf which might take time: >>>>> >>>>> import base64 >>>>> import zlib >>>>> >>>>> def decompress(data): >>>>> >>>>> bytecode = base64.b64decode(data) >>>>> d = zlib.decompressobj(32 + zlib.MAX_WBITS) >>>>> decompressed_data = d.decompress(bytecode ) >>>>> return(decompressed_data.decode('utf-8')) >>>>> >>>>> >>>>> Could this because of the two python environment mismatch from Driver >>>>> side? But the processing >>>>> >>>>> happens in the executor side? >>>>> >>>>> >>>>> >>>>> >>>>> *Regards,Dhrub* >>>>> >>>>> On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari < >>>>> abdealikoth...@gmail.com> wrote: >>>>> >>>>>> Maybe you can try running it in a python shell or >>>>>> jupyter-console/ipython instead of a spark-submit and check how much time >>>>>> it takes too. >>>>>> >>>>>> Compare the env variables to check that no additional env >>>>>> configuration is present in either environment. >>>>>> >>>>>> Also is the python environment for both the exact same? I ask because >>>>>> it looks like you're using a UDF and if the Jupyter python has (let's >>>>>> say) >>>>>> numpy compiled with blas it would be faster than a numpy without it. Etc. >>>>>> I.E. Some library you use may be using pure python and another may be >>>>>> using >>>>>> a faster C extension... >>>>>> >>>>>> What python libraries are you using in the UDFs? It you don't use >>>>>> UDFs at all and use some very simple pure spark functions does the time >>>>>> difference still exist? >>>>>> >>>>>> Also are you using dynamic allocation or some similar spark config >>>>>> which could vary performance between runs because the same resources >>>>>> we're >>>>>> not utilized on Jupyter / spark-submit? >>>>>> >>>>>> >>>>>> On Wed, Sep 11, 2019, 08:43 Stephen Boesch <java...@gmail.com> wrote: >>>>>> >>>>>>> Sounds like you have done your homework to properly compare . I'm >>>>>>> guessing the answer to the following is yes .. but in any case: are >>>>>>> they >>>>>>> both running against the same spark cluster with the same configuration >>>>>>> parameters especially executor memory and number of workers? >>>>>>> >>>>>>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < >>>>>>> dhruba.w...@gmail.com>: >>>>>>> >>>>>>>> No, i checked for that, hence written "brand new" jupyter notebook. >>>>>>>> Also the time taken by both are 30 mins and ~3hrs as i am reading a 500 >>>>>>>> gigs compressed base64 encoded text data from a hive table and >>>>>>>> decompressing and decoding in one of the udfs. Also the time compared >>>>>>>> is >>>>>>>> from Spark UI not how long the job actually takes after submission. >>>>>>>> Its >>>>>>>> just the running time i am comparing/mentioning. >>>>>>>> >>>>>>>> As mentioned earlier, all the spark conf params even match in two >>>>>>>> scripts and that's why i am puzzled what going on. >>>>>>>> >>>>>>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, < >>>>>>>> pmccar...@dstillery.com> wrote: >>>>>>>> >>>>>>>>> It's not obvious from what you pasted, but perhaps the juypter >>>>>>>>> notebook already is connected to a running spark context, while >>>>>>>>> spark-submit needs to get a new spot in the (YARN?) queue. >>>>>>>>> >>>>>>>>> I would check the cluster job IDs for both to ensure you're >>>>>>>>> getting new cluster tasks for each. >>>>>>>>> >>>>>>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati < >>>>>>>>> dhruba.w...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am facing a weird behaviour while running a python script. Here >>>>>>>>>> is what the code looks like mostly: >>>>>>>>>> >>>>>>>>>> def fn1(ip): >>>>>>>>>> some code... >>>>>>>>>> ... >>>>>>>>>> >>>>>>>>>> def fn2(row): >>>>>>>>>> ... >>>>>>>>>> some operations >>>>>>>>>> ... >>>>>>>>>> return row1 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> udf_fn1 = udf(fn1) >>>>>>>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs >>>>>>>>>> with ~4500 partitions >>>>>>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ >>>>>>>>>> .drop("colz") \ >>>>>>>>>> .withColumnRenamed("colz", "coly") >>>>>>>>>> >>>>>>>>>> edf = ddf \ >>>>>>>>>> .filter(ddf.colp == 'some_value') \ >>>>>>>>>> .rdd.map(lambda row: fn2(row)) \ >>>>>>>>>> .toDF() >>>>>>>>>> >>>>>>>>>> print edf.count() // simple way for the performance test in both >>>>>>>>>> platforms >>>>>>>>>> >>>>>>>>>> Now when I run the same code in a brand new jupyter notebook it >>>>>>>>>> runs 6x faster than when I run this python script using >>>>>>>>>> spark-submit. The >>>>>>>>>> configurations are printed and compared from both the platforms and >>>>>>>>>> they >>>>>>>>>> are exact same. I even tried to run this script in a single cell of >>>>>>>>>> jupyter >>>>>>>>>> notebook and still have the same performance. I need to understand >>>>>>>>>> if I am >>>>>>>>>> missing something in the spark-submit which is causing the issue. I >>>>>>>>>> tried >>>>>>>>>> to minimise the script to reproduce the same error without much code. >>>>>>>>>> >>>>>>>>>> Both are run in client mode on a yarn based spark cluster. The >>>>>>>>>> machines from which both are executed are also the same and from >>>>>>>>>> same user. >>>>>>>>>> >>>>>>>>>> What i found is the the quantile values for median for one ran >>>>>>>>>> with jupyter was 1.3 mins and one ran with spark-submit was ~8.5 >>>>>>>>>> mins. I >>>>>>>>>> am not able to figure out why this is happening. >>>>>>>>>> >>>>>>>>>> Any one faced this kind of issue before or know how to resolve >>>>>>>>>> this? >>>>>>>>>> >>>>>>>>>> *Regards,* >>>>>>>>>> *Dhrub* >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> >>>>>>>>> *Patrick McCarthy * >>>>>>>>> >>>>>>>>> Senior Data Scientist, Machine Learning Engineering >>>>>>>>> >>>>>>>>> Dstillery >>>>>>>>> >>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>>>>>>> >>>>>>>> >>> >>> -- >>> >>> >>> *Patrick McCarthy * >>> >>> Senior Data Scientist, Machine Learning Engineering >>> >>> Dstillery >>> >>> 470 Park Ave South, 17th Floor, NYC 10016 >>> >>