Sounds like you have done your homework to properly compare . I'm guessing the answer to the following is yes .. but in any case: are they both running against the same spark cluster with the same configuration parameters especially executor memory and number of workers?
Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < dhruba.w...@gmail.com>: > No, i checked for that, hence written "brand new" jupyter notebook. Also > the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs > compressed base64 encoded text data from a hive table and decompressing and > decoding in one of the udfs. Also the time compared is from Spark UI not > how long the job actually takes after submission. Its just the running time > i am comparing/mentioning. > > As mentioned earlier, all the spark conf params even match in two scripts > and that's why i am puzzled what going on. > > On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <pmccar...@dstillery.com> > wrote: > >> It's not obvious from what you pasted, but perhaps the juypter notebook >> already is connected to a running spark context, while spark-submit needs >> to get a new spot in the (YARN?) queue. >> >> I would check the cluster job IDs for both to ensure you're getting new >> cluster tasks for each. >> >> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.w...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I am facing a weird behaviour while running a python script. Here is >>> what the code looks like mostly: >>> >>> def fn1(ip): >>> some code... >>> ... >>> >>> def fn2(row): >>> ... >>> some operations >>> ... >>> return row1 >>> >>> >>> udf_fn1 = udf(fn1) >>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with >>> ~4500 partitions >>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ >>> .drop("colz") \ >>> .withColumnRenamed("colz", "coly") >>> >>> edf = ddf \ >>> .filter(ddf.colp == 'some_value') \ >>> .rdd.map(lambda row: fn2(row)) \ >>> .toDF() >>> >>> print edf.count() // simple way for the performance test in both >>> platforms >>> >>> Now when I run the same code in a brand new jupyter notebook it runs 6x >>> faster than when I run this python script using spark-submit. The >>> configurations are printed and compared from both the platforms and they >>> are exact same. I even tried to run this script in a single cell of jupyter >>> notebook and still have the same performance. I need to understand if I am >>> missing something in the spark-submit which is causing the issue. I tried >>> to minimise the script to reproduce the same error without much code. >>> >>> Both are run in client mode on a yarn based spark cluster. The machines >>> from which both are executed are also the same and from same user. >>> >>> What i found is the the quantile values for median for one ran with >>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not >>> able to figure out why this is happening. >>> >>> Any one faced this kind of issue before or know how to resolve this? >>> >>> *Regards,* >>> *Dhrub* >>> >> >> >> -- >> >> >> *Patrick McCarthy * >> >> Senior Data Scientist, Machine Learning Engineering >> >> Dstillery >> >> 470 Park Ave South, 17th Floor, NYC 10016 >> >