Re:

2017-01-20 Thread Keith Chapman
t; (redirecting to users as it has nothing to do with Spark project > development) > > Monitor jobs and stages using SparkListener and submit cleanup jobs where > a condition holds. > > Jacek > > On 20 Jan 2017 3:57 a.m., "Keith Chapman" <keithgchap...@gmail.com>

Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Keith Chapman
Hi, I'm trying to read in a CSV file into a Dataset but keep having compilation issues. I'm using spark 2.1 and the following is a small program that exhibit the issue I'm having. I've searched around but not found a solution that worked, I've added "import sqlContext.implicits._" as suggested

Re: Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Keith Chapman
x = spark.read.format("csv").load("/home/user/data.csv") > > x.show() > > } > > } > > > hope this helps. > > Diego > > On 22 Mar 2017 7:18 pm, "Keith Chapman" <keithgchap...@gmail.com> wrote: > > Hi, > > I'm

Re: Alternatives for dataframe collectAsList()

2017-04-04 Thread Keith Chapman
As Paul said it really depends on what you want to do with your data, perhaps writing it to a file would be a better option, but again it depends on what you want to do with the data you collect. Regards, Keith. http://keith-chapman.com On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern

Re: Get full RDD lineage for a spark job

2017-07-21 Thread Keith Chapman
You could also enable it with --conf spark.logLineage=true if you do not want to change any code. Regards, Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 7:57 PM, Keith Chapman <keithgchap...@gmail.com> wrote: > Hi Ron, > > You can try using the toDebugString me

Re: Get full RDD lineage for a spark job

2017-07-21 Thread Keith Chapman
Hi Ron, You can try using the toDebugString method on the RDD, this will print the RDD lineage. Regards, Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez wrote: > Hi, > Can someone point me to a test case or share sample code

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
> How do I Specify windowInterval and slideInteval using raw sql string? > > On Tue, Jul 25, 2017 at 8:52 AM, Keith Chapman <keithgchap...@gmail.com> > wrote: > >> You could issue a raw sql query to spark, there is no particular >> advantage or disadvantage of doi

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
You could issue a raw sql query to spark, there is no particular advantage or disadvantage of doing so. Spark would build a logical plan from the raw sql (or DSL) and optimize on that. Ideally you would end up with the same physical plan, irrespective of it been written in raw sql / DSL. Regards,

Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-23 Thread Keith Chapman
Hi, I have code that does the following using RDDs, val outputPartitionCount = 300 val part = new MyOwnPartitioner(outputPartitionCount) val finalRdd = myRdd.repartitionAndSortWithinPartitions(part) where myRdd is correctly formed as key, value pairs. I am looking convert this to use

Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
com> wrote: > I haven't worked with datasets but would this help https://stackoverflow. > com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd? > > On Jun 23, 2017 5:43 PM, "Keith Chapman" <keithgchap...@gmail.com> wrote: > >> Hi, >> >> I

Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
513667/how-to-cre >> ate-a-spark-dataset-from-an-rdd? >> >> On Jun 23, 2017 5:43 PM, "Keith Chapman" <keithgchap...@gmail.com> wrote: >> >>> Hi, >>> >>> I have code that does the following using RDDs, >>> >>> val outputPartit

Re: GC- Yarn vs Standalone K8

2018-06-11 Thread Keith Chapman
Spark on EMR is configured to use CMS GC, specifically following flags, spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled

Re: update LD_LIBRARY_PATH when running apache job in a YARN cluster

2018-01-17 Thread Keith Chapman
Hi Manuel, You could use the following to add a path to the library search path, --conf spark.driver.extraLibraryPath=PathToLibFolder --conf spark.executor.extraLibraryPath=PathToLibFolder Thanks, Keith. Regards, Keith. http://keith-chapman.com On Wed, Jan 17, 2018 at 5:39 PM, Manuel Sopena

Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
Hi, I'm benchmarking a spark application by running it for multiple iterations, its a benchmark thats heavy on shuffle and I run it on a local machine with a very large hear (~200GB). The system has a SSD. When running for 3 to 4 iterations I get into a situation that I run out of disk space on

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
difficult to tell without knowing what is your > application code doing, what kind of transformation/actions performing. > From my previous experience tuning application code which avoids > unnecessary objects reduce pressure on GC. > > > On Thu, Feb 22, 2018 at 2:13 AM

Pyspark error when converting string to timestamp in map function

2018-08-17 Thread Keith Chapman
Hi all, I'm trying to create a dataframe enforcing a schema so that I can write it to a parquet file. The schema has timestamps and I get an error with pyspark. The following is a snippet of code that exhibits the problem, df = sqlctx.range(1000) schema = StructType([StructField('a',

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Keith Chapman
Michael > Sincerely, > Michael Shtelma > > > On Mon, Mar 19, 2018 at 6:38 PM, Keith Chapman <keithgchap...@gmail.com> > wrote: > > Can you try setting spark.executor.extraJavaOptions to have > > -Djava.io.tmpdir=someValue > > > > Regards, > &g

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
and it is working for spark driver. > I would like to make something like this for the executors as well, so > that the setting will be used on all the nodes, where I have executors > running. > > Best, > Michael > > > On Mon, Mar 19, 2018 at 6:07 PM, Keith Chapman <

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
Hi Michael, You could either set spark.local.dir through spark conf or java.io.tmpdir system property. Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma wrote: > Hi everybody, > > I am running spark job on yarn, and my problem is

Can I get my custom spark strategy to run last?

2018-03-01 Thread Keith Chapman
Hi, I'd like to write a custom Spark strategy that runs after all the existing Spark strategies are run. Looking through the Spark code it seems like the custom strategies are prepended to the list of strategies in Spark. Is there a way I could get it to run last? Regards, Keith.

Re: Override jars in spark submit

2019-06-19 Thread Keith Chapman
Hi Naresh, You could use "--conf spark.driver.extraClassPath=". Note that the jar will not be shipped to the executors, if its a class that is needed on the executors as well you should provide "--conf spark.executor.extraClassPath=". Note that if you do provide executor extraclasspath the jar

Re: [pyspark 2.3] count followed by write on dataframe

2019-05-20 Thread Keith Chapman
Yes that is correct, that would cause computation twice. If you want the computation to happen only once you can cache the dataframe and call count and write on the cached dataframe. Regards, Keith. http://keith-chapman.com On Mon, May 20, 2019 at 6:43 PM Rishi Shah wrote: > Hi All, > > Just

Re: Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Keith Chapman
Hi Alex, Shuffle files in spark are deleted when the object holding a reference to the shuffle file on disk goes out of scope (is garbage collected by the JVM). Could it be the case that you are keeping these objects alive? Regards, Keith. http://keith-chapman.com On Sun, Jul 21, 2019 at

Re: Sorting tuples with byte key and byte value

2019-07-15 Thread Keith Chapman
Hi Supun, A couple of things with regard to your question. --executor-cores means the number of worker threads per VM. According to your requirement this should be set to 8. *repartitionAndSortWithinPartitions *is a RDD operation, RDD operations in Spark are not performant both in terms of