t; (redirecting to users as it has nothing to do with Spark project
> development)
>
> Monitor jobs and stages using SparkListener and submit cleanup jobs where
> a condition holds.
>
> Jacek
>
> On 20 Jan 2017 3:57 a.m., "Keith Chapman" <keithgchap...@gmail.com>
Hi,
I'm trying to read in a CSV file into a Dataset but keep having compilation
issues. I'm using spark 2.1 and the following is a small program that
exhibit the issue I'm having. I've searched around but not found a solution
that worked, I've added "import sqlContext.implicits._" as suggested
x = spark.read.format("csv").load("/home/user/data.csv")
>
> x.show()
>
> }
>
> }
>
>
> hope this helps.
>
> Diego
>
> On 22 Mar 2017 7:18 pm, "Keith Chapman" <keithgchap...@gmail.com> wrote:
>
> Hi,
>
> I'm
As Paul said it really depends on what you want to do with your data,
perhaps writing it to a file would be a better option, but again it depends
on what you want to do with the data you collect.
Regards,
Keith.
http://keith-chapman.com
On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern
You could also enable it with --conf spark.logLineage=true if you do not
want to change any code.
Regards,
Keith.
http://keith-chapman.com
On Fri, Jul 21, 2017 at 7:57 PM, Keith Chapman <keithgchap...@gmail.com>
wrote:
> Hi Ron,
>
> You can try using the toDebugString me
Hi Ron,
You can try using the toDebugString method on the RDD, this will print the
RDD lineage.
Regards,
Keith.
http://keith-chapman.com
On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez wrote:
> Hi,
> Can someone point me to a test case or share sample code
> How do I Specify windowInterval and slideInteval using raw sql string?
>
> On Tue, Jul 25, 2017 at 8:52 AM, Keith Chapman <keithgchap...@gmail.com>
> wrote:
>
>> You could issue a raw sql query to spark, there is no particular
>> advantage or disadvantage of doi
You could issue a raw sql query to spark, there is no particular advantage
or disadvantage of doing so. Spark would build a logical plan from the raw
sql (or DSL) and optimize on that. Ideally you would end up with the same
physical plan, irrespective of it been written in raw sql / DSL.
Regards,
Hi,
I have code that does the following using RDDs,
val outputPartitionCount = 300
val part = new MyOwnPartitioner(outputPartitionCount)
val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
where myRdd is correctly formed as key, value pairs. I am looking convert
this to use
com> wrote:
> I haven't worked with datasets but would this help https://stackoverflow.
> com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd?
>
> On Jun 23, 2017 5:43 PM, "Keith Chapman" <keithgchap...@gmail.com> wrote:
>
>> Hi,
>>
>> I
513667/how-to-cre
>> ate-a-spark-dataset-from-an-rdd?
>>
>> On Jun 23, 2017 5:43 PM, "Keith Chapman" <keithgchap...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have code that does the following using RDDs,
>>>
>>> val outputPartit
Spark on EMR is configured to use CMS GC, specifically following flags,
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-XX:+CMSClassUnloadingEnabled
Hi Manuel,
You could use the following to add a path to the library search path,
--conf spark.driver.extraLibraryPath=PathToLibFolder
--conf spark.executor.extraLibraryPath=PathToLibFolder
Thanks,
Keith.
Regards,
Keith.
http://keith-chapman.com
On Wed, Jan 17, 2018 at 5:39 PM, Manuel Sopena
Hi,
I'm benchmarking a spark application by running it for multiple iterations,
its a benchmark thats heavy on shuffle and I run it on a local machine with
a very large hear (~200GB). The system has a SSD. When running for 3 to 4
iterations I get into a situation that I run out of disk space on
difficult to tell without knowing what is your
> application code doing, what kind of transformation/actions performing.
> From my previous experience tuning application code which avoids
> unnecessary objects reduce pressure on GC.
>
>
> On Thu, Feb 22, 2018 at 2:13 AM
Hi all,
I'm trying to create a dataframe enforcing a schema so that I can write it
to a parquet file. The schema has timestamps and I get an error with
pyspark. The following is a snippet of code that exhibits the problem,
df = sqlctx.range(1000)
schema = StructType([StructField('a',
Michael
> Sincerely,
> Michael Shtelma
>
>
> On Mon, Mar 19, 2018 at 6:38 PM, Keith Chapman <keithgchap...@gmail.com>
> wrote:
> > Can you try setting spark.executor.extraJavaOptions to have
> > -Djava.io.tmpdir=someValue
> >
> > Regards,
> &g
and it is working for spark driver.
> I would like to make something like this for the executors as well, so
> that the setting will be used on all the nodes, where I have executors
> running.
>
> Best,
> Michael
>
>
> On Mon, Mar 19, 2018 at 6:07 PM, Keith Chapman <
Hi Michael,
You could either set spark.local.dir through spark conf or java.io.tmpdir
system property.
Regards,
Keith.
http://keith-chapman.com
On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma wrote:
> Hi everybody,
>
> I am running spark job on yarn, and my problem is
Hi,
I'd like to write a custom Spark strategy that runs after all the existing
Spark strategies are run. Looking through the Spark code it seems like the
custom strategies are prepended to the list of strategies in Spark. Is
there a way I could get it to run last?
Regards,
Keith.
Hi Naresh,
You could use "--conf spark.driver.extraClassPath=". Note
that the jar will not be shipped to the executors, if its a class that is
needed on the executors as well you should provide "--conf
spark.executor.extraClassPath=". Note that if you do
provide executor extraclasspath the jar
Yes that is correct, that would cause computation twice. If you want the
computation to happen only once you can cache the dataframe and call count
and write on the cached dataframe.
Regards,
Keith.
http://keith-chapman.com
On Mon, May 20, 2019 at 6:43 PM Rishi Shah wrote:
> Hi All,
>
> Just
Hi Alex,
Shuffle files in spark are deleted when the object holding a reference to
the shuffle file on disk goes out of scope (is garbage collected by the
JVM). Could it be the case that you are keeping these objects alive?
Regards,
Keith.
http://keith-chapman.com
On Sun, Jul 21, 2019 at
Hi Supun,
A couple of things with regard to your question.
--executor-cores means the number of worker threads per VM. According to
your requirement this should be set to 8.
*repartitionAndSortWithinPartitions *is a RDD operation, RDD operations in
Spark are not performant both in terms of
24 matches
Mail list logo