Re: Running spark with javaagent configuration

2019-05-15 Thread Akshay Bhardwaj
Hi Anton,

Do you have the option of storing the JAR file on HDFS, which can be
accessed via spark in your cluster?

Akshay Bhardwaj
+91-97111-33849


On Thu, May 16, 2019 at 12:04 AM Oleg Mazurov 
wrote:

> You can see what Uber JVM does at
> https://github.com/uber-common/jvm-profiler :
>
> --conf spark.jars=hdfs://hdfs_url/lib/jvm-profiler-1.0.0.jar
>> --conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar
>
>
> -- Oleg
>
> On Wed, May 15, 2019 at 6:28 AM Anton Puzanov 
> wrote:
>
>> Hi everyone,
>>
>> I want to run my spark application with javaagent, specifically I want to
>> use newrelic with my application.
>>
>> When I run spark-submit I must pass --conf
>> "spark.driver.extraJavaOptions=-javaagent="
>>
>> My problem is that I can't specify the full path as I run in cluster mode
>> and I don't know the exact host which will serve as the driver.
>> *Important:* I know I can upload the jar to every node, but it seems
>> like a fragile solution as machines will be added and removed later.
>>
>> I have tried specifying the jar with --files but couldn't make it work,
>> as I didn't know where exactly I should point the javaagent
>>
>> Any suggestions on what is the best practice to handle this kind of
>> problems? and what can I do?
>>
>> Thanks a lot,
>> Anton
>>
>


Re: Spark job gets hung on cloudera cluster

2019-05-15 Thread Akshay Bhardwaj
Hi Rishi,

Are you running spark on YARN or spark's master-slave cluster?

Akshay Bhardwaj
+91-97111-33849


On Thu, May 16, 2019 at 7:15 AM Rishi Shah  wrote:

> Any one please?
>
> On Tue, May 14, 2019 at 11:51 PM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> At times when there's a data node failure, running spark job doesn't fail
>> - it gets stuck and doesn't return. Any setting can help here? I would
>> ideally like to get the job terminated or executors running on those data
>> nodes fail...
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>


how to get spark-sql lineage

2019-05-15 Thread lk_spark
hi,all:
When I use spark , if I run some SQL to do ETL how can I get lineage 
info. I found that , CDH spark have some config about lineage :
spark.lineage.enabled=true
spark.lineage.log.dir=/var/log/spark2/lineage
Are they also work for apache spark ?

2019-05-16


lk_spark 

Re: Databricks - number of executors, shuffle.partitions etc

2019-05-15 Thread ayan guha
Well its a databricks question so better be asked in their forum.

You can set up cluster level params when you create new cluster or add them
later. Go to cluster page, ipen one cluster, expand additional config
section and add your param there as key value pair separated by space.

On Thu, 16 May 2019 at 11:46 am, Rishi Shah 
wrote:

> Hi All,
>
> Any idea?
>
> Thanks,
> -Rishi
>
> On Tue, May 14, 2019 at 11:52 PM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> How can we set spark conf parameter in databricks notebook? My cluster
>> doesn't take into account any spark.conf.set properties... it creates 8
>> worker nodes (dat executors) but doesn't honor the supplied conf
>> parameters. Any idea?
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>
-- 
Best Regards,
Ayan Guha


Re: Databricks - number of executors, shuffle.partitions etc

2019-05-15 Thread Rishi Shah
Hi All,

Any idea?

Thanks,
-Rishi

On Tue, May 14, 2019 at 11:52 PM Rishi Shah 
wrote:

> Hi All,
>
> How can we set spark conf parameter in databricks notebook? My cluster
> doesn't take into account any spark.conf.set properties... it creates 8
> worker nodes (dat executors) but doesn't honor the supplied conf
> parameters. Any idea?
>
> --
> Regards,
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah


Re: Spark job gets hung on cloudera cluster

2019-05-15 Thread Rishi Shah
Any one please?

On Tue, May 14, 2019 at 11:51 PM Rishi Shah 
wrote:

> Hi All,
>
> At times when there's a data node failure, running spark job doesn't fail
> - it gets stuck and doesn't return. Any setting can help here? I would
> ideally like to get the job terminated or executors running on those data
> nodes fail...
>
> --
> Regards,
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah


Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-15 Thread Russell Spitzer
Dataframes describe the calculation to be done, but the underlying
implementation is an "Incremental Query". That is that the dataframe code
is executed repeatedly with Catalyst adjusting the final execution plan on
each run. Some parts of the plan refer to static pieces of data, others
refer to data which is pulled in on each iteration. None of this changes
the DataFrame objects themselves.




On Wed, May 15, 2019 at 1:34 PM Sheel Pancholi  wrote:

> Hi
> Structured Streaming treats a stream as an unbounded table in the form of
> a DataFrame. Continuously flowing data from the stream keeps getting added
> to this DataFrame (which is the unbounded table) which warrants a change to
> the DataFrame which violates the vary basic nature of a DataFrame since a
> DataFrame by its nature is immutable. This sounds contradictory. Is there
> an explanation for this?
>
> Regards
> Sheel
>


Re: Running spark with javaagent configuration

2019-05-15 Thread Oleg Mazurov
You can see what Uber JVM does at
https://github.com/uber-common/jvm-profiler :

--conf spark.jars=hdfs://hdfs_url/lib/jvm-profiler-1.0.0.jar
> --conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar


-- Oleg

On Wed, May 15, 2019 at 6:28 AM Anton Puzanov 
wrote:

> Hi everyone,
>
> I want to run my spark application with javaagent, specifically I want to
> use newrelic with my application.
>
> When I run spark-submit I must pass --conf
> "spark.driver.extraJavaOptions=-javaagent="
>
> My problem is that I can't specify the full path as I run in cluster mode
> and I don't know the exact host which will serve as the driver.
> *Important:* I know I can upload the jar to every node, but it seems like
> a fragile solution as machines will be added and removed later.
>
> I have tried specifying the jar with --files but couldn't make it work, as
> I didn't know where exactly I should point the javaagent
>
> Any suggestions on what is the best practice to handle this kind of
> problems? and what can I do?
>
> Thanks a lot,
> Anton
>


Are Spark Dataframes mutable in Structured Streaming?

2019-05-15 Thread Sheel Pancholi
Hi
Structured Streaming treats a stream as an unbounded table in the form of a
DataFrame. Continuously flowing data from the stream keeps getting added to
this DataFrame (which is the unbounded table) which warrants a change to
the DataFrame which violates the vary basic nature of a DataFrame since a
DataFrame by its nature is immutable. This sounds contradictory. Is there
an explanation for this?

Regards
Sheel


Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-15 Thread Jason Nerothin
I did a quick google search for "Java/Scala interoperability" and was
surprised to find very few recent results on the topic. (Has the world
given up?)

It's easy to use Java library code from Scala, but the opposite is not true.

I would think about the problem this way: Do *YOU* need to provide a Java
API in your product?

If you decide to support both, beware the Law of Leaky Abstractions
 and
look at what the Spark team came up with. (DataFrames in version 2.0 target
this same problem (among others) - to provide a single abstraction that
works across Scala, Java, Python, and R. But what they came up with
required the APIs you list to make it work.)

Think carefully about what new things you're trying to provide and what
things you're trying to hide beneath your abstraction.

HTH
Jason




On Wed, May 15, 2019 at 8:24 AM Jean-Georges Perrin  wrote:

> I see… Did you consider Structure Streaming?
>
> Otherwise, you could create a factory that will build your higher level
> object, that will return an interface defining your API,  but the
> implementation may vary based on the context.
>
> And English is not my native language as well...
>
> Jean -Georges Perrin
> j...@jgp.net
>
>
>
>
> On May 14, 2019, at 21:47, Gary Gao  wrote:
>
> Thanks for reply, Jean
>   In my project , I'm working on higher abstraction layer of spark
> streaming to build a data processing product and trying to provide a common
> api for java and scala developers.
>   You can see the abstract class defined here:
> https://github.com/InterestingLab/waterdrop/blob/master/waterdrop-apis/src/main/scala/io/github/interestinglab/waterdrop/apis/BaseStreamingInput.scal
> 
>
>
>There is a method , getDStream, that return a DStream[T], which
> currently support scala class to extend this class and override getDStream,
> But I also want java class to extend this class to return a JavaDStream.
> This is my real problem.
>  Tell me if the above description is not clear, because English is
> not my native language.
>
> Thanks in advance
> Gary
>
> On Tue, May 14, 2019 at 11:06 PM Jean Georges Perrin  wrote:
>
>> There are a little bit more than the list you specified  nevertheless,
>> some data types are not directly compatible between Scala and Java and
>> requires conversion, so it’s good to not pollute your code with plenty of
>> conversion and focus on using the straight API.
>>
>> I don’t remember from the top of my head, but if you use more Spark 2
>> features (dataframes, structured streaming...) you will require less of
>> those Java-specific API.
>>
>> Do you see a problem here? What’s your take on this?
>>
>> jg
>>
>>
>> On May 14, 2019, at 10:22, Gary Gao  wrote:
>>
>> Hi all,
>>
>> I am wondering why do we need Java-Friendly APIs in Spark ? Why can't we
>> just use scala apis in java codes ? What's the difference ?
>>
>> Some examples of Java-Friendly APIs commented in Spark code are as
>> follows:
>>
>> JavaDStream
>> JavaInputDStream
>> JavaStreamingContext
>> JavaSparkContext
>>
>>
>

-- 
Thanks,
Jason


Running spark with javaagent configuration

2019-05-15 Thread Anton Puzanov
Hi everyone,

I want to run my spark application with javaagent, specifically I want to
use newrelic with my application.

When I run spark-submit I must pass --conf
"spark.driver.extraJavaOptions=-javaagent="

My problem is that I can't specify the full path as I run in cluster mode
and I don't know the exact host which will serve as the driver.
*Important:* I know I can upload the jar to every node, but it seems like a
fragile solution as machines will be added and removed later.

I have tried specifying the jar with --files but couldn't make it work, as
I didn't know where exactly I should point the javaagent

Any suggestions on what is the best practice to handle this kind of
problems? and what can I do?

Thanks a lot,
Anton


Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-15 Thread Jean-Georges Perrin
I see… Did you consider Structure Streaming?

Otherwise, you could create a factory that will build your higher level object, 
that will return an interface defining your API,  but the implementation may 
vary based on the context.

And English is not my native language as well...

Jean -Georges Perrin
j...@jgp.net




> On May 14, 2019, at 21:47, Gary Gao  wrote:
> 
> Thanks for reply, Jean
>   In my project , I'm working on higher abstraction layer of spark 
> streaming to build a data processing product and trying to provide a common 
> api for java and scala developers.
>   You can see the abstract class defined here: 
> https://github.com/InterestingLab/waterdrop/blob/master/waterdrop-apis/src/main/scala/io/github/interestinglab/waterdrop/apis/BaseStreamingInput.scal
>  
> 
>  
> 
>There is a method , getDStream, that return a DStream[T], which 
> currently support scala class to extend this class and override getDStream, 
> But I also want java class to extend this class to return a JavaDStream.
> This is my real problem. 
>  Tell me if the above description is not clear, because English is 
> not my native language. 
> 
> Thanks in advance
> Gary
> 
> On Tue, May 14, 2019 at 11:06 PM Jean Georges Perrin  > wrote:
> There are a little bit more than the list you specified  nevertheless, some 
> data types are not directly compatible between Scala and Java and requires 
> conversion, so it’s good to not pollute your code with plenty of conversion 
> and focus on using the straight API. 
> 
> I don’t remember from the top of my head, but if you use more Spark 2 
> features (dataframes, structured streaming...) you will require less of those 
> Java-specific API. 
> 
> Do you see a problem here? What’s your take on this?
> 
> jg
> 
> 
> On May 14, 2019, at 10:22, Gary Gao  > wrote:
> 
>> Hi all,
>> 
>> I am wondering why do we need Java-Friendly APIs in Spark ? Why can't we 
>> just use scala apis in java codes ? What's the difference ?
>> 
>> Some examples of Java-Friendly APIs commented in Spark code are as follows:
>> 
>>   JavaDStream
>>   JavaInputDStream
>>   JavaStreamingContext
>>   JavaSparkContext
>> 



Re: Streaming job, catch exceptions

2019-05-15 Thread bsikander
Any help would be much appreciated.

The error and question is quite generic, i believe that most experienced
users will be able to answer.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark sql insert hive table which method has the highest performance

2019-05-15 Thread Jelly Young
Hi,
The document of DFWriter say that: 
 Unlike `saveAsTable`, `insertInto` ignores the column names and just uses
position-based
For example:
*
* {{{
*scala> Seq((1, 2)).toDF("i",
"j").write.mode("overwrite").saveAsTable("t1")
*scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
*scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
*scala> sql("select * from t1").show
*+---+---+
*|  i|  j|
*+---+---+
*|  5|  6|
*|  3|  4|
*|  1|  2|
*+---+---+
* }}}


In the detail of implementation , they  wrap a DataFrameWriter action to
InsertIntoTable Class or CreateTable Class , I think that is  same thing
about your  insert sql command.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark sql insert hive table which method has the highest performance

2019-05-15 Thread 车 ��
Hello guys,

I use spark streaming to receive data from kafka and need to store the data 
into hive. I see the following ways to insert data into hive on the Internet:

1.use tmp_table
TmpDF=spark.createDataFrame(RDD,schema)
TmpDF.createOrReplaceTempView('TmpData')
sqlContext.sql('insert overwrite table tmp_table select 
*from TmpData')

2.use DataFrameWriter.insertInto

3.use DataFrameWriter.saveAsTable

I didn't find too many examples, and I don't know if there is any difference 
between them or there is a better way to write into hive. Please give me some 
help.

Thank you