Re: Using Apache Kylin as data source for Spark

2018-05-25 Thread Li Yang
That is very useful~~  :-)

On Fri, May 18, 2018 at 11:56 AM, ShaoFeng Shi 
wrote:

> Hello, Kylin and Spark users,
>
> A doc is newly added in Apache Kylin website on how to using Kylin as a
> data source in Spark;
> This can help the users who want to use Spark to analysis the aggregated
> Cube data.
>
> https://kylin.apache.org/docs23/tutorial/spark.html
>
> Thanks for your attention.
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>


[Query] Weight of evidence on Spark

2018-05-25 Thread Aakash Basu
Hi guys,

What's the best way to create feature column with Weight of Evidence
calculated for categorical columns on target column (both Binary and
Multi-Class)?

Any insight?

Thanks,
Aakash.


Re: Submit many spark applications

2018-05-25 Thread yncxcw
hi, 

please try to reduce the default heap size for the machine you use to submit
applications:

For example:
export _JAVA_OPTIONS="-Xmx512M" 

The submitter which is also a JVM does not need to reserve lots of memory.


Wei 





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Databricks 1/2 day certification course at Spark Summit

2018-05-25 Thread Sumona Routh
Hi all,
My company just now approved for some of us to go to Spark Summit in SF
this year. Unfortunately, the day long workshops on Monday are sold out
now. We are considering what we might do instead.

Have others done the 1/2 day certification course before? Is it worth
considering? Does it cover anything in particular around Spark or is it
more of an exam-prep type course? We have people with varying skill levels
-- would this be a waste for newer Spark devs or is there still some good
info to glean?

We may decide to save the budget for other types of local training, but I
wanted to see what others who may have done this course before think.

Thanks!
Sumona


Re: Submit many spark applications

2018-05-25 Thread Marcelo Vanzin
I already gave my recommendation in my very first reply to this thread...

On Fri, May 25, 2018 at 10:23 AM, raksja  wrote:
> ok, when to use what?
> do you have any recommendation?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-25 Thread raksja
ok, when to use what?
do you have any recommendation?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-25 Thread Marcelo Vanzin
On Fri, May 25, 2018 at 10:18 AM, raksja  wrote:
> InProcessLauncher would just start a subprocess as you mentioned earlier.

No. As the name says, it runs things in the same process.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-25 Thread raksja
When you mean spark uses, did you meant this
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala?

InProcessLauncher would just start a subprocess as you mentioned earlier.
How about this, does this makes a rest api call to yarn?

Do you think given my case where i have several concurrent jobs would you
recommend spark yarn client (mentioned above) over InProcessLauncher?






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-25 Thread Marcelo Vanzin
That's what Spark uses.

On Fri, May 25, 2018 at 10:09 AM, raksja  wrote:
> thanks for the reply.
>
> Have you tried submit a spark job directly to Yarn using YarnClient.
> https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/client/api/YarnClient.html
>
> Not sure whether its performant and scalable?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-25 Thread raksja
thanks for the reply. 

Have you tried submit a spark job directly to Yarn using YarnClient. 
https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/client/api/YarnClient.html

Not sure whether its performant and scalable?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why Spark JDBC Writing in a sequential order

2018-05-25 Thread Yong Zhang
I am not sure about Redshift, but I know the target table is not partitioned. 
But we should be able to just insert into non-partitioned remote table from 12 
clients concurrently, right?


Even let's say Redshift doesn't allow concurrently write, then Spark Driver 
will detect this and coordinating all tasks and executors as I observed?


Yong


From: Jörn Franke 
Sent: Friday, May 25, 2018 10:50 AM
To: Yong Zhang
Cc: user@spark.apache.org
Subject: Re: Why Spark JDBC Writing in a sequential order

Can your database receive the writes concurrently ? Ie do you make sure that 
each executor writes into a different partition at database side ?

On 25. May 2018, at 16:42, Yong Zhang 
> wrote:


Spark version 2.2.0


We are trying to write a DataFrame to remote relationship database (AWS 
Redshift). Based on the Spark JDBC document, we already repartition our DF as 
12 and set the spark jdbc to concurrent writing for 12 partitions as 
"numPartitions" parameter.


We run the command as following:

dataframe.repartition(12).write.mode("overwrite").option("batchsize", 
5000).option("numPartitions", 12).jdbc(url=jdbcurl, table="tableName", 
connectionProperties=connectionProps)


Here is the Spark UI:




We found out that the 12 tasks obviously are running in sequential order. They 
are all in "Running" status in the beginning at the same time, but if we check 
the "Duration" and "Shuffle Read Size/Records" of them, it is clear that they 
are run one by one.

For example, task 8 finished first in about 2 hours, and wrote 34732 records to 
remote DB (I knew the speed looks terrible, but that's not the question of this 
post), and task 0 started after task 8, and took 4 hours (first 2 hours waiting 
for task 8).

In this picture, only task 2 and 4 are in running stage, but task 4 is 
obviously waiting for task 2 to finish, then start writing after that.


My question is, in the above Spark command, my understanding that 12 executors 
should open the JDBC connection to the remote DB concurrently, and all 12 tasks 
should start writing also in concurrent, and whole job should finish around 2 
hours overall.


Why 12 tasks indeed are in "RUNNING" stage, but looks like waiting for 
something, and can ONLY write to remote DB sequentially? The 12 executors are 
on different JVMs on different physical nodes. Why this is happening? What 
stops Spark pushing the data truly concurrent?


Thanks


Yong



Re: Why Spark JDBC Writing in a sequential order

2018-05-25 Thread Jörn Franke
Can your database receive the writes concurrently ? Ie do you make sure that 
each executor writes into a different partition at database side ?

> On 25. May 2018, at 16:42, Yong Zhang  wrote:
> 
> Spark version 2.2.0
> 
> 
> We are trying to write a DataFrame to remote relationship database (AWS 
> Redshift). Based on the Spark JDBC document, we already repartition our DF as 
> 12 and set the spark jdbc to concurrent writing for 12 partitions as 
> "numPartitions" parameter.
> 
> 
> We run the command as following:
> 
> dataframe.repartition(12).write.mode("overwrite").option("batchsize", 
> 5000).option("numPartitions", 12).jdbc(url=jdbcurl, table="tableName", 
> connectionProperties=connectionProps)
> 
> Here is the Spark UI:
> 
> 
> 
> 
> We found out that the 12 tasks obviously are running in sequential order. 
> They are all in "Running" status in the beginning at the same time, but if we 
> check the "Duration" and "Shuffle Read Size/Records" of them, it is clear 
> that they are run one by one.
> For example, task 8 finished first in about 2 hours, and wrote 34732 records 
> to remote DB (I knew the speed looks terrible, but that's not the question of 
> this post), and task 0 started after task 8, and took 4 hours (first 2 hours 
> waiting for task 8). 
> In this picture, only task 2 and 4 are in running stage, but task 4 is 
> obviously waiting for task 2 to finish, then start writing after that.
> 
> My question is, in the above Spark command, my understanding that 12 
> executors should open the JDBC connection to the remote DB concurrently, and 
> all 12 tasks should start writing also in concurrent, and whole job should 
> finish around 2 hours overall.
> 
> Why 12 tasks indeed are in "RUNNING" stage, but looks like waiting for 
> something, and can ONLY write to remote DB sequentially? The 12 executors are 
> on different JVMs on different physical nodes. Why this is happening? What 
> stops Spark pushing the data truly concurrent?
> 
> Thanks
> 
> Yong 
> 


Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-25 Thread Chetan Khatri
Ajay, You can use Sqoop if wants to ingest data to HDFS. This is POC where
customer wants to prove that Spark ETL would be faster than C# based raw
SQL Statements. That's all, There are no time-stamp based columns in Source
tables to make it incremental load.

On Thu, May 24, 2018 at 1:08 AM, ayan guha  wrote:

> Curious question: what is the reason of using spark here? Why not simple
> sql-based ETL?
>
> On Thu, May 24, 2018 at 5:09 AM, Ajay  wrote:
>
>> Do you worry about spark overloading the SQL server?  We have had this
>> issue in the past where all spark slaves tend to send lots of data at once
>> to SQL and that slows down the latency of the rest of the system. We
>> overcame this by using sqoop and running it in a controlled environment.
>>
>> On Wed, May 23, 2018 at 7:32 AM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Super, just giving high level idea what i want to do. I have one source
>>> schema which is MS SQL Server 2008 and target is also MS SQL Server 2008.
>>> Currently there is c# based ETL application which does extract transform
>>> and load as customer specific schema including indexing etc.
>>>
>>>
>>> Thanks
>>>
>>> On Wed, May 23, 2018 at 7:11 PM, kedarsdixit <
>>> kedarnath_di...@persistent.com> wrote:
>>>
 Yes.

 Regards,
 Kedar Dixit



 --
 Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>> --
>> Thanks,
>> Ajay
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: help with streaming batch interval question needed

2018-05-25 Thread Peter Liu
 Hi Jacek,

This is exact what i'm looking for. Thanks!!

Also thanks for the link. I just noticed that I can unfold the link of
trigger and see the examples in java and scala languages - what a general
help for a new comer :-)
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter

def trigger(trigger: Trigger

): DataStreamWriter

[T]

Set the trigger for the stream query. The default value is ProcessingTime(0)
and it will run the query as fast as possible.

Scala Example:
df.writeStream.trigger(ProcessingTime("10 seconds"))

import scala.concurrent.duration._
df.writeStream.trigger(ProcessingTime(10.seconds))

Java Example:
df.writeStream().trigger(ProcessingTime.create("10 seconds"))

import java.util.concurrent.TimeUnit
df.writeStream().trigger(ProcessingTime.create(10, TimeUnit.SECONDS))

Muchly appreciated!

Peter


On Fri, May 25, 2018 at 9:11 AM, Jacek Laskowski  wrote:

> Hi Peter,
>
> > Basically I need to find a way to set the batch-interval in (b), similar
> as in (a) below.
>
> That's trigger method on DataStreamWriter.
>
> http://spark.apache.org/docs/latest/api/scala/index.html#
> org.apache.spark.sql.streaming.DataStreamWriter
>
> import org.apache.spark.sql.streaming.Trigger
> df.writeStream.trigger(Trigger.ProcessingTime("1 second"))
>
> See http://spark.apache.org/docs/latest/structured-
> streaming-programming-guide.html#triggers
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Thu, May 24, 2018 at 10:14 PM, Peter Liu  wrote:
>
>> Hi there,
>>
>> from my apache spark streaming website (see links below),
>>
>>- the batch-interval is set when a spark StreamingContext is
>>constructed (see example (a) quoted below)
>>- the StreamingContext is available in older and new Spark version
>>(v1.6, v2.2 to v2.3.0) (see https://spark.apache.org/docs/
>>1.6.0/streaming-programming-guide.html
>>
>>and https://spark.apache.org/docs/2.3.0/streaming-programming-gu
>>ide.html )
>>- however, example (b) below  doesn't use StreamingContext, but
>>StreamingSession object to setup a streaming flow;
>>
>> What does the usage difference in (a) and (b) mean? I was wondering if
>> this would mean a different streaming approach ("traditional" streaming vs
>> structured streaming?
>>
>> Basically I need to find a way to set the batch-interval in (b), similar
>> as in (a) below.
>>
>> Would be great if someone can please share some insights here.
>>
>> Thanks!
>>
>> Peter
>>
>> (a)
>> https://spark.apache.org/docs/2.3.0/streaming-programming-guide.html )
>>
>> import org.apache.spark._import org.apache.spark.streaming._
>> val conf = new SparkConf().setAppName(appName).setMaster(master)val *ssc *= 
>> new StreamingContext(conf, Seconds(1))
>>
>>
>> (b)
>> ( from databricks' https://databricks.com/blog/20
>> 17/04/26/processing-data-in-apache-kafka-with-structured-str
>> eaming-in-apache-spark-2-2.html)
>>
>>val *spark *= SparkSession.builder()
>> .appName(appName)
>>   .getOrCreate()
>> ...
>>
>> jsonOptions = { "timestampFormat": nestTimestampFormat }
>> parsed = *spark *\
>>   .readStream \
>>   .format("kafka") \
>>   .option("kafka.bootstrap.servers", "localhost:9092") \
>>   .option("subscribe", "nest-logs") \
>>   .load() \
>>   .select(from_json(col("value").cast("string"), schema, 
>> jsonOptions).alias("parsed_value"))
>>
>>
>>
>>
>>
>


Re: help with streaming batch interval question needed

2018-05-25 Thread Jacek Laskowski
Hi Peter,

> Basically I need to find a way to set the batch-interval in (b), similar
as in (a) below.

That's trigger method on DataStreamWriter.

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter

import org.apache.spark.sql.streaming.Trigger
df.writeStream.trigger(Trigger.ProcessingTime("1 second"))

See
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Thu, May 24, 2018 at 10:14 PM, Peter Liu  wrote:

> Hi there,
>
> from my apache spark streaming website (see links below),
>
>- the batch-interval is set when a spark StreamingContext is
>constructed (see example (a) quoted below)
>- the StreamingContext is available in older and new Spark version
>(v1.6, v2.2 to v2.3.0) (see https://spark.apache.org/docs/
>1.6.0/streaming-programming-guide.html
>
>and https://spark.apache.org/docs/2.3.0/streaming-programming-
>guide.html )
>- however, example (b) below  doesn't use StreamingContext, but
>StreamingSession object to setup a streaming flow;
>
> What does the usage difference in (a) and (b) mean? I was wondering if
> this would mean a different streaming approach ("traditional" streaming vs
> structured streaming?
>
> Basically I need to find a way to set the batch-interval in (b), similar
> as in (a) below.
>
> Would be great if someone can please share some insights here.
>
> Thanks!
>
> Peter
>
> (a)
> https://spark.apache.org/docs/2.3.0/streaming-programming-guide.html )
>
> import org.apache.spark._import org.apache.spark.streaming._
> val conf = new SparkConf().setAppName(appName).setMaster(master)val *ssc *= 
> new StreamingContext(conf, Seconds(1))
>
>
> (b)
> ( from databricks' https://databricks.com/blog/
> 2017/04/26/processing-data-in-apache-kafka-with-structured-
> streaming-in-apache-spark-2-2.html)
>
>val *spark *= SparkSession.builder()
> .appName(appName)
>   .getOrCreate()
> ...
>
> jsonOptions = { "timestampFormat": nestTimestampFormat }
> parsed = *spark *\
>   .readStream \
>   .format("kafka") \
>   .option("kafka.bootstrap.servers", "localhost:9092") \
>   .option("subscribe", "nest-logs") \
>   .load() \
>   .select(from_json(col("value").cast("string"), schema, 
> jsonOptions).alias("parsed_value"))
>
>
>
>
>