Re: GC- Yarn vs Standalone K8

2018-06-11 Thread Keith Chapman
Spark on EMR is configured to use CMS GC, specifically following flags,

spark.executor.extraJavaOptions  -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'


Regards,
Keith.

http://keith-chapman.com

On Mon, Jun 11, 2018 at 8:22 PM, ankit jain  wrote:

> Hi,
> Does anybody know if Yarn uses a different Garbage Collector from Spark
> standalone?
>
> We migrated our application recently from EMR to K8(not using native spark
> on k8 yet) and see quite a bit of performance degradation.
>
> Diving further it seems garbage collection is running too often, up-to 50%
> of task time even with small amount of data - PFA Spark UI screenshot.
>
> I have updated GC to G1GC and it has helped a bit - GC time have come down
> from 50-30%, still too high though.
>
> Also enabled -verbose:gc, so will be much more metrics to play with but
> any pointers meanwhile will be appreciated.
>
>
> --
> Thanks & Regards,
> Ankit.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Vamshi Talla
Aakash,

Like Jorn suggested, did you increase your test data set? If so, did you also 
update your executor-memory setting? It seems like you might exceeding the 
executor memory threshold.

Thanks
Vamshi Talla

Sent from my iPhone

On Jun 11, 2018, at 8:54 AM, Aakash Basu 
mailto:aakash.spark@gmail.com>> wrote:

Hi Jorn/Others,

Thanks for your help. Now, data is being distributed in a proper way, but the 
challenge is, after a certain point, I'm getting this error, after which, 
everything stops moving ahead -

2018-06-11 18:14:56 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 
192.168.49.39:
 Remote RPC client disassociated. Likely due to containers exceeding 
thresholds, or network issues. Check driver logs for WARN messages.



How to avoid this scenario?

Thanks,
Aakash.

On Mon, Jun 11, 2018 at 4:16 PM, Jörn Franke 
mailto:jornfra...@gmail.com>> wrote:
If it is in kB then spark will always schedule it to one node. As soon as it 
gets bigger you will see usage of more nodes.

Hence increase your testing Dataset .

On 11. Jun 2018, at 12:22, Aakash Basu 
mailto:aakash.spark@gmail.com>> wrote:

Jorn - The code is a series of feature engineering and model tuning operations. 
Too big to show. Yes, data volume is too low, it is in KBs, just tried to 
experiment with a small dataset before going for a large one.

Akshay - I ran with your suggested spark configurations, I get this (the node 
changed, but the problem persists) -





On Mon, Jun 11, 2018 at 3:16 PM, akshay naidu 
mailto:akshaynaid...@gmail.com>> wrote:
try
 --num-executors 3 --executor-cores 4 --executor-memory 2G --conf 
spark.scheduler.mode=FAIR

On Mon, Jun 11, 2018 at 2:43 PM, Aakash Basu 
mailto:aakash.spark@gmail.com>> wrote:
Hi,

I have submitted a job on 4 node cluster, where I see, most of the operations 
happening at one of the worker nodes and other two are simply chilling out.

Picture below puts light on that -
[cid:]
How to properly distribute the load?

My cluster conf (4 node cluster [1 driver; 3 slaves]) -

Cores - 6
RAM - 12 GB
HDD - 60 GB

My Spark Submit command is as follows -

spark-submit --master 
spark://192.168.49.37:7077
 --num-executors 3 --executor-cores 5 --executor-memory 4G 
/appdata/bblite-codebase/prima_diabetes_indians.py

What to do?

Thanks,
Aakash.





Re: Exception when closing SparkContext in Spark 2.3

2018-06-11 Thread umayr_nuna
My bad, it's EMR 5.14.0



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark Streaming]: How do I apply window before filter?

2018-06-11 Thread Tejas Manohar
Hey friends,

We're trying to make some batched computations run against an OLAP DB
closer to "realtime". One of our more complex computations is a trigger
when event A occurs but not event B within a given time period. Our
experience with Spark is limited, but since Spark 2.3.0 just introduced
Stream-Stream Joins

and
our data is already in Kafka, we thought we'd try it out.

That said, in our exploration, we've been running into an issue where Spark
optimizes the Kafka *watermark* to be applied *after* the *filter* in our
SQL query. This means the watermark won't move forward unless there's data
within the filtered results and thus, the trigger for "event B" not won't
occur until another "event B" is triggered, which can be problematic
depending on how granular the filter is.

See the quick isolated example I setup in *spark-shell* below.

```
scala> :paste
// Entering paste mode (ctrl-D to finish)

val kafka =
spark.readStream.format("kafka").option("kafka.bootstrap.servers",
*":"*).option("subscribe", "**").option("startingOffsets",
"latest").load()

import org.apache.spark.sql.types._
val schema = StructType(Seq(
  StructField("id", StringType),
  StructField("message", StructType(Seq(
StructField("event", StringType),
StructField("timestamp", TimestampType)
  )))
))

val parsed = kafka.select(from_json($"value".cast(StringType), schema) as
'data).select($"data.*", $"data.message.timestamp" as
'ts).withWatermark("ts", "10 seconds")

// Exiting paste mode, now interpreting.
scala> parsed.filter("message.event = 'Item
Added'").as('a).join(parsed.filter("message.event = 'Item Purchased'") as
'b, expr("a.id = b.id AND a.ts < b.ts AND b.ts < a.ts + interval 5
seconds"), joinType="left").explain()
== Physical Plan ==
StreamingSymmetricHashJoin [id#24], [id#37], LeftOuter, condition = [
leftOnly = null, rightOnly = null, both = ((ts#23-T1ms <
ts#39-T1ms) && (ts#39-T1ms < ts#23-T1ms + interval 5 seconds)),
full = ((ts#23-T1ms < ts#39-T1ms) && (ts#39-T1ms <
ts#23-T1ms + interval 5 seconds)) ], state info [ checkpoint =
, runId = 52d0e4a5-150c-4136-8542-c2c5e4bb59c2, opId = 0, ver = 0,
numPartitions = 4], 0, state cleanup [ left value predicate:
(ts#23-T1ms <= -500), right value predicate: (ts#39-T1ms <= 0) ]
:- Exchange hashpartitioning(id#24, 4)
:  +- EventTimeWatermark ts#23: timestamp, interval 10 seconds
: +- Project [jsontostructs(StructField(id,StringType,true),
StructField(message,StructType(StructField(event,StringType,true),
StructField(timestamp,TimestampType,true)),true), cast(value#8 as string),
Some(Etc/UTC), true).id AS id#24,
jsontostructs(StructField(id,StringType,true),
StructField(message,StructType(StructField(event,StringType,true),
StructField(timestamp,TimestampType,true)),true), cast(value#8 as string),
Some(Etc/UTC), true).message AS message#25,
jsontostructs(StructField(id,StringType,true),
StructField(message,StructType(StructField(event,StringType,true),
StructField(timestamp,TimestampType,true)),true), cast(value#8 as string),
Some(Etc/UTC), true).message.timestamp AS ts#23]
:+- Filter (jsontostructs(StructField(id,StringType,true),
StructField(message,StructType(StructField(event,StringType,true),
StructField(timestamp,TimestampType,true)),true), cast(value#8 as string),
Some(Etc/UTC), true).message.event = Item Added)
:   +- StreamingRelation kafka, [key#7, value#8, topic#9,
partition#10, offset#11L, timestamp#12, timestampType#13]
+- Exchange hashpartitioning(id#37, 4)
   +- *(1) Filter isnotnull(ts#39-T1ms)
  +- EventTimeWatermark ts#39: timestamp, interval 10 seconds
 +- Project [jsontostructs(StructField(id,StringType,true),
StructField(message,StructType(StructField(event,StringType,true),
StructField(timestamp,TimestampType,true)),true), cast(value#8 as string),
Some(Etc/UTC), true).id AS id#37,
jsontostructs(StructField(id,StringType,true),
StructField(message,StructType(StructField(event,StringType,true),
StructField(timestamp,TimestampType,true)),true), cast(value#8 as string),
Some(Etc/UTC), true).message AS message#38,
jsontostructs(StructField(id,StringType,true),
StructField(message,StructType(StructField(event,StringType,true),
StructField(timestamp,TimestampType,true)),true), cast(value#8 as string),
Some(Etc/UTC), true).message.timestamp AS ts#39]
+- Filter ((jsontostructs(StructField(id,StringType,true),
StructField(message,StructType(StructField(event,StringType,true),
StructField(timestamp,TimestampType,true)),true), cast(value#8 as string),
Some(Etc/UTC), true).message.event = Item Purchased) &&
isnotnull(jsontostructs(StructField(id,StringType,true),
StructField(message,StructType(StructField(event,StringType,true),
StructField(timestamp,TimestampType,true)),true), cast(value#8 as string),
Some(Etc/UTC), true).id))
   +- StreamingRelation kafka, [key#7, 

Exception when closing SparkContext in Spark 2.3

2018-06-11 Thread umayr_nuna
I'm running a Scala application in EMR 5.12.0 (S3, HDFS) with the following
properties:

--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory
30g --executor-cores 5 --conf spark.default.parallelism=400 --conf
spark.dynamicAllocation.enabled=true --conf
spark.dynamicAllocation.maxExecutors=20 --conf
spark.eventLog.dir=hdfs:///var/log/spark/apps --conf
spark.eventLog.enabled=true --conf
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf
spark.scheduler.listenerbus.eventqueue.capacity=2 --conf
spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400
--conf spark.yarn.maxAppAttempts=1

The application runs fine till SparkContext is (automatically) closed, at
which point the SparkContext object throws. 

18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1
java.lang.InterruptedException at java.lang.Object.wait(Native Method) at
java.lang.Thread.join(Thread.java:1252) at
java.lang.Thread.join(Thread.java:1326) at
org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133)
at
org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
at
org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
at scala.collection.Iterator$class.foreach(Iterator.scala:893) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at
scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at
scala.collection.AbstractIterable.foreach(Iterable.scala:54) at
org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
at
org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at
org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572)
at
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192) at
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

 

I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same
application), so I'm not sure which change is causing Spark 2.3 to throw.
Any ideas?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-11 Thread Marcelo Vanzin
We are happy to announce the availability of Spark 2.3.1!

Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3
maintenance branch of Spark. We strongly recommend all 2.3.x users to
upgrade to this stable release.

To download Spark 2.3.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-3-1.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



re: streaming - kafka partition transition time from (stage change logger)

2018-06-11 Thread Peter Liu
Hi there,

Working on the streaming processing latency time based on timestamps from
Kafka, I have two quick general questions triggered by looking at the kafka
stage change log file:

(a) the partition state change from OfflineReplica state *to
OnlinePartition *state seems to take more than 20 sec. Would it mean that
an incoming message/event into kafka would need to go thru all these state
transitions (see below) to become ready for consumer (in this case, after
20+ sec)?

(b) For which stage would be the kafka time stampe for?

Any help/clarification would be very much appreciated!

Thanks ...

Peter

[2018-06-08 *15:34:36*,518] TRACE Controller 0 epoch 53 changed state of
replica 0 for partition [events,79] from *ReplicaDeletionIneligible *to
OfflineReplica (state.change.logger)
[2018-06-08 15:34:36,945] TRACE Controller 0 epoch 53 changed state of
replica 0 for partition [events,79] from OfflineReplica to OnlineReplica
(state.change.logger)
[2018-06-08 15:34:37,025] TRACE Controller 0 epoch 53 sending
become-follower LeaderAndIsr request
(Leader:-1,ISR:,LeaderEpoch:1,ControllerEpoch:53) to broker 0 for partition
[events,79] (state.change.logger)
[2018-06-08 15:34:37,079] TRACE Broker 0 received LeaderAndIsr request
PartitionState(controllerEpoch=53, leader=-1, leaderEpoch=1, isr=[],
zkVersion=1, replicas=[0]) correlation id 1 from controller 0 epoch 53 for
partition [events,79] (state.change.logger)
[2018-06-08 15:34:38,481] TRACE Controller 0 epoch 53 changed state of
replica 0 for partition [events,79] from OnlineReplica to OfflineReplica
(state.change.logger)
[2018-06-08 15:34:38,588] TRACE Controller 0 epoch 53 changed state of
replica 0 for partition [events,79] from OfflineReplica to
ReplicaDeletionStarted (state.change.logger)
[2018-06-08 15:34:39,427] TRACE Controller 0 epoch 53 changed state of
replica 0 for partition [events,79] from ReplicaDeletionStarted to
ReplicaDeletionSuccessful (state.change.logger)
[2018-06-08 15:34:39,560] TRACE Controller 0 epoch 53 changed state of
replica 0 for partition [events,79] from ReplicaDeletionSuccessful to
NonExistentReplica (state.change.logger)
[2018-06-08 15:34:39,564] TRACE Controller 0 epoch 53 changed partition
[events,79] state from OfflinePartition to OfflinePartition
(state.change.logger)
[2018-06-08 15:34:39,571] TRACE Controller 0 epoch 53 changed partition
[events,79] state from OfflinePartition to NonExistentPartition
(state.change.logger)
[2018-06-08 15:35:01,893] TRACE Controller 0 epoch 53 changed partition
[events,79] state from NonExistentPartition to NewPartition with assigned
replicas 0 (state.change.logger)
[2018-06-08 15:35:01,960] TRACE Controller 0 epoch 53 changed state of
replica 0 for partition [events,79] from NonExistentReplica to NewReplica
(state.change.logger)
[2018-06-08 *15:35:02,02*6] TRACE Controller 0 epoch 53 changed partition
[events,79] from *NewPartition *to *OnlinePartition *with leader 0
(state.change.logger)
[2018-06-08 15:35:02,207] TRACE Controller 0 epoch 53 sending become-leader
LeaderAndIsr request (Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:53) to
broker 0 for partition [events,79] (state.change.logger)
[2018-06-08 15:35:02,219] TRACE Broker 0 received LeaderAndIsr request
PartitionState(controllerEpoch=53, leader=0, leaderEpoch=0, isr=[0],
zkVersion=0, replicas=[0]) correlation id 88 from controller 0 epoch 53 for
partition [events,79] (state.change.logger)


Re: spark optimized pagination

2018-06-11 Thread vaquar khan
Spark is processing engine not storage or cache  ,you can dump your results
back to Cassandra, if you see latency then you can use cache to dump spark
results.

In short answer is NO,spark doesn't supporter give  any api to give you
cache kind of storage.

 Directly reading from dataset millions of records will be big delay in
response.

Regards,
Vaquar khan

On Mon, Jun 11, 2018, 2:59 AM Teemu Heikkilä  wrote:

> So you are now providing the data on-demand through spark?
>
> I suggest you change your API to query from cassandra and store the
> results from Spark back there, that way you will have to process the whole
> dataset just once and cassandra is suitable for that kind of workloads.
>
> -T
>
> On 10 Jun 2018, at 8.12, onmstester onmstester 
> wrote:
>
> Hi,
> I'm using spark on top of cassandra as backend CRUD of a Restfull
> Application.
> Most of Rest API's retrieve huge amount of data from cassandra and doing a
> lot of aggregation on them  in spark which take some seconds.
>
> Problem: sometimes the output result would be a big list which make client
> browser throw stop script, so we should paginate the result at the
> server-side,
> but it would be so annoying for user to wait some seconds on each page to
> cassandra-spark processings,
>
> Current Dummy Solution: For now i was thinking about assigning a UUID to
> each request which would be sent back and forth between server-side and
> client-side,
> the first time a rest API invoked, the result would be saved in a
> temptable  and in subsequent similar requests (request for next pages) the
> result would be fetch from
> temptable (instead of common flow of retrieve from cassandra + aggregation
> in spark which would take some time). On memory limit, the old results
> would be deleted.
>
> Is there any built-in clean caching strategy in spark to handle such
> scenarios?
>
> Sent using Zoho Mail 
>
>
>
>


Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
Hi Jorn/Others,

Thanks for your help. Now, data is being distributed in a proper way, but
the challenge is, after a certain point, I'm getting this error, after
which, everything stops moving ahead -

2018-06-11 18:14:56 ERROR TaskSchedulerImpl:70 - Lost executor 0 on
192.168.49.39: Remote RPC client disassociated. Likely due to containers
exceeding thresholds, or network issues. Check driver logs for WARN
messages.



How to avoid this scenario?

Thanks,
Aakash.

On Mon, Jun 11, 2018 at 4:16 PM, Jörn Franke  wrote:

> If it is in kB then spark will always schedule it to one node. As soon as
> it gets bigger you will see usage of more nodes.
>
> Hence increase your testing Dataset .
>
> On 11. Jun 2018, at 12:22, Aakash Basu  wrote:
>
> Jorn - The code is a series of feature engineering and model tuning
> operations. Too big to show. Yes, data volume is too low, it is in KBs,
> just tried to experiment with a small dataset before going for a large one.
>
> Akshay - I ran with your suggested spark configurations, I get this (the
> node changed, but the problem persists) -
>
> 
>
>
>
> On Mon, Jun 11, 2018 at 3:16 PM, akshay naidu 
> wrote:
>
>> try
>>  --num-executors 3 --executor-cores 4 --executor-memory 2G --conf
>> spark.scheduler.mode=FAIR
>>
>> On Mon, Jun 11, 2018 at 2:43 PM, Aakash Basu 
>> wrote:
>>
>>> Hi,
>>>
>>> I have submitted a job on* 4 node cluster*, where I see, most of the
>>> operations happening at one of the worker nodes and other two are simply
>>> chilling out.
>>>
>>> Picture below puts light on that -
>>>
>>> How to properly distribute the load?
>>>
>>> My cluster conf (4 node cluster [1 driver; 3 slaves]) -
>>>
>>> *Cores - 6*
>>> *RAM - 12 GB*
>>> *HDD - 60 GB*
>>>
>>> My Spark Submit command is as follows -
>>>
>>> *spark-submit --master spark://192.168.49.37:7077
>>>  --num-executors 3 --executor-cores 5
>>> --executor-memory 4G /appdata/bblite-codebase/prima_diabetes_indians.py*
>>>
>>> What to do?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>
>>
>


Visual PySpark Programming

2018-06-11 Thread srungarapu vamsi
Hi,

I have the following use case and I did not find a suitable tool which can
serve my purpose.

Use case:
Step 1,2,3 are UI driven.
*Step 1*)  A user should be able to choose data source (example HDFS) and
should be able to configure it so that it points to a file.
*Step 2*)  A user should be able to apply filters, transformations and
actions on the dataframe loaded in the previous step.
*Step 3*)  A user should be able to perform Step 2 any number of times as a
chain.
*Step 4*)  A user should be able to click a Save button which would convert
the data flow diagram into a pyspark job.

I found tools like https://seahorse.deepsense.ai/,
https://www.streamanalytix.com/product/streamanalytix/ which can do this.
However, they give a scala/java spark job instead of a pyspark job.
Moreover, these are paid products.

a) Are there any opensource solutions which can serve my need?

If not, I would like to build one. In order to build one, I would require a
workflow UI editor which i can tweak to serve my purpose.
But I did not find any free workflow UI editor which I can tweak.

b) Are there any open sourced workflow UI editor which can help me in
solving my use case?

c) Are there any other interesting approaches to solve my use case?


Thanks,
Vamsi


Re: Launch a pyspark Job From UI

2018-06-11 Thread uğur sopaoğlu
Dear Hemant,

I have built spark cluster by using docker container. Can I use apache livy to 
submit a job to master node?

hemant singh  şunları yazdı (11 Haz 2018 13:55):

> You can explore Livy https://dzone.com/articles/quick-start-with-apache-livy
> 
>> On Mon, Jun 11, 2018 at 3:35 PM, srungarapu vamsi  
>> wrote:
>> Hi,
>> 
>> I am looking for applications where we can trigger spark jobs from UI.
>> Are there any such applications available?
>> 
>> I have checked Spark-jobserver using which we can expose an api to submit a 
>> spark application.
>> 
>> Are there any other alternatives using which i can submit pyspark jobs from 
>> UI ?
>> 
>> Thanks,
>> Vamsi
> 


Re: Launch a pyspark Job From UI

2018-06-11 Thread Sathishkumar Manimoorthy
You can use Zeppelin as well

https://zeppelin.apache.org/docs/latest/interpreter/spark.html

Thanks,
Sathish

On Mon, Jun 11, 2018 at 4:25 PM, hemant singh  wrote:

> You can explore Livy https://dzone.com/articles/quick-start-with-
> apache-livy
>
> On Mon, Jun 11, 2018 at 3:35 PM, srungarapu vamsi <
> srungarapu1...@gmail.com> wrote:
>
>> Hi,
>>
>> I am looking for applications where we can trigger spark jobs from UI.
>> Are there any such applications available?
>>
>> I have checked Spark-jobserver using which we can expose an api to submit
>> a spark application.
>>
>> Are there any other alternatives using which i can submit pyspark jobs
>> from UI ?
>>
>> Thanks,
>> Vamsi
>>
>
>


Re: Launch a pyspark Job From UI

2018-06-11 Thread hemant singh
You can explore Livy https://dzone.com/articles/quick-start-with-apache-livy

On Mon, Jun 11, 2018 at 3:35 PM, srungarapu vamsi 
wrote:

> Hi,
>
> I am looking for applications where we can trigger spark jobs from UI.
> Are there any such applications available?
>
> I have checked Spark-jobserver using which we can expose an api to submit
> a spark application.
>
> Are there any other alternatives using which i can submit pyspark jobs
> from UI ?
>
> Thanks,
> Vamsi
>


Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Jörn Franke
If it is in kB then spark will always schedule it to one node. As soon as it 
gets bigger you will see usage of more nodes.

Hence increase your testing Dataset .

> On 11. Jun 2018, at 12:22, Aakash Basu  wrote:
> 
> Jorn - The code is a series of feature engineering and model tuning 
> operations. Too big to show. Yes, data volume is too low, it is in KBs, just 
> tried to experiment with a small dataset before going for a large one.
> 
> Akshay - I ran with your suggested spark configurations, I get this (the node 
> changed, but the problem persists) -
> 
> 
> 
> 
> 
>> On Mon, Jun 11, 2018 at 3:16 PM, akshay naidu  
>> wrote:
>> try
>>  --num-executors 3 --executor-cores 4 --executor-memory 2G --conf 
>> spark.scheduler.mode=FAIR
>> 
>>> On Mon, Jun 11, 2018 at 2:43 PM, Aakash Basu  
>>> wrote:
>>> Hi,
>>> 
>>> I have submitted a job on 4 node cluster, where I see, most of the 
>>> operations happening at one of the worker nodes and other two are simply 
>>> chilling out.
>>> 
>>> Picture below puts light on that -
>>> 
>>> How to properly distribute the load?
>>> 
>>> My cluster conf (4 node cluster [1 driver; 3 slaves]) -
>>> 
>>> Cores - 6
>>> RAM - 12 GB
>>> HDD - 60 GB
>>> 
>>> My Spark Submit command is as follows -
>>> 
>>> spark-submit --master spark://192.168.49.37:7077 --num-executors 3 
>>> --executor-cores 5 --executor-memory 4G 
>>> /appdata/bblite-codebase/prima_diabetes_indians.py
>>> 
>>> What to do?
>>> 
>>> Thanks,
>>> Aakash.
>> 
> 


Launch a pyspark Job From UI

2018-06-11 Thread srungarapu vamsi
Hi,

I am looking for applications where we can trigger spark jobs from UI.
Are there any such applications available?

I have checked Spark-jobserver using which we can expose an api to submit a
spark application.

Are there any other alternatives using which i can submit pyspark jobs from
UI ?

Thanks,
Vamsi


Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread akshay naidu
try
 --num-executors 3 --executor-cores 4 --executor-memory 2G --conf
spark.scheduler.mode=FAIR

On Mon, Jun 11, 2018 at 2:43 PM, Aakash Basu 
wrote:

> Hi,
>
> I have submitted a job on* 4 node cluster*, where I see, most of the
> operations happening at one of the worker nodes and other two are simply
> chilling out.
>
> Picture below puts light on that -
>
> How to properly distribute the load?
>
> My cluster conf (4 node cluster [1 driver; 3 slaves]) -
>
> *Cores - 6*
> *RAM - 12 GB*
> *HDD - 60 GB*
>
> My Spark Submit command is as follows -
>
> *spark-submit --master spark://192.168.49.37:7077
>  --num-executors 3 --executor-cores 5
> --executor-memory 4G /appdata/bblite-codebase/prima_diabetes_indians.py*
>
> What to do?
>
> Thanks,
> Aakash.
>


Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Jörn Franke
What is your code ? Maybe this one does an operation which is bound to a single 
host or your data volume is too small for multiple hosts.

> On 11. Jun 2018, at 11:13, Aakash Basu  wrote:
> 
> Hi,
> 
> I have submitted a job on 4 node cluster, where I see, most of the operations 
> happening at one of the worker nodes and other two are simply chilling out.
> 
> Picture below puts light on that -
> 
> How to properly distribute the load?
> 
> My cluster conf (4 node cluster [1 driver; 3 slaves]) -
> 
> Cores - 6
> RAM - 12 GB
> HDD - 60 GB
> 
> My Spark Submit command is as follows -
> 
> spark-submit --master spark://192.168.49.37:7077 --num-executors 3 
> --executor-cores 5 --executor-memory 4G 
> /appdata/bblite-codebase/prima_diabetes_indians.py
> 
> What to do?
> 
> Thanks,
> Aakash.


Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-11 Thread thomas lavocat

Thank you very much for your answer.

Since I don't have dependent jobs I will continue to use this functionality.


On 05/06/2018 13:52, Saisai Shao wrote:
"dependent" I mean this batch's job relies on the previous batch's 
result. So this batch should wait for the finish of previous batch, if 
you set "spark.streaming.concurrentJobs" larger than 1, then the 
current batch could start without waiting for the previous batch (if 
it is delayed), which will lead to unexpected results.



thomas lavocat > 于2018年6月5日周二 
下午7:48写道:



On 05/06/2018 13:44, Saisai Shao wrote:

You need to read the code, this is an undocumented configuration.

I'm on it right now, but, Spark is a big piece of software.

Basically this will break the ordering of Streaming jobs, AFAIK
it may get unexpected results if you streaming jobs are not
independent.

What do you mean exactly by not independent ?
Are several source joined together dependent ?

Thanks,
Thomas


thomas lavocat mailto:thomas.lavo...@univ-grenoble-alpes.fr>> 于2018年6月5日周二
下午7:17写道:

Hello,

Thank's for your answer.


On 05/06/2018 11:24, Saisai Shao wrote:

spark.streaming.concurrentJobs is a driver side internal
configuration, this means that how many streaming jobs can
be submitted concurrently in one batch. Usually this should
not be configured by user, unless you're familiar with Spark
Streaming internals, and know the implication of this
configuration.


How can I find some documentation about those implications ?

I've experimented some configuration of this parameters and
found out that my overall throughput is increased in
correlation with this property.
But I'm experiencing scalability issues. With more than 16
receivers spread over 8 executors, my executors no longer
receive work from the driver and fall idle.
Is there an explanation ?

Thanks,
Thomas







[Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
Hi,

I have submitted a job on* 4 node cluster*, where I see, most of the
operations happening at one of the worker nodes and other two are simply
chilling out.

Picture below puts light on that -

How to properly distribute the load?

My cluster conf (4 node cluster [1 driver; 3 slaves]) -

*Cores - 6*
*RAM - 12 GB*
*HDD - 60 GB*

My Spark Submit command is as follows -

*spark-submit --master spark://192.168.49.37:7077
 --num-executors 3 --executor-cores 5
--executor-memory 4G /appdata/bblite-codebase/prima_diabetes_indians.py*

What to do?

Thanks,
Aakash.


Re: spark optimized pagination

2018-06-11 Thread Teemu Heikkilä
So you are now providing the data on-demand through spark?

I suggest you change your API to query from cassandra and store the results 
from Spark back there, that way you will have to process the whole dataset just 
once and cassandra is suitable for that kind of workloads.

-T

> On 10 Jun 2018, at 8.12, onmstester onmstester  wrote:
> 
> Hi,
> I'm using spark on top of cassandra as backend CRUD of a Restfull Application.
> Most of Rest API's retrieve huge amount of data from cassandra and doing a 
> lot of aggregation on them  in spark which take some seconds.
> 
> Problem: sometimes the output result would be a big list which make client 
> browser throw stop script, so we should paginate the result at the 
> server-side,
> but it would be so annoying for user to wait some seconds on each page to 
> cassandra-spark processings,
> 
> Current Dummy Solution: For now i was thinking about assigning a UUID to each 
> request which would be sent back and forth between server-side and 
> client-side,
> the first time a rest API invoked, the result would be saved in a temptable  
> and in subsequent similar requests (request for next pages) the result would 
> be fetch from
> temptable (instead of common flow of retrieve from cassandra + aggregation in 
> spark which would take some time). On memory limit, the old results would be 
> deleted.
> 
> Is there any built-in clean caching strategy in spark to handle such 
> scenarios?
> 
> Sent using Zoho Mail 
> 
>