Re: pySpark driver memory limit

2017-11-08 Thread Nicolas Paris
Le 06 nov. 2017 à 19:56, Nicolas Paris écrivait :
> Can anyone clarify the driver memory aspects of pySpark?
> According to [1], spark.driver.memory limits JVM + python memory.
> 
> In case:
> spark.driver.memory=2G
> Then does it mean the user won't be able to use more than 2G, whatever
> the python code + the RDD stuff he is using ?
> 
> Thanks,
> 
> [1]: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-is-running-extremely-slow-with-larger-data-set-like-2G-td17152.html
> 


after some testing, the python driver memory is not limited by
spark.driver.memory
instead, there is no limit at all for those processes. This may be
managed by cgroups however.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Why the merge method in StructType is private

2017-11-08 Thread sathy
Hi, 

The
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
has a merge method but defined private to sql package and used for parquet
merge schema. Is there a reason why this method cannot be public ? If not
this method, is there a equivalent that is available to merge 2 schemas ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark job paused(active stages finished)

2017-11-08 Thread Margusja
You have to deal with failed jobs. In example try catch in your code.

Br Margus Roo


> On 9 Nov 2017, at 05:37, bing...@iflytek.com wrote:
> 
> Dear,All
> I have a simple spark job, as below, all tasks in the stage 2(sth failed, 
> retry) already finished. But the next stage never run.
> 
> 
>
> driver thread dump:  attachment( thread.dump)
> driver last log:
> 
> 
>  driver do not receive the 16 retry tasks report.Thank you ideas.
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 


Spark http: Not showing completed apps

2017-11-08 Thread purna pradeep
Hi,

I'm using spark  standalone in aws ec2 .And I'm using spark rest
API http::8080/Json to get completed apps but the Json completed
apps as empty array though the job ran successfully.


Re: Programmatically get status of job (WAITING/RUNNING)

2017-11-08 Thread Davide.Mandrini
In this case, the only way to check the status is via REST calls to the Spark
json API, accessible at http://:/json/



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-11-08 Thread bsikander
Thank you for the reply.

I am currently not using SparkLauncher to launch my driver. Rather, I am
using the old fashion spark-submit and moving to SparkLauncher is not an
option right now.
Do I have any options there?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Structured Streaming] Changing partitions of (flat)MapGroupsWithState

2017-11-08 Thread Michael Armbrust
The relevant config is spark.sql.shuffle.partitions.  Note that once you
start a query, this number is fixed.  The config will only affect queries
starting from an empty checkpoint.

On Wed, Nov 8, 2017 at 7:34 AM, Teemu Heikkilä  wrote:

> I have spark structured streaming job and I'm crunching through few
> terabytes of data.
>
> I'm using file stream reader and it works flawlessly, I can adjust the
> partitioning of that with spark.default.parallelism
>
> However I'm doing sessionization for the data after loading it and I'm
> currently locked with just 200 partitions for that stage. I've tried to
> repartition before and after the stateful map but it just adds new stage
> and so it's not very useful
>
> Changing spark.sql.shuffle.partitions doesn't affect the count either.
>
> val sessions = streamingSource // -> spark.default.parallelism defined
> amount of partitions/tasks (ie. 2000)
>  .repartition(n) // -> n partitions/tasks
>  .groupByKey(event => event.sessid) // -> stage opens / fixed 200 tasks
>  .flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.
> EventTimeTimeout())(SessionState.updateSessionEvents) // -> fixed 200
> tasks / stage closes
>
>
> I tried to grep through spark source code but couldn’t find that param
> anywhere.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[Spark Structured Streaming] Changing partitions of (flat)MapGroupsWithState

2017-11-08 Thread Teemu Heikkilä
I have spark structured streaming job and I'm crunching through few terabytes 
of data.

I'm using file stream reader and it works flawlessly, I can adjust the 
partitioning of that with spark.default.parallelism
 
However I'm doing sessionization for the data after loading it and I'm 
currently locked with just 200 partitions for that stage. I've tried to 
repartition before and after the stateful map but it just adds new stage and so 
it's not very useful
 
Changing spark.sql.shuffle.partitions doesn't affect the count either.
 
val sessions = streamingSource // -> spark.default.parallelism defined amount 
of partitions/tasks (ie. 2000)
 .repartition(n) // -> n partitions/tasks
 .groupByKey(event => event.sessid) // -> stage opens / fixed 200 tasks
 .flatMapGroupsWithState(OutputMode.Append, 
GroupStateTimeout.EventTimeTimeout())(SessionState.updateSessionEvents) // -> 
fixed 200 tasks / stage closes


I tried to grep through spark source code but couldn’t find that param anywhere.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Measure executor idle time

2017-11-08 Thread samar kumar
Hi,
 Is there a ways to measure idle time for spark executor. Any metrics
or accumulators
currently exposed.
Regards,
Samar