from:"Mridul Muralidharan"

[jira] [Resolved] (SPARK-34069) Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL

2021-01-12 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-34069.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 31127
[https://github.com/apache/spark/pull/31127]

> Kill barrier tasks should respect SPARK_JOB_INTERRUPT_ON_CANCEL
> ---
>
> Key: SPARK-34069
> URL: https://issues.apache.org/jira/browse/SPARK-34069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
> Fix For: 3.1.0
>
>
> We should interrupt task thread if user set local property 
> `SPARK_JOB_INTERRUPT_ON_CANCEL` to true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32917) Add support for executors to push shuffle blocks after successful map task completion

2021-01-08 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32917:
---

Assignee: Chandni Singh

> Add support for executors to push shuffle blocks after successful map task 
> completion
> -
>
> Key: SPARK-32917
> URL: https://issues.apache.org/jira/browse/SPARK-32917
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0
>
>
> This is the shuffle write path for push-based shuffle, where the executors 
> would leverage the RPC protocol to push shuffle blocks to remote shuffle 
> services.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32917) Add support for executors to push shuffle blocks after successful map task completion

2021-01-08 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32917.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30312
[https://github.com/apache/spark/pull/30312]

> Add support for executors to push shuffle blocks after successful map task 
> completion
> -
>
> Key: SPARK-32917
> URL: https://issues.apache.org/jira/browse/SPARK-32917
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
> Fix For: 3.2.0
>
>
> This is the shuffle write path for push-based shuffle, where the executors 
> would leverage the RPC protocol to push shuffle blocks to remote shuffle 
> services.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Recovering SparkR on CRAN?

2020-12-22 Thread Mridul Muralidharan

I agree, is there something we can do to ensure CRAN publish goes through
consistently and predictably ?
If possible, it would be good to continue supporting it.

Regards,
Mridul

On Tue, Dec 22, 2020 at 7:48 PM Felix Cheung  wrote:

> Ok - it took many years to get it first published, so it was hard to get
> there.
>
>
> On Tue, Dec 22, 2020 at 5:45 PM Hyukjin Kwon  wrote:
>
>> Adding @Shivaram Venkataraman  and @Felix
>> Cheung  FYI
>>
>> 2020년 12월 23일 (수) 오전 9:22, Michael Heuer 님이 작성:
>>
>>> Anecdotally, as a project downstream of Spark, we've been prevented from
>>> pushing to CRAN because of this
>>>
>>> https://github.com/bigdatagenomics/adam/issues/1851
>>>
>>> We've given up and marked as WontFix.
>>>
>>>michael
>>>
>>>
>>> On Dec 22, 2020, at 5:14 PM, Dongjoon Hyun 
>>> wrote:
>>>
>>> Given the current circumstance, I'm thinking of dropping it officially
>>> from the community release scope.
>>>
>>> It's because
>>>
>>> - It turns out that our CRAN check is insufficient to guarantee the
>>> availability of SparkR on CRAN.
>>>   Apache Spark 3.1.0 may not not available on CRAN, too.
>>>
>>> - In daily CIs, CRAN check has been broken frequently due to both our
>>> side and CRAN side issues. Currently, branch-2.4 is broken.
>>>
>>> - It also has a side-effect to cause some delays on the official release
>>> announcement after RC passes because each release manager takes a look at
>>> it if he/she can recover it at that release.
>>>
>>> If we are unable to support SparkR on CRAN in a sustainable way, what
>>> about dropping it official instead?
>>>
>>> Then, it will alleviate burdens on release managers and improves daily
>>> CIs' stability by removing the CRAN check.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, Dec 21, 2020 at 7:09 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 The last `SparkR` package of Apache Spark in CRAN is `2.4.6`.


 https://cran-archive.r-project.org/web/checks/2020/2020-07-10_check_results_SparkR.html

 The latest three Apache Spark distributions (2.4.7/3.0.0/3.0.1) are not
 published to CRAN and the lack of SparkR on CRAN has been considered a
 non-release blocker.

 I'm wondering if we are aiming to recover it in Apache Spark 3.1.0.

 Bests,
 Dongjoon.

>>>
>>>

[jira] [Assigned] (SPARK-33669) Wrong error message from YARN application state monitor when sc.stop in yarn client mode

2020-12-08 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-33669:
---

Assignee: Su Qilong

> Wrong error message from YARN application state monitor when sc.stop in yarn 
> client mode
> 
>
> Key: SPARK-33669
> URL: https://issues.apache.org/jira/browse/SPARK-33669
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Assignee: Su Qilong
>Priority: Minor
> Fix For: 3.1.0
>
>
> For YarnClient mode, when stopping YarnClientSchedulerBackend, it first tries 
> to interrupt Yarn application monitor thread. In MonitorThread.run() it 
> catches InterruptedException to gracefully response to stopping request.
> But client.monitorApplication method also throws InterruptedIOException when 
> the hadoop rpc call is calling. In this case, MonitorThread will not know it 
> is interrupted, a Yarn App failed is returned with "Failed to contact YARN 
> for application x;  YARN application has exited unexpectedly with state 
> x" is logged with error level. which confuse user a lot.
> We Should take considerate InterruptedIOException here to make it the same 
> behavior with InterruptedException.
> {code:java}
> private class MonitorThread extends Thread {
>   private var allowInterrupt = true
>   override def run() {
> try {
>   val YarnAppReport(_, state, diags) =
> client.monitorApplication(appId.get, logApplicationReport = false)
>   logError(s"YARN application has exited unexpectedly with state $state! 
> " +
> "Check the YARN application logs for more details.")
>   diags.foreach { err =>
> logError(s"Diagnostics message: $err")
>   }
>   allowInterrupt = false
>   sc.stop()
> } catch {
>   case e: InterruptedException => logInfo("Interrupting monitor thread")
> }
>   }
>   
> {code}
> {code:java}
> // wrong error message
> 2020-12-05 03:06:58,000 ERROR [YARN application state monitor]: 
> org.apache.spark.deploy.yarn.Client(91) - Failed to contact YARN for 
> application application_1605868815011_1154961. 
> java.io.InterruptedIOException: Call interrupted
> at org.apache.hadoop.ipc.Client.call(Client.java:1466)
> at org.apache.hadoop.ipc.Client.call(Client.java:1409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
> at com.sun.proxy.$Proxy38.getApplicationReport(Unknown Source)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:187)
> at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy39.getApplicationReport(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:408)
> at 
> org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:327)
> at 
> org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:1039)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:116)
> 2020-12-05 03:06:58,000 ERROR [YARN application state monitor]: 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend(70) - YARN 
> application has exited unexpectedly with state FAILED! Check the YARN 
> application logs for more details. 
> 2020-12-05 03:06:58,001 ERROR [YARN application state monitor]: 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend(70) - 
> Diagnostics message: Failed to contact YARN for application 
> application_1605868815011_1154961.
> {code}
>  
> {code:java}
> // hadoop ipc code
> public Writable call(RPC.RpcKind rpcKind, Writable rpcRequest,
> ConnectionId remoteId, int serviceClass,
> AtomicBoolean fallbackToSimpleAuth) throws IOException {
>   final Call call = createCall(rpcKind, rpcRequest

[jira] [Resolved] (SPARK-33669) Wrong error message from YARN application state monitor when sc.stop in yarn client mode

2020-12-08 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-33669.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30617
[https://github.com/apache/spark/pull/30617]

> Wrong error message from YARN application state monitor when sc.stop in yarn 
> client mode
> 
>
> Key: SPARK-33669
> URL: https://issues.apache.org/jira/browse/SPARK-33669
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Priority: Minor
> Fix For: 3.1.0
>
>
> For YarnClient mode, when stopping YarnClientSchedulerBackend, it first tries 
> to interrupt Yarn application monitor thread. In MonitorThread.run() it 
> catches InterruptedException to gracefully response to stopping request.
> But client.monitorApplication method also throws InterruptedIOException when 
> the hadoop rpc call is calling. In this case, MonitorThread will not know it 
> is interrupted, a Yarn App failed is returned with "Failed to contact YARN 
> for application x;  YARN application has exited unexpectedly with state 
> x" is logged with error level. which confuse user a lot.
> We Should take considerate InterruptedIOException here to make it the same 
> behavior with InterruptedException.
> {code:java}
> private class MonitorThread extends Thread {
>   private var allowInterrupt = true
>   override def run() {
> try {
>   val YarnAppReport(_, state, diags) =
> client.monitorApplication(appId.get, logApplicationReport = false)
>   logError(s"YARN application has exited unexpectedly with state $state! 
> " +
> "Check the YARN application logs for more details.")
>   diags.foreach { err =>
> logError(s"Diagnostics message: $err")
>   }
>   allowInterrupt = false
>   sc.stop()
> } catch {
>   case e: InterruptedException => logInfo("Interrupting monitor thread")
> }
>   }
>   
> {code}
> {code:java}
> // wrong error message
> 2020-12-05 03:06:58,000 ERROR [YARN application state monitor]: 
> org.apache.spark.deploy.yarn.Client(91) - Failed to contact YARN for 
> application application_1605868815011_1154961. 
> java.io.InterruptedIOException: Call interrupted
> at org.apache.hadoop.ipc.Client.call(Client.java:1466)
> at org.apache.hadoop.ipc.Client.call(Client.java:1409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
> at com.sun.proxy.$Proxy38.getApplicationReport(Unknown Source)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:187)
> at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy39.getApplicationReport(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:408)
> at 
> org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:327)
> at 
> org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:1039)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:116)
> 2020-12-05 03:06:58,000 ERROR [YARN application state monitor]: 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend(70) - YARN 
> application has exited unexpectedly with state FAILED! Check the YARN 
> application logs for more details. 
> 2020-12-05 03:06:58,001 ERROR [YARN application state monitor]: 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend(70) - 
> Diagnostics message: Failed to contact YARN for application 
> application_1605868815011_1154961.
> {code}
>  
> {code:java}
> // hadoop ipc code
> public Writable call(RPC.RpcKind rpcKind, Writable rpcRequest,
> ConnectionId remoteId, int serviceClass,
> AtomicBoolean fallbackToSimpleAuth) throws IO

[jira] [Resolved] (SPARK-33185) YARN: Print direct links to driver logs alongside application report in cluster mode

2020-11-30 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-33185.
-
Resolution: Fixed

Issue resolved by pull request 30450
[https://github.com/apache/spark/pull/30450]

> YARN: Print direct links to driver logs alongside application report in 
> cluster mode
> 
>
> Key: SPARK-33185
> URL: https://issues.apache.org/jira/browse/SPARK-33185
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently when run in {{cluster}} mode on YARN, the Spark {{yarn.Client}} 
> will print out the application report into the logs, to be easily viewed by 
> users. For example:
> {code}
> INFO yarn.Client: 
>client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
>diagnostics: N/A
>ApplicationMaster host: X.X.X.X
>ApplicationMaster RPC port: 0
>queue: default
>start time: 1602782566027
>final status: UNDEFINED
>tracking URL: http://hostname:/proxy/application_/
>user: xkrogen
> {code}
> Typically, the tracking URL can be used to find the logs of the 
> ApplicationMaster/driver while the application is running. Later, the Spark 
> History Server can be used to track this information down, using the 
> stdout/stderr links on the Executors page.
> However, in the situation when the driver crashed _before_ writing out a 
> history file, the SHS may not be aware of this application, and thus does not 
> contain links to the driver logs. When this situation arises, it can be 
> difficult for users to debug further, since they can't easily find their 
> driver logs.
> It is possible to reach the logs by using the {{yarn logs}} commands, but the 
> average Spark user isn't aware of this and shouldn't have to be.
> I propose adding, alongside the application report, some additional lines 
> like:
> {code}
>  Driver Logs (stdout): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stdout?start=-4096
>  Driver Logs (stderr): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stderr?start=-4096
> {code}
> With this information available, users can quickly jump to their driver logs, 
> even if it crashed before the SHS became aware of the application. This has 
> the additional benefit of providing a quick way to access driver logs, which 
> often contain useful information, in a single click (instead of navigating 
> through the Spark UI).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle

2020-11-23 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32918:
---

Assignee: Ye Zhou

> RPC implementation to support control plane coordination for push-based 
> shuffle
> ---
>
> Key: SPARK-32918
> URL: https://issues.apache.org/jira/browse/SPARK-32918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Ye Zhou
>Priority: Major
> Fix For: 3.1.0
>
>
> RPCs to facilitate coordination of shuffle map/reduce stages. Notifications 
> to external shuffle services to finalize shuffle block merge for a given 
> shuffle are carried through this RPC. It also respond back the metadata about 
> a merged shuffle partition back to the caller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle

2020-11-23 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32918.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30163
[https://github.com/apache/spark/pull/30163]

> RPC implementation to support control plane coordination for push-based 
> shuffle
> ---
>
> Key: SPARK-32918
> URL: https://issues.apache.org/jira/browse/SPARK-32918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
> Fix For: 3.1.0
>
>
> RPCs to facilitate coordination of shuffle map/reduce stages. Notifications 
> to external shuffle services to finalize shuffle block merge for a given 
> shuffle are carried through this RPC. It also respond back the metadata about 
> a merged shuffle partition back to the caller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2020-11-20 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32919:
---

Assignee: Venkata krishnan Sowrirajan

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.1.0
>
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2020-11-20 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32919.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30164
[https://github.com/apache/spark/pull/30164]

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
> Fix For: 3.1.0
>
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31069) high cpu caused by chunksBeingTransferred in external shuffle service

2020-11-17 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-31069:
---

Assignee: angerszhu

> high cpu caused by chunksBeingTransferred in external shuffle service
> -
>
> Key: SPARK-31069
> URL: https://issues.apache.org/jira/browse/SPARK-31069
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Xiaoju Wu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> "shuffle-chunk-fetch-handler-2-40" #250 daemon prio=5 os_prio=0 
> tid=0x02ac nid=0xb9b3 runnable [0x7ff20a1af000]
>java.lang.Thread.State: RUNNABLE
> at 
> java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339)
> at 
> java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439)
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:184)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
>  
>  
>  
> "shuffle-chunk-fetch-handler-2-48" #235 daemon prio=5 os_prio=0 
> tid=0x7ff2302ec800 nid=0xb9ad runnable [0x7ff20a7b4000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:186)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31069) high cpu caused by chunksBeingTransferred in external shuffle service

2020-11-17 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-31069:
---

Assignee: angerszhu  (was: angerszhu)

> high cpu caused by chunksBeingTransferred in external shuffle service
> -
>
> Key: SPARK-31069
> URL: https://issues.apache.org/jira/browse/SPARK-31069
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Xiaoju Wu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> "shuffle-chunk-fetch-handler-2-40" #250 daemon prio=5 os_prio=0 
> tid=0x02ac nid=0xb9b3 runnable [0x7ff20a1af000]
>java.lang.Thread.State: RUNNABLE
> at 
> java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339)
> at 
> java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439)
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:184)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
>  
>  
>  
> "shuffle-chunk-fetch-handler-2-48" #235 daemon prio=5 os_prio=0 
> tid=0x7ff2302ec800 nid=0xb9ad runnable [0x7ff20a7b4000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:186)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31069) high cpu caused by chunksBeingTransferred in external shuffle service

2020-11-17 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-31069.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30139
[https://github.com/apache/spark/pull/30139]

> high cpu caused by chunksBeingTransferred in external shuffle service
> -
>
> Key: SPARK-31069
> URL: https://issues.apache.org/jira/browse/SPARK-31069
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Xiaoju Wu
>Priority: Major
> Fix For: 3.1.0
>
>
> "shuffle-chunk-fetch-handler-2-40" #250 daemon prio=5 os_prio=0 
> tid=0x02ac nid=0xb9b3 runnable [0x7ff20a1af000]
>java.lang.Thread.State: RUNNABLE
> at 
> java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339)
> at 
> java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439)
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:184)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
>  
>  
>  
> "shuffle-chunk-fetch-handler-2-48" #235 daemon prio=5 os_prio=0 
> tid=0x7ff2302ec800 nid=0xb9ad runnable [0x7ff20a7b4000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.network.server.OneForOneStreamManager.chunksBeingTransferred(OneForOneStreamManager.java:186)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:85)
> at 
> org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353)
> at 
> org.spark_project.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at 
> org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Mridul Muralidharan

I try to follow the second option.
In general, when multiple reviewers are looking at the code, sometimes
addressing review comments might open up other avenues of
discussion/optimization/design discussions : atleast in core, I have seen
this happen often.

A day or so delay is worth the increased scrutiny and better design/reduced
bugs.

Regards,
Mridul

On Sat, Nov 14, 2020 at 1:47 AM Jungtaek Lim 
wrote:

> I see some voices that it's not sufficient to understand the topic. Let me
> elaborate this a bit more.
>
> 1. There're multiple reviewers reviewing the PR. (Say, A, B, C, D)
> 2. A and B leaves review comments on the PR, but no one makes the explicit
> indication that these review comments are the final one.
> 3. The author of the PR addresses the review comments.
> 4. C checks that the review comments from A and B are addressed, and
> merges the PR. In parallel (or a bit later), A is trying to check whether
> the review comments are addressed (or even more, A could provide more
> review comments afterwards), and realized the PR is already merged.
>
> Saying again, there's "technically" no incorrect point. Let's give another
> example of what I said "trade-off".
>
> 1. There're multiple reviewers reviewing the PR. (Say, A, B, C, D)
> 2. A and B leaves review comments on the PR, but no one makes the explicit
> indication that these review comments are the final one.
> 3. The author of the PR addresses the review comments.
> 4. C checks that the review comments from A and B are addressed, and asks
> A and B to confirm whether there's no further review comments, with the
> condition that it will be merged in a few days later if there's no further
> feedback.
> 5. If A and B confirms or A and B doesn't provide new feedback in the
> period, C merges the PR. If A or B provides new feedback, go back to 3 with
> resetting the days.
>
> This is what we tend to comment as "@A @B I'll leave this a few days more
> to see if anyone has further comments. Otherwise I'll merge this.".
>
> I see both are used across various PRs, so it's not really something I
> want to blame. Just want to make us think about what would be the ideal
> approach we'd be better to prefer.
>
>
> On Sat, Nov 14, 2020 at 3:46 PM Jungtaek Lim 
> wrote:
>
>> Oh sorry that was gone with flame (please just consider it as my fault)
>> and I just removed all comments.
>>
>> Btw, when I always initiate discussions, I really do love to start
>> discussion "without" specific instances which tend to go blaming each
>> other. I understand it's not easy to discuss without taking examples, but
>> I'll try to explain the situation on my best instead. Please let me know if
>> there's some ambiguous or unclear thing to think about.
>>
>> On Sat, Nov 14, 2020 at 3:41 PM Sean Owen  wrote:
>>
>>> I am sure you are referring to some specific instances but I have not
>>> followed enough to know what they are. Can you point them out? I think that
>>> is most productive for everyone to understand.
>>>
>>> On Fri, Nov 13, 2020 at 10:16 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Hi devs,

 I know this is a super sensitive topic and at a risk of flame, but just
 like to try this. My apologies first.
 Assuming we all know about the ASF policy about code commit and I don't
 see Spark project has any explicit BYLAWS, it's technically possible to do
 anything for committers to do during merging.

 Sometimes this goes a bit depressing for reviewers, regardless of the
 intention, when merger makes a judgement by oneself to merge while the
 reviewers are still in the review phase. I observed the practice is used
 frequently, under the fact that we have post-review to address further
 comments later.

 I know about the concern that it's sometimes blocking unintentionally
 if we require merger to gather consensus about the merge from reviewers,
 but we also have some other practice holding on merging for a couple of
 days and noticing to reviewers whether they have further comments or not,
 which is I think a good trade-off.

 Exclude the cases where we're in release blocker mode, wouldn't we be
 hurt too much if we ask merger to respect the practice on noticing to
 reviewers that merging will be happen soon and waiting a day or so? I feel
 the post-review is opening the possibility for reviewers late on the party
 to review later, but it's over-used if it is leveraged as a judgement that
 merger can merge at any time and reviewers can still continue reviewing.
 Reviewers would feel broken flow - that is not the same experience with
 having more time to finalize reviewing before merging.

 Again I know it's super hard to reconsider the ongoing practice while
 the project has gone for the long way (10 years), but just wanted to hear
 the voices about this.

 Thanks,
 Jungtaek Lim (HeartSaVioR)

>>>

[jira] [Assigned] (SPARK-32915) RPC implementation to support pushing and merging shuffle blocks

2020-11-09 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32915:
---

Assignee: Min Shen

> RPC implementation to support pushing and merging shuffle blocks
> 
>
> Key: SPARK-32915
> URL: https://issues.apache.org/jira/browse/SPARK-32915
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Min Shen
>Priority: Major
> Fix For: 3.1.0
>
>
> RPC implementation for the basic functionality in network-common and 
> network-shuffle module to enable pushing blocks on the client side and 
> merging received blocks on the server side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32916) Add support for external shuffle service in YARN deployment mode to leverage push-based shuffle

2020-11-09 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32916.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30062
[https://github.com/apache/spark/pull/30062]

> Add support for external shuffle service in YARN deployment mode to leverage 
> push-based shuffle
> ---
>
> Key: SPARK-32916
> URL: https://issues.apache.org/jira/browse/SPARK-32916
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.1.0
>
>
> Integration needed to bootstrap external shuffle service in YARN deployment 
> mode. Properly create the necessary dirs and initialize the relevant 
> server-side components in the RPC layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32916) Add support for external shuffle service in YARN deployment mode to leverage push-based shuffle

2020-11-09 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32916:
---

Assignee: Chandni Singh

> Add support for external shuffle service in YARN deployment mode to leverage 
> push-based shuffle
> ---
>
> Key: SPARK-32916
> URL: https://issues.apache.org/jira/browse/SPARK-32916
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Chandni Singh
>Priority: Major
>
> Integration needed to bootstrap external shuffle service in YARN deployment 
> mode. Properly create the necessary dirs and initialize the relevant 
> server-side components in the RPC layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33185) YARN: Print direct links to driver logs alongside application report in cluster mode

2020-11-05 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-33185:
---

Assignee: Erik Krogen

> YARN: Print direct links to driver logs alongside application report in 
> cluster mode
> 
>
> Key: SPARK-33185
> URL: https://issues.apache.org/jira/browse/SPARK-33185
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> Currently when run in {{cluster}} mode on YARN, the Spark {{yarn.Client}} 
> will print out the application report into the logs, to be easily viewed by 
> users. For example:
> {code}
> INFO yarn.Client: 
>client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
>diagnostics: N/A
>ApplicationMaster host: X.X.X.X
>ApplicationMaster RPC port: 0
>queue: default
>start time: 1602782566027
>final status: UNDEFINED
>tracking URL: http://hostname:/proxy/application_/
>user: xkrogen
> {code}
> Typically, the tracking URL can be used to find the logs of the 
> ApplicationMaster/driver while the application is running. Later, the Spark 
> History Server can be used to track this information down, using the 
> stdout/stderr links on the Executors page.
> However, in the situation when the driver crashed _before_ writing out a 
> history file, the SHS may not be aware of this application, and thus does not 
> contain links to the driver logs. When this situation arises, it can be 
> difficult for users to debug further, since they can't easily find their 
> driver logs.
> It is possible to reach the logs by using the {{yarn logs}} commands, but the 
> average Spark user isn't aware of this and shouldn't have to be.
> I propose adding, alongside the application report, some additional lines 
> like:
> {code}
>  Driver Logs (stdout): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stdout?start=-4096
>  Driver Logs (stderr): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stderr?start=-4096
> {code}
> With this information available, users can quickly jump to their driver logs, 
> even if it crashed before the SHS became aware of the application. This has 
> the additional benefit of providing a quick way to access driver logs, which 
> often contain useful information, in a single click (instead of navigating 
> through the Spark UI).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33185) YARN: Print direct links to driver logs alongside application report in cluster mode

2020-11-05 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-33185.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30096
[https://github.com/apache/spark/pull/30096]

> YARN: Print direct links to driver logs alongside application report in 
> cluster mode
> 
>
> Key: SPARK-33185
> URL: https://issues.apache.org/jira/browse/SPARK-33185
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently when run in {{cluster}} mode on YARN, the Spark {{yarn.Client}} 
> will print out the application report into the logs, to be easily viewed by 
> users. For example:
> {code}
> INFO yarn.Client: 
>client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
>diagnostics: N/A
>ApplicationMaster host: X.X.X.X
>ApplicationMaster RPC port: 0
>queue: default
>start time: 1602782566027
>final status: UNDEFINED
>tracking URL: http://hostname:/proxy/application_/
>user: xkrogen
> {code}
> Typically, the tracking URL can be used to find the logs of the 
> ApplicationMaster/driver while the application is running. Later, the Spark 
> History Server can be used to track this information down, using the 
> stdout/stderr links on the Executors page.
> However, in the situation when the driver crashed _before_ writing out a 
> history file, the SHS may not be aware of this application, and thus does not 
> contain links to the driver logs. When this situation arises, it can be 
> difficult for users to debug further, since they can't easily find their 
> driver logs.
> It is possible to reach the logs by using the {{yarn logs}} commands, but the 
> average Spark user isn't aware of this and shouldn't have to be.
> I propose adding, alongside the application report, some additional lines 
> like:
> {code}
>  Driver Logs (stdout): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stdout?start=-4096
>  Driver Logs (stderr): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stderr?start=-4096
> {code}
> With this information available, users can quickly jump to their driver logs, 
> even if it crashed before the SHS became aware of the application. This has 
> the additional benefit of providing a quick way to access driver logs, which 
> often contain useful information, in a single click (instead of navigating 
> through the Spark UI).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-04 Thread Mridul Muralidharan

+1

Regards,
Mridul

On Wed, Nov 4, 2020 at 12:41 PM Xinyi Yu  wrote:

> Hi all,
>
> We had the discussion of SPIP: Standardize Spark Exception Messages at
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Standardize-Spark-Exception-Messages-td30341.html
> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Standardize-Spark-Exception-Messages-td30341.html>
>
> . The SPIP document link is at
>
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing
> <
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing>
>
> . We want to have the vote on this, for 72 hours.
>
> Please vote before November 7th at noon:
>
> [ ] +1: Accept this SPIP proposal
> [ ] -1: Do not agree to standardize Spark exception messages, because ...
>
>
> Thanks for your time and feedback!
>
> --
> Xinyi
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS][SPIP] Standardize Spark Exception Messages

2020-11-01 Thread Mridul Muralidharan

I like the idea of consistent messages; it makes understanding errors
easier.
Having said that, Exception messages themselves are not part of the exposed
contract to users; and are subject to change.
We should leave that flexibility open to spark developers ... I am
currently viewing this proposal as a internal standardization exercise
within Spark codebase, and not as a public contract with users.
Is that aligned with the objectives ? Or are we looking at this as a public
contract to users ?
I am not in favour of the latter.

Regards,
Mridul

On Sun, Oct 25, 2020 at 7:05 PM Xinyi Yu  wrote:

> Hi all,
>
> We like to post a SPIP of Standardize Exception Messages in Spark. Here is
> the document link:
>
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing
> <
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing>
>
>
> This SPIP aims to standardize the exception messages in Spark. It has three
> major focuses:
> 1. Group exception messages in dedicated files for easy maintenance and
> auditing.
> 2. Establish an error message guideline for developers.
> 3. Improve error message quality.
>
> Thanks for your time and patience. Looking forward to your feedback!
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[jira] [Resolved] (SPARK-33088) Enhance ExecutorPlugin API to include methods for task start and end events

2020-10-15 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-33088.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29977
[https://github.com/apache/spark/pull/29977]

> Enhance ExecutorPlugin API to include methods for task start and end events
> ---
>
> Key: SPARK-33088
> URL: https://issues.apache.org/jira/browse/SPARK-33088
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Samuel Souza
>Priority: Major
> Fix For: 3.1.0
>
>
> On [SPARK-24918|https://issues.apache.org/jira/browse/SPARK-24918]'s 
> [SIPP|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/view#|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit#],
>  it was raised to potentially add methods to ExecutorPlugin interface on task 
> start and end:
> {quote}The basic interface can just be a marker trait, as that allows a 
> plugin to monitor general characteristics of the JVM (eg. monitor memory or 
> take thread dumps).   Optionally, we could include methods for task start and 
> end events.   This would allow more control on monitoring – eg., you could 
> start polling thread dumps only if there was a task from a particular stage 
> that had been taking too long. But anything task related is a bit trickier to 
> decide the right api. Should the task end event also get the failure reason? 
> Should those events get called in the same thread as the task runner, or in 
> another thread?
> {quote}
> The ask is to add exactly that. I've put up a draft PR [in our fork of 
> spark|https://github.com/palantir/spark/pull/713] and I'm happy to push it 
> upstream. Also happy to receive comments on what's the right interface to 
> expose - not opinionated on that front, tried to expose the simplest 
> interface for now.
> The main reason for this ask is to propagate tracing information from the 
> driver to the executors 
> ([SPARK-21962|https://issues.apache.org/jira/browse/SPARK-21962] has some 
> context). On 
> [HADOOP-15566|https://issues.apache.org/jira/browse/HADOOP-15566] I see we're 
> discussing how to add tracing to the Apache ecosystem, but my problem is 
> slightly different: I want to use this interface to propagate tracing 
> information to my framework of choice. If the Hadoop issue gets solved we'll 
> have a framework to communicate tracing information inside the Apache 
> ecosystem, but it's highly unlikely that all Spark users will use the same 
> common framework. Therefore we should still provide plugin interfaces where 
> the tracing information can be propagated appropriately.
> To give more color, in our case the tracing information is [stored in a 
> thread 
> local|https://github.com/palantir/tracing-java/blob/4.9.0/tracing/src/main/java/com/palantir/tracing/Tracer.java#L61],
>  therefore it needs to be set in the same thread which is executing the task. 
> [*]
> While our framework is specific, I imagine such an interface could be useful 
> in general. Happy to hear your thoughts about it.
> [*] Something I did not mention was how to propagate the tracing information 
> from the driver to the executors. For that I intend to use 1. the driver's 
> localProperties, which 2. will be eventually propagated to the executors' 
> TaskContext, which 3. I'll be able to access from the methods above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32915) RPC implementation to support pushing and merging shuffle blocks

2020-10-15 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32915.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29855
[https://github.com/apache/spark/pull/29855]

> RPC implementation to support pushing and merging shuffle blocks
> 
>
> Key: SPARK-32915
> URL: https://issues.apache.org/jira/browse/SPARK-32915
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
> Fix For: 3.1.0
>
>
> RPC implementation for the basic functionality in network-common and 
> network-shuffle module to enable pushing blocks on the client side and 
> merging received blocks on the server side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Mridul Muralidharan

+1 on pushing the branch cut for increased dev time to match previous
releases.

Regards,
Mridul

On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:

> Thank you for your updates.
>
> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
> the 3.1 branch cut, the feature development time window is less than 5
> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>
> Below are three highly desirable feature work I am watching. Hopefully, we
> can finish them before the branch cut.
>
>- Support push-based shuffle to improve shuffle efficiency:
>https://issues.apache.org/jira/browse/SPARK-30602
>- Unify create table syntax:
>https://issues.apache.org/jira/browse/SPARK-31257
>- Bloom filter join: https://issues.apache.org/jira/browse/SPARK-32268
>
> Thanks,
>
> Xiao
>
>
> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道：
>
>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>> dropped R 3.5 and below at branch 2.4 as well.
>>
>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun,  wrote:
>>
>>> Hi, All.
>>>
>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>> According to the 3.1.0 release window, branch-3.1 will be
>>> created on November 1st and enters QA period.
>>>
>>> Here are some notable updates I've been monitoring.
>>>
>>> *Language*
>>> 01. SPARK-25075 Support Scala 2.13
>>>   - Since SPARK-32926, Scala 2.13 build test has
>>> become a part of GitHub Action jobs.
>>>   - After SPARK-33044, Scala 2.13 test will be
>>> a part of Jenkins jobs.
>>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>>> 03. SPARK-32082 Project Zen: Improving Python usability
>>>   - 7 of 16 issues are resolved.
>>> 04. SPARK-32073 Drop R < 3.5 support
>>>   - This is done for Spark 3.0.1 and 3.1.0.
>>>
>>> *Dependency*
>>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>>   - This changes the default dist. for better cloud support
>>> 06. SPARK-32981 Remove hive-1.2 distribution
>>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>>   - This will remove Hive 1.2.1 from source code
>>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>>
>>> *Core*
>>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>>   - 11 of 15 issues are resolved
>>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>>   - 8 of 14 issues are resolved
>>>
>>> *Resource Manager*
>>> 11. SPARK-33005 Kubernetes GA preparation
>>>   - It is on the way and we are waiting for more feedback.
>>>
>>> *SQL*
>>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>>   to JSON/Avro
>>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>>   - 11 of 17 issues are resolved
>>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>>   and added more features in 3.1 but still we missed
>>>   - All built-in DataSource v2 write paths are disabled
>>> and v1 write is used instead.
>>>   - Support partition pruning with subqueries
>>>   - Support bucketing
>>>
>>> We still have one month before the feature freeze
>>> and starting QA. If you are working for 3.1,
>>> please consider the timeline and share your schedule
>>> with the Apache Spark community. For the other stuff,
>>> we can put it into 3.2 release scheduled in June 2021.
>>>
>>> Last not but least, I want to emphasize (7) once again.
>>> We need to remove the forked unofficial Hive eventually.
>>> Please let us know your reasons if you need to build
>>> from Apache Spark 3.1 source code for Hive 1.2.
>>>
>>> https://github.com/apache/spark/pull/29936
>>>
>>> As I wrote in the above PR description, for old releases,
>>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>>> Hive 1.2-based distribution.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Mridul Muralidharan

+1 on pushing the branch cut for increased dev time to match previous
releases.

Regards,
Mridul

On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:

> Thank you for your updates.
>
> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
> the 3.1 branch cut, the feature development time window is less than 5
> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>
> Below are three highly desirable feature work I am watching. Hopefully, we
> can finish them before the branch cut.
>
>- Support push-based shuffle to improve shuffle efficiency:
>https://issues.apache.org/jira/browse/SPARK-30602
>- Unify create table syntax:
>https://issues.apache.org/jira/browse/SPARK-31257
>- Bloom filter join: https://issues.apache.org/jira/browse/SPARK-32268
>
> Thanks,
>
> Xiao
>
>
> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道：
>
>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>> dropped R 3.5 and below at branch 2.4 as well.
>>
>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun,  wrote:
>>
>>> Hi, All.
>>>
>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>> According to the 3.1.0 release window, branch-3.1 will be
>>> created on November 1st and enters QA period.
>>>
>>> Here are some notable updates I've been monitoring.
>>>
>>> *Language*
>>> 01. SPARK-25075 Support Scala 2.13
>>>   - Since SPARK-32926, Scala 2.13 build test has
>>> become a part of GitHub Action jobs.
>>>   - After SPARK-33044, Scala 2.13 test will be
>>> a part of Jenkins jobs.
>>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>>> 03. SPARK-32082 Project Zen: Improving Python usability
>>>   - 7 of 16 issues are resolved.
>>> 04. SPARK-32073 Drop R < 3.5 support
>>>   - This is done for Spark 3.0.1 and 3.1.0.
>>>
>>> *Dependency*
>>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>>   - This changes the default dist. for better cloud support
>>> 06. SPARK-32981 Remove hive-1.2 distribution
>>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>>   - This will remove Hive 1.2.1 from source code
>>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>>
>>> *Core*
>>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>>   - 11 of 15 issues are resolved
>>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>>   - 8 of 14 issues are resolved
>>>
>>> *Resource Manager*
>>> 11. SPARK-33005 Kubernetes GA preparation
>>>   - It is on the way and we are waiting for more feedback.
>>>
>>> *SQL*
>>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>>   to JSON/Avro
>>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>>   - 11 of 17 issues are resolved
>>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>>   and added more features in 3.1 but still we missed
>>>   - All built-in DataSource v2 write paths are disabled
>>> and v1 write is used instead.
>>>   - Support partition pruning with subqueries
>>>   - Support bucketing
>>>
>>> We still have one month before the feature freeze
>>> and starting QA. If you are working for 3.1,
>>> please consider the timeline and share your schedule
>>> with the Apache Spark community. For the other stuff,
>>> we can put it into 3.2 release scheduled in June 2021.
>>>
>>> Last not but least, I want to emphasize (7) once again.
>>> We need to remove the forked unofficial Hive eventually.
>>> Please let us know your reasons if you need to build
>>> from Apache Spark 3.1 source code for Hive 1.2.
>>>
>>> https://github.com/apache/spark/pull/29936
>>>
>>> As I wrote in the above PR description, for old releases,
>>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>>> Hive 1.2-based distribution.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>

[jira] [Updated] (SPARK-32738) thread safe endpoints may hang due to fatal error

2020-09-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-32738:

Fix Version/s: 2.4.8

> thread safe endpoints may hang due to fatal error
> -
>
> Key: SPARK-32738
> URL: https://issues.apache.org/jira/browse/SPARK-32738
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.6, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in 
> `Inbox`. Now if any fatal error happens during `Inbox.process`, 
> 'numActiveThreads' is not reduced. Then other threads can not process 
> messages in that inbox, which causes the endpoint to "hang".
> This problem is more serious in previous Spark 2.x versions since the driver, 
> executor and block manager endpoints are all thread safe endpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[RESULT] [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-18 Thread Mridul Muralidharan

Hi,

  The vote passed with 16 +1's (6 binding) and no -1's

+1s (* = binding):

Xingbo Jiang
Venkatakrishnan Sowrirajan
Tom Graves (*)
Chandni Singh
DB Tsai (*)
Xiao Li (*)
Angers Zhu
Joseph Torres
Kalyan
Dongjoon Hyun (*)
Wenchen Fan (*)
Yi Wu
叶先进 
郑瑞峰 
Takeshi Yamamuro
Mridul Muralidharan (*)

Thanks,
Mridul

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-18 Thread Mridul Muralidharan

Adding my +1 as well, before closing the vote.

Regards,
Mridul

On Sun, Sep 13, 2020 at 9:59 PM Mridul Muralidharan 
wrote:

> Hi,
>
> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
> shuffle to improve shuffle efficiency.
> Please take a look at:
>
>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>- SPIP doc:
>
> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>- POC against master and results summary :
>
> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>
> Active discussions on the jira and SPIP document have settled.
>
> I will leave the vote open until Friday (the 18th September 2020), 5pm
> CST.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks,
> Mridul
>

[jira] [Updated] (SPARK-32738) thread safe endpoints may hang due to fatal error

2020-09-17 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-32738:

Fix Version/s: 3.0.2

> thread safe endpoints may hang due to fatal error
> -
>
> Key: SPARK-32738
> URL: https://issues.apache.org/jira/browse/SPARK-32738
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.6, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in 
> `Inbox`. Now if any fatal error happens during `Inbox.process`, 
> 'numActiveThreads' is not reduced. Then other threads can not process 
> messages in that inbox, which causes the endpoint to "hang".
> This problem is more serious in previous Spark 2.x versions since the driver, 
> executor and block manager endpoints are all thread safe endpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-13 Thread Mridul Muralidharan

Hi,

I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
shuffle to improve shuffle efficiency.
Please take a look at:

   - SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
   - SPIP doc:
   
https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
   - POC against master and results summary :
   
https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit

Active discussions on the jira and SPIP document have settled.

I will leave the vote open until Friday (the 18th September 2020), 5pm CST.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...


Thanks,
Mridul

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-09 Thread Mridul Muralidharan

I imported our KEYS file locally [1] to validate ... did not use external
keyserver.

Regards,
Mridul

[1] wget https://dist.apache.org/repos/dist/dev/spark/KEYS -O - | gpg
--import

On Wed, Sep 9, 2020 at 8:03 PM Wenchen Fan  wrote:

> I checked
> https://repository.apache.org/content/repositories/orgapachespark-1361/ ,
> it says the Signature Validation failed.
>
> Prashant, can you double-check your gpg key and make sure it's uploaded to
> public key servers like the following?
> http://pool.sks-keyservers.net:11371
> http://keyserver.ubuntu.com:11371
>
>
> On Wed, Sep 9, 2020 at 6:12 AM Mridul Muralidharan 
> wrote:
>
>>
>> +1
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and built/tested with -Pyarn -Phadoop-2.7 -Phive
>> -Phive-thriftserver -Pmesos -Pkubernetes
>>
>> Thanks,
>> Mridul
>>
>>
>> On Tue, Sep 8, 2020 at 8:55 AM Prashant Sharma 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.7.
>>>
>>> The vote is open until Sep 11th at 9AM PST and passes if a majority +1
>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.4.7
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> There are currently no issues targeting 2.4.7 (try project = SPARK AND
>>> "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In Progress"))
>>>
>>> The tag to be voted on is v2.4.7-rc3 (commit
>>> 14211a19f53bd0f413396582c8970e3e0a74281d):
>>> https://github.com/apache/spark/tree/v2.4.7-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1361/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-docs/
>>>
>>> The list of bug fixes going into 2.4.7 can be found at the following URL:
>>> https://s.apache.org/spark-v2.4.7-rc3
>>>
>>> This release is using the release script of the tag v2.4.7-rc3.
>>>
>>> FAQ
>>>
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.4.7?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.4.7 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.4.7
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-08 Thread Mridul Muralidharan

+1

Signatures, digests, etc check out fine.
Checked out tag and built/tested with -Pyarn -Phadoop-2.7 -Phive
-Phive-thriftserver -Pmesos -Pkubernetes

Thanks,
Mridul


On Tue, Sep 8, 2020 at 8:55 AM Prashant Sharma  wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.4.7.
>
> The vote is open until Sep 11th at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.7
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.7 (try project = SPARK AND
> "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.7-rc3 (commit
> 14211a19f53bd0f413396582c8970e3e0a74281d):
> https://github.com/apache/spark/tree/v2.4.7-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1361/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-docs/
>
> The list of bug fixes going into 2.4.7 can be found at the following URL:
> https://s.apache.org/spark-v2.4.7-rc3
>
> This release is using the release script of the tag v2.4.7-rc3.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.7?
> ===
>
> The current list of open tickets targeted at 2.4.7 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.7
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-08-24 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183622#comment-17183622
 ] 

Mridul Muralidharan commented on SPARK-30602:
-

SPIP proposal document: 
[https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit|https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit]

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Push-based shuffle SPIP

2020-08-24 Thread Mridul Muralidharan

Hi,

  Thanks for sending out the proposal Min !
For the SPIP requirements, I am willing to act as the shepherd for this
proposal.

The jira + paper + proposal provides the high level design and
implementation details.
The vldb paper discusses the performance gains in detail for the inhouse
deployment of push based shuffle.

Would be great to get feedback from our community on this feature; before
we go to voting.


Regards,
Mridul



On Mon, Aug 24, 2020 at 4:32 PM mshen  wrote:

> We raised this SPIP ticket in
> https://issues.apache.org/jira/browse/SPARK-30602 earlier this year.
> Since then, we have progressed in multiple fronts, including:
>
> * Our work is published in VLDB 2020. The final version of the paper is
> attached in the SPIP ticket.
> * We have further enhanced and productionized this work at LinkedIn, and
> have enabled production flows adopting the new push-based shuffle
> mechanism,
> with good results.
> * We have recently also ported our push-based shuffle changes to OSS Spark
> master branch, so other people can potentially try it out. Details of this
> branch is in this  doc
> <
> https://docs.google.com/document/d/16yOfI8P_O3V6hx_FnWT22jeDIItgXuXfaDAV0fJDTqQ/edit#>
>
> * The  SPIP doc
> <
> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit>
>
> is also further updated reflecting more recent designs.
> * We have also discussed with multiple companies who share similar interest
> in this work.
>
> We would like to resume the discussion of this SPIP in the community, and
> push for a voting on this.
>
>
>
>
> -
> Min Shen
> Staff Software Engineer
> LinkedIn
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[jira] [Assigned] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server

2020-08-20 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32663:
---

Assignee: Attila Zsolt Piros

> TransportClient getting closed when there are outstanding requests to the 
> server
> 
>
> Key: SPARK-32663
> URL: https://issues.apache.org/jira/browse/SPARK-32663
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in 
> {{ExternalBlockStoreClient}} closes the client after processing a response in 
> the callback. 
> This is a cached client which will be re-used for other responses. There 
> could be other outstanding request to the shuffle service, so it should not 
> be closed after processing a response. 
> Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. 
> The older methods  {{registerWithShuffleServer}} and {{fetchBlocks}} didn't 
> close the client.
> cc [~attilapiros] [~vanzin] [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server

2020-08-20 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32663.
-
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29492
[https://github.com/apache/spark/pull/29492]

> TransportClient getting closed when there are outstanding requests to the 
> server
> 
>
> Key: SPARK-32663
> URL: https://issues.apache.org/jira/browse/SPARK-32663
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in 
> {{ExternalBlockStoreClient}} closes the client after processing a response in 
> the callback. 
> This is a cached client which will be re-used for other responses. There 
> could be other outstanding request to the shuffle service, so it should not 
> be closed after processing a response. 
> Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. 
> The older methods  {{registerWithShuffleServer}} and {{fetchBlocks}} didn't 
> close the client.
> cc [~attilapiros] [~vanzin] [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32119.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28939
[https://github.com/apache/spark/pull/28939]

> ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
> --
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> ExecutorPlugin can't work with Standalone Cluster and Kubernetes
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: LiveListenerBus is occupying most of the Driver Memory and frequent GC is degrading the performance

2020-08-11 Thread Mridul Muralidharan

Hi,

  50% of driver time being spent in gc just for listenerbus sounds very
high in a 30G heap.
Did you try to take a heap dump and see what is occupying so much memory ?

This will help us eliminate if the memory usage is due to some user
code/library holding references to large objects/graph of objects - or
memory usage is actually in listener/related code.

Regards,
Mridul


On Tue, Aug 11, 2020 at 8:14 AM Teja  wrote:

> We have ~120 executors with 5 cores each, for a very long-running job which
> crunches ~2.5 TB of data with has too many filters to query. Currently, we
> have ~30k partitions which make ~90MB per partition.
>
> We are using Spark v2.2.2 as of now. The major problem we are facing is due
> to GC on the driver. All of the driver memory (30G) is getting filled and
> GC
> is very active, which is taking more than 50% of the runtime for Full GC
> Evacuation. The heap dump indicates that 80% of the memory is being
> occupied
> by LiveListenerBus and it's not being cleared by GC. Frequent GC runs are
> clearing newly created objects only.
>
> From the Jira tickets, I got to know that Memory consumption by
> LiveListenerBus has been addressed in v2.3 (not sure of the specifics). But
> until we evaluate migrating to v2.3, is there any quick fix or workaround
> either to prevent various listerner events bulking up in driver's memory or
> to identify and disable the Listener which is causing the delay in
> processing events.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-31 Thread Mridul Muralidharan

+1

Thanks,
Mridul

On Thu, Jul 30, 2020 at 4:49 PM Holden Karau  wrote:

> Hi Spark Developers,
>
> After the discussion of the proposal to amend Spark committer guidelines,
> it appears folks are generally in agreement on policy clarifications. (See
> https://lists.apache.org/thread.html/r6706e977fda2c474a7f24775c933c2f46ea19afbfafb03c90f6972ba%40%3Cdev.spark.apache.org%3E,
> as well as some on the private@ list for PMC.) Therefore, I am calling
> for a majority VOTE, which will last at least 72 hours. See the ASF voting
> rules for procedural changes at
> https://www.apache.org/foundation/voting.html.
>
> The proposal is to add a new section entitled “When to Commit” to the
> Spark committer guidelines, currently at
> https://spark.apache.org/committers.html.
>
> ** START OF CHANGE **
>
> PRs shall not be merged during active, on-topic discussion unless they
> address issues such as critical security fixes of a public vulnerability.
> Under extenuating circumstances, PRs may be merged during active, off-topic
> discussion and the discussion directed to a more appropriate venue. Time
> should be given prior to merging for those involved with the conversation
> to explain if they believe they are on-topic.
>
> Lazy consensus requires giving time for discussion to settle while
> understanding that people may not be working on Spark as their full-time
> job and may take holidays. It is believed that by doing this, we can limit
> how often people feel the need to exercise their veto.
>
> All -1s with justification merit discussion.  A -1 from a non-committer
> can be overridden only with input from multiple committers, and suitable
> time must be offered for any committer to raise concerns. A -1 from a
> committer who cannot be reached requires a consensus vote of the PMC under
> ASF voting rules to determine the next steps within the ASF guidelines for
> code vetoes ( https://www.apache.org/foundation/voting.html ).
>
> These policies serve to reiterate the core principle that code must not be
> merged with a pending veto or before a consensus has been reached (lazy or
> otherwise).
>
> It is the PMC’s hope that vetoes continue to be infrequent, and when they
> occur, that all parties will take the time to build consensus prior to
> additional feature work.
>
> Being a committer means exercising your judgement while working in a
> community of people with diverse views. There is nothing wrong in getting a
> second (or third or fourth) opinion when you are uncertain. Thank you for
> your dedication to the Spark project; it is appreciated by the developers
> and users of Spark.
>
> It is hoped that these guidelines do not slow down development; rather, by
> removing some of the uncertainty, the goal is to make it easier for us to
> reach consensus. If you have ideas on how to improve these guidelines or
> other Spark project operating procedures, you should reach out on the dev@
> list to start the discussion.
>
> ** END OF CHANGE TEXT **
>
> I want to thank everyone who has been involved with the discussion leading
> to this proposal and those of you who take the time to vote on this. I look
> forward to our continued collaboration in building Apache Spark.
>
> I believe we share the goal of creating a welcoming community around the
> project. On a personal note, it is my belief that consistently applying
> this policy around commits can help to make a more accessible and welcoming
> community.
>
> Kind Regards,
>
> Holden
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-29 Thread Mridul Muralidharan

I agree, that would be a new feature; and unless compelling reason (like
security concerns) would not qualify.

Regards,
Mridul

On Wed, Jul 15, 2020 at 11:46 AM Wenchen Fan  wrote:

> Supporting Python 3.8.0 sounds like a new feature, and doesn't qualify a
> backport. But I'm open to other opinions.
>
> On Wed, Jul 15, 2020 at 11:24 PM Ismaël Mejía  wrote:
>
>> Any chance that SPARK-29536 PySpark does not work with Python 3.8.0
>> can be backported to 2.4.7 ?
>> This was not done for Spark 2.4.6 because it was too late on the vote
>> process but it makes perfect sense to have this in 2.4.7.
>>
>> On Wed, Jul 15, 2020 at 9:07 AM Wenchen Fan  wrote:
>> >
>> > Yea I think 2.4.7 is good to go. Let's start!
>> >
>> > On Wed, Jul 15, 2020 at 1:50 PM Prashant Sharma 
>> wrote:
>> >>
>> >> Hi Folks,
>> >>
>> >> So, I am back, and searched the JIRAS with target version as "2.4.7"
>> and Resolved, found only 2 jiras. So, are we good to go, with just a couple
>> of jiras fixed ? Shall I proceed with making a RC?
>> >>
>> >> Thanks,
>> >> Prashant
>> >>
>> >> On Thu, Jul 2, 2020 at 5:23 PM Prashant Sharma 
>> wrote:
>> >>>
>> >>> Thank you, Holden.
>> >>>
>> >>> Folks, My health has gone down a bit. So, I will start working on
>> this in a few days. If this needs to be published sooner, then maybe
>> someone else has to help out.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Jul 2, 2020 at 10:11 AM Holden Karau 
>> wrote:
>> 
>>  I’m happy to have Prashant do 2.4.7 :)
>> 
>>  On Wed, Jul 1, 2020 at 9:40 PM Xiao Li 
>> wrote:
>> >
>> > +1 on releasing both 3.0.1 and 2.4.7
>> >
>> > Great! Three committers volunteer to be a release manager. Ruifeng,
>> Prashant and Holden. Holden just helped release Spark 2.4.6. This time,
>> maybe, Ruifeng and Prashant can be the release manager of 3.0.1 and 2.4.7
>> respectively.
>> >
>> > Xiao
>> >
>> > On Wed, Jul 1, 2020 at 2:24 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-32148 was reported
>> yesterday, and if the report is valid it looks to be a blocker. I'll try to
>> take a look sooner.
>> >>
>> >> On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>> >>>
>> >>> Thanks Holden -- it would be great to also get 2.4.7 started
>> >>>
>> >>> Thanks
>> >>> Shivaram
>> >>>
>> >>> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau <
>> hol...@pigscanfly.ca> wrote:
>> >>> >
>> >>> > I can take care of 2.4.7 unless someone else wants to do it.
>> >>> >
>> >>> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore <
>> jason.mo...@quantium.com.au> wrote:
>> >>> >>
>> >>> >> Hi all,
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> Could I get some input on the severity of this one that I
>> found yesterday?  If that’s a correctness issue, should it block this
>> patch?  Let me know under the ticket if there’s more info that I can
>> provide to help.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> https://issues.apache.org/jira/browse/SPARK-32136
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> Thanks,
>> >>> >>
>> >>> >> Jason.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> From: Jungtaek Lim 
>> >>> >> Date: Wednesday, 1 July 2020 at 10:20 am
>> >>> >> To: Shivaram Venkataraman 
>> >>> >> Cc: Prashant Sharma , 郑瑞峰 <
>> ruife...@foxmail.com>, Gengliang Wang ,
>> gurwls223 , Dongjoon Hyun ,
>> Jules Damji , Holden Karau ,
>> Reynold Xin , Yuanjian Li ,
>> "dev@spark.apache.org" , Takeshi Yamamuro <
>> linguin@gmail.com>
>> >>> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> SPARK-32130 [1] looks to be a performance regression
>> introduced in Spark 3.0.0, which is ideal to look into before releasing
>> another bugfix version.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>> >>> >>
>> >>> >> Hi all
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> I just wanted to ping this thread to see if all the
>> outstanding blockers for 3.0.1 have been fixed. If so, it would be great if
>> we can get the release going. The CRAN team sent us a note that the version
>> SparkR available on CRAN for the current R version (4.0.2) is broken and
>> hence we need to update the package soon --  it will be great to do it with
>> 3.0.1.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> Thanks
>> >>> >>
>> >>> >> Shivaram
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma <
>> scrapco...@gmail.com> wrote:
>> >>> >>
>> >>> >> +1 for 3.0.1 release.
>> >>> >>
>> >>> >> I too can he

Re: [DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-23 Thread Mridul Muralidharan

Thanks Holden, this version looks good to me.
+1

Regards,
Mridul

On Thu, Jul 23, 2020 at 3:56 PM Imran Rashid  wrote:

> Sure, that sounds good to me.  +1
>
> On Wed, Jul 22, 2020 at 1:50 PM Holden Karau  wrote:
>
>>
>>
>> On Wed, Jul 22, 2020 at 7:39 AM Imran Rashid < iras...@apache.org >
>> wrote:
>>
>>> Hi Holden,
>>>
>>> thanks for leading this discussion, I'm in favor in general.  I have one
>>> specific question -- these two sections seem to contradict each other
>>> slightly:
>>>
>>> > If there is a -1 from a non-committer, multiple committers or the PMC
>>> should be consulted before moving forward.
>>> >
>>> >If the original person who cast the veto can not be reached in a
>>> reasonable time frame given likely holidays, it is up to the PMC to decide
>>> the next steps within the guidelines of the ASF. This must be decided by a
>>> consensus vote under the ASF voting rules.
>>>
>>> I think the intent here is that if a *committer* gives a -1, then the
>>> PMC has to have a consensus vote?  And if a non-committer gives a -1, then
>>> multiple committers should be consulted?  How about combining those two
>>> into something like
>>>
>>> "All -1s with justification merit discussion.  A -1 from a non-committer
>>> can be overridden only with input from multiple committers.  A -1 from a
>>> committer requires a consensus vote of the PMC under ASF voting rules".
>>>
>> I can work with that although it wasn’t quite what I was originally going
>> for. I didn’t intend to have committer -1s be eligible for override. I
>> believe committers have demonstrated sufficient merit; they are the same as
>> PMC member -1s in our project.
>>
>> My aim was just if something weird happens (like say I had a pending -1
>> before my motorcycle crash last year) we go to the PMC and take a binding
>> vote on what to do, and most likely someone on the PMC will reach out to
>> the ASF for understanding around the guidelines.
>>
>> What about:
>>
>> All -1s with justification merit discussion.  A -1 from a non-committer
>> can be overridden only with input from multiple committers and suitable
>> time for any committer to raise concerns.  A -1 from a committer who can
>> not be reached requires a consensus vote of the PMC under ASF voting rules
>> to determine the next steps within the ASF guidelines for vetos.
>>
>>>
>>>
>>> thanks,
>>> Imran
>>>
>>>
>>> On Tue, Jul 21, 2020 at 3:41 PM Holden Karau 
>>> wrote:
>>>
 Hi Spark Developers,

 There has been a rather active discussion regarding the specific vetoes
 that occured during Spark 3. From that I believe we are now mostly in
 agreement that it would be best to clarify our rules around code vetoes &
 merging in general. Personally I believe this change is important to help
 improve the appearance of a level playing field in the project.

 Once discussion settles I'll run this by a copy editor, my grammar
 isn't amazing, and bring forward for a vote.

 The current Spark committer guide is at https://spark.apache.org/
 committers.html. I am proposing we add a section on when it is OK to
 merge PRs directly above the section on how to merge PRs. The text I am
 proposing to amend our committer guidelines with is:

 PRs shall not be merged during active on topic discussion except for
 issues like critical security fixes of a public vulnerability. Under
 extenuating circumstances PRs may be merged during active off topic
 discussion and the discussion directed to a more appropriate venue. Time
 should be given prior to merging for those involved with the conversation
 to explain if they believe they are on topic.

 Lazy consensus requires giving time for discussion to settle, while
 understanding that people may not be working on Spark as their full time
 job and may take holidays. It is believed that by doing this we can limit
 how often people feel the need to exercise their veto.

 For the purposes of a -1 on code changes, a qualified voter includes
 all PMC members and committers in the project. For a -1 to be a valid veto
 it must include a technical reason. The reason can include things like the
 change may introduce a maintenance burden or is not the direction of Spark.

 If there is a -1 from a non-committer, multiple committers or the PMC
 should be consulted before moving forward.

 If the original person who cast the veto can not be reached in a
 reasonable time frame given likely holidays, it is up to the PMC to decide
 the next steps within the guidelines of the ASF. This must be decided by a
 consensus vote under the ASF voting rules.

 These policies serve to reiterate the core principle that code must not
 be merged with a pending veto or before a consensus has been reached (lazy
 or otherwise).

 It is the PMC’s hope that vetoes continue to be infrequent, and when
 they o

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Mridul Muralidharan

Congratulations !

Regards,
Mridul

On Tue, Jul 14, 2020 at 12:37 PM Matei Zaharia 
wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new roles! The new committers are:
>
> - Huaxin Gao
> - Jungtaek Lim
> - Dilip Biswal
>
> All three of them contributed to Spark 3.0 and we’re excited to have them
> join the project.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[jira] [Commented] (SPARK-25594) OOM in long running applications even with UI disabled

2020-07-02 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150729#comment-17150729
 ] 

Mridul Muralidharan commented on SPARK-25594:
-

Given regression in functionality if this is merged, closing bug.
See comment: https://github.com/apache/spark/pull/22609#issuecomment-426405757

> OOM in long running applications even with UI disabled
> --
>
> Key: SPARK-25594
> URL: https://issues.apache.org/jira/browse/SPARK-25594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>    Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Major
>
> Typically for long running applications with large number of tasks it is 
> common to disable UI to minimize overhead at driver.
> Earlier, with spark ui disabled, only stage/job information was kept as part 
> of JobProgressListener.
> As part of history server scalability fixes, particularly SPARK-20643, 
> inspite of disabling UI - task information continues to be maintained in 
> memory.
> In our long running tests against spark thrift server, this eventually 
> results in OOM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25594) OOM in long running applications even with UI disabled

2020-07-02 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-25594.
-
Resolution: Won't Fix

> OOM in long running applications even with UI disabled
> --
>
> Key: SPARK-25594
> URL: https://issues.apache.org/jira/browse/SPARK-25594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>    Reporter: Mridul Muralidharan
>    Assignee: Mridul Muralidharan
>Priority: Major
>
> Typically for long running applications with large number of tasks it is 
> common to disable UI to minimize overhead at driver.
> Earlier, with spark ui disabled, only stage/job information was kept as part 
> of JobProgressListener.
> As part of history server scalability fixes, particularly SPARK-20643, 
> inspite of disabling UI - task information continues to be maintained in 
> memory.
> In our long running tests against spark thrift server, this eventually 
> results in OOM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Decommissioning SPIP

2020-07-01 Thread Mridul Muralidharan

+1

Thanks,
Mridul

On Wed, Jul 1, 2020 at 6:36 PM Hyukjin Kwon  wrote:

> +1
>
> 2020년 7월 2일 (목) 오전 10:08, Marcelo Vanzin 님이 작성:
>
>> I reviewed the docs and PRs from way before an SPIP was explicitly
>> asked, so I'm comfortable with giving a +1 even if I haven't really
>> fully read the new document,
>>
>> On Wed, Jul 1, 2020 at 6:05 PM Holden Karau  wrote:
>> >
>> > Hi Spark Devs,
>> >
>> > I think discussion has settled on the SPIP doc at
>> https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
>> , design doc at
>> https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit,
>> or JIRA https://issues.apache.org/jira/browse/SPARK-20624, and I've
>> received a request to put the SPIP up for a VOTE quickly. The discussion
>> thread on the mailing list is at
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Graceful-Decommissioning-td29650.html
>> .
>> >
>> > Normally this vote would be open for 72 hours, however since it's a
>> long weekend in the US where many of the PMC members are, this vote will
>> not close before July 6th at noon pacific time.
>> >
>> > The SPIP procedures are documented at:
>> https://spark.apache.org/improvement-proposals.html. The ASF's voting
>> guide is at https://www.apache.org/foundation/voting.html.
>> >
>> > Please vote before July 6th at noon:
>> >
>> > [ ] +1: Accept the proposal as an official SPIP
>> > [ ] +0
>> > [ ] -1: I don't think this is a good idea because ...
>> >
>> > I will start the voting off with a +1 from myself.
>> >
>> > Cheers,
>> >
>> > Holden
>>
>>
>>
>> --
>> Marcelo Vanzin
>> van...@gmail.com
>> "Life's too short to drink cheap beer"
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [DISCUSS][SPIP] Graceful Decommissioning

2020-06-28 Thread Mridul Muralidharan

  Thanks for shepherding this Holden !
I left a few comments, but overall it looks good to me.

Regards,
Mridul

On Sat, Jun 27, 2020 at 9:34 PM Holden Karau  wrote:

> There’s been some comments & a few additions in the doc, but it seems like
> the folks taking a look generally agree on the design. If there are no
> other issues I will bring this to a vote late next week.
>
> On Thu, Jun 25, 2020 at 7:43 PM Holden Karau  wrote:
>
>> Thanks for looping in more folks :)
>>
>> On Thu, Jun 25, 2020 at 7:41 PM Hyukjin Kwon  wrote:
>>
>>> Thank you so much, Holden.
>>>
>>> PS: I cc'ed some people who might be interested in this too FYI.
>>>
>>> 2020년 6월 26일 (금) 오전 11:26, Holden Karau 님이 작성:
>>>
 At the recommendation of Hyukjin, I'm converting the graceful
 decommissioning work to an SPIP. The SPIP document is at
 https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
 and the associated JIRA is at
 https://issues.apache.org/jira/browse/SPARK-20624. This work dates
 back to 2017 when an earlier design was brought up. Now in 2019 I've
 updated the design
 https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing
 .

 From the SPIP requirements, I am willing to act as the shepherd &
 committer on this proposal, and have already been reviewing and creating
 PRs here. There are several folks who have contributed to design (you can
 see the discussion in the 2019 & 2017 documents) and I'm very thankful for
 those contributions, as well as the other code reviewers & PR authors who
 have been participating from a variety of vendors/users.

 Given the existence of multiple vendors' proprietary implementations
 (Databricks & AWS) of this feature, I think the user need is very clear.

 This begins the discussion phase of the SPIP process where the goal is
 to determine the need for the change and the general design outline. Once
 the discussion settles, we move on the voting stage. While there are WIP
 PRS attached to the design document to illustrate the implementation,
 approval of the SPIP does not necessarily mean those are the specific
 implementations that we will use.

 For more information about SPIPs in general see
 https://spark.apache.org/improvement-proposals.html

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Mridul Muralidharan

Great job everyone ! Congratulations :-)

Regards,
Mridul

On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin  wrote:

> Hi all,
>
> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many
> of the innovations from Spark 2.x, bringing new ideas as well as continuing
> long-term projects that have been in development. This release resolves
> more than 3400 tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.0.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-0.html
>
>
>
>

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Mridul Muralidharan

Great job everyone ! Congratulations :-)

Regards,
Mridul

On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin  wrote:

> Hi all,
>
> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many
> of the innovations from Spark 2.x, bringing new ideas as well as continuing
> long-term projects that have been in development. This release resolves
> more than 3400 tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.0.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-0.html
>
>
>
>

Re: [vote] Apache Spark 3.0 RC3

2020-06-07 Thread Mridul Muralidharan

+1

Regards,
Mridul

On Sat, Jun 6, 2020 at 1:20 PM Reynold Xin  wrote:

> Apologies for the mistake. The vote is open till 11:59pm Pacific time on
> Mon June 9th.
>
> On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.0.0.
>>
>> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes
>> are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.0.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.0.0-rc3 (commit
>> 3fdfce3120f307147244e5eaf46d61419a723d50):
>> https://github.com/apache/spark/tree/v3.0.0-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1350/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>>
>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>
>> This release is using the release script of the tag v3.0.0-rc3.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.0.0?
>> ===
>>
>> The current list of open tickets targeted at 3.0.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-06-03 Thread Mridul Muralidharan

  Is this a behavior change in 2.4.x from earlier version ?
Or are we proposing to introduce  a functionality to help with adoption ?

Regards,
Mridul


On Wed, Jun 3, 2020 at 10:32 AM Xiao Li  wrote:

> Yes. Spark 3.0 RC2 works well.
>
> I think the current behavior in Spark 2.4 affects the adoption, especially
> for the new users who want to try Spark in their local environment.
>
> It impacts all our built-in clients, like Scala Shell and PySpark. Should
> we consider back-porting it to 2.4?
>
> Although this fixes the bug, it will also introduce the behavior change.
> We should publicly document it and mention it in the release note. Let us
> review it more carefully and understand the risk and impact.
>
> Thanks,
>
> Xiao
>
> Nicholas Chammas  于2020年6月3日周三 上午10:12写道：
>
>> I believe that was fixed in 3.0 and there was a decision not to backport
>> the fix: SPARK-31170 
>>
>> On Wed, Jun 3, 2020 at 1:04 PM Xiao Li  wrote:
>>
>>> Just downloaded it in my local macbook. Trying to create a table using
>>> the pre-built PySpark. It sounds like the conf "spark.sql.warehouse.dir"
>>> does not take an effect. It is trying to create a directory in
>>> "file:/user/hive/warehouse/t1". I have not done any investigation yet. Have
>>> any of you hit the same issue?
>>>
>>> C02XT0U7JGH5:bin lixiao$ ./pyspark --conf
>>> spark.sql.warehouse.dir="/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6"
>>>
>>> Python 2.7.16 (default, Jan 27 2020, 04:46:15)
>>>
>>> [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)] on darwin
>>>
>>> Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> 20/06/03 09:56:11 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>>
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>>
>>> Setting default log level to "WARN".
>>>
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>>
>>> Welcome to
>>>
>>>     __
>>>
>>>  / __/__  ___ _/ /__
>>>
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>
>>>/__ / .__/\_,_/_/ /_/\_\   version 2.4.6
>>>
>>>   /_/
>>>
>>>
>>> Using Python version 2.7.16 (default, Jan 27 2020 04:46:15)
>>>
>>> SparkSession available as 'spark'.
>>>
>>> >>> spark.sql("set spark.sql.warehouse.dir").show(truncate=False)
>>>
>>>
>>> +---+-+
>>>
>>> |key|value
>>>   |
>>>
>>>
>>> +---+-+
>>>
>>>
>>> |spark.sql.warehouse.dir|/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6|
>>>
>>>
>>> +---+-+
>>>
>>>
>>> >>> spark.sql("create table t1 (col1 int)")
>>>
>>> 20/06/03 09:56:29 WARN HiveMetaStore: Location:
>>> file:/user/hive/warehouse/t1 specified for non-external table:t1
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "", line 1, in 
>>>
>>>   File
>>> "/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6/python/pyspark/sql/session.py",
>>> line 767, in sql
>>>
>>> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>>>
>>>   File
>>> "/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>>> line 1257, in __call__
>>>
>>>   File
>>> "/Users/lixiao/Downloads/spark-2.4.6-bin-hadoop2.6/python/pyspark/sql/utils.py",
>>> line 69, in deco
>>>
>>> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
>>>
>>> pyspark.sql.utils.AnalysisException:
>>> u'org.apache.hadoop.hive.ql.metadata.HiveException:
>>> MetaException(message:file:/user/hive/warehouse/t1 is not a directory or
>>> unable to create one);'
>>>
>>> Dongjoon Hyun  于2020年6月3日周三 上午9:18写道：
>>>
 +1

 Bests,
 Dongjoon

 On Wed, Jun 3, 2020 at 5:59 AM Tom Graves 
 wrote:

>  +1
>
> Tom
>
> On Sunday, May 31, 2020, 06:47:09 PM CDT, Holden Karau <
> hol...@pigscanfly.ca> wrote:
>
>
> Please vote on releasing the following candidate as Apache Spark
> version 2.4.6.
>
> The vote is open until June 5th at 9AM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.6
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.6 (try project = SPARK AND
> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In 
> Progress"))
>
> The tag to be voted on is v2.4.6-rc8 (commit
> 807e0a484d1de767d1f02bd8a622da6450bdf940):
> https://github.com/apache/spark/tree/v2.4.6-rc8
>
> The release files, including signatures, digests, etc. can be found

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-06-02 Thread Mridul Muralidharan

+1 (binding)


Thanks,
Mridul


On Sun, May 31, 2020 at 4:47 PM Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.4.6.
>
> The vote is open until June 5th at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.6
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.6 (try project = SPARK AND
> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.6-rc8 (commit
> 807e0a484d1de767d1f02bd8a622da6450bdf940):
> https://github.com/apache/spark/tree/v2.4.6-rc8
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1349/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-docs/
>
> The list of bug fixes going into 2.4.6 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12346781
>
> This release is using the release script of the tag v2.4.6-rc8.
>
> FAQ
>
> =
> What happened to the other RCs?
> =
>
> The parallel maven build caused some flakiness so I wasn't comfortable
> releasing them. I backported the fix from the 3.0 branch for this release.
> I've got a proposed change to the build script so that we only push tags
> when once the build is a success for the future, but it does not block this
> release.
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.6?
> ===
>
> The current list of open tickets targeted at 2.4.6 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.6
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2020-04-17 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086124#comment-17086124
 ] 

Mridul Muralidharan commented on SPARK-29302:
-

I agree with [~feiwang], it looks like newTaskTempFile is not robust to 
speculative execution and task failures when dynamicPartitionOverwrite is 
enabled IMO.
This will need to be fixed - it is currently using the same path irrespective 
of which attempt it is.

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29302) dynamic partition overwrite with speculation enabled

2020-04-17 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-29302:

Comment: was deleted

(was: Drive by observations:

* Speculative execution does not run the speculative task concurrently on the 
same node running the task (so definitely not same executor).
* Do we have additional details about the tasks which are failing ? Did they 
fail before (for unrelated reasons) before failing for path conflict ?
)

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2020-04-17 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086117#comment-17086117
 ] 

Mridul Muralidharan commented on SPARK-29302:
-

Drive by observations:

* Speculative execution does not run the speculative task concurrently on the 
same node running the task (so definitely not same executor).
* Do we have additional details about the tasks which are failing ? Did they 
fail before (for unrelated reasons) before failing for path conflict ?


> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Mridul Muralidharan

I agree with what Sean detailed.
The only place where I can see some amount of investigation being required
would be for security issues or correctness issues.
Knowing the affected versions, particularly if an earlier supported version
does not have the bug, will help users understand the broken/insecure
versions.

Regards,
Mridul


On Wed, Apr 1, 2020 at 6:12 PM Sean Owen  wrote:

> I think we discussed this briefly on a PR.
>
> It's not as clear what it means for an Improvement to 'affect a
> version'. Certainly, an improvement to a feature introduced in 1.2.3
> can't affect anything earlier, and implicitly affects everything
> after. It's not wrong to say it affects the latest version, at least.
> And I believe we require it in JIRA because we can't require an
> Affects Version for one type of issue but not another. So, just asking
> people to default to 'latest version' there is no burden.
>
> I would not ask someone to figure out all and earliest versions that
> an Improvement applies to; it just isn't that useful. We aren't
> generally going to back-port improvements anyway.
>
> Even for bugs, we don't really need to know that a bug in master
> affects 2.4.5, 2.4.4, 2.4.3, ... 2.3.6, 2.3.5, etc. It doesn't hurt to
> at least say it affects the latest 2.4.x, 2.3.x releases, if known,
> because it's possible it should be back-ported. Again even where this
> is significantly more useful, I'm not in favor of telling people they
> must test the bug report vs previous releases.
>
> So, if you're asserting that the current guidance is OK, I generally agree.
> Is there a particular context where this was questioned? maybe we
> should examine the particulars of that situation. As in all things,
> context matters.
>
> Sean
>
> On Wed, Apr 1, 2020 at 7:34 PM Jungtaek Lim
>  wrote:
> >
> > Hi devs,
> >
> > I know we're busy with making Spark 3.0 be out, but I think the topic is
> good to discuss at any time and actually be better to be resolved sooner
> than later.
> >
> > In the page "Contributing to Spark", we describe the guide of "affects
> version" as "For Bugs, assign at least one version that is known to exhibit
> the problem or need the change".
> >
> > For me, that sentence clearly describes minimal requirement of affects
> version via:
> >
> > * For the type of bug, assign one valid version
> > * For other types, there's no requirement
> >
> > but I'm seeing the requests more than the requirement which makes me
> think there might be different understanding of the sentence. Maybe there's
> more, but to summarize on such requests:
> >
> > 1) add affects version as same as master branch for improvement/new
> feature
> > 2) check with older versions to fill up affects version for bug
> >
> > I don't see any point on doing 1). It might give some context if we
> don't update the affect version (so that it can say which version was
> considered when filing JIRA issue) but we also update the affect version
> when we bump the master branch, which is no longer informational as the
> version should have been always the same as master branch.
> >
> > I agree it's ideal to do 2) but I think the reason the guide doesn't
> enforce is that it requires pretty much efforts to check with old versions
> (sometimes even more than origin work).
> >
> > Suppose the happy case we have UT to verify the bugfix which fails
> without the patch and passes with the patch. To check with older versions
> we have to checkout the tag, and apply the UT, and "rebuild", and run UT to
> verify which is pretty much time-consuming. What if there's a conflict
> indeed? That's still a happy case, and in worse case (there's no such UT)
> we should do E2E manual verification which I would give up.
> >
> > There should have some balance/threshold, and the balance should be the
> thing the community has a consensus.
> >
> > Would like to hear everyone's voice on this.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-06 Thread Mridul Muralidharan

I am in broad agreement with the prposal, as any developer, I prefer
stable well designed API's :-)

Can we tie the proposal to stability guarantees given by spark and
reasonable expectation from users ?
In my opinion, an unstable or evolving could change - while an
experimental api which has been around for ages should be more
conservatively handled.
Which brings in question what are the stability guarantees as
specified by annotations interacting with the proposal.

Also, can we expand on 'when' an API change can occur ?  Since we are
proposing to diverge from semver.
Patch release ? Minor release ? Only major release ? Based on 'impact'
of API ? Stability guarantees ?

Regards,
Mridul



On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust  wrote:
>
> I'll start off the vote with a strong +1 (binding).
>
> On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust  
> wrote:
>>
>> I propose to add the following text to Spark's Semantic Versioning policy 
>> and adopt it as the rubric that should be used when deciding to break APIs 
>> (even at major versions such as 3.0).
>>
>>
>> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a 
>> procedural vote, the measure will pass if there are more favourable votes 
>> than unfavourable ones. PMC votes are binding, but the community is 
>> encouraged to add their voice to the discussion.
>>
>>
>> [ ] +1 - Spark should adopt this policy.
>>
>> [ ] -1  - Spark should not adopt this policy.
>>
>>
>> 
>>
>>
>> Considerations When Breaking APIs
>>
>> The Spark project strives to avoid breaking APIs or silently changing 
>> behavior, even at major versions. While this is not always possible, the 
>> balance of the following factors should be considered before choosing to 
>> break an API.
>>
>>
>> Cost of Breaking an API
>>
>> Breaking an API almost always has a non-trivial cost to the users of Spark. 
>> A broken API means that Spark programs need to be rewritten before they can 
>> be upgraded. However, there are a few considerations when thinking about 
>> what the cost will be:
>>
>> Usage - an API that is actively used in many different places, is always 
>> very costly to break. While it is hard to know usage for sure, there are a 
>> bunch of ways that we can estimate:
>>
>> How long has the API been in Spark?
>>
>> Is the API common even for basic programs?
>>
>> How often do we see recent questions in JIRA or mailing lists?
>>
>> How often does it appear in StackOverflow or blogs?
>>
>> Behavior after the break - How will a program that works today, work after 
>> the break? The following are listed roughly in order of increasing severity:
>>
>> Will there be a compiler or linker error?
>>
>> Will there be a runtime exception?
>>
>> Will that exception happen after significant processing has been done?
>>
>> Will we silently return different answers? (very hard to debug, might not 
>> even notice!)
>>
>>
>> Cost of Maintaining an API
>>
>> Of course, the above does not mean that we will never break any APIs. We 
>> must also consider the cost both to the project and to our users of keeping 
>> the API in question.
>>
>> Project Costs - Every API we have needs to be tested and needs to keep 
>> working as other parts of the project changes. These costs are significantly 
>> exacerbated when external dependencies change (the JVM, Scala, etc). In some 
>> cases, while not completely technically infeasible, the cost of maintaining 
>> a particular API can become too high.
>>
>> User Costs - APIs also have a cognitive cost to users learning Spark or 
>> trying to understand Spark programs. This cost becomes even higher when the 
>> API in question has confusing or undefined semantics.
>>
>>
>> Alternatives to Breaking an API
>>
>> In cases where there is a "Bad API", but where the cost of removal is also 
>> high, there are alternatives that should be considered that do not hurt 
>> existing users but do address some of the maintenance costs.
>>
>>
>> Avoid Bad APIs - While this is a bit obvious, it is an important point. 
>> Anytime we are adding a new interface to Spark we should consider that we 
>> might be stuck with this API forever. Think deeply about how new APIs relate 
>> to existing ones, as well as how you expect them to evolve over time.
>>
>> Deprecation Warnings - All deprecation warnings should point to a clear 
>> alternative and should never just say that an API is deprecated.
>>
>> Updated Docs - Documentation should point to the "best" recommended way of 
>> performing a given task. In the cases where we maintain legacy 
>> documentation, we should clearly point to newer APIs and suggest to users 
>> the "right" way.
>>
>> Community Work - Many people learn Spark by reading blogs and other sites 
>> such as StackOverflow. However, many of these resources are out of date. 
>> Update them, to reduce the cost of eventually removing deprecated APIs.
>>
>>
>> 

-
To unsubscrib

Re: Is RDD thread safe?

2019-11-25 Thread Mridul Muralidharan

Very well put Imran. This is a variant of executor failure after an RDD has
been computed (including caching). In general, non determinism in spark is
going to lead to inconsistency.
The only reasonable solution for us, at that time, was to make
pseudo-randomness repeatable and checkpoint after so that recomputation
becomes deterministic.


Regards,
Mridul

On Mon, Nov 25, 2019 at 9:30 AM Imran Rashid 
wrote:

> I think Chang is right, but I also think this only comes up in limited
> scenarios.  I initially thought it wasn't a bug, but after some more
> thought I have some concerns in light of the issues we've had w/
> nondeterministic RDDs, eg. repartition().
>
> Say I have code like this:
>
> val cachedRDD = sc.textFile(...).cache()
> (0 until 200).par.foreach { idx => cachedRDD.doSomeAction(idx) }
>
> that is, my cached rdd is referenced by many threads concurrently before
> the RDD has been cached.
>
> When one of those tasks gets to cachedRDD.getOrCompute(), there are a few
> possible scenarios:
>
> 1) the partition has never been referenced before.
> BlockManager.getOrCompute() will say the block doesn't exist, so it will
> get recomputed (
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L360
> )
>
> 2) The partition has been fully materialized by another task, the
> blockmanagermaster on the driver already knows about it, so
> BlockManager.getOrCompute() will return a pointer to the cached block
> (perhaps on another node)
>
> 3) The partition is actively being computed by another task on the same
> executor.  Then BlockManager.getOrCompute() will not know about that other
> version of the task (it only knows about blocks that are fully
> materialized, IIUC).  But eventually, when the tasks try to actually write
> the data, they'll try to get a write lock for the block:
> https://github.com/apache/spark/blob/f09c1a36c4b0ca1fb450e274b22294dca590d8f8/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1218
> one task will get the write lock first; the other task will block on the
> other task, and then realize the block exists and just return those values.
>
> 4) The partition is actively being compute by another task on a
> *different* executor.  IIUC, Spark doesn't try to do anything to prevent
> both tasks from computing the block themselves in this case.  (To do so
> would require extra coordination in driver before writing every single
> block.)  Those locks in BlockManager and BlockInfoManager don't stop this
> case, because this is happening in entirely independent JVMs.
> There normally won't be any problem here -- if the RDD is totally
> deterministic, then you'll just end up with an extra copy of the data.  In
> a way, this is good, the cached RDD is in high demand, so having an extra
> copy isn't so bad.
> OTOH, if the RDD is non-deterministic, you've now got two copies with
> different values.  Then again, RDD cache is not resilient in general, so
> you've always got to be able to handle an RDD getting recomputed if its
> evicted from the cache.  So this should be pretty similar.
>
> On Mon, Nov 25, 2019 at 2:29 AM Weichen Xu 
> wrote:
>
>> emmm, I haven't check code, but I think if an RDD is referenced in
>> several places, the correct behavior should be: when this RDD data is
>> needed, it will be computed and then cached only once, otherwise it should
>> be treated as a bug. If you are suspicious there's a race condition, you
>> could create a jira ticket.
>>
>> On Mon, Nov 25, 2019 at 12:21 PM Chang Chen  wrote:
>>
>>> Sorry I did't describe clearly,  RDD id itself is thread-safe, how about
>>> cached data?
>>>
>>> See codes from BlockManager
>>>
>>> def getOrElseUpdate(...)   = {
>>>   get[T](blockId)(classTag) match {
>>>case ...
>>>case _ =>  // 1. no data is
>>> cached.
>>> // Need to compute the block
>>>  }
>>>  // Initially we hold no locks on this block
>>>  doPutIterator(...) match{..}
>>> }
>>>
>>> Considering  two DAGs (contain the same cached RDD ) runs
>>> simultaneously,  if both returns none  when they get same block from
>>> BlockManager(i.e. #1 above), then I guess the same data would be cached
>>> twice.
>>>
>>> If the later cache could override the previous data, and no memory is
>>> waste, then this is OK
>>>
>>> Thanks
>>> Chang
>>>
>>>
>>> Weichen Xu  于2019年11月25日周一 上午11:52写道：
>>>
 Rdd id is immutable and when rdd object created, the rdd id is
 generated. So why there is race condition in "rdd id" ?

 On Mon, Nov 25, 2019 at 11:31 AM Chang Chen 
 wrote:

> I am wonder the concurrent semantics for reason about the correctness.
> If the two query simultaneously run the DAGs which use the same cached
> DF\RDD，but before cache data actually happen， what will happen?
>
> By looking into code a litter, I suspect they have different BlockID
> for same Dataset which is unexpected behavior, but there is no race
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Mridul Muralidharan

Just for completeness sake, spark is not version neutral to hadoop;
particularly in yarn mode, there is a minimum version requirement
(though fairly generous I believe).

I agree with Steve, it is a long standing pain that we are bundling a
positively ancient version of hive.
Having said that, we should decouple the hive artifact question from
the hadoop version question - though they might be related currently.

Regards,
Mridul

On Tue, Nov 19, 2019 at 2:40 PM Cheng Lian  wrote:
>
> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version matters 
> except for the spark-hadoop-cloud module, which is only meaningful under the 
> hadoop-3.2 profile. All  the other spark-* artifacts published to Maven 
> central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark 
> distribution. Right now, we only publish PySpark artifacts prebuilt with 
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to 3.2 
> is feasible for PySpark users. Or maybe we should publish PySpark prebuilt 
> with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the 
> proposed hive-2.3 profile, I personally don't have a preference over having 
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing the 
> release management work, in case we decided to publish other spark-* Maven 
> artifacts from a Hadoop 2.7 build, we can still special case 
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun  wrote:
>>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and 
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung  
>> wrote:
>>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It 
>>> is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people are 
>>> concerned?
>>>
>>> 
>>> From: Steve Loughran 
>>> Sent: Sunday, November 17, 2019 9:22:09 AM
>>> To: Cheng Lian 
>>> Cc: Sean Owen ; Wenchen Fan ; 
>>> Dongjoon Hyun ; dev ; Yuming 
>>> Wang 
>>> Subject: Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which 
>>> spark has historically bundled (the org.spark-project one) is an orphan 
>>> project put together to deal with Hive's shading issues and a source of 
>>> unhappiness in the Hive project. What ever get shipped should do its best 
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest 
>>> move from a risk minimisation perspective. If something has broken then it 
>>> is you can start with the assumption that it is in the o.a.s packages 
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if 
>>> there are problems with the hadoop / hive dependencies those teams will 
>>> inevitably ignore filed bug reports for the same reason spark team will 
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the 
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that 
>>> in mind. It's not been tested, it has dependencies on artifacts we know are 
>>> incompatible, and as far as the Hadoop project is concerned: people should 
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts (a) 
>>> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x. 
>>> That way people doing things with their own projects will get up-to-date 
>>> dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever" 
>>> branch-2 release and then declare its predecessors EOL; 2.10 will be the 
>>> transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian  wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I 
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which 
>>> seemed risky, and therefore we only introduced Hive 2.3 under the 
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong 
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that 
>>> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not 
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11 
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>>>
>>> I'd pref

Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-01 Thread Mridul Muralidharan

Makes more sense to drop support for zstd assuming the fix is not
something at spark end (configuration, etc).
Does not make sense to try to detect deadlock in codec.

Regards,
Mridul

On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
 wrote:
>
> Hi devs,
>
> I've discovered an issue with event logger, specifically reading incomplete 
> event log file which is compressed with 'zstd' - the reader thread got stuck 
> on reading that file.
>
> This is very easy to reproduce: setting configuration as below
>
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
>
> and start Spark application. While the application is running, load the 
> application in SHS webpage. It may succeed to replay the event log, but high 
> likely it will be stuck and loading page will be also stuck.
>
> Please refer SPARK-29322 for more details.
>
> As the issue only occurs with 'zstd', the simplest approach is dropping 
> support of 'zstd' for event log. More general approach would be introducing 
> timeout on reading event log file, but it should be able to differentiate 
> thread being stuck vs thread busy with reading huge event log file.
>
> Which approach would be preferred in Spark community, or would someone 
> propose better ideas for handling this?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-29 Thread Mridul Muralidharan

Add a +1 from me as well.
Just managed to finish going over it.

Thanks Bobby for leading this effort !

Regards,
Mridul

On Wed, May 29, 2019 at 2:51 PM Tom Graves  wrote:
>
> Ok, I'm going to call this vote and send the result email. We had 9 +1's (4 
> binding) and 1 +0 and no -1's.
>
> Tom
>
> On Monday, May 27, 2019, 3:25:14 PM CDT, Felix Cheung 
>  wrote:
>
>
> +1
>
> I’d prefer to see more of the end goal and how that could be achieved (such 
> as ETL or SPARK-24579). However given the rounds and months of discussions we 
> have come down to just the public API.
>
> If the community thinks a new set of public API is maintainable, I don’t see 
> any problem with that.
>
> 
> From: Tom Graves 
> Sent: Sunday, May 26, 2019 8:22:59 AM
> To: hol...@pigscanfly.ca; Reynold Xin
> Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei 
> Zaharia; Thomas graves; Xiangrui Meng; Xiangrui Meng; dev
> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> Processing Support
>
> More feedback would be great, this has been open a long time though, let's 
> extend til Wednesday the 29th and see where we are at.
>
> Tom
>
>
>
> Sent from Yahoo Mail on Android
>
> On Sat, May 25, 2019 at 6:28 PM, Holden Karau
>  wrote:
> Same I meant to catch up after kubecon but had some unexpected travels.
>
> On Sat, May 25, 2019 at 10:56 PM Reynold Xin  wrote:
>
> Can we push this to June 1st? I have been meaning to read it but 
> unfortunately keeps traveling...
>
> On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun  wrote:
>
> +1
>
> Thanks,
> Dongjoon.
>
> On Fri, May 24, 2019 at 17:03 DB Tsai  wrote:
>
> +1 on exposing the APIs for columnar processing support.
>
> I understand that the scope of this SPIP doesn't cover AI / ML
> use-cases. But I saw a good performance gain when I converted data
> from rows to columns to leverage on SIMD architectures in a POC ML
> application.
>
> With the exposed columnar processing support, I can imagine that the
> heavy lifting parts of ML applications (such as computing the
> objective functions) can be written as columnar expressions that
> leverage on SIMD architectures to get a good speedup.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Wed, May 15, 2019 at 2:59 PM Bobby Evans  wrote:
> >
> > It would allow for the columnar processing to be extended through the 
> > shuffle.  So if I were doing say an FPGA accelerated extension it could 
> > replace the ShuffleExechangeExec with one that can take a ColumnarBatch as 
> > input instead of a Row. The extended version of the ShuffleExchangeExec 
> > could then do the partitioning on the incoming batch and instead of 
> > producing a ShuffleRowRDD for the exchange they could produce something 
> > like a ShuffleBatchRDD that would let the serializing and deserializing 
> > happen in a column based format for a faster exchange, assuming that 
> > columnar processing is also happening after the exchange. This is just like 
> > providing a columnar version of any other catalyst operator, except in this 
> > case it is a bit more complex of an operator.
> >
> > On Wed, May 15, 2019 at 12:15 PM Imran Rashid 
> >  wrote:
> >>
> >> sorry I am late to the discussion here -- the jira mentions using this 
> >> extensions for dealing with shuffles, can you explain that part?  I don't 
> >> see how you would use this to change shuffle behavior at all.
> >>
> >> On Tue, May 14, 2019 at 10:59 AM Thomas graves  wrote:
> >>>
> >>> Thanks for replying, I'll extend the vote til May 26th to allow your
> >>> and other people feedback who haven't had time to look at it.
> >>>
> >>> Tom
> >>>
> >>> On Mon, May 13, 2019 at 4:43 PM Holden Karau  wrote:
> >>> >
> >>> > I’d like to ask this vote period to be extended, I’m interested but I 
> >>> > don’t have the cycles to review it in detail and make an informed vote 
> >>> > until the 25th.
> >>> >
> >>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng  
> >>> > wrote:
> >>> >>
> >>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't 
> >>> >> feel strongly about it. I would still suggest doing the following:
> >>> >>
> >>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
> >>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick 
> >>> >> check. Beside ColumnarBatch and ColumnarVector, we also need to make 
> >>> >> the following public. People who are familiar with SQL internals 
> >>> >> should help assess the risk.
> >>> >> * ColumnarArray
> >>> >> * ColumnarMap
> >>> >> * unsafe.types.CaledarInterval
> >>> >> * ColumnarRow
> >>> >> * UTF8String
> >>> >> * ArrayData
> >>> >> * ...
> >>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match 
> >>> >> the purpose of this SPIP. It does make some code cleaner. But I guess 
> >>> >> for ETL use cases, it won't bri

Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

2019-05-08 Thread Mridul Muralidharan

Unfortunately I do not have bandwidth to do a detailed review, but a few
things come to mind after a quick read:

- While it might be tactically beneficial to align with existing
implementation, a clean design which does not tie into existing shuffle
implementation would be preferable (if it can be done without over
engineering). Shuffle implementation can change and there are custom
implementations and experiments which differ quite a bit from what comes
with Apache Spark.


- Please keep speculative execution in mind while designing the interfaces:
in spark, implicitly due to task scheduler logic, you won’t have conflicts
at an executor for (shuffleId, mapId) and (shuffleId, mapId, reducerId)
tuple.
When you externalize it, there can be conflict : passing a way to
distinguish different tasks for same partition would be necessary for
nontrivial implementations.


This would be a welcome and much needed enhancement to spark- looking
forward to its progress !


Regards,
Mridul



On Wed, May 8, 2019 at 11:24 AM Yifei Huang (PD) 
wrote:

> Hi everyone,
>
> For the past several months, we have been working on an API for pluggable
> storage of shuffle data. In this SPIP, we describe the proposed API, its
> implications, and how it fits into other work being done in the Spark
> shuffle space. If you're interested in Spark shuffle, and especially if you
> have done some work in this area already, please take a look at the SPIP
> and give us your thoughts and feedback.
>
> Jira Ticket: https://issues.apache.org/jira/browse/SPARK-25299
> SPIP:
> https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit
>
> Thank you!
>
> Yifei Huang and Matt Cheah
>
>
>

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mridul Muralidharan

  I am -1 on this vote for pretty much all the reasons that Mark mentioned.
A major version change gives us an opportunity to remove deprecated
interfaces, stabilize experimental/developer api, drop support for
outdated functionality/platforms and evolve the project with a vision
for foreseeable future.
IMO the primary focus should be on interface evolution, stability and
lowering tech debt which might result in breaking changes.

Which is not to say DSv2 should not be part of 3.0
Along with a lot of other exciting features also being added, it can
be one more important enhancement.

But I am not for delaying the release simply to accommodate a specific feature.
Features can be added in subsequent as well - I am yet to hear of a
good reason why it must be make it into 3.0 to need a VOTE thread.

Regards,
Mridul

On Thu, Feb 28, 2019 at 10:44 AM Mark Hamstra  wrote:
>
> I agree that adding new features in a major release is not forbidden, but 
> that is just not the primary goal of a major release. If we reach the point 
> where we are happy with the new public API before some new features are in a 
> satisfactory state to be merged, then I don't want there to be a prior 
> presumption that we cannot complete the primary goal of the major release. If 
> at that point you want to argue that it is worth waiting for some new 
> feature, then that would be fine and may have sufficient merits to warrant 
> some delay.
>
> Regardless of whether significant new public API comes into a major release 
> or a feature release, it should come in with an experimental annotation so 
> that we can make changes without requiring a new major release.
>
> If you want to argue that some new features that are currently targeting 
> 3.0.0 are significant enough that one or more of them should justify an 
> accelerated 3.1.0 release schedule if it is not ready in time for the 3.0.0 
> release, then I can much more easily get behind that kind of commitment; but 
> I remain opposed to the notion of promoting any new features to the status of 
> blockers of 3.0.0 at this time.
>
> On Thu, Feb 28, 2019 at 10:23 AM Ryan Blue  wrote:
>>
>> Mark, I disagree. Setting common goals is a critical part of getting things 
>> done.
>>
>> This doesn't commit the community to push out the release if the goals 
>> aren't met, but does mean that we will, as a community, seriously consider 
>> it. This is also an acknowledgement that this is the most important feature 
>> in the next release (whether major or minor) for many of us. This has been 
>> in limbo for a very long time, so I think it is important for the community 
>> to commit to getting it to a functional state.
>>
>> It sounds like your objection is to this commitment for 3.0, but remember 
>> that 3.0 is the next release so that we can remove deprecated APIs. It does 
>> not mean that we aren't adding new features in that release and aren't 
>> considering other goals.
>>
>> On Thu, Feb 28, 2019 at 10:12 AM Mark Hamstra  
>> wrote:
>>>
>>> Then I'm -1. Setting new features as blockers of major releases is not 
>>> proper project management, IMO.
>>>
>>> On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue  wrote:

 Mark, if this goal is adopted, "we" is the Apache Spark community.

 On Thu, Feb 28, 2019 at 9:52 AM Mark Hamstra  
 wrote:
>
> Who is "we" in these statements, such as "we should consider a functional 
> DSv2 implementation a blocker for Spark 3.0"? If it means those 
> contributing to the DSv2 effort want to set their own goals, milestones, 
> etc., then that is fine with me. If you mean that the Apache Spark 
> project should officially commit to the lack of a functional DSv2 
> implementation being a blocker for the release of Spark 3.0, then I'm -1. 
> A major release is just not about adding new features. Rather, it is 
> about making changes to the existing public API. As such, I'm opposed to 
> any new feature or any API addition being considered a blocker of the 
> 3.0.0 release.
>
>
> On Thu, Feb 28, 2019 at 9:09 AM Matt Cheah  wrote:
>>
>> +1 (non-binding)
>>
>>
>>
>> Are identifiers and namespaces going to be rolled under one of those six 
>> points?
>>
>>
>>
>> From: Ryan Blue 
>> Reply-To: "rb...@netflix.com" 
>> Date: Thursday, February 28, 2019 at 8:39 AM
>> To: Spark Dev List 
>> Subject: [VOTE] Functional DataSourceV2 in Spark 3.0
>>
>>
>>
>> I’d like to call a vote for committing to getting DataSourceV2 in a 
>> functional state for Spark 3.0.
>>
>> For more context, please see the discussion thread, but here is a quick 
>> summary about what this commitment means:
>>
>> · We think that a “functional DSv2” is an achievable goal for 
>> the Spark 3.0 release
>>
>> · We will consider this a blocker for Spark 3.0, and take 
>> reasonable steps to make it happen
>

[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-23 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750477#comment-16750477
 ] 

Mridul Muralidharan commented on SPARK-26688:
-

If this is a legitimate usecase, we should get yarn team to enhance node label 
support.

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-22 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748915#comment-16748915
 ] 

Mridul Muralidharan commented on SPARK-26688:
-

What is the usecase for this ?
As others have mentioned in the news group, nodelabels is the typical way to 
prevent (or require) allocation requests to satisfy some constraint.

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Automated formatting

2018-11-22 Thread Mridul Muralidharan

Is this handling only scala or java as well ?

Regards,
Mridul

On Thu, Nov 22, 2018 at 9:11 AM Cody Koeninger  wrote:

> Plugin invocation is ./build/mvn mvn-scalafmt_2.12:format
>
> It takes about 5 seconds, and errors out on the first different file
> that doesn't match formatting.
>
> I made a shell wrapper so that contributors can just run
>
> ./dev/scalafmt
>
> to actually format in place the files that have changed (or pass
> through commandline args if they want to do something different)
>
> On Wed, Nov 21, 2018 at 3:36 PM Sean Owen  wrote:
> >
> > I know the PR builder runs SBT, but I presume this would just be a
> > separate mvn job that runs. If it doesn't take long and only checks
> > the right diff, seems worth a shot. What's the invocation that Shane
> > could add (after this change goes in)
> > On Wed, Nov 21, 2018 at 3:27 PM Cody Koeninger 
> wrote:
> > >
> > > There's a mvn plugin (sbt as well, but it requires sbt 1.0+) so it
> > > should be runnable from the PR builder
> > >
> > > Super basic example with a minimal config that's close to current
> > > style guide here:
> > >
> > > https://github.com/apache/spark/compare/master...koeninger:scalafmt
> > >
> > > I imagine tracking down the corner cases in the config, especially
> > > around interactions with scalastyle, may take a bit of work.  Happy to
> > > do it, but not if there's significant concern about style related
> > > changes in PRs.
> > > On Wed, Nov 21, 2018 at 2:42 PM Sean Owen  wrote:
> > > >
> > > > Yeah fair, maybe mostly consistent in broad strokes but not in the
> details.
> > > > Is this something that can be just run in the PR builder? if the
> rules
> > > > are simple and not too hard to maintain, seems like a win.
> > > > On Wed, Nov 21, 2018 at 2:26 PM Cody Koeninger 
> wrote:
> > > > >
> > > > > Definitely not suggesting a mass reformat, just on a per-PR basis.
> > > > >
> > > > > scalafmt --diff  will reformat only the files that differ from git
> head
> > > > > scalafmt --test --diff won't modify files, just throw an exception
> if
> > > > > they don't match format
> > > > >
> > > > > I don't think code is consistently formatted now.
> > > > > I tried scalafmt on the most recent PR I looked at, and it caught
> > > > > stuff as basic as newlines before curly brace in existing code.
> > > > > I've had different reviewers for PRs that were literal backports or
> > > > > cut & paste of each other come up with different formatting nits.
> > > > >
> > > > >
> > > > > On Wed, Nov 21, 2018 at 12:03 PM Sean Owen 
> wrote:
> > > > > >
> > > > > > I think reformatting the whole code base might be too much. If
> there
> > > > > > are some more targeted cleanups, sure. We do have some links to
> style
> > > > > > guides buried somewhere in the docs, although the conventions are
> > > > > > pretty industry standard.
> > > > > >
> > > > > > I *think* the code is pretty consistently formatted now, and
> would
> > > > > > expect contributors to follow formatting they see, so ideally the
> > > > > > surrounding code alone is enough to give people guidance. In
> practice,
> > > > > > we're always going to have people format differently no matter
> what I
> > > > > > think so it's inevitable.
> > > > > >
> > > > > > Is there a way to just check style on PR changes? that's fine.
> > > > > > On Wed, Nov 21, 2018 at 11:40 AM Cody Koeninger <
> c...@koeninger.org> wrote:
> > > > > > >
> > > > > > > Is there any appetite for revisiting automating formatting?
> > > > > > >
> > > > > > > I know over the years various people have expressed opposition
> to it
> > > > > > > as unnecessary churn in diffs, but having every new contributor
> > > > > > > greeted with "nit: 4 space indentation for argument lists"
> isn't very
> > > > > > > welcoming.
> > > > > > >
> > > > > > >
> -
> > > > > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > > > > >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651301#comment-16651301
 ] 

Mridul Muralidharan commented on SPARK-25732:
-

[~vanzin] With long running applications (not necessarily streaming) needing 
access (read/write) to various data sources (not just hdfs), is there a way to 
do this even assuming livy rpc was augmented to support it ? For example, livy 
server would not know which data sources to fetch tokens for (since that will 
be part of user application jars/config).

For the specific usecase [~mgaido] detailed, proxy principal (foo)/keytab would 
be present and distinct from zeppelin or livy principal/keytab.
The 'proxy' part would simply be for livy to submit the application as the 
proxied user 'foo' - once application comes up, it will behave as though it was 
submitted by the user 'foo' with specified keytab (from hdfs) - acquire/renew 
tokens for user 'foo' from its keytab.


[~tgraves] I do share your concern; unfortunately for the usecase Marco is 
targeting, there does not seem to be an alternative; livy server is man in the 
middle here (w.r.t submitting client).

Having said that, if there is an alternative, I would definitely prefer that 
over sharing keytabs - even if it is over secured hdfs.


> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25594) OOM in long running applications even with UI disabled

2018-10-02 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635122#comment-16635122
 ] 

Mridul Muralidharan commented on SPARK-25594:
-

Task level information is required only when spark UI is enabled, and tends to 
be expensive to maintain (TaskDataWrapper was occupying bulk of the heap at 
OOM).
We can avoid maintaining task information when it is a live UI (not history 
server) and spark UI has been disabled (spark.ui.enabled = false)

> OOM in long running applications even with UI disabled
> --
>
> Key: SPARK-25594
> URL: https://issues.apache.org/jira/browse/SPARK-25594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>    Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Major
>
> Typically for long running applications with large number of tasks it is 
> common to disable UI to minimize overhead at driver.
> Earlier, with spark ui disabled, only stage/job information was kept as part 
> of JobProgressListener.
> As part of history server scalability fixes, particularly SPARK-20643, 
> inspite of disabling UI - task information continues to be maintained in 
> memory.
> In our long running tests against spark thrift server, this eventually 
> results in OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25594) OOM in long running applications even with UI disabled

2018-10-02 Thread Mridul Muralidharan (JIRA)

Mridul Muralidharan created SPARK-25594:
---

 Summary: OOM in long running applications even with UI disabled
 Key: SPARK-25594
 URL: https://issues.apache.org/jira/browse/SPARK-25594
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0, 2.4.0
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan


Typically for long running applications with large number of tasks it is common 
to disable UI to minimize overhead at driver.
Earlier, with spark ui disabled, only stage/job information was kept as part of 
JobProgressListener.

As part of history server scalability fixes, particularly SPARK-20643, inspite 
of disabling UI - task information continues to be maintained in memory.
In our long running tests against spark thrift server, this eventually results 
in OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: data source api v2 refactoring

2018-09-01 Thread Mridul Muralidharan

Is it only me or are all others getting Wenchen’s mails ? (Obviously Ryan
did :-) )
I did not see it in the mail thread I received or in archives ... [1]
Wondering which othersenderswere getting dropped (if yes).

Regards
Mridul

[1]
http://apache-spark-developers-list.1001551.n3.nabble.com/data-source-api-v2-refactoring-td24848.html

On Sat, Sep 1, 2018 at 8:58 PM Ryan Blue  wrote:

> Thanks for clarifying, Wenchen. I think that's what I expected.
>
> As for the abstraction, here's the way that I think about it: there are
> two important parts of a scan: the definition of what will be read, and
> task sets that actually perform the read. In batch, there's one definition
> of the scan and one task set so it makes sense that there's one scan object
> that encapsulates both of these concepts. For streaming, we need to
> separate the two into the definition of what will be read (the stream or
> streaming read) and the task sets that are run (scans). That way, the
> streaming read behaves like a factory for scans, producing scans that
> handle the data either in micro-batches or using continuous tasks.
>
> To address Jungtaek's question, I think that this does work with
> continuous. In continuous mode, the query operators keep running and send
> data to one another directly. The API still needs a streaming read layer
> because it may still produce more than one continuous scan. That would
> happen when the underlying source changes and Spark needs to reconfigure. I
> think the example here is when partitioning in a Kafka topic changes and
> Spark needs to re-map Kafka partitions to continuous tasks.
>
> rb
>
> On Fri, Aug 31, 2018 at 5:12 PM Wenchen Fan 
> wrote:
>
>> Hi Ryan,
>>
>> Sorry I may use a wrong wording. The pushdown is done with ScanConfig,
>> which is not table/stream/scan, but something between them. The table
>> creates ScanConfigBuilder, and table creates stream/scan with ScanConfig.
>> For streaming source, stream is the one to take care of the pushdown
>> result. For batch source, it's the scan.
>>
>> It's a little tricky because stream is an abstraction for streaming
>> source only. Better ideas are welcome!
>>
>
>> On Sat, Sep 1, 2018 at 7:26 AM Ryan Blue  wrote:
>>
>>> Thanks, Reynold!
>>>
>>> I think your API sketch looks great. I appreciate having the Table level
>>> in the abstraction to plug into as well. I think this makes it clear what
>>> everything does, particularly having the Stream level that represents a
>>> configured (by ScanConfig) streaming read and can act as a factory for
>>> individual batch scans or for continuous scans.
>>>
>>> Wenchen, I'm not sure what you mean by doing pushdown at the table
>>> level. It seems to mean that pushdown is specific to a batch scan or
>>> streaming read, which seems to be what you're saying as well. Wouldn't the
>>> pushdown happen to create a ScanConfig, which is then used as Reynold
>>> suggests? Looking forward to seeing this PR when you get it posted. Thanks
>>> for all of your work on this!
>>>
>>> rb
>>>
>>> On Fri, Aug 31, 2018 at 3:52 PM Wenchen Fan 
>>> wrote:
>>>
 Thank Reynold for writing this and starting the discussion!

 Data source v2 was started with batch only, so we didn't pay much
 attention to the abstraction and just follow the v1 API. Now we are
 designing the streaming API and catalog integration, the abstraction
 becomes super important.

 I like this proposed abstraction and have successfully prototyped it to
 make sure it works.

 During prototyping, I have to work around the issue that the current
 streaming engine does query optimization/planning for each micro batch.
 With this abstraction, the operator pushdown is only applied once
 per-query. In my prototype, I do the physical planning up front to get the
 pushdown result, and
 add a logical linking node that wraps the resulting physical plan node
 for the data source, and then swap that logical linking node into the
 logical plan for each batch. In the future we should just let the streaming
 engine do query optimization/planning only once.

 About pushdown, I think we should do it at the table level. The table
 should create a new pushdow handler to apply operator pushdowm for each
 scan/stream, and create the scan/stream with the pushdown result. The
 rationale is, a table should have the same pushdown behavior regardless the
 scan node.

 Thanks,
 Wenchen

 On Fri, Aug 31, 2018 at 2:00 PM Reynold Xin 
 wrote:

> I spent some time last week looking at the current data source v2
> apis, and I thought we should be a bit more buttoned up in terms of the
> abstractions and the guarantees Spark provides. In particular, I feel we
> need the following levels of "abstractions", to fit the use cases in 
> Spark,
> from batch, to streaming.
>
> Please don't focus on the naming at this s

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-29 Thread Mridul Muralidharan

+1
I left a couple of comments in NiharS's PR, but this is very useful to
have in spark !

Regards,
Mridul
On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid
 wrote:
>
> I'd like to propose adding a plugin api for Executors, primarily for 
> instrumentation and debugging 
> (https://issues.apache.org/jira/browse/SPARK-24918).  The changes are small, 
> but as its adding a new api, it might be spip-worthy.  I mentioned it as well 
> in a recent email I sent about memory monitoring
>
> The spip proposal is here (and attached to the jira as well): 
> https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing
>
> There are already some comments on the jira and pr, and I hope to get more 
> thoughts and opinions on it.
>
> thanks,
> Imran

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Resolved] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-06 Thread Mridul Muralidharan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-24948.
-
Resolution: Fixed

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Blocker
> Fix For: 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-06 Thread Mridul Muralidharan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-24948:

Fix Version/s: 2.4.0

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Blocker
> Fix For: 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread Mridul Muralidharan

A spark user’s expectation would be that any closure which worked in 2.11
will continue to work in 2.12 (exhibiting same behavior wrt functionality,
serializability, etc).
If there are behavioral changes, we will need to understand what they are -
but expection would be that they are minimal (if any) source changes for
users/libraries - requiring otherwise would be very detrimental to adoption.

Do we know the root cause here ? I am not sure how well we test the
cornercases in cleaner- if this was not caught by suite, perhaps we should
augment it ...

Regards
Mridul

On Mon, Aug 6, 2018 at 1:08 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Closure cleaner's initial purpose AFAIK is to clean the dependencies
> brought in with outer pointers (compiler's side effect). With LMFs in
> Scala 2.12 there are no outer pointers, that is why in the new design
> document we kept the implementation minimal focusing on the return
> statements (it was intentional). Also the majority of the generated
> closures AFAIK are of type LMF.
> Regarding references in the LMF body that was not part of the doc since we
> expect the user not to point to non-serializable objects etc.
> In all these cases you know you are adding references you shouldn't.
> If users were used to another UX we can try fix it, not sure how well this
> worked in the past though and if covered all cases.
>
> Regards,
> Stavros
>
> On Mon, Aug 6, 2018 at 8:36 AM, Mridul Muralidharan 
> wrote:
>
>> I agree, we should not work around the testcase but rather understand
>> and fix the root cause.
>> Closure cleaner should have null'ed out the references and allowed it
>> to be serialized.
>>
>> Regards,
>> Mridul
>>
>> On Sun, Aug 5, 2018 at 8:38 PM Wenchen Fan  wrote:
>> >
>> > It seems to me that the closure cleaner fails to clean up something.
>> The failed test case defines a serializable class inside the test case, and
>> the class doesn't refer to anything in the outer class. Ideally it can be
>> serialized after cleaning up the closure.
>> >
>> > This is somehow a very weird way to define a class, so I'm not sure how
>> serious the problem is.
>> >
>> > On Mon, Aug 6, 2018 at 3:41 AM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>> >>
>> >> Makes sense, not sure if closure cleaning is related to the last one
>> for example or others. The last one is a bit weird, unless I am missing
>> something about the LegacyAccumulatorWrapper logic.
>> >>
>> >> Stavros
>> >>
>> >> On Sun, Aug 5, 2018 at 10:23 PM, Sean Owen  wrote:
>> >>>
>> >>> Yep that's what I did. There are more failures with different
>> resolutions. I'll open a JIRA and PR and ping you, to make sure that the
>> changes are all reasonable, and not an artifact of missing something about
>> closure cleaning in 2.12.
>> >>>
>> >>> In the meantime having a 2.12 build up and running for master will
>> just help catch these things.
>> >>>
>> >>> On Sun, Aug 5, 2018 at 2:16 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>> >>>>
>> >>>> Hi Sean,
>> >>>>
>> >>>> I run a quick build so the failing tests seem to be:
>> >>>>
>> >>>> - SPARK-17644: After one stage is aborted for too many failed
>> attempts, subsequent stagesstill behave correctly on fetch failures ***
>> FAILED ***
>> >>>>   A job with one fetch failure should eventually succeed
>> (DAGSchedulerSuite.scala:2422)
>> >>>>
>> >>>>
>> >>>> - LegacyAccumulatorWrapper with AccumulatorParam that has no
>> equals/hashCode *** FAILED ***
>> >>>>   java.io.NotSerializableException:
>> org.scalatest.Assertions$AssertionsHelper
>> >>>> Serialization stack:
>> >>>> - object not serializable (class:
>> org.scalatest.Assertions$AssertionsHelper, value:
>> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f)
>> >>>>
>> >>>>
>> >>>> The last one can be fixed easily if you set class `MyData(val i:
>> Int) extends Serializable `outside of the test suite. For some reason
>> outers (not removed) are capturing
>> >>>> the Scalatest stuff in 2.12.
>> >>>>
>> >>>> Let me know if we see the same failures.
>> >>>>
>> >>>> Stavros
>> >>>>
>> >>>> On Sun, Aug 5, 2018 at 5:10 PM, Sean Owen  wrote:
>> >>>>>
>> >>>>> Shane et al - could we get a test job in Jenkins to test the Scala
>> 2.12 build? I don't think I have the access or expertise for it, though I
>> could probably copy and paste a job. I think we just need to clone the,
>> say, master Maven Hadoop 2.7 job, and add two steps: run
>> "./dev/change-scala-version.sh 2.12" first, then add "-Pscala-2.12" to the
>> profiles that are enabled.
>> >>>>>
>> >>>>> I can already see two test failures for the 2.12 build right now
>> and will try to fix those, but this should help verify whether the failures
>> are 'real' and detect them going forward.
>> >>>>>
>> >>>>>
>> >>>>
>> >>
>> >>
>> >>
>>
>
>
>
>

Re: Set up Scala 2.12 test build in Jenkins

2018-08-05 Thread Mridul Muralidharan

I agree, we should not work around the testcase but rather understand
and fix the root cause.
Closure cleaner should have null'ed out the references and allowed it
to be serialized.

Regards,
Mridul

On Sun, Aug 5, 2018 at 8:38 PM Wenchen Fan  wrote:
>
> It seems to me that the closure cleaner fails to clean up something. The 
> failed test case defines a serializable class inside the test case, and the 
> class doesn't refer to anything in the outer class. Ideally it can be 
> serialized after cleaning up the closure.
>
> This is somehow a very weird way to define a class, so I'm not sure how 
> serious the problem is.
>
> On Mon, Aug 6, 2018 at 3:41 AM Stavros Kontopoulos 
>  wrote:
>>
>> Makes sense, not sure if closure cleaning is related to the last one for 
>> example or others. The last one is a bit weird, unless I am missing 
>> something about the LegacyAccumulatorWrapper logic.
>>
>> Stavros
>>
>> On Sun, Aug 5, 2018 at 10:23 PM, Sean Owen  wrote:
>>>
>>> Yep that's what I did. There are more failures with different resolutions. 
>>> I'll open a JIRA and PR and ping you, to make sure that the changes are all 
>>> reasonable, and not an artifact of missing something about closure cleaning 
>>> in 2.12.
>>>
>>> In the meantime having a 2.12 build up and running for master will just 
>>> help catch these things.
>>>
>>> On Sun, Aug 5, 2018 at 2:16 PM Stavros Kontopoulos 
>>>  wrote:

 Hi Sean,

 I run a quick build so the failing tests seem to be:

 - SPARK-17644: After one stage is aborted for too many failed attempts, 
 subsequent stagesstill behave correctly on fetch failures *** FAILED ***
   A job with one fetch failure should eventually succeed 
 (DAGSchedulerSuite.scala:2422)


 - LegacyAccumulatorWrapper with AccumulatorParam that has no 
 equals/hashCode *** FAILED ***
   java.io.NotSerializableException: 
 org.scalatest.Assertions$AssertionsHelper
 Serialization stack:
 - object not serializable (class: 
 org.scalatest.Assertions$AssertionsHelper, value: 
 org.scalatest.Assertions$AssertionsHelper@3bc5fc8f)


 The last one can be fixed easily if you set class `MyData(val i: Int) 
 extends Serializable `outside of the test suite. For some reason outers 
 (not removed) are capturing
 the Scalatest stuff in 2.12.

 Let me know if we see the same failures.

 Stavros

 On Sun, Aug 5, 2018 at 5:10 PM, Sean Owen  wrote:
>
> Shane et al - could we get a test job in Jenkins to test the Scala 2.12 
> build? I don't think I have the access or expertise for it, though I 
> could probably copy and paste a job. I think we just need to clone the, 
> say, master Maven Hadoop 2.7 job, and add two steps: run 
> "./dev/change-scala-version.sh 2.12" first, then add "-Pscala-2.12" to 
> the profiles that are enabled.
>
> I can already see two test failures for the 2.12 build right now and will 
> try to fix those, but this should help verify whether the failures are 
> 'real' and detect them going forward.
>
>

>>
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Comment Edited] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-08-04 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569363#comment-16569363
 ] 

Mridul Muralidharan edited comment on SPARK-24375 at 8/5/18 4:42 AM:
-

{quote} We've thought hard on the issue and don't feel we can make it unless we 
force users to explicitly set a number in a barrier() call (actually it's not a 
good idea because it brings more borden to manage the code).{quote}

I am not sure where the additional burden exists.

Make it an optional param to barrier.
* If not defined, it would be analogous to what exists right now.
* If specified, fail the stage if different tasks in stage end up waiting on 
different barrier names (or some have a name and others dont).

In example usecases I have seen, there is usually partition specific code paths 
(if partition 0, do some initialization/teardown, etc) - which results in 
divergent codepaths : and so increases potential for this issue.

It will be very difficult to reason about the state when this happens.


was (Author: mridulm80):
{quote} We've thought hard on the issue and don't feel we can make it unless we 
force users to explicitly set a number in a barrier() call (actually it's not a 
good idea because it brings more borden to manage the code).{quote}

I am not sure where the additional burden exists.

Make it an optional param to barrier.
* If not defined, it would be analogous to what exists right now.
* If specified, fail the stage if different tasks in stage end up waiting on 
different barrier names (or some have a name and others dont).

In example usecases I have seen, there is usually partition specific code paths 
(if partition 0, do some initialization/teardown, etc) - which results in 
divergent codepaths : and so increases potential for this issue.

It will be very difficult to reason about the state what happens.

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-08-04 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569363#comment-16569363
 ] 

Mridul Muralidharan commented on SPARK-24375:
-

{quote} We've thought hard on the issue and don't feel we can make it unless we 
force users to explicitly set a number in a barrier() call (actually it's not a 
good idea because it brings more borden to manage the code).{quote}

I am not sure where the additional burden exists.

Make it an optional param to barrier.
* If not defined, it would be analogous to what exists right now.
* If specified, fail the stage if different tasks in stage end up waiting on 
different barrier names (or some have a name and others dont).

In example usecases I have seen, there is usually partition specific code paths 
(if partition 0, do some initialization/teardown, etc) - which results in 
divergent codepaths : and so increases potential for this issue.

It will be very difficult to reason about the state what happens.

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-08-03 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568664#comment-16568664
 ] 

Mridul Muralidharan commented on SPARK-24375:
-

{quote}
It's not desired behavior to catch exception thrown by TaskContext.barrier() 
silently. However, in case this really happens, we can detect that because we 
have `epoch` both in driver side and executor side, more details will go to the 
design doc of BarrierTaskContext.barrier() SPARK-24581
{quote}

The current 'barrier' function does not identify 'which' barrier it is from a 
user point of view.

Here, due to exceptions raised (not necessarily from barrier(), but could be 
from user code as well), different tasks are waiting on different barriers.

{code}
try {
  ... snippet A ...
  // Barrier 1
  context.barrier()
  ... snippet B ...
} catch { ... }
... snippet C ...
// Barrier 2
context.barrier()
{code}

T1 waits on barrier 1, T2 could have raised exception in snippet A and ends up 
waiting on Barrier 2 (having never seen Barrier 1).
In this scenario, how is spark making progress ?
(And ofcourse, when T1 reaches barrier 2, when T2 has moved past it).
I did not see this clarified in the design or in the implementation.


> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-20 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551198#comment-16551198
 ] 

Mridul Muralidharan commented on SPARK-24615:
-

[~tgraves] This was indeed a recurring issue - the ability to modulate ask's to 
RM based on current requirements.
What you bring out is an excellent point - changing resource requirements would 
be very useful - particularly for applications with heterogenous resource 
needs. Even currently when executor_memory/executor_cores does not allign well 
with stage requirements, we end up with OOM - resulting in over-provisioning 
memory needs; resulting in suboptimal use. GPU/accelerator aware scheduler is 
an extension of the same - where we have other resources to consider.

I agree with [~tgraves] that a more general way to model this would look at all 
resources (when declaratively specified ofcourse) and use the information to 
allocate resources (from RM) and for task schedule (within spark).



> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24755) Executor loss can cause task to not be resubmitted

2018-07-08 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536016#comment-16536016
 ] 

Mridul Muralidharan commented on SPARK-24755:
-

Go for it - thanks [~hthuynh2] !

> Executor loss can cause task to not be resubmitted
> --
>
> Key: SPARK-24755
> URL: https://issues.apache.org/jira/browse/SPARK-24755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>    Reporter: Mridul Muralidharan
>Priority: Major
>
> As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
> checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
> if task needs to be resubmitted for partition.
> Consider following:
> For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 
> respectively (one of them being speculative task)
> T1 finishes successfully first.
> This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
> We also end up killing task T2.
> Now, exec-1 if/when goes MIA.
> executorLost will no longer schedule task for P1 - since 
> killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there 
> is no other copy of P1 around (T2 was killed when T1 succeeded).
> I noticed this bug as part of reviewing PR# 21653 for SPARK-13343
> Essentially, SPARK-22074 causes a regression (which I dont usually observe 
> due to shuffle service, sigh) - and as such the fix is broken IMO.
> I dont have a PR handy for this, so if anyone wants to pick it up, please do 
> feel free !
> +CC [~XuanYuan] who fixed SPARK-22074 initially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24755) Executor loss can cause task to not be resubmitted

2018-07-07 Thread Mridul Muralidharan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-24755:

Description: 
As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
if task needs to be resubmitted for partition.

Consider following:

For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively 
(one of them being speculative task)

T1 finishes successfully first.

This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
We also end up killing task T2.

Now, exec-1 if/when goes MIA.
executorLost will no longer schedule task for P1 - since 
killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is 
no other copy of P1 around (T2 was killed when T1 succeeded).


I noticed this bug as part of reviewing PR# 21653 for SPARK-13343

Essentially, SPARK-22074 causes a regression (which I dont usually observe due 
to shuffle service, sigh) - and as such the fix is broken IMO.

I dont have a PR handy for this, so if anyone wants to pick it up, please do 
feel free !
+CC [~XuanYuan] who fixed SPARK-22074 initially.

  was:
As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
if task needs to be resubmitted for partition.

Consider following:

For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively 
(one of them being speculative task)

T1 finishes successfully first.

This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
We also end up killing task T2.

Now, exec-1 if/when goes MIA.
executorLost will no longer schedule task for P1 - since 
killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is 
no other copy of P1 around (T2 was killed when T1 succeeded).


I noticed this bug as part of reviewing PR# 21653 for SPARK-13343

Essentially, SPARK-22074 causes a regression (which I dont usually observe due 
to shuffle service, sigh) - and as such the fix is broken IMO : I believe it 
got introduced as part of the review (the original change looked fine to me - 
but I did not look at it in detail).

I dont have a PR handy for this, so if anyone wants to pick it up, please do 
feel free !
+CC [~XuanYuan] who fixed SPARK-22074 initially.


> Executor loss can cause task to not be resubmitted
> --
>
> Key: SPARK-24755
> URL: https://issues.apache.org/jira/browse/SPARK-24755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
> checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
> if task needs to be resubmitted for partition.
> Consider following:
> For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 
> respectively (one of them being speculative task)
> T1 finishes successfully first.
> This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
> We also end up killing task T2.
> Now, exec-1 if/when goes MIA.
> executorLost will no longer schedule task for P1 - since 
> killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there 
> is no other copy of P1 around (T2 was killed when T1 succeeded).
> I noticed this bug as part of reviewing PR# 21653 for SPARK-13343
> Essentially, SPARK-22074 causes a regression (which I dont usually observe 
> due to shuffle service, sigh) - and as such the fix is broken IMO.
> I dont have a PR handy for this, so if anyone wants to pick it up, please do 
> feel free !
> +CC [~XuanYuan] who fixed SPARK-22074 initially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24755) Executor loss can cause task to not be resubmitted

2018-07-07 Thread Mridul Muralidharan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-24755:

Summary: Executor loss can cause task to not be resubmitted  (was: Executor 
loss can cause task to be not resubmitted)

> Executor loss can cause task to not be resubmitted
> --
>
> Key: SPARK-24755
> URL: https://issues.apache.org/jira/browse/SPARK-24755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>    Reporter: Mridul Muralidharan
>Priority: Major
>
> As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
> checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
> if task needs to be resubmitted for partition.
> Consider following:
> For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 
> respectively (one of them being speculative task)
> T1 finishes successfully first.
> This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
> We also end up killing task T2.
> Now, exec-1 if/when goes MIA.
> executorLost will no longer schedule task for P1 - since 
> killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there 
> is no other copy of P1 around (T2 was killed when T1 succeeded).
> I noticed this bug as part of reviewing PR# 21653 for SPARK-13343
> Essentially, SPARK-22074 causes a regression (which I dont usually observe 
> due to shuffle service, sigh) - and as such the fix is broken IMO : I believe 
> it got introduced as part of the review (the original change looked fine to 
> me - but I did not look at it in detail).
> I dont have a PR handy for this, so if anyone wants to pick it up, please do 
> feel free !
> +CC [~XuanYuan] who fixed SPARK-22074 initially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24755) Executor loss can cause task to be not resubmitted

2018-07-07 Thread Mridul Muralidharan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-24755:

Description: 
As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
if task needs to be resubmitted for partition.

Consider following:

For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively 
(one of them being speculative task)

T1 finishes successfully first.

This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
We also end up killing task T2.

Now, exec-1 if/when goes MIA.
executorLost will no longer schedule task for P1 - since 
killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is 
no other copy of P1 around (T2 was killed when T1 succeeded).


I noticed this bug as part of reviewing PR# 21653 for SPARK-13343

Essentially, SPARK-22074 causes a regression (which I dont usually observe due 
to shuffle service, sigh) - and as such the fix is broken IMO : I believe it 
got introduced as part of the review (the original change looked fine to me - 
but I did not look at it in detail).

I dont have a PR handy for this, so if anyone wants to pick it up, please do 
feel free !
+CC [~XuanYuan] who fixed SPARK-22074 initially.

  was:
As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
if task needs to be resubmitted for partition.

Consider following:

For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively 
(one of them being speculative task)

T1 finishes successfully first.

This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
We also end up killing task T2.

Now, exec-1 if/when goes MIA.
executorLost will no longer schedule task for P1 - since 
killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is 
no other copy of P1 around (T2 was killed, not T1 - which was successful).


I noticed this bug as part of reviewing PR# 21653 for SPARK-13343

Essentially, SPARK-22074 causes a regression (which I dont usually observe due 
to shuffle service, sigh) - and as such the fix is broken IMO : I believe it 
got introduced as part of the review (the original change looked fine to me - 
but I did not look at it in detail).

I dont have a PR handy for this, so if anyone wants to pick it up, please do 
feel free !
+CC [~XuanYuan] who fixed SPARK-22074 initially.


> Executor loss can cause task to be not resubmitted
> --
>
> Key: SPARK-24755
> URL: https://issues.apache.org/jira/browse/SPARK-24755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
> checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
> if task needs to be resubmitted for partition.
> Consider following:
> For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 
> respectively (one of them being speculative task)
> T1 finishes successfully first.
> This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
> We also end up killing task T2.
> Now, exec-1 if/when goes MIA.
> executorLost will no longer schedule task for P1 - since 
> killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there 
> is no other copy of P1 around (T2 was killed when T1 succeeded).
> I noticed this bug as part of reviewing PR# 21653 for SPARK-13343
> Essentially, SPARK-22074 causes a regression (which I dont usually observe 
> due to shuffle service, sigh) - and as such the fix is broken IMO : I believe 
> it got introduced as part of the review (the original change looked fine to 
> me - but I did not look at it in detail).
> I dont have a PR handy for this, so if anyone wants to pick it up, please do 
> feel free !
> +CC [~XuanYuan] who fixed SPARK-22074 initially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24755) Executor loss can cause task to be not resubmitted

2018-07-07 Thread Mridul Muralidharan (JIRA)

Mridul Muralidharan created SPARK-24755:
---

 Summary: Executor loss can cause task to be not resubmitted
 Key: SPARK-24755
 URL: https://issues.apache.org/jira/browse/SPARK-24755
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Mridul Muralidharan


As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
if task needs to be resubmitted for partition.

Consider following:

For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively 
(one of them being speculative task)

T1 finishes successfully first.

This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
We also end up killing task T2.

Now, exec-1 if/when goes MIA.
executorLost will no longer schedule task for P1 - since 
killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is 
no other copy of P1 around (T2 was killed, not T1 - which was successful).


I noticed this bug as part of reviewing PR# 21653 for SPARK-13343

Essentially, SPARK-22074 causes a regression (which I dont usually observe due 
to shuffle service, sigh) - and as such the fix is broken IMO : I believe it 
got introduced as part of the review (the original change looked fine to me - 
but I did not look at it in detail).

I dont have a PR handy for this, so if anyone wants to pick it up, please do 
feel free !
+CC [~XuanYuan] who fixed SPARK-22074 initially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-18 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516411#comment-16516411
 ] 

Mridul Muralidharan edited comment on SPARK-24375 at 6/18/18 10:17 PM:
---

[~jiangxb1987] A couple of comments based on the document and your elaboration 
above:

* Is the 'barrier' logic pluggable ? Instead of only being a global sync point.
* Dynamic resource allocation (dra) triggers allocation of additional resources 
based on pending tasks - hence the comment _We may add a check of total 
available slots before scheduling tasks from a barrier stage taskset._ does not 
necessarily work in that context.
* Currently DRA in spark uniformly allocates resources - are we envisioning 
changes as part of this effort to allocate heterogenous executor resources 
based on pending tasks (atleast initially for barrier support for gpu's) ?
* How is fault tolerance handled w.r.t waiting on incorrect barriers ? Any way 
to identify the barrier ? Example:
{code}
try {
  ... snippet A ...
  // Barrier 1
  context.barrier()
  ... snippet B ...
} catch { ... }
... snippet C ...
// Barrier 2
context.barrier()

{code}
** In face of exceptions, some tasks will wait on barrier 2 and others on 
barrier 1 : causing issues.
* Can you elaborate more on leveraging TaskContext.localProperties ? Is it 
expected to be sync'ed after 'barrier' returns ? What gaurantees are we 
expecting to provide ?


was (Author: mridulm80):
[~jiangxb1987] A couple of comments based on the document and your elaboration 
above:

* Is the 'barrier' logic pluggable ? Instead of only being a global sync point.
* Dynamic resource allocation (dra) triggers allocation of additional resources 
based on pending tasks - hence the comment _We may add a check of total 
available slots before scheduling tasks from a barrier stage taskset._ does not 
necessarily work in that context.
* Currently DRA in spark uniformly allocates resources - are we envisioning 
changes as part of this effort to allocate heterogenous executor resources 
based on pending tasks (atleast initially for barrier support for gpu's) ?
* How is fault tolerance handled w.r.t waiting on incorrect barriers ? Any way 
to identify the barrier ? Example:
{code}
try {
  ... snippet A ...
  // Barrier 1
  context.barrier()
  ... snippet B ...
} catch { ... }
... snippet C ...
// Barrier 2
context.barrier()

{code}
** In face of exceptions, some tasks will wait on barrier 2 and others on 
barrier 1 : causing issues.
*

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-18 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516411#comment-16516411
 ] 

Mridul Muralidharan commented on SPARK-24375:
-


[~jiangxb1987] A couple of comments based on the document and your elaboration 
above:

* Is the 'barrier' logic pluggable ? Instead of only being a global sync point.
* Dynamic resource allocation (dra) triggers allocation of additional resources 
based on pending tasks - hence *We may add a check of total available slots 
before scheduling tasks from a barrier stage taskset.* does not necessarily 
work in that context.
* Currently DRA in spark uniformly allocates resources - are we envisioning 
changes as part of this effort to allocate heterogenous executor resources 
based on pending tasks (atleast initially for barrier support for gpu's) ?
* How is fault tolerance handled w.r.t waiting on incorrect barriers ? Any way 
to identify the barrier ? Example:
{code}
try {
  ... snippet A ...
  // Barrier 1
  context.barrier()
  ... snippet B ...
} catch { ... }
... snippet C ...
// Barrier 2
context.barrier()

{code}
** In face of exceptions, some tasks will wait on barrier 2 and others on 
barrier 1 : causing issues.
*

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-18 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516411#comment-16516411
 ] 

Mridul Muralidharan edited comment on SPARK-24375 at 6/18/18 10:15 PM:
---

[~jiangxb1987] A couple of comments based on the document and your elaboration 
above:

* Is the 'barrier' logic pluggable ? Instead of only being a global sync point.
* Dynamic resource allocation (dra) triggers allocation of additional resources 
based on pending tasks - hence the comment _We may add a check of total 
available slots before scheduling tasks from a barrier stage taskset._ does not 
necessarily work in that context.
* Currently DRA in spark uniformly allocates resources - are we envisioning 
changes as part of this effort to allocate heterogenous executor resources 
based on pending tasks (atleast initially for barrier support for gpu's) ?
* How is fault tolerance handled w.r.t waiting on incorrect barriers ? Any way 
to identify the barrier ? Example:
{code}
try {
  ... snippet A ...
  // Barrier 1
  context.barrier()
  ... snippet B ...
} catch { ... }
... snippet C ...
// Barrier 2
context.barrier()

{code}
** In face of exceptions, some tasks will wait on barrier 2 and others on 
barrier 1 : causing issues.
*


was (Author: mridulm80):

[~jiangxb1987] A couple of comments based on the document and your elaboration 
above:

* Is the 'barrier' logic pluggable ? Instead of only being a global sync point.
* Dynamic resource allocation (dra) triggers allocation of additional resources 
based on pending tasks - hence *We may add a check of total available slots 
before scheduling tasks from a barrier stage taskset.* does not necessarily 
work in that context.
* Currently DRA in spark uniformly allocates resources - are we envisioning 
changes as part of this effort to allocate heterogenous executor resources 
based on pending tasks (atleast initially for barrier support for gpu's) ?
* How is fault tolerance handled w.r.t waiting on incorrect barriers ? Any way 
to identify the barrier ? Example:
{code}
try {
  ... snippet A ...
  // Barrier 1
  context.barrier()
  ... snippet B ...
} catch { ... }
... snippet C ...
// Barrier 2
context.barrier()

{code}
** In face of exceptions, some tasks will wait on barrier 2 and others on 
barrier 1 : causing issues.
*

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: time for Apache Spark 3.0?

2018-06-15 Thread Mridul Muralidharan

I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra  wrote:
>
> Changing major version numbers is not about new features or a vague notion 
> that it is time to do something that will be seen to be a significant 
> release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 
>> 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How 
>> about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as 
>> Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should 
>> be better to drift away the historical burden and focus in new area. Spark 
>> has been widely used all over the world as a successful big data framework. 
>> And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not 
>>> yet supporting Scala 2.12, and that got me to think perhaps it is about 
>>> time for Spark to work towards the 3.0 release. By the time it comes out, 
>>> it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more 
>>> context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If 
>>> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 
>>> in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs 
>>> in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a 
>>> necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 
>>> 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for 
>>> disruptive API changes to make it painful for users to update. The main 
>>> purpose of a major release is an opportunity to fix things that are broken 
>>> in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing 
>>> major new features incrementally, so major releases are not the only time 
>>> for exciting new features. For example, the bulk of the work in the move 
>>> towards the DataFrame API was done in Spark 1.3, and Continuous Processing 
>>> was introduced in Spark 2.3. Both were feature releases rather than major 
>>> releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: 
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support 
>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs. 
>>> Similar to Spark 2.0, I think there are also opportunities for other 
>>> changes that we know have been biting us for a long time but can’t be 
>>> changed in feature releases (to be clear, I’m actually not sure they are 
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 
>>> 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, 
>>> to prevent users from shooting themselves in the foot, e.g. “SELECT 2 
>>> SECOND” -- is “SECOND” an interval unit or an alias? To make it less 
>>> painful for users to upgrade here, I’d suggest creating a flag for backward 
>>> compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard 
>>> compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. 
>>> “JavaPairRDD flatMapValues requires function returning Iterable, not 
>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in 
>>> terms of what exciting features are coming. I do think there are a number 
>>> of major changes happening already that can be part of the 3.0 release, if 
>>> they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is 
>>> realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework 
>>> and whether it’d make sense to create Spark 3.0 as the next release, rather 
>>> than the individual feature requests. Those are important b

Re: Hadoop 3 support

2018-04-02 Thread Mridul Muralidharan

Specifically to run spark with hadoop 3 docker support, I have filed a
few jira's tracked under [1].

Regards,
Mridul

[1] https://issues.apache.org/jira/browse/SPARK-23717


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin  wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container

2018-03-28 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418029#comment-16418029
 ] 

Mridul Muralidharan commented on YARN-7935:
---

[~eyang] I think there is some confusion here.
Spark does not require user defined networks - I dont think it was mentioned 
that this was required.

Taking a step back:

With "host" networking mode, we get it to work without any changes to the 
application code at all - giving us all the benefits of isolation without any 
loss in existing functionality (modulo specifying the env variables required 
ofcourse).

When used with bridge/overlay/user defined networks/etc, the container hostname 
passed to spark AM via allocation request is that of nodemanager, and not the 
actual container hostname used in the docker container.
This patch exposes the container hostname as an env variable - just as we have 
other container and node specific env variables exposed to the container 
(CONTAINER_ID, NM_HOST, etc).

Do you see any concern with exposing this variable ? I want to make sure I am 
not missing something here.

What spark (or any other application) does with this variable is their 
implementation detail; I can go into details of why this is required in the 
case of spark specifically if required, but that might digress from the jira.


> Expose container's hostname to applications running within the docker 
> container
> ---
>
> Key: YARN-7935
> URL: https://issues.apache.org/jira/browse/YARN-7935
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-7935.1.patch, YARN-7935.2.patch, YARN-7935.3.patch
>
>
> Some applications have a need to bind to the container's hostname (like 
> Spark) which is different from the NodeManager's hostname(NM_HOST which is 
> available as an env during container launch) when launched through Docker 
> runtime. The container's hostname can be exposed to applications via an env 
> CONTAINER_HOSTNAME. Another potential candidate is the container's IP but 
> this can be addressed in a separate jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container

2018-03-27 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415223#comment-16415223
 ] 

Mridul Muralidharan commented on YARN-7935:
---

[~eyang] Using YARN service to run spark AM is not feasible; IMO requiring use 
of yarn service to leverage docker support in yarn must not be a requirement.

> Expose container's hostname to applications running within the docker 
> container
> ---
>
> Key: YARN-7935
> URL: https://issues.apache.org/jira/browse/YARN-7935
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-7935.1.patch, YARN-7935.2.patch, YARN-7935.3.patch
>
>
> Some applications have a need to bind to the container's hostname (like 
> Spark) which is different from the NodeManager's hostname(NM_HOST which is 
> available as an env during container launch) when launched through Docker 
> runtime. The container's hostname can be exposed to applications via an env 
> CONTAINER_HOSTNAME. Another potential candidate is the container's IP but 
> this can be addressed in a separate jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (SPARK-23721) Enhance BlockManagerId to include container's underlying host machine hostname

2018-03-16 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-23721:

Summary: Enhance BlockManagerId to include container's underlying host 
machine hostname  (was: Enhance BlockManagerId to include container's 
underlying host machie hostname)

> Enhance BlockManagerId to include container's underlying host machine hostname
> --
>
> Key: SPARK-23721
> URL: https://issues.apache.org/jira/browse/SPARK-23721
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> In spark, host and rack locality computation is based on BlockManagerId's 
> hostname - which is the container's hostname.
> When running in containerized environment's like kubernetes, docker support 
> in hadoop 3, mesos docker support, etc; the hostname reported by container is 
> not the actual 'host' the container is running on.
> This results in spark getting affected in multiple ways.
> h3. Suboptimal schedules
> Due to host name mismatch between different containers on same physical host, 
> spark will treat all containers as running on own host.
> Effectively, there is no host-locality schedule at all due to this.
> In addition, depending on how sophisticated locality script is, it can also 
> lead to either suboptimal rack locality computation all the way to no 
> rack-locality schedule entirely.
> Hence the performance degradation in scheduler can be significant - only 
> PROCESS_LOCAL schedules dont get affected.
> h3. HDFS reads
> This is closely related to "suboptimal schedules" above.
> Block locations for hdfs files refer to the datanode hostnames - and not the 
> container's hostname.
> This effectively results in spark ignoring hdfs data placement entirely for 
> scheduling tasks - resulting in very heavy cross-node/cross-rack data 
> movement.
> h3. Speculative execution
> Spark schedules speculative tasks on a different host - in order to minimize 
> the cost of node failures for expensive tasks.
> This gets effectively disabled, resulting in speculative tasks potentially 
> running on the same actual host.
> h3. Block replication
> Similar to "speculative execution" above, block replication minimizes 
> potential cost of node loss by typically leveraging another host; which gets 
> effectively disabled in this case.
> Solution for the above is to enhance BlockManagerId to also include the 
> node's actual hostname via 'nodeHostname' - which should be used for usecases 
> above instead of the container hostname ('host').
> When not relevant, nodeHostname == hostname : which should ensure all 
> existing functionality continues to work as expected with regressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23721) Enhance BlockManagerId to include container's underlying host machie hostname

2018-03-16 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-23721:

Summary: Enhance BlockManagerId to include container's underlying host 
machie hostname  (was: Enhance BlockManagerId to include container's underlying 
host name)

> Enhance BlockManagerId to include container's underlying host machie hostname
> -
>
> Key: SPARK-23721
> URL: https://issues.apache.org/jira/browse/SPARK-23721
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> In spark, host and rack locality computation is based on BlockManagerId's 
> hostname - which is the container's hostname.
> When running in containerized environment's like kubernetes, docker support 
> in hadoop 3, mesos docker support, etc; the hostname reported by container is 
> not the actual 'host' the container is running on.
> This results in spark getting affected in multiple ways.
> h3. Suboptimal schedules
> Due to host name mismatch between different containers on same physical host, 
> spark will treat all containers as running on own host.
> Effectively, there is no host-locality schedule at all due to this.
> In addition, depending on how sophisticated locality script is, it can also 
> lead to either suboptimal rack locality computation all the way to no 
> rack-locality schedule entirely.
> Hence the performance degradation in scheduler can be significant - only 
> PROCESS_LOCAL schedules dont get affected.
> h3. HDFS reads
> This is closely related to "suboptimal schedules" above.
> Block locations for hdfs files refer to the datanode hostnames - and not the 
> container's hostname.
> This effectively results in spark ignoring hdfs data placement entirely for 
> scheduling tasks - resulting in very heavy cross-node/cross-rack data 
> movement.
> h3. Speculative execution
> Spark schedules speculative tasks on a different host - in order to minimize 
> the cost of node failures for expensive tasks.
> This gets effectively disabled, resulting in speculative tasks potentially 
> running on the same actual host.
> h3. Block replication
> Similar to "speculative execution" above, block replication minimizes 
> potential cost of node loss by typically leveraging another host; which gets 
> effectively disabled in this case.
> Solution for the above is to enhance BlockManagerId to also include the 
> node's actual hostname via 'nodeHostname' - which should be used for usecases 
> above instead of the container hostname ('host').
> When not relevant, nodeHostname == hostname : which should ensure all 
> existing functionality continues to work as expected with regressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23721) Enhance BlockManagerId to include container's underlying host name

2018-03-16 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-23721:

Summary: Enhance BlockManagerId to include container's underlying host name 
 (was: Use actual node's hostname for host and rack locality computation)

> Enhance BlockManagerId to include container's underlying host name
> --
>
> Key: SPARK-23721
> URL: https://issues.apache.org/jira/browse/SPARK-23721
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> In spark, host and rack locality computation is based on BlockManagerId's 
> hostname - which is the container's hostname.
> When running in containerized environment's like kubernetes, docker support 
> in hadoop 3, mesos docker support, etc; the hostname reported by container is 
> not the actual 'host' the container is running on.
> This results in spark getting affected in multiple ways.
> h3. Suboptimal schedules
> Due to host name mismatch between different containers on same physical host, 
> spark will treat all containers as running on own host.
> Effectively, there is no host-locality schedule at all due to this.
> In addition, depending on how sophisticated locality script is, it can also 
> lead to either suboptimal rack locality computation all the way to no 
> rack-locality schedule entirely.
> Hence the performance degradation in scheduler can be significant - only 
> PROCESS_LOCAL schedules dont get affected.
> h3. HDFS reads
> This is closely related to "suboptimal schedules" above.
> Block locations for hdfs files refer to the datanode hostnames - and not the 
> container's hostname.
> This effectively results in spark ignoring hdfs data placement entirely for 
> scheduling tasks - resulting in very heavy cross-node/cross-rack data 
> movement.
> h3. Speculative execution
> Spark schedules speculative tasks on a different host - in order to minimize 
> the cost of node failures for expensive tasks.
> This gets effectively disabled, resulting in speculative tasks potentially 
> running on the same actual host.
> h3. Block replication
> Similar to "speculative execution" above, block replication minimizes 
> potential cost of node loss by typically leveraging another host; which gets 
> effectively disabled in this case.
> Solution for the above is to enhance BlockManagerId to also include the 
> node's actual hostname via 'nodeHostname' - which should be used for usecases 
> above instead of the container hostname ('host').
> When not relevant, nodeHostname == hostname : which should ensure all 
> existing functionality continues to work as expected with regressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23721) Use actual node's hostname for host and rack locality computation

2018-03-16 Thread Mridul Muralidharan (JIRA)

Mridul Muralidharan created SPARK-23721:
---

 Summary: Use actual node's hostname for host and rack locality 
computation
 Key: SPARK-23721
 URL: https://issues.apache.org/jira/browse/SPARK-23721
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Mridul Muralidharan


In spark, host and rack locality computation is based on BlockManagerId's 
hostname - which is the container's hostname.
When running in containerized environment's like kubernetes, docker support in 
hadoop 3, mesos docker support, etc; the hostname reported by container is not 
the actual 'host' the container is running on.
This results in spark getting affected in multiple ways.

h3. Suboptimal schedules

Due to host name mismatch between different containers on same physical host, 
spark will treat all containers as running on own host.
Effectively, there is no host-locality schedule at all due to this.

In addition, depending on how sophisticated locality script is, it can also 
lead to either suboptimal rack locality computation all the way to no 
rack-locality schedule entirely.

Hence the performance degradation in scheduler can be significant - only 
PROCESS_LOCAL schedules dont get affected.

h3. HDFS reads

This is closely related to "suboptimal schedules" above.
Block locations for hdfs files refer to the datanode hostnames - and not the 
container's hostname.
This effectively results in spark ignoring hdfs data placement entirely for 
scheduling tasks - resulting in very heavy cross-node/cross-rack data movement.

h3. Speculative execution

Spark schedules speculative tasks on a different host - in order to minimize 
the cost of node failures for expensive tasks.
This gets effectively disabled, resulting in speculative tasks potentially 
running on the same actual host.

h3. Block replication

Similar to "speculative execution" above, block replication minimizes potential 
cost of node loss by typically leveraging another host; which gets effectively 
disabled in this case.




Solution for the above is to enhance BlockManagerId to also include the node's 
actual hostname via 'nodeHostname' - which should be used for usecases above 
instead of the container hostname ('host').
When not relevant, nodeHostname == hostname : which should ensure all existing 
functionality continues to work as expected with regressions.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23720) Leverage shuffle service when running in non-host networking mode in hadoop 3 docker support

2018-03-16 Thread Mridul Muralidharan (JIRA)

Mridul Muralidharan created SPARK-23720:
---

 Summary: Leverage shuffle service when running in non-host 
networking mode in hadoop 3 docker support
 Key: SPARK-23720
 URL: https://issues.apache.org/jira/browse/SPARK-23720
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Mridul Muralidharan



In current external shuffle service integration, hostname of the executor and 
the shuffle service is the same while the port's are different (shuffle service 
port vs block manager port).
When running in non-host networking mode under docker, in yarn, the shuffle 
service runs on the NM_HOST while the docker container run's under its own 
(ephemeral and generated) hostname.

We should make use of the container's host machine's hostname for shuffle 
service and not the container hostname, when external shuffle is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23718) Document using docker in host networking mode in hadoop 3

2018-03-16 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-23718:

Issue Type: Documentation  (was: Task)

> Document using docker in host networking mode in hadoop 3
> -
>
> Key: SPARK-23718
> URL: https://issues.apache.org/jira/browse/SPARK-23718
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>    Reporter: Mridul Muralidharan
>Priority: Major
>
> Document the configuration options required to be specified to run Apache 
> Spark application on Hadoop 3 docker support in host networking mode.
> There is no code changes required to leverage the same, giving us package 
> isolation with all other functionality at-par with what currently exists.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23719) Use correct hostname in non-host networking mode in hadoop 3 docker support

2018-03-16 Thread Mridul Muralidharan (JIRA)

Mridul Muralidharan created SPARK-23719:
---

 Summary: Use correct hostname in non-host networking mode in 
hadoop 3 docker support
 Key: SPARK-23719
 URL: https://issues.apache.org/jira/browse/SPARK-23719
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 2.4.0
Reporter: Mridul Muralidharan



Hostname (node-id's hostname field) specified by RM in allocated containers is 
the NM_HOST and not the hostname which will be used by the container when 
running in docker container executor : the actual container hostname is 
generated at runtime.

Due to this spark executor's are unable to launch in non-host networking mode 
when leveraging docker support in hadoop 3 - due to bind failures as hostname 
they are trying to bind to is of the host machine and not the container.

We can leverage YARN-7935 to fetch the container's hostname (when available) 
else fallback to existing mechanism - when running executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23718) Document using docker in host networking mode in hadoop 3

2018-03-16 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-23718:

Issue Type: Improvement  (was: Documentation)

> Document using docker in host networking mode in hadoop 3
> -
>
> Key: SPARK-23718
> URL: https://issues.apache.org/jira/browse/SPARK-23718
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>    Reporter: Mridul Muralidharan
>Priority: Major
>
> Document the configuration options required to be specified to run Apache 
> Spark application on Hadoop 3 docker support in host networking mode.
> There is no code changes required to leverage the same, giving us package 
> isolation with all other functionality at-par with what currently exists.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23718) Document using docker in host networking mode in hadoop 3

2018-03-16 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-23718:

Issue Type: Task  (was: Improvement)

> Document using docker in host networking mode in hadoop 3
> -
>
> Key: SPARK-23718
> URL: https://issues.apache.org/jira/browse/SPARK-23718
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 2.4.0
>    Reporter: Mridul Muralidharan
>Priority: Major
>
> Document the configuration options required to be specified to run Apache 
> Spark application on Hadoop 3 docker support in host networking mode.
> There is no code changes required to leverage the same, giving us package 
> isolation with all other functionality at-par with what currently exists.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 1557 matches

Mail list logo