from:"jerryshao"

[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-07 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19649#discussion_r149315115
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/events.scala 
---
@@ -110,7 +120,31 @@ case class RenameTableEvent(
   extends TableEvent
 
 /**
- * Event fired when a function is created, dropped or renamed.
+ * Enumeration to indicate which part of table is altered. If a plain 
alterTable API is called, then
+ * type will generally be Table.
+ */
+object AlterTableKind extends Enumeration {
+  val Table, DataSchema, Stats = Value
--- End diff --

I'm OK to use String, but I'd prefer strong type to avoid nasty issues.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19663: [SPARK-21888][YARN][SQL][Hive]add hadoop/hive/hba...

2017-11-06 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19663#discussion_r149017279
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
---
@@ -705,6 +705,19 @@ private[spark] class Client(
   }
 }
 
+val confDir =
+  sys.env.getOrElse("SPARK_CONF_DIR", sys.env("SPARK_HOME") + 
File.separator + "conf")
+val dir = new File(confDir)
+if (dir.isDirectory) {
+  val files = dir.listFiles(new FileFilter {
+override def accept(pathname: File): Boolean = {
+  pathname.isFile && pathname.getName.endsWith("xml")
--- End diff --

Yes, I understand. My question is that do we need to explicitly check the 
expected file names, rather than blindly match any xml file?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19663: [SPARK-21888][SQL][Hive]add hadoop/hive/hbase/etc config...

2017-11-06 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19663
  
Please also add [YARN] tag to the PR title, this is actually a yarn problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19663: [SPARK-21888][SQL][Hive]add hadoop/hive/hbase/etc...

2017-11-06 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19663#discussion_r149015858
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
---
@@ -705,6 +705,19 @@ private[spark] class Client(
   }
 }
 
+val confDir =
+  sys.env.getOrElse("SPARK_CONF_DIR", sys.env("SPARK_HOME") + 
File.separator + "conf")
+val dir = new File(confDir)
+if (dir.isDirectory) {
+  val files = dir.listFiles(new FileFilter {
+override def accept(pathname: File): Boolean = {
+  pathname.isFile && pathname.getName.endsWith("xml")
--- End diff --

Shall we explicitly match the file name, like "hive-site.xml"? Looks like 
only check file name ends with "xml" will also include other unwanted files 
indefinitely.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18791: [SPARK-21571][Scheduler] Spark history server leaves inc...

2017-11-06 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/18791
  
@ericvandenbergfb please also fix the PR title, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-05 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19649#discussion_r149009631
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogEventSuite.scala
 ---
@@ -104,6 +109,8 @@ class ExternalCatalogEventSuite extends SparkFunSuite {
   tableType = CatalogTableType.MANAGED,
   storage = storage,
   schema = new StructType().add("id", "long"))
+val tableDefWithSparkVersion =
--- End diff --

Sorry this was from my original code, will update it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-05 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19649#discussion_r149004933
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
 ---
@@ -158,7 +173,13 @@ abstract class ExternalCatalog
* @param table Name of table to alter schema for
* @param newDataSchema Updated data schema to be used for the table.
*/
-  def alterTableDataSchema(db: String, table: String, newDataSchema: 
StructType): Unit
+  final def alterTableDataSchema(db: String, table: String, newDataSchema: 
StructType): Unit = {
+postToAll(AlterTableSchemaPreEvent(db, table))
--- End diff --

For me I think it is not so necessary to carry the new schema, we can query 
the catalog by `db` and `table` to get this newly set schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-05 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19649#discussion_r149003803
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
 ---
@@ -147,7 +154,15 @@ abstract class ExternalCatalog
* Note: If the underlying implementation does not support altering a 
certain field,
* this becomes a no-op.
*/
-  def alterTable(tableDefinition: CatalogTable): Unit
+  final def alterTable(tableDefinition: CatalogTable): Unit = {
--- End diff --

@cloud-fan , since now we expose `alterTable` interface for other 
components to leverage, if we don't track this, then looks like we missed a 
piece of `ExternalCatalogEvent`s. I think for now we can add this 
`AlterTableEvent`, later on if we removed this method, then we can make this 
event a no-op (only kept for compatibility), what do you think?

@wzhfy , I was thinking to add partition related events, but I'm not 
clearly sure why this whole piece is missing and is it necessary to add 
partition related events? If we have an agreement on such events, I'm OK to add 
them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-02 Thread jerryshao

GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/19649

[SPARK-22405][SQL] Add more ExternalCatalogEvent

## What changes were proposed in this pull request?

We're building a data lineage tool in which we need to monitor the metadata 
changes in ExternalCatalog, current ExternalCatalog already provides several 
useful events like "CreateDatabaseEvent" for custom SparkListener to use. But 
still there's some event missing, like alter database event and alter table 
event. So here propose to and new ExternalCatalogEvent.

## How was this patch tested?

Enrich the current UT and tested on local cluster.

CC @hvanhovell please let me know your comments about current proposal, 
thanks.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-22405

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19649.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19649


commit 5c628be6b6838b224a27e06731f686a5182e1bad
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-11-03T01:48:48Z

Add more ExternalCatalogEvent




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-01 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19586
  
I tend to agree with @cloud-fan , I think you can implement your own 
serializer out of Spark to be more specialized for your application, that will 
definitely be more efficient than the built-in one. But for the Spark's default 
solution, it should be general enough to cover all cases. Setting a flag or a 
configuration is not intuitive enough from my understanding.

And for ML, can you please provide an example about how this could be 
improved with your approach. From my understanding you approach is more useful 
when leverage custom class definition, like `Person` in your example. But for 
ML/SQL cases, all the types should be predefined or primitives, will that 
improved a lot?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19396: [SPARK-22172][CORE] Worker hangs when the external shuff...

2017-11-01 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19396
  
Sorry I didn't notice it, will double-check next time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19396: [SPARK-22172][CORE] Worker hangs when the external shuff...

2017-11-01 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19396
  
OK, let me merge to master branch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-10-31 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19586
  
Using configurations seems not so elegant, also configuration is 
application based, how would you turn off/on this feature in the runtime? Sorry 
I cannot give you a good advice, maybe kryo's solution is the best option for 
general case.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19396: [SPARK-22172][CORE] Worker hangs when the external shuff...

2017-10-31 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19396
  
I'm OK with the current changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-10-31 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19586
  
@ConeyLiu what about the below example, does your implementation support 
this?

```scala

trait Base { val name: String }
case class A(name: String) extends Base
case class B(name: String) extends Base

sc.parallelize(Seq(A("a"), B("b"))).map { i => (i, 1) }.reduceByKey(_ + 
_).collect()
```

Here not all the elements have same class type, does your PR here support 
such scenario?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19580: [SPARK-11334][CORE] Fix bug in Executor allocation manag...

2017-10-31 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19580
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19580: [SPARK-11334][CORE] Fix bug in Executor allocation manag...

2017-10-31 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19580
  
jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19580: [SPARK-11334][CORE] Fix bug in Executor allocatio...

2017-10-26 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19580#discussion_r147325260
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -267,6 +267,10 @@ private[spark] class ExecutorAllocationManager(
 (numRunningOrPendingTasks + tasksPerExecutor - 1) / tasksPerExecutor
   }
 
+  private def totalRunningTasks(): Int = synchronized {
--- End diff --

I'm not sure why do we need to add a method which only used for unit test. 
If want to verify the behavior of `totalRunningTasks`, I think 
`maxNumExecutorsNeeded` can also be used indirectly for verification.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19580: [SPARK-11334][CORE] Fix bug in Executor allocatio...

2017-10-26 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19580#discussion_r147304200
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -267,6 +267,10 @@ private[spark] class ExecutorAllocationManager(
 (numRunningOrPendingTasks + tasksPerExecutor - 1) / tasksPerExecutor
   }
 
+  private def totalRunningTasks(): Int = synchronized {
--- End diff --

Looks like no one invoke this method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19580: [SPARK-11334][CORE] Fix bug in Executor allocatio...

2017-10-26 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19580#discussion_r147303973
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -678,7 +679,9 @@ private[spark] class ExecutorAllocationManager(
   val executorId = taskStart.taskInfo.executorId
 
   allocationManager.synchronized {
-numRunningTasks += 1
+if (stageIdToNumRunningTask.contains(stageId)) {
+  stageIdToNumRunningTask(stageId) = 
stageIdToNumRunningTask(stageId) + 1
--- End diff --

nit: this can be changed to `stageIdToNumRunningTask(stageId) += 1`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19580: [SPARK-11334][CORE] Fix bug in Executor allocatio...

2017-10-26 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19580#discussion_r147304306
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -709,7 +712,9 @@ private[spark] class ExecutorAllocationManager(
   val taskIndex = taskEnd.taskInfo.index
   val stageId = taskEnd.stageId
   allocationManager.synchronized {
-numRunningTasks -= 1
+if (stageIdToNumRunningTask.contains(stageId)) {
+  stageIdToNumRunningTask(stageId) = 
stageIdToNumRunningTask(stageId) - 1
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...

2017-10-26 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19519
  
LGTM, merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-25 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19534
  
@sitalkedia would you please reopen this PR, I think the second issue I 
fixed before is not valid anymore, for the first issue the fix is no difference 
compared to here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #11205: [SPARK-11334][Core] Handle maximum task failure s...

2017-10-25 Thread jerryshao

Github user jerryshao closed the pull request at:

https://github.com/apache/spark/pull/11205


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-25 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/11205
  
Verified again, looks like the 2nd bullet is not valid anymore, I cannot 
reproduce it in latest master branch, this might have already been fixed in 
SPARK-13054. 

So only first issue still exists, I think @sitalkedia 's PR is enough to 
handle this 1st issue. I'm going to close this one. @sitalkedia would you 
please reopen your PR, sorry to bring in noise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-25 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/11205
  
@vanzin ,  in the current code `stageIdToTaskIndices` cannot be used to 
track number of running tasks, because this structure doesn't remove task index 
from itself when task is finished successfully.

Yes `isExecutorIdle` is used to take care of executor idle, but the way to 
identify whether executor is idle is not robust enough. In this scenario, when 
stage is aborted because of max task failures, some task end event will be 
missing, so using number of tasks per executor will lead to residual data, and 
makes executor always be busy.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

2017-10-25 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19458
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

2017-10-25 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19458
  
There's a UT failure 
(https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83014/testReport/junit/org.apache.spark.storage/BlockIdSuite/test_bad_deserialization/).
 @superbobry please fix this failure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19519: [SPARK-21840][core] Add trait that allows conf to...

2017-10-24 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19519#discussion_r146737263
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkApplication.scala ---
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.lang.reflect.Modifier
+
+import org.apache.spark.SparkConf
+
+/**
+ * Entry point for a Spark application. Implementations must provide a 
no-argument constructor.
+ */
+private[spark] trait SparkApplication {
+
+  def start(args: Array[String], conf: SparkConf): Unit
+
+}
+
+/**
+ * Implementation of SparkApplication that wraps a standard Java class 
with a "main" method.
+ *
+ * Configuration is propagated to the application via system properties, 
so running multiple
+ * of these in the same JVM may lead to undefined behavior due to 
configuration leaks.
+ */
+private[deploy] class JavaMainApplication(klass: Class[_]) extends 
SparkApplication {
+
+  override def start(args: Array[String], conf: SparkConf): Unit = {
+val mainMethod = klass.getMethod("main", new Array[String](0).getClass)
+if (!Modifier.isStatic(mainMethod.getModifiers)) {
+  throw new IllegalStateException("The main method in the given main 
class must be static")
+}
+
+val sysProps = conf.getAll.toMap
+sysProps.foreach { case (k, v) =>
+  sys.props(k) = v
+}
+
+mainMethod.invoke(null, args)
+  }
--- End diff --

I see, thanks for explanation. I cannot figure out a solution which doesn't 
break the current semantics of `SparkConf`, this might be the only choice. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19519: [SPARK-21840][core] Add trait that allows conf to...

2017-10-24 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19519#discussion_r146734075
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkApplication.scala ---
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.lang.reflect.Modifier
+
+import org.apache.spark.SparkConf
+
+/**
+ * Entry point for a Spark application. Implementations must provide a 
no-argument constructor.
+ */
+private[spark] trait SparkApplication {
+
+  def start(args: Array[String], conf: SparkConf): Unit
+
+}
+
+/**
+ * Implementation of SparkApplication that wraps a standard Java class 
with a "main" method.
+ *
+ * Configuration is propagated to the application via system properties, 
so running multiple
+ * of these in the same JVM may lead to undefined behavior due to 
configuration leaks.
+ */
+private[deploy] class JavaMainApplication(klass: Class[_]) extends 
SparkApplication {
+
+  override def start(args: Array[String], conf: SparkConf): Unit = {
+val mainMethod = klass.getMethod("main", new Array[String](0).getClass)
+if (!Modifier.isStatic(mainMethod.getModifiers)) {
+  throw new IllegalStateException("The main method in the given main 
class must be static")
+}
+
+val sysProps = conf.getAll.toMap
+sysProps.foreach { case (k, v) =>
+  sys.props(k) = v
+}
+
+mainMethod.invoke(null, args)
+  }
--- End diff --

But based on your comment "allow multiple applications to be started in the 
same JVM", will this system properties contaminate follow-up applications? 
Sorry if I misunderstood your scenario.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18492: [SPARK-19326] Speculated task attempts do not get...

2017-10-23 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18492#discussion_r146190420
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -373,8 +373,14 @@ private[spark] class ExecutorAllocationManager(
 // If our target has not changed, do not send a message
 // to the cluster manager and reset our exponential growth
 if (delta == 0) {
-  numExecutorsToAdd = 1
-  return 0
+  // Check if there is any speculative jobs pending
+  if (listener.pendingTasks == 0 && listener.pendingSpeculativeTasks > 
0) {
+numExecutorsTarget =
+  math.max(math.min(maxNumExecutorsNeeded + 1, maxNumExecutors), 
minNumExecutors)
--- End diff --

@janewangfb Would you please explain why here `+ 1` if there's pending 
speculativeTasks, should the number of executors be calculated based on the 
number of pending tasks? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...

2017-10-23 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19554
  
@sjrand would you please close this PR, it is already merged to branch 2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...

2017-10-23 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19554
  
Thanks, merging to branch 2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...

2017-10-22 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19554
  
ok to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19554
  
Can you please add a tag in PR title `[BACKPORT-2.2]`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19554
  
ok to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-22 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19534
  
@sitalkedia I'm OK with either.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19519: [SPARK-21840][core] Add trait that allows conf to...

2017-10-22 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19519#discussion_r146154530
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkApplication.scala ---
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.lang.reflect.Modifier
+
+import org.apache.spark.SparkConf
+
+/**
+ * Entry point for a Spark application. Implementations must provide a 
no-argument constructor.
+ */
+private[spark] trait SparkApplication {
+
+  def start(args: Array[String], conf: SparkConf): Unit
+
+}
+
+/**
+ * Implementation of SparkApplication that wraps a standard Java class 
with a "main" method.
+ *
+ * Configuration is propagated to the application via system properties, 
so running multiple
+ * of these in the same JVM may lead to undefined behavior due to 
configuration leaks.
+ */
+private[deploy] class JavaMainApplication(klass: Class[_]) extends 
SparkApplication {
+
+  override def start(args: Array[String], conf: SparkConf): Unit = {
+val mainMethod = klass.getMethod("main", new Array[String](0).getClass)
+if (!Modifier.isStatic(mainMethod.getModifiers)) {
+  throw new IllegalStateException("The main method in the given main 
class must be static")
+}
+
+val sysProps = conf.getAll.toMap
+sysProps.foreach { case (k, v) =>
+  sys.props(k) = v
+}
+
+mainMethod.invoke(null, args)
+  }
--- End diff --

@vanzin , do we need to remove all the system properties after `mainMethod` 
is finished?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
@sjrand , can you please create another PR against branch-2.2, it is not 
auto-mergeable, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
LGTM, merging to master and branch 2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...

2017-10-20 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19519
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-19 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
I think branch 2.2 also has similar issue when fetching resources from 
remote secure HDFS.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-19 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
ok to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-19 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
Thanks for the fix! I didn't test on secure cluster when did glob path 
support, so I didn't realize such issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-19 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19534
  
@sitalkedia I have a very old similar PR #11205 , maybe you can refer to it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...

2017-10-19 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19469
  
@felixcheung As you can see there's bunch of configurations needs to be 
added here in https://github.com/apache-spark-on-k8s/spark/pull/516, that's why 
I'm asking a general solutions for such related issue.

I'm OK to merge this PR. But I would suspect similar PRs will still be 
created in future, since those issues are quite scenario specific, users may 
have different scenarios and can touch different issues regarding to this. So 
I'm just wondering if we could have a better solution for this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...

2017-10-19 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19519
  
@vanzin , how do we leverage this new trait, would you please explain more? 
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-19 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19534
  
@sitalkedia would you please fix the PR title, seems it is broken now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

2017-10-19 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19509
  
LGTM, merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...

2017-10-18 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19469
  
@ssaavedra , yes I think so. with the pull-in of k8s support, I would guess 
more configurations need to be added to exclusion rule. With current solution, 
one by one PR doesn't make so sense. We should either figure out a general 
solution or refactor this part. 

Besides, as we moved to structured streaming, do we need to pay more 
efforts on these issues? @zsxwing 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

2017-10-18 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19509
  
LGTM, just one minor comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...

2017-10-18 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19509#discussion_r145329972
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala 
---
@@ -347,6 +347,10 @@ package object config {
 .timeConf(TimeUnit.MILLISECONDS)
 .createWithDefault(Long.MaxValue)
 
+  private[spark] val KERBEROS_RELOGIN_PERIOD = 
ConfigBuilder("spark.yarn.kerberos.relogin.period")
+.timeConf(TimeUnit.SECONDS)
+.createWithDefaultString("1m")
--- End diff --

I think we should put this into doc. Also is it too frequent to call?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

2017-10-18 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19509
  
I see, thanks for the explanation. I didn't think about such scenario.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19263: [SPARK-22050][CORE] Allow BlockUpdated events to be opti...

2017-10-17 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19263
  
@vanzin, do you have other comments?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...

2017-10-17 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19469
  
@ChenjunZou did you get a chance to look at my left comment?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

2017-10-17 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19509
  
>The effect of this change is that now it's possible to initialize multiple,
non-concurrent SparkContext instances in the same JVM.

@vanzin , do we support in now? As I remembered it was not supported before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19476: [SPARK-22062][CORE] Spill large block to disk in BlockMa...

2017-10-17 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19476
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r145013312
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -653,15 +663,34 @@ private[spark] class BlockManager(
 require(blockId != null, "BlockId is null")
 var runningFailureCount = 0
 var totalFailureCount = 0
-val locations = getLocations(blockId)
+
+// Because all the remote blocks are registered in driver, so it is 
not necessary to ask
+// all the slave executors to get block status.
+val locationAndStatus = master.getLocationsAndStatus(blockId)
+
+val blockSize = locationAndStatus._2.map { status =>
+  // Disk size and mem size cannot co-exist, so it's ok to sum them 
together to get block size.
+  status.diskSize + status.memSize
--- End diff --

I get your point, also thinking of using `Math.max` instead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r145011923
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -509,11 +508,10 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 val bmId1 = BlockManagerId("id1", localHost, 1)
 val bmId2 = BlockManagerId("id2", localHost, 2)
 val bmId3 = BlockManagerId("id3", otherHost, 3)
-when(bmMaster.getLocations(mc.any[BlockId])).thenReturn(Seq(bmId1, 
bmId2, bmId3))
--- End diff --

Agreed, will revert it back.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r145011775
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -684,7 +713,7 @@ private[spark] class BlockManager(
   // take a significant amount of time. To get rid of these stale 
entries
   // we refresh the block locations after a certain number of 
fetch failures
   if (runningFailureCount >= maxFailuresBeforeLocationRefresh) {
-locationIterator = getLocations(blockId).iterator
+locationIterator = 
sortLocations(master.getLocationsAndStatus(blockId)._1).iterator
--- End diff --

Agreed, will change it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r145010440
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -653,15 +663,34 @@ private[spark] class BlockManager(
 require(blockId != null, "BlockId is null")
 var runningFailureCount = 0
 var totalFailureCount = 0
-val locations = getLocations(blockId)
+
+// Because all the remote blocks are registered in driver, so it is 
not necessary to ask
+// all the slave executors to get block status.
+val locationAndStatus = master.getLocationsAndStatus(blockId)
+
+val blockSize = locationAndStatus._2.map { status =>
+  // Disk size and mem size cannot co-exist, so it's ok to sum them 
together to get block size.
+  status.diskSize + status.memSize
--- End diff --

Are you saying we need to check `StorageLevel` to decide whether to use 
diskSize or memSize as block size?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r145009567
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -662,7 +662,9 @@ private[spark] object SparkConf extends Logging {
 "spark.yarn.jars" -> Seq(
   AlternateConfig("spark.yarn.jar", "2.0")),
 "spark.yarn.access.hadoopFileSystems" -> Seq(
-  AlternateConfig("spark.yarn.access.namenodes", "2.2"))
+  AlternateConfig("spark.yarn.access.namenodes", "2.2")),
+"spark.maxRemoteBlockSizeFetchToMem" -> Seq(
--- End diff --

Yes, I think so. `SparkConf` will print out warning log if we added here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r145009167
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -653,15 +663,34 @@ private[spark] class BlockManager(
 require(blockId != null, "BlockId is null")
 var runningFailureCount = 0
 var totalFailureCount = 0
-val locations = getLocations(blockId)
+
+// Because all the remote blocks are registered in driver, so it is 
not necessary to ask
+// all the slave executors to get block status.
+val locationAndStatus = master.getLocationsAndStatus(blockId)
+
+val blockSize = locationAndStatus._2.map { status =>
+  // Disk size and mem size cannot co-exist, so it's ok to sum them 
together to get block size.
+  status.diskSize + status.memSize
--- End diff --

@jiangxb1987 would you please explain more? I'm not quite following your 
comment. Are you referring to the below line `  }.getOrElse(0L)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-16 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19419
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19396: [SPARK-22172][CORE] Worker hangs when the external shuff...

2017-10-16 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19396
  
Sorry for the late response. I understand you purpose now. I think such 
behavior discrepancy is not a big problem. 

I guess the reason why NM still run with exception is that NM doesn't serve 
only for Spark, but also MR/TEZ, so the failure of Spark external service 
should not affect MR's.

Based on your comment above, I don't have a strong preference on either, I 
think both are OK. Maybe you can ping others to get their feedbacks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19419#discussion_r144775941
  
--- Diff: docs/security.md ---
@@ -186,7 +186,54 @@ configure those ports.
   
 
 
+### HTTP Security Headers
+
+Apache Spark can be configured to include HTTP Headers which aids in 
preventing Cross 
+Site Scripting (XSS), Cross-Frame Scripting (XFS), MIME-Sniffing and also 
enforces HTTP 
+Strict Transport Security.
+
+
+Property NameDefaultMeaning
--- End diff --

I think in Spark we follow 2 space indent for html code. You could refer to 
other docs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144775481
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1552,4 +1582,65 @@ private[spark] object BlockManager {
 override val metricRegistry = new MetricRegistry
 metricRegistry.registerAll(metricSet)
   }
+
+  class RemoteBlockTempFileManager(blockManager: BlockManager)
+  extends TempFileManager with Logging {
+
+private class ReferenceWithCleanup(file: File, referenceQueue: 
JReferenceQueue[File])
+extends WeakReference[File](file, referenceQueue) {
+  private val filePath = file.getAbsolutePath
+
+  def cleanUp(): Unit = {
+logDebug(s"Clean up file $filePath")
+
+if (!new File(filePath).delete()) {
+  logDebug(s"Fail to delete file $filePath")
+}
+  }
+}
--- End diff --

My concern is that: for shuffle part, since there's a explicit API to 
`cleanup` temp files, so it's not so necessary to track again with weak 
reference. Also weak reference is triggered with GC, and shuffle operations are 
usually much more frequent and heavier, using weak reference to track temp 
shuffle files may increase the overhead of GC probably. Whereas, compared to 
shuffle, fetching remote blocks are happened occasionally when block is not 
cached in local, so using weak reference may not increase the overhead a lot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144770999
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1552,4 +1582,65 @@ private[spark] object BlockManager {
 override val metricRegistry = new MetricRegistry
 metricRegistry.registerAll(metricSet)
   }
+
+  class RemoteBlockTempFileManager(blockManager: BlockManager)
+  extends TempFileManager with Logging {
+
+private class ReferenceWithCleanup(file: File, referenceQueue: 
JReferenceQueue[File])
+extends WeakReference[File](file, referenceQueue) {
+  private val filePath = file.getAbsolutePath
+
+  def cleanUp(): Unit = {
+logDebug(s"Clean up file $filePath")
+
+if (!new File(filePath).delete()) {
+  logDebug(s"Fail to delete file $filePath")
+}
+  }
+}
--- End diff --

Yes, that's what I mean. No matter false (the caller) or true 
(`ShuffleBlockFetcherIterator`), the file will be deleted, that's my question 
why there still has file leak issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144769226
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1552,4 +1582,65 @@ private[spark] object BlockManager {
 override val metricRegistry = new MetricRegistry
 metricRegistry.registerAll(metricSet)
   }
+
+  class RemoteBlockTempFileManager(blockManager: BlockManager)
+  extends TempFileManager with Logging {
+
+private class ReferenceWithCleanup(file: File, referenceQueue: 
JReferenceQueue[File])
+extends WeakReference[File](file, referenceQueue) {
+  private val filePath = file.getAbsolutePath
+
+  def cleanUp(): Unit = {
+logDebug(s"Clean up file $filePath")
+
+if (!new File(filePath).delete()) {
+  logDebug(s"Fail to delete file $filePath")
+}
+  }
+}
--- End diff --

But here in `ShuffleBlockFetcherIterator#registerTempFileToClean`, the 
caller will delete the file if it returns false, does it still has file leak 
problem?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144765076
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1552,4 +1582,65 @@ private[spark] object BlockManager {
 override val metricRegistry = new MetricRegistry
 metricRegistry.registerAll(metricSet)
   }
+
+  class RemoteBlockTempFileManager(blockManager: BlockManager)
+  extends TempFileManager with Logging {
+
+private class ReferenceWithCleanup(file: File, referenceQueue: 
JReferenceQueue[File])
+extends WeakReference[File](file, referenceQueue) {
+  private val filePath = file.getAbsolutePath
+
+  def cleanUp(): Unit = {
+logDebug(s"Clean up file $filePath")
+
+if (!new File(filePath).delete()) {
+  logDebug(s"Fail to delete file $filePath")
+}
+  }
+}
--- End diff --

I think the overhead is not big, but I'm not sure why there's a file leak 
issue here in `ShuffleBlockFetcherIterator` (the implementation of 
`TempFileManager`). From the code, all the temp files are tracked in 
`shuffleFilesSet`, and will be deleted during `cleanup`, can you please 
elaborate more?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144764761
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -355,11 +355,21 @@ package object config {
   .doc("The blocks of a shuffle request will be fetched to disk when 
size of the request is " +
 "above this threshold. This is to avoid a giant request takes too 
much memory. We can " +
 "enable this config by setting a specific value(e.g. 200m). Note 
that this config can " +
-"be enabled only when the shuffle shuffle service is newer than 
Spark-2.2 or the shuffle" +
+"be enabled only when the shuffle service is newer than Spark-2.2 
or the shuffle" +
 " service is disabled.")
   .bytesConf(ByteUnit.BYTE)
   .createWithDefault(Long.MaxValue)
 
+  private[spark] val MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM =
+ConfigBuilder("spark.maxRemoteBlockSizeFetchToMem")
+  .doc("Remote block will be fetched to disk when size of the block is 
" +
+"above this threshold. This is to avoid a giant request takes too 
much memory. We can " +
+"enable this config by setting a specific value(e.g. 200m). Note 
this configuration will " +
+"affect both shuffle fetch and block manager remote block fetch. 
For users who " +
+"enabled external shuffle service, this feature can only be worked 
when external shuffle" +
+" service is newer than Spark 2.2.")
+  .fallbackConf(REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM)
--- End diff --

Thanks, let me check it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-16 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144763884
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -355,11 +355,21 @@ package object config {
   .doc("The blocks of a shuffle request will be fetched to disk when 
size of the request is " +
 "above this threshold. This is to avoid a giant request takes too 
much memory. We can " +
 "enable this config by setting a specific value(e.g. 200m). Note 
that this config can " +
-"be enabled only when the shuffle shuffle service is newer than 
Spark-2.2 or the shuffle" +
+"be enabled only when the shuffle service is newer than Spark-2.2 
or the shuffle" +
 " service is disabled.")
   .bytesConf(ByteUnit.BYTE)
   .createWithDefault(Long.MaxValue)
 
+  private[spark] val MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM =
+ConfigBuilder("spark.maxRemoteBlockSizeFetchToMem")
+  .doc("Remote block will be fetched to disk when size of the block is 
" +
+"above this threshold. This is to avoid a giant request takes too 
much memory. We can " +
+"enable this config by setting a specific value(e.g. 200m). Note 
this configuration will " +
+"affect both shuffle fetch and block manager remote block fetch. 
For users who " +
+"enabled external shuffle service, this feature can only be worked 
when external shuffle" +
+" service is newer than Spark 2.2.")
+  .fallbackConf(REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM)
--- End diff --

From my understanding of the current code, it will not fallback to the 
deprecated config if we're using this api `SparkConf#get[T](entry: 
ConfigEntry[T])`, unless we specifically add `fallbackConf` definition. This is 
different from `SparkConf#getOption(key: String)`.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-13 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19419
  

>/home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/internal/config/package.scala:440:0:
 Whitespace at end of line

Please fix the style issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-13 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19419
  
ok to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...

2017-10-13 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19419#discussion_r144488504
  
--- Diff: docs/configuration.md ---
@@ -2013,7 +2013,62 @@ Apart from these, the following properties are also 
available, and may be useful
 
 
 
+### HTTP Security Headers
--- End diff --

I think this could move to `security.md` as a security related advanced 
configurations. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-13 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144472362
  
--- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
@@ -510,4 +510,86 @@ class FileSuite extends SparkFunSuite with 
LocalSparkContext {
 }
   }
 
+  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
+val conf = new SparkConf()
+conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, 
true)
+sc = new SparkContext(conf)
+
+def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], 
numSlices: Int,
+  outputSuffix: Int, checkPart: String, expectedPartitionNum: Int): 
Unit = {
+  val dataRDD = sc.parallelize(data, numSlices)
+  val output = new File(tempDir, "output" + outputSuffix)
+  dataRDD.saveAsHadoopFile[TextOutputFormat[String, 
String]](output.getPath)
+  assert(new File(output, checkPart).exists() === true)
+  val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
+  assert(hadoopRDD.partitions.length === expectedPartitionNum)
+}
+
+// Ensure that if all of the splits are empty, we remove the splits 
correctly
+testIgnoreEmptySplits(
+  data = Array.empty[Tuple2[String, String]],
+  numSlices = 1,
+  outputSuffix = 0,
+  checkPart = "part-0",
+  expectedPartitionNum = 0)
+
+// Ensure that if no split is empty, we don't lose any splits
+testIgnoreEmptySplits(
+  data = Array(("key1", "a"), ("key2", "a"), ("key3", "b")),
+  numSlices = 2,
+  outputSuffix = 1,
+  checkPart = "part-1",
+  expectedPartitionNum = 2)
+
+// Ensure that if part of the splits are empty, we remove the splits 
correctly
+testIgnoreEmptySplits(
+  data = Array(("key1", "a"), ("key2", "a")),
+  numSlices = 5,
+  outputSuffix = 2,
+  checkPart = "part-4",
+  expectedPartitionNum = 2)
+  }
+
+  test("spark.files.ignoreEmptySplits work correctly (new Hadoop API)") {
+val conf = new SparkConf()
+conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, 
true)
+sc = new SparkContext(conf)
+
+def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], 
numSlices: Int,
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-13 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144472244
  
--- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
@@ -510,4 +510,86 @@ class FileSuite extends SparkFunSuite with 
LocalSparkContext {
 }
   }
 
+  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
+val conf = new SparkConf()
+conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, 
true)
+sc = new SparkContext(conf)
+
+def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], 
numSlices: Int,
--- End diff --

nit: one argument per line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-12 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144456817
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -426,4 +426,11 @@ package object config {
 .toSequence
 .createOptional
 
+  private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM =
--- End diff --

I would prefer to use `spark.storage.maxRemoteBlockSizeFetchToMemory`, 
since driver side block manager will also leverage this feature.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-12 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144453507
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -426,4 +426,11 @@ package object config {
 .toSequence
 .createOptional
 
+  private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM =
--- End diff --

I was thinking about this, but the configuration name of 
`REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM` seems too shuffle specific, maybe we 
should rename it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19476: [SPARK-22062][CORE] Spill large block to disk in BlockMa...

2017-10-12 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19476
  
@cloud-fan @jiangxb1987 @jinxing64 would you please help to review when you 
have time, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

2017-10-12 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19458#discussion_r144450997
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, 
deleteFilesOnStop: Boolea
 
   /** List all the blocks currently stored on disk by the disk manager. */
   def getAllBlocks(): Seq[BlockId] = {
-getAllFiles().map(f => BlockId(f.getName))
+getAllFiles().flatMap { f =>
+  val blockId = BlockId.guess(f.getName)
--- End diff --

It looks not so necessary to define a new method `guess` for the use here 
only. I think here we can still use `apply` and catch/log the exception. In 
another word, we can simply changes `apply()` and use it here, defining new 
`guess` method is not so necessary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19263: [SPARK-22050][CORE] Allow BlockUpdated events to ...

2017-10-12 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19263#discussion_r19147
  
--- Diff: docs/configuration.md ---
@@ -714,6 +714,13 @@ Apart from these, the following properties are also 
available, and may be useful
 
 Property NameDefaultMeaning
 
+  spark.eventLog.blockUpdates
--- End diff --

I think it would be better to change the configuration name to be ended 
with ".enabled", to reflect this is a boolean configuration.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19263: [SPARK-22050][CORE] Allow BlockUpdated events to ...

2017-10-12 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19263#discussion_r18138
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -41,6 +41,22 @@ package object config {
 .bytesConf(ByteUnit.MiB)
 .createWithDefaultString("1g")
 
+  private[spark] val EVENT_LOG_COMPRESS =
+
ConfigBuilder("spark.eventLog.compress").booleanConf.createWithDefault(false)
+
+  private[spark] val EVENT_LOG_BLOCK_UPDATES =
+
ConfigBuilder("spark.eventLog.blockUpdates").booleanConf.createWithDefault(false)
+
+  private[spark] val EVENT_LOG_TESTING =
+
ConfigBuilder("spark.eventLog.testing").booleanConf.createWithDefault(false)
--- End diff --

I think this configuration should be marked as `internal()`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...

2017-10-12 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19419#discussion_r144286844
  
--- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala ---
@@ -79,6 +79,9 @@ private[spark] object JettyUtils extends Logging {
 val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")
 val xFrameOptionsValue =
   allowFramingFrom.map(uri => s"ALLOW-FROM 
$uri").getOrElse("SAMEORIGIN")
+val xXssProtectionValue = conf.getOption("spark.ui.xXssProtection")
--- End diff --

It follows new code convention, newly added configurations is better to 
change to that way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...

2017-10-12 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19419#discussion_r144283398
  
--- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala ---
@@ -89,6 +92,13 @@ private[spark] object JettyUtils extends Logging {
 val result = servletParams.responder(request)
 response.setHeader("Cache-Control", "no-cache, no-store, 
must-revalidate")
 response.setHeader("X-Frame-Options", xFrameOptionsValue)
+
xXssProtectionValue.foreach(response.setHeader("X-XSS-Protection", _))
+if (xContentTypeOptionsValue.get.equalsIgnoreCase("true")) {
+  response.setHeader("X-Content-Type-Options", "nosniff")
+}
+if (conf.get("spark.ssl.enabled").equalsIgnoreCase("true")) {
--- End diff --

I think you can check `request.scheme` if it is "https" or not?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

2017-10-12 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19458
  
Yes, I agree in any case it should not throw an exception. But in this PR 
you filtered out temp shuffle/local blocks, do you think this block is valid or 
not, are they blocks? 

So I'd like not filtering out those blocks, instead adding two parsing 
rules for those blocks. And for any other illegal files (cannot be parsed) 
catch and log the exception.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

2017-10-11 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19458
  
Instead of filtering out temp blocks, why not adding parsing rule for 
`TempLocalBlockId` and `TempShuffleBlockId`? That could also solve the problem. 
Since `DiskBlockManager#getAllFiles` doesn't filter out temp shuffle/local 
files, is it better to keep the same behavior for 
`DiskBlockManager#getAllBlocks`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...

2017-10-11 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19419#discussion_r144186220
  
--- Diff: conf/spark-defaults.conf.template ---
@@ -25,3 +25,10 @@
 # spark.serializer 
org.apache.spark.serializer.KryoSerializer
 # spark.driver.memory  5g
 # spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value 
-Dnumbers="one two three"
+
+# spark.ui.allowFramingFrom https://www.example.com/
+# spark.ui.xXssProtection   1; mode=block
+# spark.ui.xContentType.options nosniff
+
+# Enable below only when Spark is running on HTTPS
+# spark.ui.strictTransportSecurity  max-age=31536000
--- End diff --

What's the meaning of this specific number "31536000"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-11 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19419
  
@vanzin @tgravescs @ajbozarth  what is your opinion on this PR? Is it a 
necessary fix for Spark? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...

2017-10-11 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19468#discussion_r144182701
  
--- Diff: pom.xml ---
@@ -2649,6 +2649,13 @@
 
 
 
+  kubernetes
--- End diff --

We should also change the sbt file to make it work using sbt.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

2017-10-11 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19464
  
IIUC this issue also existed in `NewHadoopRDD` and `FileScanRDD` 
(possibly), we'd better also fix them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-11 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144181321
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -196,7 +196,10 @@ class HadoopRDD[K, V](
 // add the credentials here as this can be called before SparkContext 
initialized
 SparkHadoopUtil.get.addCredentials(jobConf)
 val inputFormat = getInputFormat(jobConf)
-val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+var inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+if 
(sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
--- End diff --

I would suggest not to use the name started by "spark.hadoop", this kind of 
configurations will be treated as Hadoop configuration and set into Hadoop 
`Configuration`, it might be better to choose another name.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-11 Thread jerryshao

GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/19476

[SPARK-22062][CORE] Spill large block to disk in BlockManager's remote 
fetch to avoid OOM

## What changes were proposed in this pull request?

In the current BlockManager's `getRemoteBytes`, it will call 
`BlockTransferService#fetchBlockSync` to get remote block. In the 
`fetchBlockSync`, Spark will allocate a temporary `ByteBuffer` to store the 
whole fetched block. This will potentially lead to OOM if block size is too big 
or several blocks are fetched simultaneously in this executor.

So here leveraging the idea of shuffle fetch, to spill the large block to 
local disk before consumed by upstream code. The behavior is controlled by 
newly added configuration, if block size is smaller than the threshold, then 
this block will be persisted in memory; otherwise it will first spill to disk, 
and then read from disk file.

To achieve this feature, what I did is:

1. Rename `TempShuffleFileManager` to `TempFileManager`, since now it is 
not only used by shuffle.
2. Add a new `TempFileManager` to manage the files of fetched remote 
blocks, the files are tracked by weak reference, will be deleted when no use at 
all.

## How was this patch tested?

This was tested by adding UT, also manual verification in local test to 
perform GC to clean the files.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-22062

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19476.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19476


commit f50a7b75c303bd2cf261dfb1b4fe74fa5498ca4b
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-10-12T01:47:35Z

Spill large blocks to disk during remote fetches in BlockManager




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...

2017-10-10 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19469
  
There's a similar PR #19427 , I was wondering if we can provide a general 
solution for such issues, like using a configuration to specify all the confs 
which needs to be reloaded, spark.streaming.confsToReload = 
spark.yarn.jars,spark.xx.xx. So that we don't need to fix related issues again 
and again. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19466: [SPARK-22237] [CORE] Fix spark submit file download for ...

2017-10-10 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19466
  
Would you please show us an example of how it breaks? The codes here which 
assigning all resources to local ones might work, but it covers which line is 
really broken, can you please describe more? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19399: [SPARK-22175][WEB-UI] Add status column to history page

2017-10-09 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19399
  
I agree with @squito that the criteria to define application's success 
should be well considered. Here  in your current code, only if all the jobs are 
successful then the application is marked as successful, is it too strict that 
we cannot allow any failure and retry? Besides, if an application is 
successfully running all the Spark jobs, but fail on their own code (eg, saving 
to DB), and the application is exited with non-zero code, shall we mark the 
application succeed or failure?

Also the structure to track all the jobs `jobToStatus ` will increase the 
memory occupation indefinitely in long running application.

Besides with your changes I can see that page loading time will be 
increased, for those applications which have many jobs (like Spark Streaming) 
the problem will be severe.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...

2017-10-09 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19287
  
LGTM, merging to master. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...

2017-10-08 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19077#discussion_r143380706
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -116,9 +116,10 @@ private [sql] object GenArrayData {
s"final ArrayData $arrayDataName = new 
$genericArrayClass($arrayName);",
arrayDataName)
 } else {
+  val numBytes = elementType.defaultSize * numElements
   val unsafeArraySizeInBytes =
 UnsafeArrayData.calculateHeaderPortionInBytes(numElements) +
-
ByteArrayMethods.roundNumberOfBytesToNearestWord(elementType.defaultSize * 
numElements)
+ByteArrayMethods.roundNumberOfBytesToNearestWord(numBytes).toInt
--- End diff --

Minor: why don't we inline this instead of creating a new variable?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...

2017-10-08 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19287
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...

2017-10-08 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19419#discussion_r143377794
  
--- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala ---
@@ -79,6 +79,9 @@ private[spark] object JettyUtils extends Logging {
 val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")
 val xFrameOptionsValue =
   allowFramingFrom.map(uri => s"ALLOW-FROM 
$uri").getOrElse("SAMEORIGIN")
+val xXssProtectionValue = conf.getOption("spark.ui.xXssProtection")
--- End diff --

Please use `ConfigEntry` for newly added configurations, you could refer to 
`org.apache.spark.internal.config`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 2761 matches

Mail list logo