date:20190108

svn commit: r31831 - in /dev/spark/2.4.1-SNAPSHOT-2019_01_08_23_01-6277a9f-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-08 Thread pwendell

Author: pwendell
Date: Wed Jan  9 07:16:32 2019
New Revision: 31831

Log:
Apache Spark 2.4.1-SNAPSHOT-2019_01_08_23_01-6277a9f docs


[This commit notification would consist of 1476 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r31829 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_08_20_55-dbbba80-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-08 Thread pwendell

Author: pwendell
Date: Wed Jan  9 05:08:50 2019
New Revision: 31829

Log:
Apache Spark 3.0.0-SNAPSHOT-2019_01_08_20_55-dbbba80 docs


[This commit notification would consist of 1775 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26549][PYSPARK] Fix for python worker reuse take no effect for parallelize lazy iterable range

2019-01-08 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dbbba80  [SPARK-26549][PYSPARK] Fix for python worker reuse take no 
effect for parallelize lazy iterable range
dbbba80 is described below

commit dbbba80b3cb319b147dcf82a69963eee662e289f
Author: Yuanjian Li 
AuthorDate: Wed Jan 9 11:55:12 2019 +0800

[SPARK-26549][PYSPARK] Fix for python worker reuse take no effect for 
parallelize lazy iterable range

## What changes were proposed in this pull request?

During the follow-up work(#23435) for PySpark worker reuse scenario, we 
found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. 
It happened because of the specialize rdd.parallelize logic for 
xrange(introduced in #3264) generated data by lazy iterable range, which don't 
need to use the passed-in iterator. But this will break the end of stream 
checking in python worker and finally cause worker reuse takes no effect. See 
more details in [SPARK-26549](https://issue [...]

We fix this by force using the passed-in iterator.

## How was this patch tested?
New UT in test_worker.py.

Closes #23470 from xuanyuanking/SPARK-26549.

Authored-by: Yuanjian Li 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/context.py   |  8 
 python/pyspark/tests/test_worker.py | 12 +++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 64178eb..316fbc8 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -498,6 +498,14 @@ class SparkContext(object):
 return start0 + int((split * size / numSlices)) * step
 
 def f(split, iterator):
+# it's an empty iterator here but we need this line for 
triggering the
+# logic of signal handling in FramedSerializer.load_stream, 
for instance,
+# SpecialLengths.END_OF_DATA_SECTION in _read_with_length. 
Since
+# FramedSerializer.load_stream produces a generator, the 
control should
+# at least be in that function once. Here we do it by 
explicitly converting
+# the empty iterator to a list, thus make sure worker reuse 
takes effect.
+# See more details in SPARK-26549.
+assert len(list(iterator)) == 0
 return xrange(getStart(split), getStart(split + 1), step)
 
 return self.parallelize([], numSlices).mapPartitionsWithIndex(f)
diff --git a/python/pyspark/tests/test_worker.py 
b/python/pyspark/tests/test_worker.py
index a33b77d..a4f108f 100644
--- a/python/pyspark/tests/test_worker.py
+++ b/python/pyspark/tests/test_worker.py
@@ -22,7 +22,7 @@ import time
 
 from py4j.protocol import Py4JJavaError
 
-from pyspark.testing.utils import ReusedPySparkTestCase, QuietTest
+from pyspark.testing.utils import ReusedPySparkTestCase, PySparkTestCase, 
QuietTest
 
 if sys.version_info[0] >= 3:
 xrange = range
@@ -145,6 +145,16 @@ class WorkerTests(ReusedPySparkTestCase):
 self.sc.pythonVer = version
 
 
+class WorkerReuseTest(PySparkTestCase):
+
+def test_reuse_worker_of_parallelize_xrange(self):
+rdd = self.sc.parallelize(xrange(20), 8)
+previous_pids = rdd.map(lambda x: os.getpid()).collect()
+current_pids = rdd.map(lambda x: os.getpid()).collect()
+for pid in current_pids:
+self.assertTrue(pid in previous_pids)
+
+
 if __name__ == "__main__":
 import unittest
 from pyspark.tests.test_worker import *


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat

2019-01-08 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 6277a9f  [SPARK-26571][SQL] Update Hive Serde mapping with canonical 
name of Parquet and Orc FileFormat
6277a9f is described below

commit 6277a9f8f9e8f024110056c8d12eb7d205d6d1f4
Author: Gengliang Wang 
AuthorDate: Wed Jan 9 10:18:33 2019 +0800

[SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet 
and Orc FileFormat

## What changes were proposed in this pull request?

Currently Spark table maintains Hive catalog storage format, so that Hive 
client can read it.  In `HiveSerDe.scala`, Spark uses a mapping from its data 
source to HiveSerde. The mapping is old, we need to update with latest 
canonical name of Parquet and Orc FileFormat.

Otherwise the following queries will result in wrong Serde value in Hive 
table(default value `org.apache.hadoop.mapred.SequenceFileInputFormat`), and 
Hive client will fail to read the output table:
```

df.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").saveAsTable(..)
```

```

df.write.format("org.apache.spark.sql.execution.datasources.orc.OrcFileFormat").saveAsTable(..)
```

This minor PR is to fix the mapping.

## How was this patch tested?

Unit test.

Closes #23491 from gengliangwang/fixHiveSerdeMap.

Authored-by: Gengliang Wang 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 311f32f37fbeaebe9dfa0b8dc2a111ee99b583b7)
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/internal/HiveSerDe.scala  |  2 ++
 .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 18 ++
 .../spark/sql/hive/orc/HiveOrcSourceSuite.scala| 29 --
 3 files changed, 20 insertions(+), 29 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala
index eca612f..bd25a64 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala
@@ -74,8 +74,10 @@ object HiveSerDe {
   def sourceToSerDe(source: String): Option[HiveSerDe] = {
 val key = source.toLowerCase(Locale.ROOT) match {
   case s if s.startsWith("org.apache.spark.sql.parquet") => "parquet"
+  case s if 
s.startsWith("org.apache.spark.sql.execution.datasources.parquet") => "parquet"
   case s if s.startsWith("org.apache.spark.sql.orc") => "orc"
   case s if s.startsWith("org.apache.spark.sql.hive.orc") => "orc"
+  case s if s.startsWith("org.apache.spark.sql.execution.datasources.orc") 
=> "orc"
   case s if s.equals("orcfile") => "orc"
   case s if s.equals("parquetfile") => "parquet"
   case s if s.equals("avrofile") => "avro"
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
index 688b619..5c9261c 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
@@ -159,10 +159,28 @@ class DataSourceWithHiveMetastoreCatalogSuite
   "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
 )),
 
+"org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat" -> 
((
+  "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
+  "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
+  "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
+)),
+
 "orc" -> ((
   "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
   "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
   "org.apache.hadoop.hive.ql.io.orc.OrcSerde"
+)),
+
+"org.apache.spark.sql.hive.orc" -> ((
+  "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
+  "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
+  "org.apache.hadoop.hive.ql.io.orc.OrcSerde"
+)),
+
+"org.apache.spark.sql.execution.datasources.orc.OrcFileFormat" -> ((
+  "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
+  "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
+  "org.apache.hadoop.hive.ql.io.orc.OrcSerde"
 ))
   ).foreach { case (provider, (inputFormat, outputFormat, serde)) =>
 test(s"Persist non-partitioned $provider relation into metastore as 
managed table") {
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala
index c1ae2f6..c0bf181 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/or

[spark] branch master updated: [SPARK-26529] Add debug logs for confArchive when preparing local resource

2019-01-08 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new eb42bb49 [SPARK-26529] Add debug logs for confArchive when preparing 
local resource
eb42bb49 is described below

commit eb42bb493b1d7c79e9516660b71aec66bdde5d51
Author: Liupengcheng 
AuthorDate: Wed Jan 9 10:39:25 2019 +0800

[SPARK-26529] Add debug logs for confArchive when preparing local resource

## What changes were proposed in this pull request?

Currently, `Client#createConfArchive` do not handle IOException, and some 
detail info is not provided in logs. Sometimes, this may delay the time of 
locating the root cause of io error.
This PR will add debug logs for confArchive when preparing local resource.

## How was this patch tested?

unittest

Closes #23444 from 
liupc/Add-logs-for-IOException-when-preparing-local-resource.

Authored-by: Liupengcheng 
Signed-off-by: Hyukjin Kwon 
---
 .../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala| 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index 9f09dc0..8492180 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -713,6 +713,7 @@ private[spark] class Client(
   new File(Utils.getLocalDir(sparkConf)))
 val confStream = new ZipOutputStream(new FileOutputStream(confArchive))
 
+logDebug(s"Creating an archive with the config files for distribution at 
$confArchive.")
 try {
   confStream.setLevel(0)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat

2019-01-08 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 311f32f  [SPARK-26571][SQL] Update Hive Serde mapping with canonical 
name of Parquet and Orc FileFormat
311f32f is described below

commit 311f32f37fbeaebe9dfa0b8dc2a111ee99b583b7
Author: Gengliang Wang 
AuthorDate: Wed Jan 9 10:18:33 2019 +0800

[SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet 
and Orc FileFormat

## What changes were proposed in this pull request?

Currently Spark table maintains Hive catalog storage format, so that Hive 
client can read it.  In `HiveSerDe.scala`, Spark uses a mapping from its data 
source to HiveSerde. The mapping is old, we need to update with latest 
canonical name of Parquet and Orc FileFormat.

Otherwise the following queries will result in wrong Serde value in Hive 
table(default value `org.apache.hadoop.mapred.SequenceFileInputFormat`), and 
Hive client will fail to read the output table:
```

df.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").saveAsTable(..)
```

```

df.write.format("org.apache.spark.sql.execution.datasources.orc.OrcFileFormat").saveAsTable(..)
```

This minor PR is to fix the mapping.

## How was this patch tested?

Unit test.

Closes #23491 from gengliangwang/fixHiveSerdeMap.

Authored-by: Gengliang Wang 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/internal/HiveSerDe.scala  |  2 ++
 .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 18 ++
 .../spark/sql/hive/orc/HiveOrcSourceSuite.scala| 29 --
 3 files changed, 20 insertions(+), 29 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala
index eca612f..bd25a64 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala
@@ -74,8 +74,10 @@ object HiveSerDe {
   def sourceToSerDe(source: String): Option[HiveSerDe] = {
 val key = source.toLowerCase(Locale.ROOT) match {
   case s if s.startsWith("org.apache.spark.sql.parquet") => "parquet"
+  case s if 
s.startsWith("org.apache.spark.sql.execution.datasources.parquet") => "parquet"
   case s if s.startsWith("org.apache.spark.sql.orc") => "orc"
   case s if s.startsWith("org.apache.spark.sql.hive.orc") => "orc"
+  case s if s.startsWith("org.apache.spark.sql.execution.datasources.orc") 
=> "orc"
   case s if s.equals("orcfile") => "orc"
   case s if s.equals("parquetfile") => "parquet"
   case s if s.equals("avrofile") => "avro"
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
index 688b619..5c9261c 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
@@ -159,10 +159,28 @@ class DataSourceWithHiveMetastoreCatalogSuite
   "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
 )),
 
+"org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat" -> 
((
+  "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
+  "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
+  "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
+)),
+
 "orc" -> ((
   "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
   "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
   "org.apache.hadoop.hive.ql.io.orc.OrcSerde"
+)),
+
+"org.apache.spark.sql.hive.orc" -> ((
+  "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
+  "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
+  "org.apache.hadoop.hive.ql.io.orc.OrcSerde"
+)),
+
+"org.apache.spark.sql.execution.datasources.orc.OrcFileFormat" -> ((
+  "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
+  "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
+  "org.apache.hadoop.hive.ql.io.orc.OrcSerde"
 ))
   ).foreach { case (provider, (inputFormat, outputFormat, serde)) =>
 test(s"Persist non-partitioned $provider relation into metastore as 
managed table") {
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala
index 7fefaf5..c46512b 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala
@@

svn commit: r31820 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_08_12_33-32515d2-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-08 Thread pwendell

Author: pwendell
Date: Tue Jan  8 20:46:06 2019
New Revision: 31820

Log:
Apache Spark 3.0.0-SNAPSHOT-2019_01_08_12_33-32515d2 docs


[This commit notification would consist of 1775 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26349][PYSPARK] Forbid insecure py4j gateways

2019-01-08 Thread cutlerb

This is an automated email from the ASF dual-hosted git repository.

cutlerb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 32515d2  [SPARK-26349][PYSPARK] Forbid insecure py4j gateways
32515d2 is described below

commit 32515d205a4de4d8838226fa5e5c4e4f66935193
Author: Imran Rashid 
AuthorDate: Tue Jan 8 11:26:36 2019 -0800

[SPARK-26349][PYSPARK] Forbid insecure py4j gateways

Spark always creates secure py4j connections between java and python,
but it also allows users to pass in their own connection. This ensures
that even passed in connections are secure.

Added test cases verifying the failure with a (mocked) insecure gateway.

This is closely related to SPARK-26019, but this entirely forbids the
insecure connection, rather than creating the "escape-hatch".

Closes #23441 from squito/SPARK-26349.

Authored-by: Imran Rashid 
Signed-off-by: Bryan Cutler 
---
 python/pyspark/context.py|  5 +
 python/pyspark/tests/test_context.py | 10 ++
 2 files changed, 15 insertions(+)

diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 6137ed2..64178eb 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -115,6 +115,11 @@ class SparkContext(object):
 ValueError:...
 """
 self._callsite = first_spark_call() or CallSite(None, None, None)
+if gateway is not None and gateway.gateway_parameters.auth_token is 
None:
+raise ValueError(
+"You are trying to pass an insecure Py4j gateway to Spark. 
This"
+" is not allowed as it is a security risk.")
+
 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
 try:
 self._do_init(master, appName, sparkHome, pyFiles, environment, 
batchSize, serializer,
diff --git a/python/pyspark/tests/test_context.py 
b/python/pyspark/tests/test_context.py
index 201baf4..18d9cd4 100644
--- a/python/pyspark/tests/test_context.py
+++ b/python/pyspark/tests/test_context.py
@@ -20,6 +20,7 @@ import tempfile
 import threading
 import time
 import unittest
+from collections import namedtuple
 
 from pyspark import SparkFiles, SparkContext
 from pyspark.testing.utils import ReusedPySparkTestCase, PySparkTestCase, 
QuietTest, SPARK_HOME
@@ -246,6 +247,15 @@ class ContextTests(unittest.TestCase):
 with SparkContext() as sc:
 self.assertGreater(sc.startTime, 0)
 
+def test_forbid_insecure_gateway(self):
+# Fail immediately if you try to create a SparkContext
+# with an insecure gateway
+parameters = namedtuple('MockGatewayParameters', 'auth_token')(None)
+mock_insecure_gateway = namedtuple('MockJavaGateway', 
'gateway_parameters')(parameters)
+with self.assertRaises(ValueError) as context:
+SparkContext(gateway=mock_insecure_gateway)
+self.assertIn("insecure Py4j gateway", str(context.exception))
+
 
 if __name__ == "__main__":
 from pyspark.tests.test_context import *


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-24920][CORE] Allow sharing Netty's memory pool allocators

2019-01-08 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e103c4a  [SPARK-24920][CORE] Allow sharing Netty's memory pool 
allocators
e103c4a is described below

commit e103c4a5e72bab8862ff49d6d4c1e62e642fc412
Author: “attilapiros” 
AuthorDate: Tue Jan 8 13:11:11 2019 -0600

[SPARK-24920][CORE] Allow sharing Netty's memory pool allocators

## What changes were proposed in this pull request?

Introducing shared polled ByteBuf allocators.
This feature can be enabled via the 
"spark.network.sharedByteBufAllocators.enabled" configuration.

When it is on then only two pooled ByteBuf allocators are created:
- one for transport servers where caching is allowed and
- one for transport clients where caching is disabled

This way the cache allowance remains as before.
Both shareable pools are created with numCores parameter set to 0 (which 
defaults to the available processors) as conf.serverThreads() and 
conf.clientThreads() are module dependant and the lazy creation of this 
allocators would lead to unpredicted behaviour.

When "spark.network.sharedByteBufAllocators.enabled" is false then a new 
allocator is created for every transport client and server separately as was 
before this PR.

## How was this patch tested?

Existing unit tests.

Closes #23278 from attilapiros/SPARK-24920.

Authored-by: “attilapiros” 
Signed-off-by: Sean Owen 
---
 .../network/client/TransportClientFactory.java | 11 +++--
 .../spark/network/server/TransportServer.java  | 17 +---
 .../org/apache/spark/network/util/NettyUtils.java  | 48 ++
 .../apache/spark/network/util/TransportConf.java   | 18 
 .../spark/network/netty/SparkTransportConf.scala   | 25 +--
 docs/configuration.md  | 10 +
 6 files changed, 97 insertions(+), 32 deletions(-)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
index 16d242d..a8e2715 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
@@ -84,7 +84,7 @@ public class TransportClientFactory implements Closeable {
 
   private final Class socketChannelClass;
   private EventLoopGroup workerGroup;
-  private PooledByteBufAllocator pooledAllocator;
+  private final PooledByteBufAllocator pooledAllocator;
   private final NettyMemoryMetrics metrics;
 
   public TransportClientFactory(
@@ -103,8 +103,13 @@ public class TransportClientFactory implements Closeable {
 ioMode,
 conf.clientThreads(),
 conf.getModuleName() + "-client");
-this.pooledAllocator = NettyUtils.createPooledByteBufAllocator(
-  conf.preferDirectBufs(), false /* allowCache */, conf.clientThreads());
+if (conf.sharedByteBufAllocators()) {
+  this.pooledAllocator = NettyUtils.getSharedPooledByteBufAllocator(
+  conf.preferDirectBufsForSharedByteBufAllocators(), false /* 
allowCache */);
+} else {
+  this.pooledAllocator = NettyUtils.createPooledByteBufAllocator(
+  conf.preferDirectBufs(), false /* allowCache */, 
conf.clientThreads());
+}
 this.metrics = new NettyMemoryMetrics(
   this.pooledAllocator, conf.getModuleName() + "-client", conf);
   }
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java
 
b/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java
index eb5f10a..a0ecde2 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java
@@ -54,6 +54,7 @@ public class TransportServer implements Closeable {
   private ServerBootstrap bootstrap;
   private ChannelFuture channelFuture;
   private int port = -1;
+  private final PooledByteBufAllocator pooledAllocator;
   private NettyMemoryMetrics metrics;
 
   /**
@@ -69,6 +70,13 @@ public class TransportServer implements Closeable {
 this.context = context;
 this.conf = context.getConf();
 this.appRpcHandler = appRpcHandler;
+if (conf.sharedByteBufAllocators()) {
+  this.pooledAllocator = NettyUtils.getSharedPooledByteBufAllocator(
+  conf.preferDirectBufsForSharedByteBufAllocators(), true /* 
allowCache */);
+} else {
+  this.pooledAllocator = NettyUtils.createPooledByteBufAllocator(
+  conf.preferDirectBufs(), true /* allowCache */, 
conf.serverThreads());
+}
 this.bootstraps = 
Li

[spark] branch master updated: [SPARK-24522][UI] Create filter to apply HTTP security checks consistently.

2019-01-08 Thread irashid

This is an automated email from the ASF dual-hosted git repository.

irashid pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2783e4c  [SPARK-24522][UI] Create filter to apply HTTP security checks 
consistently.
2783e4c is described below

commit 2783e4c45f55f4fc87748d1c4a454bfdf3024156
Author: Marcelo Vanzin 
AuthorDate: Tue Jan 8 11:25:33 2019 -0600

[SPARK-24522][UI] Create filter to apply HTTP security checks consistently.

Currently there is code scattered in a bunch of places to do different
things related to HTTP security, such as access control, setting
security-related headers, and filtering out bad content. This makes it
really easy to miss these things when writing new UI code.

This change creates a new filter that does all of those things, and
makes sure that all servlet handlers that are attached to the UI get
the new filter and any user-defined filters consistently. The extent
of the actual features should be the same as before.

The new filter is added at the end of the filter chain, because 
authentication
is done by custom filters and thus needs to happen first. This means that
custom filters see unfiltered HTTP requests - which is actually the current
behavior anyway.

As a side-effect of some of the code refactoring, handlers added after
the initial set also get wrapped with a GzipHandler, which didn't happen
before.

Tested with added unit tests and in a history server with SPNEGO auth
configured.

Closes #23302 from vanzin/SPARK-24522.

Authored-by: Marcelo Vanzin 
Signed-off-by: Imran Rashid 
---
 .../apache/spark/deploy/history/HistoryPage.scala  |   5 +-
 .../spark/deploy/history/HistoryServer.scala   |   8 +-
 .../spark/deploy/master/ui/ApplicationPage.scala   |   3 +-
 .../apache/spark/deploy/master/ui/MasterPage.scala |   6 +-
 .../apache/spark/deploy/worker/ui/LogPage.scala|  28 ++--
 .../spark/deploy/worker/ui/WorkerWebUI.scala   |   1 -
 .../apache/spark/metrics/sink/MetricsServlet.scala |   2 +-
 .../spark/status/api/v1/SecurityFilter.scala   |  36 -
 .../org/apache/spark/ui/HttpSecurityFilter.scala   | 116 +++
 .../scala/org/apache/spark/ui/JettyUtils.scala | 154 +---
 .../main/scala/org/apache/spark/ui/UIUtils.scala   |  21 ---
 .../src/main/scala/org/apache/spark/ui/WebUI.scala |  15 +-
 .../spark/ui/exec/ExecutorThreadDumpPage.scala |   4 +-
 .../org/apache/spark/ui/jobs/AllJobsPage.scala |  16 +--
 .../scala/org/apache/spark/ui/jobs/JobPage.scala   |   3 +-
 .../scala/org/apache/spark/ui/jobs/JobsTab.scala   |   4 +-
 .../scala/org/apache/spark/ui/jobs/PoolPage.scala  |   3 +-
 .../scala/org/apache/spark/ui/jobs/StagePage.scala |  19 ++-
 .../org/apache/spark/ui/jobs/StageTable.scala  |  15 +-
 .../scala/org/apache/spark/ui/jobs/StagesTab.scala |   4 +-
 .../org/apache/spark/ui/storage/RDDPage.scala  |  11 +-
 .../apache/spark/ui/HttpSecurityFilterSuite.scala  | 157 +
 .../test/scala/org/apache/spark/ui/UISuite.scala   | 147 +--
 .../scala/org/apache/spark/ui/UIUtilsSuite.scala   |  39 -
 .../apache/spark/deploy/mesos/ui/DriverPage.scala  |   3 +-
 .../scheduler/cluster/YarnSchedulerBackend.scala   |  35 -
 .../cluster/YarnSchedulerBackendSuite.scala|  59 +++-
 .../spark/sql/execution/ui/AllExecutionsPage.scala |  19 +--
 .../spark/sql/execution/ui/ExecutionPage.scala |   3 +-
 .../thriftserver/ui/ThriftServerSessionPage.scala  |   3 +-
 .../org/apache/spark/streaming/ui/BatchPage.scala  |   8 +-
 31 files changed, 609 insertions(+), 338 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
index 00ca4ef..7a8ab7f 100644
--- a/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala
@@ -27,9 +27,8 @@ import org.apache.spark.ui.{UIUtils, WebUIPage}
 private[history] class HistoryPage(parent: HistoryServer) extends 
WebUIPage("") {
 
   def render(request: HttpServletRequest): Seq[Node] = {
-// stripXSS is called first to remove suspicious characters used in XSS 
attacks
-val requestedIncomplete =
-  
Option(UIUtils.stripXSS(request.getParameter("showIncomplete"))).getOrElse("false").toBoolean
+val requestedIncomplete = Option(request.getParameter("showIncomplete"))
+  .getOrElse("false").toBoolean
 
 val displayApplications = parent.getApplicationList()
   .exists(isApplicationCompleted(_) != requestedIncomplete)
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
index b930338

[spark] branch master updated (b711382 -> c101182b)

2019-01-08 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b711382  [MINOR][WEBUI] Modify the name of the column named "shuffle 
spill" in the StagePage
 add c101182b [SPARK-26002][SQL] Fix day of year calculation for Julian 
calendar days

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/util/DateTimeUtils.scala| 56 +-
 .../sql/catalyst/util/DateTimeUtilsSuite.scala | 30 
 .../test/resources/sql-tests/inputs/datetime.sql   |  2 +
 .../resources/sql-tests/results/datetime.sql.out   | 10 +++-
 4 files changed, 86 insertions(+), 12 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (72a572f -> b711382)

2019-01-08 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 72a572f  [SPARK-26323][SQL] Scala UDF should still check input types 
even if some inputs are of type Any
 add b711382  [MINOR][WEBUI] Modify the name of the column named "shuffle 
spill" in the StagePage

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/org/apache/spark/ui/static/stagepage.js   | 8 
 .../resources/org/apache/spark/ui/static/stagespage-template.html | 8 
 core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala  | 8 
 3 files changed, 12 insertions(+), 12 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r31815 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_08_08_12-72a572f-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2019-01-08 Thread pwendell

Author: pwendell
Date: Tue Jan  8 16:25:31 2019
New Revision: 31815

Log:
Apache Spark 3.0.0-SNAPSHOT-2019_01_08_08_12-72a572f docs


[This commit notification would consist of 1774 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26323][SQL] Scala UDF should still check input types even if some inputs are of type Any

2019-01-08 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 72a572f  [SPARK-26323][SQL] Scala UDF should still check input types 
even if some inputs are of type Any
72a572f is described below

commit 72a572ffd6e156243b13f9243ed296f6d77b4241
Author: Wenchen Fan 
AuthorDate: Tue Jan 8 22:44:33 2019 +0800

[SPARK-26323][SQL] Scala UDF should still check input types even if some 
inputs are of type Any

## What changes were proposed in this pull request?

For Scala UDF, when checking input nullability, we will skip inputs with 
type `Any`, and only check the inputs that provide nullability info.

We should do the same for checking input types.

## How was this patch tested?

new tests

Closes #23275 from cloud-fan/udf.

Authored-by: Wenchen Fan 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/catalyst/analysis/TypeCoercion.scala |  13 +-
 .../spark/sql/catalyst/expressions/ScalaUDF.scala  |   4 +-
 .../apache/spark/sql/types/AbstractDataType.scala  |   2 +-
 .../org/apache/spark/sql/UDFRegistration.scala | 216 +
 .../sql/expressions/UserDefinedFunction.scala  |  57 +++---
 .../scala/org/apache/spark/sql/functions.scala |  52 +++--
 .../test/scala/org/apache/spark/sql/UDFSuite.scala |  15 ++
 7 files changed, 175 insertions(+), 184 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
index b19aa50..13cc9b9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
@@ -882,7 +882,18 @@ object TypeCoercion {
 
   case udf: ScalaUDF if udf.inputTypes.nonEmpty =>
 val children = udf.children.zip(udf.inputTypes).map { case (in, 
expected) =>
-  implicitCast(in, udfInputToCastType(in.dataType, 
expected)).getOrElse(in)
+  // Currently Scala UDF will only expect `AnyDataType` at top level, 
so this trick works.
+  // In the future we should create types like `AbstractArrayType`, so 
that Scala UDF can
+  // accept inputs of array type of arbitrary element type.
+  if (expected == AnyDataType) {
+in
+  } else {
+implicitCast(
+  in,
+  udfInputToCastType(in.dataType, expected.asInstanceOf[DataType])
+).getOrElse(in)
+  }
+
 }
 udf.withNewChildren(children)
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
index a23aaa3..fae1119 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
@@ -21,7 +21,7 @@ import org.apache.spark.SparkException
 import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, 
ScalaReflection}
 import org.apache.spark.sql.catalyst.expressions.codegen._
 import org.apache.spark.sql.catalyst.expressions.codegen.Block._
-import org.apache.spark.sql.types.DataType
+import org.apache.spark.sql.types.{AbstractDataType, DataType}
 
 /**
  * User-defined function.
@@ -48,7 +48,7 @@ case class ScalaUDF(
 dataType: DataType,
 children: Seq[Expression],
 inputsNullSafe: Seq[Boolean],
-inputTypes: Seq[DataType] = Nil,
+inputTypes: Seq[AbstractDataType] = Nil,
 udfName: Option[String] = None,
 nullable: Boolean = true,
 udfDeterministic: Boolean = true)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala
index 5367ce2..d2ef088 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala
@@ -96,7 +96,7 @@ private[sql] object TypeCollection {
 /**
  * An `AbstractDataType` that matches any concrete data types.
  */
-protected[sql] object AnyDataType extends AbstractDataType {
+protected[sql] object AnyDataType extends AbstractDataType with Serializable {
 
   // Note that since AnyDataType matches any concrete types, 
defaultConcreteType should never
   // be invoked.
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala
index 5a3f556..fe5d1af 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala
+++ b/sql/core/src/main/scala/org/apa

svn commit: r31831 - in /dev/spark/2.4.1-SNAPSHOT-2019_01_08_23_01-6277a9f-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

svn commit: r31829 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_08_20_55-dbbba80-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

[spark] branch master updated: [SPARK-26549][PYSPARK] Fix for python worker reuse take no effect for parallelize lazy iterable range

[spark] branch branch-2.4 updated: [SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat

[spark] branch master updated: [SPARK-26529] Add debug logs for confArchive when preparing local resource

[spark] branch master updated: [SPARK-26571][SQL] Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat

svn commit: r31820 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_08_12_33-32515d2-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

[spark] branch master updated: [SPARK-26349][PYSPARK] Forbid insecure py4j gateways

[spark] branch master updated: [SPARK-24920][CORE] Allow sharing Netty's memory pool allocators

[spark] branch master updated: [SPARK-24522][UI] Create filter to apply HTTP security checks consistently.

[spark] branch master updated (b711382 -> c101182b)

[spark] branch master updated (72a572f -> b711382)

svn commit: r31815 - in /dev/spark/3.0.0-SNAPSHOT-2019_01_08_08_12-72a572f-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

[spark] branch master updated: [SPARK-26323][SQL] Scala UDF should still check input types even if some inputs are of type Any

14 matches

Site Navigation

Mail list logo

Footer information