date:20210322

[spark] branch master updated (31da907 -> 39542bb)

2021-03-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 31da907  [SPARK-34820][K8S][R] add apt-update before gnupg install
 add 39542bb  [SPARK-34790][CORE] Disable fetching shuffle blocks in batch 
when io encryption is enabled

No new revisions were added by this update.

Summary of changes:
 .../spark/shuffle/BlockStoreShuffleReader.scala|  6 --
 .../org/apache/spark/sql/internal/SQLConf.scala|  4 ++--
 .../execution/CoalesceShufflePartitionsSuite.scala | 25 +-
 3 files changed, 30 insertions(+), 5 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-34790][CORE] Disable fetching shuffle blocks in batch when io encryption is enabled

2021-03-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 461d508  [SPARK-34790][CORE] Disable fetching shuffle blocks in batch 
when io encryption is enabled
461d508 is described below

commit 461d508e23c79c1b6d71eba5cf51b8e96093cf73
Author: hezuojiao 
AuthorDate: Mon Mar 22 13:06:12 2021 -0700

[SPARK-34790][CORE] Disable fetching shuffle blocks in batch when io 
encryption is enabled

### What changes were proposed in this pull request?

This patch proposes to disable fetching shuffle blocks in batch when io 
encryption is enabled. Adaptive Query Execution fetch contiguous shuffle blocks 
for the same map task in batch to reduce IO and improve performance. However, 
we found that batch fetching is incompatible with io encryption.

### Why are the changes needed?
Before this patch, we set `spark.io.encryption.enabled` to true, then run 
some queries which coalesced partitions by AEQ, may got following error message:
```14:05:52.638 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 
1.0 in stage 2.0 (TID 3) (11.240.37.88 executor driver): 
FetchFailed(BlockManagerId(driver, 11.240.37.88, 63574, None), shuffleId=0, 
mapIndex=0, mapId=0, reduceId=2, message=
org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772)
at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:494)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Stream is corrupted
at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:200)
at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:226)
at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841)
... 25 more

)
```

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

New tests.

Closes #31898 from hezuojiao/fetch_shuffle_in_batch.

Authored-by: hezuojiao 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 39542bb81f8570219770bb6533c077f44f6cbd2a)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/shuffle/BlockStoreShuffleReader.scala|  6 --
 .../org/apache/spark/sql/internal/SQLConf.scala|  4 ++--
 .../execution/CoalesceShufflePartitionsSuite.scala | 25 +-
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git

[spark] branch branch-3.1 updated: [SPARK-34820][K8S][R] add apt-update before gnupg install

2021-03-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 3f2ba77  [SPARK-34820][K8S][R] add apt-update before gnupg install
3f2ba77 is described below

commit 3f2ba77dfe58d67f0d7eba73247b5ea362157152
Author: Yikun Jiang 
AuthorDate: Mon Mar 22 10:13:31 2021 -0700

[SPARK-34820][K8S][R] add apt-update before gnupg install

### What changes were proposed in this pull request?
We added the gnupg installation in 
https://github.com/apache/spark/pull/30130 , we should do apt update before 
gnupg isntallation, otherwise we will get a fetch error when package is updated.

See more in:
[1] 
http://apache-spark-developers-list.1001551.n3.nabble.com/K8s-Integration-test-is-unable-to-run-because-of-the-unavailable-libs-td30986.html

### Why are the changes needed?
add a apt-update cmd before gnupg installation to avoid invaild package 
cache list.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
K8s Integration test passed

Closes #31923 from Yikun/SPARK-34820.

Authored-by: Yikun Jiang 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 31da90762efbcebc0fcdde885635612a5bcd5f6d)
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
index f63f2d0..2dd4d8c 100644
--- 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
+++ 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
@@ -27,8 +27,9 @@ RUN mkdir ${SPARK_HOME}/R
 
 # Install R 3.6.3 (http://cloud.r-project.org/bin/linux/debian/)
 RUN \
-  echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> 
/etc/apt/sources.list && \
+  apt-get update && \
   apt install -y gnupg && \
+  echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> 
/etc/apt/sources.list && \
   (apt-key adv --keyserver keys.gnupg.net --recv-key 
'E19F5F87128899B192B1A2C2AD5F960A256A04AF' || apt-key adv --keyserver 
keys.openpgp.org --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF') && \
   apt-get update && \
   apt install -y -t buster-cran35 r-base r-base-dev && \

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (85581f6 -> 31da907)

2021-03-22 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 85581f6  [SPARK-33925][CORE][FOLLOW-UP] Remove the unused variables 
'secMgr'
 add 31da907  [SPARK-34820][K8S][R] add apt-update before gnupg install

No new revisions were added by this update.

Summary of changes:
 .../kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-33925][CORE][FOLLOW-UP] Remove the unused variables 'secMgr'

2021-03-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 85581f6  [SPARK-33925][CORE][FOLLOW-UP] Remove the unused variables 
'secMgr'
85581f6 is described below

commit 85581f6dac5c116d3d83101044cc3a85bfca41bf
Author: PengLei <18066542...@189.cn>
AuthorDate: Mon Mar 22 12:02:25 2021 -0500

[SPARK-33925][CORE][FOLLOW-UP] Remove the unused variables 'secMgr'

### What changes were proposed in this pull request?
Remove the unused variable 'secMgr' in SparkSubmit.scala and 
DriverWrapper.scala
In jira https://issues.apache.org/jira/browse/SPARK-33925, The last usage 
of SecurityManager in Utils.fetchFile was removed. We don't need the variable 
anymore

### Why are the changes needed?
For better readablity of codes

### Does this PR introduce _any_ user-facing change?
No,dev-only

### How was this patch tested?
Manually complied. Github Actions and Jenkins build should test it out as 
well.

Closes #31928 from Peng-Lei/rm_secMgr.

Authored-by: PengLei <18066542...@189.cn>
Signed-off-by: Sean Owen 
---
 core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala  | 1 -
 core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala | 1 -
 2 files changed, 2 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
index 2a89078..e5fd027 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@@ -366,7 +366,6 @@ private[spark] class SparkSubmit extends Logging {
 args.pyFiles = Option(args.pyFiles).map(resolveGlobPaths(_, 
hadoopConf)).orNull
 args.archives = Option(args.archives).map(resolveGlobPaths(_, 
hadoopConf)).orNull
 
-lazy val secMgr = new SecurityManager(sparkConf)
 
 // In client mode, download remote files.
 var localPrimaryResource: String = null
diff --git 
a/core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala
index 61fb929..9176897 100644
--- a/core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala
@@ -74,7 +74,6 @@ object DriverWrapper extends Logging {
 
   private def setupDependencies(loader: MutableURLClassLoader, userJar: 
String): Unit = {
 val sparkConf = new SparkConf()
-val secMgr = new SecurityManager(sparkConf)
 val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf)
 
 val ivyProperties = DependencyUtils.getIvyProperties()

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated (ce58e05 -> 5685d84)

2021-03-22 Thread viirya

This is an automated email from the ASF dual-hosted git repository.

viirya pushed a change to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ce58e05  [SPARK-34719][SQL][2.4] Correctly resolve the view query with 
duplicated column names
 add 5685d84  [SPARK-34726][SQL][2.4] Fix collectToPython timeouts

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/api/python/PythonRDD.scala|  5 ++-
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  7 ++--
 .../scala/org/apache/spark/sql/DatasetSuite.scala  | 41 --
 3 files changed, 46 insertions(+), 7 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-34719][SQL][2.4] Correctly resolve the view query with duplicated column names

2021-03-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new ce58e05  [SPARK-34719][SQL][2.4] Correctly resolve the view query with 
duplicated column names
ce58e05 is described below

commit ce58e05714591e11d851f3604ade190fe550a8d5
Author: Wenchen Fan 
AuthorDate: Sat Mar 20 11:09:50 2021 +0900

[SPARK-34719][SQL][2.4] Correctly resolve the view query with duplicated 
column names

backport https://github.com/apache/spark/pull/31811 to 2.4

For permanent views (and the new SQL temp view in Spark 3.1), we store the 
view SQL text and re-parse/analyze the view SQL text when reading the view. In 
the case of `SELECT * FROM ...`, we want to avoid view schema change (e.g. the 
referenced table changes its schema) and will record the view query output 
column names when creating the view, so that when reading the view we can add a 
`SELECT recorded_column_names FROM ...` to retain the original view query 
schema.

In Spark 3.1 and before, the final SELECT is added after the analysis 
phase: 
https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala#L67

If the view query has duplicated output column names, we always pick the 
first column when reading a view. A simple repro:
```
scala> sql("create view c(x, y) as select 1 a, 2 a")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("select * from c").show
+---+---+
|  x|  y|
+---+---+
|  1|  1|
+---+---+
```

In the master branch, we will fail at the view reading time due to 
https://github.com/apache/spark/commit/b891862fb6b740b103d5a09530626ee4e0e8f6e3 
, which adds the final SELECT during analysis, so that the query fails with 
`Reference 'a' is ambiguous`

This PR proposes to resolve the view query output column names from the 
matching attributes by ordinal.

For example,  `create view c(x, y) as select 1 a, 2 a`, the view query 
output column names are `[a, a]`. When we reading the view, there are 2 
matching attributes (e.g.`[a#1, a#2]`) and we can simply match them by ordinal.

A negative example is
```
create table t(a int)
create view v as select *, 1 as col from t
replace table t(a int, col int)
```
When reading the view, the view query output column names are `[a, col]`, 
and there are two matching attributes of `col`, and we should fail the query. 
See the tests for details.

bug fix

yes

new test

Closes #31894 from cloud-fan/backport.

Authored-by: Wenchen Fan 
Signed-off-by: Takeshi Yamamuro 
---
 .../apache/spark/sql/catalyst/analysis/view.scala  | 44 ++---
 .../apache/spark/sql/execution/SQLViewSuite.scala  | 46 ++
 2 files changed, 84 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala
index 6a94f51..44f4a63 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala
@@ -17,7 +17,10 @@
 
 package org.apache.spark.sql.catalyst.analysis
 
-import org.apache.spark.sql.catalyst.expressions.Alias
+import java.util.Locale
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute}
 import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View}
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.internal.SQLConf
@@ -60,15 +63,44 @@ object EliminateView extends Rule[LogicalPlan] with 
CastSupport {
 // The child has the different output attributes with the View operator. 
Adds a Project over
 // the child of the view.
 case v @ View(desc, output, child) if child.resolved && output != 
child.output =>
+  // Use the stored view query output column names to find the matching 
attributes. The column
+  // names may have duplication, e.g. `CREATE VIEW v(x, y) AS SELECT 1 
col, 2 col`. We need to
+  // make sure the that matching attributes have the same number of 
duplications, and pick the
+  // corresponding attribute by ordinal.
   val resolver = conf.resolver
   val queryColumnNames = desc.viewQueryColumnNames
   val queryOutput = if (queryColumnNames.nonEmpty) {
-// Find the attribute that has the expected attribute name from an 
attribute list, the names
-// are compared using conf.resolver.
-// `CheckAnalysis` already guarantees the expected attribute can be 
found for sure.
-desc.viewQueryColumnNames.map { colName =>
-

[spark] branch master updated (ddfc75e -> 51cf0ca)

2021-03-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ddfc75e  [SPARK-34803][PYSPARK] Pass the raised ImportError if pandas 
or pyarrow fail to import
 add 51cf0ca  [SPARK-34812][SQL] RowNumberLike and RankLike should not be 
nullable

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/windowExpressions.scala   |   2 +
 .../approved-plans-v1_4/q47.sf100/explain.txt  |  82 +++---
 .../approved-plans-v1_4/q47.sf100/simplified.txt   |  36 ++-
 .../approved-plans-v1_4/q47/explain.txt|  78 +++---
 .../approved-plans-v1_4/q47/simplified.txt |  36 ++-
 .../approved-plans-v1_4/q57.sf100/explain.txt  |  82 +++---
 .../approved-plans-v1_4/q57.sf100/simplified.txt   |  36 ++-
 .../approved-plans-v1_4/q57/explain.txt|  78 +++---
 .../approved-plans-v1_4/q57/simplified.txt |  36 ++-
 .../approved-plans-v2_7/q47.sf100/explain.txt  |  82 +++---
 .../approved-plans-v2_7/q47.sf100/simplified.txt   |  36 ++-
 .../approved-plans-v2_7/q47/explain.txt|  78 +++---
 .../approved-plans-v2_7/q47/simplified.txt |  36 ++-
 .../approved-plans-v2_7/q51a.sf100/explain.txt | 300 ++---
 .../approved-plans-v2_7/q51a.sf100/simplified.txt  | 234 
 .../approved-plans-v2_7/q51a/explain.txt   | 288 +---
 .../approved-plans-v2_7/q51a/simplified.txt| 224 ---
 .../approved-plans-v2_7/q57.sf100/explain.txt  |  82 +++---
 .../approved-plans-v2_7/q57.sf100/simplified.txt   |  36 ++-
 .../approved-plans-v2_7/q57/explain.txt|  78 +++---
 .../approved-plans-v2_7/q57/simplified.txt |  36 ++-
 21 files changed, 904 insertions(+), 1072 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import

2021-03-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 8236c5f  [SPARK-34803][PYSPARK] Pass the raised ImportError if pandas 
or pyarrow fail to import
8236c5f is described below

commit 8236c5f71cd1502cb8e54a8dc2c4801bcb319764
Author: John Ayad 
AuthorDate: Mon Mar 22 23:29:28 2021 +0900

[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow 
fail to import

### What changes were proposed in this pull request?

Pass the raised `ImportError` on failing to import pandas/pyarrow. This 
will help the user identify whether pandas/pyarrow are indeed not in the 
environment or if they threw a different `ImportError`.

### Why are the changes needed?

This can already happen in Pandas for example where it could throw an 
`ImportError` on its initialisation path if `dateutil` doesn't satisfy a 
certain version requirement 
https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438

### Does this PR introduce _any_ user-facing change?

Yes, it will now show the root cause of the exception when pandas or arrow 
is missing during import.

### How was this patch tested?

Manually tested.

```python
from pyspark.sql.functions import pandas_udf
spark.range(1).select(pandas_udf(lambda x: x))
```

Before:

```
Traceback (most recent call last):
  File "", line 1, in 
  File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in 
pandas_udf
require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in 
require_minimum_pyarrow_version
raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
```

After:

```
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in 
require_minimum_pyarrow_version
import pyarrow
ModuleNotFoundError: No module named 'pyarrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in 
pandas_udf
require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in 
require_minimum_pyarrow_version
raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
```

Closes #31902 from johnhany97/jayad/spark-34803.

Lead-authored-by: John Ayad 
Co-authored-by: John H. Ayad 
Co-authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
(cherry picked from commit ddfc75ec648d57f92f474d5820d03c37f20403dc)
Signed-off-by: HyukjinKwon 
---
 python/pyspark/sql/pandas/utils.py | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/pandas/utils.py 
b/python/pyspark/sql/pandas/utils.py
index 9b97676..b22603a 100644
--- a/python/pyspark/sql/pandas/utils.py
+++ b/python/pyspark/sql/pandas/utils.py
@@ -26,11 +26,12 @@ def require_minimum_pandas_version():
 try:
 import pandas
 have_pandas = True
-except ImportError:
+except ImportError as error:
 have_pandas = False
+raised_error = error
 if not have_pandas:
 raise ImportError("Pandas >= %s must be installed; however, "
-  "it was not found." % minimum_pandas_version)
+  "it was not found." % minimum_pandas_version) from 
raised_error
 if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
 raise ImportError("Pandas >= %s must be installed; however, "
   "your version was %s." % (minimum_pandas_version, 
pandas.__version__))
@@ -47,11 +48,12 @@ def require_minimum_pyarrow_version():
 try:
 import pyarrow
 have_arrow = True
-except ImportError:
+except ImportError as error:
 have_arrow = False
+raised_error = error
 if not have_arrow:
 raise ImportError("PyArrow >= %s must be installed; however, "
-  "it was not found." % minimum_pyarrow_version)
+  "it was not found." % minimum_pyarrow_version) from 
raised_error
 if LooseVersion(pyarrow.__version__) < 
LooseVersion(minimum_pyarrow_version):
 raise ImportError("PyArrow >= %s must be installed; however, "
   "your version was %s." % (minimum_pyarrow_version, 
pyarrow.__version__))

-

[spark] branch master updated (8a552bf -> ddfc75e)

2021-03-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8a552bf  [SPARK-34778][BUILD] Upgrade to Avro 1.10.2
 add ddfc75e  [SPARK-34803][PYSPARK] Pass the raised ImportError if pandas 
or pyarrow fail to import

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/pandas/utils.py | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-34778][BUILD] Upgrade to Avro 1.10.2

2021-03-22 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8a552bf  [SPARK-34778][BUILD] Upgrade to Avro 1.10.2
8a552bf is described below

commit 8a552bfc767dff987be41f7f463db17395b74e6f
Author: Ismaël Mejía 
AuthorDate: Mon Mar 22 19:30:14 2021 +0800

[SPARK-34778][BUILD] Upgrade to Avro 1.10.2

### What changes were proposed in this pull request?
Update the  Avro version to 1.10.2

### Why are the changes needed?
To stay up to date with upstream and catch compatibility issues with zstd

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit tests

Closes #31866 from iemejia/SPARK-27733-upgrade-avro-1.10.2.

Authored-by: Ismaël Mejía 
Signed-off-by: Yuming Wang 
---
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 6 +++---
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 6 +++---
 docs/sql-data-sources-avro.md   | 4 ++--
 .../avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala | 4 ++--
 pom.xml | 2 +-
 project/SparkBuild.scala| 2 +-
 project/plugins.sbt | 2 +-
 7 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index e6619d2..2f17c11 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -22,9 +22,9 @@ arrow-memory-netty/2.0.0//arrow-memory-netty-2.0.0.jar
 arrow-vector/2.0.0//arrow-vector-2.0.0.jar
 audience-annotations/0.5.0//audience-annotations-0.5.0.jar
 automaton/1.11-8//automaton-1.11-8.jar
-avro-ipc/1.10.1//avro-ipc-1.10.1.jar
-avro-mapred/1.10.1//avro-mapred-1.10.1.jar
-avro/1.10.1//avro-1.10.1.jar
+avro-ipc/1.10.2//avro-ipc-1.10.2.jar
+avro-mapred/1.10.2//avro-mapred-1.10.2.jar
+avro/1.10.2//avro-1.10.2.jar
 bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
 breeze-macros_2.12/1.0//breeze-macros_2.12-1.0.jar
 breeze_2.12/1.0//breeze_2.12-1.0.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index ea595a0..ea44748 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -17,9 +17,9 @@ arrow-memory-netty/2.0.0//arrow-memory-netty-2.0.0.jar
 arrow-vector/2.0.0//arrow-vector-2.0.0.jar
 audience-annotations/0.5.0//audience-annotations-0.5.0.jar
 automaton/1.11-8//automaton-1.11-8.jar
-avro-ipc/1.10.1//avro-ipc-1.10.1.jar
-avro-mapred/1.10.1//avro-mapred-1.10.1.jar
-avro/1.10.1//avro-1.10.1.jar
+avro-ipc/1.10.2//avro-ipc-1.10.2.jar
+avro-mapred/1.10.2//avro-mapred-1.10.2.jar
+avro/1.10.2//avro-1.10.2.jar
 bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
 breeze-macros_2.12/1.0//breeze-macros_2.12-1.0.jar
 breeze_2.12/1.0//breeze_2.12-1.0.jar
diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md
index 928b3d0..ab1163a 100644
--- a/docs/sql-data-sources-avro.md
+++ b/docs/sql-data-sources-avro.md
@@ -378,7 +378,7 @@ applications. Read the [Advanced Dependency 
Management](https://spark.apache
 Submission Guide for more details. 
 
 ## Supported types for Avro -> Spark SQL conversion
-Currently Spark supports reading all [primitive 
types](https://avro.apache.org/docs/1.10.1/spec.html#schema_primitive) and 
[complex types](https://avro.apache.org/docs/1.10.1/spec.html#schema_complex) 
under records of Avro.
+Currently Spark supports reading all [primitive 
types](https://avro.apache.org/docs/1.10.2/spec.html#schema_primitive) and 
[complex types](https://avro.apache.org/docs/1.10.2/spec.html#schema_complex) 
under records of Avro.
 
   Avro typeSpark SQL type
   
@@ -442,7 +442,7 @@ In addition to the types listed above, it supports reading 
`union` types. The fo
 3. `union(something, null)`, where something is any supported Avro type. This 
will be mapped to the same Spark SQL type as that of something, with nullable 
set to true.
 All other union types are considered complex. They will be mapped to 
StructType where field names are member0, member1, etc., in accordance with 
members of the union. This is consistent with the behavior when converting 
between Avro and Parquet.
 
-It also supports reading the following Avro [logical 
types](https://avro.apache.org/docs/1.10.1/spec.html#Logical+Types):
+It also supports reading the following Avro [logical 
types](https://avro.apache.org/docs/1.10.2/spec.html#Logical+Types):
 
 
   Avro logical typeAvro typeSpark 
SQL type
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

[spark] branch master updated (7953fcd -> f44608a)

2021-03-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7953fcd  [SPARK-34700][SQL] SessionCatalog's temporary view related 
APIs should take/return more concrete types
 add f44608a  [SPARK-34800][SQL] Use fine-grained lock in 
SessionCatalog.tableExists

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (e4bb975 -> 7953fcd)

2021-03-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e4bb975  [SPARK-34089][CORE] HybridRowQueue should respect the 
configured memory mode
 add 7953fcd  [SPARK-34700][SQL] SessionCatalog's temporary view related 
APIs should take/return more concrete types

No new revisions were added by this update.

Summary of changes:
 .../catalyst/catalog/GlobalTempViewManager.scala   |  9 ++--
 .../sql/catalyst/catalog/SessionCatalog.scala  | 59 ++
 .../spark/sql/catalyst/analysis/AnalysisTest.scala | 52 ---
 .../catalyst/analysis/DecimalPrecisionSuite.scala  |  4 --
 .../sql/catalyst/catalog/SessionCatalogSuite.scala | 52 +--
 .../apache/spark/sql/execution/command/views.scala | 15 +++---
 .../apache/spark/sql/internal/CatalogImpl.scala|  9 ++--
 .../execution/command/PlanResolutionSuite.scala|  2 +-
 .../apache/spark/sql/internal/CatalogSuite.scala   |  9 ++--
 .../thriftserver/SparkGetColumnsOperation.scala|  4 +-
 .../apache/spark/sql/hive/ListTablesSuite.scala| 10 ++--
 11 files changed, 115 insertions(+), 110 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ec70467 -> e4bb975)

2021-03-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ec70467  [SPARK-34815][SQL] Update CSVBenchmark
 add e4bb975  [SPARK-34089][CORE] HybridRowQueue should respect the 
configured memory mode

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/memory/MemoryConsumer.java|  4 +-
 .../apache/spark/util/collection/Spillable.scala   |  2 +-
 .../expressions/RowBasedKeyValueBatch.java |  2 +-
 .../spark/sql/execution/joins/HashedRelation.scala |  2 +-
 .../spark/sql/execution/python/RowQueue.scala  |  2 +-
 .../spark/sql/execution/python/RowQueueSuite.scala | 90 --
 6 files changed, 54 insertions(+), 48 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (121883b -> ec70467)

2021-03-22 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 121883b  [SPARK-34383][SS] Optimize WAL commit phase via reducing cost 
of filesystem operations
 add ec70467  [SPARK-34815][SQL] Update CSVBenchmark

No new revisions were added by this update.

Summary of changes:
 sql/core/benchmarks/CSVBenchmark-jdk11-results.txt | 88 +++---
 sql/core/benchmarks/CSVBenchmark-results.txt   | 88 +++---
 2 files changed, 88 insertions(+), 88 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (f8838fe -> 121883b)

2021-03-22 Thread gaborgsomogyi

This is an automated email from the ASF dual-hosted git repository.

gaborgsomogyi pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from f8838fe  [SPARK-34708][SQL] Code-gen for left semi/anti broadcast 
nested loop join (build right side)
 add 121883b  [SPARK-34383][SS] Optimize WAL commit phase via reducing cost 
of filesystem operations

No new revisions were added by this update.

Summary of changes:
 .../streaming/CheckpointFileManager.scala  |  1 -
 .../sql/execution/streaming/HDFSMetadataLog.scala  | 29 --
 .../sql/execution/streaming/OffsetSeqLog.scala | 18 ++
 3 files changed, 40 insertions(+), 8 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c7bf8ad -> f8838fe)

2021-03-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c7bf8ad  [SPARK-34818][PYTHON][DOCS] Reorder the items in User Guide 
at PySpark documentation
 add f8838fe  [SPARK-34708][SQL] Code-gen for left semi/anti broadcast 
nested loop join (build right side)

No new revisions were added by this update.

Summary of changes:
 .../joins/BroadcastNestedLoopJoinExec.scala| 69 +++---
 .../sql/execution/WholeStageCodegenSuite.scala | 36 +++
 .../sql/execution/joins/ExistenceJoinSuite.scala   |  2 +-
 3 files changed, 97 insertions(+), 10 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-34818][PYTHON][DOCS] Reorder the items in User Guide at PySpark documentation

2021-03-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c7bf8ad  [SPARK-34818][PYTHON][DOCS] Reorder the items in User Guide 
at PySpark documentation
c7bf8ad is described below

commit c7bf8adc387b7520bad3fe442ac672efd5a0bd1e
Author: HyukjinKwon 
AuthorDate: Mon Mar 22 15:53:39 2021 +0900

[SPARK-34818][PYTHON][DOCS] Reorder the items in User Guide at PySpark 
documentation

### What changes were proposed in this pull request?

This PR proposes to reorder the items in User Guide in PySpark 
documentation in order to place general guides first and advance ones later.

### Why are the changes needed?

For users to more easily follow.

### Does this PR introduce _any_ user-facing change?

Yes, it changes the order in the items in documentation .

### How was this patch tested?

Manually verified the documentation after building:

https://user-images.githubusercontent.com/6477701/111945072-5537d680-8b1c-11eb-9f43-02f3ad63a509.png;>

FWIW, the current page: 
https://spark.apache.org/docs/latest/api/python/user_guide/index.html

Closes #31922 from HyukjinKwon/SPARK-34818.

Authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
---
 python/docs/source/user_guide/index.rst | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/python/docs/source/user_guide/index.rst 
b/python/docs/source/user_guide/index.rst
index 3897ab2..383effb 100644
--- a/python/docs/source/user_guide/index.rst
+++ b/python/docs/source/user_guide/index.rst
@@ -20,17 +20,8 @@
 User Guide
 ==
 
-This page is the guide for PySpark users which contains PySpark specific 
topics.
-
-.. toctree::
-:maxdepth: 2
-
-arrow_pandas
-python_packaging
-
-
-There are more guides shared with other languages in Programming Guides
-at `the Spark documentation 
`_.
+There are basic guides shared with other languages in Programming Guides
+at `the Spark documentation 
`_ as 
below:
 
 - `RDD Programming Guide 
`_
 - `Spark SQL, DataFrames and Datasets Guide 
`_
@@ -38,3 +29,10 @@ at `the Spark documentation 
`_
 - `Machine Learning Library (MLlib) Guide 
`_
 
+PySpark specific user guide is as follows:
+
+.. toctree::
+:maxdepth: 2
+
+python_packaging
+arrow_pandas

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3bef2dc -> 45235ac)

2021-03-22 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3bef2dc  Revert "[SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT 
dependencies in spark-submit"
 add 45235ac  [SPARK-34748][SS] Create a rule of the analysis logic for 
streaming write

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/streaming/WriteToStream.scala}| 23 +++--
 .../streaming/WriteToStreamStatement.scala | 61 ++
 .../execution/streaming/MicroBatchExecution.scala  | 17 ++--
 .../execution/streaming/ResolveWriteToStream.scala | 97 ++
 .../sql/execution/streaming/StreamExecution.scala  |  2 +-
 .../streaming/continuous/ContinuousExecution.scala | 18 ++--
 .../sql/internal/BaseSessionStateBuilder.scala |  2 +
 .../sql/streaming/StreamingQueryManager.scala  | 81 --
 .../spark/sql/hive/HiveSessionStateBuilder.scala   |  2 +
 9 files changed, 211 insertions(+), 92 deletions(-)
 copy 
sql/{core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/WriteToContinuousDataSource.scala
 => 
catalyst/src/main/scala/org/apache/spark/sql/catalyst/streaming/WriteToStream.scala}
 (65%)
 create mode 100644 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/streaming/WriteToStreamStatement.scala
 create mode 100644 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ResolveWriteToStream.scala

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (0734101 -> 3bef2dc)

2021-03-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0734101  [SPARK-34225][CORE] Don't encode further when a URI form 
string is passed to addFile or addJar
 add 3bef2dc  Revert "[SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT 
dependencies in spark-submit"

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/deploy/SparkSubmit.scala|  6 +-
 .../apache/spark/deploy/SparkSubmitUtilsSuite.scala| 18 --
 2 files changed, 1 insertion(+), 23 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (31da907 -> 39542bb)

[spark] branch branch-3.1 updated: [SPARK-34790][CORE] Disable fetching shuffle blocks in batch when io encryption is enabled

[spark] branch branch-3.1 updated: [SPARK-34820][K8S][R] add apt-update before gnupg install

[spark] branch master updated (85581f6 -> 31da907)

[spark] branch master updated: [SPARK-33925][CORE][FOLLOW-UP] Remove the unused variables 'secMgr'

[spark] branch branch-2.4 updated (ce58e05 -> 5685d84)

[spark] branch branch-2.4 updated: [SPARK-34719][SQL][2.4] Correctly resolve the view query with duplicated column names

[spark] branch master updated (ddfc75e -> 51cf0ca)

[spark] branch branch-3.1 updated: [SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import

[spark] branch master updated (8a552bf -> ddfc75e)

[spark] branch master updated: [SPARK-34778][BUILD] Upgrade to Avro 1.10.2

[spark] branch master updated (7953fcd -> f44608a)

[spark] branch master updated (e4bb975 -> 7953fcd)

[spark] branch master updated (ec70467 -> e4bb975)

[spark] branch master updated (121883b -> ec70467)

[spark] branch master updated (f8838fe -> 121883b)

[spark] branch master updated (c7bf8ad -> f8838fe)

[spark] branch master updated: [SPARK-34818][PYTHON][DOCS] Reorder the items in User Guide at PySpark documentation

[spark] branch master updated (3bef2dc -> 45235ac)

[spark] branch master updated (0734101 -> 3bef2dc)

20 matches

Site Navigation

Mail list logo

Footer information