[jira] [Created] (SPARK-31850) DetermineTableStats rules computes stats multiple time

2020-05-27 Thread Karuppayya (Jira)
Karuppayya created SPARK-31850:
--

 Summary: DetermineTableStats rules computes stats multiple time
 Key: SPARK-31850
 URL: https://issues.apache.org/jira/browse/SPARK-31850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Karuppayya


DetermineTableStats rules computes stats multiple time.

There are few rules which invoke

org.apache.spark.sql.catalyst.analysis.Analyzer#executeSameContext
 * The above method cause the logical plan to go through the analysis phase 
multiple times
 * The above method is invoked as a part of rule(s) which is part of a batch 
which run till *Fixed point* which implies that that the analysis phase can run 
multiple time
 * example: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138
 
|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138]
 * This is cause of concern especially in *DetermineTableStats* rule, where the 
computation of stats can be expensive for a large table(And also it happens 
multiple time due to the fixed point nature of batch from which analysis is 
triggered)

 

Repro steps
{code:java}
 
spark.sql("create table c(id INT, name STRING) STORED AS PARQUET")
val df = spark.sql("select count(id) id from c group by name order by id " )
df.queryExecution.analyzed
{code}
  Note: 
 * There is no log line in DetermineTableStats to indicate that stats compute 
happened. Need to add a log line or use a debugger
 * The above can be repro-ed with first query on a created table.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31849) Improve Python exception messages to be more Pythonic

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31849:


Assignee: (was: Apache Spark)

> Improve Python exception messages to be more Pythonic
> -
>
> Key: SPARK-31849
> URL: https://issues.apache.org/jira/browse/SPARK-31849
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Current PySpark exceptions are pretty ugly, and probably the most frequently 
> reported compliant from PySpark users.
> For example, simple udf that throws a {{ZeroDivisionError}}:
> {code}
> from pyspark.sql.functions import udf
> @udf
> def divide_by_zero(v):
> raise v / 0
> spark.range(1).select(divide_by_zero("id")).show()
> {code}
> shows a bunch of JVM stacktrace which is very hard for Python users to 
> understand:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 
> 380, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 
> in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 
> (TID 11, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", 
> line 377, in main
> process()
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", 
> line 372, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 352, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 142, in dump_stream
> for obj in iterator:
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 341, in _batched
> for item in iterator:
>   File "", line 1, in 
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", 
> line 85, in 
> return lambda *a: f(*a)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 
> 99, in wrapper
> return f(*args, **kwargs)
>   File "", line 3, in divide_by_zero
> ZeroDivisionError: division by zero
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
>   at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> 

[jira] [Assigned] (SPARK-31849) Improve Python exception messages to be more Pythonic

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31849:


Assignee: Apache Spark

> Improve Python exception messages to be more Pythonic
> -
>
> Key: SPARK-31849
> URL: https://issues.apache.org/jira/browse/SPARK-31849
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Current PySpark exceptions are pretty ugly, and probably the most frequently 
> reported compliant from PySpark users.
> For example, simple udf that throws a {{ZeroDivisionError}}:
> {code}
> from pyspark.sql.functions import udf
> @udf
> def divide_by_zero(v):
> raise v / 0
> spark.range(1).select(divide_by_zero("id")).show()
> {code}
> shows a bunch of JVM stacktrace which is very hard for Python users to 
> understand:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 
> 380, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 
> in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 
> (TID 11, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", 
> line 377, in main
> process()
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", 
> line 372, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 352, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 142, in dump_stream
> for obj in iterator:
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 341, in _batched
> for item in iterator:
>   File "", line 1, in 
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", 
> line 85, in 
> return lambda *a: f(*a)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 
> 99, in wrapper
> return f(*args, **kwargs)
>   File "", line 3, in divide_by_zero
> ZeroDivisionError: division by zero
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
>   at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> 

[jira] [Commented] (SPARK-31849) Improve Python exception messages to be more Pythonic

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118319#comment-17118319
 ] 

Apache Spark commented on SPARK-31849:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28661

> Improve Python exception messages to be more Pythonic
> -
>
> Key: SPARK-31849
> URL: https://issues.apache.org/jira/browse/SPARK-31849
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Current PySpark exceptions are pretty ugly, and probably the most frequently 
> reported compliant from PySpark users.
> For example, simple udf that throws a {{ZeroDivisionError}}:
> {code}
> from pyspark.sql.functions import udf
> @udf
> def divide_by_zero(v):
> raise v / 0
> spark.range(1).select(divide_by_zero("id")).show()
> {code}
> shows a bunch of JVM stacktrace which is very hard for Python users to 
> understand:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 
> 380, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 
> in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 
> (TID 11, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", 
> line 377, in main
> process()
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", 
> line 372, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 352, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 142, in dump_stream
> for obj in iterator:
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 341, in _batched
> for item in iterator:
>   File "", line 1, in 
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", 
> line 85, in 
> return lambda *a: f(*a)
>   File 
> "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 
> 99, in wrapper
> return f(*args, **kwargs)
>   File "", line 3, in divide_by_zero
> ZeroDivisionError: division by zero
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
>   at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  

[jira] [Created] (SPARK-31849) Improve Python exception messages to be more Pythonic

2020-05-27 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-31849:


 Summary: Improve Python exception messages to be more Pythonic
 Key: SPARK-31849
 URL: https://issues.apache.org/jira/browse/SPARK-31849
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Current PySpark exceptions are pretty ugly, and probably the most frequently 
reported compliant from PySpark users.

For example, simple udf that throws a {{ZeroDivisionError}}:

{code}
from pyspark.sql.functions import udf
@udf
def divide_by_zero(v):
raise v / 0

spark.range(1).select(divide_by_zero("id")).show()
{code}

shows a bunch of JVM stacktrace which is very hard for Python users to 
understand:

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 
380, in show
print(self._jdf.showString(n, 20, vertical))
  File 
"/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
 line 1257, in __call__
  File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, 
in deco
return f(*a, **kw)
  File 
"/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
 line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 (TID 
11, localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):
  File 
"/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 
377, in main
process()
  File 
"/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 
372, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", 
line 352, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", 
line 142, in dump_stream
for obj in iterator:
  File 
"/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", 
line 341, in _batched
for item in iterator:
  File "", line 1, in 
  File 
"/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 
85, in 
return lambda *a: f(*a)
  File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", 
line 99, in wrapper
return f(*args, **kwargs)
  File "", line 3, in divide_by_zero
ZeroDivisionError: division by zero

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
at 
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
at 
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 

[jira] [Commented] (SPARK-31841) Dataset.repartition leverage adaptive execution

2020-05-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118287#comment-17118287
 ] 

Yuming Wang commented on SPARK-31841:
-

I try to fix this issue before: https://github.com/apache/spark/pull/27986

> Dataset.repartition leverage adaptive execution
> ---
>
> Key: SPARK-31841
> URL: https://issues.apache.org/jira/browse/SPARK-31841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark branch-3.0 from may 1 this year
>Reporter: koert kuipers
>Priority: Minor
>
> hello,
> we are very happy users of adaptive query execution. its a great feature to 
> now have to think about and tune the number of partitions anymore in a 
> shuffle.
> i noticed that Dataset.groupBy consistently uses adaptive execution when its 
> enabled (e.g. i don't see the default 200 partitions) but when i do 
> Dataset.repartition it seems i am back to a hardcoded number of partitions.
> Should adaptive execution also be used for repartition? It would be nice to 
> be able to repartition without having to think about optimal number of 
> partitions.
> An example:
> {code:java}
> $ spark-shell --conf spark.sql.adaptive.enabled=true --conf 
> spark.sql.adaptive.advisoryPartitionSizeInBytes=10
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val x = (1 to 100).toDF
> x: org.apache.spark.sql.DataFrame = [value: int]
> scala> x.rdd.getNumPartitions
> res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions
> res1: Int = 200
> scala> x.groupBy("value").count.rdd.getNumPartitions
> res2: Int = 67
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31847) DAGSchedulerSuite: Rewrite the test framework to cover most of the existing major features of the Spark Scheduler, mock the necessary part wisely, and make the test fram

2020-05-27 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31847:
--

 Summary: DAGSchedulerSuite: Rewrite the test framework to cover 
most of the existing major features of the Spark Scheduler, mock the necessary 
part wisely, and make the test framework better extendable.
 Key: SPARK-31847
 URL: https://issues.apache.org/jira/browse/SPARK-31847
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31848) Break down the very huge test files, each test suite should focus on one or several major features, but not all the related behaviors

2020-05-27 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31848:
--

 Summary: Break down the very huge test files, each test suite 
should focus on one or several major features, but not all the related behaviors
 Key: SPARK-31848
 URL: https://issues.apache.org/jira/browse/SPARK-31848
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31846) DAGSchedulerSuite: For the pattern of cancel + assert, extract the general method

2020-05-27 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31846:
--

 Summary: DAGSchedulerSuite: For the pattern of cancel + assert, 
extract the general method
 Key: SPARK-31846
 URL: https://issues.apache.org/jira/browse/SPARK-31846
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng


DAGSchedulerSuite exists many test case(e.g. trivial job cancellation | job 
cancellation no-kill backend) contains the pattern of cancel + assert, such as:

{code:java}
  test("trivial job cancellation") {
val rdd = new MyRDD(sc, 1, Nil)
val jobId = submit(rdd, Array(0))
cancel(jobId)
assert(failure.getMessage === s"Job $jobId cancelled ")
assert(sparkListener.failedStages === Seq(0))
assertDataStructuresEmpty()
  }
{code}
We should extract the general method like cancelAndCheck(jobId: Int, predicate)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31845) DAGSchedulerSuite: Improve and reuse completeNextStageWithFetchFailure

2020-05-27 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31845:
--

 Summary: DAGSchedulerSuite: Improve and reuse 
completeNextStageWithFetchFailure
 Key: SPARK-31845
 URL: https://issues.apache.org/jira/browse/SPARK-31845
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng


DAGSchedulerSuite provides completeNextStageWithFetchFailure to make next stage 
occurs fetch failure.
But many test case uses complete directly as follows:

{code:java}
complete(taskSets(2), Seq(
  (FetchFailed(BlockManagerId("hostA-exec2", "hostA", 12345),
firstShuffleId, 0L, 0, 0, "ignored"),
null)
))
{code}

We need to improve completeNextStageWithFetchFailure and reuse it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31844) DAGSchedulerSuite: For the pattern of failed + assert, extract the general method

2020-05-27 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31844:
--

 Summary: DAGSchedulerSuite: For the pattern of failed + assert, 
extract the general method
 Key: SPARK-31844
 URL: https://issues.apache.org/jira/browse/SPARK-31844
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng


DAGSchedulerSuite exists many test case(e.g. trivial job failure | run shuffle 
with map stage failure) contains the pattern of failed + assert, such as:

{code:java}
  test("trivial job failure") {
submit(new MyRDD(sc, 1, Nil), Array(0))
failed(taskSets(0), "some failure")
assert(failure.getMessage === "Job aborted due to stage failure: some 
failure")
assert(sparkListener.failedStages === Seq(0))
assertDataStructuresEmpty()
  }
{code}
We should extract the general method like failedAndCheck(taskSet: TaskSet, 
message: String, predicate)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31843) DAGSchedulerSuite: For the pattern of complete + assert, extract the general method

2020-05-27 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31843:
--

 Summary: DAGSchedulerSuite: For the pattern of complete + assert, 
extract the general method
 Key: SPARK-31843
 URL: https://issues.apache.org/jira/browse/SPARK-31843
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng


DAGSchedulerSuite exists many test case(e.g. run trivial job/run trivial job w/ 
dependency/cache location preferences w/ dependency) contains the pattern of 
complete + assert, such as:

{code:java}
  test("run trivial job") {
submit(new MyRDD(sc, 1, Nil), Array(0))
complete(taskSets(0), List((Success, 42)))
assert(results === Map(0 -> 42))
assertDataStructuresEmpty()
  }
{code}
We should extract the general method like checkAnswer(job, expected)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31842) DAGSchedulerSuite: For the pattern of runevent + assert, extract the general method

2020-05-27 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31842:
--

 Summary: DAGSchedulerSuite: For the pattern of runevent + assert, 
extract the general method
 Key: SPARK-31842
 URL: https://issues.apache.org/jira/browse/SPARK-31842
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng


DAGSchedulerSuite exists many test case(e.g. late fetch failures / task events 
always posted / ignore late map task completions) contains the pattern of 
runevent + assert, such as:

{code:java}
runEvent(makeCompletionEvent(
  taskSets(1).tasks(0),
  FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, "ignored"),
  null))
assert(sparkListener.failedStages.contains(1))
{code}
We should extract the general method like runEventAndCheck(events*, predicate)





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118248#comment-17118248
 ] 

Apache Spark commented on SPARK-31764:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/28660

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118247#comment-17118247
 ] 

Apache Spark commented on SPARK-31764:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/28660

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118244#comment-17118244
 ] 

Jungtaek Lim commented on SPARK-31764:
--

Thanks for confirming. :)

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118242#comment-17118242
 ] 

Kousuke Saruta edited comment on SPARK-31764 at 5/28/20, 2:34 AM:
--

As you say, it's more proper to mark the label "bug" rather than "improvement" 
so I think it's better to go to 3.0.

(I don't remember why I marked the label "bug". Maybe, it's a mistake.)

I'll make a backporting PR.


was (Author: sarutak):
As you say, it's more proper to mark the label "bug" rather than "improvement" 
so I think it's better to go to 3.0.

(I don't remember why I mark the label "bug". Maybe, it's a mistake.)

I'll make a backporting PR.

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118242#comment-17118242
 ] 

Kousuke Saruta edited comment on SPARK-31764 at 5/28/20, 2:34 AM:
--

As you say, it's more proper to mark the label "bug" rather than "improvement" 
so I think it's better to go to 3.0.

(I don't remember why I mark the label "bug". Maybe, it's a mistake.)

I'll make a backporting PR.


was (Author: sarutak):
As you say, it's more proper to mark the label "bug" rather than "improvement" 
so I think it's better to go to 3.0.

I'll make a backporting PR.

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118242#comment-17118242
 ] 

Kousuke Saruta edited comment on SPARK-31764 at 5/28/20, 2:33 AM:
--

As you say, it's more proper to mark the label "bug" rather than "improvement" 
so I think it's better to go to 3.0.

I'll make a backporting PR.


was (Author: sarutak):
As you say, it's more proper to mark the label "bug" rather than "improvement" 
so I think it's better to go to 3.0.

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-31764:
---
Issue Type: Bug  (was: Improvement)

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118242#comment-17118242
 ] 

Kousuke Saruta commented on SPARK-31764:


As you say, it's more proper to mark the label "bug" rather than "improvement" 
so I think it's better to go to 3.0.

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28879) Kubernetes node selector should be configurable

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28879:


Assignee: Apache Spark

> Kubernetes node selector should be configurable
> ---
>
> Key: SPARK-28879
> URL: https://issues.apache.org/jira/browse/SPARK-28879
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Franco
>Assignee: Apache Spark
>Priority: Major
>
> Similar to SPARK-25220.
>  
> Having to create pod templates is not an answer because:
> a) It's not available until Spark 3.0 which is unlikely to be deployed to 
> platforms like EMR for a while as there will be major impacts to dependent 
> libraries.
> b) It is on overkill for something as fundamental as selecting different 
> nodes for your workers as your drivers e.g. for Spot instances.
>  
> Would request that we add this feature in the 2.4 timeframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28879) Kubernetes node selector should be configurable

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28879:


Assignee: (was: Apache Spark)

> Kubernetes node selector should be configurable
> ---
>
> Key: SPARK-28879
> URL: https://issues.apache.org/jira/browse/SPARK-28879
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Franco
>Priority: Major
>
> Similar to SPARK-25220.
>  
> Having to create pod templates is not an answer because:
> a) It's not available until Spark 3.0 which is unlikely to be deployed to 
> platforms like EMR for a while as there will be major impacts to dependent 
> libraries.
> b) It is on overkill for something as fundamental as selecting different 
> nodes for your workers as your drivers e.g. for Spot instances.
>  
> Would request that we add this feature in the 2.4 timeframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Xingbo Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118203#comment-17118203
 ] 

Xingbo Jiang commented on SPARK-31764:
--

The issue is reported as "Improvement" and the affected version is "3.1.0", so 
it's merged to master only. [~sarutak] Do you think this should go to 3.0?

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118201#comment-17118201
 ] 

Jungtaek Lim commented on SPARK-31764:
--

For me this looks to be a bug - the description of PR states the same. Is there 
a reason this issue is marked as an "improvement" and the fix only landed to 
master branch? 

The bug is also placed in branch-3.0: 
https://github.com/apache/spark/blob/branch-3.0/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31839) delete duplicate code

2020-05-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31839:


Assignee: philipse

> delete  duplicate code
> --
>
> Key: SPARK-31839
> URL: https://issues.apache.org/jira/browse/SPARK-31839
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.5
>Reporter: philipse
>Assignee: philipse
>Priority: Minor
>
> there are duplicate code, we can clear it to improve test quality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31839) delete duplicate code

2020-05-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31839.
--
Fix Version/s: 3.0.0
   2.4.6
   Resolution: Fixed

Issue resolved by pull request 28655
[https://github.com/apache/spark/pull/28655]

> delete  duplicate code
> --
>
> Key: SPARK-31839
> URL: https://issues.apache.org/jira/browse/SPARK-31839
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.5
>Reporter: philipse
>Assignee: philipse
>Priority: Minor
> Fix For: 2.4.6, 3.0.0
>
>
> there are duplicate code, we can clear it to improve test quality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118191#comment-17118191
 ] 

Apache Spark commented on SPARK-25351:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/28659

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Assignee: Jalpan Randeri
>Priority: Major
>  Labels: bulk-closed
> Fix For: 3.1.0
>
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118189#comment-17118189
 ] 

Apache Spark commented on SPARK-25351:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/28659

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Assignee: Jalpan Randeri
>Priority: Major
>  Labels: bulk-closed
> Fix For: 3.1.0
>
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31763.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28652
[https://github.com/apache/spark/pull/28652]

> DataFrame.inputFiles() not Available
> 
>
> Key: SPARK-31763
> URL: https://issues.apache.org/jira/browse/SPARK-31763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Assignee: Rakesh Raushan
>Priority: Major
> Fix For: 3.1.0
>
>
> I have been trying to list inputFiles that compose my DataSet by using 
> *PySpark* 
> spark_session.read
>  .format(sourceFileFormat)
>  .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
>  *.inputFiles();*
> but I get an exception saying inputFiles attribute not present. But I was 
> able to get this functionality with Spark Java. 
> *So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31763:


Assignee: Rakesh Raushan

> DataFrame.inputFiles() not Available
> 
>
> Key: SPARK-31763
> URL: https://issues.apache.org/jira/browse/SPARK-31763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Assignee: Rakesh Raushan
>Priority: Major
>
> I have been trying to list inputFiles that compose my DataSet by using 
> *PySpark* 
> spark_session.read
>  .format(sourceFileFormat)
>  .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
>  *.inputFiles();*
> but I get an exception saying inputFiles attribute not present. But I was 
> able to get this functionality with Spark Java. 
> *So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31730) Flaky test: org.apache.spark.scheduler.BarrierTaskContextSuite

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118183#comment-17118183
 ] 

Apache Spark commented on SPARK-31730:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/28658

> Flaky test: org.apache.spark.scheduler.BarrierTaskContextSuite
> --
>
> Key: SPARK-31730
> URL: https://issues.apache.org/jira/browse/SPARK-31730
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122655/testReport/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/668/testReport/
> BarrierTaskContextSuite.support multiple barrier() call within a single task
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 1031 
> was not less than or equal to 1000
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
>   at 
> org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$new$15(BarrierTaskContextSuite.scala:157)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:58)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:58)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1124)
>   at org.scalatest.Suite.run$(Suite.scala:1106)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Resolved] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2020-05-27 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25351.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 26585
[https://github.com/apache/spark/pull/26585]

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Assignee: Jalpan Randeri
>Priority: Major
>  Labels: bulk-closed
> Fix For: 3.1.0
>
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2020-05-27 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-25351:


Assignee: Jalpan Randeri

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Assignee: Jalpan Randeri
>Priority: Major
>  Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31730) Flaky test: org.apache.spark.scheduler.BarrierTaskContextSuite

2020-05-27 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-31730.
--
Fix Version/s: 3.0.0
 Assignee: Xingbo Jiang
   Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/28584

> Flaky test: org.apache.spark.scheduler.BarrierTaskContextSuite
> --
>
> Key: SPARK-31730
> URL: https://issues.apache.org/jira/browse/SPARK-31730
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122655/testReport/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/668/testReport/
> BarrierTaskContextSuite.support multiple barrier() call within a single task
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 1031 
> was not less than or equal to 1000
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
>   at 
> org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$new$15(BarrierTaskContextSuite.scala:157)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:58)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:58)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1124)
>   at org.scalatest.Suite.run$(Suite.scala:1106)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Updated] (SPARK-31838) The streaming output mode validator didn't treat union well

2020-05-27 Thread Xingcan Cui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingcan Cui updated SPARK-31838:

Description: 
{{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union 
operator well (it only assumes a linear query plan). To temporarily fix the 
broken semantics before we have a complete improvement plan for the output 
mode, the validator should do the following.

1. Allow multiple aggregations from different branches of a union operator.
2. For complete output mode, disallow a union on streams with and without 
aggregations.

  was:
{{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union 
operator well (it only assumes a linear query plan). To temporarily fix the 
broken semantics before we have a complete improvement plan for the output 
mode, the validator should do the following.

1. Allow multiple aggregations from different branches of the last union 
operator.
2. For complete output mode, disallow a union on streams with and without 
aggregations.


> The streaming output mode validator didn't treat union well
> ---
>
> Key: SPARK-31838
> URL: https://issues.apache.org/jira/browse/SPARK-31838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Xingcan Cui
>Priority: Major
>
> {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union 
> operator well (it only assumes a linear query plan). To temporarily fix the 
> broken semantics before we have a complete improvement plan for the output 
> mode, the validator should do the following.
> 1. Allow multiple aggregations from different branches of a union operator.
> 2. For complete output mode, disallow a union on streams with and without 
> aggregations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes

2020-05-27 Thread James Boylan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118140#comment-17118140
 ] 

James Boylan commented on SPARK-31800:
--

Good catch. I'll be back in the office tomorrow and I will test this scenario. 
Thank you for pointing that out. I'll update and close this ticket if that 
resolves it.

> Unable to disable Kerberos when submitting jobs to Kubernetes
> -
>
> Key: SPARK-31800
> URL: https://issues.apache.org/jira/browse/SPARK-31800
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: James Boylan
>Priority: Major
>
> When you attempt to submit a process to Kubernetes using spark-submit through 
> --master, it returns the exception:
> {code:java}
> 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" org.apache.spark.SparkException: Please specify 
> spark.kubernetes.file.upload.path property.
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246)
> at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
> at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
> at scala.collection.immutable.List.foldLeft(List.scala:89)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 20/05/22 20:25:54 INFO ShutdownHookManager: Shutdown hook called
> 20/05/22 20:25:54 INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/p1/y24myg413wx1l1l52bsdn2hrgq/T/spark-c94db9c5-b8a8-414d-b01d-f6369d31c9b8
>  {code}
> No changes in settings appear to be able to disable Kerberos. This is when 
> running a simple execution of the SparkPi on our lab cluster. The command 
> being used is
> {code:java}
> ./bin/spark-submit --master k8s://https://{api_hostname} --deploy-mode 
> cluster --name spark-test --class org.apache.spark.examples.SparkPi --conf 
> 

[jira] [Updated] (SPARK-31838) The streaming output mode validator didn't treat union well

2020-05-27 Thread Xingcan Cui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingcan Cui updated SPARK-31838:

Description: 
{{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union 
operator well (it only assumes a linear query plan). To temporarily fix the 
broken semantics before we have a complete improvement plan for the output 
mode, the validator should do the following.

1. Allow multiple aggregations from different branches of the last union 
operator.
2. For complete output mode, disallow a union on streams with and without 
aggregations.

  was:
{{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union 
operator well (it only assumes a linear query plan). To temporarily fix the 
broken semantics before we have a complete improvement plan for the output 
mode, the validator should do the following.

1. Allow multiple (two for now) aggregations from different branches of the 
last union operator.
2. For complete output mode, disallow a union on streams with and without 
aggregations.


> The streaming output mode validator didn't treat union well
> ---
>
> Key: SPARK-31838
> URL: https://issues.apache.org/jira/browse/SPARK-31838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Xingcan Cui
>Priority: Major
>
> {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union 
> operator well (it only assumes a linear query plan). To temporarily fix the 
> broken semantics before we have a complete improvement plan for the output 
> mode, the validator should do the following.
> 1. Allow multiple aggregations from different branches of the last union 
> operator.
> 2. For complete output mode, disallow a union on streams with and without 
> aggregations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31841) Dataset.repartition leverage adaptive execution

2020-05-27 Thread koert kuipers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118120#comment-17118120
 ] 

koert kuipers commented on SPARK-31841:
---

that is right. its a feature request i believe (unless i misunderstood whats 
happening). should i delete it here?

> Dataset.repartition leverage adaptive execution
> ---
>
> Key: SPARK-31841
> URL: https://issues.apache.org/jira/browse/SPARK-31841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark branch-3.0 from may 1 this year
>Reporter: koert kuipers
>Priority: Minor
>
> hello,
> we are very happy users of adaptive query execution. its a great feature to 
> now have to think about and tune the number of partitions anymore in a 
> shuffle.
> i noticed that Dataset.groupBy consistently uses adaptive execution when its 
> enabled (e.g. i don't see the default 200 partitions) but when i do 
> Dataset.repartition it seems i am back to a hardcoded number of partitions.
> Should adaptive execution also be used for repartition? It would be nice to 
> be able to repartition without having to think about optimal number of 
> partitions.
> An example:
> {code:java}
> $ spark-shell --conf spark.sql.adaptive.enabled=true --conf 
> spark.sql.adaptive.advisoryPartitionSizeInBytes=10
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val x = (1 to 100).toDF
> x: org.apache.spark.sql.DataFrame = [value: int]
> scala> x.rdd.getNumPartitions
> res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions
> res1: Int = 200
> scala> x.groupBy("value").count.rdd.getNumPartitions
> res2: Int = 67
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang updated SPARK-31764:
-
Fix Version/s: 3.1.0

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-31764.
--
Resolution: Fixed

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier

2020-05-27 Thread Xingbo Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118118#comment-17118118
 ] 

Xingbo Jiang commented on SPARK-31764:
--

Resolved by https://github.com/apache/spark/pull/28583

> JsonProtocol doesn't write RDDInfo#isBarrier
> 
>
> Key: SPARK-31764
> URL: https://issues.apache.org/jira/browse/SPARK-31764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> JsonProtocol read RDDInfo#isBarrier but doesn't write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31841) Dataset.repartition leverage adaptive execution

2020-05-27 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118114#comment-17118114
 ] 

Jungtaek Lim commented on SPARK-31841:
--

It sounds like a question/feature request which is better to be moved to user@ 
or dev@ mailing list.

> Dataset.repartition leverage adaptive execution
> ---
>
> Key: SPARK-31841
> URL: https://issues.apache.org/jira/browse/SPARK-31841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark branch-3.0 from may 1 this year
>Reporter: koert kuipers
>Priority: Minor
>
> hello,
> we are very happy users of adaptive query execution. its a great feature to 
> now have to think about and tune the number of partitions anymore in a 
> shuffle.
> i noticed that Dataset.groupBy consistently uses adaptive execution when its 
> enabled (e.g. i don't see the default 200 partitions) but when i do 
> Dataset.repartition it seems i am back to a hardcoded number of partitions.
> Should adaptive execution also be used for repartition? It would be nice to 
> be able to repartition without having to think about optimal number of 
> partitions.
> An example:
> {code:java}
> $ spark-shell --conf spark.sql.adaptive.enabled=true --conf 
> spark.sql.adaptive.advisoryPartitionSizeInBytes=10
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val x = (1 to 100).toDF
> x: org.apache.spark.sql.DataFrame = [value: int]
> scala> x.rdd.getNumPartitions
> res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions
> res1: Int = 200
> scala> x.groupBy("value").count.rdd.getNumPartitions
> res2: Int = 67
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31841) Dataset.repartition leverage adaptive execution

2020-05-27 Thread koert kuipers (Jira)
koert kuipers created SPARK-31841:
-

 Summary: Dataset.repartition leverage adaptive execution
 Key: SPARK-31841
 URL: https://issues.apache.org/jira/browse/SPARK-31841
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
 Environment: spark branch-3.0 from may 1 this year
Reporter: koert kuipers


hello,

we are very happy users of adaptive query execution. its a great feature to now 
have to think about and tune the number of partitions anymore in a shuffle.

i noticed that Dataset.groupBy consistently uses adaptive execution when its 
enabled (e.g. i don't see the default 200 partitions) but when i do 
Dataset.repartition it seems i am back to a hardcoded number of partitions.

Should adaptive execution also be used for repartition? It would be nice to be 
able to repartition without having to think about optimal number of partitions.

An example:
{code:java}
$ spark-shell --conf spark.sql.adaptive.enabled=true --conf 
spark.sql.adaptive.advisoryPartitionSizeInBytes=10
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val x = (1 to 100).toDF
x: org.apache.spark.sql.DataFrame = [value: int]
scala> x.rdd.getNumPartitions
res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions
res1: Int = 200
scala> x.groupBy("value").count.rdd.getNumPartitions
res2: Int = 67
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions

2020-05-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31797.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28534

> Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
> ---
>
> Key: SPARK-31797
> URL: https://issues.apache.org/jira/browse/SPARK-31797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: [PR link|https://github.com/apache/spark/pull/28534]
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Major
> Fix For: 3.1.0
>
>
> 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, 
> {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}}
> Reference: 
> [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds]
> 2.People will have convenient way to get timestamps from seconds,milliseconds 
> and microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions

2020-05-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-31797:
-
Comment: was deleted

(was: Don't set Fix/Target version)

> Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
> ---
>
> Key: SPARK-31797
> URL: https://issues.apache.org/jira/browse/SPARK-31797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: [PR link|https://github.com/apache/spark/pull/28534]
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Major
>
> 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, 
> {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}}
> Reference: 
> [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds]
> 2.People will have convenient way to get timestamps from seconds,milliseconds 
> and microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31827) fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form

2020-05-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31827:

Summary: fail datetime parsing/formatting if detect the Java 8 bug of 
stand-alone form  (was: better error message for the JDK bug of stand-alone 
form)

> fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
> -
>
> Key: SPARK-31827
> URL: https://issues.apache.org/jira/browse/SPARK-31827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions

2020-05-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-31797:
-
   Fix Version/s: (was: 3.1.0)
   Flags:   (was: Patch)
Target Version/s:   (was: 3.1.0)
  Labels:   (was: features)

Don't set Fix/Target version

> Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
> ---
>
> Key: SPARK-31797
> URL: https://issues.apache.org/jira/browse/SPARK-31797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: [PR link|https://github.com/apache/spark/pull/28534]
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Major
>
> 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, 
> {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}}
> Reference: 
> [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds]
> 2.People will have convenient way to get timestamps from seconds,milliseconds 
> and microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25185) CBO rowcount statistics doesn't work for partitioned parquet external table

2020-05-27 Thread Ankit Dsouza (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117992#comment-17117992
 ] 

Ankit Dsouza commented on SPARK-25185:
--

Hey [~zhenhuawang], can this be prioritized and fixed (if required) this year? 
Using spark 2.3

> CBO rowcount statistics doesn't work for partitioned parquet external table
> ---
>
> Key: SPARK-25185
> URL: https://issues.apache.org/jira/browse/SPARK-25185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
> Tried on Ubuntu, FreBSD and windows, running spark-shell in local mode 
> reading data from local file system
>Reporter: Amit
>Priority: Major
>
> Created a dummy partitioned data with partition column on string type col1=a 
> and col1=b
> added csv data-> read through spark -> created partitioned external table-> 
> msck repair table to load partition. Did analyze on all columns and partition 
> column as well.
> ~println(spark.sql("select * from test_p where 
> e='1a'").queryExecution.toStringWithStats)~
>  ~val op = spark.sql("select * from test_p where 
> e='1a'").queryExecution.optimizedPlan~
> // e is the partitioned column
>  ~val stat = op.stats(spark.sessionState.conf)~
>  ~print(stat.rowCount)~
>  
> Created the same way in parquet the rowcount comes up correctly in case of 
> csv but in parquet it shows as None.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-05-27 Thread Sudharshann D. (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110729#comment-17110729
 ] 

Sudharshann D. edited comment on SPARK-31579 at 5/27/20, 5:59 PM:
--

Please see my proof of concept 
[https://github.com/Sudhar287/spark/pull/1/files|https://github.com/Sudhar287/spark/pull/1/files]


was (Author: suddhuasf):
Please see my proof of concept 
[https://github.com/Sudhar287/spark/pull/1/files|http://example.com]

> Replace floorDiv by / in localRebaseGregorianToJulianDays()
> ---
>
> Key: SPARK-31579
> URL: https://issues.apache.org/jira/browse/SPARK-31579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: starter
>
> Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check 
> that for all available time zones in the range of [0001, 2100] years with the 
> step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv 
> can be replaced by /, and this should improve performance of 
> RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite

2020-05-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31835:
---

Assignee: Kent Yao

> Add zoneId to codegen related test in DateExpressionsSuite
> --
>
> Key: SPARK-31835
> URL: https://issues.apache.org/jira/browse/SPARK-31835
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> the formatter will fail earlier before the codegen check happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite

2020-05-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31835.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28653
[https://github.com/apache/spark/pull/28653]

> Add zoneId to codegen related test in DateExpressionsSuite
> --
>
> Key: SPARK-31835
> URL: https://issues.apache.org/jira/browse/SPARK-31835
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> the formatter will fail earlier before the codegen check happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31840) Add instance weight support in LogisticRegressionSummary

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31840:


Assignee: (was: Apache Spark)

> Add instance weight support in LogisticRegressionSummary
> 
>
> Key: SPARK-31840
> URL: https://issues.apache.org/jira/browse/SPARK-31840
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add instance weight support in LogisticRegressionSummary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31840) Add instance weight support in LogisticRegressionSummary

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31840:


Assignee: Apache Spark

> Add instance weight support in LogisticRegressionSummary
> 
>
> Key: SPARK-31840
> URL: https://issues.apache.org/jira/browse/SPARK-31840
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Add instance weight support in LogisticRegressionSummary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31840) Add instance weight support in LogisticRegressionSummary

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117923#comment-17117923
 ] 

Apache Spark commented on SPARK-31840:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28657

> Add instance weight support in LogisticRegressionSummary
> 
>
> Key: SPARK-31840
> URL: https://issues.apache.org/jira/browse/SPARK-31840
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add instance weight support in LogisticRegressionSummary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31840) Add instance weight support in LogisticRegressionSummary

2020-05-27 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31840:
--

 Summary: Add instance weight support in LogisticRegressionSummary
 Key: SPARK-31840
 URL: https://issues.apache.org/jira/browse/SPARK-31840
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Add instance weight support in LogisticRegressionSummary



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog

2020-05-27 Thread Edgar Klerks (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117887#comment-17117887
 ] 

Edgar Klerks commented on SPARK-23443:
--

Working on this. Can I make work in progress PR's? 

> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31719) Refactor JoinSelection

2020-05-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31719.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28540
[https://github.com/apache/spark/pull/28540]

> Refactor JoinSelection
> --
>
> Key: SPARK-31719
> URL: https://issues.apache.org/jira/browse/SPARK-31719
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
> Fix For: 3.1.0
>
>
> This PR extracts the logic for selecting the planned join type out of the 
> `JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst. This 
> change both cleans up the code in `JoinSelection` and allows the logic to be 
> in one place and be used from other rules that need to make decision based on 
> the join type before the planning time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31719) Refactor JoinSelection

2020-05-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31719:
---

Assignee: Ali Afroozeh

> Refactor JoinSelection
> --
>
> Key: SPARK-31719
> URL: https://issues.apache.org/jira/browse/SPARK-31719
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
>
> This PR extracts the logic for selecting the planned join type out of the 
> `JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst. This 
> change both cleans up the code in `JoinSelection` and allows the logic to be 
> in one place and be used from other rules that need to make decision based on 
> the join type before the planning time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31839) delete duplicate code

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117852#comment-17117852
 ] 

Apache Spark commented on SPARK-31839:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/28655

> delete  duplicate code
> --
>
> Key: SPARK-31839
> URL: https://issues.apache.org/jira/browse/SPARK-31839
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.5
>Reporter: philipse
>Priority: Minor
>
> there are duplicate code, we can clear it to improve test quality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31839) delete duplicate code

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31839:


Assignee: (was: Apache Spark)

> delete  duplicate code
> --
>
> Key: SPARK-31839
> URL: https://issues.apache.org/jira/browse/SPARK-31839
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.5
>Reporter: philipse
>Priority: Minor
>
> there are duplicate code, we can clear it to improve test quality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31839) delete duplicate code

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117851#comment-17117851
 ] 

Apache Spark commented on SPARK-31839:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/28655

> delete  duplicate code
> --
>
> Key: SPARK-31839
> URL: https://issues.apache.org/jira/browse/SPARK-31839
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.5
>Reporter: philipse
>Priority: Minor
>
> there are duplicate code, we can clear it to improve test quality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31839) delete duplicate code

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31839:


Assignee: Apache Spark

> delete  duplicate code
> --
>
> Key: SPARK-31839
> URL: https://issues.apache.org/jira/browse/SPARK-31839
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.5
>Reporter: philipse
>Assignee: Apache Spark
>Priority: Minor
>
> there are duplicate code, we can clear it to improve test quality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117847#comment-17117847
 ] 

Apache Spark commented on SPARK-31837:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28656

> Shift to the new highest locality level if there is when recomputeLocality
> --
>
> Key: SPARK-31837
> URL: https://issues.apache.org/jira/browse/SPARK-31837
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> In the new version of delay scheduling, if a task set is submitted before any 
> executors are added to TaskScheduler, the task set will always schedule tasks 
> at ANY locality level. We should improve this corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31837:


Assignee: (was: Apache Spark)

> Shift to the new highest locality level if there is when recomputeLocality
> --
>
> Key: SPARK-31837
> URL: https://issues.apache.org/jira/browse/SPARK-31837
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> In the new version of delay scheduling, if a task set is submitted before any 
> executors are added to TaskScheduler, the task set will always schedule tasks 
> at ANY locality level. We should improve this corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31837:


Assignee: Apache Spark

> Shift to the new highest locality level if there is when recomputeLocality
> --
>
> Key: SPARK-31837
> URL: https://issues.apache.org/jira/browse/SPARK-31837
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> In the new version of delay scheduling, if a task set is submitted before any 
> executors are added to TaskScheduler, the task set will always schedule tasks 
> at ANY locality level. We should improve this corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31839) delete duplicate code

2020-05-27 Thread philipse (Jira)
philipse created SPARK-31839:


 Summary: delete  duplicate code
 Key: SPARK-31839
 URL: https://issues.apache.org/jira/browse/SPARK-31839
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.4.5
Reporter: philipse


there are duplicate code, we can clear it to improve test quality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117848#comment-17117848
 ] 

Apache Spark commented on SPARK-31837:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28656

> Shift to the new highest locality level if there is when recomputeLocality
> --
>
> Key: SPARK-31837
> URL: https://issues.apache.org/jira/browse/SPARK-31837
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> In the new version of delay scheduling, if a task set is submitted before any 
> executors are added to TaskScheduler, the task set will always schedule tasks 
> at ANY locality level. We should improve this corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31838) The streaming output mode validator didn't treat union well

2020-05-27 Thread Xingcan Cui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingcan Cui updated SPARK-31838:

Parent: SPARK-31724
Issue Type: Sub-task  (was: Bug)

> The streaming output mode validator didn't treat union well
> ---
>
> Key: SPARK-31838
> URL: https://issues.apache.org/jira/browse/SPARK-31838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Xingcan Cui
>Priority: Major
>
> {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union 
> operator well (it only assumes a linear query plan). To temporarily fix the 
> broken semantics before we have a complete improvement plan for the output 
> mode, the validator should do the following.
> 1. Allow multiple (two for now) aggregations from different branches of the 
> last union operator.
> 2. For complete output mode, disallow a union on streams with and without 
> aggregations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31838) The streaming output mode validator didn't treat union well

2020-05-27 Thread Xingcan Cui (Jira)
Xingcan Cui created SPARK-31838:
---

 Summary: The streaming output mode validator didn't treat union 
well
 Key: SPARK-31838
 URL: https://issues.apache.org/jira/browse/SPARK-31838
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Xingcan Cui


{{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union 
operator well (it only assumes a linear query plan). To temporarily fix the 
broken semantics before we have a complete improvement plan for the output 
mode, the validator should do the following.

1. Allow multiple (two for now) aggregations from different branches of the 
last union operator.
2. For complete output mode, disallow a union on streams with and without 
aggregations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality

2020-05-27 Thread wuyi (Jira)
wuyi created SPARK-31837:


 Summary: Shift to the new highest locality level if there is when 
recomputeLocality
 Key: SPARK-31837
 URL: https://issues.apache.org/jira/browse/SPARK-31837
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: wuyi


In the new version of delay scheduling, if a task set is submitted before any 
executors are added to TaskScheduler, the task set will always schedule tasks 
at ANY locality level. We should improve this corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31795) Stream Data with API to ServiceNow

2020-05-27 Thread Dominic Wetenkamp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Wetenkamp resolved SPARK-31795.
---
Resolution: Workaround

> Stream Data with API to ServiceNow
> --
>
> Key: SPARK-31795
> URL: https://issues.apache.org/jira/browse/SPARK-31795
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.5
>Reporter: Dominic Wetenkamp
>Priority: Major
>
> 1) Create class
> 2) Instantiate class 
> 3) Setup stream
> 4) Write stream. Here do I get a pickeling error, I really don't know how to 
> get it work without error.
>  
>  
> class CMDB:
>  #Public Properties
>  @property
>  def streamDF(self):
>  return spark.readStream.table(self.__source_table)
>  
>  #Constructor
>  def __init__(self, destination_table, source_table):
>  self.__destination_table = destination_table
>  self.__source_table = source_table
> #Private Methodes 
>  def __processRow(self, row):
>  #API connection info
>  url = 'https://foo.service-now.com/api/now/table/' + 
> self.__destination_table + '?sysparm_display_value=true'
>  user = 'username'
>  password = 'psw'
>  
>  headers = \{"Content-Type":"application/json","Accept":"application/json"}
>  response = requests.post(url, auth=(user, password), headers=headers, data = 
> json.dumps(row.asDict()))
>  
>  return response
> #Public Methodes
>  def uploadStreamDF(self, df):
>  return df.writeStream.foreach(self.__processRow).trigger(once=True).start()
>  
> 
>  
> cmdb = CMDB('destination_table_name','source_table_name')
> streamDF = (cmdb.streamDF
>  .withColumn('object_id',col('source_column_id'))
>  .withColumn('name',col('source_column_name'))
>  ).select('object_id','name')
> #set stream works, able to display data
> cmdb.uploadStreamDF(streamDF)
> #cmdb.uploadStreamDF(streamDF) fails with error: PicklingError: Could not 
> serialize object: Exception: It appears that you are attempting to reference 
> SparkContext from a broadcast variable, action, or transformation. 
> SparkContext can only be used on the driver, not in code that it run on 
> workers. For more information, see SPARK-5063. See exception below:
> '''
> Exception Traceback (most recent call last)
> /databricks/spark/python/pyspark/serializers.py in dumps(self, obj)
>  704 try:
> --> 705 return cloudpickle.dumps(obj, 2)
>  706 except pickle.PickleError:
> /databricks/spark/python/pyspark/cloudpickle.py in dumps(obj, protocol)
>  862 cp = CloudPickler(file,protocol)
> --> 863 cp.dump(obj)
>  864 return file.getvalue()
> /databricks/spark/python/pyspark/cloudpickle.py in dump(self, obj)
>  259 try:
> --> 260 return Pickler.dump(self, obj)
>  261 except RuntimeError as e:
> /databricks/python/lib/python3.7/pickle.py in dump(self, obj)
>  436 self.framer.start_framing()
> --> 437 self.save(obj)
>  438 self.write(STOP)
> '''



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31803) Make sure instance weight is not negative

2020-05-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31803.
--
Fix Version/s: 3.1.0
 Assignee: Huaxin Gao
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28621

> Make sure instance weight is not negative
> -
>
> Key: SPARK-31803
> URL: https://issues.apache.org/jira/browse/SPARK-31803
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>
> add checks to make sure instance weight is not negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31638) Clean code for pagination for all pages

2020-05-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31638.
--
Fix Version/s: 3.1.0
 Assignee: Rakesh Raushan
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28448

> Clean code for pagination for all pages
> ---
>
> Key: SPARK-31638
> URL: https://issues.apache.org/jira/browse/SPARK-31638
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.1.0
>
>
> Clean code for pagination for different pages of spark webUI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31638) Clean code for pagination for all pages

2020-05-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-31638:
-
Issue Type: Improvement  (was: Bug)

> Clean code for pagination for all pages
> ---
>
> Key: SPARK-31638
> URL: https://issues.apache.org/jira/browse/SPARK-31638
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Clean code for pagination for different pages of spark webUI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions

2020-05-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117771#comment-17117771
 ] 

Yuming Wang commented on SPARK-31809:
-


{noformat}
hive> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
OK
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
Map Reduce Local Work
  Alias -> Map Local Tables:
$hdt$_0:t1
  Fetch Operator
limit: -1
  Alias -> Map Local Operator Tree:
$hdt$_0:t1
  TableScan
alias: t1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
Filter Operator
  predicate: COALESCE(c1,c2) is not null (type: boolean)
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
  Select Operator
expressions: c1 (type: string), c2 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
HashTable Sink Operator
  keys:
0 COALESCE(_col0,_col1) (type: string)
1 _col0 (type: string)

  Stage: Stage-3
Map Reduce
  Map Operator Tree:
  TableScan
alias: t2
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
Filter Operator
  predicate: c1 is not null (type: boolean)
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
  Select Operator
expressions: c1 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
Map Join Operator
  condition map:
   Inner Join 0 to 1
  keys:
0 COALESCE(_col0,_col1) (type: string)
1 _col0 (type: string)
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
  File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
table:
input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Execution mode: vectorized
  Local Work:
Map Reduce Local Work

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink

Time taken: 1.281 seconds, Fetched: 67 row(s)
{noformat}


> Infer IsNotNull for all children of NullIntolerant expressions
> --
>
> Key: SPARK-31809
> URL: https://issues.apache.org/jira/browse/SPARK-31809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: default.png, infer.png
>
>
> We should infer {{IsNotNull}} for all children of {{NullIntolerant}} 
> expressions. For example:
> {code:sql}
> CREATE TABLE t1(c1 string, c2 string);
> CREATE TABLE t2(c1 string, c2 string);
> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
> {code}
> {noformat}
> == Physical Plan ==
> *(4) Project [c1#5, c2#6]
> +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner
>:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33]
>: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, 
> c2#6], Statistics(sizeInBytes=8.0 EiB)
>+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#7, 200), true, [id=#46]
>  +- *(2) Filter isnotnull(c1#7)
> +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], 
> Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query 
> performance:
> {noformat}
> == Physical Plan ==
> *(5) Project [c1#23, c2#24]
> +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner
>:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 

[jira] [Commented] (SPARK-28054) Unable to insert partitioned table dynamically when partition name is upper case

2020-05-27 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117759#comment-17117759
 ] 

Sandeep Katta commented on SPARK-28054:
---

[~hyukjin.kwon] is there any reason why this PR is not backported to branch2.4 ?

> Unable to insert partitioned table dynamically when partition name is upper 
> case
> 
>
> Key: SPARK-28054
> URL: https://issues.apache.org/jira/browse/SPARK-28054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: ChenKai
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> -- create sql and column name is upper case
> CREATE TABLE src (KEY STRING, VALUE STRING) PARTITIONED BY (DS STRING)
> -- insert sql
> INSERT INTO TABLE src PARTITION(ds) SELECT 'k' key, 'v' value, '1' ds
> {code}
> The error is:
> {code:java}
> Error in query: 
> org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: 
> Partition spec {ds=, DS=1} contains non-partition columns;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage

2020-05-27 Thread Wesley Hildebrandt (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117726#comment-17117726
 ] 

Wesley Hildebrandt commented on SPARK-31836:


One last note .. with only four (or fewer) files this doesn't happen, I suspect 
because PySpark uses four executors so each only takes a single input file.

> input_file_name() gives wrong value following Python UDF usage
> --
>
> Key: SPARK-31836
> URL: https://issues.apache.org/jira/browse/SPARK-31836
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wesley Hildebrandt
>Priority: Major
>
> I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8.
> The following commands demonstrate that the input_file_name() function 
> sometimes returns the wrong filename following usage of a Python UDF:
> $ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done
> $ pyspark
> >>> import pyspark.sql.functions as F
> >>> spark.readStream.text('file:///tmp/test-file-*', 
> >>> wholetext=True).withColumn('file1', 
> >>> F.input_file_name()).withColumn('udf', F.udf(lambda 
> >>> x:x)('value')).withColumn('file2', 
> >>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda 
> >>> df,_: df.select('file1','file2').show(truncate=False, 
> >>> vertical=True)).start().awaitTermination()
> A few notes about this bug:
>  * It happens with many different files, so it's not related to the file 
> contents
>  * It also happens loading files from HDFS, so storage location is not a 
> factor
>  * It also happens using .csv() to read the files instead of .text(), so 
> input format is not a factor
>  * I have not been able to cause the error without using readStream, so it 
> seems to be related to streaming
>  * The bug also happens using spark-submit to send a job to my cluster
>  * I haven't tested an older version, but it's possible that Spark pulls 
> 24958 and 25321([https://github.com/apache/spark/pull/24958], 
> [https://github.com/apache/spark/pull/25321]) to fix issue 28153 
> (https://issues.apache.org/jira/browse/SPARK-28153) introduced this bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage

2020-05-27 Thread Wesley Hildebrandt (Jira)
Wesley Hildebrandt created SPARK-31836:
--

 Summary: input_file_name() gives wrong value following Python UDF 
usage
 Key: SPARK-31836
 URL: https://issues.apache.org/jira/browse/SPARK-31836
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Wesley Hildebrandt


I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8.

The following commands demonstrate that the input_file_name() function 
sometimes returns the wrong filename following usage of a Python UDF:

$ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done

$ pyspark

>>> import pyspark.sql.functions as F

>>> spark.readStream.text('file:///tmp/test-file-*', 
>>> wholetext=True).withColumn('file1', F.input_file_name()).withColumn('udf', 
>>> F.udf(lambda x:x)('value')).withColumn('file2', 
>>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda 
>>> df,_: df.select('file1','file2').show(truncate=False, 
>>> vertical=True)).start().awaitTermination()

A few notes about this bug:
 * It happens with many different files, so it's not related to the file 
contents
 * It also happens loading files from HDFS, so storage location is not a factor
 * It also happens using .csv() to read the files instead of .text(), so input 
format is not a factor
 * I have not been able to cause the error without using readStream, so it 
seems to be related to streaming
 * The bug also happens using spark-submit to send a job to my cluster
 * I haven't tested an older version, but it's possible that Spark pulls 24958 
and 25321([https://github.com/apache/spark/pull/24958], 
[https://github.com/apache/spark/pull/25321]) to fix issue 28153 
(https://issues.apache.org/jira/browse/SPARK-28153) introduced this bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31834) Improve error message for incompatible data types

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117689#comment-17117689
 ] 

Apache Spark commented on SPARK-31834:
--

User 'lipzhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28654

> Improve error message for incompatible data types
> -
>
> Key: SPARK-31834
> URL: https://issues.apache.org/jira/browse/SPARK-31834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Zhu, Lipeng
>Priority: Major
>
> {code:java}
> spark-sql> create table SPARK_31834(a int) using parquet;
> spark-sql> insert into SPARK_31834 select '1';
> Error in query: Cannot write incompatible data to table 
> '`default`.`spark_31834`':
> - Cannot safely cast 'a': StringType to IntegerType;
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31834) Improve error message for incompatible data types

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31834:


Assignee: Zhu, Lipeng  (was: Apache Spark)

> Improve error message for incompatible data types
> -
>
> Key: SPARK-31834
> URL: https://issues.apache.org/jira/browse/SPARK-31834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Zhu, Lipeng
>Priority: Major
>
> {code:java}
> spark-sql> create table SPARK_31834(a int) using parquet;
> spark-sql> insert into SPARK_31834 select '1';
> Error in query: Cannot write incompatible data to table 
> '`default`.`spark_31834`':
> - Cannot safely cast 'a': StringType to IntegerType;
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31834) Improve error message for incompatible data types

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31834:


Assignee: Apache Spark  (was: Zhu, Lipeng)

> Improve error message for incompatible data types
> -
>
> Key: SPARK-31834
> URL: https://issues.apache.org/jira/browse/SPARK-31834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> spark-sql> create table SPARK_31834(a int) using parquet;
> spark-sql> insert into SPARK_31834 select '1';
> Error in query: Cannot write incompatible data to table 
> '`default`.`spark_31834`':
> - Cannot safely cast 'a': StringType to IntegerType;
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31834) Improve error message for incompatible data types

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117688#comment-17117688
 ] 

Apache Spark commented on SPARK-31834:
--

User 'lipzhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28654

> Improve error message for incompatible data types
> -
>
> Key: SPARK-31834
> URL: https://issues.apache.org/jira/browse/SPARK-31834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Zhu, Lipeng
>Priority: Major
>
> {code:java}
> spark-sql> create table SPARK_31834(a int) using parquet;
> spark-sql> insert into SPARK_31834 select '1';
> Error in query: Cannot write incompatible data to table 
> '`default`.`spark_31834`':
> - Cannot safely cast 'a': StringType to IntegerType;
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2020-05-27 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117613#comment-17117613
 ] 

Gabor Somogyi commented on SPARK-31815:
---

[~agrim] why do you think an external kerberos connector for hive is needed? 
Could you mention at least one use-case?

> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31834) Improve error message for incompatible data types

2020-05-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-31834:
---

Assignee: Zhu, Lipeng

> Improve error message for incompatible data types
> -
>
> Key: SPARK-31834
> URL: https://issues.apache.org/jira/browse/SPARK-31834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Zhu, Lipeng
>Priority: Major
>
> {code:java}
> spark-sql> create table SPARK_31834(a int) using parquet;
> spark-sql> insert into SPARK_31834 select '1';
> Error in query: Cannot write incompatible data to table 
> '`default`.`spark_31834`':
> - Cannot safely cast 'a': StringType to IntegerType;
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-27 Thread Puviarasu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117569#comment-17117569
 ] 

Puviarasu commented on SPARK-31754:
---

Hello [~kabhwan], 

Please find below our updates for testing with Spark 2.4.5 and Spark 
3.0.0preview2. 
 * *Spark 2.4.5:* The application fails with the same 
java.lang.NullPointerException as we were getting in Spark 2.4.0
 * *Spark 3.0.0preview2:* The application fails with below exception. Somewhat 
related to https://issues.apache.org/jira/browse/SPARK-27780. Full exception 
stack along with Logical Plan : [^Excpetion-3.0.0Preview2.txt]

{code:java}
org.apache.spark.shuffle.FetchFailedException: 
java.lang.IllegalArgumentException: Unknown message type: 10
{code}
Regarding the input and checkpoint, it is production data actually. Sharing 
them as such is very difficult. We are looking for options to anonymize the 
data before sharing. Even then we require approvals from stake holders. 

Thank you. 

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Excpetion-3.0.0Preview2.txt, 
> Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at 

[jira] [Updated] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-27 Thread Puviarasu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Puviarasu updated SPARK-31754:
--
Attachment: Excpetion-3.0.0Preview2.txt

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Excpetion-3.0.0Preview2.txt, 
> Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
> at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583)
> 

[jira] [Commented] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117557#comment-17117557
 ] 

Apache Spark commented on SPARK-31835:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28653

> Add zoneId to codegen related test in DateExpressionsSuite
> --
>
> Key: SPARK-31835
> URL: https://issues.apache.org/jira/browse/SPARK-31835
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> the formatter will fail earlier before the codegen check happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117555#comment-17117555
 ] 

Apache Spark commented on SPARK-31835:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28653

> Add zoneId to codegen related test in DateExpressionsSuite
> --
>
> Key: SPARK-31835
> URL: https://issues.apache.org/jira/browse/SPARK-31835
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> the formatter will fail earlier before the codegen check happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31835:


Assignee: Apache Spark

> Add zoneId to codegen related test in DateExpressionsSuite
> --
>
> Key: SPARK-31835
> URL: https://issues.apache.org/jira/browse/SPARK-31835
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> the formatter will fail earlier before the codegen check happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31835:


Assignee: (was: Apache Spark)

> Add zoneId to codegen related test in DateExpressionsSuite
> --
>
> Key: SPARK-31835
> URL: https://issues.apache.org/jira/browse/SPARK-31835
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> the formatter will fail earlier before the codegen check happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite

2020-05-27 Thread Kent Yao (Jira)
Kent Yao created SPARK-31835:


 Summary: Add zoneId to codegen related test in DateExpressionsSuite
 Key: SPARK-31835
 URL: https://issues.apache.org/jira/browse/SPARK-31835
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


the formatter will fail earlier before the codegen check happen.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31763:


Assignee: (was: Apache Spark)

> DataFrame.inputFiles() not Available
> 
>
> Key: SPARK-31763
> URL: https://issues.apache.org/jira/browse/SPARK-31763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have been trying to list inputFiles that compose my DataSet by using 
> *PySpark* 
> spark_session.read
>  .format(sourceFileFormat)
>  .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
>  *.inputFiles();*
> but I get an exception saying inputFiles attribute not present. But I was 
> able to get this functionality with Spark Java. 
> *So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117526#comment-17117526
 ] 

Apache Spark commented on SPARK-31763:
--

User 'iRakson' has created a pull request for this issue:
https://github.com/apache/spark/pull/28652

> DataFrame.inputFiles() not Available
> 
>
> Key: SPARK-31763
> URL: https://issues.apache.org/jira/browse/SPARK-31763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have been trying to list inputFiles that compose my DataSet by using 
> *PySpark* 
> spark_session.read
>  .format(sourceFileFormat)
>  .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
>  *.inputFiles();*
> but I get an exception saying inputFiles attribute not present. But I was 
> able to get this functionality with Spark Java. 
> *So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31763:


Assignee: Apache Spark

> DataFrame.inputFiles() not Available
> 
>
> Key: SPARK-31763
> URL: https://issues.apache.org/jira/browse/SPARK-31763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Assignee: Apache Spark
>Priority: Major
>
> I have been trying to list inputFiles that compose my DataSet by using 
> *PySpark* 
> spark_session.read
>  .format(sourceFileFormat)
>  .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
>  *.inputFiles();*
> but I get an exception saying inputFiles attribute not present. But I was 
> able to get this functionality with Spark Java. 
> *So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31795) Stream Data with API to ServiceNow

2020-05-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117480#comment-17117480
 ] 

Hyukjin Kwon commented on SPARK-31795:
--

Please take a look for https://spark.apache.org/community.html

> Stream Data with API to ServiceNow
> --
>
> Key: SPARK-31795
> URL: https://issues.apache.org/jira/browse/SPARK-31795
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.5
>Reporter: Dominic Wetenkamp
>Priority: Major
>
> 1) Create class
> 2) Instantiate class 
> 3) Setup stream
> 4) Write stream. Here do I get a pickeling error, I really don't know how to 
> get it work without error.
>  
>  
> class CMDB:
>  #Public Properties
>  @property
>  def streamDF(self):
>  return spark.readStream.table(self.__source_table)
>  
>  #Constructor
>  def __init__(self, destination_table, source_table):
>  self.__destination_table = destination_table
>  self.__source_table = source_table
> #Private Methodes 
>  def __processRow(self, row):
>  #API connection info
>  url = 'https://foo.service-now.com/api/now/table/' + 
> self.__destination_table + '?sysparm_display_value=true'
>  user = 'username'
>  password = 'psw'
>  
>  headers = \{"Content-Type":"application/json","Accept":"application/json"}
>  response = requests.post(url, auth=(user, password), headers=headers, data = 
> json.dumps(row.asDict()))
>  
>  return response
> #Public Methodes
>  def uploadStreamDF(self, df):
>  return df.writeStream.foreach(self.__processRow).trigger(once=True).start()
>  
> 
>  
> cmdb = CMDB('destination_table_name','source_table_name')
> streamDF = (cmdb.streamDF
>  .withColumn('object_id',col('source_column_id'))
>  .withColumn('name',col('source_column_name'))
>  ).select('object_id','name')
> #set stream works, able to display data
> cmdb.uploadStreamDF(streamDF)
> #cmdb.uploadStreamDF(streamDF) fails with error: PicklingError: Could not 
> serialize object: Exception: It appears that you are attempting to reference 
> SparkContext from a broadcast variable, action, or transformation. 
> SparkContext can only be used on the driver, not in code that it run on 
> workers. For more information, see SPARK-5063. See exception below:
> '''
> Exception Traceback (most recent call last)
> /databricks/spark/python/pyspark/serializers.py in dumps(self, obj)
>  704 try:
> --> 705 return cloudpickle.dumps(obj, 2)
>  706 except pickle.PickleError:
> /databricks/spark/python/pyspark/cloudpickle.py in dumps(obj, protocol)
>  862 cp = CloudPickler(file,protocol)
> --> 863 cp.dump(obj)
>  864 return file.getvalue()
> /databricks/spark/python/pyspark/cloudpickle.py in dump(self, obj)
>  259 try:
> --> 260 return Pickler.dump(self, obj)
>  261 except RuntimeError as e:
> /databricks/python/lib/python3.7/pickle.py in dump(self, obj)
>  436 self.framer.start_framing()
> --> 437 self.save(obj)
>  438 self.write(STOP)
> '''



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31834) Improve error message for incompatible data types

2020-05-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31834:

Description: 

{code:java}
spark-sql> create table SPARK_31834(a int) using parquet;
spark-sql> insert into SPARK_31834 select '1';
Error in query: Cannot write incompatible data to table 
'`default`.`spark_31834`':
- Cannot safely cast 'a': StringType to IntegerType;
spark-sql>
{code}


> Improve error message for incompatible data types
> -
>
> Key: SPARK-31834
> URL: https://issues.apache.org/jira/browse/SPARK-31834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:java}
> spark-sql> create table SPARK_31834(a int) using parquet;
> spark-sql> insert into SPARK_31834 select '1';
> Error in query: Cannot write incompatible data to table 
> '`default`.`spark_31834`':
> - Cannot safely cast 'a': StringType to IntegerType;
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31834) Improve error message for incompatible data types

2020-05-27 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-31834:
---

 Summary: Improve error message for incompatible data types
 Key: SPARK-31834
 URL: https://issues.apache.org/jira/browse/SPARK-31834
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31795) Stream Data with API to ServiceNow

2020-05-27 Thread Dominic Wetenkamp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Wetenkamp reopened SPARK-31795:
---

?

> Stream Data with API to ServiceNow
> --
>
> Key: SPARK-31795
> URL: https://issues.apache.org/jira/browse/SPARK-31795
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.5
>Reporter: Dominic Wetenkamp
>Priority: Major
>
> 1) Create class
> 2) Instantiate class 
> 3) Setup stream
> 4) Write stream. Here do I get a pickeling error, I really don't know how to 
> get it work without error.
>  
>  
> class CMDB:
>  #Public Properties
>  @property
>  def streamDF(self):
>  return spark.readStream.table(self.__source_table)
>  
>  #Constructor
>  def __init__(self, destination_table, source_table):
>  self.__destination_table = destination_table
>  self.__source_table = source_table
> #Private Methodes 
>  def __processRow(self, row):
>  #API connection info
>  url = 'https://foo.service-now.com/api/now/table/' + 
> self.__destination_table + '?sysparm_display_value=true'
>  user = 'username'
>  password = 'psw'
>  
>  headers = \{"Content-Type":"application/json","Accept":"application/json"}
>  response = requests.post(url, auth=(user, password), headers=headers, data = 
> json.dumps(row.asDict()))
>  
>  return response
> #Public Methodes
>  def uploadStreamDF(self, df):
>  return df.writeStream.foreach(self.__processRow).trigger(once=True).start()
>  
> 
>  
> cmdb = CMDB('destination_table_name','source_table_name')
> streamDF = (cmdb.streamDF
>  .withColumn('object_id',col('source_column_id'))
>  .withColumn('name',col('source_column_name'))
>  ).select('object_id','name')
> #set stream works, able to display data
> cmdb.uploadStreamDF(streamDF)
> #cmdb.uploadStreamDF(streamDF) fails with error: PicklingError: Could not 
> serialize object: Exception: It appears that you are attempting to reference 
> SparkContext from a broadcast variable, action, or transformation. 
> SparkContext can only be used on the driver, not in code that it run on 
> workers. For more information, see SPARK-5063. See exception below:
> '''
> Exception Traceback (most recent call last)
> /databricks/spark/python/pyspark/serializers.py in dumps(self, obj)
>  704 try:
> --> 705 return cloudpickle.dumps(obj, 2)
>  706 except pickle.PickleError:
> /databricks/spark/python/pyspark/cloudpickle.py in dumps(obj, protocol)
>  862 cp = CloudPickler(file,protocol)
> --> 863 cp.dump(obj)
>  864 return file.getvalue()
> /databricks/spark/python/pyspark/cloudpickle.py in dump(self, obj)
>  259 try:
> --> 260 return Pickler.dump(self, obj)
>  261 except RuntimeError as e:
> /databricks/python/lib/python3.7/pickle.py in dump(self, obj)
>  436 self.framer.start_framing()
> --> 437 self.save(obj)
>  438 self.write(STOP)
> '''



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31795) Stream Data with API to ServiceNow

2020-05-27 Thread Dominic Wetenkamp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117465#comment-17117465
 ] 

Dominic Wetenkamp commented on SPARK-31795:
---

hello Hyukjin Kwon,

Where can I get an answer or some support I don't understand what you mean with 
"mailing list"

> Stream Data with API to ServiceNow
> --
>
> Key: SPARK-31795
> URL: https://issues.apache.org/jira/browse/SPARK-31795
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.5
>Reporter: Dominic Wetenkamp
>Priority: Major
>
> 1) Create class
> 2) Instantiate class 
> 3) Setup stream
> 4) Write stream. Here do I get a pickeling error, I really don't know how to 
> get it work without error.
>  
>  
> class CMDB:
>  #Public Properties
>  @property
>  def streamDF(self):
>  return spark.readStream.table(self.__source_table)
>  
>  #Constructor
>  def __init__(self, destination_table, source_table):
>  self.__destination_table = destination_table
>  self.__source_table = source_table
> #Private Methodes 
>  def __processRow(self, row):
>  #API connection info
>  url = 'https://foo.service-now.com/api/now/table/' + 
> self.__destination_table + '?sysparm_display_value=true'
>  user = 'username'
>  password = 'psw'
>  
>  headers = \{"Content-Type":"application/json","Accept":"application/json"}
>  response = requests.post(url, auth=(user, password), headers=headers, data = 
> json.dumps(row.asDict()))
>  
>  return response
> #Public Methodes
>  def uploadStreamDF(self, df):
>  return df.writeStream.foreach(self.__processRow).trigger(once=True).start()
>  
> 
>  
> cmdb = CMDB('destination_table_name','source_table_name')
> streamDF = (cmdb.streamDF
>  .withColumn('object_id',col('source_column_id'))
>  .withColumn('name',col('source_column_name'))
>  ).select('object_id','name')
> #set stream works, able to display data
> cmdb.uploadStreamDF(streamDF)
> #cmdb.uploadStreamDF(streamDF) fails with error: PicklingError: Could not 
> serialize object: Exception: It appears that you are attempting to reference 
> SparkContext from a broadcast variable, action, or transformation. 
> SparkContext can only be used on the driver, not in code that it run on 
> workers. For more information, see SPARK-5063. See exception below:
> '''
> Exception Traceback (most recent call last)
> /databricks/spark/python/pyspark/serializers.py in dumps(self, obj)
>  704 try:
> --> 705 return cloudpickle.dumps(obj, 2)
>  706 except pickle.PickleError:
> /databricks/spark/python/pyspark/cloudpickle.py in dumps(obj, protocol)
>  862 cp = CloudPickler(file,protocol)
> --> 863 cp.dump(obj)
>  864 return file.getvalue()
> /databricks/spark/python/pyspark/cloudpickle.py in dump(self, obj)
>  259 try:
> --> 260 return Pickler.dump(self, obj)
>  261 except RuntimeError as e:
> /databricks/python/lib/python3.7/pickle.py in dump(self, obj)
>  436 self.framer.start_framing()
> --> 437 self.save(obj)
>  438 self.write(STOP)
> '''



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >