[spark] branch master updated (04f66bf -> e9337f5)

2020-06-06 Thread sarutak
This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 04f66bf  [MINOR][SS][DOCS] fileNameOnly parameter description re-unite
 add e9337f5  [SPARK-30119][WEBUI] Add Pagination Support to Streaming Page

No new revisions were added by this update.

Summary of changes:
 .../resources/org/apache/spark/ui/static/webui.js  |   3 +-
 .../spark/streaming/ui/AllBatchesTable.scala   | 282 +++--
 .../apache/spark/streaming/ui/StreamingPage.scala  | 113 ++---
 .../apache/spark/streaming/UISeleniumSuite.scala   |  45 +++-
 4 files changed, 267 insertions(+), 176 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (04f66bf -> e9337f5)

2020-06-06 Thread sarutak
This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 04f66bf  [MINOR][SS][DOCS] fileNameOnly parameter description re-unite
 add e9337f5  [SPARK-30119][WEBUI] Add Pagination Support to Streaming Page

No new revisions were added by this update.

Summary of changes:
 .../resources/org/apache/spark/ui/static/webui.js  |   3 +-
 .../spark/streaming/ui/AllBatchesTable.scala   | 282 +++--
 .../apache/spark/streaming/ui/StreamingPage.scala  | 113 ++---
 .../apache/spark/streaming/UISeleniumSuite.scala   |  45 +++-
 4 files changed, 267 insertions(+), 176 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-30119][WEBUI] Add Pagination Support to Streaming Page

2020-06-06 Thread sarutak
This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e9337f5  [SPARK-30119][WEBUI] Add Pagination Support to Streaming Page
e9337f5 is described below

commit e9337f505b737f4501d4173baa9c5739626b06a8
Author: iRakson 
AuthorDate: Sun Jun 7 13:08:50 2020 +0900

[SPARK-30119][WEBUI] Add Pagination Support to Streaming Page

### What changes were proposed in this pull request?
* Pagination Support is added to all tables of streaming page in spark web 
UI.
For adding pagination support, existing classes from #7399 were used.
* Earlier streaming page has two tables `Active Batches` and `Completed 
Batches`. Now, we will have three tables `Running Batches`, `Waiting Batches` 
and `Completed Batches`. If we have large number of waiting and running batches 
then keeping track in a single table is difficult. Also other pages have 
different table for different type type of data.
* Earlier empty tables were shown. Now only non-empty tables will be shown.
`Active Batches` table used to show details of waiting batches followed by 
running batches.

### Why are the changes needed?
Pagination will allow users to analyse the table in much better way. All 
spark web UI pages support pagination apart from streaming pages, so this will 
add consistency as well. Also it might fix the potential OOM errors that can 
arise.

### Does this PR introduce _any_ user-facing change?
Yes. `Active Batches` table is split into two tables `Running Batches` and 
`Waiting Batches`. Pagination Support is added to the all the tables. Every 
other functionality is unchanged.

### How was this patch tested?
Manually.

Before changes:
https://user-images.githubusercontent.com/15366835/80915680-8fb44b80-8d71-11ea-9957-c4a3769b8b67.png;>

After Changes:
https://user-images.githubusercontent.com/15366835/80915694-a9ee2980-8d71-11ea-8fc5-246413a4951d.png;>

Closes #28439 from iRakson/streamingPagination.

Authored-by: iRakson 
Signed-off-by: Kousuke Saruta 
---
 .../resources/org/apache/spark/ui/static/webui.js  |   3 +-
 .../spark/streaming/ui/AllBatchesTable.scala   | 282 +++--
 .../apache/spark/streaming/ui/StreamingPage.scala  | 113 ++---
 .../apache/spark/streaming/UISeleniumSuite.scala   |  45 +++-
 4 files changed, 267 insertions(+), 176 deletions(-)

diff --git a/core/src/main/resources/org/apache/spark/ui/static/webui.js 
b/core/src/main/resources/org/apache/spark/ui/static/webui.js
index 4f8409c..bb37256 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/webui.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/webui.js
@@ -87,7 +87,8 @@ $(function() {
   
collapseTablePageLoad('collapse-aggregated-poolActiveStages','aggregated-poolActiveStages');
   collapseTablePageLoad('collapse-aggregated-tasks','aggregated-tasks');
   collapseTablePageLoad('collapse-aggregated-rdds','aggregated-rdds');
-  
collapseTablePageLoad('collapse-aggregated-activeBatches','aggregated-activeBatches');
+  
collapseTablePageLoad('collapse-aggregated-waitingBatches','aggregated-waitingBatches');
+  
collapseTablePageLoad('collapse-aggregated-runningBatches','aggregated-runningBatches');
   
collapseTablePageLoad('collapse-aggregated-completedBatches','aggregated-completedBatches');
   
collapseTablePageLoad('collapse-aggregated-runningExecutions','aggregated-runningExecutions');
   
collapseTablePageLoad('collapse-aggregated-completedExecutions','aggregated-completedExecutions');
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/ui/AllBatchesTable.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/ui/AllBatchesTable.scala
index 1e443f6..c0eec0e 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/ui/AllBatchesTable.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/ui/AllBatchesTable.scala
@@ -17,30 +17,41 @@
 
 package org.apache.spark.streaming.ui
 
-import scala.xml.Node
-
-import org.apache.spark.ui.{UIUtils => SparkUIUtils}
+import java.net.URLEncoder
+import java.nio.charset.StandardCharsets.UTF_8
+import javax.servlet.http.HttpServletRequest
 
-private[ui] abstract class BatchTableBase(tableId: String, batchInterval: 
Long) {
+import scala.xml.Node
 
-  protected def columns: Seq[Node] = {
-Batch Time
-  Records
-  Scheduling Delay
-{SparkUIUtils.tooltip("Time taken by Streaming scheduler to submit 
jobs of a batch", "top")}
-  
-  Processing Time
-{SparkUIUtils.tooltip("Time taken to process all jobs of a batch", 
"top")}
-  }
+import org.apache.spark.ui.{PagedDataSource, PagedTable, UIUtils => 
SparkUIUtils}
+
+private[ui] class StreamingPagedTable(
+request: HttpServletRequest,
+tableTag: String,
+batches: 

svn commit: r39960 - in /dev/spark/v3.0.0-rc3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _site/api/java/org/apache/parqu

2020-06-06 Thread rxin
Author: rxin
Date: Sat Jun  6 14:03:25 2020
New Revision: 39960

Log:
Apache Spark v3.0.0-rc3 docs


[This commit notification would consist of 1920 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r39959 - /dev/spark/v3.0.0-rc3-bin/

2020-06-06 Thread rxin
Author: rxin
Date: Sat Jun  6 13:35:40 2020
New Revision: 39959

Log:
Apache Spark v3.0.0-rc3

Added:
dev/spark/v3.0.0-rc3-bin/
dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz   (with props)
dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.asc
dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.sha512
dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz   (with props)
dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.asc
dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.sha512
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz   (with 
props)
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz.asc
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz.sha512
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7.tgz   (with props)
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7.tgz.asc
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7.tgz.sha512
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop3.2.tgz   (with props)
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop3.2.tgz.asc
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop3.2.tgz.sha512
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-without-hadoop.tgz   (with props)
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-without-hadoop.tgz.asc
dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-without-hadoop.tgz.sha512
dev/spark/v3.0.0-rc3-bin/spark-3.0.0.tgz   (with props)
dev/spark/v3.0.0-rc3-bin/spark-3.0.0.tgz.asc
dev/spark/v3.0.0-rc3-bin/spark-3.0.0.tgz.sha512

Added: dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.asc
==
--- dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.asc (added)
+++ dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.asc Sat Jun  6 13:35:40 2020
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJEBAABCgAuFiEESovaSObiEqc0YyUC3qlj4uk0fWYFAl7bh3gQHHJ4aW5AYXBh
+Y2hlLm9yZwAKCRDeqWPi6TR9ZjGIEACG3gsdARN8puRHS2YL+brOmjbrS4wVY/Av
+l+ZR59moZ7QuwjYoixyqNnztIKgIyleYJq9DL5TqqMxFgGpuoDrnuWVqI+8MngVA
+gau/QDmYINabZsJxFfDn1IjxxSQBsgf6pwfqQbB+fGSjLSPnDq+u3DIWr3fRMh4X
+DrTuATNewKiiBIwQHUKAtPMAbsdDvXv0DRL7CGTiIJri43opAntQzHec3sP9hgRU
+J5J2HnjOlamgv58S7zrUw/Wo1xPLmz2PGIsP0aq9DRRw0bLnesrtEaWAKFp2HL5E
+QlbjfboaDQz/X+meruW57/sO/DDwA90/XvF44z4Gu6kbS8nRuTsU5wVfZ/1iyWZk
+PLP2nFoWl7O85k/DLB5ADYgce3e6k2qD2obKxzsEx0nr0Wu13cxCR2+IBQmv05jb
+4Kwi7iE0iKIxt3cESDH6j9GqZoTrcxt6Jb88KSQ+YM2TBNUr1ZZNmkjgYdmLvm7a
+wH6vLtdpZzUKIGd6bt1grEwoQJBMnQjkoDYxhx+ugjbs8CwwxcdUNd2Q5xz0WaSn
+p443ZlMR5lbGf6D6U4PUigaIrdD8d+ef/rRTDtXdoDqC+FdNuepyS9+2+dUZGErx
+N2IMNunKIdKw57GZGcILey1hY45SSuQFw5JAe+nWqCAzCmFX72ulkv9The7rLdlE
+YdLu6XQIBA==
+=HhHH
+-END PGP SIGNATURE-

Added: dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.sha512
==
--- dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.sha512 (added)
+++ dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.sha512 Sat Jun  6 13:35:40 2020
@@ -0,0 +1,3 @@
+SparkR_3.0.0.tar.gz: 394DCFEB 4E202A8E 5C58BF94 A77548FD 79A00F92 34538535
+ B0242E1B 96068E3E 80F78188 D71831F8 4A350224 41AA14B1
+ D72AE704 F2390842 DBEAB41F 5AC9859A

Added: dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.asc
==
--- dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.asc (added)
+++ dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.asc Sat Jun  6 13:35:40 2020
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJEBAABCgAuFiEESovaSObiEqc0YyUC3qlj4uk0fWYFAl7bh3oQHHJ4aW5AYXBh
+Y2hlLm9yZwAKCRDeqWPi6TR9ZvhPD/9Vyrywk4kYFaUo36Kg6nickTvWMm+yDfGZ
+tKrUo3dAoMta7OwUjT6W0Roo8a4BBgumaDv47Dm6mlquF2DuLuBrFsqFo8c5VNA/
+jT1tdSdHiTzjq7LfY9GQDn8Wkgp1gyIKON70XFdZifduW0gcFDkJ+FjhPYWcA6jy
+GGOGK5qboCdi9C+KowUVj4VB9bbxPbWvW7FVF3+VlcrKvkmNx+EmqmIrqsh72w8O
+EL70za2uBRUUiFcaOpY/wpmEN1raCAkMzQ+dPl7p1PFgmLFrMN9RaRXJ1stF+fXO
+rDLBLNPqb85TvvOOHpcr4PSP38GrdZvDAvljCOEbBzacF719bewu/IVRcNi9lPZE
+HDPUcZLgnocNIF6kafykrm3JhagzmPIhQ8d4DFTuH6ePxgWqdUa9lWKQL54z3mjU
+LT2CJ8gMDY0Wz5zSKc/sI/ZwL+Q6U8xiIGYSzQgT9yPztbhDd5AM2DgohJkZSD4b
+jOrEsSyNRJiwwRAHlbeOOVPb4UNYzsx1USPbPEBeXTt8X8VUb8jsU84o/RhXexk9
+EMJjxz/aChB+NefbmUjBZmXSaa/zYubprJrWnUgPw7hFxAnmtgIUdjSWSNIOJ6bp
+EV1M6xwuvrmGhOa3D0C+lYyAuYZca2FQrcAtzNiL6iOMQ6USFZvzjxGWQiV2CDGQ

svn commit: r39958 - /dev/spark/v3.0.0-rc3-bin/

2020-06-06 Thread rxin
Author: rxin
Date: Sat Jun  6 11:18:32 2020
New Revision: 39958

Log:
remove 3.0 rc3 binary

Removed:
dev/spark/v3.0.0-rc3-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-2.4 updated: [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI

2020-06-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 476010a  [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow 
enabled to show metrics in Query UI
476010a is described below

commit 476010aedd101e1a807c202d71328415109660ae
Author: Takuya UESHIN 
AuthorDate: Sat Jun 6 16:50:40 2020 +0900

[SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show 
metrics in Query UI

### What changes were proposed in this pull request?

This is a backport of #28730.

In `Dataset.collectAsArrowToPython`, since the code block for 
`serveToStream` is run in the separate thread, `withAction` finishes as soon as 
it starts the thread. As a result, it doesn't collect the metrics of the actual 
action and Query UI shows the plan graph without metrics.

We should call `serveToStream` first, then `withAction` in it.

### Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric:

```py
>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], 
schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   yz
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 30 
AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png)

but if Arrow execution is enabled, it shows only plan nodes and the 
duration is not correct:

```py
>>> spark.conf.set('spark.sql.execution.arrow.enabled', True)
>>> df.toPandas()
   x   yz
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 42 
AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png)

### Does this PR introduce _any_ user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

### How was this patch tested?

I checked it manually in my local.

![Screen Shot 2020-06-05 at 11 29 48 
AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png)

Closes #28740 from 
ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui.

Authored-by: Takuya UESHIN 
Signed-off-by: HyukjinKwon 
---
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 6bc0d0b..a755a6f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -3283,8 +3283,8 @@ class Dataset[T] private[sql](
   private[sql] def collectAsArrowToPython(): Array[Any] = {
 val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone
 
-withAction("collectAsArrowToPython", queryExecution) { plan =>
-  PythonRDD.serveToStreamWithSync("serve-Arrow") { out =>
+PythonRDD.serveToStreamWithSync("serve-Arrow") { out =>
+  withAction("collectAsArrowToPython", queryExecution) { plan =>
 val batchWriter = new ArrowBatchStreamWriter(schema, out, timeZoneId)
 val arrowBatchRdd = toArrowBatchRdd(plan)
 val numPartitions = arrowBatchRdd.partitions.length


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-2.4 updated: [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI

2020-06-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 476010a  [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow 
enabled to show metrics in Query UI
476010a is described below

commit 476010aedd101e1a807c202d71328415109660ae
Author: Takuya UESHIN 
AuthorDate: Sat Jun 6 16:50:40 2020 +0900

[SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show 
metrics in Query UI

### What changes were proposed in this pull request?

This is a backport of #28730.

In `Dataset.collectAsArrowToPython`, since the code block for 
`serveToStream` is run in the separate thread, `withAction` finishes as soon as 
it starts the thread. As a result, it doesn't collect the metrics of the actual 
action and Query UI shows the plan graph without metrics.

We should call `serveToStream` first, then `withAction` in it.

### Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric:

```py
>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], 
schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   yz
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 30 
AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png)

but if Arrow execution is enabled, it shows only plan nodes and the 
duration is not correct:

```py
>>> spark.conf.set('spark.sql.execution.arrow.enabled', True)
>>> df.toPandas()
   x   yz
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 42 
AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png)

### Does this PR introduce _any_ user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

### How was this patch tested?

I checked it manually in my local.

![Screen Shot 2020-06-05 at 11 29 48 
AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png)

Closes #28740 from 
ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui.

Authored-by: Takuya UESHIN 
Signed-off-by: HyukjinKwon 
---
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 6bc0d0b..a755a6f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -3283,8 +3283,8 @@ class Dataset[T] private[sql](
   private[sql] def collectAsArrowToPython(): Array[Any] = {
 val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone
 
-withAction("collectAsArrowToPython", queryExecution) { plan =>
-  PythonRDD.serveToStreamWithSync("serve-Arrow") { out =>
+PythonRDD.serveToStreamWithSync("serve-Arrow") { out =>
+  withAction("collectAsArrowToPython", queryExecution) { plan =>
 val batchWriter = new ArrowBatchStreamWriter(schema, out, timeZoneId)
 val arrowBatchRdd = toArrowBatchRdd(plan)
 val numPartitions = arrowBatchRdd.partitions.length


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.0 updated: [MINOR][SS][DOCS] fileNameOnly parameter description re-unite

2020-06-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new c5683fc  [MINOR][SS][DOCS] fileNameOnly parameter description re-unite
c5683fc is described below

commit c5683fce6f43dd561677830d74af196bc6c97134
Author: Gabor Somogyi 
AuthorDate: Sat Jun 6 16:49:48 2020 +0900

[MINOR][SS][DOCS] fileNameOnly parameter description re-unite

### What changes were proposed in this pull request?
`fileNameOnly` parameter is split to 2 pieces in 
[this](https://github.com/apache/spark/commit/dbb8143501ab87865d6e202c17297b9a73a0b1c3)
 commit. This PR re-unites it.

### Why are the changes needed?
Parameter description split in doc.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.

Closes #28739 from gaborgsomogyi/datasettxtfix.

Authored-by: Gabor Somogyi 
Signed-off-by: HyukjinKwon 
(cherry picked from commit 04f66bfd4eb863253ac9c30594055b8d5997c321)
Signed-off-by: HyukjinKwon 
---
 docs/structured-streaming-programming-guide.md | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 1776d23..69d744d 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -540,12 +540,13 @@ Here are the details of all the sources in Spark.
 
 fileNameOnly: whether to check new files based on only 
the filename instead of on the full path (default: false). With this set to 
`true`, the following files would be considered as the same file, because their 
filenames, "dataset.txt", are the same:
 
-maxFileAge: Maximum age of a file that can be found in 
this directory, before it is ignored. For the first batch all files will be 
considered valid. If latestFirst is set to `true` and 
maxFilesPerTrigger is set, then this parameter will be ignored, 
because old files that are valid, and should be processed, may be ignored. The 
max age is specified with respect to the timestamp of the latest file, and not 
the timestamp of the current system.(d [...]
-
 "file:///dataset.txt"
 "s3://a/dataset.txt"
 "s3n://a/b/dataset.txt"
-"s3a://a/b/c/dataset.txt"
+"s3a://a/b/c/dataset.txt"
+
+maxFileAge: Maximum age of a file that can be found in 
this directory, before it is ignored. For the first batch all files will be 
considered valid. If latestFirst is set to `true` and 
maxFilesPerTrigger is set, then this parameter will be ignored, 
because old files that are valid, and should be processed, may be ignored. The 
max age is specified with respect to the timestamp of the latest file, and not 
the timestamp of the current system.(d [...]
+
 cleanSource: option to clean up completed files after 
processing.
 Available options are "archive", "delete", "off". If the option is not 
provided, the default value is "off".
 When "archive" is provided, additional option 
sourceArchiveDir must be provided as well. The value of 
"sourceArchiveDir" must not match with source pattern in depth (the number of 
directories from the root directory), where the depth is minimum of depth on 
both paths. This will ensure archived files are never included as new source 
files.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-2.4 updated: [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI

2020-06-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 476010a  [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow 
enabled to show metrics in Query UI
476010a is described below

commit 476010aedd101e1a807c202d71328415109660ae
Author: Takuya UESHIN 
AuthorDate: Sat Jun 6 16:50:40 2020 +0900

[SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show 
metrics in Query UI

### What changes were proposed in this pull request?

This is a backport of #28730.

In `Dataset.collectAsArrowToPython`, since the code block for 
`serveToStream` is run in the separate thread, `withAction` finishes as soon as 
it starts the thread. As a result, it doesn't collect the metrics of the actual 
action and Query UI shows the plan graph without metrics.

We should call `serveToStream` first, then `withAction` in it.

### Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric:

```py
>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], 
schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   yz
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 30 
AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png)

but if Arrow execution is enabled, it shows only plan nodes and the 
duration is not correct:

```py
>>> spark.conf.set('spark.sql.execution.arrow.enabled', True)
>>> df.toPandas()
   x   yz
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 42 
AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png)

### Does this PR introduce _any_ user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

### How was this patch tested?

I checked it manually in my local.

![Screen Shot 2020-06-05 at 11 29 48 
AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png)

Closes #28740 from 
ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui.

Authored-by: Takuya UESHIN 
Signed-off-by: HyukjinKwon 
---
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 6bc0d0b..a755a6f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -3283,8 +3283,8 @@ class Dataset[T] private[sql](
   private[sql] def collectAsArrowToPython(): Array[Any] = {
 val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone
 
-withAction("collectAsArrowToPython", queryExecution) { plan =>
-  PythonRDD.serveToStreamWithSync("serve-Arrow") { out =>
+PythonRDD.serveToStreamWithSync("serve-Arrow") { out =>
+  withAction("collectAsArrowToPython", queryExecution) { plan =>
 val batchWriter = new ArrowBatchStreamWriter(schema, out, timeZoneId)
 val arrowBatchRdd = toArrowBatchRdd(plan)
 val numPartitions = arrowBatchRdd.partitions.length


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.0 updated: [MINOR][SS][DOCS] fileNameOnly parameter description re-unite

2020-06-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new c5683fc  [MINOR][SS][DOCS] fileNameOnly parameter description re-unite
c5683fc is described below

commit c5683fce6f43dd561677830d74af196bc6c97134
Author: Gabor Somogyi 
AuthorDate: Sat Jun 6 16:49:48 2020 +0900

[MINOR][SS][DOCS] fileNameOnly parameter description re-unite

### What changes were proposed in this pull request?
`fileNameOnly` parameter is split to 2 pieces in 
[this](https://github.com/apache/spark/commit/dbb8143501ab87865d6e202c17297b9a73a0b1c3)
 commit. This PR re-unites it.

### Why are the changes needed?
Parameter description split in doc.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.

Closes #28739 from gaborgsomogyi/datasettxtfix.

Authored-by: Gabor Somogyi 
Signed-off-by: HyukjinKwon 
(cherry picked from commit 04f66bfd4eb863253ac9c30594055b8d5997c321)
Signed-off-by: HyukjinKwon 
---
 docs/structured-streaming-programming-guide.md | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 1776d23..69d744d 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -540,12 +540,13 @@ Here are the details of all the sources in Spark.
 
 fileNameOnly: whether to check new files based on only 
the filename instead of on the full path (default: false). With this set to 
`true`, the following files would be considered as the same file, because their 
filenames, "dataset.txt", are the same:
 
-maxFileAge: Maximum age of a file that can be found in 
this directory, before it is ignored. For the first batch all files will be 
considered valid. If latestFirst is set to `true` and 
maxFilesPerTrigger is set, then this parameter will be ignored, 
because old files that are valid, and should be processed, may be ignored. The 
max age is specified with respect to the timestamp of the latest file, and not 
the timestamp of the current system.(d [...]
-
 "file:///dataset.txt"
 "s3://a/dataset.txt"
 "s3n://a/b/dataset.txt"
-"s3a://a/b/c/dataset.txt"
+"s3a://a/b/c/dataset.txt"
+
+maxFileAge: Maximum age of a file that can be found in 
this directory, before it is ignored. For the first batch all files will be 
considered valid. If latestFirst is set to `true` and 
maxFilesPerTrigger is set, then this parameter will be ignored, 
because old files that are valid, and should be processed, may be ignored. The 
max age is specified with respect to the timestamp of the latest file, and not 
the timestamp of the current system.(d [...]
+
 cleanSource: option to clean up completed files after 
processing.
 Available options are "archive", "delete", "off". If the option is not 
provided, the default value is "off".
 When "archive" is provided, additional option 
sourceArchiveDir must be provided as well. The value of 
"sourceArchiveDir" must not match with source pattern in depth (the number of 
directories from the root directory), where the depth is minimum of depth on 
both paths. This will ensure archived files are never included as new source 
files.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (5079831 -> 04f66bf)

2020-06-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 5079831  [SPARK-31904][SQL] Fix case sensitive problem of char and 
varchar partition columns
 add 04f66bf  [MINOR][SS][DOCS] fileNameOnly parameter description re-unite

No new revisions were added by this update.

Summary of changes:
 docs/structured-streaming-programming-guide.md | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (5079831 -> 04f66bf)

2020-06-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 5079831  [SPARK-31904][SQL] Fix case sensitive problem of char and 
varchar partition columns
 add 04f66bf  [MINOR][SS][DOCS] fileNameOnly parameter description re-unite

No new revisions were added by this update.

Summary of changes:
 docs/structured-streaming-programming-guide.md | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org