[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-03 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19086
  
Sure, current behavior is hive behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19086
  
Please hold it. It means it is a behavior change. Let me consider it more. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-03 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19086
  
yes, correct


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19086
  
> use db2;
> alter table db1.t2 rename to t1;

After this PR, it is renamed to `db2.t1`, right?

Before this PR, it is renamed to `db1.t1`, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19115: [SPARK-21882][CORE] OutputMetrics doesn't count written ...

2017-09-03 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19115
  
Please create a new pr against master branch and close this one. If the 
issue doesn't exist in master branch, then consider backporting that fix to 2.2 
branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...

2017-09-03 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19086#discussion_r136747432
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -569,46 +569,51 @@ class SessionCatalog(
   /**
* Rename a table.
*
-   * If a database is specified in `oldName`, this will rename the table 
in that database.
+   * If the database specified in `newName` is different from the one 
specified in `oldName`,
+   * It will result in moving table across databases.
+   *
* If no database is specified, this will first attempt to rename a 
temporary table with
-   * the same name, then, if that does not exist, rename the table in the 
current database.
+   * the same name, then, if that does not exist, current database will be 
used for locating
+   * `oldName` or `newName`.
--- End diff --

So should I make the comment here unchanged?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19086#discussion_r136747159
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -418,6 +439,42 @@ abstract class SessionCatalogSuite extends 
AnalysisTest {
 }
   }
 
+  test("rename global temp view") {
+withBasicCatalog { catalog =>
+  val globalTempTable = Range(1, 10, 2, 10)
+  catalog.createGlobalTempView("tbl1", globalTempTable, 
overrideIfExists = false)
+  assert(catalog.getGlobalTempView("tbl1") == Option(globalTempTable))
+  assert(catalog.externalCatalog.listTables("db2").toSet == 
Set("tbl1", "tbl2"))
+  // If database is not specified, global temp view will not be renamed
+  catalog.setCurrentDatabase("db1")
+  val thrown1 = intercept[AnalysisException] {
+catalog.renameTable(TableIdentifier("tbl1"), 
TableIdentifier("tblone"))
+  }
+  assert(thrown1.getMessage.contains("Table or view 'tbl1' not found 
in database 'db1'"))
+  catalog.setCurrentDatabase("db2")
+  catalog.renameTable(TableIdentifier("tbl1"), 
TableIdentifier("tblone"))
+  assert(catalog.getGlobalTempView("tbl1") == Option(globalTempTable))
+  assert(catalog.externalCatalog.listTables("db2").toSet == 
Set("tblone", "tbl2"))
+  // Moving global temp view to another database is forbidden
+  val thrown2 = intercept[AnalysisException] {
+catalog.renameTable(
+  TableIdentifier("tbl1", Some("global_temp")), 
TableIdentifier("tbl3", Some("db2")))
+  }
+  assert(thrown2.getMessage.contains("Cannot change database of table 
'tbl1'"))
+  // Moving table from regular database to be a global temp view is 
forbidden
+  val thrown3 = intercept[AnalysisException] {
+catalog.renameTable(
+  TableIdentifier("tbl2", Some("db2")), TableIdentifier("tbltwo", 
Some("global_temp")))
+  }
--- End diff --

 Normally, we put `.getMessage` here. Thus, we can make the next line 
shorter. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17014
  
**[Test build #81372 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81372/testReport)**
 for PR 17014 at commit 
[`df4d263`](https://github.com/apache/spark/commit/df4d263d21c3e4413b4d74aaabcd3a10348a0001).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17014
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81372/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17014
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2017-09-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r136746833
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala ---
@@ -89,6 +89,8 @@ case class ExpandExec(
 child.asInstanceOf[CodegenSupport].inputRDDs()
   }
 
+  override protected def doConsumeInChainOfFunc: Boolean = false
--- End diff --

yea, probably we might need to describe more about exceptional cases we 
can't use this optimization like `HashAggregateExec` in 
https://github.com/apache/spark/pull/18931/files#diff-28cb12941b992ff680c277c651b59aa0R204


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17014
  
**[Test build #81372 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81372/testReport)**
 for PR 17014 at commit 
[`df4d263`](https://github.com/apache/spark/commit/df4d263d21c3e4413b4d74aaabcd3a10348a0001).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2017-09-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r136746468
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala ---
@@ -177,6 +177,8 @@ case class SortExec(
  """.stripMargin.trim
   }
 
+  override protected def doConsumeInChainOfFunc: Boolean = false
--- End diff --

yea, other guys might give good suggestions on the naming...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19086
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19086
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81370/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19086
  
**[Test build #81370 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81370/testReport)**
 for PR 19086 at commit 
[`d4a0f97`](https://github.com/apache/spark/commit/d4a0f97b9600dd661d81a79e71019a781eb812ff).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19086#discussion_r136745765
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -569,46 +569,51 @@ class SessionCatalog(
   /**
* Rename a table.
*
-   * If a database is specified in `oldName`, this will rename the table 
in that database.
+   * If the database specified in `newName` is different from the one 
specified in `oldName`,
+   * It will result in moving table across databases.
+   *
* If no database is specified, this will first attempt to rename a 
temporary table with
-   * the same name, then, if that does not exist, rename the table in the 
current database.
+   * the same name, then, if that does not exist, current database will be 
used for locating
+   * `oldName` or `newName`.
--- End diff --

This is wrong, right? This is for `temp view`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.

2017-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19116
  
cc @srowen and @vanzin, could you take a look please when you have some 
time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.

2017-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19116#discussion_r136745412
  
--- Diff: scalastyle-config.xml ---
@@ -268,10 +268,7 @@ This file is divided into 3 sections:
   
 
   
-  
-^Override$
-override modifier should be used instead of 
@java.lang.Override.
-  
+  
--- End diff --

This removes the workaround we used and use 
`org.scalastyle.scalariform.OverrideJavaChecker` -  
`https://github.com/scalastyle/scalastyle/issues/214`.

I manually tested with `@Override`:

Before:

```
[error] ...:435:3: override modifier should be used instead of 
@java.lang.Override.
```

After:


```
[error] ...:435:2: override keyword should be used instead of 
@java.lang.Override
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.

2017-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19116#discussion_r136745459
  
--- Diff: project/plugins.sbt ---
@@ -7,8 +7,7 @@ addSbtPlugin("com.typesafe.sbteclipse" % 
"sbteclipse-plugin" % "5.1.0")
 // sbt 1.0.0 support: 
https://github.com/jrudolph/sbt-dependency-graph/issues/134
 addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.8.2")
 
-// need to make changes to uptake sbt 1.0 support in "org.scalastyle" %% 
"scalastyle-sbt-plugin" % "0.9.0"
--- End diff --

Looks we are okay to remove this - 
https://github.com/scalastyle/scalastyle-sbt-plugin/blob/v1.0.0/README.md


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.

2017-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19116#discussion_r136745559
  
--- Diff: project/SparkBuild.scala ---
@@ -163,14 +163,15 @@ object SparkBuild extends PomBuild {
 val configUrlV = scalastyleConfigUrl.in(config).value
 val streamsV = streams.in(config).value
 val failOnErrorV = true
+val failOnWarningV = true
 val scalastyleTargetV = scalastyleTarget.in(config).value
 val configRefreshHoursV = 
scalastyleConfigRefreshHours.in(config).value
 val targetV = target.in(config).value
 val configCacheFileV = 
scalastyleConfigUrlCacheFile.in(config).value
 
 logger.info(s"Running scalastyle on ${name.value} in 
${config.name}")
-Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, 
scalaSourceV, scalastyleTargetV,
-  streamsV, configRefreshHoursV, targetV, configCacheFileV)
+Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, 
failOnWarningV, scalaSourceV,
--- End diff --

This change looks required - 
https://github.com/scalastyle/scalastyle-sbt-plugin/commit/cbbb9c9d1c33b9e90bf187edb0dd2d6b4eceaa21


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19086#discussion_r136745597
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -569,46 +569,51 @@ class SessionCatalog(
   /**
* Rename a table.
*
-   * If a database is specified in `oldName`, this will rename the table 
in that database.
+   * If the database specified in `newName` is different from the one 
specified in `oldName`,
+   * It will result in moving table across databases.
--- End diff --

`It will result` -> `it results`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19116
  
**[Test build #81371 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81371/testReport)**
 for PR 19116 at commit 
[`2447fd0`](https://github.com/apache/spark/commit/2447fd0e152ace4dd074a92bd0d3cdc638b09b1a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.

2017-09-03 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/19116

[SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.

## What changes were proposed in this pull request?

1.0.0 fixes an issue with import order, explicit type for public methods, 
line length limitation and comment validation:

```
[error] 
.../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16:
 Are you sure you want to println? If yes, wrap the code block with
[error]   // scalastyle:off println
[error]   println(...)
[error]   // scalastyle:on println
[error] 
.../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49:
 File line length exceeds 100 characters
[error] 
.../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21:
 Are you sure you want to println? If yes, wrap the code block with
[error]   // scalastyle:off println
[error]   println(...)
[error]   // scalastyle:on println
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6:
 Public method must have explicit type
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6:
 Public method must have explicit type
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15:
 Public method must have explicit type
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15:
 Public method must have explicit type
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2:
 Insert a space after the start of the comment
[error] 
.../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43:
 JavaDStream should come before JavaDStreamLike.
```

This PR also fixes the workaround added in SPARK-16877 for 
`org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0.

## How was this patch tested?

Manually tested.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark scalastyle-1.0.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19116.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19116


commit 2447fd0e152ace4dd074a92bd0d3cdc638b09b1a
Author: hyukjinkwon 
Date:   2017-09-04T04:26:19Z

Upgrade scalastyle to 1.0.0.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/18902
  
hmm... that's interesting. So I found performance gap between dataframe 
codegen aggregation and the simple RDD aggregation. I will discuss with SQL 
team for this later. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/18902
  
@WeichenXu123 No, I only cache the DataFrame. And the RDD-Version is 
[here](https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465).
 
I use the same testsuit above to test those impls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19113
  
How about the other popular open source projects? Do you know whether which 
projects are using Univocity 2.5?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/17014
  
@zhengruifeng `KMeans` regarded as a bugfix(SPARK-21799) because the 
double-cache issue is introduced in 2.2 and cause perf regression.
Other algos also have the same issue, but the issue exists in those algos 
for a long time and they related to `Predictor` so it is not so easy to fix, we 
can leave them in here and do more discussion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19113
  
Any performance measure from 2.2 to 2.5?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18865
  
Yea, it is and that's what we discussed - 
https://github.com/apache/spark/pull/18865#discussion_r131887972. In that way, 
I was thinking 
https://github.com/apache/spark/pull/18865#issuecomment-326858521 case could 
make sense as no columns are selected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18869: [SPARK-21654][SQL] Complement SQL predicates expr...

2017-09-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18869


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...

2017-09-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18869
  
Thanks @HyukjinKwon @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18869
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18869
  
Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18865
  
But doesn't this behavior that `_corrupt_record` content depends on the 
selected json fields is designed at the first?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18865
  
> the usage can cause weird results too

> `_corrupt_record` should return all the records that Spark SQL fail to 
parse

I think another point should be, this issue still exists even if we 
disallow selecting `_corrupt_record` alone here. If we select few columns 
together with `_corrupt_record`, this case will still exist and results won't 
be consistent. For example,

```
echo '{"fieldA": 1, "fieldB": 2}
 {"fieldA": 3, "fieldB": 4}
 {"fieldA": "5", "fieldB": 6}' >/tmp/sample.json
```

```scala
val file = "/tmp/sample.json"
val dfFromFile = spark.read.schema("fieldA BYTE, fieldB BYTE, 
_corrupt_record STRING").json(file)
dfFromFile.select($"fieldA", $"_corrupt_record").show()
dfFromFile.select($"fieldB", $"_corrupt_record").show()
```

```
scala> dfFromFile.select($"fieldA", $"_corrupt_record").show()
+--++
|fieldA| _corrupt_record|
+--++
| 1|null|
| 3|null|
|  null| {"fieldA": "5", ...|
+--++


scala> dfFromFile.select($"fieldB", $"_corrupt_record").show()
+--+---+
|fieldB|_corrupt_record|
+--+---+
| 2|   null|
| 4|   null|
| 6|   null|
+--+---+
```

If we should disallow, I think we should rather deprecate this option first 
with some warnings, or explain this behaviour in the documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2017-09-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r136743435
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala ---
@@ -177,6 +177,8 @@ case class SortExec(
  """.stripMargin.trim
   }
 
+  override protected def doConsumeInChainOfFunc: Boolean = false
--- End diff --

I revised this variable name in times, but didn't find a good name to 
convey its meaning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19115: [SPARK-21882][CORE] OutputMetrics doesn't count written ...

2017-09-03 Thread awarrior
Github user awarrior commented on the issue:

https://github.com/apache/spark/pull/19115
  
@jerryshao hi~ I have modified this PR. But this patch just work in 2.2.0 
(some changes apply now). I want to confirm whether I need to create a new PR. 
Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19113
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19113
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81368/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19113
  
**[Test build #81368 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81368/testReport)**
 for PR 19113 at commit 
[`fa7eb51`](https://github.com/apache/spark/commit/fa7eb514cfd91bb405fd74680b08d5865911e3f0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18865
  
BTW, this change should be put into the migration guide of Spark SQL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18865
  
From the viewpoints of the end users of Spark, 
`dfFromFile.select($"_corrupt_record").show()` might not return all the 
expected records. ``_corrupt_record`` should return all the records that Spark 
SQL fail to parse. If we are unable to output the expected results, we need to 
disallow users to do it. 

cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2017-09-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r136742515
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala ---
@@ -89,6 +89,8 @@ case class ExpandExec(
 child.asInstanceOf[CodegenSupport].inputRDDs()
   }
 
+  override protected def doConsumeInChainOfFunc: Boolean = false
--- End diff --

The `doConsume` produces something like:

   |for (int $i = 0; $i < ${projections.length}; $i ++) {
   |  switch ($i) {
   |${cases.mkString("\n").trim}
   |  }
   |  $numOutput.add(1);
   |  ${consume(ctx, outputColumns)}
   |}

So the consume logic of its parent node is actually wrapped in a local 
for-loop. It has the same effect as not chain the next consume.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2017-09-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r136742331
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan {
 
 ctx.freshNamePrefix = parent.variablePrefix
 val evaluated = evaluateRequiredVariables(output, inputVars, 
parent.usedInputs)
+
+// Under certain conditions, we can put the logic to consume the rows 
of this operator into
+// another function. So we can prevent a generated function too long 
to be optimized by JIT.
+// The conditions:
+// 1. The parent uses all variables in output. we can't defer variable 
evaluation when consume
+//in another function.
+// 2. The output variables are not empty. If it's empty, we don't 
bother to do that.
+// 3. We don't use row variable. The construction of row uses deferred 
variable evaluation. We
+//can't do it.
+// 4. The number of output variables must less than maximum number of 
parameters in Java method
+//declaration.
+val requireAllOutput = output.forall(parent.usedInputs.contains(_))
+val consumeFunc =
+  if (row == null && outputVars.nonEmpty && requireAllOutput && 
outputVars.length < 255) {
+constructDoConsumeFunction(ctx, inputVars)
+  } else {
+parent.doConsume(ctx, inputVars, rowVar)
+  }
 s"""
|${ctx.registerComment(s"CONSUME: ${parent.simpleString}")}
|$evaluated
-   |${parent.doConsume(ctx, inputVars, rowVar)}
+   |$consumeFunc
+ """.stripMargin
+  }
+
+  /**
+   * To prevent concatenated function growing too long to be optimized by 
JIT. Instead of inlining,
+   * we may put the consume logic of parent operator into a function and 
set this flag to `true`.
+   * The parent operator can know if its consume logic is inlined or in 
separated function.
+   */
+  private var doConsumeInFunc: Boolean = false
+
+  /**
+   * Returning true means we have at least one consume logic from child 
operator or this operator is
+   * separated in a function. If this is `true`, this operator shouldn't 
use `continue` statement to
+   * continue on next row, because its generated codes aren't enclosed in 
main while-loop.
+   *
+   * For example, we have generated codes for a query plan like:
+   *   Op1Exec
+   * Op2Exec
+   *   Op3Exec
+   *
+   * If we put the consume code of Op2Exec into a separated function, the 
generated codes are like:
+   *   while (...) {
+   * ... // logic of Op3Exec.
+   * Op2Exec_doConsume(...);
+   *   }
+   *   private boolean Op2Exec_doConsume(...) {
+   * ... // logic of Op2Exec to consume rows.
+   *   }
+   * For now, `doConsumeInChainOfFunc` of Op2Exec will be `true`.
+   *
+   * Notice for some operators like `HashAggregateExec`, it doesn't chain 
previous consume functions
+   * but begins with its produce framework. We should override 
`doConsumeInChainOfFunc` to return
+   * `false`.
+   */
+  protected def doConsumeInChainOfFunc: Boolean = {
+val codegenChildren = children.map(_.asInstanceOf[CodegenSupport])
+doConsumeInFunc || codegenChildren.exists(_.doConsumeInChainOfFunc)
+  }
+
+  /**
+   * The actual java statement this operator should use if there is a need 
to continue on next row
+   * in its `doConsume` codes.
+   *
+   *   while (...) {
+   * ...   // logic of Op3Exec.
+   * Op2Exec_doConsume(...);
+   *   }
+   *   private boolean Op2Exec_doConsume(...) {
+   * ...   // logic of Op2Exec to consume rows.
+   * continue; // Wrong. We can't use continue with the while-loop.
+   *   }
+   * In above code, we can't use `continue` in `Op2Exec_doConsume`.
+   *
+   * Instead, we do something like:
+   *   while (...) {
+   * ...  // logic of Op3Exec.
+   * boolean continueForLoop = Op2Exec_doConsume(...);
+   * if (continueForLoop) continue;
+   *   }
+   *   private boolean Op2Exec_doConsume(...) {
+   * ...  // logic of Op2Exec to consume rows.
+   * return true; // When we need to do continue, we return true.
+   *   }
+   */
+  protected def continueStatementInDoConsume: String = if 
(doConsumeInChainOfFunc) {
+"return true;";
--- End diff --

Thanks. I'll fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at 

[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2017-09-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r136742282
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan {
 
 ctx.freshNamePrefix = parent.variablePrefix
 val evaluated = evaluateRequiredVariables(output, inputVars, 
parent.usedInputs)
+
+// Under certain conditions, we can put the logic to consume the rows 
of this operator into
+// another function. So we can prevent a generated function too long 
to be optimized by JIT.
+// The conditions:
+// 1. The parent uses all variables in output. we can't defer variable 
evaluation when consume
+//in another function.
+// 2. The output variables are not empty. If it's empty, we don't 
bother to do that.
+// 3. We don't use row variable. The construction of row uses deferred 
variable evaluation. We
+//can't do it.
+// 4. The number of output variables must less than maximum number of 
parameters in Java method
+//declaration.
+val requireAllOutput = output.forall(parent.usedInputs.contains(_))
+val consumeFunc =
+  if (row == null && outputVars.nonEmpty && requireAllOutput && 
outputVars.length < 255) {
+constructDoConsumeFunction(ctx, inputVars)
--- End diff --

I was thinking to check it. But the whole-stage codegen is a non-breaking 
processing which produce/consume calls are embeded together. You don't have a 
break to check the function length here.

Actually I think it should have no negative effect to split consume 
functions always. From the benchmarking numbers, looks it shows no harm to 
normal queries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2017-09-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r136742019
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan {
 
 ctx.freshNamePrefix = parent.variablePrefix
 val evaluated = evaluateRequiredVariables(output, inputVars, 
parent.usedInputs)
+
+// Under certain conditions, we can put the logic to consume the rows 
of this operator into
+// another function. So we can prevent a generated function too long 
to be optimized by JIT.
+// The conditions:
+// 1. The parent uses all variables in output. we can't defer variable 
evaluation when consume
+//in another function.
+// 2. The output variables are not empty. If it's empty, we don't 
bother to do that.
+// 3. We don't use row variable. The construction of row uses deferred 
variable evaluation. We
--- End diff --

The same reason as above. The variables used to evaluate the row can be out 
of scope because row construction is deferred until it is used actually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2017-09-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r136741637
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan {
 
 ctx.freshNamePrefix = parent.variablePrefix
 val evaluated = evaluateRequiredVariables(output, inputVars, 
parent.usedInputs)
+
+// Under certain conditions, we can put the logic to consume the rows 
of this operator into
+// another function. So we can prevent a generated function too long 
to be optimized by JIT.
+// The conditions:
+// 1. The parent uses all variables in output. we can't defer variable 
evaluation when consume
--- End diff --

E.g., a `ProjectExec` node doesn't necessarily evaluate all its output 
variables before continuing `doConsume` of its parent node. It can defer the 
evaluation until the variables are needed in the parent node's consume logic.

Once a variable's evaluation is deferred, and if we create a consume 
function, the variable will be evaluated in the function. But now the 
references of this variable is out of scope.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...

2017-09-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19050
  
ping @cloud-fan @hvanhovell This blocks the #18956 going forward, can you 
help review this change? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...

2017-09-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18869
  
@HyukjinKwon @gatorsmile Any more comments on this? The added tests should 
be enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136740379
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with 
TestHiveSingleton with Bef
   }
 }
   }
+
+  test("insert overwrite to dir from hive metastore table") {
+withTempDir { dir =>
+  val path = dir.toURI.getPath
+
+  sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src 
where key < 10")
+
+  sql(
+s"""
+   |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+   |STORED AS orc
+   |SELECT * FROM src where key < 10
+ """.stripMargin)
+
+  // use orc data source to check the data of path is right.
+  withTempView("orc_source") {
+sql(
+  s"""
+ |CREATE TEMPORARY VIEW orc_source
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${dir.getCanonicalPath}'
+ |)
+   """.stripMargin)
+
+checkAnswer(
+  sql("select * from orc_source"),
+  sql("select * from src where key < 10").collect())
+  }
+}
+  }
+
+  test("insert overwrite to local dir from temp table") {
+withTempView("test_insert_table") {
+  spark.range(10).selectExpr("id", "id AS 
str").createOrReplaceTempView("test_insert_table")
+
+  withTempDir { dir =>
+val path = dir.toURI.getPath
+
+sql(
+  s"""
+ |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+ |ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+sql(
+  s"""
+ |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+ |STORED AS orc
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+// use orc data source to check the data of path is right.
+withTempView("orc_source") {
+  sql(
+s"""
+   |CREATE TEMPORARY VIEW orc_source
+   |USING org.apache.spark.sql.hive.orc
+   |OPTIONS (
+   |  PATH '${dir.getCanonicalPath}'
+   |)
+ """.stripMargin)
+
+  checkAnswer(
+sql("select * from orc_source"),
+sql("select * from test_insert_table").collect())
+}
+  }
+}
+  }
+
+  test("insert overwrite to dir from temp table") {
+withTempView("test_insert_table") {
+  spark.range(10).selectExpr("id", "id AS 
str").createOrReplaceTempView("test_insert_table")
+
+  withTempDir { dir =>
+val pathUri = dir.toURI
+
+sql(
+  s"""
+ |INSERT OVERWRITE DIRECTORY '${pathUri}'
+ |ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+sql(
+  s"""
+ |INSERT OVERWRITE DIRECTORY '${pathUri}'
+ |STORED AS orc
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+// use orc data source to check the data of path is right.
+withTempView("orc_source") {
+  sql(
+s"""
+   |CREATE TEMPORARY VIEW orc_source
+   |USING org.apache.spark.sql.hive.orc
+   |OPTIONS (
+   |  PATH '${dir.getCanonicalPath}'
+   |)
+ """.stripMargin)
+
+  checkAnswer(
+sql("select * from orc_source"),
--- End diff --

You do not need the temp view `orc_source `. Below is an example.
```Scala
checkAnswer(
  spark.read.orc(dir.getCanonicalPath),
  sql("select * from test_insert_table"))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19114: Update PairRDDFunctions.scala

2017-09-03 Thread awarrior
Github user awarrior closed the pull request at:

https://github.com/apache/spark/pull/19114


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136739979
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with 
TestHiveSingleton with Bef
   }
 }
   }
+
+  test("insert overwrite to dir from hive metastore table") {
+withTempDir { dir =>
+  val path = dir.toURI.getPath
+
+  sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src 
where key < 10")
+
+  sql(
+s"""
+   |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+   |STORED AS orc
+   |SELECT * FROM src where key < 10
+ """.stripMargin)
+
+  // use orc data source to check the data of path is right.
+  withTempView("orc_source") {
+sql(
+  s"""
+ |CREATE TEMPORARY VIEW orc_source
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${dir.getCanonicalPath}'
+ |)
+   """.stripMargin)
+
+checkAnswer(
+  sql("select * from orc_source"),
+  sql("select * from src where key < 10").collect())
+  }
+}
+  }
+
+  test("insert overwrite to local dir from temp table") {
+withTempView("test_insert_table") {
+  spark.range(10).selectExpr("id", "id AS 
str").createOrReplaceTempView("test_insert_table")
+
+  withTempDir { dir =>
+val path = dir.toURI.getPath
+
+sql(
+  s"""
+ |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+ |ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+sql(
+  s"""
+ |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+ |STORED AS orc
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+// use orc data source to check the data of path is right.
+withTempView("orc_source") {
+  sql(
+s"""
+   |CREATE TEMPORARY VIEW orc_source
+   |USING org.apache.spark.sql.hive.orc
+   |OPTIONS (
+   |  PATH '${dir.getCanonicalPath}'
+   |)
+ """.stripMargin)
+
+  checkAnswer(
+sql("select * from orc_source"),
+sql("select * from test_insert_table").collect())
--- End diff --

The same here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136739964
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with 
TestHiveSingleton with Bef
   }
 }
   }
+
+  test("insert overwrite to dir from hive metastore table") {
+withTempDir { dir =>
+  val path = dir.toURI.getPath
+
+  sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src 
where key < 10")
+
+  sql(
+s"""
+   |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+   |STORED AS orc
+   |SELECT * FROM src where key < 10
+ """.stripMargin)
+
+  // use orc data source to check the data of path is right.
+  withTempView("orc_source") {
+sql(
+  s"""
+ |CREATE TEMPORARY VIEW orc_source
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${dir.getCanonicalPath}'
+ |)
+   """.stripMargin)
+
+checkAnswer(
+  sql("select * from orc_source"),
+  sql("select * from src where key < 10").collect())
--- End diff --

Nit: `.collect()` can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136739990
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with 
TestHiveSingleton with Bef
   }
 }
   }
+
+  test("insert overwrite to dir from hive metastore table") {
+withTempDir { dir =>
+  val path = dir.toURI.getPath
+
+  sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src 
where key < 10")
+
+  sql(
+s"""
+   |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+   |STORED AS orc
+   |SELECT * FROM src where key < 10
+ """.stripMargin)
+
+  // use orc data source to check the data of path is right.
+  withTempView("orc_source") {
+sql(
+  s"""
+ |CREATE TEMPORARY VIEW orc_source
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${dir.getCanonicalPath}'
+ |)
+   """.stripMargin)
+
+checkAnswer(
+  sql("select * from orc_source"),
+  sql("select * from src where key < 10").collect())
+  }
+}
+  }
+
+  test("insert overwrite to local dir from temp table") {
+withTempView("test_insert_table") {
+  spark.range(10).selectExpr("id", "id AS 
str").createOrReplaceTempView("test_insert_table")
+
+  withTempDir { dir =>
+val path = dir.toURI.getPath
+
+sql(
+  s"""
+ |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+ |ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+sql(
+  s"""
+ |INSERT OVERWRITE LOCAL DIRECTORY '${path}'
+ |STORED AS orc
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+// use orc data source to check the data of path is right.
+withTempView("orc_source") {
+  sql(
+s"""
+   |CREATE TEMPORARY VIEW orc_source
+   |USING org.apache.spark.sql.hive.orc
+   |OPTIONS (
+   |  PATH '${dir.getCanonicalPath}'
+   |)
+ """.stripMargin)
+
+  checkAnswer(
+sql("select * from orc_source"),
+sql("select * from test_insert_table").collect())
+}
+  }
+}
+  }
+
+  test("insert overwrite to dir from temp table") {
+withTempView("test_insert_table") {
+  spark.range(10).selectExpr("id", "id AS 
str").createOrReplaceTempView("test_insert_table")
+
+  withTempDir { dir =>
+val pathUri = dir.toURI
+
+sql(
+  s"""
+ |INSERT OVERWRITE DIRECTORY '${pathUri}'
+ |ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+sql(
+  s"""
+ |INSERT OVERWRITE DIRECTORY '${pathUri}'
+ |STORED AS orc
+ |SELECT * FROM test_insert_table
+   """.stripMargin)
+
+// use orc data source to check the data of path is right.
+withTempView("orc_source") {
+  sql(
+s"""
+   |CREATE TEMPORARY VIEW orc_source
+   |USING org.apache.spark.sql.hive.orc
+   |OPTIONS (
+   |  PATH '${dir.getCanonicalPath}'
+   |)
+ """.stripMargin)
+
+  checkAnswer(
+sql("select * from orc_source"),
+sql("select * from test_insert_table").collect())
--- End diff --

The same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/18902
  
+1 for using Dataframe-based version code.

@zhengruifeng One thing I want to confirm is that, I check your testing 
code, both RDD-based version and Dataframe-based version code will both cost on 
deserialization:
```
...
val df = spark.createDataFrame(rows, struct)
df.persist()
df.count()
...
// do `imputer.fit`
```
when running `imputer.fit`, it will extract the required columns from the 
cached input dataframe, and then you compare the perf between `RDD.aggregate` 
and `dataframe avg`, they both need to deserialize data from input and then do 
computation, and `dataframe avg` will take advantage of codegen and should be 
faster. But here the test show that RDD version is slower than Dataframe 
version, it is not very reasonable, so I want to confirm:
in your RDD version testing, do you cache again when get `RDD` from the 
input `Dataframe`?
If not, your testing has no problem, I will guess there exists other 
performance issue in SQL layer and cc @cloud-fan to take a look.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19086
  
**[Test build #81370 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81370/testReport)**
 for PR 19086 at commit 
[`d4a0f97`](https://github.com/apache/spark/commit/d4a0f97b9600dd661d81a79e71019a781eb812ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-03 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19086
  
@gatorsmile 
I updated, let me known if there's still comments not resolved.
Thanks again for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18865
  
@HyukjinKwon 's provided use case looks pretty fair. The corrupt record is 
the whole line which doesn't follow the json format. It is kind of different to 
the corrupt record case that some json fields can't be correctly converted to 
desired data type.

This two kind of corrupt records can be mixed in one json file. E.g.,

echo '{"field": 1
 {"field" 2}
 {"field": 3}
 {"field": "4"}' >/tmp/sample.json

scala> dfFromFile.show(false)
+-+---+
|field|_corrupt_record|
+-+---+
|null |{"field": 1|
|null | {"field" 2}   |
|3|null   |
|null |{"field": "4"} |
+-+---+

scala> dfFromFile.select($"_corrupt_record").show()
+---+
|_corrupt_record|
+---+
|{"field": 1|
|{"field" 2}|
|   null|
|   null|
+---+


At least we should clearly explain the difference in the error message. 
Maybe something like: The query to execute now requires only `_corrupt_record` 
in effect after optimization. When there are corrupt records due to json field 
conversion error, those corrupt records might not correctly generated in the 
end, because now no other json fields are required along actually. In order to 
obtain most accurate result, we recommend users to cache or save the dataset 
before the queries requiring only `_corrupt_record`.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136739126
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala
 ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.language.existentials
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.hive.common.FileUtils
+import org.apache.hadoop.hive.ql.plan.TableDesc
+import org.apache.hadoop.hive.serde.serdeConstants
+import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe
+import org.apache.hadoop.mapred._
+
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, 
CatalogTable}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.hive.client.HiveClientImpl
+
+/**
+ * Command for writing the results of `query` to file system.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   INSERT OVERWRITE [LOCAL] DIRECTORY
+ *   path
+ *   [ROW FORMAT row_format]
+ *   [STORED AS file_format]
+ *   SELECT ...
+ * }}}
+ *
+ * @param isLocal whether the path specified in `storage` is a local 
directory
+ * @param storage storage format used to describe how the query result is 
stored.
+ * @param query the logical plan representing data to write to
+ * @param overwrite whthere overwrites existing directory
+ */
+case class InsertIntoHiveDirCommand(
+isLocal: Boolean,
+storage: CatalogStorageFormat,
+query: LogicalPlan,
+overwrite: Boolean) extends SaveAsHiveFile {
+
+  override def children: Seq[LogicalPlan] = query :: Nil
+
+  override def run(sparkSession: SparkSession, children: Seq[SparkPlan]): 
Seq[Row] = {
+assert(children.length == 1)
+assert(storage.locationUri.nonEmpty)
+
+//val Array(cols, types) = children.head.output.foldLeft(Array("", 
"")) { case (r, a) =>
+//  r(0) = r(0) + a.name + ","
+//  r(1) = r(1) + a.dataType.catalogString + ":"
+//  r
+//}
+//
+//val properties = new Properties()
+//properties.put("columns", cols.dropRight(1))
+//properties.put("columns.types", types.dropRight(1))
+//properties.put(serdeConstants.SERIALIZATION_LIB,
+//  storage.serde.getOrElse(classOf[LazySimpleSerDe].getName))
+//
+//import scala.collection.JavaConverters._
+//properties.putAll(storage.properties.asJava)
+//
+//val tableDesc = new TableDesc(
+//  Utils.classForName(storage.inputFormat.get).asInstanceOf[Class[_ 
<: InputFormat[_, _]]],
+//  Utils.classForName(storage.outputFormat.get),
+//  properties
+//)
--- End diff --

Yes. The following dummy table looks fine to me. Could you remove the above 
lines? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19079: [SPARK-21859][CORE] Fix SparkFiles.get failed on ...

2017-09-03 Thread lgrcyanny
Github user lgrcyanny closed the pull request at:

https://github.com/apache/spark/pull/19079


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19102: [SPARK-21859][CORE] Fix SparkFiles.get failed on ...

2017-09-03 Thread lgrcyanny
Github user lgrcyanny commented on a diff in the pull request:

https://github.com/apache/spark/pull/19102#discussion_r136738525
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -496,7 +496,7 @@ object SparkSubmit extends CommandLineUtils with 
Logging {
 sysProp = "spark.executor.memory"),
   OptionAssigner(args.totalExecutorCores, STANDALONE | MESOS, 
ALL_DEPLOY_MODES,
 sysProp = "spark.cores.max"),
-  OptionAssigner(args.files, LOCAL | STANDALONE | MESOS, 
ALL_DEPLOY_MODES,
+  OptionAssigner(args.files, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES,
--- End diff --

For yarn-client mode, --files are are already added to 
"spark.yarn.dist.files", I agree with you that, just addFile in SparkContext 
for yarn-client mode for "spark.yarn.dist.files". BTW, I will fix the doc for 
Spark-Submit either.
Thanks @vanzin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19115: Update PairRDDFunctions.scala

2017-09-03 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19115
  
@awarrior please follow the 
[doc](https://spark.apache.org/contributing.html) to submit patch.

You need to change the PR title like other PRs by adding JIRA id and 
component tag.

Add the details of your problem in PR description, not just a simple JIRA 
link.

Submit PR against master branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136738263
  
--- Diff: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ---
@@ -241,11 +241,21 @@ query
 : ctes? queryNoWith
 ;
 
-insertInto
+insertIntoTable
--- End diff --

These can be combined to

```
insertInto
: INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT 
EXISTS)?)?  #insertIntoTable
| INSERT INTO TABLE? tableIdentifier partitionSpec? 
#insertIntoTable
| INSERT OVERWRITE LOCAL? DIRECTORY path=STRING rowFormat? 
createFileFormat?#insertOverwriteHiveDir
| INSERT OVERWRITE LOCAL? DIRECTORY (path=STRING)? tableProvider 
(OPTIONS options=tablePropertyList)?   #insertOverwriteDir
;
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136738182
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -178,11 +179,50 @@ class AstBuilder(conf: SQLConf) extends 
SqlBaseBaseVisitor[AnyRef] with Logging
   }
 
   /**
-   * Add an INSERT INTO [TABLE]/INSERT OVERWRITE TABLE operation to the 
logical plan.
+   * Parameters used for writing query to a directory: (isLocal, 
CatalogStorageFormat, provider).
+   */
+  type InsertDirParams = (Boolean, CatalogStorageFormat, Option[String])
+
+  /**
+   * Add an
+   *   INSERT INTO [TABLE] or
+   *   INSERT OVERWRITE TABLE or
+   *   INSERT OVERWRITE [LOCAL] DIRECTORY
+   * operation to logical plan
*/
   private def withInsertInto(
   ctx: InsertIntoContext,
   query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
+assert(ctx.children.size == 1)
--- End diff --

It can be rewritten to 
```Scala
  private def withInsertInto(
  ctx: InsertIntoContext,
  query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
ctx match {
  case c: InsertIntoTableContext =>
withInsertIntoTable(c, query)
  case dir: InsertOverwriteDirContext =>
val (isLocal, storage, provider) = visitInsertOverwriteDir(dir)
InsertIntoDir(isLocal, storage, provider, query, overwrite = true)
  case hiveDir: InsertOverwriteHiveDirContext =>
val (isLocal, storage, provider) = 
visitInsertOverwriteHiveDir(hiveDir)
InsertIntoDir(isLocal, storage, provider, query, overwrite = true)
}
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-03 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/17014
  
@WeichenXu123 @jkbradley  I am curious about why `ml.Kmeans` is special 
that it needs a separate PR 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18538
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81369/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18538
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18538
  
**[Test build #81369 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81369/testReport)**
 for PR 18538 at commit 
[`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19115: Update PairRDDFunctions.scala

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19115
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19114: Update PairRDDFunctions.scala

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19114
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19115: Update PairRDDFunctions.scala

2017-09-03 Thread awarrior
GitHub user awarrior opened a pull request:

https://github.com/apache/spark/pull/19115

Update PairRDDFunctions.scala

[https://issues.apache.org/jira/browse/SPARK-21882](url)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/awarrior/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19115.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19115


commit a096970b2f2cfa497a96870ebd26f83a106b4e07
Author: Jarvis 
Date:   2017-09-04T02:48:35Z

Update PairRDDFunctions.scala

spark-21882




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17014: [SPARK-18608][ML] Fix double-caching in ML algori...

2017-09-03 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/17014#discussion_r136737427
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -304,16 +304,14 @@ class KMeans @Since("1.5.0") (
   override def fit(dataset: Dataset[_]): KMeansModel = {
 transformSchema(dataset.schema, logging = true)
 
-val handlePersistence = dataset.rdd.getStorageLevel == 
StorageLevel.NONE
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 val instances: RDD[OldVector] = 
dataset.select(col($(featuresCol))).rdd.map {
   case Row(point: Vector) => OldVectors.fromML(point)
 }
 
-if (handlePersistence) {
-  instances.persist(StorageLevel.MEMORY_AND_DISK)
-}
+if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
--- End diff --

OK, I will revert those small changes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19114: Update PairRDDFunctions.scala

2017-09-03 Thread awarrior
GitHub user awarrior opened a pull request:

https://github.com/apache/spark/pull/19114

Update PairRDDFunctions.scala

[https://issues.apache.org/jira/browse/SPARK-21882](url)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/awarrior/spark branch-1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19114.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19114


commit e7e42802b07c5148ba02761af1edd2ee81d6ef95
Author: Jarvis 
Date:   2017-09-04T02:52:01Z

Update PairRDDFunctions.scala

spark-21882




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18957: [SPARK-21744][CORE] Add retry logic for new broad...

2017-09-03 Thread caneGuy
Github user caneGuy closed the pull request at:

https://github.com/apache/spark/pull/18957


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19111
  
I found `NaiveBayes` also possible to fail. Founded here #18538 . Hope this 
can works!


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81316/console

```
2. Error: spark.naiveBayes (@test_mllib_classification.R#482) 
--
java.lang.IllegalArgumentException: requirement failed: The input column 
stridx_5a965aeed344 should have at least two distinct values.
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.ml.feature.OneHotEncoder$$anonfun$5.apply(OneHotEncoder.scala:113)
...
1: spark.naiveBayes(traindf, clicked ~ ., smoothing = 0) at 
/home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/tests/fulltests/test_mllib_classification.R:482
2: spark.naiveBayes(traindf, clicked ~ ., smoothing = 0)
3: .local(data, formula, ...)
4: callJStatic("org.apache.spark.ml.r.NaiveBayesWrapper", "fit", formula, 
data@sdf, 
   smoothing, handleInvalid)
5: invokeJava(isStatic = TRUE, className, methodName, ...)
6: handleErrors(returnStatus, conn)
7: stop(readString(conn))

DONE 
===
Error: Test failures
Execution halted
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18538
  
**[Test build #81369 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81369/testReport)**
 for PR 18538 at commit 
[`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/18538
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19079: [SPARK-21859][CORE] Fix SparkFiles.get failed on driver ...

2017-09-03 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19079
  
Please close this PR @lgrcyanny thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19113
  
**[Test build #81368 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81368/testReport)**
 for PR 19113 at commit 
[`fa7eb51`](https://github.com/apache/spark/commit/fa7eb514cfd91bb405fd74680b08d5865911e3f0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19113: [SPARK-20978][SQL] Bump up Univocity version to 2...

2017-09-03 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/19113

[SPARK-20978][SQL] Bump up Univocity version to 2.5.4

## What changes were proposed in this pull request?

There was a bug in Univocity Parser that causes the issue in SPARK-20978. 
This was fixed as below:

```scala
val df = spark.read.schema("a string, b string, unparsed 
string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
df.show()
```

**Before**

```
java.lang.NullPointerException
at 
scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
...
```

**After**

```
+---+++
|  a|   b|unparsed|
+---+++
|  a|null|   a|
+---+++
```

It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to 
upgrade this.

## How was this patch tested?

Unit test added in `CSVSuite.scala`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark bump-up-univocity

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19113.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19113


commit fa7eb514cfd91bb405fd74680b08d5865911e3f0
Author: hyukjinkwon 
Date:   2017-08-26T03:52:40Z

Bump up Univocity version to 2.5.4




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17862
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17862
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81367/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17862
  
**[Test build #81367 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81367/testReport)**
 for PR 17862 at commit 
[`cec628b`](https://github.com/apache/spark/commit/cec628ba094fcbaad2fad4a7c5957aa9f5d3d698).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17862
  
**[Test build #81367 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81367/testReport)**
 for PR 17862 at commit 
[`cec628b`](https://github.com/apache/spark/commit/cec628ba094fcbaad2fad4a7c5957aa9f5d3d698).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18975
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18975
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81366/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...

2017-09-03 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/18975
  
There is a difference in Hive's semantics vs what this PR is doing. In 
Hive, the query execution writes to a staging location and the destination 
location is cleared + re-populated after the end of  query execution (it 
happens in `MoveTask`). This PR will first wipe the destination location and 
then perform the query execution to populate the destination location with 
desired data. 

I like the hive model because: 
- If the query execution fails, you atleast have the old data. Insert 
overwrite to table does the same thing. This PR will leave the output location 
empty.
- Hive achieves atomicity by using a staging dir. With this PR, I am not 
sure what happens to the output location if the some tasks have written out the 
final data while rest are still working. Would it have partial output ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18975
  
**[Test build #81366 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81366/testReport)**
 for PR 18975 at commit 
[`2ec9947`](https://github.com/apache/spark/commit/2ec9947af571691966e979e17aec12d3d683decf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136727120
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala
 ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.language.existentials
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.hive.common.FileUtils
+import org.apache.hadoop.hive.ql.plan.TableDesc
+import org.apache.hadoop.hive.serde.serdeConstants
+import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe
+import org.apache.hadoop.mapred._
+
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, 
CatalogTable}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.hive.client.HiveClientImpl
+
+/**
+ * Command for writing the results of `query` to file system.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   INSERT OVERWRITE [LOCAL] DIRECTORY
+ *   path
+ *   [ROW FORMAT row_format]
+ *   [STORED AS file_format]
+ *   SELECT ...
+ * }}}
+ *
+ * @param isLocal whether the path specified in `storage` is a local 
directory
+ * @param storage storage format used to describe how the query result is 
stored.
+ * @param query the logical plan representing data to write to
+ * @param overwrite whthere overwrites existing directory
+ */
+case class InsertIntoHiveDirCommand(
+isLocal: Boolean,
+storage: CatalogStorageFormat,
+query: LogicalPlan,
+overwrite: Boolean) extends SaveAsHiveFile {
+
+  override def children: Seq[LogicalPlan] = query :: Nil
+
+  override def run(sparkSession: SparkSession, children: Seq[SparkPlan]): 
Seq[Row] = {
+assert(children.length == 1)
+assert(storage.locationUri.nonEmpty)
+
+//val Array(cols, types) = children.head.output.foldLeft(Array("", 
"")) { case (r, a) =>
--- End diff --

deadcode ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136727065
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/InsertIntoDataSourceDirCommand.scala
 ---
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.datasources._
+
+/**
+ * A command used to write the result of a query to a directory.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   INSERT OVERWRITE DIRECTORY (path=STRING)?
+ *   USING format OPTIONS ([option1_name "option1_value", option2_name 
"option2_value", ...])
+ *   SELECT ...
+ * }}}
+ *
+ * @param storage storage format used to describe how the query result is 
stored.
+ * @param provider the data source type to be used
+ * @param query the logical plan representing data to write to
+ * @param overwrite whthere overwrites existing directory
+ */
+case class InsertIntoDataSourceDirCommand(
+storage: CatalogStorageFormat,
+provider: String,
+query: LogicalPlan,
+overwrite: Boolean) extends RunnableCommand {
+
+  override def innerChildren: Seq[LogicalPlan] = Seq(query)
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+assert(innerChildren.length == 1)
+assert(storage.locationUri.nonEmpty, "Directory path is required")
+assert(!provider.isEmpty, "Data source is required")
+
+// Create the relation based on the input logical plan: `query`.
+val pathOption = storage.locationUri.map("path" -> 
CatalogUtils.URIToString(_))
+
+val dataSource = DataSource(
+  sparkSession,
+  className = provider,
+  options = storage.properties ++ pathOption,
+  catalogTable = None)
+
+val isFileFormat = 
classOf[FileFormat].isAssignableFrom(dataSource.providingClass)
+if (!isFileFormat) {
+  throw new SparkException(
+"Only Data Sources providing FileFormat are supported.")
--- End diff --

nit: you could also include `dataSource.providingClass` in there to make it 
easy to reason / debug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136726991
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -178,11 +179,50 @@ class AstBuilder(conf: SQLConf) extends 
SqlBaseBaseVisitor[AnyRef] with Logging
   }
 
   /**
-   * Add an INSERT INTO [TABLE]/INSERT OVERWRITE TABLE operation to the 
logical plan.
+   * Parameters used for writing query to a directory: (isLocal, 
CatalogStorageFormat, provider).
+   */
+  type InsertDirParams = (Boolean, CatalogStorageFormat, Option[String])
+
+  /**
+   * Add an
+   *   INSERT INTO [TABLE] or
+   *   INSERT OVERWRITE TABLE or
+   *   INSERT OVERWRITE [LOCAL] DIRECTORY
+   * operation to logical plan
*/
   private def withInsertInto(
   ctx: InsertIntoContext,
   query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
+assert(ctx.children.size == 1)
--- End diff --

@janewangfb : thanks for the explanation. I got confused if this would have 
affected multi-inserts. ie.

```
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
```

One can even mix in `INSERT OVERWRITE TABLE ... ` in hive. You should have 
a test case for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19090: [SPARK-21877][DEPLOY, WINDOWS] Handle quotes in Windows ...

2017-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19090
  
Thanks for thorough testing. Yea, looks fine. Will take a look few times 
more by myself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136726921
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -1509,4 +1509,86 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder(conf) {
   query: LogicalPlan): LogicalPlan = {
 RepartitionByExpression(expressions, query, conf.numShufflePartitions)
   }
+
+  /**
+   * Return the parameters for [[InsertIntoDir]] logical plan.
+   *
+   * Expected format:
+   * {{{
+   *   INSERT OVERWRITE DIRECTORY
+   *   [path]
+   *   [OPTIONS table_property_list]
+   *   select_statement;
+   * }}}
+   */
+  override def visitInsertOverwriteDir(
+  ctx: InsertOverwriteDirContext): InsertDirParams = withOrigin(ctx) {
+if (ctx.LOCAL != null) {
+  throw new ParseException(
+"LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data 
source", ctx)
+}
+
+val options = 
Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty)
+var storage = DataSource.buildStorageFormatFromOptions(options)
+
+val path = Option(ctx.path).map(string).getOrElse("")
+
+if (!path.isEmpty && storage.locationUri.isDefined) {
+  throw new ParseException(
+"Directory path and 'path' in OPTIONS are both used to indicate 
the directory path, " +
+  "you can only specify one of them.", ctx)
+}
+if (path.isEmpty && storage.locationUri.isEmpty) {
--- End diff --

you can collapse this check and one above by doing XOR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-09-03 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18975#discussion_r136726866
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -1509,4 +1509,86 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder(conf) {
   query: LogicalPlan): LogicalPlan = {
 RepartitionByExpression(expressions, query, conf.numShufflePartitions)
   }
+
+  /**
+   * Return the parameters for [[InsertIntoDir]] logical plan.
+   *
+   * Expected format:
+   * {{{
+   *   INSERT OVERWRITE DIRECTORY
+   *   [path]
+   *   [OPTIONS table_property_list]
+   *   select_statement;
+   * }}}
+   */
+  override def visitInsertOverwriteDir(
+  ctx: InsertOverwriteDirContext): InsertDirParams = withOrigin(ctx) {
+if (ctx.LOCAL != null) {
+  throw new ParseException(
+"LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data 
source", ctx)
+}
+
+val options = 
Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty)
+var storage = DataSource.buildStorageFormatFromOptions(options)
+
+val path = Option(ctx.path).map(string).getOrElse("")
+
+if (!path.isEmpty && storage.locationUri.isDefined) {
+  throw new ParseException(
+"Directory path and 'path' in OPTIONS are both used to indicate 
the directory path, " +
+  "you can only specify one of them.", ctx)
+}
+if (path.isEmpty && storage.locationUri.isEmpty) {
+  throw new ParseException(
+"You need to specify directory path or 'path' in OPTIONS, but not 
both", ctx)
--- End diff --

nit: `but not both` does not seem in sync with the check because both 
`path` and `locationUri` are not specified for the check to be true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19112: [SPARK-21901][SS] Define toString for StateOperat...

2017-09-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19112#discussion_r136726682
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala ---
@@ -177,11 +179,11 @@ class SourceProgress protected[sql](
 }
 
 ("description" -> JString(description)) ~
-  ("startOffset" -> tryParse(startOffset)) ~
-  ("endOffset" -> tryParse(endOffset)) ~
-  ("numInputRows" -> JInt(numInputRows)) ~
-  ("inputRowsPerSecond" -> safeDoubleToJValue(inputRowsPerSecond)) ~
-  ("processedRowsPerSecond" -> 
safeDoubleToJValue(processedRowsPerSecond))
--- End diff --

Yep. I guess that those are the less frequent style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18865
  
A use case might be:

```
echo '{"field": 1
 {"field" 2}
 {"field": 3}' >/tmp/sample.json
```


```scala
val file = "/tmp/sample.json"
val dfFromFile = spark.read.schema("field BYTE, _corrupt_record 
STRING").json(file)
dfFromFile.show(false)
dfFromFile.filter($"_corrupt_record".isNotNull).count()
dfFromFile.filter($"_corrupt_record".isNull).count()
dfFromFile.select($"_corrupt_record").show()
```

```
scala> dfFromFile.show(false)
+-+---+
|field|_corrupt_record|
+-+---+
|null |{"field": 1|
|null | {"field" 2}   |
|3|null   |
+-+---+


scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
res1: Long = 2

scala> dfFromFile.filter($"_corrupt_record".isNull).count()
res2: Long = 1

scala> dfFromFile.select($"_corrupt_record").show()
+---+
|_corrupt_record|
+---+
|{"field": 1|
|{"field" 2}|
|   null|
+---+
```

I was thinking of avoiding to blacklist a case which works before without 
an exception but it looks we (at least I) are less sure if it is a bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19060: [WIP][SQL] Add DataSourceSuite validating data sources l...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19060
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19060: [WIP][SQL] Add DataSourceSuite validating data sources l...

2017-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19060
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81365/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19060: [WIP][SQL] Add DataSourceSuite validating data sources l...

2017-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19060
  
**[Test build #81365 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81365/testReport)**
 for PR 19060 at commit 
[`fb72c89`](https://github.com/apache/spark/commit/fb72c89478149da99bd7ac402257ab7468156f9d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >