[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf

2018-06-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21482
  
How is this done in other databases? I don't think we want to invent new 
ways on these basic primitives.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21482: [SPARK-24393][SQL] SQL builtin: isinf

2018-06-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21482#discussion_r193230476
  
--- Diff: R/pkg/NAMESPACE ---
@@ -281,6 +281,8 @@ exportMethods("%<=>%",
   "initcap",
   "input_file_name",
   "instr",
+  "isInf",
+  "isinf",
--- End diff --

the functions are case insensitive so i don't think we need both?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21448: [SPARK-24408][SQL][DOC] Move abs, bitwiseNOT, isnan, nan...

2018-05-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21448
  
I'd only move abs and nothing else.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21459: [SPARK-24420][Build] Upgrade ASM to 6.1 to support JDK9+

2018-05-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21459
  
What's driving this (is it java 9)? I'm in general scared by core library 
updates like this. Maybe Spark 3.0 is a good time (and we should just do it 
this year).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

2018-05-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21453
  
Jenkins, add to whitelist.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

2018-05-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21453
  
Jenkins, test this please.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...

2018-05-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21416
  
LGTM (I didn't look that carefully though)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...

2018-05-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21416#discussion_r191306678
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala ---
@@ -392,9 +396,97 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 
 val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b")
 
-intercept[AnalysisException] {
+val e = intercept[AnalysisException] {
   df2.filter($"a".isin($"b"))
 }
+Seq("cannot resolve", "due to data type mismatch: Arguments must be 
same type but were")
+  .foreach { s =>
+
assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT)))
+  }
+  }
+
+  test("isInCollection: Scala Collection") {
+val df = Seq((1, "x"), (2, "y"), (3, "z")).toDF("a", "b")
+checkAnswer(df.filter($"a".isInCollection(Seq(1, 2))),
+  df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 2))),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 1))),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1))
+
+// Auto casting should work with mixture of different types in 
collections
+checkAnswer(df.filter($"a".isInCollection(Seq(1.toShort, "2"))),
+  df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq("3", 2.toLong))),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq(3, "1"))),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1))
+
+checkAnswer(df.filter($"b".isInCollection(Seq("y", "x"))),
+  df.collect().toSeq.filter(r => r.getString(1) == "y" || 
r.getString(1) == "x"))
+checkAnswer(df.filter($"b".isInCollection(Seq("z", "x"))),
+  df.collect().toSeq.filter(r => r.getString(1) == "z" || 
r.getString(1) == "x"))
+checkAnswer(df.filter($"b".isInCollection(Seq("z", "y"))),
+  df.collect().toSeq.filter(r => r.getString(1) == "z" || 
r.getString(1) == "y"))
+
+// Test with different types of collections
+checkAnswer(df.filter($"a".isInCollection(Seq(1, 2).toSet)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 2).toArray)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 1).toList)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1))
+
+val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b")
+
+val e = intercept[AnalysisException] {
+  df2.filter($"a".isInCollection(Seq($"b")))
+}
+Seq("cannot resolve", "due to data type mismatch: Arguments must be 
same type but were")
+  .foreach { s =>
+
assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT)))
+  }
+  }
+
+  test("isInCollection: Java Collection") {
+val df = Seq((1, "x"), (2, "y"), (3, "z")).toDF("a", "b")
--- End diff --

same thing here. just run a single test case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...

2018-05-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21416#discussion_r191306654
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala ---
@@ -392,9 +396,97 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 
 val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b")
 
-intercept[AnalysisException] {
+val e = intercept[AnalysisException] {
   df2.filter($"a".isin($"b"))
 }
+Seq("cannot resolve", "due to data type mismatch: Arguments must be 
same type but were")
+  .foreach { s =>
+
assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT)))
+  }
+  }
+
+  test("isInCollection: Scala Collection") {
--- End diff --

can we simplify the test cases? you are just testing this api as a wrapper. 
you don't need to run so many queries for type coercion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

2018-05-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21427
  
If we can fix it without breaking existing behavior that would be awesome.

On Fri, May 25, 2018 at 9:59 AM Bryan Cutler <notificati...@github.com>
wrote:

> I've been thinking about this and came to the same conclusion as
> @icexelloss <https://github.com/icexelloss> here #21427 (comment)
> <https://github.com/apache/spark/pull/21427#issuecomment-392070950> that
> we could really support both names and position, and fix this without
> changing behavior.
>
> When the user defines as grouped map udf, the StructType has field names
> so if the returned DataFrame has column names they should match. If the
> user returned a DataFrame with positional columns only, pandas will name
> the columns with an integer index (not an integer string). We could change
> the logic to do the following:
>
> Assign columns by name, catching a KeyError exception
> If the column names are all integers, then fallback to assign by position
> Else raise the KeyError (most likely the user has a typo in the column 
name)
>
> I think that will solve this issue and not change the behavior, but I
> would need check that this will hold for different pandas versions. How
> does that sound?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21427#issuecomment-392119306>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPMCqb9uccM8coTBel1PxwCReedS4ks5t2DiCgaJpZM4UM2oZ>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

2018-05-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21427
  
On the config part, I haven’t looked at the code but can’t we just 
reorder
the columns on the JVM side? Why do we need to reorder them on the Python
side?

On Fri, May 25, 2018 at 12:31 AM Hyukjin Kwon <notificati...@github.com>
wrote:

> I believe it was just a mistake to correct - we forget this to mark it
> experimental. It's pretty unstable and many JIRAs are being open.
> @BryanCutler <https://github.com/BryanCutler> mind if I ask to go ahead
> if you find some time? if you are busy will do it by myself.
>
> cc @vanzin <https://github.com/vanzin> FYI.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21427#issuecomment-391967423>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPD8iBRMXvmS7vVSIidwnZxK1BaQ4ks5t17NlgaJpZM4UM2oZ>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

2018-05-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21427
  
I agree it should have started experimental. It is pretty weird to after
the fact mark something experimental though.

On Fri, May 25, 2018 at 12:23 AM Hyukjin Kwon <notificati...@github.com>
wrote:

> BTW, what do you think about adding a blocker to set this feature as
> experimental @rxin <https://github.com/rxin>? I think it's pretty new
> feature and it should be reasonable to call it experimental.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21427#issuecomment-391965470>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPI2-nftoelNAPqgn19vurlYolkG8ks5t17FjgaJpZM4UM2oZ>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

2018-05-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21427
  
Why is it difficult?

On Fri, May 25, 2018 at 12:03 AM Hyukjin Kwon <notificati...@github.com>
wrote:

> but as I said it's difficult to have a configuration there. Shall we just
> target 3.0.0 abd martk this as experimental as I suggeated from the first
> place? That should be the safest way.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21427#issuecomment-391961189>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPJ-ym2CEM9e_hHJxlvOwTlE-UADIks5t16yxgaJpZM4UM2oZ>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r190803873
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False):
  name | Bob
 """
 if isinstance(truncate, bool) and truncate:
-print(self._jdf.showString(n, 20, vertical))
+print(self._jdf.showString(n, 20, vertical, False))
 else:
-print(self._jdf.showString(n, int(truncate), vertical))
+print(self._jdf.showString(n, int(truncate), vertical, False))
--- End diff --

use named arguments for boolean flags




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r190803855
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False):
  name | Bob
 """
 if isinstance(truncate, bool) and truncate:
-print(self._jdf.showString(n, 20, vertical))
+print(self._jdf.showString(n, 20, vertical, False))
--- End diff --

use named arguments for boolean flags


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r190803772
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.sql.repl.eagerEval.enabled
+  false
+  
+Enable eager evaluation or not. If true and repl you're using supports 
eager evaluation,
+dataframe will be ran automatically and html table will feedback the 
queries user have defined
+(see https://issues.apache.org/jira/browse/SPARK-24215;>SPARK-24215 for 
more details).
+  
+
+
+  spark.sql.repl.eagerEval.showRows
+  20
+  
+Default number of rows in HTML table.
+  
+
+
+  spark.sql.repl.eagerEval.truncate
--- End diff --

maybe he wants to follow what dataframe.show does, which truncates num 
characters within a cell. That's useful for console output, but not so much for 
notebooks.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r190803641
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.sql.repl.eagerEval.enabled
+  false
+  
+Enable eager evaluation or not. If true and repl you're using supports 
eager evaluation,
+dataframe will be ran automatically and html table will feedback the 
queries user have defined
+(see https://issues.apache.org/jira/browse/SPARK-24215;>SPARK-24215 for 
more details).
+  
+
+
+  spark.sql.repl.eagerEval.showRows
--- End diff --

maxNumRows


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...

2018-05-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21427
  
If this has been released you can't just change it like this; it will break 
users' programs immediately. At the very least introduce a flag so it can be 
set by the user to avoid breaking their code.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21242: [SPARK-23657][SQL] Document and expose the internal data...

2018-05-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21242
  
Thanks Ryan. I'm not a fan of just exposing internal classes like this. The 
APIs haven't really been designed or audited for the purpose of external 
consumption. If we want to expose the internal APIs, we should revisit their 
APIs to make sure they are good, and potentially narrow down the exposure.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189669772
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.jupyter.eagerEval.enabled
--- End diff --

btw the config flag isn't jupyter specific.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21370
  
Can we also do something a bit more generic that works for non-Jupyter 
notebooks as well? For example, in IPython or just plain Python REPL.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...

2018-05-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21329
  
Why are we cleaning up stuff like this?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-05-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21192
  
my point is that i don't consider a sequence of chars an array to begin 
with. it is not natural to me.

I'd want an array if it is a different set of separators.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-05-16 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21192
  
eh I actually think separated makes it much simpler to look at, compared 
with an array. Why complicate the API and require users to understand how to 
specify an array (in all languages)?

One question I have is whether we'd want to support multiple separators and 
each separator can be multi sequence chars as well. In that case an array might 
make more sense to specify the multi separators, and each separator is just 
space delimited for chars.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21318: [minor] Update docs for functions.scala to make it clear...

2018-05-15 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21318
  
It's still going to fail because I haven't updated it yet. Will do tomorrow.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21316: [SPARK-20538][SQL] Wrap Dataset.reduce with withN...

2018-05-14 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21316#discussion_r188104204
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1607,7 +1607,9 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def reduce(func: (T, T) => T): T = rdd.reduce(func)
+  def reduce(func: (T, T) => T): T = withNewExecutionId {
--- End diff --

Why would we want to deprecate it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21318: [minor] Update docs for functions.scala to make it clear...

2018-05-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21318
  
Hm the failure doesn't look like it's caused by this PR. Do you guys know 
what's going on?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21318: [minor] Update docs for functions.scala to make it clear...

2018-05-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21318
  
cc @gatorsmile @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21318: [minor] Update docs for functions.scala to make i...

2018-05-13 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/21318

[minor] Update docs for functions.scala to make it clear not all the 
built-in functions are defined there

The title summarizes the change.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark functions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21318.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21318


commit 83c191fbbe82bf49c81a860f4f1ebde7a4076f00
Author: Reynold Xin <rxin@...>
Date:   2018-05-14T05:15:56Z

[minor] Update docs for functions.scala




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21316: [SPARK-20538][SQL] Wrap Dataset.reduce with withN...

2018-05-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21316#discussion_r187838099
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1607,7 +1607,9 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def reduce(func: (T, T) => T): T = rdd.reduce(func)
+  def reduce(func: (T, T) => T): T = withNewExecutionId {
--- End diff --

cc @zsxwing 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21309: [SPARK-23907] Removes regr_* functions in functions.scal...

2018-05-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21309
  
Better compile time error. Plus a lot of people are already using these.

On Fri, May 11, 2018 at 7:35 PM Hyukjin Kwon <notificati...@github.com>
wrote:

> Yup, then why not just deprecate other functions in other APIs for 3.0.0,
> and promote the usage of expr?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21309#issuecomment-388524092>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPNbOEidl-IwkRFVW0kVpVjEPKoOgks5txkpdgaJpZM4T8LX4>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21309: [SPARK-23907] Removes regr_* functions in functions.scal...

2018-05-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21309
  
Adding it to sql would allow it to be available everywhere (through expr)
right?

On Fri, May 11, 2018 at 7:30 PM Hyukjin Kwon <notificati...@github.com>
wrote:

> Thing is, I am a bit confused when to add it to other APIs. I thought if
> it's expected to be less commonly used, it shouldn't be added at the first
> place. We have UDFs.
>
> I have been a bit confused of some functions specifically not added into
> other APIs. If that's worth being added in an API, I thought it makes 
sense
> to add it to other APIs too. Is there a reason to add them to SQL side
> specifically?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21309#issuecomment-388523839>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPJx8IcRSIpAHmk2APbxDMm4wf4E8ks5txkkngaJpZM4T8LX4>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21309: [SPARK-23907] Removes regr_* functions in functions.scal...

2018-05-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21309
  
Btw it’s been always the case that the less commonly used functions are 
not
part of this file. There is just a lot of overhead to maintaining all of
them.

I’m not even sure if the regr_* expressions should be added in the first
place.

On Fri, May 11, 2018 at 7:20 PM Hyukjin Kwon <notificati...@github.com>
wrote:

    > @rxin <https://github.com/rxin>, how about splitting up this file by the
> group or something, or deprecating all the functions that can be called 
via
> expr for 3.0.0? To me, it looked a bit odd when some functions exist and
> some did not. It was an actual use case and I had to check which function
> exists or not every time.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21309#issuecomment-388523458>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPKznGyNtcF57sol08PGgzbhth-4_ks5txkcKgaJpZM4T8LX4>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21054
  
There is not a single function that can’t be called by expr. It mainly 
adds
some type safety.

On Fri, May 11, 2018 at 7:18 PM Hyukjin Kwon <notificati...@github.com>
wrote:

> *@HyukjinKwon* commented on this pull request.
> --
>
> In sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> <https://github.com/apache/spark/pull/21054#discussion_r187761743>:
>
> > @@ -775,6 +775,178 @@ object functions {
> */
>def var_pop(columnName: String): Column = var_pop(Column(columnName))
>
> +  /**
> +   * Aggregate function: returns the number of non-null pairs.
> +   *
> +   * @group agg_funcs
> +   * @since 2.4.0
> +   */
> +  def regr_count(y: Column, x: Column): Column = withAggregateFunction {
>
> @rxin <https://github.com/rxin>, how about splitting up this file by the
> group or something, or deprecating all the functions that can be called 
via
> expr for 3.0.0? To me, it looked a bit odd when some functions exist and
> some did not. It was an actual use case and I had to check which function
> exists or not every time.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21054#discussion_r187761743>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPMuMFZtp285MrttmJfITKM6WS0pcks5txkZ0gaJpZM4TSBOu>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21309: [SPARK-23907] Removes regr_* functions in functions.scal...

2018-05-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21309
  
cc @gatorsmile @mgaido91 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21309: [SPARK-23907] Removes regr_* functions in functio...

2018-05-11 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/21309

[SPARK-23907] Removes regr_* functions in functions.scala

## What changes were proposed in this pull request?
This patch removes the various regr_* functions in functions.scala. They 
are so uncommon that I don't think they deserve real estate in functions.scala. 
We can consider adding them later if more users need them.

## How was this patch tested?
Removed the associated test case as well.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-23907

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21309.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21309


commit ce2c305169d90c4d7803338d85d2d4c92a8e1d3c
Author: Reynold Xin <rxin@...>
Date:   2018-05-11T23:24:15Z

[SPARK-23907] Removes regr_ functions in functions.scala




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-11 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r187751801
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -775,6 +775,178 @@ object functions {
*/
   def var_pop(columnName: String): Column = var_pop(Column(columnName))
 
+  /**
+   * Aggregate function: returns the number of non-null pairs.
+   *
+   * @group agg_funcs
+   * @since 2.4.0
+   */
+  def regr_count(y: Column, x: Column): Column = withAggregateFunction {
--- End diff --

do we need all of these? seems like users can just invoke expr to do them. 
this file is getting very long.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-05-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21121
  
@lokm01 wouldn't @ueshin's suggestion on adding a second parameter to 
transform work for you? You can just do something similar to `transform(x, 
(entry, index) -> struct(entry, index))`. Perhaps zip_with_index is just an 
alias for that.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21187: [SPARK-24035][SQL] SQL syntax for Pivot

2018-04-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21187#discussion_r185084802
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/PivotSuite.scala ---
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
--- End diff --

can we use the infra for SQLQueryTestSuite?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184596334
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1805,12 +1805,13 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization is unable to be used whereas 
`createDataFrame` from Pandas DataFrame allowed the fallback to 
non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas 
DataFrame allow the fallback by default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
- - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
- - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
- - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema. 
- - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
+  - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
+  - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
+  - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema.
--- End diff --

what's a nested empty schema?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Optimizer

2018-04-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20560
  
Just saw this - this seems like a somewhat awkward way to do it by just 
matching on filter / project. Is the main thing lacking a way to do back 
propagation for properties? (We can only do forward propagation at the moment 
on properties so we can't eliminate subtree's sort based on the parent's sort).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21071: [SPARK-21962][CORE] Distributed Tracing in Spark

2018-04-22 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21071
  
@devaraj-kavali can you close this PR first?

Looks like there isn't any reason to really use htrace anymore ...



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19222
  
@kiszk do you have more data now?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19222
  
OK thanks please do that. Does TPC-DS even trigger 2 call sites? E.g. 
ByteArrayMemoryBlock and OnHeapMemoryBlock. Even there it might introduce a 
conditional branch after JIT that could lead to perf degradation.

I also really worry about off-heap mode, in which all three callsites can 
exist and lead to massive degradation.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19222
  
Sorry this thread is too long for me to follow. I might be bringing up a 
point that has been brought up before.

@kiszk did your perf tests take into account megamorphic callsites? It 
seems to me from a quick cursory look the benchmark result might not be 
accurate for real workloads if there are only one implementations of the 
MemoryBlock loaded.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...

2018-04-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19881
  
Thanks @jcuquemelle 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21071: [SPARK-21962][CORE] Distributed Tracing in Spark

2018-04-16 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21071
  
This probably deserves its own SPIP. Also unclear whether we should just 
support htrace, or have an extension api so users can plug in whatever they 
want.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

2018-04-16 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21060
  
It looks to me this is a bug fix that can merit backporting, as 
QueryExecutionListener is also marked as experimental,

In this case, I think @gatorsmile is worried one might have written a 
listener that enumerates the possible function names, and that listener will 
fail now with a new action name. I feel this is quite unlikely, but I also 
appreciate @gatorsmile's concern for backward compatibility, and I've certainly 
been wrong before when our fixes break existing workloads.

(On the spectrum of being extremely conservative to extremely liberal, I 
think I'm in general more on the middle, whereas @gatorsmile probably leans 
more to the conservative side. There isn't really anything wrong with this, and 
it's good to have balancing forces in a project.)

How about this, @HyukjinKwon -- for the 2.3.x backport, add a config that 
so it is possible to turn this off in production, if somebody actually has 
their job failed because of this? It's a small delta from what this PR already 
does, and that should alleviate the concerns @gatorsmile has. I'd also change 
the function doc for onSuccess/onFailure to make it clear that we will add new 
function names in the future, and users shouldn't expect a fixed list of 
function names.






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20992: [SPARK-23779][SQL] TaskMemoryManager and UnsafeSorter re...

2018-04-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20992
  
What are the performance improvements? Without additional data this seems 
like just an invasive change without any real benefits ...



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21031: [SPARK-23923][SQL] Add cardinality function

2018-04-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21031
  
If there is already size, why do we need to create a new implementation? 
Why can't we just rewrite cardinality to size? 

Also I wouldn't add any programming API for this, since there is already 
size.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21056: [SPARK-23849][SQL] Tests for samplingRatio of jso...

2018-04-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21056#discussion_r181530121
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2128,38 +2128,60 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 }
   }
 
-  test("SPARK-23849: schema inferring touches less data if samplingRation 
< 1.0") {
-val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  val sampledTestData = (row: Row) => {
+val value = row.getLong(0)
+val predefinedSample = Set[Long](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
   57, 62, 68, 72)
-withTempPath { path =>
-  val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
-StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
-  for (i <- 0 until 100) {
-if (predefinedSample.contains(i)) {
-  writer.write(s"""{"f1":${i.toString}}""" + "\n")
-} else {
-  writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
-}
-  }
-  writer.close()
+if (predefinedSample.contains(value)) {
+  s"""{"f1":${value.toString}}"""
+} else {
+  s"""{"f1":${(value.toDouble + 0.1).toString}}"""
+}
+  }
 
-  val ds = spark.read.option("samplingRatio", 
0.1).json(path.getCanonicalPath)
+  test("SPARK-23849: schema inferring touches less data if samplingRatio < 
1.0") {
+// Set default values for the DataSource parameters to make sure
+// that whole test file is mapped to only one partition. This will 
guarantee
+// reliable sampling of the input file.
+withSQLConf(
+  "spark.sql.files.maxPartitionBytes" -> (128 * 1024 * 1024).toString,
+  "spark.sql.files.openCostInBytes" -> (4 * 1024 * 1024).toString
+)(withTempPath { path =>
+  val rdd = spark.sqlContext.range(0, 100, 1, 1).map(sampledTestData)
+  rdd.write.text(path.getAbsolutePath)
+
+  val ds = spark.read
+.option("inferSchema", true)
+.option("samplingRatio", 0.1)
+.json(path.getCanonicalPath)
   assert(ds.schema == new StructType().add("f1", LongType))
-}
+})
   }
 
-  test("SPARK-23849: usage of samplingRation while parsing of dataset of 
strings") {
-val dstr = spark.sparkContext.parallelize(0 until 100, 1).map { i =>
-  val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
-57, 62, 68, 72)
-  if (predefinedSample.contains(i)) {
-s"""{"f1":${i.toString}}""" + "\n"
-  } else {
-s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n"
-  }
-}.toDS()
-val ds = spark.read.option("samplingRatio", 0.1).json(dstr)
+  test("SPARK-23849: usage of samplingRatio while parsing a dataset of 
strings") {
+val rdd = spark.sqlContext.range(0, 100, 1, 1).map(sampledTestData)
+val ds = spark.read
+  .option("inferSchema", true)
+  .option("samplingRatio", 0.1)
+  .json(rdd)
 
 assert(ds.schema == new StructType().add("f1", LongType))
   }
+
+  test("SPARK-23849: samplingRatio is out of the range (0, 1.0]") {
+val dstr = spark.sparkContext.parallelize(0 until 100, 
1).map(_.toString).toDS()
--- End diff --

can you just use spark.range?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21053: [SPARK-23924][SQL] Add element_at function

2018-04-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21053#discussion_r181529978
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -413,6 +413,78 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
 )
   }
 
+  test("element at function") {
--- End diff --

also the function is element_at, not "element at" ...



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21053: [SPARK-23924][SQL] Add element_at function

2018-04-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21053#discussion_r181529901
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -413,6 +413,78 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
 )
   }
 
+  test("element at function") {
--- End diff --

why do we need so many test cases here? this is just to verify the api 
works end to end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20933#discussion_r181529318
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcDataSourceV2.scala
 ---
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.orc
+
+import java.net.URI
+import java.util.Locale
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType}
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.orc.{OrcConf, OrcFile}
+import org.apache.orc.mapred.OrcStruct
+import org.apache.orc.mapreduce.OrcInputFormat
+
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, JoinedRow}
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.execution.datasources._
+import 
org.apache.spark.sql.execution.datasources.orc.{OrcColumnarBatchReader, 
OrcDeserializer, OrcFilters, OrcUtils}
+import 
org.apache.spark.sql.execution.datasources.v2.ColumnarBatchFileSourceReader
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema}
+import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.types.{AtomicType, StructType}
+import org.apache.spark.util.SerializableConfiguration
+
+class OrcDataSourceV2 extends DataSourceV2 with ReadSupport with 
ReadSupportWithSchema {
+  override def createReader(options: DataSourceOptions): DataSourceReader 
= {
+new OrcDataSourceReader(options, None)
+  }
+
+  override def createReader(schema: StructType, options: 
DataSourceOptions): DataSourceReader = {
+new OrcDataSourceReader(options, Some(schema))
+  }
+}
+
+case class OrcDataSourceReader(options: DataSourceOptions, 
userSpecifiedSchema: Option[StructType])
+  extends ColumnarBatchFileSourceReader
+  with SupportsPushDownCatalystFilters {
+
+  override def inferSchema(files: Seq[FileStatus]): Option[StructType] = {
+OrcUtils.readSchema(sparkSession, files)
+  }
+
+  private var pushedFiltersArray: Array[Expression] = Array.empty
+
+  override def readFunction: PartitionedFile => Iterator[InternalRow] = {
--- End diff --

btw i think it's also ok if we know what we want in the final version, and 
the intermediate change tries to minimize code changes (i haven't looked at the 
pr at all so don't interpret this comment as endorsing or not endorsing the pr 
design)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[1/2] spark-website git commit: Update text/wording to more "modern" Spark and more consistent.

2018-04-12 Thread rxin
Repository: spark-website
Updated Branches:
  refs/heads/asf-site 91b561749 -> 658467248


http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/strata-exercises-now-available-online.html
--
diff --git a/site/news/strata-exercises-now-available-online.html 
b/site/news/strata-exercises-now-available-online.html
index 916f242..4f250a3 100644
--- a/site/news/strata-exercises-now-available-online.html
+++ b/site/news/strata-exercises-now-available-online.html
@@ -66,7 +66,7 @@
   
   
-  Lightning-fast cluster computing
+  Lightning-fast unified analytics engine
   
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/submit-talks-to-spark-summit-2014.html
--
diff --git a/site/news/submit-talks-to-spark-summit-2014.html 
b/site/news/submit-talks-to-spark-summit-2014.html
index 4f43c23..18f2642 100644
--- a/site/news/submit-talks-to-spark-summit-2014.html
+++ b/site/news/submit-talks-to-spark-summit-2014.html
@@ -66,7 +66,7 @@
   
   
-  Lightning-fast cluster computing
+  Lightning-fast unified analytics engine
   
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/submit-talks-to-spark-summit-2016.html
--
diff --git a/site/news/submit-talks-to-spark-summit-2016.html 
b/site/news/submit-talks-to-spark-summit-2016.html
index 3163bab..3766932 100644
--- a/site/news/submit-talks-to-spark-summit-2016.html
+++ b/site/news/submit-talks-to-spark-summit-2016.html
@@ -66,7 +66,7 @@
   
   
-  Lightning-fast cluster computing
+  Lightning-fast unified analytics engine
   
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/submit-talks-to-spark-summit-east-2016.html
--
diff --git a/site/news/submit-talks-to-spark-summit-east-2016.html 
b/site/news/submit-talks-to-spark-summit-east-2016.html
index 1984db7..b4a51a7 100644
--- a/site/news/submit-talks-to-spark-summit-east-2016.html
+++ b/site/news/submit-talks-to-spark-summit-east-2016.html
@@ -66,7 +66,7 @@
   
   
-  Lightning-fast cluster computing
+  Lightning-fast unified analytics engine
   
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/submit-talks-to-spark-summit-eu-2016.html
--
diff --git a/site/news/submit-talks-to-spark-summit-eu-2016.html 
b/site/news/submit-talks-to-spark-summit-eu-2016.html
index 8e33a17..940bc6f 100644
--- a/site/news/submit-talks-to-spark-summit-eu-2016.html
+++ b/site/news/submit-talks-to-spark-summit-eu-2016.html
@@ -66,7 +66,7 @@
   
   
-  Lightning-fast cluster computing
+  Lightning-fast unified analytics engine
   
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/two-weeks-to-spark-summit-2014.html
--
diff --git a/site/news/two-weeks-to-spark-summit-2014.html 
b/site/news/two-weeks-to-spark-summit-2014.html
index 3863298..d4e993a 100644
--- a/site/news/two-weeks-to-spark-summit-2014.html
+++ b/site/news/two-weeks-to-spark-summit-2014.html
@@ -66,7 +66,7 @@
   
   
-  Lightning-fast cluster computing
+  Lightning-fast unified analytics engine
   
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/video-from-first-spark-development-meetup.html
--
diff --git a/site/news/video-from-first-spark-development-meetup.html 
b/site/news/video-from-first-spark-development-meetup.html
index 2be7f50..04151a8 100644
--- a/site/news/video-from-first-spark-development-meetup.html
+++ b/site/news/video-from-first-spark-development-meetup.html
@@ -66,7 +66,7 @@
   
   
-  Lightning-fast cluster computing
+  Lightning-fast unified analytics engine
   
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/powered-by.html
--
diff --git a/site/powered-by.html b/site/powered-by.html
index 3449782..b303df0 100644
--- a/site/powered-by.html
+++ b/site/powered-by.html
@@ -66,7 +66,7 @@
   
   
-  Lightning-fast cluster computing
+  Lightning-fast unified analytics engine
   
 
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/release-process.html
--
diff --git a/site/release-process.html b/site/release-process.html
index 

[2/2] spark-website git commit: Update text/wording to more "modern" Spark and more consistent.

2018-04-12 Thread rxin
Update text/wording to more "modern" Spark and more consistent.

1. Use DataFrame examples.

2. Reduce explicit comparison with MapReduce, since the topic does not really 
come up.

3. More focus on analytics rather than "cluster compute".

4. Update committer affiliation.

5. Make it more clear Spark runs in diverse environments (especially on MLlib 
page).

There are a lot that needs to be done that I don't have time today, e.g. refer 
to Structured Streaming.


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/65846724
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/65846724
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/65846724

Branch: refs/heads/asf-site
Commit: 658467248b278b109bc3d2594b0ef08ff0c727cb
Parents: 91b5617
Author: Reynold Xin 
Authored: Thu Apr 12 12:56:05 2018 -0700
Committer: Reynold Xin 
Committed: Thu Apr 12 12:56:05 2018 -0700

--
 _layouts/global.html|   2 +-
 committers.md   |  22 +-
 index.md|  34 +--
 mllib/index.md  |  18 +-
 site/committers.html|  24 +-
 site/community.html |   2 +-
 site/contributing.html  |   2 +-
 site/developer-tools.html   |   2 +-
 site/documentation.html |   2 +-
 site/downloads.html |   2 +-
 site/examples.html  |   2 +-
 site/faq.html   |   2 +-
 site/history.html   |   2 +-
 site/improvement-proposals.html |   2 +-
 site/index.html |  36 +--
 site/mailing-lists.html |   4 +-
 site/mllib/index.html   |  18 +-
 site/news/amp-camp-2013-registration-ope.html   |   2 +-
 .../news/announcing-the-first-spark-summit.html |   2 +-
 .../news/fourth-spark-screencast-published.html |   2 +-
 site/news/index.html|   2 +-
 site/news/nsdi-paper.html   |   2 +-
 site/news/one-month-to-spark-summit-2015.html   |   2 +-
 .../proposals-open-for-spark-summit-east.html   |   2 +-
 ...registration-open-for-spark-summit-east.html |   2 +-
 .../news/run-spark-and-shark-on-amazon-emr.html |   2 +-
 site/news/spark-0-6-1-and-0-5-2-released.html   |   2 +-
 site/news/spark-0-6-2-released.html |   2 +-
 site/news/spark-0-7-0-released.html |   2 +-
 site/news/spark-0-7-2-released.html |   2 +-
 site/news/spark-0-7-3-released.html |   2 +-
 site/news/spark-0-8-0-released.html |   2 +-
 site/news/spark-0-8-1-released.html |   2 +-
 site/news/spark-0-9-0-released.html |   2 +-
 site/news/spark-0-9-1-released.html |   2 +-
 site/news/spark-0-9-2-released.html |   2 +-
 site/news/spark-1-0-0-released.html |   2 +-
 site/news/spark-1-0-1-released.html |   2 +-
 site/news/spark-1-0-2-released.html |   2 +-
 site/news/spark-1-1-0-released.html |   2 +-
 site/news/spark-1-1-1-released.html |   2 +-
 site/news/spark-1-2-0-released.html |   2 +-
 site/news/spark-1-2-1-released.html |   2 +-
 site/news/spark-1-2-2-released.html |   2 +-
 site/news/spark-1-3-0-released.html |   2 +-
 site/news/spark-1-4-0-released.html |   2 +-
 site/news/spark-1-4-1-released.html |   2 +-
 site/news/spark-1-5-0-released.html |   2 +-
 site/news/spark-1-5-1-released.html |   2 +-
 site/news/spark-1-5-2-released.html |   2 +-
 site/news/spark-1-6-0-released.html |   2 +-
 site/news/spark-1-6-1-released.html |   2 +-
 site/news/spark-1-6-2-released.html |   2 +-
 site/news/spark-1-6-3-released.html |   2 +-
 site/news/spark-2-0-0-released.html |   2 +-
 site/news/spark-2-0-1-released.html |   2 +-
 site/news/spark-2-0-2-released.html |   2 +-
 site/news/spark-2-1-0-released.html |   2 +-
 site/news/spark-2-1-1-released.html |   2 +-
 site/news/spark-2-1-2-released.html |   2 +-
 site/news/spark-2-2-0-released.html |   2 +-
 site/news/spark-2-2-1-released.html |   2 +-
 site/news/spark-2-3-0-released.html |   2 +-
 site/news/spark-2.0.0-preview.html  |   2 +-
 .../spark-accepted-into-apache-incubator.html   |   2 +-
 site/news/spark-and-shark-in-the-news.html  |   2 +-
 site/news/spark-becomes-tlp.html|   2 +-
 

[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...

2018-04-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19881
  
I thought about this more, and I actually think something like this makes 
more sense: `executorAllocationRatio`. Basically it is just a ratio that 
determines how aggressive we want Spark to request full executors. Ratio of 1.0 
means fill up everything. Ratio of 0.5 means only request half of the executors.

What do you think?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...

2018-04-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19881
  
SGTM on divisor.

Do we need "full" there in the config?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20045: [Spark-22360][SQL][TEST] Add unit tests for Window Speci...

2018-04-09 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20045
  
Can we add them to the file based test suites instead?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...

2018-04-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19881
  
Maybe instead of "divisor", we just have a "rate" or "factor" that can be 
floating point value, and use multiplication rather than division? This way 
people can also make it even more aggressive.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...

2018-04-04 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20937
  
Seems fine to me ...



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20959
  
I'm good with having this option given the data @MaxGekk posted. (I haven't 
reviewed the code - somebody else should do that before merging).

`val sampledSchema = spark.read.option("inferSchema", 
true).csv(ds.sample(false, 0.7)).schema` is a bit clunky compared with an 
option that applies to all the sources.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...

2018-03-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19881
  
Can you wait another day? I just find the name pretty weird. Do we have
other configs that use the “divisor” suffix?

On Wed, Mar 28, 2018 at 7:23 AM Tom Graves <notificati...@github.com> wrote:

> I'll leave this a bit longer but then I'm going to merge it later today
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/19881#issuecomment-376905017>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPOFekjRxMQwLNeHMCtxZt92Fv3YGks5ti5z8gaJpZM4Q1Frd>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20877: [SPARK-23765][SQL] Supports custom line separator for js...

2018-03-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20877
  
We can also change both if they haven’t been released yet.

On Sun, Mar 25, 2018 at 10:37 AM Maxim Gekk <notificati...@github.com>
wrote:

> @gatorsmile <https://github.com/gatorsmile> The PR has been already
> submitted: #20885 <https://github.com/apache/spark/pull/20885> . Frankly
> speaking I would prefer another name for the option like we discussed
> before: MaxGekk#1 <https://github.com/MaxGekk/spark-1/pull/1> but similar
> PR for text datasource had been merged already: #20727
> <https://github.com/apache/spark/pull/20727> . And I think it is more
> important to have the same option across all datasource. That's why I
> renamed *recordDelimiter* to *lineSep* in #20885
    > <https://github.com/apache/spark/pull/20885> / cc @rxin
> <https://github.com/rxin>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/20877#issuecomment-375988424>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPKz5R1mF_QZcR0qPO-OBRoGZ3vIEks5th9XQgaJpZM4S2jpk>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20731: [SPARK-23579][Documentation] Added context model image a...

2018-03-22 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20731
  
Yea we gotta be careful with adding commercial vendor logos here. It's part 
of the complexity we need to navigate being hosted at the Apache Software 
Foundation. The project needs to be very vendor neutral.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20774: [SPARK-23549][SQL] Cast to timestamp when compari...

2018-03-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20774#discussion_r175335072
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -479,6 +479,15 @@ object SQLConf {
 .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
 
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
 
+  val HIVE_COMPARE_DATE_TIMESTAMP_IN_TIMESTAMP =
+buildConf("spark.sql.hive.compareDateTimestampInTimestamp")
+  .doc("When true (default), compare Date with Timestamp after 
converting both sides to " +
+"Timestamp. This behavior is compatible with Hive 2.2 or later. 
See HIVE-15236. " +
+"When false, restore the behavior prior to Spark 2.4. Compare Date 
with Timestamp after " +
+"converting both sides to string.")
+.booleanConf
--- End diff --

perhaps mention this config will be removed in spark 3.0.

(on a related note we should look at those configs for backward 
compatibility and consider removing them in 3.0)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20774: [SPARK-23549][SQL] Cast to timestamp when compari...

2018-03-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20774#discussion_r175334948
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/predicate-functions.sql ---
@@ -39,3 +43,4 @@ select 2.0 <= '2.2';
 select 0.5 <= '1.5';
 select to_date('2009-07-30 04:17:52') <= to_date('2009-07-30 04:17:52');
 select to_date('2009-07-30 04:17:52') <= '2009-07-30 04:17:52';
+select to_date('2017-03-01') <= to_timestamp('2017-03-01 00:00:01');
--- End diff --

+1 it is really confusing to look at the diff


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[2/2] spark-website git commit: Squashed commit of the following:

2018-03-16 Thread rxin
Squashed commit of the following:

commit 8e2dd71cf5613be6f019bb76b46226771422a40e
Merge: 8bd24fb6d 01f0b4e0c
Author: Reynold Xin 
Date:   Fri Mar 16 10:24:54 2018 -0700

Merge pull request #104 from mateiz/history

Add a project history page

commit 01f0b4e0c1fe77781850cf994058980664201bce
Author: Matei Zaharia 
Date:   Wed Mar 14 23:29:01 2018 -0700

Add a project history page


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/a1d84bcb
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/a1d84bcb
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/a1d84bcb

Branch: refs/heads/asf-site
Commit: a1d84bcbf53099be51c39914528bea3f4e2735a0
Parents: 8bd24fb
Author: Reynold Xin 
Authored: Fri Mar 16 10:26:14 2018 -0700
Committer: Reynold Xin 
Committed: Fri Mar 16 10:26:14 2018 -0700

--
 _layouts/global.html|   1 +
 community.md|  24 +-
 history.md  |  29 +++
 index.md|  16 +-
 site/committers.html|   1 +
 site/community.html |  24 +-
 site/contributing.html  |   1 +
 site/developer-tools.html   |   1 +
 site/documentation.html |   1 +
 site/downloads.html |   1 +
 site/examples.html  |   1 +
 site/faq.html   |   1 +
 site/graphx/index.html  |   1 +
 site/history.html   | 235 +++
 site/improvement-proposals.html |   1 +
 site/index.html |  17 +-
 site/mailing-lists.html |   1 +
 site/mllib/index.html   |   1 +
 site/news/amp-camp-2013-registration-ope.html   |   1 +
 .../news/announcing-the-first-spark-summit.html |   1 +
 .../news/fourth-spark-screencast-published.html |   1 +
 site/news/index.html|   1 +
 site/news/nsdi-paper.html   |   1 +
 site/news/one-month-to-spark-summit-2015.html   |   1 +
 .../proposals-open-for-spark-summit-east.html   |   1 +
 ...registration-open-for-spark-summit-east.html |   1 +
 .../news/run-spark-and-shark-on-amazon-emr.html |   1 +
 site/news/spark-0-6-1-and-0-5-2-released.html   |   1 +
 site/news/spark-0-6-2-released.html |   1 +
 site/news/spark-0-7-0-released.html |   1 +
 site/news/spark-0-7-2-released.html |   1 +
 site/news/spark-0-7-3-released.html |   1 +
 site/news/spark-0-8-0-released.html |   1 +
 site/news/spark-0-8-1-released.html |   1 +
 site/news/spark-0-9-0-released.html |   1 +
 site/news/spark-0-9-1-released.html |   1 +
 site/news/spark-0-9-2-released.html |   1 +
 site/news/spark-1-0-0-released.html |   1 +
 site/news/spark-1-0-1-released.html |   1 +
 site/news/spark-1-0-2-released.html |   1 +
 site/news/spark-1-1-0-released.html |   1 +
 site/news/spark-1-1-1-released.html |   1 +
 site/news/spark-1-2-0-released.html |   1 +
 site/news/spark-1-2-1-released.html |   1 +
 site/news/spark-1-2-2-released.html |   1 +
 site/news/spark-1-3-0-released.html |   1 +
 site/news/spark-1-4-0-released.html |   1 +
 site/news/spark-1-4-1-released.html |   1 +
 site/news/spark-1-5-0-released.html |   1 +
 site/news/spark-1-5-1-released.html |   1 +
 site/news/spark-1-5-2-released.html |   1 +
 site/news/spark-1-6-0-released.html |   1 +
 site/news/spark-1-6-1-released.html |   1 +
 site/news/spark-1-6-2-released.html |   1 +
 site/news/spark-1-6-3-released.html |   1 +
 site/news/spark-2-0-0-released.html |   1 +
 site/news/spark-2-0-1-released.html |   1 +
 site/news/spark-2-0-2-released.html |   1 +
 site/news/spark-2-1-0-released.html |   1 +
 site/news/spark-2-1-1-released.html |   1 +
 site/news/spark-2-1-2-released.html |   1 +
 site/news/spark-2-2-0-released.html |   1 +
 site/news/spark-2-2-1-released.html |   1 +
 site/news/spark-2-3-0-released.html |   1 +
 site/news/spark-2.0.0-preview.html  |   1 +
 .../spark-accepted-into-apache-incubator.html   |   1 +
 site/news/spark-and-shark-in-the-news.html  |   1 +
 site/news/spark-becomes-tlp.html|   1 +
 

[1/2] spark-website git commit: Squashed commit of the following:

2018-03-16 Thread rxin
Repository: spark-website
Updated Branches:
  refs/heads/asf-site 8bd24fb6d -> a1d84bcbf


http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-summit-june-2016-agenda-posted.html
--
diff --git a/site/news/spark-summit-june-2016-agenda-posted.html 
b/site/news/spark-summit-june-2016-agenda-posted.html
index ce68829..7947354 100644
--- a/site/news/spark-summit-june-2016-agenda-posted.html
+++ b/site/news/spark-summit-june-2016-agenda-posted.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-summit-june-2017-agenda-posted.html
--
diff --git a/site/news/spark-summit-june-2017-agenda-posted.html 
b/site/news/spark-summit-june-2017-agenda-posted.html
index 5d2df4b..e4055c3 100644
--- a/site/news/spark-summit-june-2017-agenda-posted.html
+++ b/site/news/spark-summit-june-2017-agenda-posted.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-summit-june-2018-agenda-posted.html
--
diff --git a/site/news/spark-summit-june-2018-agenda-posted.html 
b/site/news/spark-summit-june-2018-agenda-posted.html
index 17c284f..9b2f739 100644
--- a/site/news/spark-summit-june-2018-agenda-posted.html
+++ b/site/news/spark-summit-june-2018-agenda-posted.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-tips-from-quantifind.html
--
diff --git a/site/news/spark-tips-from-quantifind.html 
b/site/news/spark-tips-from-quantifind.html
index bfbac1d..00c71c2 100644
--- a/site/news/spark-tips-from-quantifind.html
+++ b/site/news/spark-tips-from-quantifind.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-user-survey-and-powered-by-page.html
--
diff --git a/site/news/spark-user-survey-and-powered-by-page.html 
b/site/news/spark-user-survey-and-powered-by-page.html
index 67935a9..c015e5c 100644
--- a/site/news/spark-user-survey-and-powered-by-page.html
+++ b/site/news/spark-user-survey-and-powered-by-page.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-version-0-6-0-released.html
--
diff --git a/site/news/spark-version-0-6-0-released.html 
b/site/news/spark-version-0-6-0-released.html
index 3f670d7..d9120b0 100644
--- a/site/news/spark-version-0-6-0-released.html
+++ b/site/news/spark-version-0-6-0-released.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-wins-cloudsort-100tb-benchmark.html
--
diff --git a/site/news/spark-wins-cloudsort-100tb-benchmark.html 
b/site/news/spark-wins-cloudsort-100tb-benchmark.html
index b498034..8bef605 100644
--- a/site/news/spark-wins-cloudsort-100tb-benchmark.html
+++ b/site/news/spark-wins-cloudsort-100tb-benchmark.html
@@ -123,6 +123,7 @@
   https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
   Powered By
   Project Committers
+  Project History
 
   
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html
--
diff --git a/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html 
b/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html
index 18646f4..32f53e9 100644
--- 

[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset

2018-03-14 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20800
  
So the API looks useful, but I don't know if this is the right 
implementation. How important is it to add this? It seems like the value is not 
super high either.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in DataSet

2018-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20800#discussion_r174016939
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -511,6 +511,14 @@ class Dataset[T] private[sql](
*/
   def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation]
 
+  /**
+   * Returns true if the `DataSet` is empty
--- End diff --

Dataset


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20674: [SPARK-23465][SQL] Introduce new function to rename colu...

2018-03-07 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20674
  
I personally wouldn't include this since it's a simple function users can 
write ...



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20706: [SPARK-23550][core] Cleanup `Utils`.

2018-03-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20706#discussion_r171666996
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -267,44 +264,20 @@ private[spark] object Utils extends Logging {
 }
   }
 
-  /**
-   * JDK equivalent of `chmod 700 file`.
-   *
-   * @param file the file whose permissions will be modified
-   * @return true if the permissions were successfully changed, false 
otherwise.
-   */
-  def chmod700(file: File): Boolean = {
-file.setReadable(false, false) &&
-file.setReadable(true, true) &&
-file.setWritable(false, false) &&
-file.setWritable(true, true) &&
-file.setExecutable(false, false) &&
-file.setExecutable(true, true)
-  }
-
   /**
* Create a directory inside the given parent directory. The directory 
is guaranteed to be
* newly created, and is not marked for automatic deletion.
*/
   def createDirectory(root: String, namePrefix: String = "spark"): File = {
-var attempts = 0
-val maxAttempts = MAX_DIR_CREATION_ATTEMPTS
-var dir: File = null
-while (dir == null) {
-  attempts += 1
-  if (attempts > maxAttempts) {
-throw new IOException("Failed to create a temp directory (under " 
+ root + ") after " +
-  maxAttempts + " attempts!")
-  }
-  try {
-dir = new File(root, namePrefix + "-" + UUID.randomUUID.toString)
-if (dir.exists() || !dir.mkdirs()) {
-  dir = null
-}
-  } catch { case e: SecurityException => dir = null; }
+val prefix = namePrefix + "-"
--- End diff --

was there a reason you rewriting this?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fall back to non-Arr...

2018-02-12 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20567
  
A quick bit: fallback is a single word. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...

2018-02-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20490#discussion_r167137165
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java
 ---
@@ -62,6 +62,16 @@
*/
   DataWriterFactory createWriterFactory();
 
+  /**
+   * Returns whether Spark should use the commit coordinator to ensure 
that only one attempt for
--- End diff --

This is actually not a guarantee, is it?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....

2018-02-07 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20499
  
I'd fix this in 2.3, and 2.2.1 as well.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...

2018-02-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20535#discussion_r166701501
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java 
---
@@ -27,6 +27,39 @@
 /**
  * An immutable string-to-string map in which keys are case-insensitive. 
This is used to represent
  * data source options.
+ *
+ * Each data source implementation can define its own options and teach 
its users how to set them.
+ * Spark doesn't have any restrictions about what options a data source 
should or should not have.
+ * Instead Spark defines some standard options that data sources can 
optionally adopt. It's possible
+ * that some options are very common and many data sources use them. 
However different data
+ * sources may define the common options(key and meaning) differently, 
which is quite confusing to
+ * end users.
+ *
+ * The standard options defined by Spark:
+ * 
+ *   
+ * Option key
+ * Option value
+ *   
+ *   
+ * path
+ * A comma separated paths string of the data files/directories, 
like
+ * path1,/absolute/file2,path3/*. Each path can either be 
relative or absolute,
+ * points to either file or directory, and can contain wildcards. This 
option is commonly used
+ * by file-based data sources.
+ *   
+ *   
+ * table
+ * A table name string representing the table name directly 
without any interpretation.
--- End diff --

what do you mean by "without any interpretation"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20491: [SQL] Minor doc update: Add an example in DataFrameReade...

2018-02-02 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20491
  
This should also go into branch-2.3.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20491: [SQL] Minor doc update: Add an example in DataFra...

2018-02-02 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/20491

[SQL] Minor doc update: Add an example in DataFrameReader.schema

## What changes were proposed in this pull request?
This patch adds a small example to the schema string definition of schema 
function. It isn't obvious how to use it, so an example would be useful.

## How was this patch tested?
N/A - doc only.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark schema-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20491.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20491


commit 69193dbd64e9e0002abd9a8cd6fe60c1c87bc471
Author: Reynold Xin <rxin@...>
Date:   2018-02-02T23:00:39Z

[SQL] Minor doc update: Add an example in DataFrameReader.schema

commit e5e5e0b44e22f58736dd27e5c048395670574f18
Author: Reynold Xin <rxin@...>
Date:   2018-02-02T23:02:26Z

fix typo




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16793: [SPARK-19454][PYTHON][SQL] DataFrame.replace improvement...

2018-02-02 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16793
  
Also the implementation doesn't match what was proposed in 
https://issues.apache.org/jira/browse/SPARK-19454

Having null value as the default in a function called replace is too risky 
and error prone.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16793: [SPARK-19454][PYTHON][SQL] DataFrame.replace improvement...

2018-02-02 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16793
  
Sorry I object this change. Why would we put null as the default replace 
value, in a function called replace? That seems very counterintuitive and error 
prone.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection

2018-01-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20219
  
But it is possible to generate NullType data right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20152: [SPARK-22957] ApproxQuantile breaks if the number of row...

2018-01-04 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20152
  
cc @gatorsmile @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...

2018-01-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20072#discussion_r159573530
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -261,6 +261,17 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
+"org.apache.spark.sql.execution.datasources.fileDataSizeFactor")
--- End diff --

shouldn't we call this something like compressionFactor?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2017-12-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20076
  
Thanks for the PR. Why are we complicating the PR by doing the rename? Does 
this actually gain anything other than minor cosmetic changes? It makes the 
simple PR pretty long ...



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-22648][K8S] Spark on Kubernetes - Documentation

2017-12-21 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 7beb375bf -> 7ab165b70


[SPARK-22648][K8S] Spark on Kubernetes - Documentation

What changes were proposed in this pull request?

This PR contains documentation on the usage of Kubernetes scheduler in Spark 
2.3, and a shell script to make it easier to build docker images required to 
use the integration. The changes detailed here are covered by 
https://github.com/apache/spark/pull/19717 and 
https://github.com/apache/spark/pull/19468 which have merged already.

How was this patch tested?
The script has been in use for releases on our fork. Rest is documentation.

cc rxin mateiz (shepherd)
k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 
erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko
reviewers: vanzin felixcheung jiangxb1987 mridulm

TODO:
- [x] Add dockerfiles directory to built distribution. 
(https://github.com/apache/spark/pull/20007)
- [x] Change references to docker to instead say "container" 
(https://github.com/apache/spark/pull/19995)
- [x] Update configuration table.
- [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of 
int (#20032)

Author: foxish <ramanath...@google.com>

Closes #19946 from foxish/update-k8s-docs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ab165b7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ab165b7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ab165b7

Branch: refs/heads/master
Commit: 7ab165b7061d9acc26523227076056e94354d204
Parents: 7beb375
Author: foxish <ramanath...@google.com>
Authored: Thu Dec 21 17:21:11 2017 -0800
Committer: Reynold Xin <r...@databricks.com>
Committed: Thu Dec 21 17:21:11 2017 -0800

--
 docs/_layouts/global.html|   1 +
 docs/building-spark.md   |   6 +-
 docs/cluster-overview.md |   7 +-
 docs/configuration.md|   2 +
 docs/img/k8s-cluster-mode.png| Bin 0 -> 55538 bytes
 docs/index.md|   3 +-
 docs/running-on-kubernetes.md| 578 ++
 docs/running-on-yarn.md  |   4 +-
 docs/submitting-applications.md  |  16 +
 sbin/build-push-docker-images.sh |  68 
 10 files changed, 677 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7ab165b7/docs/_layouts/global.html
--
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 67b05ec..e5af5ae 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -99,6 +99,7 @@
 Spark 
Standalone
 Mesos
 YARN
+Kubernetes
 
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/7ab165b7/docs/building-spark.md
--
diff --git a/docs/building-spark.md b/docs/building-spark.md
index 98f7df1..c391255 100644
--- a/docs/building-spark.md
+++ b/docs/building-spark.md
@@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by the
 to be runnable, use `./dev/make-distribution.sh` in the project root 
directory. It can be configured
 with Maven profile settings and so on like the direct Maven build. Example:
 
-./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
-Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn
+./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
-Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
 
 This will build Spark distribution along with Python pip and R packages. For 
more information on usage, run `./dev/make-distribution.sh --help`
 
@@ -90,6 +90,10 @@ like ZooKeeper and Hadoop itself.
 ## Building with Mesos support
 
 ./build/mvn -Pmesos -DskipTests clean package
+
+## Building with Kubernetes support
+
+./build/mvn -Pkubernetes -DskipTests clean package
 
 ## Building with Kafka 0.8 support
 

http://git-wip-us.apache.org/repos/asf/spark/blob/7ab165b7/docs/cluster-overview.md
--
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md
index c42bb4b..658e67f 100644
--- a/docs/cluster-overview.md
+++ b/docs/cluster-overview.md
@@ -52,11 +52,8 @@ The system currently supports three cluster managers:
 * [Apache Mesos](running-on-mesos.html) -- a general cluster manager that can 
also run Hadoop MapReduce
   and service applications.
 * [Hadoop YARN](running-on-yarn.html) -- the resource manager in Hadoop 2.
-* [Kubernetes (experimental)](https://github.com/apac

[GitHub] spark issue #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documentation

2017-12-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19946
  
Merging in master.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19973: [SPARK-22779] FallbackConfigEntry's default value...

2017-12-21 Thread rxin
Github user rxin closed the pull request at:

https://github.com/apache/spark/pull/19973


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...

2017-12-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19946#discussion_r158205893
  
--- Diff: docs/building-spark.md ---
@@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by 
the
 to be runnable, use `./dev/make-distribution.sh` in the project root 
directory. It can be configured
 with Maven profile settings and so on like the direct Maven build. Example:
 
-./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn
+./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
--- End diff --

Yea I don't think you need to block this pr with this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryError in c...

2017-12-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/20014
  
Overall change lgtm.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryErr...

2017-12-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20014#discussion_r157673852
  
--- Diff: 
core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryError.java ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.memory;
+
+/**
+ * This exception is thrown when a task can not acquire memory from the 
Memory manager.
+ * Instead of throwing {@link OutOfMemoryError}, which kills the executor,
+ * we should use throw this exception, which will just kill the current 
task.
+ */
+public final class SparkOutOfMemoryError extends OutOfMemoryError {
--- End diff --

is this an internal class? if yes perhaps we should label it.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19973: [SPARK-22779] FallbackConfigEntry's default value should...

2017-12-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19973
  
@vanzin you got a min to submit a patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19946: [SPARK-22648] [Scheduler] Spark on Kubernetes - D...

2017-12-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19946#discussion_r156821519
  
--- Diff: docs/building-spark.md ---
@@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by 
the
 to be runnable, use `./dev/make-distribution.sh` in the project root 
directory. It can be configured
 with Maven profile settings and so on like the direct Maven build. Example:
 
-./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn
+./dev/make-distribution.sh --name custom-spark --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
--- End diff --

should we use k8s? I kept bringing this up and that's because I can never 
spell Kubernetes properly. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19973: [SPARK-22779] FallbackConfigEntry's default value should...

2017-12-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19973
  
That's what the "default" is, isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19973: [SPARK-22779] ConfigEntry's default value should actuall...

2017-12-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19973
  
The issue is in

```
  /**
   * Return the `string` value of Spark SQL configuration property for the 
given key. If the key is
   * not set yet, return `defaultValue`.
   */
  def getConfString(key: String, defaultValue: String): String = {
if (defaultValue != null && defaultValue != "") {
  val entry = sqlConfEntries.get(key)
  if (entry != null) {
// Only verify configs in the SQLConf object
entry.valueConverter(defaultValue)
  }
}
Option(settings.get(key)).getOrElse(defaultValue)
  }
```

The value converter gets applied on this generated string which is not a 
real value and will fail.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19973: [SPARK-22779] ConfigEntry's default value should ...

2017-12-13 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/19973

[SPARK-22779] ConfigEntry's default value should actually be a value

## What changes were proposed in this pull request?
ConfigEntry's config value right now shows a human readable message. In 
some places in SQL we actually rely on default value for real to be setting the 
values.

## How was this patch tested?
Tested manually.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-22779

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19973.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19973


commit 385c300c14a382654c2a1f94ccd2813487dbe26a
Author: Reynold Xin <r...@databricks.com>
Date:   2017-12-13T22:43:55Z

[SPARK-22779] ConfigEntry's default value should actually be a value




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19973: [SPARK-22779] ConfigEntry's default value should actuall...

2017-12-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19973
  
cc @vanzin @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r155693977
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import java.util.regex.Pattern
+
+import scala.collection.JavaConverters._
+import scala.collection.immutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.v2.ConfigSupport
+
+private[sql] object DataSourceV2ConfigSupport extends Logging {
+
+  /**
+   * Helper method to propagate session configs with config key that 
matches at least one of the
+   * given prefixes to the corresponding data source options.
+   *
+   * @param cs the session config propagate help class
+   * @param source the data source format
+   * @param conf the session conf
+   * @return an immutable map that contains all the session configs that 
should be propagated to
+   * the data source.
+   */
+  def withSessionConfig(
+  cs: ConfigSupport,
+  source: String,
+  conf: SQLConf): immutable.Map[String, String] = {
+val prefixes = cs.getConfigPrefixes
+require(prefixes != null, "The config key-prefixes cann't be null.")
+val mapping = cs.getConfigMapping.asScala
+val validOptions = cs.getValidOptions
+require(validOptions != null, "The valid options list cann't be null.")
--- End diff --

double n


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-07 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r155693966
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import java.util.regex.Pattern
+
+import scala.collection.JavaConverters._
+import scala.collection.immutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.v2.ConfigSupport
+
+private[sql] object DataSourceV2ConfigSupport extends Logging {
+
+  /**
+   * Helper method to propagate session configs with config key that 
matches at least one of the
+   * given prefixes to the corresponding data source options.
+   *
+   * @param cs the session config propagate help class
+   * @param source the data source format
+   * @param conf the session conf
+   * @return an immutable map that contains all the session configs that 
should be propagated to
+   * the data source.
+   */
+  def withSessionConfig(
+  cs: ConfigSupport,
+  source: String,
+  conf: SQLConf): immutable.Map[String, String] = {
+val prefixes = cs.getConfigPrefixes
+require(prefixes != null, "The config key-prefixes cann't be null.")
--- End diff --

double n


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19905: [SPARK-22710] ConfigBuilder.fallbackConf should trigger ...

2017-12-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19905
  
cc @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >