[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21482 How is this done in other databases? I don't think we want to invent new ways on these basic primitives. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21482#discussion_r193230476 --- Diff: R/pkg/NAMESPACE --- @@ -281,6 +281,8 @@ exportMethods("%<=>%", "initcap", "input_file_name", "instr", + "isInf", + "isinf", --- End diff -- the functions are case insensitive so i don't think we need both? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21448: [SPARK-24408][SQL][DOC] Move abs, bitwiseNOT, isnan, nan...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21448 I'd only move abs and nothing else. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21459: [SPARK-24420][Build] Upgrade ASM to 6.1 to support JDK9+
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21459 What's driving this (is it java 9)? I'm in general scared by core library updates like this. Maybe Spark 3.0 is a good time (and we should just do it this year). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21453 Jenkins, add to whitelist. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21453 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21416 LGTM (I didn't look that carefully though) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21416#discussion_r191306678 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala --- @@ -392,9 +396,97 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b") -intercept[AnalysisException] { +val e = intercept[AnalysisException] { df2.filter($"a".isin($"b")) } +Seq("cannot resolve", "due to data type mismatch: Arguments must be same type but were") + .foreach { s => + assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT))) + } + } + + test("isInCollection: Scala Collection") { +val df = Seq((1, "x"), (2, "y"), (3, "z")).toDF("a", "b") +checkAnswer(df.filter($"a".isInCollection(Seq(1, 2))), + df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2)) +checkAnswer(df.filter($"a".isInCollection(Seq(3, 2))), + df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2)) +checkAnswer(df.filter($"a".isInCollection(Seq(3, 1))), + df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1)) + +// Auto casting should work with mixture of different types in collections +checkAnswer(df.filter($"a".isInCollection(Seq(1.toShort, "2"))), + df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2)) +checkAnswer(df.filter($"a".isInCollection(Seq("3", 2.toLong))), + df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2)) +checkAnswer(df.filter($"a".isInCollection(Seq(3, "1"))), + df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1)) + +checkAnswer(df.filter($"b".isInCollection(Seq("y", "x"))), + df.collect().toSeq.filter(r => r.getString(1) == "y" || r.getString(1) == "x")) +checkAnswer(df.filter($"b".isInCollection(Seq("z", "x"))), + df.collect().toSeq.filter(r => r.getString(1) == "z" || r.getString(1) == "x")) +checkAnswer(df.filter($"b".isInCollection(Seq("z", "y"))), + df.collect().toSeq.filter(r => r.getString(1) == "z" || r.getString(1) == "y")) + +// Test with different types of collections +checkAnswer(df.filter($"a".isInCollection(Seq(1, 2).toSet)), + df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2)) +checkAnswer(df.filter($"a".isInCollection(Seq(3, 2).toArray)), + df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2)) +checkAnswer(df.filter($"a".isInCollection(Seq(3, 1).toList)), + df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1)) + +val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b") + +val e = intercept[AnalysisException] { + df2.filter($"a".isInCollection(Seq($"b"))) +} +Seq("cannot resolve", "due to data type mismatch: Arguments must be same type but were") + .foreach { s => + assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT))) + } + } + + test("isInCollection: Java Collection") { +val df = Seq((1, "x"), (2, "y"), (3, "z")).toDF("a", "b") --- End diff -- same thing here. just run a single test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21416#discussion_r191306654 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala --- @@ -392,9 +396,97 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b") -intercept[AnalysisException] { +val e = intercept[AnalysisException] { df2.filter($"a".isin($"b")) } +Seq("cannot resolve", "due to data type mismatch: Arguments must be same type but were") + .foreach { s => + assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT))) + } + } + + test("isInCollection: Scala Collection") { --- End diff -- can we simplify the test cases? you are just testing this api as a wrapper. you don't need to run so many queries for type coercion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21427 If we can fix it without breaking existing behavior that would be awesome. On Fri, May 25, 2018 at 9:59 AM Bryan Cutler <notificati...@github.com> wrote: > I've been thinking about this and came to the same conclusion as > @icexelloss <https://github.com/icexelloss> here #21427 (comment) > <https://github.com/apache/spark/pull/21427#issuecomment-392070950> that > we could really support both names and position, and fix this without > changing behavior. > > When the user defines as grouped map udf, the StructType has field names > so if the returned DataFrame has column names they should match. If the > user returned a DataFrame with positional columns only, pandas will name > the columns with an integer index (not an integer string). We could change > the logic to do the following: > > Assign columns by name, catching a KeyError exception > If the column names are all integers, then fallback to assign by position > Else raise the KeyError (most likely the user has a typo in the column name) > > I think that will solve this issue and not change the behavior, but I > would need check that this will hold for different pandas versions. How > does that sound? > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/21427#issuecomment-392119306>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPMCqb9uccM8coTBel1PxwCReedS4ks5t2DiCgaJpZM4UM2oZ> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21427 On the config part, I havenât looked at the code but canât we just reorder the columns on the JVM side? Why do we need to reorder them on the Python side? On Fri, May 25, 2018 at 12:31 AM Hyukjin Kwon <notificati...@github.com> wrote: > I believe it was just a mistake to correct - we forget this to mark it > experimental. It's pretty unstable and many JIRAs are being open. > @BryanCutler <https://github.com/BryanCutler> mind if I ask to go ahead > if you find some time? if you are busy will do it by myself. > > cc @vanzin <https://github.com/vanzin> FYI. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/21427#issuecomment-391967423>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPD8iBRMXvmS7vVSIidwnZxK1BaQ4ks5t17NlgaJpZM4UM2oZ> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21427 I agree it should have started experimental. It is pretty weird to after the fact mark something experimental though. On Fri, May 25, 2018 at 12:23 AM Hyukjin Kwon <notificati...@github.com> wrote: > BTW, what do you think about adding a blocker to set this feature as > experimental @rxin <https://github.com/rxin>? I think it's pretty new > feature and it should be reasonable to call it experimental. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/21427#issuecomment-391965470>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPI2-nftoelNAPqgn19vurlYolkG8ks5t17FjgaJpZM4UM2oZ> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21427 Why is it difficult? On Fri, May 25, 2018 at 12:03 AM Hyukjin Kwon <notificati...@github.com> wrote: > but as I said it's difficult to have a configuration there. Shall we just > target 3.0.0 abd martk this as experimental as I suggeated from the first > place? That should be the safest way. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/21427#issuecomment-391961189>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPJ-ym2CEM9e_hHJxlvOwTlE-UADIks5t16yxgaJpZM4UM2oZ> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r190803873 --- Diff: python/pyspark/sql/dataframe.py --- @@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False): name | Bob """ if isinstance(truncate, bool) and truncate: -print(self._jdf.showString(n, 20, vertical)) +print(self._jdf.showString(n, 20, vertical, False)) else: -print(self._jdf.showString(n, int(truncate), vertical)) +print(self._jdf.showString(n, int(truncate), vertical, False)) --- End diff -- use named arguments for boolean flags --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r190803855 --- Diff: python/pyspark/sql/dataframe.py --- @@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False): name | Bob """ if isinstance(truncate, bool) and truncate: -print(self._jdf.showString(n, 20, vertical)) +print(self._jdf.showString(n, 20, vertical, False)) --- End diff -- use named arguments for boolean flags --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r190803772 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.sql.repl.eagerEval.enabled + false + +Enable eager evaluation or not. If true and repl you're using supports eager evaluation, +dataframe will be ran automatically and html table will feedback the queries user have defined +(see https://issues.apache.org/jira/browse/SPARK-24215;>SPARK-24215 for more details). + + + + spark.sql.repl.eagerEval.showRows + 20 + +Default number of rows in HTML table. + + + + spark.sql.repl.eagerEval.truncate --- End diff -- maybe he wants to follow what dataframe.show does, which truncates num characters within a cell. That's useful for console output, but not so much for notebooks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r190803641 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.sql.repl.eagerEval.enabled + false + +Enable eager evaluation or not. If true and repl you're using supports eager evaluation, +dataframe will be ran automatically and html table will feedback the queries user have defined +(see https://issues.apache.org/jira/browse/SPARK-24215;>SPARK-24215 for more details). + + + + spark.sql.repl.eagerEval.showRows --- End diff -- maxNumRows --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF should assi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21427 If this has been released you can't just change it like this; it will break users' programs immediately. At the very least introduce a flag so it can be set by the user to avoid breaking their code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21242: [SPARK-23657][SQL] Document and expose the internal data...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21242 Thanks Ryan. I'm not a fan of just exposing internal classes like this. The APIs haven't really been designed or audited for the purpose of external consumption. If we want to expose the internal APIs, we should revisit their APIs to make sure they are good, and potentially narrow down the exposure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189669772 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.jupyter.eagerEval.enabled --- End diff -- btw the config flag isn't jupyter specific. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21370 Can we also do something a bit more generic that works for non-Jupyter notebooks as well? For example, in IPython or just plain Python REPL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21329 Why are we cleaning up stuff like this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21192 my point is that i don't consider a sequence of chars an array to begin with. it is not natural to me. I'd want an array if it is a different set of separators. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21192 eh I actually think separated makes it much simpler to look at, compared with an array. Why complicate the API and require users to understand how to specify an array (in all languages)? One question I have is whether we'd want to support multiple separators and each separator can be multi sequence chars as well. In that case an array might make more sense to specify the multi separators, and each separator is just space delimited for chars. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21318: [minor] Update docs for functions.scala to make it clear...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21318 It's still going to fail because I haven't updated it yet. Will do tomorrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21316: [SPARK-20538][SQL] Wrap Dataset.reduce with withN...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21316#discussion_r188104204 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1607,7 +1607,9 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def reduce(func: (T, T) => T): T = rdd.reduce(func) + def reduce(func: (T, T) => T): T = withNewExecutionId { --- End diff -- Why would we want to deprecate it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21318: [minor] Update docs for functions.scala to make it clear...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21318 Hm the failure doesn't look like it's caused by this PR. Do you guys know what's going on? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21318: [minor] Update docs for functions.scala to make it clear...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21318 cc @gatorsmile @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21318: [minor] Update docs for functions.scala to make i...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/21318 [minor] Update docs for functions.scala to make it clear not all the built-in functions are defined there The title summarizes the change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark functions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21318.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21318 commit 83c191fbbe82bf49c81a860f4f1ebde7a4076f00 Author: Reynold Xin <rxin@...> Date: 2018-05-14T05:15:56Z [minor] Update docs for functions.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21316: [SPARK-20538][SQL] Wrap Dataset.reduce with withN...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21316#discussion_r187838099 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1607,7 +1607,9 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def reduce(func: (T, T) => T): T = rdd.reduce(func) + def reduce(func: (T, T) => T): T = withNewExecutionId { --- End diff -- cc @zsxwing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21309: [SPARK-23907] Removes regr_* functions in functions.scal...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21309 Better compile time error. Plus a lot of people are already using these. On Fri, May 11, 2018 at 7:35 PM Hyukjin Kwon <notificati...@github.com> wrote: > Yup, then why not just deprecate other functions in other APIs for 3.0.0, > and promote the usage of expr? > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/21309#issuecomment-388524092>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPNbOEidl-IwkRFVW0kVpVjEPKoOgks5txkpdgaJpZM4T8LX4> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21309: [SPARK-23907] Removes regr_* functions in functions.scal...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21309 Adding it to sql would allow it to be available everywhere (through expr) right? On Fri, May 11, 2018 at 7:30 PM Hyukjin Kwon <notificati...@github.com> wrote: > Thing is, I am a bit confused when to add it to other APIs. I thought if > it's expected to be less commonly used, it shouldn't be added at the first > place. We have UDFs. > > I have been a bit confused of some functions specifically not added into > other APIs. If that's worth being added in an API, I thought it makes sense > to add it to other APIs too. Is there a reason to add them to SQL side > specifically? > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/21309#issuecomment-388523839>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPJx8IcRSIpAHmk2APbxDMm4wf4E8ks5txkkngaJpZM4T8LX4> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21309: [SPARK-23907] Removes regr_* functions in functions.scal...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21309 Btw itâs been always the case that the less commonly used functions are not part of this file. There is just a lot of overhead to maintaining all of them. Iâm not even sure if the regr_* expressions should be added in the first place. On Fri, May 11, 2018 at 7:20 PM Hyukjin Kwon <notificati...@github.com> wrote: > @rxin <https://github.com/rxin>, how about splitting up this file by the > group or something, or deprecating all the functions that can be called via > expr for 3.0.0? To me, it looked a bit odd when some functions exist and > some did not. It was an actual use case and I had to check which function > exists or not every time. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/21309#issuecomment-388523458>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPKznGyNtcF57sol08PGgzbhth-4_ks5txkcKgaJpZM4T8LX4> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21054: [SPARK-23907][SQL] Add regr_* functions
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21054 There is not a single function that canât be called by expr. It mainly adds some type safety. On Fri, May 11, 2018 at 7:18 PM Hyukjin Kwon <notificati...@github.com> wrote: > *@HyukjinKwon* commented on this pull request. > -- > > In sql/core/src/main/scala/org/apache/spark/sql/functions.scala > <https://github.com/apache/spark/pull/21054#discussion_r187761743>: > > > @@ -775,6 +775,178 @@ object functions { > */ >def var_pop(columnName: String): Column = var_pop(Column(columnName)) > > + /** > + * Aggregate function: returns the number of non-null pairs. > + * > + * @group agg_funcs > + * @since 2.4.0 > + */ > + def regr_count(y: Column, x: Column): Column = withAggregateFunction { > > @rxin <https://github.com/rxin>, how about splitting up this file by the > group or something, or deprecating all the functions that can be called via > expr for 3.0.0? To me, it looked a bit odd when some functions exist and > some did not. It was an actual use case and I had to check which function > exists or not every time. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/21054#discussion_r187761743>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPMuMFZtp285MrttmJfITKM6WS0pcks5txkZ0gaJpZM4TSBOu> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21309: [SPARK-23907] Removes regr_* functions in functions.scal...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21309 cc @gatorsmile @mgaido91 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21309: [SPARK-23907] Removes regr_* functions in functio...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/21309 [SPARK-23907] Removes regr_* functions in functions.scala ## What changes were proposed in this pull request? This patch removes the various regr_* functions in functions.scala. They are so uncommon that I don't think they deserve real estate in functions.scala. We can consider adding them later if more users need them. ## How was this patch tested? Removed the associated test case as well. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-23907 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21309.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21309 commit ce2c305169d90c4d7803338d85d2d4c92a8e1d3c Author: Reynold Xin <rxin@...> Date: 2018-05-11T23:24:15Z [SPARK-23907] Removes regr_ functions in functions.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21054#discussion_r187751801 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -775,6 +775,178 @@ object functions { */ def var_pop(columnName: String): Column = var_pop(Column(columnName)) + /** + * Aggregate function: returns the number of non-null pairs. + * + * @group agg_funcs + * @since 2.4.0 + */ + def regr_count(y: Column, x: Column): Column = withAggregateFunction { --- End diff -- do we need all of these? seems like users can just invoke expr to do them. this file is getting very long. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21121 @lokm01 wouldn't @ueshin's suggestion on adding a second parameter to transform work for you? You can just do something similar to `transform(x, (entry, index) -> struct(entry, index))`. Perhaps zip_with_index is just an alias for that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21187: [SPARK-24035][SQL] SQL syntax for Pivot
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21187#discussion_r185084802 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/PivotSuite.scala --- @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql --- End diff -- can we use the infra for SQLQueryTestSuite? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21169#discussion_r184596334 --- Diff: docs/sql-programming-guide.md --- @@ -1805,12 +1805,13 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. - - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe. - - Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``. - - Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema. - - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0. - - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. - - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. + - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe. + - Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``. + - Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema. --- End diff -- what's a nested empty schema? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Optimizer
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20560 Just saw this - this seems like a somewhat awkward way to do it by just matching on filter / project. Is the main thing lacking a way to do back propagation for properties? (We can only do forward propagation at the moment on properties so we can't eliminate subtree's sort based on the parent's sort). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21071: [SPARK-21962][CORE] Distributed Tracing in Spark
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21071 @devaraj-kavali can you close this PR first? Looks like there isn't any reason to really use htrace anymore ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19222 @kiszk do you have more data now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19222 OK thanks please do that. Does TPC-DS even trigger 2 call sites? E.g. ByteArrayMemoryBlock and OnHeapMemoryBlock. Even there it might introduce a conditional branch after JIT that could lead to perf degradation. I also really worry about off-heap mode, in which all three callsites can exist and lead to massive degradation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19222 Sorry this thread is too long for me to follow. I might be bringing up a point that has been brought up before. @kiszk did your perf tests take into account megamorphic callsites? It seems to me from a quick cursory look the benchmark result might not be accurate for real workloads if there are only one implementations of the MemoryBlock loaded. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19881 Thanks @jcuquemelle --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21071: [SPARK-21962][CORE] Distributed Tracing in Spark
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21071 This probably deserves its own SPIP. Also unclear whether we should just support htrace, or have an extension api so users can plug in whatever they want. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21060 It looks to me this is a bug fix that can merit backporting, as QueryExecutionListener is also marked as experimental, In this case, I think @gatorsmile is worried one might have written a listener that enumerates the possible function names, and that listener will fail now with a new action name. I feel this is quite unlikely, but I also appreciate @gatorsmile's concern for backward compatibility, and I've certainly been wrong before when our fixes break existing workloads. (On the spectrum of being extremely conservative to extremely liberal, I think I'm in general more on the middle, whereas @gatorsmile probably leans more to the conservative side. There isn't really anything wrong with this, and it's good to have balancing forces in a project.) How about this, @HyukjinKwon -- for the 2.3.x backport, add a config that so it is possible to turn this off in production, if somebody actually has their job failed because of this? It's a small delta from what this PR already does, and that should alleviate the concerns @gatorsmile has. I'd also change the function doc for onSuccess/onFailure to make it clear that we will add new function names in the future, and users shouldn't expect a fixed list of function names. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20992: [SPARK-23779][SQL] TaskMemoryManager and UnsafeSorter re...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20992 What are the performance improvements? Without additional data this seems like just an invasive change without any real benefits ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21031: [SPARK-23923][SQL] Add cardinality function
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21031 If there is already size, why do we need to create a new implementation? Why can't we just rewrite cardinality to size? Also I wouldn't add any programming API for this, since there is already size. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21056: [SPARK-23849][SQL] Tests for samplingRatio of jso...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21056#discussion_r181530121 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2128,38 +2128,60 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } - test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { -val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, + val sampledTestData = (row: Row) => { +val value = row.getLong(0) +val predefinedSample = Set[Long](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, 57, 62, 68, 72) -withTempPath { path => - val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath), -StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW) - for (i <- 0 until 100) { -if (predefinedSample.contains(i)) { - writer.write(s"""{"f1":${i.toString}}""" + "\n") -} else { - writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n") -} - } - writer.close() +if (predefinedSample.contains(value)) { + s"""{"f1":${value.toString}}""" +} else { + s"""{"f1":${(value.toDouble + 0.1).toString}}""" +} + } - val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath) + test("SPARK-23849: schema inferring touches less data if samplingRatio < 1.0") { +// Set default values for the DataSource parameters to make sure +// that whole test file is mapped to only one partition. This will guarantee +// reliable sampling of the input file. +withSQLConf( + "spark.sql.files.maxPartitionBytes" -> (128 * 1024 * 1024).toString, + "spark.sql.files.openCostInBytes" -> (4 * 1024 * 1024).toString +)(withTempPath { path => + val rdd = spark.sqlContext.range(0, 100, 1, 1).map(sampledTestData) + rdd.write.text(path.getAbsolutePath) + + val ds = spark.read +.option("inferSchema", true) +.option("samplingRatio", 0.1) +.json(path.getCanonicalPath) assert(ds.schema == new StructType().add("f1", LongType)) -} +}) } - test("SPARK-23849: usage of samplingRation while parsing of dataset of strings") { -val dstr = spark.sparkContext.parallelize(0 until 100, 1).map { i => - val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, -57, 62, 68, 72) - if (predefinedSample.contains(i)) { -s"""{"f1":${i.toString}}""" + "\n" - } else { -s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n" - } -}.toDS() -val ds = spark.read.option("samplingRatio", 0.1).json(dstr) + test("SPARK-23849: usage of samplingRatio while parsing a dataset of strings") { +val rdd = spark.sqlContext.range(0, 100, 1, 1).map(sampledTestData) +val ds = spark.read + .option("inferSchema", true) + .option("samplingRatio", 0.1) + .json(rdd) assert(ds.schema == new StructType().add("f1", LongType)) } + + test("SPARK-23849: samplingRatio is out of the range (0, 1.0]") { +val dstr = spark.sparkContext.parallelize(0 until 100, 1).map(_.toString).toDS() --- End diff -- can you just use spark.range? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21053: [SPARK-23924][SQL] Add element_at function
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21053#discussion_r181529978 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -413,6 +413,78 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { ) } + test("element at function") { --- End diff -- also the function is element_at, not "element at" ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21053: [SPARK-23924][SQL] Add element_at function
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21053#discussion_r181529901 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -413,6 +413,78 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { ) } + test("element at function") { --- End diff -- why do we need so many test cases here? this is just to verify the api works end to end. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r181529318 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcDataSourceV2.scala --- @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.datasources.v2.orc + +import java.net.URI +import java.util.Locale + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType} +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl +import org.apache.orc.{OrcConf, OrcFile} +import org.apache.orc.mapred.OrcStruct +import org.apache.orc.mapreduce.OrcInputFormat + +import org.apache.spark.TaskContext +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, JoinedRow} +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection +import org.apache.spark.sql.execution.datasources._ +import org.apache.spark.sql.execution.datasources.orc.{OrcColumnarBatchReader, OrcDeserializer, OrcFilters, OrcUtils} +import org.apache.spark.sql.execution.datasources.v2.ColumnarBatchFileSourceReader +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema} +import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.types.{AtomicType, StructType} +import org.apache.spark.util.SerializableConfiguration + +class OrcDataSourceV2 extends DataSourceV2 with ReadSupport with ReadSupportWithSchema { + override def createReader(options: DataSourceOptions): DataSourceReader = { +new OrcDataSourceReader(options, None) + } + + override def createReader(schema: StructType, options: DataSourceOptions): DataSourceReader = { +new OrcDataSourceReader(options, Some(schema)) + } +} + +case class OrcDataSourceReader(options: DataSourceOptions, userSpecifiedSchema: Option[StructType]) + extends ColumnarBatchFileSourceReader + with SupportsPushDownCatalystFilters { + + override def inferSchema(files: Seq[FileStatus]): Option[StructType] = { +OrcUtils.readSchema(sparkSession, files) + } + + private var pushedFiltersArray: Array[Expression] = Array.empty + + override def readFunction: PartitionedFile => Iterator[InternalRow] = { --- End diff -- btw i think it's also ok if we know what we want in the final version, and the intermediate change tries to minimize code changes (i haven't looked at the pr at all so don't interpret this comment as endorsing or not endorsing the pr design) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[1/2] spark-website git commit: Update text/wording to more "modern" Spark and more consistent.
Repository: spark-website Updated Branches: refs/heads/asf-site 91b561749 -> 658467248 http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/strata-exercises-now-available-online.html -- diff --git a/site/news/strata-exercises-now-available-online.html b/site/news/strata-exercises-now-available-online.html index 916f242..4f250a3 100644 --- a/site/news/strata-exercises-now-available-online.html +++ b/site/news/strata-exercises-now-available-online.html @@ -66,7 +66,7 @@ - Lightning-fast cluster computing + Lightning-fast unified analytics engine http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/submit-talks-to-spark-summit-2014.html -- diff --git a/site/news/submit-talks-to-spark-summit-2014.html b/site/news/submit-talks-to-spark-summit-2014.html index 4f43c23..18f2642 100644 --- a/site/news/submit-talks-to-spark-summit-2014.html +++ b/site/news/submit-talks-to-spark-summit-2014.html @@ -66,7 +66,7 @@ - Lightning-fast cluster computing + Lightning-fast unified analytics engine http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/submit-talks-to-spark-summit-2016.html -- diff --git a/site/news/submit-talks-to-spark-summit-2016.html b/site/news/submit-talks-to-spark-summit-2016.html index 3163bab..3766932 100644 --- a/site/news/submit-talks-to-spark-summit-2016.html +++ b/site/news/submit-talks-to-spark-summit-2016.html @@ -66,7 +66,7 @@ - Lightning-fast cluster computing + Lightning-fast unified analytics engine http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/submit-talks-to-spark-summit-east-2016.html -- diff --git a/site/news/submit-talks-to-spark-summit-east-2016.html b/site/news/submit-talks-to-spark-summit-east-2016.html index 1984db7..b4a51a7 100644 --- a/site/news/submit-talks-to-spark-summit-east-2016.html +++ b/site/news/submit-talks-to-spark-summit-east-2016.html @@ -66,7 +66,7 @@ - Lightning-fast cluster computing + Lightning-fast unified analytics engine http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/submit-talks-to-spark-summit-eu-2016.html -- diff --git a/site/news/submit-talks-to-spark-summit-eu-2016.html b/site/news/submit-talks-to-spark-summit-eu-2016.html index 8e33a17..940bc6f 100644 --- a/site/news/submit-talks-to-spark-summit-eu-2016.html +++ b/site/news/submit-talks-to-spark-summit-eu-2016.html @@ -66,7 +66,7 @@ - Lightning-fast cluster computing + Lightning-fast unified analytics engine http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/two-weeks-to-spark-summit-2014.html -- diff --git a/site/news/two-weeks-to-spark-summit-2014.html b/site/news/two-weeks-to-spark-summit-2014.html index 3863298..d4e993a 100644 --- a/site/news/two-weeks-to-spark-summit-2014.html +++ b/site/news/two-weeks-to-spark-summit-2014.html @@ -66,7 +66,7 @@ - Lightning-fast cluster computing + Lightning-fast unified analytics engine http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/news/video-from-first-spark-development-meetup.html -- diff --git a/site/news/video-from-first-spark-development-meetup.html b/site/news/video-from-first-spark-development-meetup.html index 2be7f50..04151a8 100644 --- a/site/news/video-from-first-spark-development-meetup.html +++ b/site/news/video-from-first-spark-development-meetup.html @@ -66,7 +66,7 @@ - Lightning-fast cluster computing + Lightning-fast unified analytics engine http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/powered-by.html -- diff --git a/site/powered-by.html b/site/powered-by.html index 3449782..b303df0 100644 --- a/site/powered-by.html +++ b/site/powered-by.html @@ -66,7 +66,7 @@ - Lightning-fast cluster computing + Lightning-fast unified analytics engine http://git-wip-us.apache.org/repos/asf/spark-website/blob/65846724/site/release-process.html -- diff --git a/site/release-process.html b/site/release-process.html index
[2/2] spark-website git commit: Update text/wording to more "modern" Spark and more consistent.
Update text/wording to more "modern" Spark and more consistent. 1. Use DataFrame examples. 2. Reduce explicit comparison with MapReduce, since the topic does not really come up. 3. More focus on analytics rather than "cluster compute". 4. Update committer affiliation. 5. Make it more clear Spark runs in diverse environments (especially on MLlib page). There are a lot that needs to be done that I don't have time today, e.g. refer to Structured Streaming. Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/65846724 Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/65846724 Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/65846724 Branch: refs/heads/asf-site Commit: 658467248b278b109bc3d2594b0ef08ff0c727cb Parents: 91b5617 Author: Reynold XinAuthored: Thu Apr 12 12:56:05 2018 -0700 Committer: Reynold Xin Committed: Thu Apr 12 12:56:05 2018 -0700 -- _layouts/global.html| 2 +- committers.md | 22 +- index.md| 34 +-- mllib/index.md | 18 +- site/committers.html| 24 +- site/community.html | 2 +- site/contributing.html | 2 +- site/developer-tools.html | 2 +- site/documentation.html | 2 +- site/downloads.html | 2 +- site/examples.html | 2 +- site/faq.html | 2 +- site/history.html | 2 +- site/improvement-proposals.html | 2 +- site/index.html | 36 +-- site/mailing-lists.html | 4 +- site/mllib/index.html | 18 +- site/news/amp-camp-2013-registration-ope.html | 2 +- .../news/announcing-the-first-spark-summit.html | 2 +- .../news/fourth-spark-screencast-published.html | 2 +- site/news/index.html| 2 +- site/news/nsdi-paper.html | 2 +- site/news/one-month-to-spark-summit-2015.html | 2 +- .../proposals-open-for-spark-summit-east.html | 2 +- ...registration-open-for-spark-summit-east.html | 2 +- .../news/run-spark-and-shark-on-amazon-emr.html | 2 +- site/news/spark-0-6-1-and-0-5-2-released.html | 2 +- site/news/spark-0-6-2-released.html | 2 +- site/news/spark-0-7-0-released.html | 2 +- site/news/spark-0-7-2-released.html | 2 +- site/news/spark-0-7-3-released.html | 2 +- site/news/spark-0-8-0-released.html | 2 +- site/news/spark-0-8-1-released.html | 2 +- site/news/spark-0-9-0-released.html | 2 +- site/news/spark-0-9-1-released.html | 2 +- site/news/spark-0-9-2-released.html | 2 +- site/news/spark-1-0-0-released.html | 2 +- site/news/spark-1-0-1-released.html | 2 +- site/news/spark-1-0-2-released.html | 2 +- site/news/spark-1-1-0-released.html | 2 +- site/news/spark-1-1-1-released.html | 2 +- site/news/spark-1-2-0-released.html | 2 +- site/news/spark-1-2-1-released.html | 2 +- site/news/spark-1-2-2-released.html | 2 +- site/news/spark-1-3-0-released.html | 2 +- site/news/spark-1-4-0-released.html | 2 +- site/news/spark-1-4-1-released.html | 2 +- site/news/spark-1-5-0-released.html | 2 +- site/news/spark-1-5-1-released.html | 2 +- site/news/spark-1-5-2-released.html | 2 +- site/news/spark-1-6-0-released.html | 2 +- site/news/spark-1-6-1-released.html | 2 +- site/news/spark-1-6-2-released.html | 2 +- site/news/spark-1-6-3-released.html | 2 +- site/news/spark-2-0-0-released.html | 2 +- site/news/spark-2-0-1-released.html | 2 +- site/news/spark-2-0-2-released.html | 2 +- site/news/spark-2-1-0-released.html | 2 +- site/news/spark-2-1-1-released.html | 2 +- site/news/spark-2-1-2-released.html | 2 +- site/news/spark-2-2-0-released.html | 2 +- site/news/spark-2-2-1-released.html | 2 +- site/news/spark-2-3-0-released.html | 2 +- site/news/spark-2.0.0-preview.html | 2 +- .../spark-accepted-into-apache-incubator.html | 2 +- site/news/spark-and-shark-in-the-news.html | 2 +- site/news/spark-becomes-tlp.html| 2 +-
[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19881 I thought about this more, and I actually think something like this makes more sense: `executorAllocationRatio`. Basically it is just a ratio that determines how aggressive we want Spark to request full executors. Ratio of 1.0 means fill up everything. Ratio of 0.5 means only request half of the executors. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19881 SGTM on divisor. Do we need "full" there in the config? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20045: [Spark-22360][SQL][TEST] Add unit tests for Window Speci...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20045 Can we add them to the file based test suites instead? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19881 Maybe instead of "divisor", we just have a "rate" or "factor" that can be floating point value, and use multiplication rather than division? This way people can also make it even more aggressive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support cus...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20937 Seems fine to me ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20959 I'm good with having this option given the data @MaxGekk posted. (I haven't reviewed the code - somebody else should do that before merging). `val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 0.7)).schema` is a bit clunky compared with an option that applies to all the sources. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19881: [SPARK-22683][CORE] Add a fullExecutorAllocationDivisor ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19881 Can you wait another day? I just find the name pretty weird. Do we have other configs that use the âdivisorâ suffix? On Wed, Mar 28, 2018 at 7:23 AM Tom Graves <notificati...@github.com> wrote: > I'll leave this a bit longer but then I'm going to merge it later today > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/19881#issuecomment-376905017>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPOFekjRxMQwLNeHMCtxZt92Fv3YGks5ti5z8gaJpZM4Q1Frd> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20877: [SPARK-23765][SQL] Supports custom line separator for js...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20877 We can also change both if they havenât been released yet. On Sun, Mar 25, 2018 at 10:37 AM Maxim Gekk <notificati...@github.com> wrote: > @gatorsmile <https://github.com/gatorsmile> The PR has been already > submitted: #20885 <https://github.com/apache/spark/pull/20885> . Frankly > speaking I would prefer another name for the option like we discussed > before: MaxGekk#1 <https://github.com/MaxGekk/spark-1/pull/1> but similar > PR for text datasource had been merged already: #20727 > <https://github.com/apache/spark/pull/20727> . And I think it is more > important to have the same option across all datasource. That's why I > renamed *recordDelimiter* to *lineSep* in #20885 > <https://github.com/apache/spark/pull/20885> / cc @rxin > <https://github.com/rxin> > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/20877#issuecomment-375988424>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPKz5R1mF_QZcR0qPO-OBRoGZ3vIEks5th9XQgaJpZM4S2jpk> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20731: [SPARK-23579][Documentation] Added context model image a...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20731 Yea we gotta be careful with adding commercial vendor logos here. It's part of the complexity we need to navigate being hosted at the Apache Software Foundation. The project needs to be very vendor neutral. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20774: [SPARK-23549][SQL] Cast to timestamp when compari...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20774#discussion_r175335072 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -479,6 +479,15 @@ object SQLConf { .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) .createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString) + val HIVE_COMPARE_DATE_TIMESTAMP_IN_TIMESTAMP = +buildConf("spark.sql.hive.compareDateTimestampInTimestamp") + .doc("When true (default), compare Date with Timestamp after converting both sides to " + +"Timestamp. This behavior is compatible with Hive 2.2 or later. See HIVE-15236. " + +"When false, restore the behavior prior to Spark 2.4. Compare Date with Timestamp after " + +"converting both sides to string.") +.booleanConf --- End diff -- perhaps mention this config will be removed in spark 3.0. (on a related note we should look at those configs for backward compatibility and consider removing them in 3.0) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20774: [SPARK-23549][SQL] Cast to timestamp when compari...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20774#discussion_r175334948 --- Diff: sql/core/src/test/resources/sql-tests/inputs/predicate-functions.sql --- @@ -39,3 +43,4 @@ select 2.0 <= '2.2'; select 0.5 <= '1.5'; select to_date('2009-07-30 04:17:52') <= to_date('2009-07-30 04:17:52'); select to_date('2009-07-30 04:17:52') <= '2009-07-30 04:17:52'; +select to_date('2017-03-01') <= to_timestamp('2017-03-01 00:00:01'); --- End diff -- +1 it is really confusing to look at the diff --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[2/2] spark-website git commit: Squashed commit of the following:
Squashed commit of the following: commit 8e2dd71cf5613be6f019bb76b46226771422a40e Merge: 8bd24fb6d 01f0b4e0c Author: Reynold XinDate: Fri Mar 16 10:24:54 2018 -0700 Merge pull request #104 from mateiz/history Add a project history page commit 01f0b4e0c1fe77781850cf994058980664201bce Author: Matei Zaharia Date: Wed Mar 14 23:29:01 2018 -0700 Add a project history page Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/a1d84bcb Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/a1d84bcb Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/a1d84bcb Branch: refs/heads/asf-site Commit: a1d84bcbf53099be51c39914528bea3f4e2735a0 Parents: 8bd24fb Author: Reynold Xin Authored: Fri Mar 16 10:26:14 2018 -0700 Committer: Reynold Xin Committed: Fri Mar 16 10:26:14 2018 -0700 -- _layouts/global.html| 1 + community.md| 24 +- history.md | 29 +++ index.md| 16 +- site/committers.html| 1 + site/community.html | 24 +- site/contributing.html | 1 + site/developer-tools.html | 1 + site/documentation.html | 1 + site/downloads.html | 1 + site/examples.html | 1 + site/faq.html | 1 + site/graphx/index.html | 1 + site/history.html | 235 +++ site/improvement-proposals.html | 1 + site/index.html | 17 +- site/mailing-lists.html | 1 + site/mllib/index.html | 1 + site/news/amp-camp-2013-registration-ope.html | 1 + .../news/announcing-the-first-spark-summit.html | 1 + .../news/fourth-spark-screencast-published.html | 1 + site/news/index.html| 1 + site/news/nsdi-paper.html | 1 + site/news/one-month-to-spark-summit-2015.html | 1 + .../proposals-open-for-spark-summit-east.html | 1 + ...registration-open-for-spark-summit-east.html | 1 + .../news/run-spark-and-shark-on-amazon-emr.html | 1 + site/news/spark-0-6-1-and-0-5-2-released.html | 1 + site/news/spark-0-6-2-released.html | 1 + site/news/spark-0-7-0-released.html | 1 + site/news/spark-0-7-2-released.html | 1 + site/news/spark-0-7-3-released.html | 1 + site/news/spark-0-8-0-released.html | 1 + site/news/spark-0-8-1-released.html | 1 + site/news/spark-0-9-0-released.html | 1 + site/news/spark-0-9-1-released.html | 1 + site/news/spark-0-9-2-released.html | 1 + site/news/spark-1-0-0-released.html | 1 + site/news/spark-1-0-1-released.html | 1 + site/news/spark-1-0-2-released.html | 1 + site/news/spark-1-1-0-released.html | 1 + site/news/spark-1-1-1-released.html | 1 + site/news/spark-1-2-0-released.html | 1 + site/news/spark-1-2-1-released.html | 1 + site/news/spark-1-2-2-released.html | 1 + site/news/spark-1-3-0-released.html | 1 + site/news/spark-1-4-0-released.html | 1 + site/news/spark-1-4-1-released.html | 1 + site/news/spark-1-5-0-released.html | 1 + site/news/spark-1-5-1-released.html | 1 + site/news/spark-1-5-2-released.html | 1 + site/news/spark-1-6-0-released.html | 1 + site/news/spark-1-6-1-released.html | 1 + site/news/spark-1-6-2-released.html | 1 + site/news/spark-1-6-3-released.html | 1 + site/news/spark-2-0-0-released.html | 1 + site/news/spark-2-0-1-released.html | 1 + site/news/spark-2-0-2-released.html | 1 + site/news/spark-2-1-0-released.html | 1 + site/news/spark-2-1-1-released.html | 1 + site/news/spark-2-1-2-released.html | 1 + site/news/spark-2-2-0-released.html | 1 + site/news/spark-2-2-1-released.html | 1 + site/news/spark-2-3-0-released.html | 1 + site/news/spark-2.0.0-preview.html | 1 + .../spark-accepted-into-apache-incubator.html | 1 + site/news/spark-and-shark-in-the-news.html | 1 + site/news/spark-becomes-tlp.html| 1 +
[1/2] spark-website git commit: Squashed commit of the following:
Repository: spark-website Updated Branches: refs/heads/asf-site 8bd24fb6d -> a1d84bcbf http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-summit-june-2016-agenda-posted.html -- diff --git a/site/news/spark-summit-june-2016-agenda-posted.html b/site/news/spark-summit-june-2016-agenda-posted.html index ce68829..7947354 100644 --- a/site/news/spark-summit-june-2016-agenda-posted.html +++ b/site/news/spark-summit-june-2016-agenda-posted.html @@ -123,6 +123,7 @@ https://issues.apache.org/jira/browse/SPARK;>Issue Tracker Powered By Project Committers + Project History http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-summit-june-2017-agenda-posted.html -- diff --git a/site/news/spark-summit-june-2017-agenda-posted.html b/site/news/spark-summit-june-2017-agenda-posted.html index 5d2df4b..e4055c3 100644 --- a/site/news/spark-summit-june-2017-agenda-posted.html +++ b/site/news/spark-summit-june-2017-agenda-posted.html @@ -123,6 +123,7 @@ https://issues.apache.org/jira/browse/SPARK;>Issue Tracker Powered By Project Committers + Project History http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-summit-june-2018-agenda-posted.html -- diff --git a/site/news/spark-summit-june-2018-agenda-posted.html b/site/news/spark-summit-june-2018-agenda-posted.html index 17c284f..9b2f739 100644 --- a/site/news/spark-summit-june-2018-agenda-posted.html +++ b/site/news/spark-summit-june-2018-agenda-posted.html @@ -123,6 +123,7 @@ https://issues.apache.org/jira/browse/SPARK;>Issue Tracker Powered By Project Committers + Project History http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-tips-from-quantifind.html -- diff --git a/site/news/spark-tips-from-quantifind.html b/site/news/spark-tips-from-quantifind.html index bfbac1d..00c71c2 100644 --- a/site/news/spark-tips-from-quantifind.html +++ b/site/news/spark-tips-from-quantifind.html @@ -123,6 +123,7 @@ https://issues.apache.org/jira/browse/SPARK;>Issue Tracker Powered By Project Committers + Project History http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-user-survey-and-powered-by-page.html -- diff --git a/site/news/spark-user-survey-and-powered-by-page.html b/site/news/spark-user-survey-and-powered-by-page.html index 67935a9..c015e5c 100644 --- a/site/news/spark-user-survey-and-powered-by-page.html +++ b/site/news/spark-user-survey-and-powered-by-page.html @@ -123,6 +123,7 @@ https://issues.apache.org/jira/browse/SPARK;>Issue Tracker Powered By Project Committers + Project History http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-version-0-6-0-released.html -- diff --git a/site/news/spark-version-0-6-0-released.html b/site/news/spark-version-0-6-0-released.html index 3f670d7..d9120b0 100644 --- a/site/news/spark-version-0-6-0-released.html +++ b/site/news/spark-version-0-6-0-released.html @@ -123,6 +123,7 @@ https://issues.apache.org/jira/browse/SPARK;>Issue Tracker Powered By Project Committers + Project History http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-wins-cloudsort-100tb-benchmark.html -- diff --git a/site/news/spark-wins-cloudsort-100tb-benchmark.html b/site/news/spark-wins-cloudsort-100tb-benchmark.html index b498034..8bef605 100644 --- a/site/news/spark-wins-cloudsort-100tb-benchmark.html +++ b/site/news/spark-wins-cloudsort-100tb-benchmark.html @@ -123,6 +123,7 @@ https://issues.apache.org/jira/browse/SPARK;>Issue Tracker Powered By Project Committers + Project History http://git-wip-us.apache.org/repos/asf/spark-website/blob/a1d84bcb/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html -- diff --git a/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html b/site/news/spark-wins-daytona-gray-sort-100tb-benchmark.html index 18646f4..32f53e9 100644 ---
[GitHub] spark issue #20800: [SPARK-23627][SQL] Provide isEmpty in Dataset
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20800 So the API looks useful, but I don't know if this is the right implementation. How important is it to add this? It seems like the value is not super high either. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20800: [SPARK-23627][SQL] Provide isEmpty in DataSet
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20800#discussion_r174016939 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -511,6 +511,14 @@ class Dataset[T] private[sql]( */ def isLocal: Boolean = logicalPlan.isInstanceOf[LocalRelation] + /** + * Returns true if the `DataSet` is empty --- End diff -- Dataset --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20674: [SPARK-23465][SQL] Introduce new function to rename colu...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20674 I personally wouldn't include this since it's a simple function users can write ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20706: [SPARK-23550][core] Cleanup `Utils`.
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20706#discussion_r171666996 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -267,44 +264,20 @@ private[spark] object Utils extends Logging { } } - /** - * JDK equivalent of `chmod 700 file`. - * - * @param file the file whose permissions will be modified - * @return true if the permissions were successfully changed, false otherwise. - */ - def chmod700(file: File): Boolean = { -file.setReadable(false, false) && -file.setReadable(true, true) && -file.setWritable(false, false) && -file.setWritable(true, true) && -file.setExecutable(false, false) && -file.setExecutable(true, true) - } - /** * Create a directory inside the given parent directory. The directory is guaranteed to be * newly created, and is not marked for automatic deletion. */ def createDirectory(root: String, namePrefix: String = "spark"): File = { -var attempts = 0 -val maxAttempts = MAX_DIR_CREATION_ATTEMPTS -var dir: File = null -while (dir == null) { - attempts += 1 - if (attempts > maxAttempts) { -throw new IOException("Failed to create a temp directory (under " + root + ") after " + - maxAttempts + " attempts!") - } - try { -dir = new File(root, namePrefix + "-" + UUID.randomUUID.toString) -if (dir.exists() || !dir.mkdirs()) { - dir = null -} - } catch { case e: SecurityException => dir = null; } +val prefix = namePrefix + "-" --- End diff -- was there a reason you rewriting this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fall back to non-Arr...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20567 A quick bit: fallback is a single word. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20490#discussion_r167137165 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java --- @@ -62,6 +62,16 @@ */ DataWriterFactory createWriterFactory(); + /** + * Returns whether Spark should use the commit coordinator to ensure that only one attempt for --- End diff -- This is actually not a guarantee, is it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20499 I'd fix this in 2.3, and 2.2.1 as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r166701501 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java --- @@ -27,6 +27,39 @@ /** * An immutable string-to-string map in which keys are case-insensitive. This is used to represent * data source options. + * + * Each data source implementation can define its own options and teach its users how to set them. + * Spark doesn't have any restrictions about what options a data source should or should not have. + * Instead Spark defines some standard options that data sources can optionally adopt. It's possible + * that some options are very common and many data sources use them. However different data + * sources may define the common options(key and meaning) differently, which is quite confusing to + * end users. + * + * The standard options defined by Spark: + * + * + * Option key + * Option value + * + * + * path + * A comma separated paths string of the data files/directories, like + * path1,/absolute/file2,path3/*. Each path can either be relative or absolute, + * points to either file or directory, and can contain wildcards. This option is commonly used + * by file-based data sources. + * + * + * table + * A table name string representing the table name directly without any interpretation. --- End diff -- what do you mean by "without any interpretation"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20491: [SQL] Minor doc update: Add an example in DataFrameReade...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20491 This should also go into branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20491: [SQL] Minor doc update: Add an example in DataFra...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/20491 [SQL] Minor doc update: Add an example in DataFrameReader.schema ## What changes were proposed in this pull request? This patch adds a small example to the schema string definition of schema function. It isn't obvious how to use it, so an example would be useful. ## How was this patch tested? N/A - doc only. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark schema-doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20491.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20491 commit 69193dbd64e9e0002abd9a8cd6fe60c1c87bc471 Author: Reynold Xin <rxin@...> Date: 2018-02-02T23:00:39Z [SQL] Minor doc update: Add an example in DataFrameReader.schema commit e5e5e0b44e22f58736dd27e5c048395670574f18 Author: Reynold Xin <rxin@...> Date: 2018-02-02T23:02:26Z fix typo --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16793: [SPARK-19454][PYTHON][SQL] DataFrame.replace improvement...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16793 Also the implementation doesn't match what was proposed in https://issues.apache.org/jira/browse/SPARK-19454 Having null value as the default in a function called replace is too risky and error prone. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16793: [SPARK-19454][PYTHON][SQL] DataFrame.replace improvement...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16793 Sorry I object this change. Why would we put null as the default replace value, in a function called replace? That seems very counterintuitive and error prone. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20219: [SPARK-23025][SQL] Support Null type in scala reflection
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20219 But it is possible to generate NullType data right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20152: [SPARK-22957] ApproxQuantile breaks if the number of row...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20152 cc @gatorsmile @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20072#discussion_r159573530 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -261,6 +261,17 @@ object SQLConf { .booleanConf .createWithDefault(false) + val DISK_TO_MEMORY_SIZE_FACTOR = buildConf( +"org.apache.spark.sql.execution.datasources.fileDataSizeFactor") --- End diff -- shouldn't we call this something like compressionFactor? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20076 Thanks for the PR. Why are we complicating the PR by doing the rename? Does this actually gain anything other than minor cosmetic changes? It makes the simple PR pretty long ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-22648][K8S] Spark on Kubernetes - Documentation
Repository: spark Updated Branches: refs/heads/master 7beb375bf -> 7ab165b70 [SPARK-22648][K8S] Spark on Kubernetes - Documentation What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by https://github.com/apache/spark/pull/19717 and https://github.com/apache/spark/pull/19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (https://github.com/apache/spark/pull/20007) - [x] Change references to docker to instead say "container" (https://github.com/apache/spark/pull/19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (#20032) Author: foxish <ramanath...@google.com> Closes #19946 from foxish/update-k8s-docs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ab165b7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ab165b7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ab165b7 Branch: refs/heads/master Commit: 7ab165b7061d9acc26523227076056e94354d204 Parents: 7beb375 Author: foxish <ramanath...@google.com> Authored: Thu Dec 21 17:21:11 2017 -0800 Committer: Reynold Xin <r...@databricks.com> Committed: Thu Dec 21 17:21:11 2017 -0800 -- docs/_layouts/global.html| 1 + docs/building-spark.md | 6 +- docs/cluster-overview.md | 7 +- docs/configuration.md| 2 + docs/img/k8s-cluster-mode.png| Bin 0 -> 55538 bytes docs/index.md| 3 +- docs/running-on-kubernetes.md| 578 ++ docs/running-on-yarn.md | 4 +- docs/submitting-applications.md | 16 + sbin/build-push-docker-images.sh | 68 10 files changed, 677 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7ab165b7/docs/_layouts/global.html -- diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index 67b05ec..e5af5ae 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -99,6 +99,7 @@ Spark Standalone Mesos YARN +Kubernetes http://git-wip-us.apache.org/repos/asf/spark/blob/7ab165b7/docs/building-spark.md -- diff --git a/docs/building-spark.md b/docs/building-spark.md index 98f7df1..c391255 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by the to be runnable, use `./dev/make-distribution.sh` in the project root directory. It can be configured with Maven profile settings and so on like the direct Maven build. Example: -./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn +./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes This will build Spark distribution along with Python pip and R packages. For more information on usage, run `./dev/make-distribution.sh --help` @@ -90,6 +90,10 @@ like ZooKeeper and Hadoop itself. ## Building with Mesos support ./build/mvn -Pmesos -DskipTests clean package + +## Building with Kubernetes support + +./build/mvn -Pkubernetes -DskipTests clean package ## Building with Kafka 0.8 support http://git-wip-us.apache.org/repos/asf/spark/blob/7ab165b7/docs/cluster-overview.md -- diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index c42bb4b..658e67f 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -52,11 +52,8 @@ The system currently supports three cluster managers: * [Apache Mesos](running-on-mesos.html) -- a general cluster manager that can also run Hadoop MapReduce and service applications. * [Hadoop YARN](running-on-yarn.html) -- the resource manager in Hadoop 2. -* [Kubernetes (experimental)](https://github.com/apac
[GitHub] spark issue #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documentation
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19946 Merging in master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19973: [SPARK-22779] FallbackConfigEntry's default value...
Github user rxin closed the pull request at: https://github.com/apache/spark/pull/19973 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r158205893 --- Diff: docs/building-spark.md --- @@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by the to be runnable, use `./dev/make-distribution.sh` in the project root directory. It can be configured with Maven profile settings and so on like the direct Maven build. Example: -./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn +./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes --- End diff -- Yea I don't think you need to block this pr with this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryError in c...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20014 Overall change lgtm. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20014: [SPARK-22827][CORE] Avoid throwing OutOfMemoryErr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20014#discussion_r157673852 --- Diff: core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryError.java --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.memory; + +/** + * This exception is thrown when a task can not acquire memory from the Memory manager. + * Instead of throwing {@link OutOfMemoryError}, which kills the executor, + * we should use throw this exception, which will just kill the current task. + */ +public final class SparkOutOfMemoryError extends OutOfMemoryError { --- End diff -- is this an internal class? if yes perhaps we should label it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19973: [SPARK-22779] FallbackConfigEntry's default value should...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19973 @vanzin you got a min to submit a patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [Scheduler] Spark on Kubernetes - D...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r156821519 --- Diff: docs/building-spark.md --- @@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by the to be runnable, use `./dev/make-distribution.sh` in the project root directory. It can be configured with Maven profile settings and so on like the direct Maven build. Example: -./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn +./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes --- End diff -- should we use k8s? I kept bringing this up and that's because I can never spell Kubernetes properly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19973: [SPARK-22779] FallbackConfigEntry's default value should...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19973 That's what the "default" is, isn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19973: [SPARK-22779] ConfigEntry's default value should actuall...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19973 The issue is in ``` /** * Return the `string` value of Spark SQL configuration property for the given key. If the key is * not set yet, return `defaultValue`. */ def getConfString(key: String, defaultValue: String): String = { if (defaultValue != null && defaultValue != "") { val entry = sqlConfEntries.get(key) if (entry != null) { // Only verify configs in the SQLConf object entry.valueConverter(defaultValue) } } Option(settings.get(key)).getOrElse(defaultValue) } ``` The value converter gets applied on this generated string which is not a real value and will fail. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19973: [SPARK-22779] ConfigEntry's default value should ...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/19973 [SPARK-22779] ConfigEntry's default value should actually be a value ## What changes were proposed in this pull request? ConfigEntry's config value right now shows a human readable message. In some places in SQL we actually rely on default value for real to be setting the values. ## How was this patch tested? Tested manually. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-22779 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19973.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19973 commit 385c300c14a382654c2a1f94ccd2813487dbe26a Author: Reynold Xin <r...@databricks.com> Date: 2017-12-13T22:43:55Z [SPARK-22779] ConfigEntry's default value should actually be a value --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19973: [SPARK-22779] ConfigEntry's default value should actuall...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19973 cc @vanzin @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19861#discussion_r155693977 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import java.util.regex.Pattern + +import scala.collection.JavaConverters._ +import scala.collection.immutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources.v2.ConfigSupport + +private[sql] object DataSourceV2ConfigSupport extends Logging { + + /** + * Helper method to propagate session configs with config key that matches at least one of the + * given prefixes to the corresponding data source options. + * + * @param cs the session config propagate help class + * @param source the data source format + * @param conf the session conf + * @return an immutable map that contains all the session configs that should be propagated to + * the data source. + */ + def withSessionConfig( + cs: ConfigSupport, + source: String, + conf: SQLConf): immutable.Map[String, String] = { +val prefixes = cs.getConfigPrefixes +require(prefixes != null, "The config key-prefixes cann't be null.") +val mapping = cs.getConfigMapping.asScala +val validOptions = cs.getValidOptions +require(validOptions != null, "The valid options list cann't be null.") --- End diff -- double n --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19861#discussion_r155693966 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import java.util.regex.Pattern + +import scala.collection.JavaConverters._ +import scala.collection.immutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources.v2.ConfigSupport + +private[sql] object DataSourceV2ConfigSupport extends Logging { + + /** + * Helper method to propagate session configs with config key that matches at least one of the + * given prefixes to the corresponding data source options. + * + * @param cs the session config propagate help class + * @param source the data source format + * @param conf the session conf + * @return an immutable map that contains all the session configs that should be propagated to + * the data source. + */ + def withSessionConfig( + cs: ConfigSupport, + source: String, + conf: SQLConf): immutable.Map[String, String] = { +val prefixes = cs.getConfigPrefixes +require(prefixes != null, "The config key-prefixes cann't be null.") --- End diff -- double n --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19905: [SPARK-22710] ConfigBuilder.fallbackConf should trigger ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19905 cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org