[GitHub] [spark-website] zhengruifeng commented on pull request #478: [SPARK-45195][FOLLOWUP] Update python example
zhengruifeng commented on PR #478: URL: https://github.com/apache/spark-website/pull/478#issuecomment-1724871330 thank you guys -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-site updated: [SPARK-45195][FOLLOWUP] Update python example (#478)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new c578724e88 [SPARK-45195][FOLLOWUP] Update python example (#478) c578724e88 is described below commit c578724e8818a2ebd5f49e0a9e300bb86c8000cb Author: Ruifeng Zheng AuthorDate: Tue Sep 19 13:53:42 2023 +0800 [SPARK-45195][FOLLOWUP] Update python example (#478) * init * address comments address comments address comments --- index.md| 7 +-- site/index.html | 7 +-- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/index.md b/index.md index ada6242742..188ced0d6c 100644 --- a/index.md +++ b/index.md @@ -88,12 +88,15 @@ navigation: Run now -Install with 'pip' or try offical image +Install with 'pip' $ pip install pyspark $ pyspark -$ + +Use the official Docker image + + $ docker run -it --rm spark:python3 /opt/spark/bin/pyspark diff --git a/site/index.html b/site/index.html index 3ccc7104ce..e3da4b2a5b 100644 --- a/site/index.html +++ b/site/index.html @@ -213,12 +213,15 @@ Run now -Install with 'pip' or try offical image +Install with 'pip' $ pip install pyspark $ pyspark -$ + +Use the official Docker image + + $ docker run -it --rm spark:python3 /opt/spark/bin/pyspark - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] zhengruifeng merged pull request #478: [SPARK-45195][FOLLOWUP] Update python example
zhengruifeng merged PR #478: URL: https://github.com/apache/spark-website/pull/478 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 4a418a448ea [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) 4a418a448ea is described below commit 4a418a448eab6e1007927db92ecacad6594397c8 Author: Hyukjin Kwon AuthorDate: Tue Sep 19 14:51:27 2023 +0900 [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) ### What changes were proposed in this pull request? This PR proposes to recover the availity of switching languages consistently across docs for all code snippets in Spark 3.4 and below by using the proper class selector in the JQuery. Previously the selector was a string `.nav-link tab_python` which did not comply multiple class selection: https://www.w3.org/TR/CSS21/selector.html#class-html. I assume it worked as a legacy behaviour somewhere. Now it uses the standard way `.nav-link.tab_python`. Note that https://github.com/apache/spark/pull/42657 works because there's only single class assigned (after we refactored the site at https://github.com/apache/spark/pull/40269) ### Why are the changes needed? This is a regression in our documentation site. ### Does this PR introduce _any_ user-facing change? Yes, once you click the language tab, it will apply to the examples in the whole page. ### How was this patch tested? Manually tested after building the site. ![Screenshot 2023-09-19 at 12 08 17 PM](https://github.com/apache/spark/assets/6477701/09d0c117-9774-4404-8e2e-d454b7f700a3) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42989 from HyukjinKwon/SPARK-45210. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 796d8785c61e09d1098350657fd44707763687db) Signed-off-by: Hyukjin Kwon --- docs/js/main.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/js/main.js b/docs/js/main.js index 968097c8041..07df2541f74 100755 --- a/docs/js/main.js +++ b/docs/js/main.js @@ -63,7 +63,7 @@ function codeTabs() { // while retaining the scroll position e.preventDefault(); var scrollOffset = $(this).offset().top - $(document).scrollTop(); -$("." + $(this).attr('class')).tab('show'); +$("." + $(this).attr('class').split(" ").join(".")).tab('show'); $(document).scrollTop($(this).offset().top - scrollOffset); }); } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new e428fe902bb [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) e428fe902bb is described below commit e428fe902bb1f12cea973de7fe4b885ae69fd6ca Author: Hyukjin Kwon AuthorDate: Tue Sep 19 14:51:27 2023 +0900 [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) ### What changes were proposed in this pull request? This PR proposes to recover the availity of switching languages consistently across docs for all code snippets in Spark 3.4 and below by using the proper class selector in the JQuery. Previously the selector was a string `.nav-link tab_python` which did not comply multiple class selection: https://www.w3.org/TR/CSS21/selector.html#class-html. I assume it worked as a legacy behaviour somewhere. Now it uses the standard way `.nav-link.tab_python`. Note that https://github.com/apache/spark/pull/42657 works because there's only single class assigned (after we refactored the site at https://github.com/apache/spark/pull/40269) ### Why are the changes needed? This is a regression in our documentation site. ### Does this PR introduce _any_ user-facing change? Yes, once you click the language tab, it will apply to the examples in the whole page. ### How was this patch tested? Manually tested after building the site. ![Screenshot 2023-09-19 at 12 08 17 PM](https://github.com/apache/spark/assets/6477701/09d0c117-9774-4404-8e2e-d454b7f700a3) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42989 from HyukjinKwon/SPARK-45210. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 796d8785c61e09d1098350657fd44707763687db) Signed-off-by: Hyukjin Kwon --- docs/js/main.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/js/main.js b/docs/js/main.js index 968097c8041..07df2541f74 100755 --- a/docs/js/main.js +++ b/docs/js/main.js @@ -63,7 +63,7 @@ function codeTabs() { // while retaining the scroll position e.preventDefault(); var scrollOffset = $(this).offset().top - $(document).scrollTop(); -$("." + $(this).attr('class')).tab('show'); +$("." + $(this).attr('class').split(" ").join(".")).tab('show'); $(document).scrollTop($(this).offset().top - scrollOffset); }); } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 578233e8e0a [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) 578233e8e0a is described below commit 578233e8e0a05213ce2c3f9f49525075c2da801f Author: Hyukjin Kwon AuthorDate: Tue Sep 19 14:51:27 2023 +0900 [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) ### What changes were proposed in this pull request? This PR proposes to recover the availity of switching languages consistently across docs for all code snippets in Spark 3.4 and below by using the proper class selector in the JQuery. Previously the selector was a string `.nav-link tab_python` which did not comply multiple class selection: https://www.w3.org/TR/CSS21/selector.html#class-html. I assume it worked as a legacy behaviour somewhere. Now it uses the standard way `.nav-link.tab_python`. Note that https://github.com/apache/spark/pull/42657 works because there's only single class assigned (after we refactored the site at https://github.com/apache/spark/pull/40269) ### Why are the changes needed? This is a regression in our documentation site. ### Does this PR introduce _any_ user-facing change? Yes, once you click the language tab, it will apply to the examples in the whole page. ### How was this patch tested? Manually tested after building the site. ![Screenshot 2023-09-19 at 12 08 17 PM](https://github.com/apache/spark/assets/6477701/09d0c117-9774-4404-8e2e-d454b7f700a3) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42989 from HyukjinKwon/SPARK-45210. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 796d8785c61e09d1098350657fd44707763687db) Signed-off-by: Hyukjin Kwon --- docs/js/main.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/js/main.js b/docs/js/main.js index 968097c8041..07df2541f74 100755 --- a/docs/js/main.js +++ b/docs/js/main.js @@ -63,7 +63,7 @@ function codeTabs() { // while retaining the scroll position e.preventDefault(); var scrollOffset = $(this).offset().top - $(document).scrollTop(); -$("." + $(this).attr('class')).tab('show'); +$("." + $(this).attr('class').split(" ").join(".")).tab('show'); $(document).scrollTop($(this).offset().top - scrollOffset); }); } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated: [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 796d8785c61 [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) 796d8785c61 is described below commit 796d8785c61e09d1098350657fd44707763687db Author: Hyukjin Kwon AuthorDate: Tue Sep 19 14:51:27 2023 +0900 [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) ### What changes were proposed in this pull request? This PR proposes to recover the availity of switching languages consistently across docs for all code snippets in Spark 3.4 and below by using the proper class selector in the JQuery. Previously the selector was a string `.nav-link tab_python` which did not comply multiple class selection: https://www.w3.org/TR/CSS21/selector.html#class-html. I assume it worked as a legacy behaviour somewhere. Now it uses the standard way `.nav-link.tab_python`. Note that https://github.com/apache/spark/pull/42657 works because there's only single class assigned (after we refactored the site at https://github.com/apache/spark/pull/40269) ### Why are the changes needed? This is a regression in our documentation site. ### Does this PR introduce _any_ user-facing change? Yes, once you click the language tab, it will apply to the examples in the whole page. ### How was this patch tested? Manually tested after building the site. ![Screenshot 2023-09-19 at 12 08 17 PM](https://github.com/apache/spark/assets/6477701/09d0c117-9774-4404-8e2e-d454b7f700a3) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42989 from HyukjinKwon/SPARK-45210. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- docs/js/main.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/js/main.js b/docs/js/main.js index 968097c8041..07df2541f74 100755 --- a/docs/js/main.js +++ b/docs/js/main.js @@ -63,7 +63,7 @@ function codeTabs() { // while retaining the scroll position e.preventDefault(); var scrollOffset = $(this).offset().top - $(document).scrollTop(); -$("." + $(this).attr('class')).tab('show'); +$("." + $(this).attr('class').split(" ").join(".")).tab('show'); $(document).scrollTop($(this).offset().top - scrollOffset); }); } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated: [SPARK-44910][SQL][3.4] Encoders.bean does not support superclasses with generic type arguments
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new e5d82875463 [SPARK-44910][SQL][3.4] Encoders.bean does not support superclasses with generic type arguments e5d82875463 is described below commit e5d8287546306f31d3888662669970e46895c155 Author: Giambattista Bloisi AuthorDate: Mon Sep 18 21:38:42 2023 -0700 [SPARK-44910][SQL][3.4] Encoders.bean does not support superclasses with generic type arguments ### What changes were proposed in this pull request? This pull request adds Encoders.bean support for beans having a superclass declared with generic type arguments. For example: ``` class JavaBeanWithGenericsA { public T getPropertyA() { return null; } public void setPropertyA(T a) { } } class JavaBeanWithGenericBase extends JavaBeanWithGenericsA { } Encoders.bean(JavaBeanWithGenericBase.class); // Exception ``` That feature had to be part of [PR 42327](https://github.com/apache/spark/commit/1f5d78b5952fcc6c7d36d3338a5594070e3a62dd) but was missing as I was focusing on nested beans only (hvanhovell ) ### Why are the changes needed? JavaTypeInference.encoderFor did not solve TypeVariable objects for superclasses so when managing a case like in the example above an exception was thrown. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests have been extended, new specific tests have been added ### Was this patch authored or co-authored using generative AI tooling? No hvanhovell this is branch-3.4 port of [PR-42634](https://github.com/apache/spark/pull/42634) Closes #42914 from gbloisi-openaire/SPARK-44910-branch-3.4. Authored-by: Giambattista Bloisi Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/JavaTypeInference.scala | 6 ++- ...thGenerics.java => JavaTypeInferenceBeans.java} | 52 +++--- .../sql/catalyst/JavaTypeInferenceSuite.scala | 41 +++-- 3 files changed, 88 insertions(+), 11 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala index 75aca3ccbdd..a0341c0d9c8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala @@ -131,10 +131,13 @@ object JavaTypeInference { // TODO: we should only collect properties that have getter and setter. However, some tests // pass in scala case class as java bean class which doesn't have getter and setter. val properties = getJavaBeanReadableProperties(c) + // add type variables from inheritance hierarchy of the class + val classTV = JavaTypeUtils.getTypeArguments(c, classOf[Object]).asScala.toMap ++ +typeVariables // Note that the fields are ordered by name. val fields = properties.map { property => val readMethod = property.getReadMethod -val encoder = encoderFor(readMethod.getGenericReturnType, seenTypeSet + c, typeVariables) +val encoder = encoderFor(readMethod.getGenericReturnType, seenTypeSet + c, classTV) // The existence of `javax.annotation.Nonnull`, means this field is not nullable. val hasNonNull = readMethod.isAnnotationPresent(classOf[Nonnull]) EncoderField( @@ -158,4 +161,3 @@ object JavaTypeInference { .filter(_.getReadMethod != null) } } - diff --git a/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java b/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaTypeInferenceBeans.java old mode 100755 new mode 100644 similarity index 54% rename from sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java rename to sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaTypeInferenceBeans.java index b84a3122cf8..8438b75c762 --- a/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java +++ b/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaTypeInferenceBeans.java @@ -17,25 +17,65 @@ package org.apache.spark.sql.catalyst; -class JavaBeanWithGenerics { +public class JavaTypeInferenceBeans { + + static class JavaBeanWithGenericsA { +public T getPropertyA() { + return null; +} + +public void setPropertyA(T a) { + +} + } + + static class JavaBeanWithGenericsAB extends JavaBeanWithGenericsA { +public T getPropertyB() { + return null; +} + +public void setPropertyB(T a) { + +} + } + + static class
[spark] branch branch-3.5 updated: [SPARK-44910][SQL] Encoders.bean does not support superclasses with generic type arguments
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 6a498087361 [SPARK-44910][SQL] Encoders.bean does not support superclasses with generic type arguments 6a498087361 is described below commit 6a498087361ecbd653821fc283b9ea0fa703c820 Author: Giambattista Bloisi AuthorDate: Mon Sep 18 21:37:09 2023 -0700 [SPARK-44910][SQL] Encoders.bean does not support superclasses with generic type arguments ### What changes were proposed in this pull request? This pull request adds Encoders.bean support for beans having a superclass declared with generic type arguments. For example: ``` class JavaBeanWithGenericsA { public T getPropertyA() { return null; } public void setPropertyA(T a) { } } class JavaBeanWithGenericBase extends JavaBeanWithGenericsA { } Encoders.bean(JavaBeanWithGenericBase.class); // Exception ``` That feature had to be part of [PR 42327](https://github.com/apache/spark/commit/1f5d78b5952fcc6c7d36d3338a5594070e3a62dd) but was missing as I was focusing on nested beans only (hvanhovell ) ### Why are the changes needed? JavaTypeInference.encoderFor did not solve TypeVariable objects for superclasses so when managing a case like in the example above an exception was thrown. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests have been extended, new specific tests have been added ### Was this patch authored or co-authored using generative AI tooling? No Closes #42634 from gbloisi-openaire/SPARK-44910. Lead-authored-by: Giambattista Bloisi Co-authored-by: gbloisi-openaire <141144100+gbloisi-opena...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun (cherry picked from commit 7e14c8cc33f0ed0a9c53a888e8a3b17dd2a5d493) Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/JavaTypeInference.scala | 5 ++- ...thGenerics.java => JavaTypeInferenceBeans.java} | 51 +++--- .../sql/catalyst/JavaTypeInferenceSuite.scala | 41 +++-- 3 files changed, 88 insertions(+), 9 deletions(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala index 3d536b735db..191ccc52544 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala @@ -130,10 +130,13 @@ object JavaTypeInference { // TODO: we should only collect properties that have getter and setter. However, some tests // pass in scala case class as java bean class which doesn't have getter and setter. val properties = getJavaBeanReadableProperties(c) + // add type variables from inheritance hierarchy of the class + val classTV = JavaTypeUtils.getTypeArguments(c, classOf[Object]).asScala.toMap ++ +typeVariables // Note that the fields are ordered by name. val fields = properties.map { property => val readMethod = property.getReadMethod -val encoder = encoderFor(readMethod.getGenericReturnType, seenTypeSet + c, typeVariables) +val encoder = encoderFor(readMethod.getGenericReturnType, seenTypeSet + c, classTV) // The existence of `javax.annotation.Nonnull`, means this field is not nullable. val hasNonNull = readMethod.isAnnotationPresent(classOf[Nonnull]) EncoderField( diff --git a/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java b/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaTypeInferenceBeans.java similarity index 54% rename from sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java rename to sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaTypeInferenceBeans.java index b84a3122cf8..cc3540717ee 100644 --- a/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java +++ b/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaTypeInferenceBeans.java @@ -17,25 +17,66 @@ package org.apache.spark.sql.catalyst; -class JavaBeanWithGenerics { +public class JavaTypeInferenceBeans { + + static class JavaBeanWithGenericsA { +public T getPropertyA() { + return null; +} + +public void setPropertyA(T a) { + +} + } + + static class JavaBeanWithGenericsAB extends JavaBeanWithGenericsA { +public T getPropertyB() { + return null; +} + +public void setPropertyB(T a) { + +} + } + + static class JavaBeanWithGenericsABC extends JavaBeanWithGenericsAB { +p
[spark] branch master updated: [SPARK-44910][SQL] Encoders.bean does not support superclasses with generic type arguments
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7e14c8cc33f [SPARK-44910][SQL] Encoders.bean does not support superclasses with generic type arguments 7e14c8cc33f is described below commit 7e14c8cc33f0ed0a9c53a888e8a3b17dd2a5d493 Author: Giambattista Bloisi AuthorDate: Mon Sep 18 21:37:09 2023 -0700 [SPARK-44910][SQL] Encoders.bean does not support superclasses with generic type arguments ### What changes were proposed in this pull request? This pull request adds Encoders.bean support for beans having a superclass declared with generic type arguments. For example: ``` class JavaBeanWithGenericsA { public T getPropertyA() { return null; } public void setPropertyA(T a) { } } class JavaBeanWithGenericBase extends JavaBeanWithGenericsA { } Encoders.bean(JavaBeanWithGenericBase.class); // Exception ``` That feature had to be part of [PR 42327](https://github.com/apache/spark/commit/1f5d78b5952fcc6c7d36d3338a5594070e3a62dd) but was missing as I was focusing on nested beans only (hvanhovell ) ### Why are the changes needed? JavaTypeInference.encoderFor did not solve TypeVariable objects for superclasses so when managing a case like in the example above an exception was thrown. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests have been extended, new specific tests have been added ### Was this patch authored or co-authored using generative AI tooling? No Closes #42634 from gbloisi-openaire/SPARK-44910. Lead-authored-by: Giambattista Bloisi Co-authored-by: gbloisi-openaire <141144100+gbloisi-opena...@users.noreply.github.com> Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/JavaTypeInference.scala | 5 ++- ...thGenerics.java => JavaTypeInferenceBeans.java} | 51 +++--- .../sql/catalyst/JavaTypeInferenceSuite.scala | 41 +++-- 3 files changed, 88 insertions(+), 9 deletions(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala index 3d536b735db..191ccc52544 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala @@ -130,10 +130,13 @@ object JavaTypeInference { // TODO: we should only collect properties that have getter and setter. However, some tests // pass in scala case class as java bean class which doesn't have getter and setter. val properties = getJavaBeanReadableProperties(c) + // add type variables from inheritance hierarchy of the class + val classTV = JavaTypeUtils.getTypeArguments(c, classOf[Object]).asScala.toMap ++ +typeVariables // Note that the fields are ordered by name. val fields = properties.map { property => val readMethod = property.getReadMethod -val encoder = encoderFor(readMethod.getGenericReturnType, seenTypeSet + c, typeVariables) +val encoder = encoderFor(readMethod.getGenericReturnType, seenTypeSet + c, classTV) // The existence of `javax.annotation.Nonnull`, means this field is not nullable. val hasNonNull = readMethod.isAnnotationPresent(classOf[Nonnull]) EncoderField( diff --git a/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java b/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaTypeInferenceBeans.java similarity index 54% rename from sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java rename to sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaTypeInferenceBeans.java index b84a3122cf8..cc3540717ee 100644 --- a/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaBeanWithGenerics.java +++ b/sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/JavaTypeInferenceBeans.java @@ -17,25 +17,66 @@ package org.apache.spark.sql.catalyst; -class JavaBeanWithGenerics { +public class JavaTypeInferenceBeans { + + static class JavaBeanWithGenericsA { +public T getPropertyA() { + return null; +} + +public void setPropertyA(T a) { + +} + } + + static class JavaBeanWithGenericsAB extends JavaBeanWithGenericsA { +public T getPropertyB() { + return null; +} + +public void setPropertyB(T a) { + +} + } + + static class JavaBeanWithGenericsABC extends JavaBeanWithGenericsAB { +public T getPropertyC() { + return null; +} + +public void setPropertyC(T a) { + +} + } + + stati
[spark] branch master updated: [SPARK-43654][CONNECT][PS][TESTS] Enable `InternalFrameParityTests.test_from_pandas`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2697a9426e1 [SPARK-43654][CONNECT][PS][TESTS] Enable `InternalFrameParityTests.test_from_pandas` 2697a9426e1 is described below commit 2697a9426e174e30c797a27660f20e42cf2faefc Author: Haejoon Lee AuthorDate: Mon Sep 18 21:26:02 2023 -0700 [SPARK-43654][CONNECT][PS][TESTS] Enable `InternalFrameParityTests.test_from_pandas` ### What changes were proposed in this pull request? This PR proposes to enable `InternalFrameParityTests.test_from_pandas` ### Why are the changes needed? To improve the test coverage for Pandas API with Spark Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT for Spark Connect. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42956 from itholic/enable_from_pandas. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- python/pyspark/pandas/tests/connect/test_parity_internal.py | 4 +--- python/pyspark/pandas/utils.py | 2 +- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/python/pyspark/pandas/tests/connect/test_parity_internal.py b/python/pyspark/pandas/tests/connect/test_parity_internal.py index d586fec57f7..d90b2b260f4 100644 --- a/python/pyspark/pandas/tests/connect/test_parity_internal.py +++ b/python/pyspark/pandas/tests/connect/test_parity_internal.py @@ -24,9 +24,7 @@ from pyspark.testing.pandasutils import PandasOnSparkTestUtils class InternalFrameParityTests( InternalFrameTestsMixin, PandasOnSparkTestUtils, ReusedConnectTestCase ): -@unittest.skip("TODO(SPARK-43654): Enable InternalFrameParityTests.test_from_pandas.") -def test_from_pandas(self): -super().test_from_pandas() +pass if __name__ == "__main__": diff --git a/python/pyspark/pandas/utils.py b/python/pyspark/pandas/utils.py index ebeb1d69d1b..b647697edf9 100644 --- a/python/pyspark/pandas/utils.py +++ b/python/pyspark/pandas/utils.py @@ -954,7 +954,7 @@ def spark_column_equals(left: Column, right: Column) -> bool: error_class="NOT_COLUMN", message_parameters={"arg_name": "right", "arg_type": type(right).__name__}, ) -return repr(left) == repr(right) +return repr(left).replace("`", "") == repr(right).replace("`", "") else: return left._jc.equals(right._jc) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] gatorsmile commented on a diff in pull request #478: [SPARK-45195] Update python example
gatorsmile commented on code in PR #478: URL: https://github.com/apache/spark-website/pull/478#discussion_r1329548341 ## index.md: ## @@ -88,13 +88,17 @@ navigation: Run now -Install with 'pip' or try offical image +Install with 'pip' $ pip install pyspark $ pyspark -$ + +Try offical image Review Comment: Typo: offical => official ``` Use the official Docker image. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] gatorsmile commented on a diff in pull request #478: [SPARK-45195] Update python example
gatorsmile commented on code in PR #478: URL: https://github.com/apache/spark-website/pull/478#discussion_r1329547154 ## index.md: ## @@ -88,13 +88,17 @@ navigation: Run now -Install with 'pip' or try offical image +Install with 'pip' $ pip install pyspark $ pyspark -$ + +Try offical image + + $ docker run -it --rm spark:python3 /opt/spark/bin/pyspark +$ Review Comment: remove this line? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] zhengruifeng opened a new pull request, #478: [SPARK-45195] Update python example
zhengruifeng opened a new pull request, #478: URL: https://github.com/apache/spark-website/pull/478 To address https://github.com/apache/spark-website/pull/477#discussion_r1329098596 ![image](https://github.com/apache/spark-website/assets/7322292/691cd53b-c5dd-43eb-a5fa-be1611239194) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45202][BUILD] Fix lint-js tool and js format
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e818e9cefa0 [SPARK-45202][BUILD] Fix lint-js tool and js format e818e9cefa0 is described below commit e818e9cefa012037a38a99819498f331697d3a50 Author: Kent Yao AuthorDate: Tue Sep 19 11:05:41 2023 +0800 [SPARK-45202][BUILD] Fix lint-js tool and js format ### What changes were proposed in this pull request? This PR fixes lint-js tool, which covers only js files in the core module, and also fixes the js format of the rest modules. ``` dev/lint-js /Users/kentyao/spark/sql/core/src/main/resources/org/apache/spark/sql/execution/ui/static/spark-sql-viz.js 29:10 error 'renderPlanViz' is defined but never used no-unused-vars 31:18 error 'd3' is not defined no-undef 35:11 error 'graphlibDot' is not defined no-undef 37:22 error 'dagreD3' is not defined no-undef 46:27 error '$' is not defined no-undef 59:38 error 'd3' is not defined no-undef 66:21 error 'd3' is not defined no-undef 67:3 error 'd3' is not defined no-undef 68:20 error 'd' is defined but never used. Allowed unused args must match /^_ignored_.*/u no-unused-vars 69:21 error 'd3' is not defined no-undef 70:7 error '$' is not defined no-undef 104:55 error 'i' is defined but never used. Allowed unused args must match /^_ignored_.*/u no-unused-vars 108:1 error Expected indentation of 10 spaces but found 12 indent 109:1 error Expected indentation of 10 spaces but found 12 indent 115:1 error Expected indentation of 4 spaces but found 6 indent 116:1 error Expected indentation of 6 spaces but found 8 indent 116:16 error 'd3' is not defined no-undef 117:1 error Expected indentation of 4 spaces but found 6 indent 118:1 error Expected indentation of 2 spaces but found 4 indent 129:13 error 'd3' is not defined no-undef 130:34 error 'd3' is not defined no-undef 133:13 error 'd3' is not defined no-undef 134:34 error 'd3' is not defined no-undef 137:13 error 'd3' is not defined no-undef 138:15 error 'd3' is not defined no-undef 142:13 error 'd3' is not defined no-undef 143:15 error 'd3' is not defined no-undef 180:11 error 'd3' is not defined no-undef 196:3 error '$' is not defined no-undef 198:26 error '$' is not defined no-undef 199:1 error Expected indentation of 6 spaces but found 8 indent 200:1 error Expected indentation of 8 spaces but found 10 indent 201:1 error Expected indentation of 8 spaces but found 10 indent 201:28 error 'd3' is not defined no-undef 201:41 error '$' is not defined no-undef 202:1 error Expected indentation of 8 spaces but found 10 indent 203:1 error Expected indentation of 8 spaces but found 10 indent 204:1 error Expected indentation of 8 spaces but found 10
[spark] branch branch-3.5 updated: Revert "Revert "[SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site""
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 555c8def51e Revert "Revert "[SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site"" 555c8def51e is described below commit 555c8def51e5951c7bf5165a332795e9e330ec9d Author: Hyukjin Kwon AuthorDate: Tue Sep 19 10:18:18 2023 +0900 Revert "Revert "[SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site"" This reverts commit bbe12e148eb1f289cfb1f4412525f4c4381c10a9. --- python/docs/source/_static/css/pyspark.css | 13 python/docs/source/_static/versions.json | 22 +++ .../docs/source/_templates/version-switcher.html | 77 ++ python/docs/source/conf.py | 9 ++- 4 files changed, 120 insertions(+), 1 deletion(-) diff --git a/python/docs/source/_static/css/pyspark.css b/python/docs/source/_static/css/pyspark.css index 89b7c65f27a..ccfe60f2bca 100644 --- a/python/docs/source/_static/css/pyspark.css +++ b/python/docs/source/_static/css/pyspark.css @@ -95,3 +95,16 @@ u.bd-sidebar .nav>li>ul>.active:hover>a,.bd-sidebar .nav>li>ul>.active>a { .spec_table tr, td, th { border-top: none!important; } + +/* Styling to the version dropdown */ +#version-button { + padding-left: 0.2rem; + padding-right: 3.2rem; +} + +#version_switcher { + height: auto; + max-height: 300px; + width: 165px; + overflow-y: auto; +} diff --git a/python/docs/source/_static/versions.json b/python/docs/source/_static/versions.json new file mode 100644 index 000..3d0bd148180 --- /dev/null +++ b/python/docs/source/_static/versions.json @@ -0,0 +1,22 @@ +[ +{ +"name": "3.4.1", +"version": "3.4.1" +}, +{ +"name": "3.4.0", +"version": "3.4.0" +}, +{ +"name": "3.3.2", +"version": "3.3.2" +}, +{ +"name": "3.3.1", +"version": "3.3.1" +}, +{ +"name": "3.3.0", +"version": "3.3.0" +} +] diff --git a/python/docs/source/_templates/version-switcher.html b/python/docs/source/_templates/version-switcher.html new file mode 100644 index 000..16c443229f4 --- /dev/null +++ b/python/docs/source/_templates/version-switcher.html @@ -0,0 +1,77 @@ + + + + +{{ release }} + + + + + + + + +// Function to construct the target URL from the JSON components +function buildURL(entry) { +var template = "{{ switcher_template_url }}"; // supplied by jinja +template = template.replace("{version}", entry.version); +return template; +} + +// Function to check if corresponding page path exists in other version of docs +// and, if so, go there instead of the homepage of the other docs version +function checkPageExistsAndRedirect(event) { +const currentFilePath = "{{ pagename }}.html", + otherDocsHomepage = event.target.getAttribute("href"); +let tryUrl = `${otherDocsHomepage}${currentFilePath}`; +$.ajax({ +type: 'HEAD', +url: tryUrl, +// if the page exists, go there +success: function() { +location.href = tryUrl; +} +}).fail(function() { +location.href = otherDocsHomepage; +}); +return false; +} + +// Function to populate the version switcher +(function () { +// get JSON config +$.getJSON("{{ switcher_json_url }}", function(data, textStatus, jqXHR) { +// create the nodes first (before AJAX calls) to ensure the order is +// correct (for now, links will go to doc version homepage) +$.each(data, function(index, entry) { +// if no custom name specified (e.g., "latest"), use version string +if (!("name" in entry)) { +entry.name = entry.version; +} +// construct the appropriate URL, and add it to the dropdown +entry.url = buildURL(entry); +const node = document.createElement("a"); +node.setAttribute("class", "list-group-item list-group-item-action py-1"); +node.setAttribute("href", `${entry.url}`); +node.textContent = `${entry.name}`; +node.onclick = checkPageExistsAndRedirect; +$("#version_switcher").append(node); +}); +}); +})(); + diff --git a/python/docs/source/conf.py b/python/docs/source/conf.py index 38c331048e7..0f57cb37cee 100644 --- a/python/docs/source/conf.py +++ b/python/docs/source/conf.py @@ -177,10 +177,17 @@ autosummary_generate = True # a list of builtin themes. html_theme = 'pydata_sphinx_theme' +html_context = { +"switcher_json_url": "_static/versions.json", +"switcher_template_url": "https://spark.apache.org/docs/{version}/api/python/index.html";, +} + # Theme options are theme-specific an
[GitHub] [spark-website] zhengruifeng commented on a diff in pull request #477: [SPARK-45195] Update examples with docker official image
zhengruifeng commented on code in PR #477: URL: https://github.com/apache/spark-website/pull/477#discussion_r1329402279 ## index.md: ## @@ -88,11 +88,13 @@ navigation: Run now -Installing with 'pip' +Install with 'pip' or try offical image $ pip install pyspark $ pyspark +$ +$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark Review Comment: make sense, will send a follow up pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Add sorting to the example of `collect_set/collect_list` to ensure stable results
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 464a3c19f51 [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Add sorting to the example of `collect_set/collect_list` to ensure stable results 464a3c19f51 is described below commit 464a3c19f51081914a09d4394820da059cb2ee47 Author: yangjie01 AuthorDate: Tue Sep 19 08:48:03 2023 +0900 [SPARK-45113][PYTHON][DOCS][FOLLOWUP] Add sorting to the example of `collect_set/collect_list` to ensure stable results ### What changes were proposed in this pull request? This PR adds a `sort_array` for the output results of `collect_set` and `collect_list` to ensure the output is stable. ### Why are the changes needed? When executing the Example of `collect_set` and `collect_list` with different versions of Scala, the output results may differ, resulting in the failure of daily tests on Scala 2.13: - https://github.com/apache/spark/actions/runs/6209111340/job/16856005714 ``` ** File "/__w/spark/spark/python/pyspark/sql/connect/functions.py", line 1030, in pyspark.sql.connect.functions.collect_set Failed example: df.select(sf.collect_set('age')).show() Expected: ++ |collect_set(age)| ++ | [5, 2]| ++ Got: ++ |collect_set(age)| ++ | [2, 5]| ++ ** 1 of 9 in pyspark.sql.connect.functions.collect_set ***Test Failed*** 1 failures. Had test failures in pyspark.sql.connect.functions with python3.9; see logs. Error: running /__w/spark/spark/python/run-tests --modules=pyspark-connect --parallelism=1 ; received return code 255 Error: Process completed with exit code 19. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #42968 from LuciferYang/SPARK-45113-FOLLOWUP. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/functions.py | 96 - 1 file changed, 48 insertions(+), 48 deletions(-) diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 54bd330ebc0..5474873df7b 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -3671,39 +3671,39 @@ def collect_list(col: "ColumnOrName") -> Column: Examples -Example 1: Collect values from a single column DataFrame +Example 1: Collect values from a DataFrame and sort the result in ascending order >>> from pyspark.sql import functions as sf ->>> df = spark.createDataFrame([(2,), (5,), (5,)], ('age',)) ->>> df.select(sf.collect_list('age')).show() -+-+ -|collect_list(age)| -+-+ -|[2, 5, 5]| -+-+ - -Example 2: Collect values from a DataFrame with multiple columns - ->>> from pyspark.sql import functions as sf ->>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name")) ->>> df.groupBy("name").agg(sf.collect_list('id')).show() -+++ -|name|collect_list(id)| -+++ -|John| [1, 2]| -| Ana| [3]| -+++ +>>> df = spark.createDataFrame([(1,), (2,), (2,)], ('value',)) +>>> df.select(sf.sort_array(sf.collect_list('value')).alias('sorted_list')).show() ++---+ +|sorted_list| ++---+ +| [1, 2, 2]| ++---+ -Example 3: Collect values from a DataFrame and sort the result +Example 2: Collect values from a DataFrame and sort the result in descending order >>> from pyspark.sql import functions as sf ->>> df = spark.createDataFrame([(1,), (2,), (2,)], ('value',)) ->>> df.select(sf.array_sort(sf.collect_list('value')).alias('sorted_list')).show() +>>> df = spark.createDataFrame([(2,), (5,), (5,)], ('age',)) +>>> df.select(sf.sort_array(sf.collect_list('age'), asc=False).alias('sorted_list')).show() +---+ |sorted_list| +---+ -| [1, 2, 2]| +| [5, 5, 2]| +---+ + +Example 3: Collect values from a DataFrame with multiple columns and sort the result + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")],
[spark] branch branch-3.5 updated: [SPARK-45167][CONNECT][PYTHON][FOLLOW-UP] Use lighter threading Rlock, and use the existing eventually util function
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 2a9dd2b3968 [SPARK-45167][CONNECT][PYTHON][FOLLOW-UP] Use lighter threading Rlock, and use the existing eventually util function 2a9dd2b3968 is described below commit 2a9dd2b3968da7c2e96c502aaf4c158ee782e5f4 Author: Hyukjin Kwon AuthorDate: Mon Sep 18 13:46:34 2023 +0900 [SPARK-45167][CONNECT][PYTHON][FOLLOW-UP] Use lighter threading Rlock, and use the existing eventually util function This PR is a followup of https://github.com/apache/spark/pull/42929 that: - Use lighter threading `Rlock` instead of multithreading `Rlock`. Multiprocessing does not work with PySpark due to the ser/de problem for socket connections, and many others. - Use the existing eventually util function `pyspark.testing.eventually` instead of `assertEventually` to deduplicate code. Mainly for code clean-up. No. Existing tests should pass them. No. Closes #42965 from HyukjinKwon/SPARK-45167-followup. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit d5ff04da217df483d27011f6e38417df2eaa42bd) Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/client/reattach.py | 5 ++--- .../sql/tests/connect/client/test_client.py| 23 +- 2 files changed, 7 insertions(+), 21 deletions(-) diff --git a/python/pyspark/sql/connect/client/reattach.py b/python/pyspark/sql/connect/client/reattach.py index e58864b965b..6addb5bd2c6 100644 --- a/python/pyspark/sql/connect/client/reattach.py +++ b/python/pyspark/sql/connect/client/reattach.py @@ -18,12 +18,11 @@ from pyspark.sql.connect.utils import check_dependencies check_dependencies(__name__) +from threading import RLock import warnings import uuid from collections.abc import Generator from typing import Optional, Dict, Any, Iterator, Iterable, Tuple, Callable, cast, Type, ClassVar -from multiprocessing import RLock -from multiprocessing.synchronize import RLock as RLockBase from multiprocessing.pool import ThreadPool import os @@ -56,7 +55,7 @@ class ExecutePlanResponseReattachableIterator(Generator): """ # Lock to manage the pool -_lock: ClassVar[RLockBase] = RLock() +_lock: ClassVar[RLock] = RLock() _release_thread_pool: Optional[ThreadPool] = ThreadPool(os.cpu_count() if os.cpu_count() else 8) @classmethod diff --git a/python/pyspark/sql/tests/connect/client/test_client.py b/python/pyspark/sql/tests/connect/client/test_client.py index cf43fb16df7..93b7006799b 100644 --- a/python/pyspark/sql/tests/connect/client/test_client.py +++ b/python/pyspark/sql/tests/connect/client/test_client.py @@ -25,6 +25,7 @@ import grpc from pyspark.sql.connect.client import SparkConnectClient, ChannelBuilder import pyspark.sql.connect.proto as proto from pyspark.testing.connectutils import should_test_connect, connect_requirement_message +from pyspark.testing.utils import eventually from pyspark.sql.connect.client.core import Retrying from pyspark.sql.connect.client.reattach import ( @@ -152,20 +153,6 @@ class SparkConnectClientReattachTestCase(unittest.TestCase): attach_ops=ResponseGenerator(attach) if attach is not None else None, ) -def assertEventually(self, callable, timeout_ms=1000): -"""Helper method that will continuously evaluate the callable to not raise an -exception.""" -import time - -limit = time.monotonic_ns() + timeout_ms * 1000 * 1000 -while time.monotonic_ns() < limit: -try: -callable() -break -except Exception: -time.sleep(0.1) -callable() - def test_basic_flow(self): stub = self._stub_with([self.response, self.finished]) ite = ExecutePlanResponseReattachableIterator(self.request, stub, self.policy, []) @@ -178,7 +165,7 @@ class SparkConnectClientReattachTestCase(unittest.TestCase): self.assertEqual(1, stub.release_calls) self.assertEqual(1, stub.execute_calls) -self.assertEventually(check_all, timeout_ms=1000) +eventually(timeout=1, catch_assertions=True)(check_all)() def test_fail_during_execute(self): def fatal(): @@ -196,7 +183,7 @@ class SparkConnectClientReattachTestCase(unittest.TestCase): self.assertEqual(1, stub.release_until_calls) self.assertEqual(1, stub.execute_calls) -self.assertEventually(check, timeout_ms=1000) +eventually(timeout=1, catch_assertions=True)(check)() def test_fail_and_retry_during_execute(self): def non_fatal(): @@ -215,7 +202,7 @@ class SparkConnectClientReattachTestCase(unittest.TestCase): se
[spark] branch branch-3.5 updated: [SPARK-45167][CONNECT][PYTHON][3.5] Python client must call `release_all`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 60073f31831 [SPARK-45167][CONNECT][PYTHON][3.5] Python client must call `release_all` 60073f31831 is described below commit 60073f318313ab2329ea1504ef7538641433852e Author: Martin Grund AuthorDate: Tue Sep 19 08:32:21 2023 +0900 [SPARK-45167][CONNECT][PYTHON][3.5] Python client must call `release_all` ### What changes were proposed in this pull request? Cherry-pick of https://github.com/apache/spark/pull/42929 Previously the Python client would not call `release_all` after fetching all results and leaving the query dangling. The query would then be removed after the five minute timeout. This patch adds proper testing for calling release all and release until. In addition it fixes a test race condition where we would close the SparkSession which would in turn close the GRPC channel which might have dangling async release calls hanging. ### Why are the changes needed? Stability ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #42973 from grundprinzip/SPARK-45167-3.5. Authored-by: Martin Grund Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/client/core.py | 1 + python/pyspark/sql/connect/client/reattach.py | 37 +++- .../sql/tests/connect/client/test_client.py| 195 - 3 files changed, 226 insertions(+), 7 deletions(-) diff --git a/python/pyspark/sql/connect/client/core.py b/python/pyspark/sql/connect/client/core.py index 7b3299d123b..7b1aafbefeb 100644 --- a/python/pyspark/sql/connect/client/core.py +++ b/python/pyspark/sql/connect/client/core.py @@ -1005,6 +1005,7 @@ class SparkConnectClient(object): """ Close the channel. """ +ExecutePlanResponseReattachableIterator.shutdown() self._channel.close() self._closed = True diff --git a/python/pyspark/sql/connect/client/reattach.py b/python/pyspark/sql/connect/client/reattach.py index 7e1e722d5fd..e58864b965b 100644 --- a/python/pyspark/sql/connect/client/reattach.py +++ b/python/pyspark/sql/connect/client/reattach.py @@ -21,7 +21,9 @@ check_dependencies(__name__) import warnings import uuid from collections.abc import Generator -from typing import Optional, Dict, Any, Iterator, Iterable, Tuple, Callable, cast +from typing import Optional, Dict, Any, Iterator, Iterable, Tuple, Callable, cast, Type, ClassVar +from multiprocessing import RLock +from multiprocessing.synchronize import RLock as RLockBase from multiprocessing.pool import ThreadPool import os @@ -53,7 +55,30 @@ class ExecutePlanResponseReattachableIterator(Generator): ReleaseExecute RPCs that instruct the server to release responses that it already processed. """ -_release_thread_pool = ThreadPool(os.cpu_count() if os.cpu_count() else 8) +# Lock to manage the pool +_lock: ClassVar[RLockBase] = RLock() +_release_thread_pool: Optional[ThreadPool] = ThreadPool(os.cpu_count() if os.cpu_count() else 8) + +@classmethod +def shutdown(cls: Type["ExecutePlanResponseReattachableIterator"]) -> None: +""" +When the channel is closed, this method will be called before, to make sure all +outstanding calls are closed. +""" +with cls._lock: +if cls._release_thread_pool is not None: +cls._release_thread_pool.close() +cls._release_thread_pool.join() +cls._release_thread_pool = None + +@classmethod +def _initialize_pool_if_necessary(cls: Type["ExecutePlanResponseReattachableIterator"]) -> None: +""" +If the processing pool for the release calls is None, initialize the pool exactly once. +""" +with cls._lock: +if cls._release_thread_pool is None: +cls._release_thread_pool = ThreadPool(os.cpu_count() if os.cpu_count() else 8) def __init__( self, @@ -62,6 +87,7 @@ class ExecutePlanResponseReattachableIterator(Generator): retry_policy: Dict[str, Any], metadata: Iterable[Tuple[str, str]], ): +ExecutePlanResponseReattachableIterator._initialize_pool_if_necessary() self._request = request self._retry_policy = retry_policy if request.operation_id: @@ -111,7 +137,6 @@ class ExecutePlanResponseReattachableIterator(Generator): self._last_returned_response_id = ret.response_id if ret.HasField("result_complete"): -self._result_complete = True self._release_all() else:
[spark] branch master updated: [SPARK-45133][CONNECT][TESTS][FOLLOWUP] Add test that queries transition to FINISHED
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8323c0c48de [SPARK-45133][CONNECT][TESTS][FOLLOWUP] Add test that queries transition to FINISHED 8323c0c48de is described below commit 8323c0c48de7f498ef2452059f737a167586b98d Author: Juliusz Sompolski AuthorDate: Tue Sep 19 08:31:15 2023 +0900 [SPARK-45133][CONNECT][TESTS][FOLLOWUP] Add test that queries transition to FINISHED ### What changes were proposed in this pull request? Add test checking that queries (also special case: local relations) transition to FINISHED state, even if the client does not consume the results. ### Why are the changes needed? Add test for SPARK-45133. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This adds tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42910 from juliuszsompolski/SPARK-45133-followup. Authored-by: Juliusz Sompolski Signed-off-by: Hyukjin Kwon --- .../spark/sql/connect/SparkConnectServerTest.scala | 29 - .../service/SparkConnectServiceE2ESuite.scala | 48 ++ 2 files changed, 76 insertions(+), 1 deletion(-) diff --git a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/SparkConnectServerTest.scala b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/SparkConnectServerTest.scala index eddd1c6be72..7b02377f484 100644 --- a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/SparkConnectServerTest.scala +++ b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/SparkConnectServerTest.scala @@ -16,14 +16,19 @@ */ package org.apache.spark.sql.connect -import java.util.UUID +import java.util.{TimeZone, UUID} +import scala.reflect.runtime.universe.TypeTag + +import org.apache.arrow.memory.RootAllocator import org.scalatest.concurrent.{Eventually, TimeLimits} import org.scalatest.time.Span import org.scalatest.time.SpanSugar._ import org.apache.spark.connect.proto +import org.apache.spark.sql.catalyst.ScalaReflection import org.apache.spark.sql.connect.client.{CloseableIterator, CustomSparkConnectBlockingStub, ExecutePlanResponseReattachableIterator, GrpcRetryHandler, SparkConnectClient, WrappedCloseableIterator} +import org.apache.spark.sql.connect.client.arrow.ArrowSerializer import org.apache.spark.sql.connect.common.config.ConnectCommon import org.apache.spark.sql.connect.config.Connect import org.apache.spark.sql.connect.dsl.MockRemoteSession @@ -43,6 +48,8 @@ trait SparkConnectServerTest extends SharedSparkSession { val eventuallyTimeout = 30.seconds + val allocator = new RootAllocator() + override def beforeAll(): Unit = { super.beforeAll() // Other suites using mocks leave a mess in the global executionManager, @@ -60,6 +67,7 @@ trait SparkConnectServerTest extends SharedSparkSession { override def afterAll(): Unit = { SparkConnectService.stop() +allocator.close() super.afterAll() } @@ -127,6 +135,19 @@ trait SparkConnectServerTest extends SharedSparkSession { proto.Plan.newBuilder().setRoot(dsl.sql(query)).build() } + protected def buildLocalRelation[A <: Product: TypeTag](data: Seq[A]) = { +val encoder = ScalaReflection.encoderFor[A] +val arrowData = + ArrowSerializer.serialize(data.iterator, encoder, allocator, TimeZone.getDefault.getID) +val localRelation = proto.LocalRelation + .newBuilder() + .setData(arrowData) + .setSchema(encoder.schema.json) + .build() +val relation = proto.Relation.newBuilder().setLocalRelation(localRelation).build() +proto.Plan.newBuilder().setRoot(relation).build() + } + protected def getReattachableIterator( stubIterator: CloseableIterator[proto.ExecutePlanResponse]) = { // This depends on the wrapping in CustomSparkConnectBlockingStub.executePlanReattachable: @@ -188,6 +209,12 @@ trait SparkConnectServerTest extends SharedSparkSession { executions.head } + protected def eventuallyGetExecutionHolder: ExecuteHolder = { +Eventually.eventually(timeout(eventuallyTimeout)) { + getExecutionHolder +} + } + protected def withClient(f: SparkConnectClient => Unit): Unit = { val client = SparkConnectClient .builder() diff --git a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/service/SparkConnectServiceE2ESuite.scala b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/service/SparkConnectServiceE2ESuite.scala new file mode 100644 index 000..14ecc9a2e95 --- /dev/null +++ b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/service/
[spark] branch master updated: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f83e5ec202b [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 f83e5ec202b is described below commit f83e5ec202be68d6640f9f9403d96a39ef993f82 Author: Haejoon Lee AuthorDate: Mon Sep 18 16:27:09 2023 -0700 [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0 ### What changes were proposed in this pull request? This PR proposes to support pandas 2.1.0 for PySpark. See [What's new in 2.1.0](https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html) for more detail. ### Why are the changes needed? We should follow the latest version of pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing CI should passed with Pandas 2.1.0 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42793 from itholic/pandas_2.1.0. Lead-authored-by: Haejoon Lee Co-authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- dev/infra/Dockerfile | 4 +- .../source/migration_guide/pyspark_upgrade.rst | 3 + .../docs/source/reference/pyspark.pandas/frame.rst | 1 + python/pyspark/pandas/base.py | 2 - python/pyspark/pandas/frame.py | 143 +++-- python/pyspark/pandas/generic.py | 42 ++ python/pyspark/pandas/groupby.py | 50 ++- python/pyspark/pandas/indexes/base.py | 6 +- python/pyspark/pandas/indexes/datetimes.py | 18 +++ python/pyspark/pandas/indexes/timedelta.py | 7 + python/pyspark/pandas/namespace.py | 21 ++- python/pyspark/pandas/plot/matplotlib.py | 2 +- python/pyspark/pandas/series.py| 83 +++- python/pyspark/pandas/supported_api_gen.py | 2 +- .../pandas/tests/computation/test_corrwith.py | 5 +- python/pyspark/pandas/tests/test_stats.py | 31 - python/pyspark/pandas/typedef/typehints.py | 15 ++- python/pyspark/pandas/window.py| 4 + python/pyspark/sql/connect/session.py | 3 +- python/pyspark/sql/pandas/conversion.py| 16 ++- python/pyspark/sql/pandas/serializers.py | 8 +- python/pyspark/sql/pandas/types.py | 14 +- .../sql/tests/pandas/test_pandas_cogrouped_map.py | 2 +- 23 files changed, 415 insertions(+), 67 deletions(-) diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index 8e5a3cb7c05..d196d0e97c5 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -84,8 +84,8 @@ RUN Rscript -e "devtools::install_version('roxygen2', version='7.2.0', repos='ht # See more in SPARK-39735 ENV R_LIBS_SITE "/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library" -RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib -RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' +RUN pypy3 -m pip install numpy 'pandas<=2.1.0' scipy coverage matplotlib +RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.1.0' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' # Add Python deps for Spark Connect. RUN python3.9 -m pip install 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 'protobuf==3.20.3' 'googleapis-common-protos==1.56.4' diff --git a/python/docs/source/migration_guide/pyspark_upgrade.rst b/python/docs/source/migration_guide/pyspark_upgrade.rst index 3513f0a878e..d081275dc83 100644 --- a/python/docs/source/migration_guide/pyspark_upgrade.rst +++ b/python/docs/source/migration_guide/pyspark_upgrade.rst @@ -22,6 +22,7 @@ Upgrading PySpark Upgrading from PySpark 3.5 to 4.0 - +* In Spark 4.0, it is recommended to use Pandas version 2.0.0 or above with PySpark for optimal compatibility. * In Spark 4.0, the minimum supported version for Pandas has been raised from 1.0.5 to 1.4.4 in PySpark. * In Spark 4.0, the minimum supported version for Numpy has been raised from 1.15 to 1.21 in PySpark. * In Spark 4.0, ``Int64Index`` and ``Float64Index`` have been removed from pandas API on Spark, ``Index`` should be used directly. @@ -44,6 +45,8 @@ Upgrading from PySpark 3.5 to 4.0 * In Spark 4.0, ``squeeze`` parameter from ``ps.read_csv`` and ``ps.read_excel`` has been removed from pandas API on Spark. * In Spark 4.0, ``null_counts`` parameter from ``DataFrame.info`` has been removed from pandas API on Spark, use ``show_counts`` instea
[spark] branch master updated: [SPARK-45188][SQL][DOCS] Update error messages related to parameterized `sql()`
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 981312284f0 [SPARK-45188][SQL][DOCS] Update error messages related to parameterized `sql()` 981312284f0 is described below commit 981312284f0776ca847c8d21411f74a72c639b22 Author: Max Gekk AuthorDate: Tue Sep 19 00:22:43 2023 +0300 [SPARK-45188][SQL][DOCS] Update error messages related to parameterized `sql()` ### What changes were proposed in this pull request? In the PR, I propose to update some error formats and comments regarding `sql()` parameters - maps, arrays and struct might be used as `sql()` parameters. New behaviour has been added by https://github.com/apache/spark/pull/42752. ### Why are the changes needed? To inform users about recent changes introduced by SPARK-45033. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suite: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42957 from MaxGekk/clean-ClientE2ETestSuite. Authored-by: Max Gekk Signed-off-by: Max Gekk --- .../src/main/resources/error/error-classes.json | 4 ++-- .../scala/org/apache/spark/sql/SparkSession.scala| 11 +++ docs/sql-error-conditions.md | 4 ++-- python/pyspark/pandas/sql_formatter.py | 3 ++- python/pyspark/sql/session.py| 3 ++- .../spark/sql/catalyst/analysis/parameters.scala | 14 +- .../scala/org/apache/spark/sql/SparkSession.scala| 20 ++-- 7 files changed, 34 insertions(+), 25 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 4740ed72f89..186e7b4640d 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -1892,7 +1892,7 @@ }, "INVALID_SQL_ARG" : { "message" : [ - "The argument of `sql()` is invalid. Consider to replace it by a SQL literal." + "The argument of `sql()` is invalid. Consider to replace it either by a SQL literal or by collection constructor functions such as `map()`, `array()`, `struct()`." ] }, "INVALID_SQL_SYNTAX" : { @@ -2768,7 +2768,7 @@ }, "UNBOUND_SQL_PARAMETER" : { "message" : [ - "Found the unbound parameter: . Please, fix `args` and provide a mapping of the parameter to a SQL literal." + "Found the unbound parameter: . Please, fix `args` and provide a mapping of the parameter to either a SQL literal or collection constructor functions such as `map()`, `array()`, `struct()`." ], "sqlState" : "42P02" }, diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala index 8788e34893e..5aa8c5a2bd5 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -235,8 +235,9 @@ class SparkSession private[sql] ( * An array of Java/Scala objects that can be converted to SQL literal expressions. See https://spark.apache.org/docs/latest/sql-ref-datatypes.html";> Supported Data * Types for supported value types in Scala/Java. For example: 1, "Steven", - * LocalDate.of(2023, 4, 2). A value can be also a `Column` of literal expression, in that - * case it is taken as is. + * LocalDate.of(2023, 4, 2). A value can be also a `Column` of a literal or collection + * constructor functions such as `map()`, `array()`, `struct()`, in that case it is taken as + * is. * * @since 3.5.0 */ @@ -272,7 +273,8 @@ class SparkSession private[sql] ( * expressions. See https://spark.apache.org/docs/latest/sql-ref-datatypes.html";> * Supported Data Types for supported value types in Scala/Java. For example, map keys: * "rank", "name", "birthdate"; map values: 1, "Steven", LocalDate.of(2023, 4, 2). Map value - * can be also a `Column` of literal expression, in that case it is taken as is. + * can be also a `Column` of a literal or collection constructor functions such as `map()`, + * `array()`, `struct()`, in that case it is taken as is. * * @since 3.4.0 */ @@ -292,7 +294,8 @@ class SparkSession private[sql] ( * expressions. See https://spark.apache.org/docs/latest/sql-ref-datatypes.html";> * Supported Data Types for supported value types in Scala/Java. For example, map keys: * "rank", "
[spark] branch branch-3.4 updated: [SPARK-45081][SQL][3.4] Encoders.bean does no longer work with read-only properties
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 66f5f964f82 [SPARK-45081][SQL][3.4] Encoders.bean does no longer work with read-only properties 66f5f964f82 is described below commit 66f5f964f8213e263e8aefb38a7e733753836995 Author: Giambattista Bloisi AuthorDate: Mon Sep 18 13:39:54 2023 -0700 [SPARK-45081][SQL][3.4] Encoders.bean does no longer work with read-only properties ### What changes were proposed in this pull request? This PR re-enables Encoders.bean to be called against beans having read-only properties, that is properties that have only getters and no setter method. Beans with read only properties are even used in internal tests. Setter methods of a Java bean encoder are stored within an Option wrapper because they are missing in case of read-only properties. When a java bean has to be initialized, setter methods for the bean properties have to be called: this PR filters out read-only properties from that process. ### Why are the changes needed? The changes are required to avoid an exception to the thrown by getting the value of a None option object. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? An additional regression test has been added ### Was this patch authored or co-authored using generative AI tooling? No hvanhovell this is 3.4 branch port of [PR-42829](https://github.com/apache/spark/pull/42829) Closes #42913 from gbloisi-openaire/SPARK-45081-branch-3.4. Authored-by: Giambattista Bloisi Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/catalyst/ScalaReflection.scala | 4 +++- .../java/test/org/apache/spark/sql/JavaDatasetSuite.java | 16 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala index 2e03f32a58d..b18613bdad3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala @@ -345,7 +345,9 @@ object ScalaReflection extends ScalaReflection { CreateExternalRow(convertedFields, enc.schema)) case JavaBeanEncoder(tag, fields) => - val setters = fields.map { f => + val setters = fields +.filter(_.writeMethod.isDefined) +.map { f => val newTypePath = walkedTypePath.recordField( f.enc.clsTag.runtimeClass.getName, f.name) diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java index 228b7855142..6a9ffef6991 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java @@ -1711,6 +1711,22 @@ public class JavaDatasetSuite implements Serializable { Assert.assertEquals(1, df.collectAsList().size()); } + public static class ReadOnlyPropertyBean implements Serializable { +public boolean isEmpty() { + return true; +} + } + + @Test + public void testReadOnlyPropertyBean() { +ReadOnlyPropertyBean bean = new ReadOnlyPropertyBean(); +List data = Arrays.asList(bean); +Dataset df = spark.createDataset(data, +Encoders.bean(ReadOnlyPropertyBean.class)); +Assert.assertEquals(1, df.schema().length()); +Assert.assertEquals(1, df.collectAsList().size()); + } + public class CircularReference1Bean implements Serializable { private CircularReference2Bean child; - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45153][BUILD][PS] Rebalance testing time for `pyspark-pandas-connect-part1`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 29acdf755cd [SPARK-45153][BUILD][PS] Rebalance testing time for `pyspark-pandas-connect-part1` 29acdf755cd is described below commit 29acdf755cd0a5c88c5cc5c8c16947a5b8e840f9 Author: Haejoon Lee AuthorDate: Mon Sep 18 12:02:28 2023 -0700 [SPARK-45153][BUILD][PS] Rebalance testing time for `pyspark-pandas-connect-part1` ### What changes were proposed in this pull request? This PR proposes to rebalance the tests for `pyspark-pandas-connect-part1`. ### Why are the changes needed? We rebalance the CI by splitting slow tests into multiple parts, but `pyspark-pandas-connect-part1` takes almost an hour than other splitted `pyspark-pandas-connect-partx` tests as below: |pyspark-pandas-connect-part0|pyspark-pandas-connect-part1|pyspark-pandas-connect-part2| |--|--|--| |1h 51m|2h 55m|1h 54m| ### Does this PR introduce _any_ user-facing change? No, this PR is proposed to improve the build infra. ### How was this patch tested? We should manually check the CI from GitHub Actions after PR is opened. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42909 from itholic/SPARK-45153. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 ++ dev/sparktestsupport/modules.py | 46 dev/sparktestsupport/utils.py| 14 +-- 3 files changed, 40 insertions(+), 22 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 9c5d25d30af..39fd2796015 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -372,6 +372,8 @@ jobs: pyspark-pandas-connect-part1 - >- pyspark-pandas-connect-part2 + - >- +pyspark-pandas-connect-part3 env: MODULES_TO_TEST: ${{ matrix.modules }} HADOOP_PROFILE: ${{ inputs.hadoop }} diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 0a751052491..a3bfa288383 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -965,7 +965,6 @@ pyspark_pandas_connect_part0 = Module( "pyspark.pandas.tests.connect.test_parity_utils", "pyspark.pandas.tests.connect.test_parity_window", "pyspark.pandas.tests.connect.indexes.test_parity_base", -"pyspark.pandas.tests.connect.indexes.test_parity_datetime", "pyspark.pandas.tests.connect.indexes.test_parity_align", "pyspark.pandas.tests.connect.indexes.test_parity_indexing", "pyspark.pandas.tests.connect.indexes.test_parity_reindex", @@ -982,7 +981,10 @@ pyspark_pandas_connect_part0 = Module( "pyspark.pandas.tests.connect.computation.test_parity_describe", "pyspark.pandas.tests.connect.computation.test_parity_eval", "pyspark.pandas.tests.connect.computation.test_parity_melt", -"pyspark.pandas.tests.connect.computation.test_parity_pivot", +"pyspark.pandas.tests.connect.groupby.test_parity_stat", +"pyspark.pandas.tests.connect.frame.test_parity_attrs", +"pyspark.pandas.tests.connect.diff_frames_ops.test_parity_dot_frame", +"pyspark.pandas.tests.connect.diff_frames_ops.test_parity_dot_series", ], excluded_python_implementations=[ "PyPy" # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and @@ -999,7 +1001,6 @@ pyspark_pandas_connect_part1 = Module( ], python_test_goals=[ # pandas-on-Spark unittests -"pyspark.pandas.tests.connect.frame.test_parity_attrs", "pyspark.pandas.tests.connect.frame.test_parity_constructor", "pyspark.pandas.tests.connect.frame.test_parity_conversion", "pyspark.pandas.tests.connect.frame.test_parity_reindexing", @@ -1012,21 +1013,12 @@ pyspark_pandas_connect_part1 = Module( "pyspark.pandas.tests.connect.groupby.test_parity_aggregate", "pyspark.pandas.tests.connect.groupby.test_parity_apply_func", "pyspark.pandas.tests.connect.groupby.test_parity_cumulative", -"pyspark.pandas.tests.connect.groupby.test_parity_describe", -"pyspark.pandas.tests.connect.groupby.test_parity_groupby", -"pyspark.pandas.tests.connect.groupby.test_parity_head_tail", -"pyspark.pandas.tests.connect.groupby.test_parity_index", "pyspark.pandas.tests.connect.groupby.test_parity_missing_data", "pyspark.pandas.tests.connect.groupby.test_parity_split_apply", -"pyspark.pandas.tests.connect.groupby.test_parity_stat", "pyspark.pandas.tests.
[spark] branch branch-3.4 updated: [SPARK-45078][SQL][3.4] Fix `array_insert` ImplicitCastInputTypes not work
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 464811a243c [SPARK-45078][SQL][3.4] Fix `array_insert` ImplicitCastInputTypes not work 464811a243c is described below commit 464811a243cbf1502a26d4440c484cfce13d4ddc Author: Jia Fan AuthorDate: Mon Sep 18 11:55:23 2023 -0700 [SPARK-45078][SQL][3.4] Fix `array_insert` ImplicitCastInputTypes not work ### What changes were proposed in this pull request? This is a backport PR for https://github.com/apache/spark/pull/42951, to fix `array_insert` ImplicitCastInputTypes not work. ### Why are the changes needed? Fix error behavior in `array_insert` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42960 from Hisoka-X/arrayinsert-fix-3.4. Authored-by: Jia Fan Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/expressions/collectionOperations.scala | 1 - sql/core/src/test/resources/sql-tests/inputs/array.sql| 1 + sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out | 8 sql/core/src/test/resources/sql-tests/results/array.sql.out | 8 4 files changed, 17 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index fd4249f4776..629ae0499b4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -4642,7 +4642,6 @@ case class ArrayInsert( } case (e1, e2, e3) => Seq.empty } -Seq.empty } override def checkInputDataTypes(): TypeCheckResult = { diff --git a/sql/core/src/test/resources/sql-tests/inputs/array.sql b/sql/core/src/test/resources/sql-tests/inputs/array.sql index 606ed14cbe0..b3834b2e816 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/array.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/array.sql @@ -141,6 +141,7 @@ select array_insert(array(1, 2, 3, NULL), cast(NULL as INT), 4); select array_insert(array(1, 2, 3, NULL), 4, cast(NULL as INT)); select array_insert(array(2, 3, NULL, 4), 5, 5); select array_insert(array(2, 3, NULL, 4), -5, 1); +select array_insert(array(1), 2, cast(2 as tinyint)); set spark.sql.legacy.negativeIndexInArrayInsert=true; select array_insert(array(1, 3, 4), -2, 2); diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out index 6d17904271d..2aa818a48ef 100644 --- a/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out @@ -659,6 +659,14 @@ struct> [1,2,3,null,4] +-- !query +select array_insert(array(1), 2, cast(2 as tinyint)) +-- !query schema +struct> +-- !query output +[1,2] + + -- !query set spark.sql.legacy.negativeIndexInArrayInsert=true -- !query schema diff --git a/sql/core/src/test/resources/sql-tests/results/array.sql.out b/sql/core/src/test/resources/sql-tests/results/array.sql.out index 8943c36ea42..322bff9a886 100644 --- a/sql/core/src/test/resources/sql-tests/results/array.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/array.sql.out @@ -540,6 +540,14 @@ struct> [1,2,3,null,4] +-- !query +select array_insert(array(1), 2, cast(2 as tinyint)) +-- !query schema +struct> +-- !query output +[1,2] + + -- !query set spark.sql.legacy.negativeIndexInArrayInsert=true -- !query schema - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45203][SQL][TEST] Join tests in KeyGroupedPartitioningSuite should use merge join hint
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new eadb591f37a [SPARK-45203][SQL][TEST] Join tests in KeyGroupedPartitioningSuite should use merge join hint eadb591f37a is described below commit eadb591f37a118096bab637e4b6ca913c2753a6b Author: Wenchen Fan AuthorDate: Mon Sep 18 11:50:16 2023 -0700 [SPARK-45203][SQL][TEST] Join tests in KeyGroupedPartitioningSuite should use merge join hint ### What changes were proposed in this pull request? It's more robust to use join hints to enforce sort-merge join in the tests, instead of setting configs which may be ineffective after more advanced optimizations in the future. ### Why are the changes needed? make tests more future-proof ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? no Closes #42983 from cloud-fan/minor. Authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun --- .../connector/KeyGroupedPartitioningSuite.scala| 246 - 1 file changed, 93 insertions(+), 153 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala index ffd1c8e31e9..4cb5457b66b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala @@ -18,6 +18,7 @@ package org.apache.spark.sql.connector import java.util.Collections +import org.apache.spark.SparkConf import org.apache.spark.sql.{DataFrame, Row} import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.{Literal, TransformExpression} @@ -44,25 +45,9 @@ class KeyGroupedPartitioningSuite extends DistributionAndOrderingSuiteBase { UnboundBucketFunction, UnboundTruncateFunction) - private var originalV2BucketingEnabled: Boolean = false - private var originalAutoBroadcastJoinThreshold: Long = -1 - - override def beforeAll(): Unit = { -super.beforeAll() -originalV2BucketingEnabled = conf.getConf(V2_BUCKETING_ENABLED) -conf.setConf(V2_BUCKETING_ENABLED, true) -originalAutoBroadcastJoinThreshold = conf.getConf(AUTO_BROADCASTJOIN_THRESHOLD) -conf.setConf(AUTO_BROADCASTJOIN_THRESHOLD, -1L) - } - - override def afterAll(): Unit = { -try { - super.afterAll() -} finally { - conf.setConf(V2_BUCKETING_ENABLED, originalV2BucketingEnabled) - conf.setConf(AUTO_BROADCASTJOIN_THRESHOLD, originalAutoBroadcastJoinThreshold) -} - } + override def sparkConf: SparkConf = super.sparkConf +.set(V2_BUCKETING_ENABLED, true) +.set(AUTO_BROADCASTJOIN_THRESHOLD, -1L) before { functions.foreach { f => @@ -261,6 +246,25 @@ class KeyGroupedPartitioningSuite extends DistributionAndOrderingSuiteBase { .add("order_amount", DoubleType) .add("customer_id", LongType) + private def selectWithMergeJoinHint(t1: String, t2: String): String = { +s"SELECT /*+ MERGE($t1, $t2) */ " + } + + private def createJoinTestDF( + keys: Seq[(String, String)], + extraColumns: Seq[String] = Nil, + joinType: String = ""): DataFrame = { +val extraColList = if (extraColumns.isEmpty) "" else extraColumns.mkString(", ", ", ", "") +sql( + s""" + |${selectWithMergeJoinHint("i", "p")} + |id, name, i.price as purchase_price, p.price as sale_price $extraColList + |FROM testcat.ns.$items i $joinType JOIN testcat.ns.$purchases p + |ON ${keys.map(k => s"i.${k._1} = p.${k._2}").mkString(" AND ")} + |ORDER BY id, purchase_price, sale_price $extraColList + |""".stripMargin) + } + private def testWithCustomersAndOrders( customers_partitions: Array[Transform], orders_partitions: Array[Transform], @@ -273,9 +277,13 @@ class KeyGroupedPartitioningSuite extends DistributionAndOrderingSuiteBase { sql(s"INSERT INTO testcat.ns.$orders VALUES " + s"(100.0, 1), (200.0, 1), (150.0, 2), (250.0, 2), (350.0, 2), (400.50, 3)") -val df = sql("SELECT customer_name, customer_age, order_amount " + -s"FROM testcat.ns.$customers c JOIN testcat.ns.$orders o " + -"ON c.customer_id = o.customer_id ORDER BY c.customer_id, order_amount") +val df = sql( + s""" +|${selectWithMergeJoinHint("c", "o")} +|customer_name, customer_age, order_amount +|FROM testcat.ns.$customers c JOIN testcat.ns.$orders o +|ON c.customer_id = o.customer_id ORDER BY c.customer_id, order_amou
[GitHub] [spark-website] allisonwang-db commented on a diff in pull request #477: [SPARK-45195] Update examples with docker official image
allisonwang-db commented on code in PR #477: URL: https://github.com/apache/spark-website/pull/477#discussion_r1329098596 ## index.md: ## @@ -88,11 +88,13 @@ navigation: Run now -Installing with 'pip' +Install with 'pip' or try offical image $ pip install pyspark $ pyspark +$ +$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark Review Comment: @zhengruifeng I feel this "try official image" is a bit unclear. Maybe we can reformat to this? ``` Install with pip Install with docker ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8d363c6e2c8 -> 0dda75f824d)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8d363c6e2c8 [SPARK-45196][PYTHON][DOCS] Refine docstring of `array/array_contains/arrays_overlap` add 0dda75f824d [SPARK-45137][CONNECT] Support map/array parameters in parameterized `sql()` No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/SparkSession.scala | 6 +- .../org/apache/spark/sql/ClientE2ETestSuite.scala | 7 + .../src/main/protobuf/spark/connect/commands.proto | 12 +- .../main/protobuf/spark/connect/relations.proto| 12 +- .../sql/connect/planner/SparkConnectPlanner.scala | 26 +- python/pyspark/sql/connect/proto/commands_pb2.py | 164 +++-- python/pyspark/sql/connect/proto/commands_pb2.pyi | 60 - python/pyspark/sql/connect/proto/relations_pb2.py | 268 +++-- python/pyspark/sql/connect/proto/relations_pb2.pyi | 60 - 9 files changed, 396 insertions(+), 219 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45196][PYTHON][DOCS] Refine docstring of `array/array_contains/arrays_overlap`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8d363c6e2c8 [SPARK-45196][PYTHON][DOCS] Refine docstring of `array/array_contains/arrays_overlap` 8d363c6e2c8 is described below commit 8d363c6e2c84c0dbbb51b9376bb4c2a4d1be3acf Author: yangjie01 AuthorDate: Mon Sep 18 09:08:38 2023 -0700 [SPARK-45196][PYTHON][DOCS] Refine docstring of `array/array_contains/arrays_overlap` ### What changes were proposed in this pull request? This pr refine docstring of `array/array_contains/arrays_overlap` and add some new examples. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #42972 from LuciferYang/collect-1. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/functions.py | 191 ++-- 1 file changed, 164 insertions(+), 27 deletions(-) diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 3c65e8d9162..54bd330ebc0 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -11686,7 +11686,8 @@ def array(__cols: Union[List["ColumnOrName_"], Tuple["ColumnOrName_", ...]]) -> def array( *cols: Union["ColumnOrName", Union[List["ColumnOrName_"], Tuple["ColumnOrName_", ...]]] ) -> Column: -"""Creates a new array column. +""" +Collection function: Creates a new array column from the input columns or column names. .. versionadded:: 1.4.0 @@ -11696,25 +11697,63 @@ def array( Parameters -- cols : :class:`~pyspark.sql.Column` or str -column names or :class:`~pyspark.sql.Column`\\s that have -the same data type. +Column names or :class:`~pyspark.sql.Column` objects that have the same data type. Returns --- :class:`~pyspark.sql.Column` -a column of array type. +A new Column of array type, where each value is an array containing the corresponding values +from the input columns. Examples +Example 1: Basic usage of array function with column names. + +>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5)], ("name", "age")) ->>> df.select(array('age', 'age').alias("arr")).collect() -[Row(arr=[2, 2]), Row(arr=[5, 5])] ->>> df.select(array([df.age, df.age]).alias("arr")).collect() -[Row(arr=[2, 2]), Row(arr=[5, 5])] ->>> df.select(array('age', 'age').alias("col")).printSchema() -root - |-- col: array (nullable = false) - ||-- element: long (containsNull = true) +>>> df.select(sf.array('name', 'age').alias("arr")).show() ++--+ +| arr| ++--+ +|[Alice, 2]| +| [Bob, 5]| ++--+ + +Example 2: Usage of array function with Column objects. + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5)], ("name", "age")) +>>> df.select(sf.array(df.name, df.age).alias("arr")).show() ++--+ +| arr| ++--+ +|[Alice, 2]| +| [Bob, 5]| ++--+ + +Example 3: Single argument as list of column names. + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5)], ("name", "age")) +>>> df.select(sf.array(['name', 'age']).alias("arr")).show() ++--+ +| arr| ++--+ +|[Alice, 2]| +| [Bob, 5]| ++--+ + +Example 4: array function with a column containing null values. + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([("Alice", None), ("Bob", 5)], ("name", "age")) +>>> df.select(sf.array('name', 'age').alias("arr")).show() ++-+ +| arr| ++-+ +|[Alice, NULL]| +| [Bob, 5]| ++-+ """ if len(cols) == 1 and isinstance(cols[0], (list, set)): cols = cols[0] # type: ignore[assignment] @@ -11724,8 +11763,9 @@ def array( @_try_remote_functions def array_contains(col: "ColumnOrName", value: Any) -> Column: """ -Collection function: returns null if the array is null, true if the array contains the -given value, and false otherwise. +Collection function: This function returns a boolean indicating whether the array +contains the given value, returning null if the array is null, true if the array +contains the given value, and false otherwise. .. versionadded:: 1.5.0 @@ -11735,22 +11775
[spark] branch master updated: [SPARK-43628][SPARK-43629][CONNECT][PS][TESTS] Clear message for JVM dependent tests
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 02bbd2867be [SPARK-43628][SPARK-43629][CONNECT][PS][TESTS] Clear message for JVM dependent tests 02bbd2867be is described below commit 02bbd2867be52187239074f13617ef0ba2acf09e Author: Haejoon Lee AuthorDate: Mon Sep 18 09:02:52 2023 -0700 [SPARK-43628][SPARK-43629][CONNECT][PS][TESTS] Clear message for JVM dependent tests ### What changes were proposed in this pull request? This PR proposes to correct the message for JVM only tests from Spark Connect, and enable the tests when possible to workaround without JVM features. ### Why are the changes needed? Among JVM-dependent tests, there are tests that can be replaced without using JVM features, while there are some edge tests that can only be tested using JVM features. We need to be clearer about why these cannot be tested. ### Does this PR introduce _any_ user-facing change? No. it's test-only ### How was this patch tested? Updated the existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42955 from itholic/connect_jvm_tests. Authored-by: Haejoon Lee Signed-off-by: Dongjoon Hyun --- .../pandas/tests/connect/groupby/test_parity_apply_func.py| 2 +- .../tests/connect/plot/test_parity_frame_plot_matplotlib.py | 2 +- .../pandas/tests/connect/plot/test_parity_frame_plot_plotly.py| 2 +- .../tests/connect/plot/test_parity_series_plot_matplotlib.py | 2 +- .../pandas/tests/connect/plot/test_parity_series_plot_plotly.py | 2 +- python/pyspark/pandas/tests/connect/test_parity_frame_spark.py| 8 6 files changed, 9 insertions(+), 9 deletions(-) diff --git a/python/pyspark/pandas/tests/connect/groupby/test_parity_apply_func.py b/python/pyspark/pandas/tests/connect/groupby/test_parity_apply_func.py index 2c0d49ebb5c..4daf84bd1b5 100644 --- a/python/pyspark/pandas/tests/connect/groupby/test_parity_apply_func.py +++ b/python/pyspark/pandas/tests/connect/groupby/test_parity_apply_func.py @@ -24,7 +24,7 @@ from pyspark.testing.pandasutils import PandasOnSparkTestUtils class GroupbyParityApplyFuncTests( GroupbyApplyFuncMixin, PandasOnSparkTestUtils, ReusedConnectTestCase ): -@unittest.skip("TODO(SPARK-43629): Enable RDD dependent tests with Spark Connect.") +@unittest.skip("Test depends on SparkContext which is not supported from Spark Connect.") def test_apply_with_side_effect(self): super().test_apply_with_side_effect() diff --git a/python/pyspark/pandas/tests/connect/plot/test_parity_frame_plot_matplotlib.py b/python/pyspark/pandas/tests/connect/plot/test_parity_frame_plot_matplotlib.py index 98da8858a03..3f615326f2b 100644 --- a/python/pyspark/pandas/tests/connect/plot/test_parity_frame_plot_matplotlib.py +++ b/python/pyspark/pandas/tests/connect/plot/test_parity_frame_plot_matplotlib.py @@ -28,7 +28,7 @@ class DataFramePlotMatplotlibParityTests( def test_hist_plot(self): super().test_hist_plot() -@unittest.skip("TODO(SPARK-43629): Enable RDD with Spark Connect.") +@unittest.skip("TODO(SPARK-44372): Enable KernelDensity within Spark Connect.") def test_kde_plot(self): super().test_kde_plot() diff --git a/python/pyspark/pandas/tests/connect/plot/test_parity_frame_plot_plotly.py b/python/pyspark/pandas/tests/connect/plot/test_parity_frame_plot_plotly.py index 7a3efee06df..16b97d6814e 100644 --- a/python/pyspark/pandas/tests/connect/plot/test_parity_frame_plot_plotly.py +++ b/python/pyspark/pandas/tests/connect/plot/test_parity_frame_plot_plotly.py @@ -32,7 +32,7 @@ class DataFramePlotPlotlyParityTests( def test_hist_plot(self): super().test_hist_plot() -@unittest.skip("TODO(SPARK-43629): Enable RDD with Spark Connect.") +@unittest.skip("TODO(SPARK-44372): Enable KernelDensity within Spark Connect.") def test_kde_plot(self): super().test_kde_plot() diff --git a/python/pyspark/pandas/tests/connect/plot/test_parity_series_plot_matplotlib.py b/python/pyspark/pandas/tests/connect/plot/test_parity_series_plot_matplotlib.py index 975d78f17c6..9e264e76229 100644 --- a/python/pyspark/pandas/tests/connect/plot/test_parity_series_plot_matplotlib.py +++ b/python/pyspark/pandas/tests/connect/plot/test_parity_series_plot_matplotlib.py @@ -32,7 +32,7 @@ class SeriesPlotMatplotlibParityTests( def test_hist_plot(self): super().test_hist_plot() -@unittest.skip("TODO(SPARK-43629): Enable RDD with Spark Connect.") +@unittest.skip("TODO(SPARK-44372): Enable KernelDensity within Spark Connect.") def test_kde_plot(self): super().test_kde_plot() diff --git a/python/pyspark/
[spark] branch master updated (82d0823c2f9 -> 1ae8e018164)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 82d0823c2f9 [SPARK-44788][PYTHON][DOCS][FOLLOW-UP] Move `from_xml`/`schema_of_xml` to `Xml Functions` add 1ae8e018164 [SPARK-45056][CONNECT][TESTS] Change the tests related to Python in `SparkConnectSessionHodlerSuite` to assume shouldTestPandasUDFs No new revisions were added by this update. Summary of changes: .../spark/sql/connect/service/SparkConnectSessionHodlerSuite.scala| 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44788][PYTHON][DOCS][FOLLOW-UP] Move `from_xml`/`schema_of_xml` to `Xml Functions`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 82d0823c2f9 [SPARK-44788][PYTHON][DOCS][FOLLOW-UP] Move `from_xml`/`schema_of_xml` to `Xml Functions` 82d0823c2f9 is described below commit 82d0823c2f9b5b1d60c8799cf6bf7924c24acbe1 Author: Ruifeng Zheng AuthorDate: Mon Sep 18 08:58:25 2023 -0700 [SPARK-44788][PYTHON][DOCS][FOLLOW-UP] Move `from_xml`/`schema_of_xml` to `Xml Functions` ### What changes were proposed in this pull request? Move `from_xml`/`schema_of_xml` to `Xml Functions` ### Why are the changes needed? there is a dedicated function group for xml functions ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #42977 from zhengruifeng/update_group. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- python/docs/source/reference/pyspark.sql/functions.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/docs/source/reference/pyspark.sql/functions.rst b/python/docs/source/reference/pyspark.sql/functions.rst index 6e00c859da4..e21c05343da 100644 --- a/python/docs/source/reference/pyspark.sql/functions.rst +++ b/python/docs/source/reference/pyspark.sql/functions.rst @@ -264,8 +264,6 @@ Collection Functions str_to_map to_csv try_element_at -from_xml -schema_of_xml Partition Transformation Functions @@ -531,6 +529,8 @@ Xml Functions .. autosummary:: :toctree: api/ +from_xml +schema_of_xml xpath xpath_boolean xpath_double - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45148][BUILD] Upgrade scalatest related dependencies to the 3.2.17 series
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c1c58698d3d [SPARK-45148][BUILD] Upgrade scalatest related dependencies to the 3.2.17 series c1c58698d3d is described below commit c1c58698d3d6b1447045fad592f8dfb0395989d1 Author: yangjie01 AuthorDate: Mon Sep 18 10:01:47 2023 -0500 [SPARK-45148][BUILD] Upgrade scalatest related dependencies to the 3.2.17 series ### What changes were proposed in this pull request? This pr aims upgrade `scalatest` related test dependencies to 3.2.17: - scalatest: upgrade scalatest to 3.2.17 - scalatestplus - scalacheck: upgrade to `scalacheck-1-17` 3.2.17.0 - mockito: upgrade to `mockito-4-11` to 3.2.17.0 - selenium: uprade to `selenium-4-12` to 3.2.17.0 and `selenium-java` to 4.12.1, `htmlunit-driver` to 4.12.0, byte-buddy and byte-buddy-agent to 1.14.5 ### Why are the changes needed? The release notes as follows: - scalatest:https://github.com/scalatest/scalatest/releases/tag/release-3.2.17 - scalatestplus - scalacheck-1-17: https://github.com/scalatest/scalatestplus-scalacheck/releases/tag/release-3.2.17.0-for-scalacheck-1.17 - mockito-4-11: https://github.com/scalatest/scalatestplus-mockito/releases/tag/release-3.2.17.0-for-mockito-4.11 - selenium-4-12: https://github.com/scalatest/scalatestplus-selenium/releases/tag/release-3.2.17.0-for-selenium-4.12 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: - ChromeUISeleniumSuite - RocksDBBackendChromeUIHistoryServerSuite ``` build/sbt -Dguava.version=32.1.2-jre -Dspark.test.webdriver.chrome.driver=/Users/yangjie01/Tools/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly org.apache.spark.ui.ChromeUISeleniumSuite" build/sbt -Dguava.version=32.1.2-jre -Dspark.test.webdriver.chrome.driver=/Users/yangjie01/Tools/chromedriver -Dtest.default.exclude.tags="" -Phive -Phive-thriftserver "core/testOnly org.apache.spark.deploy.history.RocksDBBackendChromeUIHistoryServerSuite" ``` ``` [info] ChromeUISeleniumSuite: [info] - SPARK-31534: text for tooltip should be escaped (1 second, 809 milliseconds) [info] - SPARK-31882: Link URL for Stage DAGs should not depend on paged table. (604 milliseconds) [info] - SPARK-31886: Color barrier execution mode RDD correctly (252 milliseconds) [info] - Search text for paged tables should not be saved (1 second, 309 milliseconds) [info] Run completed in 6 seconds, 116 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ``` [info] RocksDBBackendChromeUIHistoryServerSuite: [info] - ajax rendered relative links are prefixed with uiRoot (spark.ui.proxyBase) (1 second, 615 milliseconds) [info] Run completed in 5 seconds, 130 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 27 s, completed 2023-9-14 11:29:27 ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #42906 from LuciferYang/SPARK-45148. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: Sean Owen --- pom.xml | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/pom.xml b/pom.xml index 779f9e64f1d..971cb07ea40 100644 --- a/pom.xml +++ b/pom.xml @@ -214,8 +214,8 @@ 4.9.3 1.1 -4.9.1 -4.9.1 +4.12.1 +4.12.0 2.70.0 3.1.0 1.1.0 @@ -413,7 +413,7 @@ org.scalatestplus - selenium-4-9_${scala.binary.version} + selenium-4-12_${scala.binary.version} test @@ -1137,25 +1137,25 @@ org.scalatest scalatest_${scala.binary.version} -3.2.16 +3.2.17 test org.scalatestplus scalacheck-1-17_${scala.binary.version} -3.2.16.0 +3.2.17.0 test org.scalatestplus mockito-4-11_${scala.binary.version} -3.2.16.0 +3.2.17.0 test org.scalatestplus -selenium-4-9_${scala.binary.version} -3.2.16.0 +selenium-4-12_${scala.binary.version} +3.2.17.0 test @@ -1173,13 +1173,13 @@ net.bytebuddy byte-buddy -1.14.4 +1.1
[spark-website] branch asf-site updated: [SPARK-45195] Update examples with docker official image
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new 6b10f7fd85 [SPARK-45195] Update examples with docker official image 6b10f7fd85 is described below commit 6b10f7fd85327f97cc12bede9ce5c60a744d9063 Author: Ruifeng Zheng AuthorDate: Mon Sep 18 07:31:46 2023 -0500 [SPARK-45195] Update examples with docker official image 1, add `docker run` commands for PySpark and SparkR; 2, switch to docker official image for SQL, Scala and Java; refer to https://hub.docker.com/_/spark also manually checked all the commands, e,g,: ``` ruifeng.zhengx:~$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark Python 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/09/18 06:02:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ Using Python version 3.8.10 (default, May 26 2023 14:05:08) Spark context Web UI available at http://4861f70118ab:4040 Spark context available as 'sc' (master = local[*], app id = local-1695016951087). SparkSession available as 'spark'. >>> spark.range(0, 10).show() +---+ | id| +---+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| | 9| +---+ ``` Author: Ruifeng Zheng Closes #477 from zhengruifeng/offical_image. --- index.md| 12 +++- site/index.html | 12 +++- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/index.md b/index.md index a41e4c9e81..ada6242742 100644 --- a/index.md +++ b/index.md @@ -88,11 +88,13 @@ navigation: Run now -Installing with 'pip' +Install with 'pip' or try offical image $ pip install pyspark $ pyspark +$ +$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark Run now -$ docker run -it --rm apache/spark /opt/spark/bin/spark-sql +$ docker run -it --rm spark /opt/spark/bin/spark-sql spark-sql> @@ -175,7 +177,7 @@ FROM json.`logs.json` Run now -$ docker run -it --rm apache/spark /opt/spark/bin/spark-shell +$ docker run -it --rm spark /opt/spark/bin/spark-shell scala> @@ -193,7 +195,7 @@ df.where("age > 21") Run now -$ docker run -it --rm apache/spark /opt/spark/bin/spark-shell +$ docker run -it --rm spark /opt/spark/bin/spark-shell scala> @@ -210,7 +212,7 @@ df.where("age > 21") Run now -$ SPARK-HOME/bin/sparkR +$ docker run -it --rm spark:r /opt/spark/bin/sparkR > diff --git a/site/index.html b/site/index.html index e1b0b7e416..3ccc7104ce 100644 --- a/site/index.html +++ b/site/index.html @@ -213,11 +213,13 @@ Run now -Installing with 'pip' +Install with 'pip' or try offical image $ pip install pyspark $ pyspark +$ +$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark @@ -273,7 +275,7 @@ Run now -$ docker run -it --rm apache/spark /opt/spark/bin/spark-sql +$ docker run -it --rm spark /opt/spark/bin/spark-sql spark-sql> @@ -293,7 +295,7 @@ Run now -$ docker run -it --rm apache/spark /opt/spark/bin/spark-shell +$ docker run -it --rm spark /opt/spark/bin/spark-shell sc
[GitHub] [spark-website] srowen closed pull request #477: [SPARK-45195] Update examples with docker official image
srowen closed pull request #477: [SPARK-45195] Update examples with docker official image URL: https://github.com/apache/spark-website/pull/477 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45197][CORE] Make `StandaloneRestServer` add `JavaModuleOptions` to drivers
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 360adc2250b [SPARK-45197][CORE] Make `StandaloneRestServer` add `JavaModuleOptions` to drivers 360adc2250b is described below commit 360adc2250bccdb0fbe559dcd1fc4b6b4c7c1d7a Author: Dongjoon Hyun AuthorDate: Mon Sep 18 04:10:22 2023 -0700 [SPARK-45197][CORE] Make `StandaloneRestServer` add `JavaModuleOptions` to drivers ### What changes were proposed in this pull request? This PR aims to make `StandaloneRestServer` add `JavaModuleOptions` to drivers by default. ### Why are the changes needed? Since Apache Spark 3.3.0 (SPARK-36796, #34153), `SparkContext` adds `JavaModuleOptions` by default. We had better add `JavaModuleOptions` when `StandaloneRestServer` receives submissions via REST API, too. Otherwise, it fails like the following if the users don't set it manually. **SUBMISSION** ```bash $ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true" sbin/start-master.sh $ curl -s -k -XPOST http://yourserver:6066/v1/submissions/create \ --header "Content-Type:application/json;charset=UTF-8" \ --data '{ "appResource": "", "sparkProperties": { "spark.master": "local[2]", "spark.app.name": "Test 1", "spark.submit.deployMode": "cluster", "spark.jars": "/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/examples/jars/spark-examples_2.12-3.5.0.jar" }, "clientSparkVersion": "", "mainClass": "org.apache.spark.examples.SparkPi", "environmentVariables": {}, "action": "CreateSubmissionRequest", "appArgs": [] }' ``` **DRIVER `stderr` LOG** ``` Exception in thread "main" java.lang.reflect.InvocationTargetException ... at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module 0x6d7a93c9) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module 0x6d7a93c9 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42975 from dongjoon-hyun/SPARK-45197. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/deploy/rest/StandaloneRestServer.scala| 14 +- .../spark/deploy/rest/StandaloneRestSubmitSuite.scala | 12 .../java/org/apache/spark/launcher/JavaModuleOptions.java | 8 3 files changed, 29 insertions(+), 5 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala b/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala index c060ef9da8c..a298e4f6dbf 100644 --- a/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala +++ b/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala @@ -24,7 +24,7 @@ import org.apache.spark.{SPARK_VERSION => sparkVersion, SparkConf} import org.apache.spark.deploy.{Command, DeployMessages, DriverDescription} import org.apache.spark.deploy.ClientArguments._ import org.apache.spark.internal.config -import org.apache.spark.launcher.SparkLauncher +import org.apache.spark.launcher.{JavaModuleOptions, SparkLauncher} import org.apache.spark.resource.ResourceUtils import org.apache.spark.rpc.RpcEndpointRef import org.apache.spark.util.Utils @@ -124,7 +124,10 @@ private[rest] class StandaloneSubmitRequestServlet( * fields used by python applications since python is not supported in standalone * cluster mode yet. */ - private def buildDriverDescription(request: CreateSubmissionRequest): DriverDescription = { + private[rest] def buildDriverDescription( + request: CreateSubmissionRequest, + masterUrl: String, + masterRestPort: Int): DriverDescription = { // Required fields, including the main class because python is not yet supported val appResource = Option(request.appResource).getOrElse { throw new SubmitRestMissingFieldException("Application jar is missing.") @@ -149,7 +152,6 @@ private[rest] class StandaloneSubmitRequestServlet( // the driver. val masters = sparkProperties.get("spark.master") val (_, masterPort) = Utils.extractHostPortFromSparkUrl(masterUrl) -val masterRestPort = this.conf.get(config.MASTER_REST_SERVER_PORT) val updatedMasters = masters.map( _.replace(s":$masterRes
[spark] branch master updated: [SPARK-44788][PYTHON][DOCS][FOLLOW-UP] Add from_xml and schema_of_xml into PySpark documentation
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5b89bc156cb [SPARK-44788][PYTHON][DOCS][FOLLOW-UP] Add from_xml and schema_of_xml into PySpark documentation 5b89bc156cb is described below commit 5b89bc156cb50c3ddc6bf4174e481e4f2a3f4491 Author: Hyukjin Kwon AuthorDate: Mon Sep 18 02:02:26 2023 -0700 [SPARK-44788][PYTHON][DOCS][FOLLOW-UP] Add from_xml and schema_of_xml into PySpark documentation ### What changes were proposed in this pull request? This PR adds `from_xml` and `schema_of_xml` into PySpark documentation ### Why are the changes needed? For users to know how to use them. ### Does this PR introduce _any_ user-facing change? Yes, It adds both `from_xml` and `schema_of_xml` into Python documentation ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42974 from HyukjinKwon/SPARK-44788-followup. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- python/docs/source/reference/pyspark.sql/functions.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/python/docs/source/reference/pyspark.sql/functions.rst b/python/docs/source/reference/pyspark.sql/functions.rst index 6896efd4fb4..6e00c859da4 100644 --- a/python/docs/source/reference/pyspark.sql/functions.rst +++ b/python/docs/source/reference/pyspark.sql/functions.rst @@ -264,6 +264,8 @@ Collection Functions str_to_map to_csv try_element_at +from_xml +schema_of_xml Partition Transformation Functions - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45193][PS][CONNECT][TESTS] Refactor `test_mode` to be compatible with Spark Connect
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5db824aa6f9 [SPARK-45193][PS][CONNECT][TESTS] Refactor `test_mode` to be compatible with Spark Connect 5db824aa6f9 is described below commit 5db824aa6f9203b24668c4e0135fab50d20831da Author: Ruifeng Zheng AuthorDate: Mon Sep 18 17:00:35 2023 +0800 [SPARK-45193][PS][CONNECT][TESTS] Refactor `test_mode` to be compatible with Spark Connect ### What changes were proposed in this pull request? Refactor `test_mode` to be compatible with Spark Connect ### Why are the changes needed? for test parity ### Does this PR introduce _any_ user-facing change? No, test-only ### How was this patch tested? CI, and manually check: ``` In [5]: import pandas as pd In [6]: def func(iterator): ...: for pdf in iterator: ...: if len(pdf) > 0: ...: if pdf["partition"][0] == 3: ...: yield pd.DataFrame({"num" : ["0", "1", "2", "3", "4"] }) ...: else: ...: yield pd.DataFrame({"num" : ["3", "3", "3", "3", "4"]} ) ...: In [7]: df = spark.range(0, 4, 1, 4).select(sf.spark_partition_id().alias("partition")) In [8]: df.mapInPandas(func, "num string").withColumn("p", sf.spark_partition_id()).show() /Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:239: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead /Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:239: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead /Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:239: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead /Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:239: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead +---+---+ |num| p| +---+---+ | 3| 0| | 3| 0| | 3| 0| | 3| 0| | 4| 0| | 3| 1| | 3| 1| | 3| 1| | 3| 1| | 4| 1| | 3| 2| | 3| 2| | 3| 2| | 3| 2| | 4| 2| | 0| 3| | 1| 3| | 2| 3| | 3| 3| | 4| 3| +---+---+ ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #42970 from zhengruifeng/py_test_mode. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .../pandas/tests/computation/test_compute.py | 44 +- .../connect/computation/test_parity_compute.py | 30 +-- 2 files changed, 35 insertions(+), 39 deletions(-) diff --git a/python/pyspark/pandas/tests/computation/test_compute.py b/python/pyspark/pandas/tests/computation/test_compute.py index 9a29cb236a8..dc145601fca 100644 --- a/python/pyspark/pandas/tests/computation/test_compute.py +++ b/python/pyspark/pandas/tests/computation/test_compute.py @@ -15,11 +15,11 @@ # limitations under the License. # import unittest -from distutils.version import LooseVersion import numpy as np import pandas as pd +from pyspark.sql import functions as sf from pyspark import pandas as ps from pyspark.testing.pandasutils import ComparisonTestBase from pyspark.testing.sqlutils import SQLTestUtils @@ -101,16 +101,40 @@ class FrameComputeMixin: with self.assertRaises(ValueError): psdf.mode(axis=2) -def f(index, iterator): -return ["3", "3", "3", "3", "4"] if index == 3 else ["0", "1", "2", "3", "4"] +def func(iterator): +for pdf in iterator: +if len(pdf) > 0: +if pdf["partition"][0] == 3: +yield pd.DataFrame( +{ +"num": [ +"3", +"3", +"3", +"3", +"4", +] +} +) +else: +yield pd.DataFrame( +{ +"num": [ +"0", +
[spark] branch master updated: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 89041a4a8c7 [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function 89041a4a8c7 is described below commit 89041a4a8c7b7787fa10f090d4324f20447c4dd3 Author: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> AuthorDate: Mon Sep 18 16:35:22 2023 +0900 [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function ### What changes were proposed in this pull request? Add from_xml and schema_of_xml to pyspark, spark connect and sql function ### Why are the changes needed? from_xml parses XML data nested in a `Column` into a struct. schema_of_xml infers schema from XML data in a `Column`. This PR adds these two functions to pyspark, spark connect and SQL function registry. It is one of the series of PR to add native support for [XML File Format](https://issues.apache.org/jira/browse/SPARK-44265) in spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds from_xml and schema_of_xml to pyspark, spark connect and sql function ### How was this patch tested? - Added new unit tests - Github Action ### Was this patch authored or co-authored using generative AI tooling? No Closes #42938 from sandip-db/from_xml-master. Authored-by: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- .../scala/org/apache/spark/sql/functions.scala | 51 +++- .../org/apache/spark/sql/FunctionTestSuite.scala | 10 + python/pyspark/errors/error_classes.py | 5 + python/pyspark/sql/connect/functions.py| 46 +++ python/pyspark/sql/functions.py| 129 + .../sql/tests/connect/test_connect_function.py | 106 +++ python/pyspark/sql/tests/test_functions.py | 28 +- .../sql/catalyst/analysis/FunctionRegistry.scala | 6 +- .../sql/catalyst/expressions/xmlExpressions.scala | 26 +- .../scala/org/apache/spark/sql/functions.scala | 60 +++- .../sql-functions/sql-expression-schema.md | 4 +- .../analyzer-results/xml-functions.sql.out | 271 + .../resources/sql-tests/inputs/xml-functions.sql | 50 .../sql-tests/results/xml-functions.sql.out| 319 + .../apache/spark/sql/DataFrameFunctionsSuite.scala | 5 +- .../sql/execution/datasources/xml/XmlSuite.scala | 2 +- 16 files changed, 1093 insertions(+), 25 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala index 83f0ee64501..b94a33007b1 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala @@ -26,6 +26,7 @@ import org.apache.spark.sql.api.java._ import org.apache.spark.sql.catalyst.encoders.AgnosticEncoders.PrimitiveLongEncoder import org.apache.spark.sql.connect.common.LiteralValueProtoConverter._ import org.apache.spark.sql.connect.common.UdfUtils +import org.apache.spark.sql.errors.DataTypeErrors import org.apache.spark.sql.expressions.{ScalarUserDefinedFunction, UserDefinedFunction} import org.apache.spark.sql.types.{DataType, StructType} import org.apache.spark.sql.types.DataType.parseTypeWithFallback @@ -7311,15 +7312,59 @@ object functions { * "https://spark.apache.org/docs/latest/sql-data-sources-xml.html#data-source-option";> Data * Source Option in the version you use. * @group collection_funcs + * @since 4.0.0 + */ + // scalastyle:on line.size.limit + def from_xml(e: Column, schema: StructType, options: java.util.Map[String, String]): Column = +from_xml(e, lit(schema.json), options.asScala.toIterator) + + // scalastyle:off line.size.limit + + /** + * (Java-specific) Parses a column containing a XML string into a `StructType` with the + * specified schema. Returns `null`, in the case of an unparseable string. * + * @param e + * a string column containing XML data. + * @param schema + * the schema as a DDL-formatted string. + * @param options + * options to control how the XML is parsed. accepts the same options and the xml data source. + * See https://spark.apache.org/docs/latest/sql-data-sources-xml.html#data-source-option";> Data + * Source Option in the version you use. + * @group collection_funcs * @since 4.0.0 */ // scalastyle:on line.size.limit - def from_xml(e: Column, schema: StructType, options: Map[String, String]): Column = -from_xml(e, lit
[spark] branch master updated: [SPARK-44823][PYTHON] Update black to 23.9.1 and fix erroneous check
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5299e5423f4 [SPARK-44823][PYTHON] Update black to 23.9.1 and fix erroneous check 5299e5423f4 is described below commit 5299e5423f439ce702d14666c79f9d42eb446943 Author: panbingkun AuthorDate: Mon Sep 18 16:34:34 2023 +0900 [SPARK-44823][PYTHON] Update black to 23.9.1 and fix erroneous check ### What changes were proposed in this pull request? The pr aims to update black from 22.6.0 to 23.9.1 and fix erroneous check. ### Why are the changes needed? The date of `22.6.0` release is `Jun 28, 2022`, it has been over a year since now. (https://pypi.org/project/black/23.7.0/#history) Make PySpark's code style more in line with the latest version of Black's requirements. Release notes: - 23.9.1: https://github.com/psf/black/blob/main/CHANGES.md#2391 https://github.com/apache/spark/assets/15246973/2d7235b8-c846-45a0-8e03-40625b1a1f71";> - 23.9.0: https://github.com/psf/black/blob/main/CHANGES.md#2390 https://github.com/apache/spark/assets/15246973/225e6e56-79e1-4d36-a47f-26f7bfd4de3e";> - 23.7.0: https://github.com/psf/black/blob/main/CHANGES.md#2370 https://github.com/apache/spark/assets/15246973/ec42aab0-1abe-43cf-af4e-7338a4f698e7";> - 23.3.0: https://github.com/psf/black/blob/main/CHANGES.md#2330 - 23.1.0: https://github.com/psf/black/blob/main/CHANGES.md#2310 https://github.com/apache/spark/assets/15246973/493bec0e-e7be-4c31-8e01-81b9e729099b";> - 22.12.0: https://github.com/psf/black/blob/main/CHANGES.md#22120 - 22.10.0: https://github.com/psf/black/blob/main/CHANGES.md#22100 - 22.8.0: https://github.com/psf/black/blob/main/CHANGES.md#2280 https://github.com/apache/spark/assets/15246973/f643c228-b4ed-4d5f-8668-0fcfb3ca6b65";> https://github.com/apache/spark/assets/15246973/5c7dae77-ab6e-4e1c-a438-55ca0ce99f60";> https://github.com/apache/spark/assets/15246973/4de07cda-1777-4186-953d-4657bb1e0ee3";> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #42507 from panbingkun/SPARK-44823. Authored-by: panbingkun Signed-off-by: Hyukjin Kwon --- .github/workflows/build_and_test.yml | 2 +- dev/pyproject.toml | 2 +- dev/reformat-python| 2 +- dev/requirements.txt | 2 +- dev/run-tests.py | 3 +- python/pyspark/conf.py | 4 +- python/pyspark/context.py | 2 +- python/pyspark/instrumentation_utils.py| 3 - python/pyspark/join.py | 8 +- python/pyspark/ml/connect/evaluation.py| 1 - .../pyspark/ml/deepspeed/deepspeed_distributor.py | 1 - python/pyspark/ml/linalg/__init__.py | 2 - python/pyspark/ml/param/__init__.py| 1 - python/pyspark/ml/tests/test_algorithms.py | 7 -- python/pyspark/ml/tests/test_dl_util.py| 1 - python/pyspark/ml/tests/test_feature.py| 1 - python/pyspark/ml/tests/test_linalg.py | 2 - python/pyspark/ml/torch/distributor.py | 1 - python/pyspark/ml/tuning.py| 2 +- python/pyspark/ml/util.py | 1 - python/pyspark/ml/wrapper.py | 1 - python/pyspark/mllib/linalg/__init__.py| 2 - python/pyspark/mllib/tests/test_feature.py | 1 - python/pyspark/mllib/tests/test_linalg.py | 2 - python/pyspark/pandas/frame.py | 7 +- python/pyspark/pandas/groupby.py | 1 - python/pyspark/pandas/indexing.py | 1 - python/pyspark/pandas/numpy_compat.py | 1 - python/pyspark/pandas/plot/matplotlib.py | 1 - python/pyspark/pandas/spark/functions.py | 4 - python/pyspark/pandas/sql_processor.py | 2 +- python/pyspark/pandas/strings.py | 1 + .../tests/connect/groupby/test_parity_aggregate.py | 1 - .../tests/connect/groupby/test_parity_describe.py | 1 - .../tests/connect/groupby/test_parity_head_tail.py | 1 - .../tests/connect/groupby/test_parity_index.py | 1 - .../tests/connect/groupby/test_parity_stat.py | 1 - .../tests/connect/series/test_parity_all_any.py| 1 - .../tests/connect/series/test_parity_as_type.py| 1 - .../tests/connect/series/test_parity_conversion.py | 1 - .../tests/connect/series/test_parity_sort.py | 1 - .../pandas/tests/data_type_ops/test_string_ops.py | 1 - .../pyspark