[GitHub] [spark] msummersgill opened a new pull request #31744: [WIP] Enable Arrow optimization for float types with using SparkR

GitBox Thu, 04 Mar 2021 13:40:36 -0800


msummersgill opened a new pull request #31744:
URL: https://github.com/apache/spark/pull/31744



   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: 
https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: 
https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., 
'[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a 
faster review.
     7. If you want to add a new configuration, please read the guideline first 
for naming configurations in
        
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   _**Before finalization, I'm investigating to see if the arrow optimization 
for any of `BinaryType`, `ArrayType`,`StructType`,or`MapType` can be added as 
well.**_
   
   ***
   
   ### What changes were proposed in this pull request?
   I deleted several error-handlers from the SparkR package `types.R` file. The 
R `arrow` package now supports float types. 
   
   This was brought to my attention by Neal Richardson, maintainer of the R 
Arrow package, in the comments on this issue: 
https://issues.apache.org/jira/browse/ARROW-3783
   
   ```r
     if (any(field_strings == "FloatType")) {
       stop("Arrow optimization in R does not support float type yet.")
     }
   ```
   
   
   ### Why are the changes needed?
   
   This change allows SparkR users to take advantage of continued development 
of the R arrow package developers. The R Arrow package now supports `float` 
types, rendering this error unnecessary.
   
   ```r
   str(collect(SparkR::sql("SELECT float('1') AS x;"))$x[[1]])
   ## num 1
   
   ## Warning message:In value[[3L]](cond) :  The conversion from Spark 
DataFrame to R DataFrame was attempted with Arrow optimization because 
'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, 
attempting non-optimization. Reason: Error in checkSchemaInArrow(schema(x)): 
Arrow optimization in R does not support float type yet.
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Float types will now be returned with arrow optimization (when 
`spark.sql.execution.arrow.sparkr.enabled = "true"` and the `R` `arrow` package 
is available in the executing environment.
   
   
   ### How was this patch tested?
   
   I built a copy of the SparkR package locally under R 3.6.0 using this 
branch, connected to a Databricks cluster running Databricks runtime version 
7.3 LTS (Spark 3.0.1, Scala 2.12), and executed the following without error.
   
   ```r
   str(collect(SparkR::sql("SELECT float('-9999999999999999.9999999999999') AS 
x1,
                           float('-1.0') AS x2,
                           float('-0.00001') AS x3,
                           float('0') AS x4,
                           float('0.00001') AS x5,
                           float('1.0') AS x6,
                           float('9999999999999999.9999999999999') AS x7;")))
   
   # 'data.frame':      1 obs. of  7 variables:
   #  $ x1: num -1e+16
   #  $ x2: num -1
   #  $ x3: num -1e-05
   #  $ x4: num 0
   #  $ x5: num 1e-05
   #  $ x6: num 1
   #  $ x7: num 1e+16
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] msummersgill opened a new pull request #31744: [WIP] Enable Arrow optimization for float types with using SparkR

Reply via email to