[GitHub] HyukjinKwon opened a new pull request #23760: [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame

GitBox Mon, 11 Feb 2019 22:49:18 -0800

HyukjinKwon opened a new pull request #23760: [SPARK-26762][SQL][R] Arrow 
optimization for conversion from Spark DataFrame to R DataFrame
URL: https://github.com/apache/spark/pull/23760
 
 
   ## What changes were proposed in this pull request?
   
   This PR targets to support Arrow optimization for conversion from Spark 
DataFrame to R DataFrame.
   Like PySpark side, it falls back to non-optimization code path when it's 
unable to use Arrow optimization.
   
   This can be tested as below:
   
   ```bash
   $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
   ```
   
   ```r
   collect(createDataFrame(mtcars))
   ```
   
   ### Requirements
     - R 3.5.x 
     - Arrow package 0.12+
       ```bash
       Rscript -e 'remotes::install_github("apache/[email protected]", 
subdir = "r")'
       ```
   
   **Note:** currently, Arrow R package is not in CRAN. Please take a look at 
ARROW-3204.
   **Note:** currently, Arrow R package seems not supporting Windows. Please 
take a look at ARROW-3204.
   
   
   ### Benchmarks
   
   **Shall**
   
   ```bash
   sync && sudo purge
   ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 
4g
   ```
   
   ```bash
   sync && sudo purge
   ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
   ```
   
   **R code**
   
   ```r
   df <- cache(createDataFrame(read.csv("500000.csv")))
   count(df)
   
   test <- function() {
     options(digits.secs = 6) # milliseconds
     start.time <- Sys.time()
     collect(df)
     end.time <- Sys.time()
     time.taken <- end.time - start.time
     print(time.taken)
   }
   
   test()
   ```
   
   **Data (350 MB):**
   
   ```r
   object.size(read.csv("500000.csv"))
   350379504 bytes
   ```
   
   "500000 Records"  
http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/
   
   **Results**
   
   ```
   Time difference of 221.32014 secs
   ```
   
   ```
   Time difference of 8.579493 secs
   ```
   
   The performance improvement was around **2579%**.
   
   ### Limitations:
   
   - For now, Arrow optimization with R does not support when the data is 
`raw`, and when user explicitly gives float type in the schema. They produce 
corrupt values. In this case, we decide to fall back to non-optimization code 
path.
   
   - Due to ARROW-4512, it cannot send and receive batch by batch. It has to 
send all batches in Arrow stream format at once. It needs improvement later.
   
   ## How was this patch tested?
   
   Existing tests related with Arrow optimization cover this change. I fixed 
test title.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] HyukjinKwon opened a new pull request #23760: [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame

Reply via email to