[GitHub] HyukjinKwon opened a new pull request #23787: [SPARK-26830][SQL][R] Vectorized R dapply() implementation

GitBox Thu, 14 Feb 2019 01:44:13 -0800

HyukjinKwon opened a new pull request #23787: [SPARK-26830][SQL][R] Vectorized 
R dapply() implementation
URL: https://github.com/apache/spark/pull/23787
 
 
   ## What changes were proposed in this pull request?
   
   This PR targets to add vectorized `dapply()` in R, Arrow optimization.
   
   This can be tested as below:
   
   ```bash
   $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
   ```
   
   ```r
   df <- createDataFrame(mtcars)
   collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, 
structType("gear double")))
   ```
   
   ### Requirements
     - R 3.5.x 
     - Arrow package 0.12+
       ```bash
       Rscript -e 'remotes::install_github("apache/[email protected]", 
subdir = "r")'
       ```
   
   **Note:** currently, Arrow R package is not in CRAN. Please take a look at 
ARROW-3204.
   **Note:** currently, Arrow R package seems not supporting Windows. Please 
take a look at ARROW-3204.
   
   
   ### Benchmarks
   
   **Shall**
   
   ```bash
   sync && sudo purge
   ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 
4g
   ```
   
   ```bash
   sync && sudo purge
   ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
   ```
   
   **R code**
   
   ```r
   rdf <- read.csv("500000.csv")
   rdf <- rdf[, c("First.Name", "Month.of.Joining")]  # We're only interested 
in the key and values to calculate.
   df <- cache(createDataFrame(rdf))
   count(df)
   
   test <- function() {
     options(digits.secs = 6) # milliseconds
     start.time <- Sys.time()
     count(dapply(df,
                  function(rdf) {
                    rdf$Month_of_Joining <- rdf$Month_of_Joining + 1
                    rdf
                  },
                  structType("First_Name string, Month_of_Joining double")))
     end.time <- Sys.time()
     time.taken <- end.time - start.time
     print(time.taken)
   }
   
   test()
   ```
   
   **Data (350 MB):**
   
   ```r
   object.size(read.csv("500000.csv"))
   350379504 bytes
   ```
   
   "500000 Records"  
http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/
   
   **Results**
   
   ```
   Time difference of 92.78868 secs
   ```
   
   ```
   Time difference of 1.997686 secs
   ```
   
   The performance improvement was around **4644%**.
   
   
   ### Limitations
   
   - For now, Arrow optimization with R does not support when the data is 
`raw`, and when user explicitly gives float type in the schema. They produce 
corrupt values.
   
   - Due to ARROW-4512, it cannot send and receive batch by batch. It has to 
send all batches in Arrow stream format at once. It needs improvement later.
   
   ## How was this patch tested?
   
   Unit tests were added, and manually tested.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] HyukjinKwon opened a new pull request #23787: [SPARK-26830][SQL][R] Vectorized R dapply() implementation

Reply via email to