HyukjinKwon commented on a change in pull request #23746: [SPARK-26761][SQL][R]
Vectorized R gapply() implementation
URL: https://github.com/apache/spark/pull/23746#discussion_r255298526
##########
File path: R/pkg/R/deserialize.R
##########
@@ -231,6 +231,26 @@ readMultipleObjectsWithKeys <- function(inputCon) {
list(keys = keys, data = data) # this is a list of keys and corresponding
data
}
+readDeserializeInArrow <- function(inputCon) {
+ # This is a hack to avoid CRAN check. Arrow is not uploaded into CRAN now.
See ARROW-3204.
+ requireNamespace1 <- requireNamespace
+ requireNamespace1("arrow", quietly = TRUE)
+
+ # Currently, there looks no way to read batch by batch by socket connection
in R side,
+ # See ARROW-4512. Therefore, it reads the whole Arrow streaming-formatted
binary at once for now.
+ dataLen <- readInt(inputCon)
+ arrowData <- readBin(inputCon, raw(), as.integer(dataLen), endian = "big")
+ batches <- arrow::RecordBatchStreamReader(arrowData)$batches()
+
+ # Read all groupped batches. Tibble -> data.frame is cheap.
+ data <- lapply(batches, function(batch)
as.data.frame(arrow::as_tibble(batch)))
Review comment:
Arrow seems to only provide an API that converts Arrow batch to Tibble (not
`data.frame`). They are similar but different. I converted it to
`as.data.frame` to be consistent with existing `gapply`. The cost looks very
cheap since some doc say `as.data.frame` works like a thin wrapper in this case.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]