bkietz commented on a change in pull request #10191:
URL: https://github.com/apache/arrow/pull/10191#discussion_r625161974



##########
File path: r/tests/testthat/test-dplyr-mutate.R
##########
@@ -344,20 +344,21 @@ test_that("print a mutated table", {
       select(int) %>%
       mutate(twice = int * 2) %>%
       print(),
-'Table (query)
+'InMemoryDataset (query)
 int: int32
 twice: expr
 
 See $.data for the source Arrow object',
   fixed = TRUE)
 
   # Handling non-expressions/edge cases
+  skip("InMemoryDataset$Project() doesn't accept array (or could it?)")
   expect_output(
     Table$create(tbl) %>%
       select(int) %>%
       mutate(again = 1:10) %>%

Review comment:
       This isn't supported because (for most datasets) we don't know the 
number of rows before scanning, so it's unclear if augmenting the table with an 
explicit column is valid. It's *possible* to add support for projections like 
this to Expression but it'd mean they were no longer strictly scalar (since 
they'd only be valid in the case where the result of the scan happened to have 
the expected number of rows). This would complicate scanning drastically, since 
we'd have to accommodate array literals in each scanned batch
   
   I'd expect that augmenting with an explicit array is only done with small, 
cheap scans? If so, one way to work around this would be to auto-`collect()` 
the preceding steps and handle the explicit array with `Table$AddColumn`. In 
any case it might be worthwhile to maintain an `is_scan_of_in_memory_data` flag 
of the dplyr query as a hint of whether auto `collect()`ion is reasonable
   
   @westonpace what do you think?

##########
File path: r/tests/testthat/test-dplyr.R
##########
@@ -501,6 +503,7 @@ test_that("explicit type conversions with as.*()", {
 })
 
 test_that("as.factor()/dictionary_encode()", {
+  skip("ExecuteScalarExpression cannot Execute non-scalar expression 
{x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}")

Review comment:
       Adding support for dictionary encoding is possible but it will be subtle 
since it will be the first Expression with mutable state to share between 
threads. I've added https://issues.apache.org/jira/browse/ARROW-12632 to track 
this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to