nealrichardson commented on PR #13415:
URL: https://github.com/apache/arrow/pull/13415#issuecomment-1164434904

   I did some exploration of what's happening, at least to illustrate where the 
issue is. Taking an example from our test suite and using the fact of the 
warning on nul removal as a marker of when data is being converted from Arrow 
to R:
   
   ```
   >   raws <- structure(list(
   +     as.raw(c(0x70, 0x65, 0x72, 0x73, 0x6f, 0x6e)),
   +     as.raw(c(0x77, 0x6f, 0x6d, 0x61, 0x6e)),
   +     as.raw(c(0x6d, 0x61, 0x00, 0x6e)), # <-- there's your nul, 0x00
   +     as.raw(c(0x66, 0x00, 0x00, 0x61, 0x00, 0x6e)), # multiple nuls
   +     as.raw(c(0x63, 0x61, 0x6d, 0x65, 0x72, 0x61)),
   +     as.raw(c(0x74, 0x76))
   +   ),
   +   class = c("arrow_binary", "vctrs_vctr", "list")
   +   )
   >   array_with_nul <- Array$create(raws)$cast(utf8())
   > v <- as.vector(array_with_nul)
   > options(arrow.skip_nul = TRUE)
   > v[]
   [1] "person" "woman"  "man"    "fan"    "camera" "tv"    
   Warning message:
   Stripping '\0' (nul) from character vector 
   > v[]
   [1] "person" "woman"  "man"    "fan"    "camera" "tv"  
   # See no warning the second time because the vector was materialized.
   
   # But: single element access:
   > v2 <- as.vector(Array$create(raws)$cast(utf8()))
   > v2[3]
   [1] "man"
   Warning message:
   Stripping '\0' (nul) from character vector 
   > v2[3]
   [1] "man"
   Warning message:
   Stripping '\0' (nul) from character vector 
   # Touching an element doesn't materialize, so it has to re-convert each time
   ```
   
   You can see how this blows up with `unique()`: 2 of the 6 cells have nul in 
them, but we see the warning 10 times:
   
   ```
   > unique(v)
   [1] "person" "woman"  "man"    "fan"    "camera" "tv"    
   Warning messages:
   1: Stripping '\0' (nul) from character vector 
   2: Stripping '\0' (nul) from character vector 
   3: Stripping '\0' (nul) from character vector 
   4: Stripping '\0' (nul) from character vector 
   5: Stripping '\0' (nul) from character vector 
   6: Stripping '\0' (nul) from character vector 
   7: Stripping '\0' (nul) from character vector 
   8: Stripping '\0' (nul) from character vector 
   9: Stripping '\0' (nul) from character vector 
   10: Stripping '\0' (nul) from character vector 
   
   # Access the whole vector so that it materializes
   > v[]
   [1] "person" "woman"  "man"    "fan"    "camera" "tv"    
   Warning message:
   Stripping '\0' (nul) from character vector 
   > unique(v)
   [1] "person" "woman"  "man"    "fan"    "camera" "tv"
   # No conversion happening this time
   ```
   
   (is.na() is less bad, there is only 1 hit per element. Makes me think that 
the algorithm in `base::unique()` is inefficient, but that's a separate issue.)
   
   So IIUC it's a tradeoff between memory consumption (duplicating the data at 
`Materialize()`) vs. performance. On my machine, I'm seeing an even bigger hit 
on performance, and unique/is.na are clearly not materializing the whole array 
because the perf isn't better if you do it again:
   
   ```
   > system.time(unique(df1$x))
      user  system elapsed 
     0.025   0.001   0.026 
   > system.time(unique(df2$x))
      user  system elapsed 
     4.790   1.397   6.781 
   > system.time(unique(df2$x))
      user  system elapsed 
     4.901   1.187   6.233 
   > system.time(is.na(df1$x))
      user  system elapsed 
     0.002   0.000   0.004 
   > system.time(is.na(df2$x))
      user  system elapsed 
     0.729   0.098   0.827 
   > system.time(is.na(df2$x))
      user  system elapsed 
     0.723   0.096   0.817
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to