nealrichardson edited a comment on pull request #8650:
URL: https://github.com/apache/arrow/pull/8650#issuecomment-754955102


   Since the latest commits aren't compiling, I did some benchmarking on 
https://github.com/apache/arrow/pull/8650/commits/bcb1be733697b0e7ca86534a6700b5816e0dad46.
 Summary of findings:
   
   * Character to string conversion is usually (but not always) faster with the 
new code, around 20-30% better. Because string conversion is generally slower 
than other types, a small percentage improvement can be significant.
   * Integer and integer64 conversion is slower by an order of magnitude or 
more in the new code
   * `bench::mark` didn't report results for numeric vectors because the 
results were not equal.
   
   Not sure where things will stand with the latest changes, but I think this 
suggests that the (numpy-like) special handling for vector types that can be 
just copied/moved to Arrow are important where appropriate. Otherwise, the 
string results suggest that there is some performance gain to be had with this 
work, and if the new approach will handle chunking and parallelization, we can 
do even better.
   
   Code:
   
   ```r
   
download.file("https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz";,
 "fanniemae.csv.gz")
   df <- read_delim_arrow("fanniemae.csv.gz", delim="|", col_names=FALSE)
   dim(df)
   ## [1] 22180168       31
   for (n in names(df)) {
     print(n)
     print(class(df[[n]]))
     try(print(bench::mark(arrow:::Array__from_vector(df[[n]], NULL), 
arrow:::vec_to_arrow(df[[n]], NULL))))
   }
   ```
   
   There's also a NYC taxi CSV at 
https://ursa-qa.s3.amazonaws.com/nyctaxi/yellow_tripdata_2010-01.csv.gz you can 
test with (just `read_csv_arrow()`, it has colnames).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to