OfekShilon opened a new issue, #15271:
URL: https://github.com/apache/arrow/issues/15271
### Describe the bug, including details regarding any error messages,
version, and platform.
Test script that measures R/arrow load time for various sizes:
```r
colnums <- c(10,20,30,100,150,200,300,500)
rownums <- c(1,2,3,4,5,10,20,30,40,50,60,70,100,200, 300, 400, 500, 1000,
2000, 3000, 4000, 5000, 10000)
# Generate files
for (colnum in colnums) {
for (rownum in rownums) {
fn.robj <- paste0("~/tmp/robj.",rownum,"x",colnum)
fn.arrow <- paste0("~/tmp/arrow.",rownum,"x",colnum)
dat <- as.data.frame(matrix(runif(rownum*colnum), nrow=rownum,
ncol=colnum))
save(dat, file=fn.robj)
arrow::write_feather(x = dat, sink = fn.arrow)
}
}
times.robj <- matrix(0, nrow=length(rownums), ncol=length(colnums))
rownames(times.robj) <- paste(rownums,"rows")
colnames(times.robj) <- paste(colnums,"cols")
times.arrow <- times.robj
for (i in 1:length(rownums)) {
for (j in 1:length(colnums)) {
rownum <- rownums[i]
colnum <- colnums[j]
fn.robj <- paste0("~/tmp/robj.",rownum,"x",colnum)
fn.arrow <- paste0("~/tmp/arrow.",rownum,"x",colnum)
# measure 2nd load to account for cold caches
load(fn.robj)
start <- Sys.time();
load(fn.robj);
times.robj[i,j] <- Sys.time()-start
tst <- arrow::read_feather(fn.arrow)
start <- Sys.time();
tst <- arrow::read_feather(fn.arrow);
times.arrow[i,j] <- Sys.time()-start
}
}
```
Results:
```
> times.arrow / times.robj
10 cols 20 cols 30 cols 100 cols 150 cols 200
cols 300 cols 500 cols
1 rows 16.1439951 19.7020075 25.1108247 51.1643757 77.1529228
91.3080397 111.3643533 149.3513743
2 rows 15.0277094 21.2175810 22.2626322 48.8661710 68.6573327
650.6486486 134.8991050 130.5041691
3 rows 14.6777409 20.1436969 20.9700806 47.7467603 63.9312016
68.5315315 98.5874855 119.4731097
4 rows 13.2236921 17.4342891 20.9966044 43.8189867 57.1619048
64.3601299 94.4213217 118.8271915
5 rows 12.6945607 14.8067084 18.7377778 36.4182165 49.6366695
56.7033511 73.2449044 115.0325528
10 rows 13.1203008 16.9616537 16.7252696 37.5056129 47.2363992
56.1606467 76.4436374 86.6117791
20 rows 12.4548896 774.0376940 17.5051370 32.4073774 35.6958398
39.4063311 46.5070936 51.8869215
30 rows 10.2758259 12.8381764 15.6813459 25.9489239 30.6835476
31.7596519 35.4976311 41.5393059
40 rows 10.8671210 7.8244697 15.1399804 23.4805764 29.2812743
26.6662289 31.4367649 42.6152522
50 rows 11.3902007 12.6833417 15.2992519 25.2068532 27.2051708
28.9717248 32.0606809 36.8470872
60 rows 10.9138495 14.1022129 16.6385948 22.7227723 26.6038445
27.9418484 28.5083841 33.9032176
70 rows 10.7040650 12.1799904 13.2777314 19.7737738 20.8106306
21.8470504 22.5418507 27.6593520
100 rows 10.7567132 11.7838963 12.8056854 15.0082676 28.4549343
18.1499451 21.5192503 22.0708589
200 rows 9.5018797 10.1656687 10.6434257 12.3456125 12.0490603
12.5274870 13.1872241 14.6434862
300 rows 9.6111111 8.9652621 8.9622146 9.3272070 9.1396644
10.0647620 10.6045769 12.0662228
400 rows 8.7160494 9.3873540 8.3236041 7.2730971 7.9281412
7.4078140 7.4032556 7.9848605
500 rows 7.1358811 6.4100263 6.4007276 6.0777437 6.6235458
6.2249675 6.3370181 6.9172020
1000 rows 5.3677043 4.4564087 4.1116463 3.6105644 3.2333922
3.2778293 3.2759320 3.4308380
2000 rows 3.5031858 2.5319266 2.4289314 1.8577107 1.7995663
1.7371557 1.7497375 1.8541778
3000 rows 2.5769010 6.3183501 1.7323371 1.3046406 1.2342389
1.2235438 1.3174136 1.2508460
4000 rows 2.0956563 1.4165296 1.8561829 0.9478190 0.8863266
1.2302510 0.8732958 0.8928616
5000 rows 1.6759777 1.2119986 1.1039393 0.8229102 1.3977869
0.9786898 0.9761781 0.8342817
10000 rows 0.9136646 0.6621193 0.5184357 0.4271505 0.3822572
0.3574329 0.3735044 0.4495687
```
Is this some known overhead? It seems rather large...
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]