ala commented on PR #37228: URL: https://github.com/apache/spark/pull/37228#issuecomment-1213402734
@sadikovi The cost of reading the row_index column is in the same ballpark as the other metadata columns: ``` [info] Vectorized Parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] no metadata columns 332 370 15 15.1 66.3 1.0X [info] _metadata.file_path 436 491 33 11.5 87.1 0.8X [info] _metadata.file_name 440 479 20 11.4 88.0 0.8X [info] _metadata.file_size 377 420 24 13.3 75.4 0.9X [info] _metadata.file_modification_time 391 420 19 12.8 78.1 0.8X [info] _metadata.row_index 434 489 27 11.5 86.7 0.8X [info] _metadata 676 766 34 7.4 135.2 0.5X [info] Parquet-mr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] no metadata columns 1250 1447 78 4.0 250.0 1.0X [info] _metadata.file_path 1688 1898 116 3.0 337.6 0.7X [info] _metadata.file_name 1678 1867 87 3.0 335.6 0.7X [info] _metadata.file_size 1518 1711 79 3.3 303.6 0.8X [info] _metadata.file_modification_time 1596 1701 60 3.1 319.3 0.8X [info] _metadata.row_index 1526 1725 79 3.3 305.3 0.8X [info] _metadata 2268 2578 134 2.2 453.5 0.6X ``` And these numbers are in the same ballpark as for vanilla `master` branch: ``` [info] Vectorized Parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] no metadata columns 346 411 31 14.5 69.1 1.0X [info] _metadata.file_path 452 524 49 11.1 90.5 0.8X [info] _metadata.file_name 446 489 24 11.2 89.2 0.8X [info] _metadata.file_size 389 436 38 12.9 77.8 0.9X [info] _metadata.file_modification_time 387 421 19 12.9 77.4 0.9X [info] _metadata 592 672 30 8.4 118.4 0.6X [info] Parquet-mr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] no metadata columns 1209 1351 73 4.1 241.8 1.0X [info] _metadata.file_path 1595 1807 112 3.1 318.9 0.8X [info] _metadata.file_name 1592 1777 100 3.1 318.3 0.8X [info] _metadata.file_size 1493 1692 102 3.3 298.7 0.8X [info] _metadata.file_modification_time 1507 1688 87 3.3 301.5 0.8X [info] _metadata 1998 2238 107 2.5 399.6 0.6X ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
