[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-649585804 +1, thanks for fixing this @emkornfield! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-649585468 FWIW gcc benchmarks (sse4.2) on my machine (ubuntu 18.04 on i9-9960X) ``` $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 emkornfield/ARROW-8504 master --suite-filter=parquet-arrow benchmark baselinecontender change %counters 35 BM_ReadColumn/75/1 267.930 MiB/sec 275.083 MiB/sec 2.670 {'iterations': 2} 6 BM_ReadColumn/99/0 238.381 MiB/sec 244.286 MiB/sec 2.477 {'iterations': 3} 10 BM_ReadColumn/10/50 270.775 MiB/sec 275.249 MiB/sec 1.652 {'iterations': 2} 31 BM_ReadColumn/25/5 171.549 MiB/sec 174.237 MiB/sec 1.567 {'iterations': 2} 29BM_ReadColumn/-1/1 856.194 MiB/sec 866.945 MiB/sec 1.256 {'iterations': 11} 7BM_WriteColumn 31.593 MiB/sec 31.845 MiB/sec 0.797 {'iterations': 1} 44 BM_ReadIndividualRowGroups 849.469 MiB/sec 856.209 MiB/sec 0.793 {'iterations': 6} 36BM_ReadColumn/-1/0 408.388 MiB/sec 411.490 MiB/sec 0.759 {'iterations': 3} 3 BM_ReadColumn/50/0 151.600 MiB/sec 152.399 MiB/sec 0.527 {'iterations': 2} 32BM_ReadColumn/30/10 268.512 MiB/sec 269.849 MiB/sec 0.498 {'iterations': 2} 8 BM_ReadColumn/-1/20 772.878 MiB/sec 776.716 MiB/sec 0.497 {'iterations': 5} 17 BM_ReadColumn/25/25 277.705 MiB/sec 278.957 MiB/sec 0.451 {'iterations': 2} 23BM_ReadColumn/-1/11.472 GiB/sec1.478 GiB/sec 0.414 {'iterations': 13} 30 BM_ReadColumn/1/1 235.809 MiB/sec 236.730 MiB/sec 0.391 {'iterations': 3} 45 BM_WriteColumn 45.067 MiB/sec 45.232 MiB/sec 0.366 {'iterations': 1} 41BM_ReadColumn/45/25 241.545 MiB/sec 242.427 MiB/sec 0.365 {'iterations': 2} 40 BM_ReadColumn/-1/10 710.485 MiB/sec 712.181 MiB/sec 0.239 {'iterations': 5} 22 BM_ReadColumn/-1/0 211.939 MiB/sec 212.433 MiB/sec 0.233 {'iterations': 15} 39 BM_ReadColumn/-1/50 629.632 MiB/sec 630.955 MiB/sec 0.210 {'iterations': 9} 15 BM_ReadColumn/1/20 145.490 MiB/sec 145.701 MiB/sec 0.145 {'iterations': 10} 9BM_ReadColumn/-1/1 162.373 MiB/sec 162.497 MiB/sec 0.077 {'iterations': 4} 11 BM_ReadColumn/-1/0 257.190 MiB/sec 257.278 MiB/sec 0.034 {'iterations': 3} 13BM_WriteColumn 32.378 MiB/sec 32.386 MiB/sec 0.025 {'iterations': 1} 37BM_ReadColumn/99/50 238.918 MiB/sec 238.975 MiB/sec 0.024 {'iterations': 3} 20 BM_WriteColumn 54.200 MiB/sec 54.209 MiB/sec 0.017 {'iterations': 1} 19 BM_ReadColumn/1/1 369.522 MiB/sec 369.199 MiB/sec -0.087 {'iterations': 2} 0BM_ReadColumn/-1/501.144 GiB/sec1.143 GiB/sec -0.090 {'iterations': 10} 34BM_ReadColumn/35/10 262.271 MiB/sec 261.890 MiB/sec -0.145 {'iterations': 2} 21 BM_ReadColumn/50/1 242.510 MiB/sec 242.067 MiB/sec -0.183 {'iterations': 2} 5 BM_ReadColumn/50/50 152.344 MiB/sec 152.000 MiB/sec -0.225 {'iterations': 2} 1 BM_ReadColumn/10/10 179.063 MiB/sec 178.583 MiB/sec -0.268 {'iterations': 2} 28BM_WriteColumn 64.162 MiB/sec 63.990 MiB/sec -0.269 {'iterations': 1} 38BM_WriteColumn 69.641 MiB/sec 69.446 MiB/sec -0.280 {'iterations': 1} 18 BM_WriteColumn 26.643 MiB/sec 26.557 MiB/sec -0.324 {'iterations': 2} 16 BM_ReadColumn/-1/02.291 GiB/sec2.281 GiB/sec -0.397 {'iterations': 20} 43 BM_WriteColumn 75.453 MiB/sec 75.050 MiB/sec -0.534 {'iterations': 1} 33 BM_ReadColumn/5/10 115.534 MiB/sec 114.574 MiB/sec -0.832 {'iterations': 3} 14BM_ReadColumn/50/50 242.071 MiB/sec 240.005 MiB/sec -0.854 {'iterations': 2} 4BM_ReadMultipleRowGroups 826.026 MiB/sec 818.709 MiB/sec -0.886 {'iterations': 5} 24 BM_ReadColumn/-1/10 386.383 MiB/sec 382.213 MiB/sec -1.079 {'iterations': 6} 12 BM_ReadColumn/-1/0 410.877 MiB/sec 404.700 MiB/sec -1.503 {'iterations': 3} 25 BM_ReadColumn/10/5 289.612 MiB/sec 284.777 MiB/sec -1.670 {'iterations': 2} 2 BM_ReadColumn/25/10 277.045 MiB/sec 271.925 MiB/sec -1.848 {'iterations': 2} 26 BM_ReadColumn/5/5 311.137 MiB/sec 295.823 MiB/sec -4.922 {'iterations': 2} 42 BM_ReadColumn/99/0 371.803 MiB/sec 347.872 MiB/sec -6.436 {'iterations': 2} 27BM_ReadColumn/99/50 381.029 MiB/sec 351.329 MiB/sec -7.795 {'iterations': 3} ``` Here's clang-11 ``` $ archery benchmark diff --cc=clang-11 --cxx=clang++-11 emkornfield/ARROW-8504 master --suite-filter=parquet-arrow benchmark baselinecontender change %
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-648454941 The bug seems to be in the BitRunReader ![image](https://user-images.githubusercontent.com/329591/85470398-95a9fa00-b574-11ea-99c4-3f06db4a0179.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-648453423 @emkornfield I'm sort of at a dead end here, hopefully the above gives you some clues about where there might be a problem This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-648451899 there seems to be a situation where the bit run has more values then are needed to fulfill the call to `GetSpaced` ![image](https://user-images.githubusercontent.com/329591/85469465-50d19380-b573-11ea-8129-14da0a68dc5a.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-648446373 I'm able to reproduce the error in VS and set breakpoints, I got this far to see that GetBatchWithDictSpaced has decoded more values than it was asked to ![image](https://user-images.githubusercontent.com/329591/85467944-6e9df900-b571-11ea-82f8-5f7efade8f2e.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-648230676 I'm not sure what the MSVC failure is about but I'll debug locally This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-647699240 Here's with clang, even less of an issue there it seems ``` benchmark baselinecontender change % regression 32 BM_ReadColumn/1/11.656 GiB/sec2.689 GiB/sec 62.384 False 15 BM_ReadColumn/99/02.861 GiB/sec4.184 GiB/sec 46.245 False 21BM_ReadColumn/99/501.890 GiB/sec2.744 GiB/sec 45.220 False 20BM_ReadColumn/99/502.898 GiB/sec4.199 GiB/sec 44.891 False 6 BM_ReadColumn/50/11.530 GiB/sec2.210 GiB/sec 44.402 False 30BM_ReadColumn/50/501.586 GiB/sec2.233 GiB/sec 40.787 False 28 BM_ReadColumn/50/0 881.102 MiB/sec1.196 GiB/sec 39.025 False 22BM_ReadColumn/45/251.598 GiB/sec2.221 GiB/sec 38.951 False 14 BM_ReadColumn/1/12.967 GiB/sec4.122 GiB/sec 38.919 False 29BM_ReadColumn/50/50 887.206 MiB/sec1.203 GiB/sec 38.854 False 35 BM_ReadColumn/99/01.928 GiB/sec2.661 GiB/sec 38.045 False 17BM_ReadColumn/35/101.656 GiB/sec2.248 GiB/sec 35.689 False 0BM_ReadColumn/-1/502.765 GiB/sec3.750 GiB/sec 35.628 False 26BM_ReadColumn/30/101.692 GiB/sec2.221 GiB/sec 31.244 False 19 BM_ReadColumn/75/11.712 GiB/sec2.217 GiB/sec 29.505 False 2 BM_ReadColumn/25/5 991.022 MiB/sec1.229 GiB/sec 27.031 False 31BM_ReadColumn/25/101.744 GiB/sec2.164 GiB/sec 24.091 False 16 BM_ReadColumn/-1/505.892 GiB/sec7.252 GiB/sec 23.084 False 24BM_ReadColumn/10/101.003 GiB/sec1.225 GiB/sec 22.073 False 23 BM_ReadColumn/25/251.790 GiB/sec2.159 GiB/sec 20.659 False 10 BM_ReadColumn/5/51.995 GiB/sec2.371 GiB/sec 18.853 False 5 BM_ReadColumn/10/51.796 GiB/sec2.122 GiB/sec 18.128 False 3 BM_ReadColumn/-1/203.488 GiB/sec4.062 GiB/sec 16.453 False 1BM_ReadColumn/-1/102.740 GiB/sec3.055 GiB/sec 11.497 False 11 BM_ReadColumn/-1/101.362 GiB/sec1.486 GiB/sec 9.086 False 9BM_ReadColumn/-1/0 12.984 GiB/sec 13.310 GiB/sec 2.509 False 13 BM_ReadColumn/5/10 515.781 MiB/sec 523.838 MiB/sec 1.562 False 7 BM_ReadColumn/-1/14.505 GiB/sec4.559 GiB/sec 1.198 False 34 BM_ReadColumn/1/20 400.254 MiB/sec 403.780 MiB/sec 0.881 False 25 BM_ReadColumn/-1/0 870.109 MiB/sec 875.827 MiB/sec 0.657 False 18BM_ReadColumn/-1/18.237 GiB/sec8.045 GiB/sec -2.331 False 27 BM_ReadColumn/-1/1 816.651 MiB/sec 793.763 MiB/sec -2.803 False 8BM_ReadColumn/10/501.649 GiB/sec1.587 GiB/sec -3.792 False 12 BM_ReadColumn/-1/03.044 GiB/sec1.432 GiB/sec -52.958True 33BM_ReadColumn/-1/03.161 GiB/sec1.479 GiB/sec -53.209True 4 BM_ReadColumn/-1/01.863 GiB/sec 754.619 MiB/sec -60.448True ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-647694302 Here's what I see with gcc-8: ``` benchmark baseline contender change % regression 1 BM_ReadColumn/1/1 1.336 GiB/sec 1.880 GiB/sec40.701 False 17 BM_ReadColumn/1/1 2.224 GiB/sec 2.926 GiB/sec31.567 False 14 BM_ReadColumn/99/0 1.599 GiB/sec 2.001 GiB/sec25.128 False 23BM_ReadColumn/99/50 1.601 GiB/sec 1.990 GiB/sec24.300 False 21BM_ReadColumn/50/50 1002.325 MiB/sec 1.206 GiB/sec23.171 False 12BM_ReadColumn/50/50 619.919 MiB/sec 758.845 MiB/sec22.410 False 33BM_ReadColumn/45/25 1016.908 MiB/sec 1.212 GiB/sec22.095 False 30 BM_ReadColumn/50/0 620.068 MiB/sec 757.008 MiB/sec22.085 False 34 BM_ReadColumn/5/5 1.586 GiB/sec 1.935 GiB/sec21.981 False 27 BM_ReadColumn/99/0 2.539 GiB/sec 3.081 GiB/sec21.308 False 13 BM_ReadColumn/10/5 1.361 GiB/sec 1.644 GiB/sec20.791 False 18 BM_ReadColumn/50/1 1.002 GiB/sec 1.204 GiB/sec20.174 False 26 BM_ReadColumn/75/1 1.182 GiB/sec 1.419 GiB/sec20.005 False 15BM_ReadColumn/99/50 2.532 GiB/sec 3.036 GiB/sec19.890 False 11BM_ReadColumn/30/10 1.123 GiB/sec 1.322 GiB/sec17.754 False 35BM_ReadColumn/25/10 1.187 GiB/sec 1.396 GiB/sec17.558 False 25 BM_ReadColumn/25/5 741.612 MiB/sec 870.530 MiB/sec17.384 False 3 BM_ReadColumn/10/10 859.032 MiB/sec 1005.688 MiB/sec17.072 False 31BM_ReadColumn/35/10 1.095 GiB/sec 1.254 GiB/sec14.528 False 20 BM_ReadColumn/-1/1 712.571 MiB/sec 808.480 MiB/sec13.460 False 5BM_ReadColumn/25/25 1.244 GiB/sec 1.399 GiB/sec12.397 False 9BM_ReadColumn/5/10 460.741 MiB/sec 484.654 MiB/sec 5.190 False 7 BM_ReadColumn/1/20 397.103 MiB/sec 412.517 MiB/sec 3.882 False 29 BM_ReadColumn/-1/0 984.767 MiB/sec 994.483 MiB/sec 0.987 False 19 BM_ReadColumn/-1/013.609 GiB/sec13.738 GiB/sec 0.944 False 28BM_ReadColumn/-1/1 4.910 GiB/sec 4.702 GiB/sec-4.236 False 8 BM_ReadColumn/-1/1 8.631 GiB/sec 8.000 GiB/sec-7.310True 6BM_ReadColumn/-1/10 2.872 GiB/sec 2.329 GiB/sec -18.912True 32 BM_ReadColumn/-1/10 1.457 GiB/sec 1.099 GiB/sec -24.615True 22 BM_ReadColumn/10/50 1.433 GiB/sec 964.501 MiB/sec -34.267True 24 BM_ReadColumn/-1/0 2.325 GiB/sec 1.400 GiB/sec -39.804True 2 BM_ReadColumn/-1/0 2.412 GiB/sec 1.422 GiB/sec -41.059True 10 BM_ReadColumn/-1/0 1.454 GiB/sec 864.019 MiB/sec -41.959True 16 BM_ReadColumn/-1/20 3.826 GiB/sec 2.184 GiB/sec -42.899True 0BM_ReadColumn/-1/50 4.669 GiB/sec 1.865 GiB/sec -60.055True 4BM_ReadColumn/-1/50 3.184 GiB/sec 862.581 MiB/sec -73.546True ``` I agree that the alternating NA/not-NA is pretty pathological so IMHO this performance regression doesn't seem like a big deal to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-647690792 I rebased. I'm going to look at the benchmarks locally and then probably merge this so further performance work (including comparing with BitBlockCounter) can be pursued as follow up This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-647674585 @emkornfield I'm sorry that I've been neglecting this PR. I will try to rebase this and investigate the perf questions a little bit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-639566354 @emkornfield I can take a look at this once https://github.com/apache/arrow/pull/7356 is merged which adds a single-word function to `BitBlockCounter` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-639184070 I could try to adapt the Parquet code to use BitBlockCounter and see what the benchmarks look like? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-638884684 My anecdotal experience suggests that 1% is a higher-than-average rate of nulls, but I don't know how accurate that is. The popcount strategy could be adapted to do 64 bits at a time (instead of the 256 that's implemented now) which would mean a higher incidence of no-nulls blocks in the 1% case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-638534173 OK I adapted the benchmark here to use the `BitmapScanner` from #7346 https://github.com/wesm/arrow/tree/bit-runner ``` -- Benchmark Time CPU Iterations -- BitRunReader/-1 9890 ns 9890 ns 70683 49.3705MB/s BitRunReader/0108 ns108 ns6250693 4.423GB/s BitRunReader/10 2101 ns 2101 ns 334686 232.36MB/s BitRunReader/25 4072 ns 4072 ns 173114 119.915MB/s BitRunReader/50 5221 ns 5221 ns 133040 93.5178MB/s BitRunReader/60 5042 ns 5042 ns 138099 96.8386MB/s BitRunReader/75 3933 ns 3933 ns 179857 124.152MB/s BitRunReader/99 291 ns291 ns2412105 1.6376GB/s BitRunReaderWithScanner/-1 47 ns 47 ns 15059881 10.2331GB/s BitRunReaderWithScanner/0 46 ns 46 ns 15078363 10.2704GB/s BitRunReaderWithScanner/10 47 ns 47 ns 15118172 10.2299GB/s BitRunReaderWithScanner/25 47 ns 47 ns 15033144 10.2528GB/s BitRunReaderWithScanner/50 47 ns 47 ns 14947443 10.1964GB/s BitRunReaderWithScanner/60 47 ns 47 ns 14668505 10.0837GB/s BitRunReaderWithScanner/75 46 ns 46 ns 15045334 10.2918GB/s BitRunReaderWithScanner/99 46 ns 46 ns 14961067 10.2813GB/s BitRunReaderScalar/-1 13089 ns 13088 ns 51449 37.3063MB/s BitRunReaderScalar/0 3844 ns 3844 ns 176221 127.024MB/s BitRunReaderScalar/106621 ns 6621 ns 104648 73.7517MB/s BitRunReaderScalar/25 12397 ns 12397 ns 55998 39.388MB/s BitRunReaderScalar/50 17099 ns 17099 ns 41378 28.556MB/s BitRunReaderScalar/60 16606 ns 16606 ns 42580 29.4046MB/s BitRunReaderScalar/75 11431 ns 11431 ns 61744 42.7165MB/s BitRunReaderScalar/994265 ns 4265 ns 167402 114.484MB/s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-638530324 From a brief look this is a lot more complicated than #7346. Does it have additional features? I can try to set up benchmarks to compare the two performance wise This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-638528872 I need to take a closer look to compare the API of this versus what is in #7346, but for the specific use case in #7346, that code is going to be much faster. It would be nice to have some apples-to-apples benchmarks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-634107353 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-632463807 Cool, I will try to look at this tomorrow and kick the tires a bit with benchmarks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-626408844 A fair chunk of RLE-related code came out of Impala originally, it might not be a bad idea to peek at what's in apache/impala to see if it has gotten worked on perf-wise since the beginning of 2016. In any case, we wouldn't want to take on a change that causes a perf regression in cases with many short null/non-null runs This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org