[GitHub] [arrow] kou commented on issue #15280: [C++] Move Acero out of libarrow
kou commented on issue #15280: URL: https://github.com/apache/arrow/issues/15280#issuecomment-1376855728 I prefer to `libacero` because Gandiva uses `libgandiva` not `libarrow_gandiva`. ADBC also uses `libadbc_*` not `libarrow_adbc_*`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on issue #15279: 10.0.1 Ruby gems are not released to RubyGems
kou commented on issue #15279: URL: https://github.com/apache/arrow/issues/15279#issuecomment-1376852622 Oh, I forgot to push them! I've pushed them. BTW, what is your platform? I want to see the full log of the problem situation. I think that red-arrow 10.0.0 requires Apache Arrow C GLib 10.0.0 or later. So red-arrow 10.0.0 should not to try installing Apache Arrow C GLib 10.0.1 when Apache Arrow C GLib 10.0.1 is already installed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ursabot commented on pull request #4863: Support for SQL Natural Join
ursabot commented on PR #4863: URL: https://github.com/apache/arrow-datafusion/pull/4863#issuecomment-1376808500 Benchmark runs are scheduled for baseline = 1eb46df49fcd2479235500810c0562b10da77c90 and contender = 556282a8b6da6cb7d41d8c311211ae49b7ed82a7. 556282a8b6da6cb7d41d8c311211ae49b7ed82a7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4699ea4135f74164bb6e16b6a9875991...93a4e515b1044fa39afc952dbcac6259/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/cb2a0250846d4dedabcc14e082e9b60f...a56bac486f544116ae63385428b85d01/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/e46e16de48b1454d8a23337944dbfc30...d3917c7d9226415fb513f6db83e315ca/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/baed42e3723c48c4afd60fcff464b6fe...33d9c9cf6ea24b03be7872af81148197/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ursabot commented on pull request #4852: Covariance single row input & null skipping
ursabot commented on PR #4852: URL: https://github.com/apache/arrow-datafusion/pull/4852#issuecomment-1376808478 Benchmark runs are scheduled for baseline = 292eb954fc0bad3a1febc597233ba26cb60bda3e and contender = 1eb46df49fcd2479235500810c0562b10da77c90. 1eb46df49fcd2479235500810c0562b10da77c90 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7fa5429ba5fc4323b33a306c3ea156f3...4699ea4135f74164bb6e16b6a9875991/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/337f5b7a8b8944bda0f0425ea232e627...cb2a0250846d4dedabcc14e082e9b60f/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2e4be53365ae4c42a27f8f84b96d17bf...e46e16de48b1454d8a23337944dbfc30/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/bad9c40adbb34927a3668ca46355549e...baed42e3723c48c4afd60fcff464b6fe/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ygf11 commented on issue #4844: Incorrect results for join condition against current master branch
ygf11 commented on issue #4844: URL: https://github.com/apache/arrow-datafusion/issues/4844#issuecomment-1376804032 > I agree -- I have mentioned it on the mailing list vote thread. @ygf11 can you work on a PR for this issue soon? Of course @alamb , I will submit a pr today or tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #4863: Support for SQL Natural Join
alamb merged PR #4863: URL: https://github.com/apache/arrow-datafusion/pull/4863 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #117: Add SQL support for NATURAL JOIN
alamb closed issue #117: Add SQL support for NATURAL JOIN URL: https://github.com/apache/arrow-datafusion/issues/117 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #4852: Covariance single row input & null skipping
alamb merged PR #4852: URL: https://github.com/apache/arrow-datafusion/pull/4852 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4852: Covariance single row input & null skipping
alamb commented on PR #4852: URL: https://github.com/apache/arrow-datafusion/pull/4852#issuecomment-1376802487 Thanks @korowa and @ozankabak -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ursabot commented on pull request #4848: Support using var/var_pop/stddev/stddev_pop in window expressions with custom frames
ursabot commented on PR #4848: URL: https://github.com/apache/arrow-datafusion/pull/4848#issuecomment-1376801429 Benchmark runs are scheduled for baseline = 13fb42efec4b5ab7f9aa251f1705fdcf89057d23 and contender = 292eb954fc0bad3a1febc597233ba26cb60bda3e. 292eb954fc0bad3a1febc597233ba26cb60bda3e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/597674ce0c554b6c87821768e1a80ede...7fa5429ba5fc4323b33a306c3ea156f3/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/5d9e1bc44e204f3a982e1786fd6bcc0b...337f5b7a8b8944bda0f0425ea232e627/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/89f47298cf844a66bbb6ea7f935e19d3...2e4be53365ae4c42a27f8f84b96d17bf/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/89d3d1c9c5af48c39a55df97f307a4c4...bad9c40adbb34927a3668ca46355549e/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4852: Covariance single row input & null skipping
alamb commented on PR #4852: URL: https://github.com/apache/arrow-datafusion/pull/4852#issuecomment-1376800585 > Had we migrated to Rust 1.65, I would suggest the much simpler: > @alamb, any information as to when we will make the switch? I am not sure what you are asking here @ozankabak . The CI checks use the latest rust stable version. I just verified that this is currently using Rust `1.66`. https://github.com/apache/arrow-datafusion/actions/runs/3878200568/jobs/6614116891#step:4:97 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #4848: Support using var/var_pop/stddev/stddev_pop in window expressions with custom frames
alamb merged PR #4848: URL: https://github.com/apache/arrow-datafusion/pull/4848 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4848: Support using var/var_pop/stddev/stddev_pop in window expressions with custom frames
alamb commented on PR #4848: URL: https://github.com/apache/arrow-datafusion/pull/4848#issuecomment-1376798281 Nice -- thank you @jonmmease and @ozankabak -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ursabot commented on pull request #4846: Implement retract_batch for AvgAccumulator
ursabot commented on PR #4846: URL: https://github.com/apache/arrow-datafusion/pull/4846#issuecomment-1376794889 Benchmark runs are scheduled for baseline = 42f7dd50913c4f6b6d830b88d55dfd4a8d16d44f and contender = 13fb42efec4b5ab7f9aa251f1705fdcf89057d23. 13fb42efec4b5ab7f9aa251f1705fdcf89057d23 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/faa29f51a2314df2957bc566903cb458...597674ce0c554b6c87821768e1a80ede/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/a3b2097585e04c51bc17dfaba323e635...5d9e1bc44e204f3a982e1786fd6bcc0b/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/76c4557b831b456184c5e76d92fb0cd7...89f47298cf844a66bbb6ea7f935e19d3/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/7b369e15214b45baa689fef51fe6e0d3...89d3d1c9c5af48c39a55df97f307a4c4/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #4844: Incorrect results for join condition against current master branch
alamb commented on issue #4844: URL: https://github.com/apache/arrow-datafusion/issues/4844#issuecomment-1376793437 > IMO it would be a good idea to fix this before releasing, I agree -- I have mentioned it on the mailing list vote thread. @ygf11 can you work on a PR for this issue soon? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4846: Implement retract_batch for AvgAccumulator
alamb commented on PR #4846: URL: https://github.com/apache/arrow-datafusion/pull/4846#issuecomment-1376789846 Sorry for the delay @jonmmease -- thanks again for the contributions! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #4845: Support custom window frame with AVG aggregate function
alamb closed issue #4845: Support custom window frame with AVG aggregate function URL: https://github.com/apache/arrow-datafusion/issues/4845 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #4846: Implement retract_batch for AvgAccumulator
alamb merged PR #4846: URL: https://github.com/apache/arrow-datafusion/pull/4846 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4855: Minor: Move test data into `datafusion/core/tests/data`
alamb commented on PR #4855: URL: https://github.com/apache/arrow-datafusion/pull/4855#issuecomment-1376789486 Thanks @jackwener -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #15268: GH-14976: [Python] Avoid dependency on exec plan in Table.sort_by to fix minimal tests
ursabot commented on PR #15268: URL: https://github.com/apache/arrow/pull/15268#issuecomment-1376781511 Benchmark runs are scheduled for baseline = 5f3df1255e6e0b7d2d4e197afec447769335f087 and contender = 154de48f0b6c16ae5acc384ed8383704253afd96. 154de48f0b6c16ae5acc384ed8383704253afd96 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6913a328cb6941b2b4ea823560eec9b3...69352a735ab14742bb5463c31c4e7d1d/) [Finished :arrow_down:0.58% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/1240994dbd664119a06041753ed50fb7...a093b2196b8e4f7784a166efb4c3f781/) [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/bd6091cf591947b6adc113fadc2ef40e...4cbb0be6cd444cc9bd920d85bb5b6c42/) [Finished :arrow_down:0.1% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9c9c618cfdc04f7bb7a7704a0d8c2cc5...b3cd26fdcfe247faa32546f8e3ba484a/) Buildkite builds: [Finished] [`154de48f` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2139) [Finished] [`154de48f` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2162) [Finished] [`154de48f` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2134) [Finished] [`154de48f` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2154) [Finished] [`5f3df125` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2138) [Finished] [`5f3df125` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2161) [Finished] [`5f3df125` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2133) [Finished] [`5f3df125` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2153) Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-ballista] Ted-Jiang closed pull request #285: [arrow-ballista] enable delete log periodically (#280)
Ted-Jiang closed pull request #285: [arrow-ballista] enable delete log periodically (#280) URL: https://github.com/apache/arrow-ballista/pull/285 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lukehsiao commented on issue #15233: [Python] pyarrow.fs.copy_files hangs indefinitely
lukehsiao commented on issue #15233: URL: https://github.com/apache/arrow/issues/15233#issuecomment-1376741426 > I'm taking a look at this. I have one debugging question for you in the meantime however. When I look at _upload_to_uri_with_exclude it appears that it is a recursive copy. Is there a possibility that you've included the current directory "." or parent directory ".." and that's why the operation never finishes? In our case, `exclude` is None, so that code path to `_upload_to_uri_with_exclude` is actually never hit (and you'll notice it doesn't show in the flamegraph). So, it's essentially just the call to `pyarrow.fs.copy_files`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mapleFU commented on pull request #15241: GH-14923: [C++][Parquet] Fix DELTA_BINARY_PACKED problem on reading the last block with malford bit-width
mapleFU commented on PR #15241: URL: https://github.com/apache/arrow/pull/15241#issuecomment-1376739203 @pitrou @wjones127 Mind take a look? Since I made lots of naming changes( described https://github.com/apache/arrow/pull/15241#issue-1523751523 ), and update a Hex testing format. Don't know whether it's ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 merged pull request #15201: GH-15200: [C++] Created benchmarks for round kernels.
cyb70289 merged PR #15201: URL: https://github.com/apache/arrow/pull/15201 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #15201: GH-15200: [C++] Created benchmarks for round kernels.
cyb70289 commented on PR #15201: URL: https://github.com/apache/arrow/pull/15201#issuecomment-1376733663 Not related to this PR, but the benchmark result is not obvious to me. Why is `double` faster than `int64`? I suppose integer rounding does nothing. ``` Ceil/size:1048576/inverse_null_proportion:0 323253 ns 323251 ns 2161 bytes_per_second=3.02107G/s items_per_second=405.481M/s null_percent=0 size=1048.58k Ceil/size:1048576/inverse_null_proportion:0 89834 ns89832 ns 7795 bytes_per_second=10.871G/s items_per_second=1.45908G/s null_percent=0 size=1048.58k ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #15201: GH-15200: [C++] Created benchmarks for round kernels.
cyb70289 commented on PR #15201: URL: https://github.com/apache/arrow/pull/15201#issuecomment-1376730524 Travis CI error is not related. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #15288: GH-15284: [C++] Use DeclarationToExecBatches in Acero plan tests
github-actions[bot] commented on PR #15288: URL: https://github.com/apache/arrow/pull/15288#issuecomment-1376724815 * Closes: #15284 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vibhatha opened a new pull request, #15288: GH-15284: [C++] Use DeclarationToExecBatches in Acero plan tests
vibhatha opened a new pull request, #15288: URL: https://github.com/apache/arrow/pull/15288 This PR includes a change to the existing test cases in `plan_test.cc` for Acero-based plan evaluation. At the moment, using the `StartAndCollect` method and manually passing the `FutuerMatcher` and expected results are done and the validation is done on the results obtained from the `StartAndCollect` method. Most of the updated test cases in this PR uses the a common format. Replacing that format and using `DeclarationToExecBatches` makes the code more readable and easy to debug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #4863: Support for SQL Natural Join
jackwener commented on code in PR #4863: URL: https://github.com/apache/arrow-datafusion/pull/4863#discussion_r1065311448 ## datafusion/sql/src/relation/join.rs: ## @@ -142,10 +142,29 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { .build() } JoinConstraint::Natural => { -// https://issues.apache.org/jira/browse/ARROW-10727 -Err(DataFusionError::NotImplemented( -"NATURAL JOIN is not supported (https://issues.apache.org/jira/browse/ARROW-10727)".to_string(), -)) +// TODO: what if no common join cols? -> becomes cross join +// TODO: limit join_type to specific set? +let left_cols: HashSet<> = left +.schema() +.fields() +.iter() +.map(|f| f.field().name()) +.collect(); +let keys: Vec = right +.schema() +.fields() +.iter() +.map(|f| f.field().name()) +.filter(|f| left_cols.contains(f)) +.map(Column::from_name) +.collect(); Review Comment: > I'm not sure I follow, are you suggesting to consider the qualifier when checking for common columns? My understanding is that when joining two different tables they should have different qualifiers anyway. Look like current implementation is ok. `field.name()` don't use qualifier, so this Problem don't exist -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] EpsilonPrime commented on issue #15233: [Python] pyarrow.fs.copy_files hangs indefinitely
EpsilonPrime commented on issue #15233: URL: https://github.com/apache/arrow/issues/15233#issuecomment-1376686021 I'm taking a look at this. I have one debugging question for you in the meantime however. When I look at _upload_to_uri_with_exclude it appears that it is a recursive copy. Is there a possibility that you've included the current directory "." or parent directory ".." and that's why the operation never finishes? One way of determining this would be to insert a log/print at https://github.com/ray-project/ray/blob/d7b2b49a962bf33dae7a50376f159ab15d80800f/python/ray/air/_internal/remote_storage.py#L238 to output the paths that are going to be copied from. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #15269: ARROW-16728: [Python] ParquetDataset to still take legacy code path when old filesystem is passed
ursabot commented on PR #15269: URL: https://github.com/apache/arrow/pull/15269#issuecomment-1376682079 Benchmark runs are scheduled for baseline = 0121ae73f4852ee3c5a6870ed5583916662a35be and contender = 5f3df1255e6e0b7d2d4e197afec447769335f087. 5f3df1255e6e0b7d2d4e197afec447769335f087 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e2790950792344dea8837e8cccbd1c74...6913a328cb6941b2b4ea823560eec9b3/) [Finished :arrow_down:0.91% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/1a2c8751870940f0ae99d4a02268d742...1240994dbd664119a06041753ed50fb7/) [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7fd5af8e17ec442fa199339edcbabd07...bd6091cf591947b6adc113fadc2ef40e/) [Finished :arrow_down:0.07% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ebf1d499d58a4016b08dff1a603c6d0d...9c9c618cfdc04f7bb7a7704a0d8c2cc5/) Buildkite builds: [Finished] [`5f3df125` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2138) [Finished] [`5f3df125` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2161) [Finished] [`5f3df125` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2133) [Finished] [`5f3df125` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2153) [Finished] [`0121ae73` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2137) [Finished] [`0121ae73` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2160) [Finished] [`0121ae73` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2132) [Finished] [`0121ae73` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2152) Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a diff in pull request #15201: GH-15200: [C++] Created benchmarks for round kernels.
cyb70289 commented on code in PR #15201: URL: https://github.com/apache/arrow/pull/15201#discussion_r1065303435 ## cpp/src/arrow/compute/kernels/scalar_round_benchmark.cc: ## @@ -0,0 +1,125 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include + +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/util/benchmark_util.h" + +namespace arrow { +namespace compute { + +// Use a fixed hash to ensure consistent results from run to run. +constexpr auto kSeed = 0x94378165; + +template +static void RoundArrayBenchmark(benchmark::State& state, const std::string& func_name) { + RegressionArgs args(state); + + const int64_t array_size = args.size / sizeof(CType); + auto rand = random::RandomArrayGenerator(kSeed); + + // Choose values so as to avoid overflow on all ops and types. + auto min = static_cast(6); + auto max = static_cast(min + 15); + auto val = std::static_pointer_cast>( + rand.Numeric(array_size, min, max, args.null_proportion)); + RoundOptions options; + options.round_mode = static_cast(Mode); + + for (auto _ : state) { +ABORT_NOT_OK(CallFunction(func_name, {val}, )); + } + state.SetItemsProcessed(state.iterations() * array_size); +} + +void SetRoundArgs(benchmark::internal::Benchmark* bench) { + bench->ArgNames({"size", "inverse_null_proportion"}); + + for (const auto inverse_null_proportion : std::vector({100, 0})) { +bench->Args({static_cast(kL2Size), inverse_null_proportion}); + } +} + +template +static void Ceil(benchmark::State& state) { + RoundArrayBenchmark(state, "ceil"); +} + +template +static void Floor(benchmark::State& state) { + RoundArrayBenchmark(state, "floor"); Review Comment: We can leave decimals as followup task. Can you open an issue to track it? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] EpsilonPrime commented on pull request #15283: GH-15282: [C++,CI] Temporarily fix continuous integration issue with clang-format v14.
EpsilonPrime commented on PR #15283: URL: https://github.com/apache/arrow/pull/15283#issuecomment-1376681342 This attempt appears to change the outer configuration of the docker test build and not the interior. As such this doesn't address the underlying problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] EpsilonPrime commented on a diff in pull request #15201: GH-15200: [C++] Created benchmarks for round kernels.
EpsilonPrime commented on code in PR #15201: URL: https://github.com/apache/arrow/pull/15201#discussion_r1065302588 ## cpp/src/arrow/compute/kernels/scalar_round_benchmark.cc: ## @@ -0,0 +1,125 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include + +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/util/benchmark_util.h" + +namespace arrow { +namespace compute { + +// Use a fixed hash to ensure consistent results from run to run. +constexpr auto kSeed = 0x94378165; + +template +static void RoundArrayBenchmark(benchmark::State& state, const std::string& func_name) { + RegressionArgs args(state); + + const int64_t array_size = args.size / sizeof(CType); + auto rand = random::RandomArrayGenerator(kSeed); + + // Choose values so as to avoid overflow on all ops and types. + auto min = static_cast(6); + auto max = static_cast(min + 15); + auto val = std::static_pointer_cast>( + rand.Numeric(array_size, min, max, args.null_proportion)); + RoundOptions options; + options.round_mode = static_cast(Mode); + + for (auto _ : state) { +ABORT_NOT_OK(CallFunction(func_name, {val}, )); + } + state.SetItemsProcessed(state.iterations() * array_size); +} + +void SetRoundArgs(benchmark::internal::Benchmark* bench) { + bench->ArgNames({"size", "inverse_null_proportion"}); + + for (const auto inverse_null_proportion : std::vector({100, 0})) { +bench->Args({static_cast(kL2Size), inverse_null_proportion}); + } +} + +template +static void Ceil(benchmark::State& state) { + RoundArrayBenchmark(state, "ceil"); +} + +template +static void Floor(benchmark::State& state) { + RoundArrayBenchmark(state, "floor"); Review Comment: Good catch, fixed (although I had to remove the Decimal128 benchmarks for now). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a diff in pull request #15201: GH-15200: [C++] Created benchmarks for round kernels.
cyb70289 commented on code in PR #15201: URL: https://github.com/apache/arrow/pull/15201#discussion_r1065293955 ## cpp/src/arrow/compute/kernels/scalar_round_benchmark.cc: ## @@ -0,0 +1,125 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include + +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/util/benchmark_util.h" + +namespace arrow { +namespace compute { + +// Use a fixed hash to ensure consistent results from run to run. +constexpr auto kSeed = 0x94378165; + +template +static void RoundArrayBenchmark(benchmark::State& state, const std::string& func_name) { + RegressionArgs args(state); + + const int64_t array_size = args.size / sizeof(CType); + auto rand = random::RandomArrayGenerator(kSeed); + + // Choose values so as to avoid overflow on all ops and types. + auto min = static_cast(6); + auto max = static_cast(min + 15); + auto val = std::static_pointer_cast>( + rand.Numeric(array_size, min, max, args.null_proportion)); + RoundOptions options; + options.round_mode = static_cast(Mode); + + for (auto _ : state) { +ABORT_NOT_OK(CallFunction(func_name, {val}, )); + } + state.SetItemsProcessed(state.iterations() * array_size); +} + +void SetRoundArgs(benchmark::internal::Benchmark* bench) { + bench->ArgNames({"size", "inverse_null_proportion"}); + + for (const auto inverse_null_proportion : std::vector({100, 0})) { +bench->Args({static_cast(kL2Size), inverse_null_proportion}); + } +} + +template +static void Ceil(benchmark::State& state) { + RoundArrayBenchmark(state, "ceil"); +} + +template +static void Floor(benchmark::State& state) { + RoundArrayBenchmark(state, "floor"); Review Comment: `FloatType` looks wrong. Should it be `ArrowType`? Same for Round and Trunc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] JayjeetAtGithub commented on issue #3483: Use `array_value_to_string` in `arrow-csv`
JayjeetAtGithub commented on issue #3483: URL: https://github.com/apache/arrow-rs/issues/3483#issuecomment-1376649197 @alamb @tustvold Continuing the discussion from apache/arrow-rs#3417, currently the `arrow-csv` writer allows users to pass custom datetime formats besides the defaults. By using `array_value_to_string`, we would enforce the format and disable the customization. Otherwise, if we still want customization, we can keep both, where we basically fallback to using `array_value_to_string` when no custom format is supplied by the user. Looking forward to hearing your thoughts on this. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-ballista] yahoNanJing commented on pull request #576: Use lz4 for shuffle (stream) compression
yahoNanJing commented on PR #576: URL: https://github.com/apache/arrow-ballista/pull/576#issuecomment-1376640134 > LZ4 is much more light on the CPU than ZSTD, so would be great if we could configure this too! I think by default, the compression is not enabled, right? So at least, it's better for us to add a config to indicate whether to enable the compression feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 merged pull request #15250: GH-15249: [Documentation] Add PR template
cyb70289 merged PR #15250: URL: https://github.com/apache/arrow/pull/15250 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #14917: [C++] Issues with GTest when building Arrow C++ on MacOS
westonpace commented on issue #14917: URL: https://github.com/apache/arrow/issues/14917#issuecomment-1376623935 > @AlenkaF does this happen after using the Conda instructions in the Python development docs? @mhconradt Yes. It appears that conda's recipe has the same issue. I have verified this manually using the reproducer in https://github.com/Homebrew/homebrew-core/issues/111810 I think conda just uses the cmake file as is and clang defaults to C++14 if not specified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a diff in pull request #15201: GH-15200: [C++] Created benchmarks for round kernels.
cyb70289 commented on code in PR #15201: URL: https://github.com/apache/arrow/pull/15201#discussion_r1065267123 ## cpp/src/arrow/compute/kernels/scalar_round_benchmark.cc: ## @@ -0,0 +1,111 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include + +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/util/benchmark_util.h" + +namespace arrow { +namespace compute { + +// Use a fixed hash to ensure consistent results from run to run. +constexpr auto kSeed = 0x94378165; + +template +static void RoundArrayBenchmark(benchmark::State& state, const std::string& func_name) { + RegressionArgs args(state); + + const int64_t array_size = args.size / sizeof(CType); + auto rand = random::RandomArrayGenerator(kSeed); + + // Choose values so as to avoid overflow on all ops and types. + auto min = static_cast(6); + auto max = static_cast(min + 15); + auto val = std::static_pointer_cast>( + rand.Numeric(array_size, min, max, args.null_proportion)); + RoundOptions options; + options.round_mode = static_cast(Mode); + + for (auto _ : state) { +ABORT_NOT_OK(CallFunction(func_name, {val}, )); + } + state.SetItemsProcessed(state.iterations() * array_size); +} + +void SetRoundArgs(benchmark::internal::Benchmark* bench) { + for (const auto inverse_null_proportion : std::vector({100, 0})) { Review Comment: Ah, that's okay. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] boshek commented on issue #15056: [R][R Shiny] Deployed/published Rshiny app using arrow to query an AWS-hosted dataset causes intermittent "stack imbalance", "segfault", "memory not
boshek commented on issue #15056: URL: https://github.com/apache/arrow/issues/15056#issuecomment-1376600272 The feather files are rather large so for something like this where so much is going across the wire, csv or parquet can actually be quicker. That said, it _should_ still work regardless of file format. ``` microbenchmark::microbenchmark( arrow = open_dataset(arrow_bucket, format = 'arrow', unify_schemas = FALSE) %>% filter(parameter == 'Ozone', sample_duration == '1 HOUR', poc == 1) %>% collect(), csv = open_dataset(csv_bucket, format = 'csv', unify_schemas = FALSE) %>% filter( parameter == 'Ozone', sample_duration == '1 HOUR', poc == 1) %>% collect(), parquet = open_dataset(parquet_bucket, format = 'parquet', unify_schemas = FALSE) %>% filter( parameter == 'Ozone', sample_duration == '1 HOUR', poc == 1) %>% collect(), times = 3L ) Unit: seconds expr minlq meanmedianuq max neval arrow 16.984894 17.353667 17.822625 17.722439 18.241490 18.760542 3 csv 10.173464 10.474577 10.876690 10.775691 11.228303 11.680915 3 parquet 8.599491 8.630459 8.705268 8.661427 8.758156 8.854885 3 ``` The reason I suggested Linux was because rsc and shinyapps running on linux machines so maybe this isn't related at to RStudio Connect. That would be good to test. I will see what I can do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15220: Speed up Parquet Writing?
westonpace commented on issue #15220: URL: https://github.com/apache/arrow/issues/15220#issuecomment-1376594167 Reading json and writing parquet, purely in pyarrow, should look something like: ``` import pyarrow.parquet import pyarrow.json table = pyarrow.json.read_json('/tmp/foo.jsonl') pyarrow.parquet.write_table(table, '/tmp/foo.parquet') ``` Separating disk write from encoding would be trickier (there is an in-memory filesystem in C++ but I don't think we expose it in python). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones127 commented on issue #15231: [Benchmarking][C++] Track memory usage in C++ microbenchmarks
wjones127 commented on issue #15231: URL: https://github.com/apache/arrow/issues/15231#issuecomment-1376592183 > Though I do wonder how many memory issues would be revealed at the micro-benchmark level It's a good question. One thing I have noticed when I do measure memory usage is it tends to be very deterministic. So it's a lot easier to detect regressions and improvements for memory usage than it is for time-based metrics (which can be very noisy). But it's also the case that memory issues can happen due to the interaction between systems, rather than just within isolated components. If we are using Google bench for any macro-benchmarks happy to apply it there as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15232: [Documentation] Provide guidance to contributors on getting reviews
westonpace commented on issue #15232: URL: https://github.com/apache/arrow/issues/15232#issuecomment-1376587209 I'm in favor of more users reading the contributing guidelines / docs (they're great). I don't know how effective assigning reviewers (or something like code owners) will be. On the one hand, I suppose I could stop looking at all notifications which is what I do today. On the other hand, I fear I would become blind to anything that wasn't modification of files I was assigned to (I assume we would just self-assign?) Have other projects had success with this? Another suggestion might be to make sure PRs are assigned to at least one reviewer, ping reviewers when they do not respond in a reasonable amount of time and, if too much time passes, assign to a different reviewer or close. Not sure if that can be automated or would require some kind of human intervention. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15226: [Python] dictionary_encode support duration types
westonpace commented on issue #15226: URL: https://github.com/apache/arrow/issues/15226#issuecomment-1376578323 Casting from duration to `int64` should be a zero-copy operation: ``` >>> import pyarrow as pa >>> import datetime >>> arr = pa.array([datetime.timedelta(days=1), datetime.timedelta(hours=1)]) >>> arr [ 864, 36 ] >>> arr.cast(pa.int64()) [ 864, 36 ] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15161: [C++] Inplace renaming of columns in c++ record batch.
westonpace commented on issue #15161: URL: https://github.com/apache/arrow/issues/15161#issuecomment-1376574396 > I don't see this function in the record batch Correct, one does not exist. I was suggesting that someone create one. If you create one yourself I'd encourage you to submit a PR. > in what instance, is there a copy of the buffer. Normally you have to go out of your way to copy a buffer. For example, create an array builder, add all the values from the old array, and then build an array from that. However, many of the compute functions (take, filter, sort, cast, etc.) will do this (some casts are zero-copy but not all). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15231: [Benchmarking][C++] Track memory usage in C++ microbenchmarks
westonpace commented on issue #15231: URL: https://github.com/apache/arrow/issues/15231#issuecomment-1376572429 This would be great. Though I do wonder how many memory issues would be revealed at the micro-benchmark level vs. at the macro-benchmark level (that being said, features like this make google benchmark convenient for macro-benchmarks as well) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #15222: MINOR: [Java] Doc & javadoc improvements
ursabot commented on PR #15222: URL: https://github.com/apache/arrow/pull/15222#issuecomment-1376565366 Benchmark runs are scheduled for baseline = 11d286eafb72b630baf897e619c84ecdfc6b723f and contender = 0121ae73f4852ee3c5a6870ed5583916662a35be. 0121ae73f4852ee3c5a6870ed5583916662a35be is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:25.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/452345e6de07436e8b72b4bbb2eab4e2...e2790950792344dea8837e8cccbd1c74/) [Failed :arrow_down:0.55% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/eddda79cf6f14eff87ad495619c1ced5...1a2c8751870940f0ae99d4a02268d742/) [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/eddc366f198743749dd78cfbc7d7d1ae...7fd5af8e17ec442fa199339edcbabd07/) [Finished :arrow_down:0.07% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/cf52e7a7a2744c8d8201e5b095e7d558...ebf1d499d58a4016b08dff1a603c6d0d/) Buildkite builds: [Finished] [`0121ae73` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2137) [Finished] [`0121ae73` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2160) [Finished] [`0121ae73` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2132) [Finished] [`0121ae73` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2152) [Finished] [`11d286ea` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2136) [Failed] [`11d286ea` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2159) [Finished] [`11d286ea` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2131) [Finished] [`11d286ea` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2151) Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vibhatha commented on issue #15284: [C++] Use DeclarationToExecBatches in Acero plan tests
vibhatha commented on issue #15284: URL: https://github.com/apache/arrow/issues/15284#issuecomment-1376559887 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cgostic commented on issue #15056: [R][R Shiny] Deployed/published Rshiny app using arrow to query an AWS-hosted dataset causes intermittent "stack imbalance", "segfault", "memory not
cgostic commented on issue #15056: URL: https://github.com/apache/arrow/issues/15056#issuecomment-1376547224 Thanks so much for checking in! I do see that error intermittently as well, though I more frequently get the one listed in the logs. From my observations, it seems that the sigpipe issue occurs in the first use of the app after publishing, but it usually shifts to the "stack imbalance" after continued use. (1) I can definitely check writing the partitioned data as .csvs and get back to you! However, we're really hoping to maximize querying efficiency offered by feather files. (2) I have not tried this on a Linux machine. Our RStudio Connect is hosted on an EC2, and shinyapps is also on AWS. These are the only two platforms I have access to, unfortunately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dongjoon-hyun commented on pull request #15267: GH-15265: [Java] Publish SBOM artifacts
dongjoon-hyun commented on PR #15267: URL: https://github.com/apache/arrow/pull/15267#issuecomment-1376546296 Thank you, @assignUser . `.json` and `.xml` patterns are added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] comphead commented on issue #4644: Comparing a `Timestamp` to a `Date32` fails
comphead commented on issue #4644: URL: https://github.com/apache/arrow-datafusion/issues/4644#issuecomment-1376537326 I found something weird. first logical plan failed silently on cast. second is arrow cast acts not consistently on the same inputs. so if I run through first test ``` [datafusion/expr/src/expr_schema.rs:268] _type = Date32 [datafusion/expr/src/expr_schema.rs:269] _to_type = Timestamp( Nanosecond, None, ) [datafusion/expr/src/expr_schema.rs:270] can_cast_types(_type, cast_to_type) = true ``` but if I run second test ``` [datafusion/expr/src/expr_schema.rs:268] _type = Date32 [datafusion/expr/src/expr_schema.rs:269] _to_type = Timestamp( Nanosecond, None, ) [datafusion/expr/src/expr_schema.rs:270] can_cast_types(_type, cast_to_type) = false ``` I'm looking what in arrow-rs can cause such non determined behaviuor @liukun4515 @crepererum @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] boshek commented on issue #15056: [R][R Shiny] Deployed/published Rshiny app using arrow to query an AWS-hosted dataset causes intermittent "stack imbalance", "segfault", "memory not
boshek commented on issue #15056: URL: https://github.com/apache/arrow/issues/15056#issuecomment-1376532373 I can reproduce this issue in RStudio Connect. This is the error that I see (which is different that what you are reporting): log ``` 2023/01/09 23:47:44.920714517 [1] 3785 2023/01/09 23:47:45.194163992 [1] 3785 2023/01/09 23:47:45.197808737 [1] "pinged" 2023/01/09 23:48:15.466341119 [1] 3785 2023/01/09 23:48:45.750669396 [1] 3785 2023/01/09 23:49:16.046789291 [1] 3785 2023/01/09 23:49:46.366873116 [1] 3785 2023/01/09 23:50:16.639291193 [1] 3785 2023/01/09 23:50:46.902676880 [1] 3785 2023/01/09 23:51:17.211774091 [1] 3785 2023/01/09 23:51:20.593104281 Warning: Error in collect: ignoring SIGPIPE signal 2023/01/09 23:51:20.606683408 212: 2023/01/09 23:51:20.606694399 211: signalCondition 2023/01/09 23:51:20.606724639 210: signal_abort 2023/01/09 23:51:20.606743769 209: abort 2023/01/09 23:51:20.606782270 208: augment_io_error_msg 2023/01/09 23:51:20.606783790 207: value[[3L]] 2023/01/09 23:51:20.606797460 206: tryCatchOne 2023/01/09 23:51:20.606798670 205: tryCatchList 2023/01/09 23:51:20.606891601 204: tryCatch 2023/01/09 23:51:20.606893811 203: collect.arrow_dplyr_query 2023/01/09 23:51:20.606907001 202: collect 2023/01/09 23:51:20.606907791 201: %>% 2023/01/09 23:51:20.606914991 200: [/opt/rstudio-connect/mnt/app/app.R#50] 2023/01/09 23:51:20.606927181 198: .func 2023/01/09 23:51:20.606945612 195: contextFunc 2023/01/09 23:51:20.606947502 194: env$runWith 2023/01/09 23:51:20.606972372 187: ctx$run 2023/01/09 23:51:20.606973622 186: self$.updateValue 2023/01/09 23:51:20.606983322 184: plot_data 2023/01/09 23:51:20.606984412 182: renderPlot [/opt/rstudio-connect/mnt/app/app.R#62] 2023/01/09 23:51:20.606993812 180: func 2023/01/09 23:51:20.606995122 140: drawPlot 2023/01/09 23:51:20.607005132 126: 2023/01/09 23:51:20.607006152 110: drawReactive 2023/01/09 23:51:20.60701929297: renderFunc 2023/01/09 23:51:20.60702268396: output$TS 2023/01/09 23:51:20.60703214315: 2023/01/09 23:51:20.60703329313: fn 2023/01/09 23:51:20.607046283 8: retry 2023/01/09 23:51:20.607047413 7: connect$retryingStartServer 2023/01/09 23:51:20.607056303 6: eval 2023/01/09 23:51:20.607058203 5: eval 2023/01/09 23:51:20.607072323 4: eval 2023/01/09 23:51:20.607073053 3: eval 2023/01/09 23:51:20.607080343 2: eval.parent 2023/01/09 23:51:20.607081113 1: local ``` I tried a few troubleshooting steps including: - moving the `collect` call around to see where exactly the issue was. the only thing that "worked" was to pull all the data in right away which obviously isn't helpful - tried using parquet and csv. - Using csv seems to resolve this issue though that does not solve the actual thing you wanted. Two additional questions: - does write the partitioned data as csv and using them as csvs work? - have you tried this using a linux machine? I don't have access to the linux machine where I can deploy this. it is very strange that this works locally but not in RStudio Connect. Trying this on linux would be a useful point on information. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] askoa commented on a diff in pull request #3501: Add a function to get memory size of array slice
askoa commented on code in PR #3501: URL: https://github.com/apache/arrow-rs/pull/3501#discussion_r1065214401 ## arrow-data/src/data.rs: ## @@ -463,6 +463,59 @@ impl ArrayData { size } +pub fn get_slice_memory_size() -> Result { +let mut result: usize = 0; +let layout = layout(_type); + +for spec in layout.buffers.iter() { +match spec { +BufferSpec::FixedWidth { byte_width } => { +let buffer_size = self +.len +.checked_mul(*byte_width) +.expect("integer overflow computing buffer size"); Review Comment: Thanks. Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] EpsilonPrime commented on pull request #15283: GH-15282: [C++,CI] Fix continuous integration issue with clang-format v14.
EpsilonPrime commented on PR #15283: URL: https://github.com/apache/arrow/pull/15283#issuecomment-1376509214 An ideal solution would add the missing package to the dependencies to install but I haven't found that yet. This change seems like it would work but for some reason changing the Travis CI isn't enough to get the CI to run to verify this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #15283: GH-15282: [C++,CI] Fix continuous integration issue with clang-format v14.
github-actions[bot] commented on PR #15283: URL: https://github.com/apache/arrow/pull/15283#issuecomment-1376507396 * Closes: #15282 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #15283: GH-15282: [C++,CI] Fix continuous integration issue with clang-format v14.
github-actions[bot] commented on PR #15283: URL: https://github.com/apache/arrow/pull/15283#issuecomment-1376507420 :warning: GitHub issue #15282 **has been automatically assigned in GitHub** to PR creator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] EpsilonPrime opened a new pull request, #15283: GH-15282: [C++,CI] Fix continuous integration issue with clang-format v14.
EpsilonPrime opened a new pull request, #15283: URL: https://github.com/apache/arrow/pull/15283 …ready on Docker for version 14. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] felipecrv commented on a diff in pull request #15083: ARROW-18403: [C++] Add support for nullary and n-ary aggregate functions
felipecrv commented on code in PR #15083: URL: https://github.com/apache/arrow/pull/15083#discussion_r1064818385 ## cpp/src/arrow/compute/exec/aggregate_node.cc: ## @@ -46,14 +46,26 @@ namespace { void AggregatesToString(std::stringstream* ss, const Schema& input_schema, const std::vector& aggs, -const std::vector& target_field_ids, int indent = 0) { +const std::vector>& target_fieldsets, +int indent = 0) { *ss << "aggregates=[" << std::endl; for (size_t i = 0; i < aggs.size(); i++) { for (int j = 0; j < indent; ++j) *ss << " "; -*ss << '\t' << aggs[i].function << '(' -<< input_schema.field(target_field_ids[i])->name(); +*ss << '\t' << aggs[i].function << '('; +const auto& target = target_fieldsets[i]; +if (target.size() == 0) { + *ss << "*"; +} else { + *ss << input_schema.field(target[0])->name(); + for (size_t k = 1; k < target.size(); k++) { +*ss << ", " << input_schema.field(target[k])->name(); + } +} if (aggs[i].options) { - *ss << ", " << aggs[i].options->ToString(); + auto* options_type = aggs[i].options->options_type(); + if (options_type->num_properties() > 0) { Review Comment: This function is sometimes (maybe at all times) called after NULL `Aggregate::options` gets replaced by a pointer to an instance of the default options for that aggregate function. That happens here https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec/aggregate_node.cc#L113 So I ended up adding the first function that needs no options at all, but providing default options is required. So I created a options type with 0 properties. That works as a placeholder for the future, when this function might require properties. I could drop all this and make this render the string `count_all(*, {})`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #15267: GH-15265: [Java] Publish SBOM artifacts
github-actions[bot] commented on PR #15267: URL: https://github.com/apache/arrow/pull/15267#issuecomment-1376494798 Revision: de44a621698683f1891e0e1886a23569d2998eb3 Submitted crossbow builds: [ursacomputing/crossbow @ actions-588eb7d929](https://github.com/ursacomputing/crossbow/branches/all?query=actions-588eb7d929) |Task|Status| ||--| |java-jars|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-588eb7d929-github-java-jars)](https://github.com/ursacomputing/crossbow/actions/runs/3878713201/jobs/6615146612)| -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] assignUser commented on pull request #15267: GH-15265: [Java] Publish SBOM artifacts
assignUser commented on PR #15267: URL: https://github.com/apache/arrow/pull/15267#issuecomment-1376494116 To have them uploaded to nightlies.apache.org the nightly java job needs to be updated to: https://github.com/apache/arrow/blob/master/.github/workflows/java_nightly.yml#L110 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] zeroshade commented on pull request #322: feat(go/adbc/driver/flightsql): Native Golang ADBC Flight SQL driver
zeroshade commented on PR #322: URL: https://github.com/apache/arrow-adbc/pull/322#issuecomment-1376493589 Updated with further implementations, all that is left is to write tests. I should have tests written out tomorrow and updated. Then i'll mark this as ready for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] assignUser commented on pull request #15267: GH-15265: [Java] Publish SBOM artifacts
assignUser commented on PR #15267: URL: https://github.com/apache/arrow/pull/15267#issuecomment-1376492670 @github-actions crossbow submit java-jars -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] EpsilonPrime commented on pull request #15201: GH-15200: [C++] Created benchmarks for round kernels.
EpsilonPrime commented on PR #15201: URL: https://github.com/apache/arrow/pull/15201#issuecomment-1376486116 I've updated the arithmetic benchmarks to also have labels. Thanks for the suggestion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] viirya commented on a diff in pull request #3501: Add a function to get memory size of array slice
viirya commented on code in PR #3501: URL: https://github.com/apache/arrow-rs/pull/3501#discussion_r1065198385 ## arrow-data/src/data.rs: ## @@ -463,6 +463,59 @@ impl ArrayData { size } +pub fn get_slice_memory_size() -> Result { Review Comment: For public api, it is better to add a doc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] viirya commented on a diff in pull request #3501: Add a function to get memory size of array slice
viirya commented on code in PR #3501: URL: https://github.com/apache/arrow-rs/pull/3501#discussion_r1065197952 ## arrow-data/src/data.rs: ## @@ -463,6 +463,59 @@ impl ArrayData { size } +pub fn get_slice_memory_size() -> Result { +let mut result: usize = 0; +let layout = layout(_type); + +for spec in layout.buffers.iter() { +match spec { +BufferSpec::FixedWidth { byte_width } => { +let buffer_size = self +.len +.checked_mul(*byte_width) +.expect("integer overflow computing buffer size"); Review Comment: Instead of panic, it should be better to return `Err`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] felipecrv commented on a diff in pull request #15083: ARROW-18403: [C++] Add support for nullary and n-ary aggregate functions
felipecrv commented on code in PR #15083: URL: https://github.com/apache/arrow/pull/15083#discussion_r1065195728 ## cpp/src/arrow/compute/exec.cc: ## @@ -163,6 +163,34 @@ Result ExecBatch::Make(std::vector values) { continue; } +if (length != value.length()) { + // all the arrays should have the same length + return -1; +} Review Comment: I added a commit that adds a reusable function for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] askoa opened a new pull request, #3501: Add a function to get memory size of array slice
askoa opened a new pull request, #3501: URL: https://github.com/apache/arrow-rs/pull/3501 # Which issue does this PR close? Closes #3407 # Rationale for this change See issue description. # What changes are included in this PR? Function to calculate buffer size of array slice. # Are there any user-facing changes? User will get a new API that'll calculate the buffer size of array slice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a diff in pull request #13669: ARROW-16389: [C++][Acero] Add spilling for hash join
westonpace commented on code in PR #13669: URL: https://github.com/apache/arrow/pull/13669#discussion_r1064813860 ## cpp/src/arrow/compute/exec/accumulation_queue.cc: ## @@ -39,20 +37,180 @@ void AccumulationQueue::Concatenate(AccumulationQueue&& that) { this->batches_.reserve(this->batches_.size() + that.batches_.size()); std::move(that.batches_.begin(), that.batches_.end(), std::back_inserter(this->batches_)); - this->row_count_ += that.row_count_; that.Clear(); } void AccumulationQueue::InsertBatch(ExecBatch batch) { - row_count_ += batch.length; batches_.emplace_back(std::move(batch)); } +void AccumulationQueue::SetBatch(size_t idx, ExecBatch batch) { + ARROW_DCHECK(idx < batches_.size()); + batches_[idx] = std::move(batch); +} + +size_t AccumulationQueue::CalculateRowCount() const { + size_t count = 0; + for (const ExecBatch& b : batches_) count += static_cast(b.length); + return count; +} + void AccumulationQueue::Clear() { - row_count_ = 0; batches_.clear(); } -ExecBatch& AccumulationQueue::operator[](size_t i) { return batches_[i]; } -} // namespace util +Status SpillingAccumulationQueue::Init(QueryContext* ctx) { + ctx_ = ctx; + partition_locks_.Init(ctx_->max_concurrency(), kNumPartitions); + for (size_t ipart = 0; ipart < kNumPartitions; ipart++) { +task_group_read_[ipart] = ctx_->RegisterTaskGroup( +[this, ipart](size_t thread_index, int64_t batch_index) { + return read_back_fn_[ipart](thread_index, static_cast(batch_index), + std::move(queues_[ipart][batch_index])); +}, +[this, ipart](size_t thread_index) { return on_finished_[ipart](thread_index); }); + } + return Status::OK(); +} + +Status SpillingAccumulationQueue::InsertBatch(size_t thread_index, ExecBatch batch) { + Datum& hash_datum = batch.values.back(); + const uint64_t* hashes = + reinterpret_cast(hash_datum.array()->buffers[1]->data()); + // `permutation` stores the indices of rows in the input batch sorted by partition. + std::vector permutation(batch.length); + uint16_t part_starts[kNumPartitions + 1]; + PartitionSort::Eval( + batch.length, kNumPartitions, part_starts, + /*partition_id=*/[&](int64_t i) { return partition_id(hashes[i]); }, + /*output_fn=*/ + [](int64_t input_pos, int64_t output_pos) { +permutation[output_pos] = static_cast(input_pos); + }); + + int unprocessed_partition_ids[kNumPartitions]; + RETURN_NOT_OK(partition_locks_.ForEachPartition( + thread_index, unprocessed_partition_ids, + /*is_prtn_empty=*/ + [&](int part_id) { return part_starts[part_id + 1] == part_starts[part_id]; }, + /*partition=*/ Review Comment: ```suggestion /*process_prtn_fn=*/ ``` ## cpp/src/arrow/compute/exec/accumulation_queue.h: ## @@ -42,16 +45,88 @@ class AccumulationQueue { void Concatenate(AccumulationQueue&& that); void InsertBatch(ExecBatch batch); - int64_t row_count() { return row_count_; } - size_t batch_count() { return batches_.size(); } + void SetBatch(size_t idx, ExecBatch batch); + size_t batch_count() const { return batches_.size(); } bool empty() const { return batches_.empty(); } + size_t CalculateRowCount() const; + + // Resizes the accumulation queue to contain size batches. The + // new batches will be empty and have length 0, but they will be + // usable (useful for concurrent modification of the AccumulationQueue + // of separate elements). + void Resize(size_t size) { batches_.resize(size); } void Clear(); - ExecBatch& operator[](size_t i); + ExecBatch& operator[](size_t i) { return batches_[i]; }; + const ExecBatch& operator[](size_t i) const { return batches_[i]; }; private: - int64_t row_count_; std::vector batches_; }; -} // namespace util +class SpillingAccumulationQueue { Review Comment: ```suggestion /// Accumulates batches in a queue that can be spilled to disk if needed /// /// Each batch is partitioned by the lower bits of the hash column (which must be present) and /// rows are initially accumulated in batch builders (one per partition). As a batch builder fills /// up the completed batch is put into an in-memory accumulation queue (per partition). /// /// When memory pressure is encountered the spilling queue's "spill cursor" can be advanced. This /// will cause a partition to be spilled to disk. Any future data arriving for that partition will go immediately /// to disk (after accumulating a full batch in the batch builder). /// /// Later, data is retrieved one partition at a time. Partitions that are in-memory will be delivered immediately /// in new thread tasks. Partitions that are on disk will be read from disk and delivered as they arrive. class SpillingAccumulationQueue { ``` Other things that might be nice to mention: * This is a "write
[GitHub] [arrow] wjones127 commented on a diff in pull request #15201: GH-15200: [C++] Created benchmarks for round kernels.
wjones127 commented on code in PR #15201: URL: https://github.com/apache/arrow/pull/15201#discussion_r1065174034 ## cpp/src/arrow/compute/kernels/scalar_round_benchmark.cc: ## @@ -0,0 +1,123 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include + +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/util/benchmark_util.h" + +namespace arrow { +namespace compute { + +// Use a fixed hash to ensure consistent results from run to run. +constexpr auto kSeed = 0x94378165; + +template +static void RoundArrayBenchmark(benchmark::State& state, const std::string& func_name) { + RegressionArgs args(state); + + const int64_t array_size = args.size / sizeof(CType); + auto rand = random::RandomArrayGenerator(kSeed); + + // Choose values so as to avoid overflow on all ops and types. + auto min = static_cast(6); + auto max = static_cast(min + 15); + auto val = std::static_pointer_cast>( + rand.Numeric(array_size, min, max, args.null_proportion)); + RoundOptions options; + options.round_mode = static_cast(Mode); + + for (auto _ : state) { +ABORT_NOT_OK(CallFunction(func_name, {val}, )); + } + state.SetItemsProcessed(state.iterations() * array_size); +} + +void SetRoundArgs(benchmark::internal::Benchmark* bench) { + for (const auto inverse_null_proportion : std::vector({100, 0})) { Review Comment: Would be nice to label these args: ```suggestion bench->ArgNames({"size", "inverse_null_proportion"}) for (const auto inverse_null_proportion : std::vector({100, 0})) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones127 commented on issue #15280: [C++] Move Acero out of libarrow
wjones127 commented on issue #15280: URL: https://github.com/apache/arrow/issues/15280#issuecomment-1376425281 `libarrow_dataset` depends on Acero, and is tightly integrated with it. Should it be moved into `libacero`? or renamed `libacero_dataset`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones127 commented on a diff in pull request #15244: GH-15239: [C++][Parquet] Parquet writer writes decimal as int32/64
wjones127 commented on code in PR #15244: URL: https://github.com/apache/arrow/pull/15244#discussion_r1065167634 ## cpp/src/parquet/properties.h: ## @@ -199,7 +199,8 @@ class PARQUET_EXPORT WriterProperties { pagesize_(kDefaultDataPageSize), version_(ParquetVersion::PARQUET_2_4), data_page_version_(ParquetDataPageVersion::V1), - created_by_(DEFAULT_CREATED_BY) {} + created_by_(DEFAULT_CREATED_BY), + integer_annotate_decimal_(false) {} Review Comment: I understand the desire for stability, but would like to have a pathway for improved defaults to be released. It's unfortunate we always do a major release while still trying to respect stability, as there's no clear good time to make breaking changes like this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones127 commented on a diff in pull request #15244: GH-15239: [C++][Parquet] Parquet writer writes decimal as int32/64
wjones127 commented on code in PR #15244: URL: https://github.com/apache/arrow/pull/15244#discussion_r1065082767 ## cpp/src/parquet/arrow/arrow_reader_writer_test.cc: ## @@ -4727,5 +4726,74 @@ std::vector GenerateMapFilteredTestCases() { INSTANTIATE_TEST_SUITE_P(MapFilteredReads, TestNestedSchemaFilteredReader, ::testing::ValuesIn(GenerateMapFilteredTestCases())); +template +class TestIntegerAnnotateDecimalTypeParquetIO : public TestParquetIO { + public: + template + void WriteColumn(const std::shared_ptr& values) { +auto arrow_schema = ::arrow::schema({::arrow::field("a", values->type())}); + +parquet::WriterProperties::Builder builder; +// Enforce integer type to annotate decimal type +auto writer_properties = builder.enable_integer_annotate_decimal()->build(); +std::shared_ptr parquet_schema; +ASSERT_OK_NO_THROW(ToParquetSchema(arrow_schema.get(), *writer_properties, + *default_arrow_writer_properties(), + _schema)); + +this->sink_ = CreateOutputStream(); +auto schema_node = std::static_pointer_cast(parquet_schema->schema_root()); + +std::unique_ptr writer; +ASSERT_OK_NO_THROW(FileWriter::Make( +::arrow::default_memory_pool(), +ParquetFileWriter::Open(this->sink_, schema_node, writer_properties), +arrow_schema, default_arrow_writer_properties(), )); +ASSERT_OK_NO_THROW(writer->NewRowGroup(values->length())); +ASSERT_OK_NO_THROW(writer->WriteColumnChunk(*values)); +ASSERT_OK_NO_THROW(writer->Close()); + } + + void ReadAndCheckSingleDecimalColumnFile(const Array& values) { +std::shared_ptr out; +std::unique_ptr reader; +this->ReaderFromSink(); +this->ReadSingleColumnFile(std::move(reader), ); + +// Reader always read values as DECIMAL128 type +ASSERT_EQ(out->type()->id(), ::arrow::Type::DECIMAL128); + +if (values.type()->id() == ::arrow::Type::DECIMAL128) { + AssertArraysEqual(values, *out); +} else { + auto& expected_values = dynamic_cast(values); + auto& read_values = dynamic_cast(*out); + ASSERT_EQ(expected_values.null_count(), read_values.null_count()); + ASSERT_EQ(0, read_values.null_count()); + ASSERT_EQ(expected_values.length(), read_values.length()); + for (int64_t i = 0; i < expected_values.length(); ++i) { +ASSERT_EQ(::arrow::Decimal256(expected_values.Value(i)).ToString(0), + ::arrow::Decimal128(read_values.Value(i)).ToString(0)); + } +} + } +}; + +typedef ::testing::Types< +DecimalWithPrecisionAndScale<1>, DecimalWithPrecisionAndScale<5>, +DecimalWithPrecisionAndScale<10>, DecimalWithPrecisionAndScale<18>, +Decimal256WithPrecisionAndScale<1>, Decimal256WithPrecisionAndScale<5>, +Decimal256WithPrecisionAndScale<10>, Decimal256WithPrecisionAndScale<18>> +DecimalTestTypes; + +TYPED_TEST_SUITE(TestIntegerAnnotateDecimalTypeParquetIO, DecimalTestTypes); + +TYPED_TEST(TestIntegerAnnotateDecimalTypeParquetIO, SingleDecimalColumn) { Review Comment: Could we also have a test with null values? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ursabot commented on pull request #4836: Remove tests from sql_integration that were ported to sqllogictest
ursabot commented on PR #4836: URL: https://github.com/apache/arrow-datafusion/pull/4836#issuecomment-1376413190 Benchmark runs are scheduled for baseline = c4f4dffec79d3ecc7e177a3d07fcf60312d300db and contender = 42f7dd50913c4f6b6d830b88d55dfd4a8d16d44f. 42f7dd50913c4f6b6d830b88d55dfd4a8d16d44f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/9009bae4a21a47829264f427c6230153...faa29f51a2314df2957bc566903cb458/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b947a659471042a78e8c8d92cee5024b...a3b2097585e04c51bc17dfaba323e635/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/c4f2e2bbe9654e26a93359c9ad85e6e4...76c4557b831b456184c5e76d92fb0cd7/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/6ee2e6779d214ca0bd019da00a596d47...7b369e15214b45baa689fef51fe6e0d3/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ursabot commented on pull request #4849: Fix push_down_projection through a distinct
ursabot commented on PR #4849: URL: https://github.com/apache/arrow-datafusion/pull/4849#issuecomment-1376413168 Benchmark runs are scheduled for baseline = ceff6cb44ad621f89c9c4e9c0bd34cb204246910 and contender = c4f4dffec79d3ecc7e177a3d07fcf60312d300db. c4f4dffec79d3ecc7e177a3d07fcf60312d300db is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/c249b4ad51a6465abd46020854cd2e34...9009bae4a21a47829264f427c6230153/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9ba12a8737fb488698d66ee4f4d171e2...b947a659471042a78e8c8d92cee5024b/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/e8e8d1f4c4604eae8057adb1be8f20fa...c4f2e2bbe9654e26a93359c9ad85e6e4/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/5d38382ce34649a9bf35c8dd1f16033d...6ee2e6779d214ca0bd019da00a596d47/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #15255: GH-15254: [GLib] garrow_execute_plain_wait() checks the finished status
ursabot commented on PR #15255: URL: https://github.com/apache/arrow/pull/15255#issuecomment-1376403051 ['Python', 'R'] benchmarks have high level of regressions. [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/b54f1f820ca14747952ca20d32cf645d...eddc366f198743749dd78cfbc7d7d1ae/) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] ursabot commented on pull request #4839: Orthogonalize distribution and sort enforcement rules into `EnforceDistribution` and `EnforceSorting`
ursabot commented on PR #4839: URL: https://github.com/apache/arrow-datafusion/pull/4839#issuecomment-1376402673 Benchmark runs are scheduled for baseline = c5e2594e99b01c12d4f6903cb998a62a5479455c and contender = ceff6cb44ad621f89c9c4e9c0bd34cb204246910. ceff6cb44ad621f89c9c4e9c0bd34cb204246910 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/ff07244654914621b553179cf4dc6e08...c249b4ad51a6465abd46020854cd2e34/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/5008572ef5df4997b4eac505ba058bc5...9ba12a8737fb488698d66ee4f4d171e2/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/404617774739495385493fef750351b6...e8e8d1f4c4604eae8057adb1be8f20fa/) [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2a988505448f44f0bcd851583feafa20...5d38382ce34649a9bf35c8dd1f16033d/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #15255: GH-15254: [GLib] garrow_execute_plain_wait() checks the finished status
ursabot commented on PR #15255: URL: https://github.com/apache/arrow/pull/15255#issuecomment-1376402654 Benchmark runs are scheduled for baseline = 4d31b1ef70be330356ed9119f63931f8fd90e6d1 and contender = 11d286eafb72b630baf897e619c84ecdfc6b723f. 11d286eafb72b630baf897e619c84ecdfc6b723f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f31d7340d4c47ab9fdb35a75dad1063...452345e6de07436e8b72b4bbb2eab4e2/) [Failed :arrow_down:0.23% :arrow_up:0.1%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/00ad91f5d080474c88331d3823083a33...eddda79cf6f14eff87ad495619c1ced5/) [Finished :arrow_down:0.26% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/b54f1f820ca14747952ca20d32cf645d...eddc366f198743749dd78cfbc7d7d1ae/) [Finished :arrow_down:0.24% :arrow_up:0.03%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ca66e67dd0a844578a789d0cebbb0e44...cf52e7a7a2744c8d8201e5b095e7d558/) Buildkite builds: [Finished] [`11d286ea` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2136) [Failed] [`11d286ea` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2159) [Finished] [`11d286ea` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2131) [Finished] [`11d286ea` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2151) [Finished] [`4d31b1ef` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2135) [Failed] [`4d31b1ef` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2158) [Finished] [`4d31b1ef` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2130) [Finished] [`4d31b1ef` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2150) Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #4836: Remove tests from sql_integration that were ported to sqllogictest
alamb merged PR #4836: URL: https://github.com/apache/arrow-datafusion/pull/4836 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #4498: Remove tests from `sql_integration` that were ported to `sqllogictest`
alamb closed issue #4498: Remove tests from `sql_integration` that were ported to `sqllogictest` URL: https://github.com/apache/arrow-datafusion/issues/4498 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #4667: `SELECT ... FROM (tbl1 UNION tbl2)` wrongly works like `SELECT DISTINCT ... FROM (tbl1 UNION tbl2)`
alamb closed issue #4667: `SELECT ... FROM (tbl1 UNION tbl2)` wrongly works like `SELECT DISTINCT ... FROM (tbl1 UNION tbl2)` URL: https://github.com/apache/arrow-datafusion/issues/4667 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #4849: Fix push_down_projection through a distinct
alamb merged PR #4849: URL: https://github.com/apache/arrow-datafusion/pull/4849 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4862: Rewrite coerce_plan_expr_for_schema to fix union type coercion
alamb commented on PR #4862: URL: https://github.com/apache/arrow-datafusion/pull/4862#issuecomment-1376400480 Thanks @ygf11 -- I hope to review this PR more carefully tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4839: Orthogonalize distribution and sort enforcement rules into `EnforceDistribution` and `EnforceSorting`
alamb commented on PR #4839: URL: https://github.com/apache/arrow-datafusion/pull/4839#issuecomment-1376398444 Thanks everyone -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #4839: Orthogonalize distribution and sort enforcement rules into `EnforceDistribution` and `EnforceSorting`
alamb merged PR #4839: URL: https://github.com/apache/arrow-datafusion/pull/4839 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4856: Remove unused ExecutionPlan::relies_input_order (has been replaced with `required_input_ordering`)
alamb commented on PR #4856: URL: https://github.com/apache/arrow-datafusion/pull/4856#issuecomment-1376397449 > LGTM. I remember originally there is logic insider the Repartition rule to check the relies_on_input_order(), it seems this logic was removed recently. Yes I believe it was removed in https://github.com/apache/arrow-datafusion/pull/4714/files#diff-4f2e98e6466bc2e0f66c0fdb844f46ef25c6b5b559f9c51610dd9273da2b2ec1L859 from @yahoNanJing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lafiona commented on a diff in pull request #14533: ARROW-17777: [Dev] Update the pull request merge script to work with master or main
lafiona commented on code in PR #14533: URL: https://github.com/apache/arrow/pull/14533#discussion_r1065152006 ## dev/merge_arrow_pr.py: ## @@ -110,10 +111,32 @@ def strip_ci_directives(commit_message): return _REGEX_CI_DIRECTIVE.sub('', commit_message) +def git_default_branch_name(): +default_branch_name = os.getenv("MERGE_SCRIPT_DEFAULT_BRANCH_NAME") + +if default_branch_name is None: +try: +default_reference = run_cmd( +"git rev-parse --abbrev-ref origin/HEAD") +default_branch_name = default_reference.lstrip("origin/") Review Comment: Thank you for your suggestion @kou. There is a newline character that I've added an `rstrip()` call to remove. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] matthewwillian commented on issue #4850: Support `select .. from 'data.parquet'` files in SQL from any `SessionContext` (optionally)
matthewwillian commented on issue #4850: URL: https://github.com/apache/arrow-datafusion/issues/4850#issuecomment-1376393999 I'm happy to take on this issue if no one else is already working on it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #4842: Simplify the break rule of logical optimizer passes
alamb commented on PR #4842: URL: https://github.com/apache/arrow-datafusion/pull/4842#issuecomment-1376393249 > Why don't we return None if the rule can't optimize the plan is this scenario? I don't think there is a theoretical reason we don't do this (and it would probably be more efficient) but I don't think it is how the code works today. I would definitely be supportive of updating the passes so they had this property, however -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] save-buffer commented on a diff in pull request #15253: ARROW-17381: [C++] Cleanup error handling
save-buffer commented on code in PR #15253: URL: https://github.com/apache/arrow/pull/15253#discussion_r1065140241 ## cpp/src/arrow/compute/exec/map_node.cc: ## @@ -36,22 +36,16 @@ namespace compute { MapNode::MapNode(ExecPlan* plan, std::vector inputs, Review Comment: Would it make sense to while we're at it remove MapNode? ## cpp/src/arrow/compute/exec/query_context.cc: ## @@ -60,26 +60,24 @@ Result> QueryContext::BeginExternalTask() { return Future<>{}; } -Status QueryContext::ScheduleTask(std::function fn) { +void QueryContext::ScheduleTask(std::function fn) { ::arrow::internal::Executor* exec = executor(); // Adds a task which submits fn to the executor and tracks its progress. If we're // already stopping then the task is ignored and fn is not executed. async_scheduler_->AddSimpleTask([exec, fn]() { return exec->Submit(std::move(fn)); }); - return Status::OK(); } -Status QueryContext::ScheduleTask(std::function fn) { +void QueryContext::ScheduleTask(std::function fn) { std::function indexed_fn = [this, fn]() { size_t thread_index = GetThreadIndex(); return fn(thread_index); }; return ScheduleTask(std::move(indexed_fn)); Review Comment: Can probably remove `return` from here ## cpp/src/arrow/compute/exec/exec_plan.h: ## @@ -121,18 +119,20 @@ class ARROW_EXPORT ExecNode { virtual const char* kind_name() const = 0; - // The number of inputs/outputs expected by this node + // The number of inputs expected by this node int num_inputs() const { return static_cast(inputs_.size()); } - int num_outputs() const { return num_outputs_; } /// This node's predecessors in the exec plan const NodeVector& inputs() const { return inputs_; } + /// True if the plan has no output schema (is a sink) + bool is_sink() const { return !output_schema_; } Review Comment: Why does a sink not have an output schema? Sinks still output stuff batches, I think it would be more correct to define a node to be a sink if its `output_` is null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] ursabot commented on pull request #3494: Fix: Added support to cast string without time
ursabot commented on PR #3494: URL: https://github.com/apache/arrow-rs/pull/3494#issuecomment-1376388781 Benchmark runs are scheduled for baseline = 592d7a3601b1b7876ab5753abde66113f1a9dc23 and contender = fb36dd980b398deabe5547af114982326b97e078. fb36dd980b398deabe5547af114982326b97e078 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b0233fb6237e4d088115a00737ea4e25...98fb0125f2fa4a90ac771613696efdd1/) [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/fc661025284e43cf8709646fd0dc4968...a5081adbdf8c49ce8b208de98f7890c2/) [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2dea8f6d176b44b5aa4c47d9e33c4147...5c2a7e0ad00d4a909c0763a85c65fc69/) [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f4255d24d18c4d409af58ec9f97663cc...9a9d04e866824bcd84d91688ccbb8a50/) Buildkite builds: Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a diff in pull request #3494: Fix: Added support to cast string without time
alamb commented on code in PR #3494: URL: https://github.com/apache/arrow-rs/pull/3494#discussion_r1065144942 ## arrow-cast/src/parse.rs: ## @@ -121,6 +122,14 @@ pub fn string_to_timestamp_nanos(s: ) -> Result { return Ok(ts.timestamp_nanos()); } +// without a timezone specifier as a local time, only date +// Example: 2020-09-08 +if let Ok(dt) = NaiveDate::parse_from_str(s, "%Y-%m-%d") { +if let Some(ts) = dt.and_hms_opt(0, 0, 0) { Review Comment: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb closed issue #3492: Support casting strings like `'2001-01-01'` to timestamp
alamb closed issue #3492: Support casting strings like `'2001-01-01'` to timestamp URL: https://github.com/apache/arrow-rs/issues/3492 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb merged pull request #3494: Fix: Added support to cast string without time
alamb merged PR #3494: URL: https://github.com/apache/arrow-rs/pull/3494 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on issue #3407: ArrayData`get_slice_memory_size` or similar
alamb commented on issue #3407: URL: https://github.com/apache/arrow-rs/issues/3407#issuecomment-1376383261 Thank you @askoa -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on pull request #3452: Remove multiversion dependency
alamb commented on PR #3452: URL: https://github.com/apache/arrow-rs/pull/3452#issuecomment-1376374773 > It's OS specific (dlopen etc) and precludes users from making statically linked executables I guess what I had in mind was literally create builds of the influxdb_iox binary for each architecture (well really probably "AVX512" and "SSE2", and maybe a third one in the middle). As you note Rust doesn't have a stable ABI (and therefore will recompile almost everything from source each time anyways) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fatemehp commented on a diff in pull request #14603: PARQUET-2210: [C++][Parquet] Skip pages based on header metadata using a callback
fatemehp commented on code in PR #14603: URL: https://github.com/apache/arrow/pull/14603#discussion_r1065132773 ## cpp/src/parquet/statistics.h: ## @@ -216,6 +216,14 @@ class PARQUET_EXPORT Statistics { bool has_distinct_count, ::arrow::MemoryPool* pool = ::arrow::default_memory_pool()); + // Helper function to convert EncodedStatistics to Statistics. + // EncodedStatistics does not contain number of non-null values, and it can be + // passed using the num_values parameter. + static std::shared_ptr Make( + const ColumnDescriptor* descr, const EncodedStatistics* encoded_statistics, + int64_t num_values = -1, Review Comment: We decided to add a convenient way to convert an EncodedStatistics to Statistics. However, the encoded statistics misses the number of non-null values that is required for making Statistics. So, I made an optional argument for it. Please see here: https://github.com/apache/arrow/pull/14603#discussion_r1048484566 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fatemehp commented on a diff in pull request #14603: PARQUET-2210: [C++][Parquet] Skip pages based on header metadata using a callback
fatemehp commented on code in PR #14603: URL: https://github.com/apache/arrow/pull/14603#discussion_r1065131471 ## cpp/src/parquet/file_deserialize_test.cc: ## @@ -177,6 +201,217 @@ TEST_F(TestPageSerde, DataPageV1) { ASSERT_NO_FATAL_FAILURE(CheckDataPageHeader(data_page_header_, current_page.get())); } +// Templated test class to test page filtering for both format::DataPageHeader +// and format::DataPageHeaderV2. +template +class PageFilterTest : public TestPageSerde { + public: + const int kNumPages = 10; + void WriteStream(); + + protected: + std::vector data_page_headers_; + int total_rows_ = 0; +}; + +template <> +void PageFilterTest::WriteStream() { + for (int i = 0; i < kNumPages; ++i) { +// Vary the number of rows to produce different headers. +int32_t num_rows = i + 100; +total_rows_ += num_rows; +int data_size = i + 1024; +this->data_page_header_.__set_num_values(num_rows); +this->data_page_header_.statistics.__set_min_value("A"); +this->data_page_header_.statistics.__set_max_value("Z"); +this->data_page_header_.statistics.__set_null_count(0); +this->data_page_header_.__isset.statistics = true; +ASSERT_NO_FATAL_FAILURE( +this->WriteDataPageHeader(/*max_serialized_len=*/1024, data_size, data_size)); +data_page_headers_.push_back(this->data_page_header_); +// Also write data, to make sure we skip the data correctly. +std::vector faux_data(data_size); +ASSERT_OK(this->out_stream_->Write(faux_data.data(), data_size)); + } + this->EndStream(); +} + +template <> +void PageFilterTest::WriteStream() { + for (int i = 0; i < kNumPages; ++i) { +// Vary the number of rows to produce different headers. +int32_t num_rows = i + 100; +total_rows_ += num_rows; +int data_size = i + 1024; +this->data_page_header_v2_.__set_num_values(num_rows); +this->data_page_header_v2_.__set_num_rows(num_rows); +this->data_page_header_v2_.statistics.__set_min_value("A"); +this->data_page_header_v2_.statistics.__set_max_value("Z"); +this->data_page_header_v2_.statistics.__set_null_count(0); +this->data_page_header_v2_.__isset.statistics = true; +ASSERT_NO_FATAL_FAILURE( +this->WriteDataPageHeaderV2(/*max_serialized_len=*/1024, data_size, data_size)); +data_page_headers_.push_back(this->data_page_header_v2_); +// Also write data, to make sure we skip the data correctly. +std::vector faux_data(data_size); +ASSERT_OK(this->out_stream_->Write(faux_data.data(), data_size)); + } + this->EndStream(); +} + +using DataPageHeaderTypes = +::testing::Types; +TYPED_TEST_SUITE(PageFilterTest, DataPageHeaderTypes); + +// Creates a number of pages and skips some of them with the page filter callback. +TYPED_TEST(PageFilterTest, TestPageFilterCallback) { + this->WriteStream(); + + { // Read all pages. +auto stream = std::make_shared<::arrow::io::BufferReader>(this->out_buffer_); +this->page_reader_ = +PageReader::Open(stream, this->total_rows_, Compression::UNCOMPRESSED); + +// This callback will always return false. +auto read_all_pages = [](const DataPageStats& stats) -> bool { return false; }; + +this->page_reader_->set_data_page_filter(read_all_pages); +for (int i = 0; i < this->kNumPages; ++i) { + std::shared_ptr current_page = this->page_reader_->NextPage(); + ASSERT_NE(current_page, nullptr); + ASSERT_NO_FATAL_FAILURE( + CheckDataPageHeader(this->data_page_headers_[i], current_page.get())); +} +ASSERT_EQ(this->page_reader_->NextPage(), nullptr); + } + { // Skip all pages. +auto stream = std::make_shared<::arrow::io::BufferReader>(this->out_buffer_); +this->page_reader_ = +PageReader::Open(stream, this->total_rows_, Compression::UNCOMPRESSED); + +auto skip_all_pages = [](const DataPageStats& stats) -> bool { return true; }; + +this->page_reader_->set_data_page_filter(skip_all_pages); +std::shared_ptr current_page = this->page_reader_->NextPage(); +ASSERT_EQ(this->page_reader_->NextPage(), nullptr); + } + + { // Skip every other page. +auto stream = std::make_shared<::arrow::io::BufferReader>(this->out_buffer_); +this->page_reader_ = +PageReader::Open(stream, this->total_rows_, Compression::UNCOMPRESSED); + +// Skip pages with even number of values. +auto skip_even_pages = [](const DataPageStats& stats) -> bool { + if (stats.num_values % 2 == 0) return true; + return false; +}; + +this->page_reader_->set_data_page_filter(skip_even_pages); + +for (int i = 0; i < this->kNumPages; ++i) { + // Only pages with odd number of values are read. + if (i % 2 != 0) { +std::shared_ptr current_page = this->page_reader_->NextPage(); +ASSERT_NE(current_page, nullptr); +ASSERT_NO_FATAL_FAILURE( +CheckDataPageHeader(this->data_page_headers_[i], current_page.get())); + } +} +
[GitHub] [arrow] fatemehp commented on a diff in pull request #14603: PARQUET-2210: [C++][Parquet] Skip pages based on header metadata using a callback
fatemehp commented on code in PR #14603: URL: https://github.com/apache/arrow/pull/14603#discussion_r1065130195 ## cpp/src/parquet/file_deserialize_test.cc: ## @@ -177,6 +201,305 @@ TEST_F(TestPageSerde, DataPageV1) { ASSERT_NO_FATAL_FAILURE(CheckDataPageHeader(data_page_header_, current_page.get())); } +// Templated test class to test page filtering for both format::DataPageHeader +// and format::DataPageHeaderV2. +template +class PageFilterTest : public TestPageSerde { + public: + const int kNumPages = 10; + void WriteStream(); + void WritePageWithoutStats(); + void CheckNumRows(std::optional num_rows, const T& header); + + protected: + std::vector data_page_headers_; + int total_rows_ = 0; +}; + +template <> +void PageFilterTest::WriteStream() { + for (int i = 0; i < kNumPages; ++i) { +// Vary the number of rows to produce different headers. +int32_t num_rows = i + 100; +total_rows_ += num_rows; +int data_size = i + 1024; +this->data_page_header_.__set_num_values(num_rows); +this->data_page_header_.statistics.__set_min_value("A" + std::to_string(i)); +this->data_page_header_.statistics.__set_max_value("Z" + std::to_string(i)); +this->data_page_header_.statistics.__set_null_count(0); +this->data_page_header_.statistics.__set_distinct_count(num_rows); +this->data_page_header_.__isset.statistics = true; +ASSERT_NO_FATAL_FAILURE( +this->WriteDataPageHeader(/*max_serialized_len=*/1024, data_size, data_size)); +data_page_headers_.push_back(this->data_page_header_); +// Also write data, to make sure we skip the data correctly. +std::vector faux_data(data_size); +ASSERT_OK(this->out_stream_->Write(faux_data.data(), data_size)); + } + this->EndStream(); +} + +template <> +void PageFilterTest::WriteStream() { + for (int i = 0; i < kNumPages; ++i) { +// Vary the number of rows to produce different headers. +int32_t num_rows = i + 100; +total_rows_ += num_rows; +int data_size = i + 1024; +this->data_page_header_v2_.__set_num_values(num_rows); +this->data_page_header_v2_.__set_num_rows(num_rows); +this->data_page_header_v2_.statistics.__set_min_value("A" + std::to_string(i)); +this->data_page_header_v2_.statistics.__set_max_value("Z" + std::to_string(i)); +this->data_page_header_v2_.statistics.__set_null_count(0); +this->data_page_header_v2_.statistics.__set_distinct_count(num_rows); +this->data_page_header_v2_.__isset.statistics = true; +ASSERT_NO_FATAL_FAILURE( +this->WriteDataPageHeaderV2(/*max_serialized_len=*/1024, data_size, data_size)); +data_page_headers_.push_back(this->data_page_header_v2_); +// Also write data, to make sure we skip the data correctly. +std::vector faux_data(data_size); +ASSERT_OK(this->out_stream_->Write(faux_data.data(), data_size)); + } + this->EndStream(); +} + +template <> +void PageFilterTest::WritePageWithoutStats() { + int32_t num_rows = 100; + total_rows_ += num_rows; + int data_size = 1024; + this->data_page_header_.__set_num_values(num_rows); + ASSERT_NO_FATAL_FAILURE( + this->WriteDataPageHeader(/*max_serialized_len=*/1024, data_size, data_size)); + data_page_headers_.push_back(this->data_page_header_); + std::vector faux_data(data_size); + ASSERT_OK(this->out_stream_->Write(faux_data.data(), data_size)); + this->EndStream(); +} + +template <> +void PageFilterTest::WritePageWithoutStats() { + int32_t num_rows = 100; + total_rows_ += num_rows; + int data_size = 1024; + this->data_page_header_v2_.__set_num_values(num_rows); + this->data_page_header_v2_.__set_num_rows(num_rows); + ASSERT_NO_FATAL_FAILURE( + this->WriteDataPageHeaderV2(/*max_serialized_len=*/1024, data_size, data_size)); + data_page_headers_.push_back(this->data_page_header_v2_); + std::vector faux_data(data_size); + ASSERT_OK(this->out_stream_->Write(faux_data.data(), data_size)); + this->EndStream(); +} + +template <> +void PageFilterTest::CheckNumRows( +std::optional num_rows, const format::DataPageHeader& header) { + ASSERT_EQ(num_rows, std::nullopt); +} + +template <> +void PageFilterTest::CheckNumRows( +std::optional num_rows, const format::DataPageHeaderV2& header) { + ASSERT_EQ(*num_rows, header.num_rows); +} + +using DataPageHeaderTypes = +::testing::Types; +TYPED_TEST_SUITE(PageFilterTest, DataPageHeaderTypes); + +// Test that the returned encoded_statistics is nullptr when there are no +// statistics in the page header. +TYPED_TEST(PageFilterTest, TestPageWithoutStatistics) { + this->WritePageWithoutStats(); + + auto stream = std::make_shared<::arrow::io::BufferReader>(this->out_buffer_); + this->page_reader_ = + PageReader::Open(stream, this->total_rows_, Compression::UNCOMPRESSED); + + int num_pages = 0; + bool is_stats_null = false; + auto read_all_pages = [&](const DataPageStats& stats) -> bool { +is_stats_null = stats.encoded_statistics == nullptr; +
[GitHub] [arrow] fatemehp commented on a diff in pull request #14603: PARQUET-2210: [C++][Parquet] Skip pages based on header metadata using a callback
fatemehp commented on code in PR #14603: URL: https://github.com/apache/arrow/pull/14603#discussion_r1065129943 ## cpp/src/parquet/column_reader.h: ## @@ -55,6 +56,27 @@ static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; // 16 KB is the default expected page header size static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024; +// \brief DataPageStats stores encoded statistics and number of values/rows for +// a page. +struct PARQUET_EXPORT DataPageStats { + DataPageStats(const EncodedStatistics* encoded_statistics, int32_t num_values, +std::optional num_rows) + : encoded_statistics(encoded_statistics), +num_values(num_values), +num_rows(num_rows) {} + + // Encoded statistics extracted from the page header. + // Nullptr if there are no statistics in the page header. + const EncodedStatistics* encoded_statistics; + // Number of values stored in the page. Filled for both V1 and V2 data pages. + // For repeated fields, this can be greater than number of rows. For + // non-repeated fields, this will be the same as the number of rows. + int32_t num_values; + // Number of rows stored in the page. std::nullopt for V1 data pages since Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org