[GitHub] [arrow-datafusion] Jimexist opened a new pull request #715: Use pytest in integration test
Jimexist opened a new pull request #715: URL: https://github.com/apache/arrow-datafusion/pull/715 # Which issue does this PR close? Closes #673 # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] p5a0u9l edited a comment on issue #10531: How to resolve `pyarrow.deserialize` FutureWarning
p5a0u9l edited a comment on issue #10531: URL: https://github.com/apache/arrow/issues/10531#issuecomment-877982140 I mean, ok @wesm. Super appreciative of all you and contributors create here, but I rely on Plasma for a production system. An in-memory object store that supports hugepages, exposes simple put/get API and lets me achieve well over 20 Gb/s constantly streaming data, which is easily more than loopback w/ TCP with sufficiently large streaming records - which they are, being raw sensor data. Can someone point me in the recommended direction within the arrow ecosphere if Plasma is deprecated, given the above use case? I'm moving byte blobs mostly, but they are sometimes lightly wrapped as Python objects. I am aware of pyarrow IPC, but didn't see how it fit this rather niche case. would love to be corrected! tia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] p5a0u9l commented on issue #10531: How to resolve `pyarrow.deserialize` FutureWarning
p5a0u9l commented on issue #10531: URL: https://github.com/apache/arrow/issues/10531#issuecomment-877982140 I mean, ok @wesm. Super appreciative of all you and contributors create here, but I rely on Plasma for a production system. An in-memory object store that supports hugepages, exposes simple put/get API and lets me achieve over 20 Gb/s constantly streaming data, which is easily more than loopback w/ TCP with sufficiently large streaming records - which they are, being raw sensor data. Can someone point me in the recommended direction within the arrow ecosphere if Plasma is deprecated, given the above use case? I'm moving byte blobs mostly, but they are sometimes lightly wrapped as Python objects. tia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Jimexist commented on pull request #704: replace once iter chain with array::IntoIter
Jimexist commented on pull request #704: URL: https://github.com/apache/arrow-datafusion/pull/704#issuecomment-877979537 can you remove the intoiter wrapping since array has it implemented by default? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #10686: ARROW-13289: [C++] Accept integer args in trig/log functions via promotion to double
cyb70289 commented on a change in pull request #10686: URL: https://github.com/apache/arrow/pull/10686#discussion_r667592918 ## File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc ## @@ -1042,6 +1042,26 @@ TEST(TestUnaryArithmetic, DispatchBest) { for (std::string name : {"negate", "negate_checked", "abs", "abs_checked"}) { CheckDispatchFails(name, {null()}); } + + for (std::string name : + {"ln", "log2", "log10", "log1p", "sin", "cos", "tan", "asin", "acos"}) { +for (std::string suffix : {"", "_checked"}) { + name += suffix; + + CheckDispatchBest(name, {int32()}, {float64()}); + CheckDispatchBest(name, {uint8()}, {float64()}); + + CheckDispatchBest(name, {dictionary(int8(), int64())}, {float64()}); +} + } + + CheckDispatchBest("atan", {int32()}, {float64()}); + CheckDispatchBest("atan2", {int32(), float64()}, {float64(), float64()}); + CheckDispatchBest("atan2", {int32(), uint8()}, {float64(), float64()}); + CheckDispatchBest("atan2", {int32(), null()}, {float64(), float64()}); + CheckDispatchBest("atan2", {float32(), float64()}, {float64(), float64()}); + // Integer always promotes to double + CheckDispatchBest("atan2", {float32(), int8()}, {float64(), float64()}); Review comment: Is it necessary to add some integer tests besides the implicit cast? ## File path: cpp/src/arrow/compute/kernels/scalar_arithmetic.cc ## @@ -1076,6 +1076,38 @@ struct ArithmeticFunction : ScalarFunction { } }; +/// An ArithmeticFunction that promotes integer arguments to double. +struct ArithmeticFloatingPointFunction : public ArithmeticFunction { + using ArithmeticFunction::ArithmeticFunction; + + Result DispatchBest(std::vector* values) const override { +RETURN_NOT_OK(CheckArity(*values)); +RETURN_NOT_OK(CheckDecimals(values)); + +using arrow::compute::detail::DispatchExactImpl; +if (auto kernel = DispatchExactImpl(this, *values)) return kernel; + +EnsureDictionaryDecoded(values); + +// Only promote types for binary functions Review comment: This comment looks not correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #10690: ARROW-13283: [Developer Tools] Support passing CPU/memory limits to Docker
cyb70289 commented on a change in pull request #10690: URL: https://github.com/apache/arrow/pull/10690#discussion_r667586087 ## File path: docker-compose.yml ## @@ -62,6 +62,14 @@ x-ccache: CCACHE_MAXSIZE: 500M CCACHE_DIR: /ccache +# CPU/memory limits to pass to Docker. These values emulate GitHub Actions: +# https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners +x-limits: + # Note we use cpuset and not cpus since Ninja only detects and limits parallelism Review comment: Tested `--cpus` and `--cpuset-cpus`. With `--cpuset-cpus=0,1`, ninja spawns 2 jobs, as expected. With `--cpus=2`, ninja spawns 32 jobs on my 32 core machine, though only 2 host cores can be used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] silathdiir commented on pull request #537: Implement `RecordBatch::concat`
silathdiir commented on pull request #537: URL: https://github.com/apache/arrow-rs/pull/537#issuecomment-877912374 Hi @alamb, refer to @jorgecarleitao latest comment [#461 (comment)](https://github.com/apache/arrow-rs/issues/461#issuecomment-877808997), it seems that `arrow::compute::concat::concat_batches` may be better (even not good enough for thread model). I will follow the discussion and try to update this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #539: generate parquet schema from rust struct
codecov-commenter edited a comment on pull request #539: URL: https://github.com/apache/arrow-rs/pull/539#issuecomment-877857352 # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#539](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (e13b9e7) into [master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (f1fb2b1) will **decrease** coverage by `0.14%`. > The diff coverage is `22.01%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/539/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff @@ ## master #539 +/- ## == - Coverage 82.60% 82.45% -0.15% == Files 167 167 Lines 4598446084 +100 == + Hits3798437999 +15 - Misses 8000 8085 +85 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [parquet\_derive/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL2xpYi5ycw==) | `0.00% <0.00%> (ø)` | | | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `66.16% <12.50%> (-17.17%)` | :arrow_down: | | [parquet\_derive\_test/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmVfdGVzdC9zcmMvbGliLnJz) | `100.00% <100.00%> (ø)` | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Last update [f1fb2b1...e13b9e7](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #10691: ARROW-13296: [C++] Provide a reflection compatible enum replacement
bkietz commented on pull request #10691: URL: https://github.com/apache/arrow/pull/10691#issuecomment-877881536 I recommend avoiding macros whenever possible, but for completeness I'll note that we could replace ```c++ // color.h struct Color : EnumType { using EnumType::EnumType; static constexpr const char* kValues = R"(red green blue)"; }; // color.cc constexpr const char* Color::kValues; ``` with ```c++ ARROW_ENUM(Color, red green blue); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on issue #705: Any reason why logical plan is optimized twice?
houqp commented on issue #705: URL: https://github.com/apache/arrow-datafusion/issues/705#issuecomment-877873156 That's a good point. Perhaps the better structure would be to push optimize into query evaluation/materialization methods like `save` and `collect`. This way, we can make sure the engine always execute with an optimized plan regardless whether it's built from sql or plan builder. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on a change in pull request #714: Ballista: Shuffle write bug fix
andygrove commented on a change in pull request #714: URL: https://github.com/apache/arrow-datafusion/pull/714#discussion_r667545951 ## File path: ballista/rust/core/src/execution_plans/shuffle_writer.rs ## @@ -254,13 +254,13 @@ impl ExecutionPlan for ShuffleWriterExec { // write batch out let start = Instant::now(); -match writers[num_output_partition] { +match writers[output_partition_num] { Some(w) => { w.write(_batch)?; } None => { let mut path = path.clone(); -path.push(!("{}", partition)); Review comment: This is the fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove opened a new pull request #714: shuffle write bug fix
andygrove opened a new pull request #714: URL: https://github.com/apache/arrow-datafusion/pull/714 # Which issue does this PR close? Closes #713 . # Rationale for this change Simple bug fix. Wrong variable was being used for partition number. # What changes are included in this PR? # Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove opened a new issue #713: Ballista: ShuffleWriterExec writing to wrong partition
andygrove opened a new issue #713: URL: https://github.com/apache/arrow-datafusion/issues/713 **Describe the bug** ShuffleWriterExec writes all output partitions to the same file due to a simple bug where the wrong variable is used for the partition number. **To Reproduce** Implement a unit test **Expected behavior** Use the correct partition number and produce multiple files **Additional context** None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on issue #705: Any reason why logical plan is optimized twice?
Dandandan commented on issue #705: URL: https://github.com/apache/arrow-datafusion/issues/705#issuecomment-877869950 I think it's not an oversight: `collect` is not necessarily called when executing queries, for example when writing the results to disk. As `collect` loads the result into memory in one thread/partition, this is not always what you want to do. We might do something to avoid optimizing twice when calling both functions, but maybe it's not a big deal for now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10697: ARROW-13304: [C++] Unable to install nightly on Ubuntu 21.04 due to day of week options
github-actions[bot] commented on pull request #10697: URL: https://github.com/apache/arrow/pull/10697#issuecomment-877865260 https://issues.apache.org/jira/browse/ARROW-13304 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok opened a new pull request #10697: ARROW-13304: [C++] Unable to install nightly on Ubuntu 21.04 due to day of week options
rok opened a new pull request #10697: URL: https://github.com/apache/arrow/pull/10697 To address [ARROW-13304](https://issues.apache.org/jira/browse/ARROW-13304). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] jhorstmann commented on pull request #389: make slice work for nested types
jhorstmann commented on pull request #389: URL: https://github.com/apache/arrow-rs/pull/389#issuecomment-877864847 This looks good and makes sense to me. StructArray (and probably union too) are a bit of a special case in how offsets are propagated. Since we currently do not push down offsets into the buffers, the offset of an array applies to the direct buffers of the array (including the validity buffer). That is still the case for StructArray, which has no other buffers, so Arrays are still working consistently in that regard. A test for concatenation of sliced StructArrays would be nice to have. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] bryantbiggs commented on pull request #539: generate parquet schema from rust struct
bryantbiggs commented on pull request #539: URL: https://github.com/apache/arrow-rs/pull/539#issuecomment-877862396 nice!!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pachadotdev commented on a change in pull request #10624: ARROW-12992: [R] bindings for substr(), substring(), str_sub()
pachadotdev commented on a change in pull request #10624: URL: https://github.com/apache/arrow/pull/10624#discussion_r667533252 ## File path: r/R/dplyr-functions.R ## @@ -280,6 +280,45 @@ nse_funcs$str_trim <- function(string, side = c("both", "left", "right")) { Expression$create(trim_fun, string) } +nse_funcs$substr <- function(string, start, stop) { + assert_that( +length(start) == 1, +msg = "Start of length != 1 not supported in Arrow" + ) + assert_that( +length(end) == 1, +msg = "End of length != 1 not supported in Arrow" + ) + + if (start > stop) { +start <- 0 +stop <- 0 + } else { +start <- max(0, start - 1) +stop <- max(0, stop) +start_stop <- c(min(start, stop), max(start, stop)) +start <- start_stop[1] +stop <- start_stop[2] + } Review comment: hmmm maybe it's just a bad try from me to avoid situations where start > end i'l lremove this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on pull request #706: Rename concurrency to default_partitions
Dandandan commented on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877858042 > Maybe we do need two separate configs after all as per @Dandandan PR. Maybe with concurrency renamed to something more specific to readers. For now, something specific to readers seems fine to me. For the future, I think we also need to have something for setting the max number of threads for running multiple executors / DataFusion processes on one node (using the builder). By default this uses the number of CPU cores, but if you run multiple ballista workers on one node it should be better to lower this. https://docs.rs/tokio/1.8.1/tokio/runtime/struct.Builder.html# -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10696: ARROW-13303: [JS] Revise bundles
github-actions[bot] commented on pull request #10696: URL: https://github.com/apache/arrow/pull/10696#issuecomment-877858022 https://issues.apache.org/jira/browse/ARROW-13303 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] domoritz opened a new pull request #10696: ARROW-13303: [JS] Revise bundles
domoritz opened a new pull request #10696: URL: https://github.com/apache/arrow/pull/10696 Merge after #10673 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #539: generate parquet schema from rust struct
codecov-commenter commented on pull request #539: URL: https://github.com/apache/arrow-rs/pull/539#issuecomment-877857352 # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#539](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (1984adf) into [master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (f1fb2b1) will **decrease** coverage by `0.14%`. > The diff coverage is `6.97%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/539/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff @@ ## master #539 +/- ## == - Coverage 82.60% 82.45% -0.15% == Files 167 167 Lines 4598446066 +82 == + Hits3798437986 +2 - Misses 8000 8080 +80 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [parquet\_derive/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL2xpYi5ycw==) | `0.00% <0.00%> (ø)` | | | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `66.23% <0.00%> (-17.10%)` | :arrow_down: | | [parquet\_derive\_test/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmVfdGVzdC9zcmMvbGliLnJz) | `100.00% <100.00%> (ø)` | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Last update [f1fb2b1...1984adf](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on pull request #706: Rename concurrency to default_partitions
Dandandan commented on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877857197 > Maybe we do need two separate configs after all as per @Dandandan PR. Maybe with concurrency renamed to something more specific to readers. I believe that's correct. Might not be a problem when reading from one source, but if multiple sources come in, one task that waits on results from one source prevents other tasks from being started, which might limit the amount of both CPU and IO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on pull request #706: Rename concurrency to default_partitions
andygrove commented on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877856405 Maybe we do need two separate configs after all as per @Dandandan PR. Maybe with concurrency renamed to something more specific to readers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou merged pull request #10694: MINOR: [C++] Fix a typo
kou merged pull request #10694: URL: https://github.com/apache/arrow/pull/10694 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on pull request #539: generate parquet schema from rust struct
nevi-me commented on pull request #539: URL: https://github.com/apache/arrow-rs/pull/539#issuecomment-877855410 Might be of interest to @xrl and @bryantbiggs as you were the original authors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me opened a new pull request #539: generate parquet schema from rust struct
nevi-me opened a new pull request #539: URL: https://github.com/apache/arrow-rs/pull/539 # Which issue does this PR close? None # Rationale for this change Users of `parquet_derive` currently have to write the schema of their Rust struct by hand. This is inconvenient, and can be generated for them. # What changes are included in this PR? Adds `parquet::record::RecordWriter::schema()` trait member, and an implementation in `parquet_derive` to generate a schema for the user. # Are there any user-facing changes? Yes, new API. The breakage is probably irrelevant as I don't think that there are many users of `parquet_derive`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #706: Rename concurrency to default_partitions
jorgecarleitao commented on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877851471 > What's the ideal design going forward? If the end goal is to create one async task for each partition and let tokio thread pool to manage the parallelism (threads), then I think the proper change we want to introduce should be getting rid of `spawn_blocking` and make read_files fully async. All our readers are blocking; if we do not run then on `spawn_blocking`, won't they block Tokio's runtime? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on pull request #706: Rename concurrency to default_partitions
andygrove commented on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877850054 @Dandandan Ah, ok. I had not understood fully what was happening there. I think the goal is to have one async task per partition and to remove the spawn_blocking, as @houqp suggested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #706: Rename concurrency to default_partitions
houqp commented on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877847808 What's the ideal design going forward? If the end goal is to create one async task for each partition and let tokio thread pool to manage the parallelism (threads), then I think the proper change we want to introduce should be getting rid of `spawn_blocking` and make read_files fully async. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on pull request #706: Rename concurrency to default_partitions
Dandandan commented on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877844585 > > One concern I have is that the current config also sets the number of maximum threads during reading parquet files. > > Is this still true though? I know we were creating threads at one point in time but we are using Tokio/async now, so we are not creating threads. Increasing partition count will increase the number of async tasks that we run in the thread pool but won't increase the number of threads. We run the tasks now with `spawn_blocking`, this will still create a number of extra threads to execute the task on. This is set to create a maximum of 512(!) threads by default. Based on the `max_concurrency` we still split the files into multiple parallel readers, so increasing this value will increase the number of extra threads (and allocated data) we use considerably as far as I can see. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on pull request #709: Ballista: Fix shuffle writes
andygrove commented on pull request #709: URL: https://github.com/apache/arrow-datafusion/pull/709#issuecomment-877843644 @edrevo fyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on pull request #706: Rename concurrency to default_partitions
andygrove commented on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877843494 > One concern I have is that the current config also sets the number of maximum threads during reading parquet files. Is this still true though? I know we were creating threads at one point in time but we are using Tokio/async now, so we are not creating threads. Increasing partition count will increase the number of async tasks that we run in the thread pool but won't increase the number of threads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove opened a new pull request #712: Implement serde for ShuffleWriterExec
andygrove opened a new pull request #712: URL: https://github.com/apache/arrow-datafusion/pull/712 # Which issue does this PR close? Closes #710 . # Rationale for this change We need this to complete the shuffle implementation. # What changes are included in this PR? # Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove opened a new issue #711: Ballista: Executor must return statistics in CompletedTask / CompletedJob
andygrove opened a new issue #711: URL: https://github.com/apache/arrow-datafusion/issues/711 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** We cannot fix the shuffle mechanism until we have partition stats, or ShuffleReaderExec will attempt to read empty partitions, causing an error. **Describe the solution you'd like** Scheduler should receive partition stats and only try and read from non-empty shuffle partitions. **Describe alternatives you've considered** As a workaround we could write empty shuffle files for empty partitions. **Additional context** None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove opened a new issue #710: Ballista: Implement physical plan serde for ShuffleWriterExec
andygrove opened a new issue #710: URL: https://github.com/apache/arrow-datafusion/issues/710 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** We can't get shuffle writes working correctly until we implement serde for this operator. **Describe the solution you'd like** Implement physical plan serde for ShuffleWriterExec **Describe alternatives you've considered** None **Additional context** None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #706: Rename concurrency to default_partitions
Dandandan edited a comment on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877831879 I think this is mostly good. One concern I have is that the current config also sets the number of maximum threads during reading parquet files. For Ballista, I also think it makes sense to configure the amount of threads Tokio uses. When using multiple executors per machine/node you'll likely want to limit to avoid many (Tokio) threads, to avoid higher memory usage. This could be set when creating / configuring the tokio Runtime. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #704: replace once iter chain with array::IntoIter
houqp commented on pull request #704: URL: https://github.com/apache/arrow-datafusion/pull/704#issuecomment-877836703 @jorgecarleitao yes, this is a relatively new API that landed in stable rust 1.51. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on pull request #706: Rename concurrency to default_partitions
Dandandan commented on pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877831879 I think this is mostly good. One concern I have is that the current config also sets the number of maximum threads during reading parquet files. For Ballista, I also think it makes sense to configure the amount of threads Tokio uses. When using multiple executors per machine/node you'll likely avoid to have too many (Tokio) threads, to avoid higher memory usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #706: Rename concurrency to default_partitions
Dandandan commented on a change in pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706#discussion_r667508175 ## File path: datafusion/src/logical_plan/builder.rs ## @@ -147,10 +147,10 @@ impl LogicalPlanBuilder { pub fn scan_parquet_with_name( path: impl Into, projection: Option>, -max_concurrency: usize, +max_partitions: usize, table_name: impl Into, ) -> Result { -let provider = Arc::new(ParquetTable::try_new(path, max_concurrency)?); +let provider = Arc::new(ParquetTable::try_new(path, max_partitions)?); Review comment: One concern: when increasing the partitions, we also increase the number of maximum nr of threads while reading parquet. I think this should be decoupled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove opened a new pull request #709: Ballista: Fix shuffle writes
andygrove opened a new pull request #709: URL: https://github.com/apache/arrow-datafusion/pull/709 # Which issue does this PR close? Closes #707 . # Rationale for this change Shuffles were broken. The executor always ran the shuffle writes with partioning of `None` instead of the hash partitioning they were supposed to use. This information was never sent as part of the protobuf. Queries still worked correctly but this was not scaling since there was always a single partition. # What changes are included in this PR? # Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on issue #707: Ballista: Avoid pointless shuffle write of Parquet partitions
andygrove commented on issue #707: URL: https://github.com/apache/arrow-datafusion/issues/707#issuecomment-877826983 It turns out there is a fundamental bug here. I am working on fixing this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on pull request #683: Introduce (default) number of partitions option, use it in DataFusion/Ballista
andygrove commented on pull request #683: URL: https://github.com/apache/arrow-datafusion/pull/683#issuecomment-877817032 Please take a look at my proposal in https://github.com/apache/arrow-datafusion/pull/706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove opened a new issue #708: Ballista: Remove hard-coded concurrency from logical plan serde code
andygrove opened a new issue #708: URL: https://github.com/apache/arrow-datafusion/issues/708 **Describe the bug** ```rust LogicalPlanBuilder::scan_parquet_with_name( , projection, 24, _name, )? //TODO concurrency ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove opened a new issue #707: Ballista: Avoid pointless shuffle write of Parquet partitions
andygrove opened a new issue #707: URL: https://github.com/apache/arrow-datafusion/issues/707 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** When running TPC-H query 12 I see query plans like this being executed: ``` ShuffleWriterExec: None ParquetExec: ... ``` This means we are reading a Parquet partition and immediately streaming it to disk in IPC format so that it can be read again later. This is introducing a redundant write and read of this data. **Describe the solution you'd like** TBD **Describe alternatives you've considered** TBD **Additional context** None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove opened a new pull request #706: Rename concurrency to default_partitions
andygrove opened a new pull request #706: URL: https://github.com/apache/arrow-datafusion/pull/706 # Which issue does this PR close? Closes #685 . # Rationale for this change We originally used the `concurrency` config setting to determine how many threads to launch in certain parts of the code base but we now use Tokio to manage task concurrency. The `concurrency` setting is only used to determine the number of partitions to use when repartitioning the plan. We should therefore rename this config setting. # What changes are included in this PR? Introduce new `with_default_partitions` method and leave `with_concurrency` in place for now so that we don't break the API. # Are there any user-facing changes? If users were accessing the `concurrency` field in `ExecutionContext` then they will now need to use `default_partitions` instead. Users should not be accessing these attributes directly though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] ritchie46 commented on issue #461: Implement RecordBatch::concat
ritchie46 commented on issue #461: URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877811760 I think a lean arrow-core crate would be beneficial We already started some work be feature gating IO logic. Wouldn't the same be achieved by `impl FromIterator for RecordBatch`? Then users can use the `concat` kernel and choose how to iter the columns. They also don't pay the compilation penalty until they actually implement that iterator + collect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] jorgecarleitao commented on issue #461: Implement RecordBatch::concat
jorgecarleitao commented on issue #461: URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877808997 Let me try to explain my reasoning atm. All methods exposed on `Array` are `O(1)`. In particular, `.slice` is `O(1)` over the array, and thus `O(c)` over the record where `c` is the number of fields. `concat` over `RecordBatch` seems rather simple but is `O(c * n * r)` where c is the number of columns, r the number of records, and `n` the average length of the records. Since `c` is trivially parallelizable, I would say that the natural implementation is to actually rayon it, i.e. `columns().iter_par()...`. Generally, I consider non-parallel iterations over a record to be an anti-pattern, since parallelism over columns is one of the hallmarks of columnar formats. Imo the decision of how to iterate over columns does not belong to `arrow-rs`, but to Polars, DataFusion and the like. `DataFusion` uses `iter`; polars uses `iter_par` for the most part. We do have some methods in compute that `iter` over the `RecordBatch` that follow this pattern (`filter` and `sort` I believe). So, in this context, I would be more inclined to place `concat_record` at the same level as them: methods that are not `O(1)` over the arrays' length that some may use when they do not want to commit to a multi-threaded execution. But again, imo this is an anti-pattern that we should not promote, as it enforces a specific threading model over columns. The reasoning to have it in `compute` is to not drag compute dependencies to the core modules (I see them as being the datatypes, array, buffer, RecordBatch, alloc). The reason being that `compute` has a massive compile time when compared to the rest of the crate, and keeping these separated makes it easier to split `arrow` in two crates (`arrow` and `arrow-compute`) to reduce compile and/or binary size. This is minor and can be solved by moving `impl RecordBatch` to the `compute::kernels::concat` if the time arises. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on pull request #683: Introduce (default) number of partitions option, use it in DataFusion/Ballista
andygrove commented on pull request #683: URL: https://github.com/apache/arrow-datafusion/pull/683#issuecomment-877808503 I spent some more time reviewing the codebase this morning and it appears that we no longer use the `concurrency` config to determine concurrency as in number of threads, but only use it to determine partition counts when repartitioning, so I think we should just rename this field, rather than introduce a new config. Tokio is managing concurrency level automatically. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on issue #334: Cast of utf8 types and list container types don't respect offset
nevi-me commented on issue #334: URL: https://github.com/apache/arrow-rs/issues/334#issuecomment-877794567 Can be closed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me closed issue #334: Cast of utf8 types and list container types don't respect offset
nevi-me closed issue #334: URL: https://github.com/apache/arrow-rs/issues/334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on issue #518: Parquet list reader not reading correct element lengths
nevi-me commented on issue #518: URL: https://github.com/apache/arrow-rs/issues/518#issuecomment-877794108 @mcassels I've opened this issue to track the bug that you identified -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #491: Minimal MapArray support
codecov-commenter edited a comment on pull request #491: URL: https://github.com/apache/arrow-rs/pull/491#issuecomment-873369694 # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#491](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (f52d13e) into [master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (f1fb2b1) will **increase** coverage by `0.02%`. > The diff coverage is `80.10%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/491/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff @@ ## master #491 +/- ## == + Coverage 82.60% 82.62% +0.02% == Files 167 168 +1 Lines 4598446717 +733 == + Hits3798438599 +615 - Misses 8000 8118 +118 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy93cml0ZXIucnM=) | `86.68% <ø> (ø)` | | | [arrow/src/array/equal\_json.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsX2pzb24ucnM=) | `88.69% <33.33%> (-2.52%)` | :arrow_down: | | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `51.36% <45.00%> (-0.47%)` | :arrow_down: | | [arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL21vZC5ycw==) | `93.78% <50.00%> (-0.16%)` | :arrow_down: | | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.08% <50.00%> (-0.74%)` | :arrow_down: | | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `82.65% <65.38%> (-0.43%)` | :arrow_down: | | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `77.70% <77.67%> (+0.25%)` | :arrow_up: | | [arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=) | `85.83% <81.44%> (-0.29%)` | :arrow_down: | | [arrow/src/util/integration\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL3V0aWwvaW50ZWdyYXRpb25fdXRpbC5ycw==) | `70.10% <81.81%> (+0.54%)` | :arrow_up: | | [parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvc2NoZW1hLnJz) | `87.35% <81.88%> (-1.18%)` | :arrow_down: | | ... and [21
[GitHub] [arrow-rs] nevi-me commented on pull request #491: Minimal MapArray support
nevi-me commented on pull request #491: URL: https://github.com/apache/arrow-rs/pull/491#issuecomment-877791486 @alamb @jorgecarleitao this is now ready for review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me opened a new issue #538: DataType::Map tracking issue
nevi-me opened a new issue #538: URL: https://github.com/apache/arrow-rs/issues/538 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** We are adding support for the map data type, which is a list of key-values ala hashmap. This issue tracks outstanding functionality without opening multiple issues for each at once. The outstanding work after #395 and #491 are: - [ ] Integration testing (https://issues.apache.org/jira/browse/ARROW-13300) - [ ] Non-canonical map support (where we don't check that the schema names match exactly) - [ ] Nested maps [arrow, parquet (write)] - [ ] improving map builder support to behave like a hashmap - [ ] JSON writer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #491: Minimal MapArray support
codecov-commenter edited a comment on pull request #491: URL: https://github.com/apache/arrow-rs/pull/491#issuecomment-873369694 # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#491](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (85cda50) into [master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (f1fb2b1) will **increase** coverage by `0.02%`. > The diff coverage is `80.20%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/491/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff @@ ## master #491 +/- ## == + Coverage 82.60% 82.62% +0.02% == Files 167 168 +1 Lines 4598446717 +733 == + Hits3798438599 +615 - Misses 8000 8118 +118 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy93cml0ZXIucnM=) | `86.68% <ø> (ø)` | | | [arrow/src/array/equal\_json.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsX2pzb24ucnM=) | `88.69% <33.33%> (-2.52%)` | :arrow_down: | | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `51.36% <45.00%> (-0.47%)` | :arrow_down: | | [arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL21vZC5ycw==) | `93.78% <50.00%> (-0.16%)` | :arrow_down: | | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.08% <50.00%> (-0.74%)` | :arrow_down: | | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `82.65% <65.38%> (-0.43%)` | :arrow_down: | | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `77.70% <77.67%> (+0.25%)` | :arrow_up: | | [arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=) | `85.83% <81.44%> (-0.29%)` | :arrow_down: | | [arrow/src/util/integration\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL3V0aWwvaW50ZWdyYXRpb25fdXRpbC5ycw==) | `70.10% <81.81%> (+0.54%)` | :arrow_up: | | [parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvc2NoZW1hLnJz) | `87.35% <81.88%> (-1.18%)` | :arrow_down: | | ... and [21
[GitHub] [arrow-rs] alamb commented on a change in pull request #532: Add tests for building applications using arrow with different feature flags
alamb commented on a change in pull request #532: URL: https://github.com/apache/arrow-rs/pull/532#discussion_r667468941 ## File path: .github/workflows/rust.yml ## @@ -357,15 +353,22 @@ jobs: uses: actions/cache@v2 with: path: /github/home/target - key: ${{ runner.os }}-${{ matrix.arch }}-target-wasm32-cache-${{ matrix.rust }} + # this key equals the ones on `linux-build-lib` for re-use + key: ${{ runner.os }}-${{ matrix.arch }}-target-cache-${{ matrix.rust }} - name: Setup Rust toolchain run: | rustup toolchain install ${{ matrix.rust }} rustup override set ${{ matrix.rust }} rustup component add rustfmt - - name: Build arrow crate + - name: Build with default features run: | export CARGO_HOME="/github/home/.cargo" export CARGO_TARGET_DIR="/github/home/target" - cd arrow - cargo check --all-targets --no-default-features Review comment: I think the reason CI isn't catching issues with feature flags is due to the cargo semantics of a `workspace` -- where Cargo tries to calculate the dependency needs of the workspace as a whole. Since we have some crates in the workspace (like `arrow-flight`) that bring in additional dependencies, these additional dependencies can obscure build problems with other crates (like `arrow`) when used alone -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #532: Add tests for building applications using arrow with different feature flags
alamb commented on a change in pull request #532: URL: https://github.com/apache/arrow-rs/pull/532#discussion_r667468650 ## File path: arrow/test/dependency/no-default-features/Cargo.toml ## @@ -0,0 +1,32 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +[package] +name = "no-default-features" +description = "Models a user application of arrow that specifies no-default-features=true" +version = "0.1.0" +edition = "2018" + +# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html + +[dependencies] +arrow = { path = "../../../../arrow", version = "5.0.0-SNAPSHOT", default-features = false } + +# Workaround for https://github.com/apache/arrow-rs/issues/529 +rand = { version = "0.8" } + +[workspace] Review comment: Actually, this is required (you get a cargo build error without it). Somehow this signals to cargo that this project is not part of the workspace rooted at `arrow-rs/Cargo.toml` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on pull request #537: Implement `RecordBatch::concat`
alamb commented on pull request #537: URL: https://github.com/apache/arrow-rs/pull/537#issuecomment-877785281 Although @nevi-me seems to like `RecordBatch::concat` -- https://github.com/apache/arrow-rs/issues/461#issuecomment-877784891 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on issue #461: Implement RecordBatch::concat
alamb commented on issue #461: URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877785205 > I think this would aid discoverability, in the sense that a person asking "what can I do with a record batch?" would look at impl RecordBatch and discover that easily. I am convinced by this argument -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on issue #461: Implement RecordBatch::concat
nevi-me commented on issue #461: URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877784891 I'd lean towards `RecordBatch::filter` and `RecordBatch::concat` as long as that ends up being a convenience that iterates through all columns, and calls the underlying compute or other function. Similar to a `RecordBatch::slice` that calls `array.slice` internally. I think this would aid discoverability, in the sense that a person asking "what can I do with a record batch?" would look at `impl RecordBatch` and discover that easily. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #537: Implement `RecordBatch::concat`
alamb commented on a change in pull request #537: URL: https://github.com/apache/arrow-rs/pull/537#discussion_r667467518 ## File path: arrow/src/record_batch.rs ## @@ -639,4 +669,108 @@ mod tests { assert_eq!(batch.column(0).as_ref(), boolean.as_ref()); assert_eq!(batch.column(1).as_ref(), int.as_ref()); } + +#[test] +fn concat_empty_record_batches() { +let schema = Arc::new(Schema::new(vec![Field::new("a", DataType::Int32, false)])); +let batch = RecordBatch::concat(, &[]).unwrap(); +assert_eq!(batch.schema().as_ref(), schema.as_ref()); +assert_eq!(0, batch.num_rows()); +} + +#[test] +fn concat_record_batches_of_same_schema() { +let schema = Arc::new(Schema::new(vec![Field::new("a", DataType::Int32, false)])); +let batch1 = RecordBatch::try_new( +schema.clone(), +vec![Arc::new(Int32Array::from(vec![1, 2]))], +) +.unwrap(); +let batch2 = RecordBatch::try_new( +schema.clone(), +vec![Arc::new(Int32Array::from(vec![3, 4]))], +) +.unwrap(); +let new_batch = RecordBatch::concat(, &[batch1, batch2]).unwrap(); +assert_eq!(new_batch.schema().as_ref(), schema.as_ref()); +assert_eq!(4, new_batch.num_rows()); +} + +#[test] +fn concat_record_batches_of_similar_schemas() { Review comment: I would suggest this case be treated as an error -- if a user wants this behavior I think they should explicitly have to specifically pick out the columns they want and create record batches with the same schema ## File path: arrow/src/record_batch.rs ## @@ -353,6 +354,35 @@ impl RecordBatch { let schema = Arc::new(Schema::new(fields)); RecordBatch::try_new(schema, columns) } + +/// Concatenates `batches` together into a single record batch. +pub fn concat(schema: , batches: &[Self]) -> Result { +if batches.is_empty() { +return Ok(RecordBatch::new_empty(schema.clone())); +} +let field_num = schema.fields().len(); +if let Some((i, _)) = batches +.iter() +.enumerate() +.find(|&(_, batch)| batch.num_columns() < field_num) Review comment: I think if the batches don't have the same schema, an error should result, even if there is some plausible behavior. My rationale is that it is likely a user error if the schemas don't match, so failing fast will help the programmer identify the problem more easily So perhaps this function should check `batch.schema() != schema` (aka compare the entire schema for each record batch) ## File path: arrow/src/record_batch.rs ## @@ -639,4 +669,108 @@ mod tests { assert_eq!(batch.column(0).as_ref(), boolean.as_ref()); assert_eq!(batch.column(1).as_ref(), int.as_ref()); } + +#[test] +fn concat_empty_record_batches() { +let schema = Arc::new(Schema::new(vec![Field::new("a", DataType::Int32, false)])); +let batch = RecordBatch::concat(, &[]).unwrap(); +assert_eq!(batch.schema().as_ref(), schema.as_ref()); +assert_eq!(0, batch.num_rows()); +} + +#[test] +fn concat_record_batches_of_same_schema() { +let schema = Arc::new(Schema::new(vec![Field::new("a", DataType::Int32, false)])); +let batch1 = RecordBatch::try_new( +schema.clone(), +vec![Arc::new(Int32Array::from(vec![1, 2]))], +) +.unwrap(); +let batch2 = RecordBatch::try_new( +schema.clone(), +vec![Arc::new(Int32Array::from(vec![3, 4]))], +) +.unwrap(); +let new_batch = RecordBatch::concat(, &[batch1, batch2]).unwrap(); +assert_eq!(new_batch.schema().as_ref(), schema.as_ref()); +assert_eq!(4, new_batch.num_rows()); +} + +#[test] +fn concat_record_batches_of_similar_schemas() { +let schema1 = +Arc::new(Schema::new(vec![Field::new("a", DataType::Int32, false)])); +let schema2 = Arc::new(Schema::new(vec![ +Field::new("b", DataType::Int32, false), +Field::new("c", DataType::Utf8, false), +])); +let batch1 = RecordBatch::try_new( +schema1.clone(), +vec![Arc::new(Int32Array::from(vec![1, 2]))], +) +.unwrap(); +let batch2 = RecordBatch::try_new( +schema2, +vec![ +Arc::new(Int32Array::from(vec![3, 4])), +Arc::new(StringArray::from(vec!["a", "b"])), +], +) +.unwrap(); +let new_batch = RecordBatch::concat(, &[batch1, batch2]).unwrap(); +assert_eq!(new_batch.schema().as_ref(), schema1.as_ref()); +assert_eq!(4, new_batch.num_rows()); +} + +#[test] +fn concat_record_batches_of_different_column_num() { +let
[GitHub] [arrow-rs] alamb commented on issue #461: Implement RecordBatch::concat
alamb commented on issue #461: URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877783516 > Wouldn't it make it more sense to add this to compute as a normal function? Implementing it on the RecordBatch interface is breaking the separation between compute and data that we enforce with the compute module. @jorgecarleitao I think that comes down to "is concat a construction operation or a compute operation"? It could probably be classified as either. I don't have a strong opinion about where the "concat RecordBatch" function goes. The only other prior art of using `RecordBatch`es in compute kernels I found is `pub fn filter_record_batch` in filter.rs ` https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/filter.rs#L279-L298 We could perhaps add a `concat_batches` function in compute/kernels/concat.rs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on a change in pull request #389: make slice work for nested types
nevi-me commented on a change in pull request #389: URL: https://github.com/apache/arrow-rs/pull/389#discussion_r667463293 ## File path: arrow/src/array/transform/structure.rs ## @@ -26,13 +26,10 @@ pub(super) fn build_extend(array: ) -> Extend { index: usize, start: usize, len: usize| { -mutable.child_data.iter_mut().for_each(|child| { -child.extend( -index, -array.offset() + start, -array.offset() + start + len, -) -}) +mutable Review comment: This reverses 9f96561 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on pull request #389: make slice work for nested types
nevi-me commented on pull request #389: URL: https://github.com/apache/arrow-rs/pull/389#issuecomment-877780039 @jorgecarleitao @jhorstmann @bjchambers this is ready for review, and partially fixes #514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #389: make slice work for nested types
codecov-commenter commented on pull request #389: URL: https://github.com/apache/arrow-rs/pull/389#issuecomment-88131 # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#389](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (f72c86e) into [master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (f1fb2b1) will **increase** coverage by `0.00%`. > The diff coverage is `100.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/389/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master #389 +/- ## === Coverage 82.60% 82.60% === Files 167 167 Lines 4598445986+2 === + Hits3798437986+2 Misses 8000 8000 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [arrow/src/array/array\_struct.rs](https://codecov.io/gh/apache/arrow-rs/pull/389/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X3N0cnVjdC5ycw==) | `89.11% <100.00%> (-0.14%)` | :arrow_down: | | [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/389/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `73.60% <100.00%> (+0.93%)` | :arrow_up: | | [arrow/src/array/transform/structure.rs](https://codecov.io/gh/apache/arrow-rs/pull/389/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9zdHJ1Y3R1cmUucnM=) | `80.00% <100.00%> (-3.34%)` | :arrow_down: | | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/389/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `94.85% <0.00%> (-0.20%)` | :arrow_down: | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Last update [f1fb2b1...f72c86e](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on a change in pull request #389: make slice work for nested types
nevi-me commented on a change in pull request #389: URL: https://github.com/apache/arrow-rs/pull/389#discussion_r667459042 ## File path: arrow/src/array/array_struct.rs ## @@ -85,12 +85,7 @@ impl From for StructArray { fn from(data: ArrayData) -> Self { let mut boxed_fields = vec![]; for cd in data.child_data() { -let child_data = if data.offset() != 0 || data.len() != cd.len() { Review comment: @jhorstmann @jorgecarleitao the problem was here. We take `data` which has children and their own offsets, then slice that data correctly to populate `boxed_fields`, but then we still use the `data` with the child arrays that aren't offset. The result's that the struct is correct when looking at it through `boxed_fields` (e.g. the print utility uses this), but once you need to do anything with `data`, it's as if the child values were never sliced. The lightbulb turned on while i was trying to figure out why a test for #491 was failing. There's a change that @bjchambers authored (https://github.com/apache/arrow-rs/commit/9f965612626cdf31187ce07ba5dbecc3503077b8) which addressed the child offsets, but my tests were failing because I needed to revert what they did. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] domoritz edited a comment on pull request #10673: ARROW-13277: [JS] Add declaration maps for TypeScript and refactor testing infrastructure
domoritz edited a comment on pull request #10673: URL: https://github.com/apache/arrow/pull/10673#issuecomment-877747590 ~~Hmm, can it be that test times for the docker build went from 20 minutes to 50 minutes? (https://github.com/apache/arrow/runs/3005016709 vs https://github.com/apache/arrow/runs/3037021755).~~ The last run was faster again. Maybe the CI is just inconsistent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #491: [DRAFT] Minimal MapArray support
codecov-commenter edited a comment on pull request #491: URL: https://github.com/apache/arrow-rs/pull/491#issuecomment-873369694 # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#491](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (b5efa36) into [master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (f1fb2b1) will **decrease** coverage by `0.04%`. > The diff coverage is `75.48%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/491/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff @@ ## master #491 +/- ## == - Coverage 82.60% 82.56% -0.05% == Files 167 168 +1 Lines 4598446629 +645 == + Hits3798438498 +514 - Misses 8000 8131 +131 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `72.23% <0.00%> (-0.44%)` | :arrow_down: | | [arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy93cml0ZXIucnM=) | `86.68% <ø> (ø)` | | | [arrow/src/array/equal\_json.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsX2pzb24ucnM=) | `88.69% <33.33%> (-2.52%)` | :arrow_down: | | [arrow/src/array/array\_map.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X21hcC5ycw==) | `44.77% <44.77%> (ø)` | | | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `51.36% <45.00%> (-0.47%)` | :arrow_down: | | [arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL21vZC5ycw==) | `93.78% <50.00%> (-0.16%)` | :arrow_down: | | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.08% <50.00%> (-0.74%)` | :arrow_down: | | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `82.65% <65.38%> (-0.43%)` | :arrow_down: | | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `77.70% <77.67%> (+0.25%)` | :arrow_up: | | [arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=) | `85.83% <81.44%> (-0.29%)` | :arrow_down: | | ... and [18
[GitHub] [arrow] domoritz commented on pull request #10673: ARROW-13277: [JS] Add declaration maps for TypeScript and refactor testing infrastructure
domoritz commented on pull request #10673: URL: https://github.com/apache/arrow/pull/10673#issuecomment-877747590 Hmm, can it be that test times for the docker build went from 20 minutes to 50 minutes? (https://github.com/apache/arrow/runs/3005016709 vs https://github.com/apache/arrow/runs/3037021755). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org