date:20210711

[GitHub] [arrow-datafusion] Jimexist opened a new pull request #715: Use pytest in integration test

2021-07-11 Thread GitBox



Jimexist opened a new pull request #715:
URL: https://github.com/apache/arrow-datafusion/pull/715


   # Which issue does this PR close?
   
   Closes #673
   
# Rationale for this change
   
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] p5a0u9l edited a comment on issue #10531: How to resolve `pyarrow.deserialize` FutureWarning

2021-07-11 Thread GitBox



p5a0u9l edited a comment on issue #10531:
URL: https://github.com/apache/arrow/issues/10531#issuecomment-877982140


   I mean, ok @wesm. Super appreciative of all you and contributors create 
here, but I rely on Plasma for a production system. 
   
   An in-memory object store that supports hugepages, exposes simple put/get 
API and lets me achieve well over 20 Gb/s constantly streaming data, which is 
easily more than loopback w/ TCP with sufficiently large streaming records - 
which they are, being raw sensor data. 
   
   Can someone point me in the recommended direction within the arrow ecosphere 
if Plasma is deprecated, given the above use case? I'm moving byte blobs 
mostly, but they are sometimes lightly wrapped as Python objects.  
   
   I am aware of pyarrow IPC, but didn't see how it fit this rather niche case. 
would love to be corrected!
   
   tia


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] p5a0u9l commented on issue #10531: How to resolve `pyarrow.deserialize` FutureWarning

2021-07-11 Thread GitBox



p5a0u9l commented on issue #10531:
URL: https://github.com/apache/arrow/issues/10531#issuecomment-877982140


   I mean, ok @wesm. Super appreciative of all you and contributors create 
here, but I rely on Plasma for a production system. 
   
   An in-memory object store that supports hugepages, exposes simple put/get 
API and lets me achieve over 20 Gb/s constantly streaming data, which is easily 
more than loopback w/ TCP with sufficiently large streaming records - which 
they are, being raw sensor data. 
   
   Can someone point me in the recommended direction within the arrow ecosphere 
if Plasma is deprecated, given the above use case? I'm moving byte blobs 
mostly, but they are sometimes lightly wrapped as Python objects.  
   
   tia


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on pull request #704: replace once iter chain with array::IntoIter

2021-07-11 Thread GitBox



Jimexist commented on pull request #704:
URL: https://github.com/apache/arrow-datafusion/pull/704#issuecomment-877979537


   can you remove the intoiter wrapping since array has it implemented by 
default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 commented on a change in pull request #10686: ARROW-13289: [C++] Accept integer args in trig/log functions via promotion to double

2021-07-11 Thread GitBox



cyb70289 commented on a change in pull request #10686:
URL: https://github.com/apache/arrow/pull/10686#discussion_r667592918



##
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##
@@ -1042,6 +1042,26 @@ TEST(TestUnaryArithmetic, DispatchBest) {
   for (std::string name : {"negate", "negate_checked", "abs", "abs_checked"}) {
 CheckDispatchFails(name, {null()});
   }
+
+  for (std::string name :
+   {"ln", "log2", "log10", "log1p", "sin", "cos", "tan", "asin", "acos"}) {
+for (std::string suffix : {"", "_checked"}) {
+  name += suffix;
+
+  CheckDispatchBest(name, {int32()}, {float64()});
+  CheckDispatchBest(name, {uint8()}, {float64()});
+
+  CheckDispatchBest(name, {dictionary(int8(), int64())}, {float64()});
+}
+  }
+
+  CheckDispatchBest("atan", {int32()}, {float64()});
+  CheckDispatchBest("atan2", {int32(), float64()}, {float64(), float64()});
+  CheckDispatchBest("atan2", {int32(), uint8()}, {float64(), float64()});
+  CheckDispatchBest("atan2", {int32(), null()}, {float64(), float64()});
+  CheckDispatchBest("atan2", {float32(), float64()}, {float64(), float64()});
+  // Integer always promotes to double
+  CheckDispatchBest("atan2", {float32(), int8()}, {float64(), float64()});

Review comment:
   Is it necessary to add some integer tests besides the implicit cast?

##
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic.cc
##
@@ -1076,6 +1076,38 @@ struct ArithmeticFunction : ScalarFunction {
   }
 };
 
+/// An ArithmeticFunction that promotes integer arguments to double.
+struct ArithmeticFloatingPointFunction : public ArithmeticFunction {
+  using ArithmeticFunction::ArithmeticFunction;
+
+  Result DispatchBest(std::vector* values) const 
override {
+RETURN_NOT_OK(CheckArity(*values));
+RETURN_NOT_OK(CheckDecimals(values));
+
+using arrow::compute::detail::DispatchExactImpl;
+if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+EnsureDictionaryDecoded(values);
+
+// Only promote types for binary functions

Review comment:
   This comment looks not correct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 commented on a change in pull request #10690: ARROW-13283: [Developer Tools] Support passing CPU/memory limits to Docker

2021-07-11 Thread GitBox



cyb70289 commented on a change in pull request #10690:
URL: https://github.com/apache/arrow/pull/10690#discussion_r667586087



##
File path: docker-compose.yml
##
@@ -62,6 +62,14 @@ x-ccache: 
   CCACHE_MAXSIZE: 500M
   CCACHE_DIR: /ccache
 
+# CPU/memory limits to pass to Docker. These values emulate GitHub Actions:
+# 
https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners
+x-limits:
+  # Note we use cpuset and not cpus since Ninja only detects and limits 
parallelism

Review comment:
   Tested `--cpus` and `--cpuset-cpus`.
   With `--cpuset-cpus=0,1`, ninja spawns 2 jobs, as expected.
   With `--cpus=2`, ninja spawns 32 jobs on my 32 core machine, though only 2 
host cores can be used.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] silathdiir commented on pull request #537: Implement `RecordBatch::concat`

2021-07-11 Thread GitBox



silathdiir commented on pull request #537:
URL: https://github.com/apache/arrow-rs/pull/537#issuecomment-877912374


   Hi @alamb, refer to @jorgecarleitao latest comment [#461 
(comment)](https://github.com/apache/arrow-rs/issues/461#issuecomment-877808997),
 it seems that `arrow::compute::concat::concat_batches` may be better (even not 
good enough for thread model).
   I will follow the discussion and try to update this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #539: generate parquet schema from rust struct

2021-07-11 Thread GitBox



codecov-commenter edited a comment on pull request #539:
URL: https://github.com/apache/arrow-rs/pull/539#issuecomment-877857352


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#539](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (e13b9e7) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f1fb2b1) will **decrease** coverage by `0.14%`.
   > The diff coverage is `22.01%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/539/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #539  +/-   ##
   ==
   - Coverage   82.60%   82.45%   -0.15% 
   ==
 Files 167  167  
 Lines   4598446084 +100 
   ==
   + Hits3798437999  +15 
   - Misses   8000 8085  +85 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[parquet\_derive/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL2xpYi5ycw==)
 | `0.00% <0.00%> (ø)` | |
   | 
[parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=)
 | `66.16% <12.50%> (-17.17%)` | :arrow_down: |
   | 
[parquet\_derive\_test/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmVfdGVzdC9zcmMvbGliLnJz)
 | `100.00% <100.00%> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[f1fb2b1...e13b9e7](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #10691: ARROW-13296: [C++] Provide a reflection compatible enum replacement

2021-07-11 Thread GitBox



bkietz commented on pull request #10691:
URL: https://github.com/apache/arrow/pull/10691#issuecomment-877881536


   I recommend avoiding macros whenever possible, but for completeness I'll 
note that we could replace
   ```c++
   // color.h
   struct Color : EnumType {
 using EnumType::EnumType;
 static constexpr const char* kValues = R"(red green blue)";
   };
   // color.cc
   constexpr const char* Color::kValues;
   ```
   
   with
   ```c++
   ARROW_ENUM(Color, red green blue);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #705: Any reason why logical plan is optimized twice?

2021-07-11 Thread GitBox



houqp commented on issue #705:
URL: 
https://github.com/apache/arrow-datafusion/issues/705#issuecomment-877873156


   That's a good point. Perhaps the better structure would be to push optimize 
into query evaluation/materialization methods like `save` and `collect`. This 
way, we can make sure the engine always execute with an optimized plan 
regardless whether it's built from sql or plan builder.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on a change in pull request #714: Ballista: Shuffle write bug fix

2021-07-11 Thread GitBox



andygrove commented on a change in pull request #714:
URL: https://github.com/apache/arrow-datafusion/pull/714#discussion_r667545951



##
File path: ballista/rust/core/src/execution_plans/shuffle_writer.rs
##
@@ -254,13 +254,13 @@ impl ExecutionPlan for ShuffleWriterExec {
 
 // write batch out
 let start = Instant::now();
-match  writers[num_output_partition] {
+match  writers[output_partition_num] {
 Some(w) => {
 w.write(_batch)?;
 }
 None => {
 let mut path = path.clone();
-path.push(!("{}", partition));

Review comment:
   This is the fix




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new pull request #714: shuffle write bug fix

2021-07-11 Thread GitBox



andygrove opened a new pull request #714:
URL: https://github.com/apache/arrow-datafusion/pull/714


   # Which issue does this PR close?
   
   
   
   Closes #713 .
   
# Rationale for this change
   
   
   Simple bug fix. Wrong variable was being used for partition number.
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   
   
   No
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new issue #713: Ballista: ShuffleWriterExec writing to wrong partition

2021-07-11 Thread GitBox



andygrove opened a new issue #713:
URL: https://github.com/apache/arrow-datafusion/issues/713


   **Describe the bug**
   ShuffleWriterExec writes all output partitions to the same file due to a 
simple bug where the wrong variable is used for the partition number.
   
   **To Reproduce**
   Implement a unit test
   
   **Expected behavior**
   Use the correct partition number and produce multiple files
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #705: Any reason why logical plan is optimized twice?

2021-07-11 Thread GitBox



Dandandan commented on issue #705:
URL: 
https://github.com/apache/arrow-datafusion/issues/705#issuecomment-877869950


   I think it's not an oversight: `collect` is not necessarily called when 
executing queries, for example when writing the results to disk. As `collect` 
loads the result into memory in one thread/partition, this is not always what 
you want to do.
   We might do something to avoid optimizing twice when calling both functions, 
but maybe it's not a big deal for now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10697: ARROW-13304: [C++] Unable to install nightly on Ubuntu 21.04 due to day of week options

2021-07-11 Thread GitBox



github-actions[bot] commented on pull request #10697:
URL: https://github.com/apache/arrow/pull/10697#issuecomment-877865260


   https://issues.apache.org/jira/browse/ARROW-13304


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok opened a new pull request #10697: ARROW-13304: [C++] Unable to install nightly on Ubuntu 21.04 due to day of week options

2021-07-11 Thread GitBox



rok opened a new pull request #10697:
URL: https://github.com/apache/arrow/pull/10697


   To address [ARROW-13304](https://issues.apache.org/jira/browse/ARROW-13304).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] jhorstmann commented on pull request #389: make slice work for nested types

2021-07-11 Thread GitBox



jhorstmann commented on pull request #389:
URL: https://github.com/apache/arrow-rs/pull/389#issuecomment-877864847


   This looks good and makes sense to me. StructArray (and probably union too) 
are a bit of a special case in how offsets are propagated. Since we currently 
do not push down offsets into the buffers, the offset of an array applies to 
the direct buffers of the array (including the validity buffer). That is still 
the case for StructArray, which has no other buffers, so Arrays are still 
working consistently in that regard.
   
   A test for concatenation of sliced StructArrays would be nice to have.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] bryantbiggs commented on pull request #539: generate parquet schema from rust struct

2021-07-11 Thread GitBox



bryantbiggs commented on pull request #539:
URL: https://github.com/apache/arrow-rs/pull/539#issuecomment-877862396


   nice!!!  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pachadotdev commented on a change in pull request #10624: ARROW-12992: [R] bindings for substr(), substring(), str_sub()

2021-07-11 Thread GitBox



pachadotdev commented on a change in pull request #10624:
URL: https://github.com/apache/arrow/pull/10624#discussion_r667533252



##
File path: r/R/dplyr-functions.R
##
@@ -280,6 +280,45 @@ nse_funcs$str_trim <- function(string, side = c("both", 
"left", "right")) {
   Expression$create(trim_fun, string)
 }
 
+nse_funcs$substr <- function(string, start, stop) {
+  assert_that(
+length(start) == 1,
+msg = "Start of length != 1 not supported in Arrow"
+  )
+  assert_that(
+length(end) == 1,
+msg = "End of length != 1 not supported in Arrow"
+  )
+
+  if (start > stop) {
+start <- 0
+stop <- 0
+  } else {
+start <- max(0, start - 1)
+stop <- max(0, stop)
+start_stop <- c(min(start, stop), max(start, stop))
+start <- start_stop[1]
+stop <- start_stop[2]
+  }

Review comment:
   hmmm maybe it's just a bad try from me to avoid situations where 
start > end
   i'l lremove this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



Dandandan commented on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877858042


   > Maybe we do need two separate configs after all as per @Dandandan PR. 
Maybe with concurrency renamed to something more specific to readers.
   
   For now, something specific to readers seems fine to me.
   
   For the future, I think we also need to have something for setting the max 
number of threads for running multiple executors / DataFusion processes on one 
node (using the builder). By default this uses the number of CPU cores, but if 
you run multiple ballista workers on one node it should be better to lower this.
   
   https://docs.rs/tokio/1.8.1/tokio/runtime/struct.Builder.html#


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10696: ARROW-13303: [JS] Revise bundles

2021-07-11 Thread GitBox



github-actions[bot] commented on pull request #10696:
URL: https://github.com/apache/arrow/pull/10696#issuecomment-877858022


   https://issues.apache.org/jira/browse/ARROW-13303


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] domoritz opened a new pull request #10696: ARROW-13303: [JS] Revise bundles

2021-07-11 Thread GitBox



domoritz opened a new pull request #10696:
URL: https://github.com/apache/arrow/pull/10696


   Merge after #10673 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter commented on pull request #539: generate parquet schema from rust struct

2021-07-11 Thread GitBox



codecov-commenter commented on pull request #539:
URL: https://github.com/apache/arrow-rs/pull/539#issuecomment-877857352


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#539](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (1984adf) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f1fb2b1) will **decrease** coverage by `0.14%`.
   > The diff coverage is `6.97%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/539/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #539  +/-   ##
   ==
   - Coverage   82.60%   82.45%   -0.15% 
   ==
 Files 167  167  
 Lines   4598446066  +82 
   ==
   + Hits3798437986   +2 
   - Misses   8000 8080  +80 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[parquet\_derive/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL2xpYi5ycw==)
 | `0.00% <0.00%> (ø)` | |
   | 
[parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=)
 | `66.23% <0.00%> (-17.10%)` | :arrow_down: |
   | 
[parquet\_derive\_test/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/539/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmVfdGVzdC9zcmMvbGliLnJz)
 | `100.00% <100.00%> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[f1fb2b1...1984adf](https://codecov.io/gh/apache/arrow-rs/pull/539?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



Dandandan commented on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877857197


   > Maybe we do need two separate configs after all as per @Dandandan PR. 
Maybe with concurrency renamed to something more specific to readers.
   
   I believe that's correct. Might not be a problem when reading from one 
source, but if multiple sources come in, one task that waits on results from 
one source prevents other tasks from being started, which might limit the 
amount of both CPU and IO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



andygrove commented on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877856405


   Maybe we do need two separate configs after all as per @Dandandan PR. Maybe 
with concurrency renamed to something more specific to readers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou merged pull request #10694: MINOR: [C++] Fix a typo

2021-07-11 Thread GitBox



kou merged pull request #10694:
URL: https://github.com/apache/arrow/pull/10694


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on pull request #539: generate parquet schema from rust struct

2021-07-11 Thread GitBox



nevi-me commented on pull request #539:
URL: https://github.com/apache/arrow-rs/pull/539#issuecomment-877855410


   Might be of interest to @xrl and @bryantbiggs as you were the original 
authors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me opened a new pull request #539: generate parquet schema from rust struct

2021-07-11 Thread GitBox



nevi-me opened a new pull request #539:
URL: https://github.com/apache/arrow-rs/pull/539


   # Which issue does this PR close?
   
   None
   
   # Rationale for this change

   Users of `parquet_derive` currently have to write the schema of their Rust 
struct by hand.
   This is inconvenient, and can be generated for them.
   
   # What changes are included in this PR?
   
   Adds `parquet::record::RecordWriter::schema()` trait member, and an 
implementation in `parquet_derive` to generate a schema for the user.
   
   # Are there any user-facing changes?
   
   Yes, new API. The breakage is probably irrelevant as I don't think that 
there are many users of `parquet_derive`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



jorgecarleitao commented on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877851471


   > What's the ideal design going forward? If the end goal is to create one 
async task for each partition and let tokio thread pool to manage the 
parallelism (threads), then I think the proper change we want to introduce 
should be getting rid of `spawn_blocking` and make read_files fully async.
   
   All our readers are blocking; if we do not run then on `spawn_blocking`, 
won't they block Tokio's runtime?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



andygrove commented on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877850054


   @Dandandan Ah, ok. I had not understood fully what was happening there. I 
think the goal is to have one async task per partition and to remove the 
spawn_blocking, as @houqp suggested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



houqp commented on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877847808


   What's the ideal design going forward? If the end goal is to create one 
async task for each partition and let tokio thread pool to manage the 
parallelism (threads), then I think the proper change we want to introduce 
should be getting rid of `spawn_blocking` and make read_files fully async.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



Dandandan commented on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877844585


   > > One concern I have is that the current config also sets the number of 
maximum threads during reading parquet files.
   > 
   > Is this still true though? I know we were creating threads at one point in 
time but we are using Tokio/async now, so we are not creating threads. 
Increasing partition count will increase the number of async tasks that we run 
in the thread pool but won't increase the number of threads.
   
   We run the tasks now with `spawn_blocking`, this will still create a number 
of extra threads to execute the task on. This is set to create a maximum of 
512(!) threads by default. Based on the `max_concurrency` we still split the 
files into multiple parallel readers, so increasing this value will increase 
the number of extra threads (and allocated data) we use considerably as far as 
I can see.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #709: Ballista: Fix shuffle writes

2021-07-11 Thread GitBox



andygrove commented on pull request #709:
URL: https://github.com/apache/arrow-datafusion/pull/709#issuecomment-877843644


   @edrevo fyi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



andygrove commented on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877843494


   > One concern I have is that the current config also sets the number of 
maximum threads during reading parquet files.
   
   Is this still true though? I know we were creating threads at one point in 
time but we are using Tokio/async now, so we are not creating threads. 
Increasing partition count will increase the number of async tasks that we run 
in the thread pool but won't increase the number of threads.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new pull request #712: Implement serde for ShuffleWriterExec

2021-07-11 Thread GitBox



andygrove opened a new pull request #712:
URL: https://github.com/apache/arrow-datafusion/pull/712


   # Which issue does this PR close?
   
   
   
   Closes #710 .
   
# Rationale for this change
   
   
   We need this to complete the shuffle implementation.
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   
   
   No
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new issue #711: Ballista: Executor must return statistics in CompletedTask / CompletedJob

2021-07-11 Thread GitBox



andygrove opened a new issue #711:
URL: https://github.com/apache/arrow-datafusion/issues/711


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   We cannot fix the shuffle mechanism until we have partition stats, or 
ShuffleReaderExec will attempt to read empty partitions, causing an error.
   
   **Describe the solution you'd like**
   Scheduler should receive partition stats and only try and read from 
non-empty shuffle partitions.
   
   **Describe alternatives you've considered**
   As a workaround we could write empty shuffle files for empty partitions.
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new issue #710: Ballista: Implement physical plan serde for ShuffleWriterExec

2021-07-11 Thread GitBox



andygrove opened a new issue #710:
URL: https://github.com/apache/arrow-datafusion/issues/710


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   We can't get shuffle writes working correctly until we implement serde for 
this operator.
   
   **Describe the solution you'd like**
   Implement physical plan serde for ShuffleWriterExec
   
   **Describe alternatives you've considered**
   None
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



Dandandan edited a comment on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877831879


   I think this is mostly good.
   
   One concern I have is that the current config also sets the number of 
maximum threads during reading parquet files.
   
   For Ballista, I also think it makes sense to configure the amount of threads 
Tokio uses. When using multiple executors per machine/node you'll likely want 
to limit to avoid many (Tokio) threads, to avoid higher memory usage. This 
could be set when creating / configuring the tokio Runtime.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on pull request #704: replace once iter chain with array::IntoIter

2021-07-11 Thread GitBox



houqp commented on pull request #704:
URL: https://github.com/apache/arrow-datafusion/pull/704#issuecomment-877836703


   @jorgecarleitao yes, this is a relatively new API that landed in stable rust 
1.51.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



Dandandan commented on pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#issuecomment-877831879


   I think this is mostly good.
   
   One concern I have is that the current config also sets the number of 
maximum threads during reading parquet files.
   
   For Ballista, I also think it makes sense to configure the amount of threads 
Tokio uses. When using multiple executors per machine/node you'll likely avoid 
to have too many (Tokio) threads, to avoid higher memory usage.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



Dandandan commented on a change in pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706#discussion_r667508175



##
File path: datafusion/src/logical_plan/builder.rs
##
@@ -147,10 +147,10 @@ impl LogicalPlanBuilder {
 pub fn scan_parquet_with_name(
 path: impl Into,
 projection: Option>,
-max_concurrency: usize,
+max_partitions: usize,
 table_name: impl Into,
 ) -> Result {
-let provider = Arc::new(ParquetTable::try_new(path, max_concurrency)?);
+let provider = Arc::new(ParquetTable::try_new(path, max_partitions)?);

Review comment:
   One concern: when increasing the partitions, we also increase the number 
of maximum nr of threads while reading parquet. I think this should be 
decoupled.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new pull request #709: Ballista: Fix shuffle writes

2021-07-11 Thread GitBox



andygrove opened a new pull request #709:
URL: https://github.com/apache/arrow-datafusion/pull/709


   # Which issue does this PR close?
   
   
   
   Closes #707 .
   
# Rationale for this change
   
   
   Shuffles were broken. The executor always ran the shuffle writes with 
partioning of `None` instead of the hash partitioning they were supposed to 
use. This information was never sent as part of the protobuf. Queries still 
worked correctly but this was not scaling since there was always a single 
partition.
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   
   
   No
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #707: Ballista: Avoid pointless shuffle write of Parquet partitions

2021-07-11 Thread GitBox



andygrove commented on issue #707:
URL: 
https://github.com/apache/arrow-datafusion/issues/707#issuecomment-877826983


   It turns out there is a fundamental bug here. I am working on fixing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #683: Introduce (default) number of partitions option, use it in DataFusion/Ballista

2021-07-11 Thread GitBox



andygrove commented on pull request #683:
URL: https://github.com/apache/arrow-datafusion/pull/683#issuecomment-877817032


   Please take a look at my proposal in 
https://github.com/apache/arrow-datafusion/pull/706


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new issue #708: Ballista: Remove hard-coded concurrency from logical plan serde code

2021-07-11 Thread GitBox



andygrove opened a new issue #708:
URL: https://github.com/apache/arrow-datafusion/issues/708


   **Describe the bug**
   
   ```rust
   LogicalPlanBuilder::scan_parquet_with_name(
 ,
 projection,
 24,
 _name,
   )? //TODO concurrency
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new issue #707: Ballista: Avoid pointless shuffle write of Parquet partitions

2021-07-11 Thread GitBox



andygrove opened a new issue #707:
URL: https://github.com/apache/arrow-datafusion/issues/707


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   When running TPC-H query 12 I see query plans like this being executed:
   
   ```
   ShuffleWriterExec: None
 ParquetExec: ...
   ```
   
   This means we are reading a Parquet partition and immediately streaming it 
to disk in IPC format so that it can be read again later. This is introducing a 
redundant write and read of this data.
   
   **Describe the solution you'd like**
   TBD
   
   **Describe alternatives you've considered**
   TBD
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new pull request #706: Rename concurrency to default_partitions

2021-07-11 Thread GitBox



andygrove opened a new pull request #706:
URL: https://github.com/apache/arrow-datafusion/pull/706


   # Which issue does this PR close?
   
   
   
   Closes #685 .
   
# Rationale for this change
   
   
   We originally used the `concurrency` config setting to determine how many 
threads to launch in certain parts of the code base but we now use Tokio to 
manage task concurrency. The `concurrency` setting is only used to determine 
the number of partitions to use when repartitioning the plan. We should 
therefore rename this config setting.
   
   # What changes are included in this PR?
   
   
   Introduce new `with_default_partitions` method and leave `with_concurrency` 
in place for now so that we don't break the API.
   
   # Are there any user-facing changes?
   
   
   If users were accessing the `concurrency` field in `ExecutionContext` then 
they will now need to use `default_partitions` instead. Users should not be 
accessing these attributes directly though.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] ritchie46 commented on issue #461: Implement RecordBatch::concat

2021-07-11 Thread GitBox



ritchie46 commented on issue #461:
URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877811760


   I think a lean arrow-core crate would be beneficial
We already started some work be feature gating IO logic. 
   
   Wouldn't the same be achieved by `impl FromIterator for 
RecordBatch`? Then users can use the `concat` kernel and choose how to iter the 
columns. They also don't pay the compilation penalty until they actually 
implement that iterator + collect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] jorgecarleitao commented on issue #461: Implement RecordBatch::concat

2021-07-11 Thread GitBox



jorgecarleitao commented on issue #461:
URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877808997


   Let me try to explain my reasoning atm.
   
   All methods exposed on `Array` are `O(1)`. In particular, `.slice` is `O(1)` 
over the array, and thus `O(c)` over the record where `c` is the number of 
fields.
   
   `concat` over `RecordBatch` seems rather simple but is `O(c * n * r)` where 
c is the number of columns, r the number of records, and `n` the average length 
of the records. Since `c` is trivially parallelizable, I would say that the 
natural implementation is to actually rayon it, i.e. `columns().iter_par()...`.
   
   Generally, I consider non-parallel iterations over a record to be an 
anti-pattern, since parallelism over columns is one of the hallmarks of 
columnar formats. Imo the decision of how to iterate over columns does not 
belong to `arrow-rs`, but to Polars,  DataFusion and the like. `DataFusion` 
uses `iter`; polars uses `iter_par` for the most part.
   
   We do have some methods in compute that `iter` over the `RecordBatch` that 
follow this pattern (`filter` and `sort` I believe). So, in this context, I 
would be more inclined to place `concat_record` at the same level as them: 
methods that are not `O(1)` over the arrays' length that some may use when they 
do not want to commit to a multi-threaded execution. But again, imo this is an 
anti-pattern that we should not promote, as it enforces a specific threading 
model over columns.
   
   The reasoning to have it in `compute` is to not drag compute dependencies to 
the core modules (I see them as being the datatypes, array, buffer, 
RecordBatch, alloc). The reason being that `compute` has a massive compile time 
when compared to the rest of the crate, and keeping these separated makes it 
easier to split `arrow` in two crates (`arrow` and `arrow-compute`) to reduce 
compile and/or binary size. This is minor and can be solved by moving `impl 
RecordBatch` to the `compute::kernels::concat` if the time arises.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #683: Introduce (default) number of partitions option, use it in DataFusion/Ballista

2021-07-11 Thread GitBox



andygrove commented on pull request #683:
URL: https://github.com/apache/arrow-datafusion/pull/683#issuecomment-877808503


   I spent some more time reviewing the codebase this morning and it appears 
that we no longer use the `concurrency` config to determine concurrency as in 
number of threads, but only use it to determine partition counts when 
repartitioning, so I think we should just rename this field, rather than 
introduce a new config. Tokio is managing concurrency level automatically.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on issue #334: Cast of utf8 types and list container types don't respect offset

2021-07-11 Thread GitBox



nevi-me commented on issue #334:
URL: https://github.com/apache/arrow-rs/issues/334#issuecomment-877794567


   Can be closed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me closed issue #334: Cast of utf8 types and list container types don't respect offset

2021-07-11 Thread GitBox



nevi-me closed issue #334:
URL: https://github.com/apache/arrow-rs/issues/334


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on issue #518: Parquet list reader not reading correct element lengths

2021-07-11 Thread GitBox



nevi-me commented on issue #518:
URL: https://github.com/apache/arrow-rs/issues/518#issuecomment-877794108


   @mcassels I've opened this issue to track the bug that you identified


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #491: Minimal MapArray support

2021-07-11 Thread GitBox



codecov-commenter edited a comment on pull request #491:
URL: https://github.com/apache/arrow-rs/pull/491#issuecomment-873369694


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#491](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f52d13e) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f1fb2b1) will **increase** coverage by `0.02%`.
   > The diff coverage is `80.10%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/491/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #491  +/-   ##
   ==
   + Coverage   82.60%   82.62%   +0.02% 
   ==
 Files 167  168   +1 
 Lines   4598446717 +733 
   ==
   + Hits3798438599 +615 
   - Misses   8000 8118 +118 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy93cml0ZXIucnM=)
 | `86.68% <ø> (ø)` | |
   | 
[arrow/src/array/equal\_json.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsX2pzb24ucnM=)
 | `88.69% <33.33%> (-2.52%)` | :arrow_down: |
   | 
[arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==)
 | `51.36% <45.00%> (-0.47%)` | :arrow_down: |
   | 
[arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL21vZC5ycw==)
 | `93.78% <50.00%> (-0.16%)` | :arrow_down: |
   | 
[arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==)
 | `66.08% <50.00%> (-0.74%)` | :arrow_down: |
   | 
[parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz)
 | `82.65% <65.38%> (-0.43%)` | :arrow_down: |
   | 
[parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz)
 | `77.70% <77.67%> (+0.25%)` | :arrow_up: |
   | 
[arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=)
 | `85.83% <81.44%> (-0.29%)` | :arrow_down: |
   | 
[arrow/src/util/integration\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL3V0aWwvaW50ZWdyYXRpb25fdXRpbC5ycw==)
 | `70.10% <81.81%> (+0.54%)` | :arrow_up: |
   | 
[parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvc2NoZW1hLnJz)
 | `87.35% <81.88%> (-1.18%)` | :arrow_down: |
   | ... and [21

[GitHub] [arrow-rs] nevi-me commented on pull request #491: Minimal MapArray support

2021-07-11 Thread GitBox



nevi-me commented on pull request #491:
URL: https://github.com/apache/arrow-rs/pull/491#issuecomment-877791486


   @alamb @jorgecarleitao this is now ready for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me opened a new issue #538: DataType::Map tracking issue

2021-07-11 Thread GitBox



nevi-me opened a new issue #538:
URL: https://github.com/apache/arrow-rs/issues/538


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   We are adding support for the map data type, which is a list of key-values 
ala hashmap.
   This issue tracks outstanding functionality without opening multiple issues 
for each at once.
   
   The outstanding work after #395 and #491 are:
   - [ ] Integration testing (https://issues.apache.org/jira/browse/ARROW-13300)
   - [ ] Non-canonical map support (where we don't check that the schema names 
match exactly)
   - [ ] Nested maps [arrow, parquet (write)]
   - [ ] improving map builder support to behave like a hashmap
   - [ ] JSON writer
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #491: Minimal MapArray support

2021-07-11 Thread GitBox



codecov-commenter edited a comment on pull request #491:
URL: https://github.com/apache/arrow-rs/pull/491#issuecomment-873369694


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#491](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (85cda50) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f1fb2b1) will **increase** coverage by `0.02%`.
   > The diff coverage is `80.20%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/491/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #491  +/-   ##
   ==
   + Coverage   82.60%   82.62%   +0.02% 
   ==
 Files 167  168   +1 
 Lines   4598446717 +733 
   ==
   + Hits3798438599 +615 
   - Misses   8000 8118 +118 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy93cml0ZXIucnM=)
 | `86.68% <ø> (ø)` | |
   | 
[arrow/src/array/equal\_json.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsX2pzb24ucnM=)
 | `88.69% <33.33%> (-2.52%)` | :arrow_down: |
   | 
[arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==)
 | `51.36% <45.00%> (-0.47%)` | :arrow_down: |
   | 
[arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL21vZC5ycw==)
 | `93.78% <50.00%> (-0.16%)` | :arrow_down: |
   | 
[arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==)
 | `66.08% <50.00%> (-0.74%)` | :arrow_down: |
   | 
[parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz)
 | `82.65% <65.38%> (-0.43%)` | :arrow_down: |
   | 
[parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz)
 | `77.70% <77.67%> (+0.25%)` | :arrow_up: |
   | 
[arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=)
 | `85.83% <81.44%> (-0.29%)` | :arrow_down: |
   | 
[arrow/src/util/integration\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL3V0aWwvaW50ZWdyYXRpb25fdXRpbC5ycw==)
 | `70.10% <81.81%> (+0.54%)` | :arrow_up: |
   | 
[parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvc2NoZW1hLnJz)
 | `87.35% <81.88%> (-1.18%)` | :arrow_down: |
   | ... and [21

[GitHub] [arrow-rs] alamb commented on a change in pull request #532: Add tests for building applications using arrow with different feature flags

2021-07-11 Thread GitBox



alamb commented on a change in pull request #532:
URL: https://github.com/apache/arrow-rs/pull/532#discussion_r667468941



##
File path: .github/workflows/rust.yml
##
@@ -357,15 +353,22 @@ jobs:
 uses: actions/cache@v2
 with:
   path: /github/home/target
-  key: ${{ runner.os }}-${{ matrix.arch }}-target-wasm32-cache-${{ 
matrix.rust }}
+  # this key equals the ones on `linux-build-lib` for re-use
+  key: ${{ runner.os }}-${{ matrix.arch }}-target-cache-${{ 
matrix.rust }}
   - name: Setup Rust toolchain
 run: |
   rustup toolchain install ${{ matrix.rust }}
   rustup override set ${{ matrix.rust }}
   rustup component add rustfmt
-  - name: Build arrow crate
+  - name: Build with default features
 run: |
   export CARGO_HOME="/github/home/.cargo"
   export CARGO_TARGET_DIR="/github/home/target"
-  cd arrow
-  cargo check --all-targets --no-default-features

Review comment:
   I think the reason CI isn't catching issues with feature flags is due to 
the cargo semantics of a `workspace` -- where Cargo tries to calculate the 
dependency needs of the workspace as a whole. Since we have some crates in the 
workspace (like `arrow-flight`) that bring in additional dependencies, these 
additional dependencies can obscure build problems with other crates (like 
`arrow`) when used alone 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a change in pull request #532: Add tests for building applications using arrow with different feature flags

2021-07-11 Thread GitBox



alamb commented on a change in pull request #532:
URL: https://github.com/apache/arrow-rs/pull/532#discussion_r667468650



##
File path: arrow/test/dependency/no-default-features/Cargo.toml
##
@@ -0,0 +1,32 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "no-default-features"
+description = "Models a user application of arrow that specifies 
no-default-features=true"
+version = "0.1.0"
+edition = "2018"
+
+# See more keys and their definitions at 
https://doc.rust-lang.org/cargo/reference/manifest.html
+
+[dependencies]
+arrow = { path = "../../../../arrow", version = "5.0.0-SNAPSHOT", 
default-features = false }
+
+# Workaround for https://github.com/apache/arrow-rs/issues/529
+rand = { version = "0.8" }
+
+[workspace]

Review comment:
   Actually, this is required (you get a cargo build error without it). 
Somehow this signals to cargo that this project is not part of the workspace 
rooted at `arrow-rs/Cargo.toml`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on pull request #537: Implement `RecordBatch::concat`

2021-07-11 Thread GitBox



alamb commented on pull request #537:
URL: https://github.com/apache/arrow-rs/pull/537#issuecomment-877785281


   Although @nevi-me seems to like `RecordBatch::concat` -- 
https://github.com/apache/arrow-rs/issues/461#issuecomment-877784891


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #461: Implement RecordBatch::concat

2021-07-11 Thread GitBox



alamb commented on issue #461:
URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877785205


   > I think this would aid discoverability, in the sense that a person asking 
"what can I do with a record batch?" would look at impl RecordBatch and 
discover that easily.
   
   
   I am convinced by this argument  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on issue #461: Implement RecordBatch::concat

2021-07-11 Thread GitBox



nevi-me commented on issue #461:
URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877784891


   I'd lean towards `RecordBatch::filter` and `RecordBatch::concat` as long as 
that ends up being a convenience that iterates through all columns, and calls 
the underlying compute or other function.
   
   Similar to a `RecordBatch::slice` that calls `array.slice` internally.
   
   I think this would aid discoverability, in the sense that a person asking 
"what can I do with a record batch?" would look at `impl RecordBatch` and 
discover that easily.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a change in pull request #537: Implement `RecordBatch::concat`

2021-07-11 Thread GitBox



alamb commented on a change in pull request #537:
URL: https://github.com/apache/arrow-rs/pull/537#discussion_r667467518



##
File path: arrow/src/record_batch.rs
##
@@ -639,4 +669,108 @@ mod tests {
 assert_eq!(batch.column(0).as_ref(), boolean.as_ref());
 assert_eq!(batch.column(1).as_ref(), int.as_ref());
 }
+
+#[test]
+fn concat_empty_record_batches() {
+let schema = Arc::new(Schema::new(vec![Field::new("a", 
DataType::Int32, false)]));
+let batch = RecordBatch::concat(, &[]).unwrap();
+assert_eq!(batch.schema().as_ref(), schema.as_ref());
+assert_eq!(0, batch.num_rows());
+}
+
+#[test]
+fn concat_record_batches_of_same_schema() {
+let schema = Arc::new(Schema::new(vec![Field::new("a", 
DataType::Int32, false)]));
+let batch1 = RecordBatch::try_new(
+schema.clone(),
+vec![Arc::new(Int32Array::from(vec![1, 2]))],
+)
+.unwrap();
+let batch2 = RecordBatch::try_new(
+schema.clone(),
+vec![Arc::new(Int32Array::from(vec![3, 4]))],
+)
+.unwrap();
+let new_batch = RecordBatch::concat(, &[batch1, 
batch2]).unwrap();
+assert_eq!(new_batch.schema().as_ref(), schema.as_ref());
+assert_eq!(4, new_batch.num_rows());
+}
+
+#[test]
+fn concat_record_batches_of_similar_schemas() {

Review comment:
   I would suggest this case be treated as an error -- if a user wants this 
behavior I think they should explicitly have to specifically pick out the 
columns they want and create record batches with the same schema

##
File path: arrow/src/record_batch.rs
##
@@ -353,6 +354,35 @@ impl RecordBatch {
 let schema = Arc::new(Schema::new(fields));
 RecordBatch::try_new(schema, columns)
 }
+
+/// Concatenates `batches` together into a single record batch.
+pub fn concat(schema: , batches: &[Self]) -> Result {
+if batches.is_empty() {
+return Ok(RecordBatch::new_empty(schema.clone()));
+}
+let field_num = schema.fields().len();
+if let Some((i, _)) = batches
+.iter()
+.enumerate()
+.find(|&(_, batch)| batch.num_columns() < field_num)

Review comment:
   I think if the batches don't have the same schema, an error should 
result, even if there is some plausible behavior.
   
   My rationale is that it is likely a user error if the schemas don't match, 
so failing fast will help the programmer identify the problem more easily
   
   So perhaps this function should check `batch.schema() != schema` (aka 
compare the entire schema for each record batch)

##
File path: arrow/src/record_batch.rs
##
@@ -639,4 +669,108 @@ mod tests {
 assert_eq!(batch.column(0).as_ref(), boolean.as_ref());
 assert_eq!(batch.column(1).as_ref(), int.as_ref());
 }
+
+#[test]
+fn concat_empty_record_batches() {
+let schema = Arc::new(Schema::new(vec![Field::new("a", 
DataType::Int32, false)]));
+let batch = RecordBatch::concat(, &[]).unwrap();
+assert_eq!(batch.schema().as_ref(), schema.as_ref());
+assert_eq!(0, batch.num_rows());
+}
+
+#[test]
+fn concat_record_batches_of_same_schema() {
+let schema = Arc::new(Schema::new(vec![Field::new("a", 
DataType::Int32, false)]));
+let batch1 = RecordBatch::try_new(
+schema.clone(),
+vec![Arc::new(Int32Array::from(vec![1, 2]))],
+)
+.unwrap();
+let batch2 = RecordBatch::try_new(
+schema.clone(),
+vec![Arc::new(Int32Array::from(vec![3, 4]))],
+)
+.unwrap();
+let new_batch = RecordBatch::concat(, &[batch1, 
batch2]).unwrap();
+assert_eq!(new_batch.schema().as_ref(), schema.as_ref());
+assert_eq!(4, new_batch.num_rows());
+}
+
+#[test]
+fn concat_record_batches_of_similar_schemas() {
+let schema1 =
+Arc::new(Schema::new(vec![Field::new("a", DataType::Int32, 
false)]));
+let schema2 = Arc::new(Schema::new(vec![
+Field::new("b", DataType::Int32, false),
+Field::new("c", DataType::Utf8, false),
+]));
+let batch1 = RecordBatch::try_new(
+schema1.clone(),
+vec![Arc::new(Int32Array::from(vec![1, 2]))],
+)
+.unwrap();
+let batch2 = RecordBatch::try_new(
+schema2,
+vec![
+Arc::new(Int32Array::from(vec![3, 4])),
+Arc::new(StringArray::from(vec!["a", "b"])),
+],
+)
+.unwrap();
+let new_batch = RecordBatch::concat(, &[batch1, 
batch2]).unwrap();
+assert_eq!(new_batch.schema().as_ref(), schema1.as_ref());
+assert_eq!(4, new_batch.num_rows());
+}
+
+#[test]
+fn concat_record_batches_of_different_column_num() {
+let

[GitHub] [arrow-rs] alamb commented on issue #461: Implement RecordBatch::concat

2021-07-11 Thread GitBox



alamb commented on issue #461:
URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877783516


   > Wouldn't it make it more sense to add this to compute as a normal 
function? Implementing it on the RecordBatch interface is breaking the 
separation between compute and data that we enforce with the compute module.
   
   @jorgecarleitao I think that comes down to "is concat a construction 
operation or a compute operation"? It could probably be classified as either. I 
don't have a strong opinion about where the "concat RecordBatch" function goes. 
   
   The only other prior art of using `RecordBatch`es in compute kernels I found 
is  `pub fn filter_record_batch` in filter.rs
   `
   
https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/filter.rs#L279-L298
   
   We could perhaps add a `concat_batches` function in  
compute/kernels/concat.rs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on a change in pull request #389: make slice work for nested types

2021-07-11 Thread GitBox



nevi-me commented on a change in pull request #389:
URL: https://github.com/apache/arrow-rs/pull/389#discussion_r667463293



##
File path: arrow/src/array/transform/structure.rs
##
@@ -26,13 +26,10 @@ pub(super) fn build_extend(array: ) -> Extend {
   index: usize,
   start: usize,
   len: usize| {
-mutable.child_data.iter_mut().for_each(|child| {
-child.extend(
-index,
-array.offset() + start,
-array.offset() + start + len,
-)
-})
+mutable

Review comment:
   This reverses 9f96561




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on pull request #389: make slice work for nested types

2021-07-11 Thread GitBox



nevi-me commented on pull request #389:
URL: https://github.com/apache/arrow-rs/pull/389#issuecomment-877780039


   @jorgecarleitao @jhorstmann @bjchambers this is ready for review, and 
partially fixes #514 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter commented on pull request #389: make slice work for nested types

2021-07-11 Thread GitBox



codecov-commenter commented on pull request #389:
URL: https://github.com/apache/arrow-rs/pull/389#issuecomment-88131


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#389](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f72c86e) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f1fb2b1) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/389/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@   Coverage Diff   @@
   ##   master #389   +/-   ##
   ===
 Coverage   82.60%   82.60%   
   ===
 Files 167  167   
 Lines   4598445986+2 
   ===
   + Hits3798437986+2 
 Misses   8000 8000   
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[arrow/src/array/array\_struct.rs](https://codecov.io/gh/apache/arrow-rs/pull/389/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X3N0cnVjdC5ycw==)
 | `89.11% <100.00%> (-0.14%)` | :arrow_down: |
   | 
[arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/389/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=)
 | `73.60% <100.00%> (+0.93%)` | :arrow_up: |
   | 
[arrow/src/array/transform/structure.rs](https://codecov.io/gh/apache/arrow-rs/pull/389/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9zdHJ1Y3R1cmUucnM=)
 | `80.00% <100.00%> (-3.34%)` | :arrow_down: |
   | 
[parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/389/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz)
 | `94.85% <0.00%> (-0.20%)` | :arrow_down: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[f1fb2b1...f72c86e](https://codecov.io/gh/apache/arrow-rs/pull/389?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on a change in pull request #389: make slice work for nested types

2021-07-11 Thread GitBox



nevi-me commented on a change in pull request #389:
URL: https://github.com/apache/arrow-rs/pull/389#discussion_r667459042



##
File path: arrow/src/array/array_struct.rs
##
@@ -85,12 +85,7 @@ impl From for StructArray {
 fn from(data: ArrayData) -> Self {
 let mut boxed_fields = vec![];
 for cd in data.child_data() {
-let child_data = if data.offset() != 0 || data.len() != cd.len() {

Review comment:
   @jhorstmann @jorgecarleitao the problem was here. We take `data` which 
has children and their own offsets, then slice that data correctly to populate 
`boxed_fields`, but then we still use the `data` with the child arrays that 
aren't offset.
   
   The result's that the struct is correct when looking at it through 
`boxed_fields` (e.g. the print utility uses this), but once you need to do 
anything with `data`, it's as if the child values were never sliced.
   
   The lightbulb turned on while i was trying to figure out why a test for #491 
was failing.
   
   There's a change that @bjchambers authored 
(https://github.com/apache/arrow-rs/commit/9f965612626cdf31187ce07ba5dbecc3503077b8)
 which addressed the child offsets, but my tests were failing because I needed 
to revert what they did.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] domoritz edited a comment on pull request #10673: ARROW-13277: [JS] Add declaration maps for TypeScript and refactor testing infrastructure

2021-07-11 Thread GitBox



domoritz edited a comment on pull request #10673:
URL: https://github.com/apache/arrow/pull/10673#issuecomment-877747590


   ~~Hmm, can it be that test times for the docker build went from 20 minutes 
to 50 minutes? (https://github.com/apache/arrow/runs/3005016709 vs 
https://github.com/apache/arrow/runs/3037021755).~~ The last run was faster 
again. Maybe the CI is just inconsistent. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #491: [DRAFT] Minimal MapArray support

2021-07-11 Thread GitBox



codecov-commenter edited a comment on pull request #491:
URL: https://github.com/apache/arrow-rs/pull/491#issuecomment-873369694


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#491](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (b5efa36) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/f1fb2b11bbd6350365de010d3e1d676a27602d3a?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f1fb2b1) will **decrease** coverage by `0.04%`.
   > The diff coverage is `75.48%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/491/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #491  +/-   ##
   ==
   - Coverage   82.60%   82.56%   -0.05% 
   ==
 Files 167  168   +1 
 Lines   4598446629 +645 
   ==
   + Hits3798438498 +514 
   - Misses   8000 8131 +131 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/491?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=)
 | `72.23% <0.00%> (-0.44%)` | :arrow_down: |
   | 
[arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy93cml0ZXIucnM=)
 | `86.68% <ø> (ø)` | |
   | 
[arrow/src/array/equal\_json.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsX2pzb24ucnM=)
 | `88.69% <33.33%> (-2.52%)` | :arrow_down: |
   | 
[arrow/src/array/array\_map.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X21hcC5ycw==)
 | `44.77% <44.77%> (ø)` | |
   | 
[arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==)
 | `51.36% <45.00%> (-0.47%)` | :arrow_down: |
   | 
[arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL21vZC5ycw==)
 | `93.78% <50.00%> (-0.16%)` | :arrow_down: |
   | 
[arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==)
 | `66.08% <50.00%> (-0.74%)` | :arrow_down: |
   | 
[parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz)
 | `82.65% <65.38%> (-0.43%)` | :arrow_down: |
   | 
[parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz)
 | `77.70% <77.67%> (+0.25%)` | :arrow_up: |
   | 
[arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/491/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=)
 | `85.83% <81.44%> (-0.29%)` | :arrow_down: |
   | ... and [18

[GitHub] [arrow] domoritz commented on pull request #10673: ARROW-13277: [JS] Add declaration maps for TypeScript and refactor testing infrastructure

2021-07-11 Thread GitBox



domoritz commented on pull request #10673:
URL: https://github.com/apache/arrow/pull/10673#issuecomment-877747590


   Hmm, can it be that test times for the docker build went from 20 minutes to 
50 minutes? (https://github.com/apache/arrow/runs/3005016709 vs 
https://github.com/apache/arrow/runs/3037021755).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

71 matches

Mail list logo