date:20210614

[GitHub] [arrow-rs] sunchao commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

2021-06-14 Thread GitBox



sunchao commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r651247944



##
File path: parquet/src/data_type.rs
##
@@ -661,8 +661,15 @@ pub(crate) mod private {
 _:  W,
 bit_writer:  BitWriter,
 ) -> Result<()> {
+if bit_writer.bytes_written() + values.len() >= 
bit_writer.capacity() {

Review comment:
   Hmm I'm sorry but can you elaborate why the unit of `values` is also 
byte?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #561: Fix pruning on not equal predicate

2021-06-14 Thread GitBox



alamb commented on a change in pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#discussion_r651282910



##
File path: datafusion/src/physical_optimizer/pruning.rs
##
@@ -553,12 +553,14 @@ fn build_predicate_expression(
 let corrected_op = expr_builder.correct_operator(op);
 let statistics_expr = match corrected_op {
 Operator::NotEq => {
-// column != literal => (min, max) = literal => min > literal || 
literal > max
+// column != literal => (min, max) = literal =>
+// !(min != literal && max != literal) ==>
+// min != literal || literal != max
 let min_column_expr = expr_builder.min_column_expr()?;
 let max_column_expr = expr_builder.max_column_expr()?;
 min_column_expr
-.gt(expr_builder.scalar_expr().clone())
-.or(expr_builder.scalar_expr().clone().gt(max_column_expr))
+.not_eq(expr_builder.scalar_expr().clone())
+.or(expr_builder.scalar_expr().clone().not_eq(max_column_expr))

Review comment:
   The way I think about it = predicate is effectively `!(col == min && col 
== max)` which when you apply Demorgan's law you get `(col != min) || (col != 
max)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10520: ARROW-12709: [C++] Add binary_join_element_wise

2021-06-14 Thread GitBox



lidavidm commented on pull request #10520:
URL: https://github.com/apache/arrow/pull/10520#issuecomment-861035926


   The null handling behavior never affects the separator: it only describes 
how to handle the other values. It's intended to let you mimic libcudf.
   
   ```
   binary_join_element_wise(['a', 'b', 'c', 'h'], ['d', null, 'e', 'i'], ['f', 
'g', null, 'j'], [';', '-', ',', null])
   EMIT_NULL:['a', null, null, null]
   SKIP: ['a;d;f', 'b-g', 'c,e', null] # b and g are 
joined; the middle null value is ignored
   REPLACE(replacement=foo): ['a;d;f', 'b-foo-g', 'c,e,foo', null]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ianmcook commented on pull request #10520: ARROW-12709: [C++] Add binary_join_element_wise

2021-06-14 Thread GitBox



ianmcook commented on pull request #10520:
URL: https://github.com/apache/arrow/pull/10520#issuecomment-861046053


   Oh, I see; I was confused about what happens when the separator is an array; 
I see now. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #520: Implement window functions with `order_by` clause

2021-06-14 Thread GitBox



Jimexist commented on a change in pull request #520:
URL: https://github.com/apache/arrow-datafusion/pull/520#discussion_r651361200



##
File path: datafusion/src/physical_plan/windows.rs
##
@@ -156,31 +162,72 @@ impl WindowExpr for BuiltInWindowExpr {
 self.window.expressions()
 }
 
-fn create_accumulator() -> Result> {
-self.window.create_accumulator()
+fn partition_by() -> &[Arc] {
+_by
+}
+
+fn order_by() -> &[PhysicalSortExpr] {
+_by
+}
+
+fn evaluate(, batch: ) -> Result {
+// FIXME, for now we assume all the rows belong to the same partition, 
which will not be the
+// case when partition_by is supported, in which case we'll 
parallelize the calls.
+// See https://github.com/apache/arrow-datafusion/issues/299
+let values = self.evaluate_args(batch)?;
+self.window.evaluate(batch.num_rows(), )
 }
 }
 
 /// A window expr that takes the form of an aggregate function
 #[derive(Debug)]
 pub struct AggregateWindowExpr {
 aggregate: Arc,
+partition_by: Vec>,
+order_by: Vec,
+window_frame: Option,
 }
 
-#[derive(Debug)]
-struct AggregateWindowAccumulator {
-accumulator: Box,
-}
+impl AggregateWindowExpr {
+/// the aggregate window function operates based on window frame, and by 
default the mode is
+/// "range".
+fn evaluation_mode() -> WindowFrameUnits {
+self.window_frame.unwrap_or_default().units
+}
 
-impl WindowAccumulator for AggregateWindowAccumulator {
-fn scan( self, values: &[ScalarValue]) -> Result> {
-self.accumulator.update(values)?;
-Ok(None)
+/// create a new accumulator based on the underlying aggregation function
+fn create_accumulator() -> Result {
+let accumulator = self.aggregate.create_accumulator()?;
+Ok(AggregateWindowAccumulator { accumulator })
 }
 
-/// returns its value based on its current state.
-fn evaluate() -> Result> {
-Ok(Some(self.accumulator.evaluate()?))
+/// peer based evaluation based on the fact that batch is pre-sorted given 
the sort columns
+/// and then per partition point we'll evaluate the peer group (e.g. SUM 
or MAX gives the same
+/// results for peers) and concatenate the results.
+fn peer_based_evaluate(, batch: ) -> Result {

Review comment:
   since this is private function i guess i can leave the naming part for 
later changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #558: Implement window functions with `partition_by` clause

2021-06-14 Thread GitBox



codecov-commenter edited a comment on pull request #558:
URL: https://github.com/apache/arrow-datafusion/pull/558#issuecomment-860569767


   # 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/558?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#558](https://codecov.io/gh/apache/arrow-datafusion/pull/558?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (4a7a499) into 
[master](https://codecov.io/gh/apache/arrow-datafusion/commit/e3e7e293c4482af1475406bf4e922d5c99e7a792?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (e3e7e29) will **decrease** coverage by `0.03%`.
   > The diff coverage is `81.01%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-datafusion/pull/558/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/558?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #558  +/-   ##
   ==
   - Coverage   76.12%   76.08%   -0.04% 
   ==
 Files 156  156  
 Lines   2707427121  +47 
   ==
   + Hits2060920635  +26 
   - Misses   6465 6486  +21 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-datafusion/pull/558?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[datafusion/src/physical\_plan/window\_functions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/558/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi93aW5kb3dfZnVuY3Rpb25zLnJz)
 | `86.42% <ø> (+0.71%)` | :arrow_up: |
   | 
[datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/558/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvc3FsL3BsYW5uZXIucnM=)
 | `84.75% <ø> (ø)` | |
   | 
[datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/558/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9wbGFubmVyLnJz)
 | `79.84% <33.33%> (+2.30%)` | :arrow_up: |
   | 
[datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/558/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9tb2QucnM=)
 | `80.00% <70.96%> (+0.90%)` | :arrow_up: |
   | 
[datafusion/src/physical\_plan/windows.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/558/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi93aW5kb3dzLnJz)
 | `82.59% <75.00%> (-3.88%)` | :arrow_down: |
   | 
[...afusion/src/physical\_plan/expressions/nth\_value.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/558/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9udGhfdmFsdWUucnM=)
 | `79.41% <75.67%> (-11.07%)` | :arrow_down: |
   | 
[datafusion/src/execution/context.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/558/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvZXhlY3V0aW9uL2NvbnRleHQucnM=)
 | `92.13% <100.00%> (+0.13%)` | :arrow_up: |
   | 
[...fusion/src/physical\_plan/expressions/row\_number.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/558/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9yb3dfbnVtYmVyLnJz)
 | `94.28% <100.00%> (+13.03%)` | :arrow_up: |
   | 
[datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/558/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2FnZ3JlZ2F0ZS5ycw==)
 | `86.54% <100.00%> (ø)` | |
   |

[GitHub] [arrow-datafusion] Jimexist commented on pull request #520: Implement window functions with `order_by` clause

2021-06-14 Thread GitBox



Jimexist commented on pull request #520:
URL: https://github.com/apache/arrow-datafusion/pull/520#issuecomment-861088442


   > This looks very nice @Jimexist -- I went over the code and saw only 
goodness :)
   > 
   > All that this PR needs to be mergeable in my opinion is to reset the Cargo 
`arrow*` references (now that arrow 4.3.0 has been released)
   
   thank you for taking time to review. the changes to arrow are now reverted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on pull request #520: Implement window functions with `order_by` clause

2021-06-14 Thread GitBox



Jimexist edited a comment on pull request #520:
URL: https://github.com/apache/arrow-datafusion/pull/520#issuecomment-861088442


   > This looks very nice @Jimexist -- I went over the code and saw only 
goodness :)
   > 
   > All that this PR needs to be mergeable in my opinion is to reset the Cargo 
`arrow*` references (now that arrow 4.3.0 has been released)
   
   thank you for taking time to review. the changes to arrow references are now 
reverted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on a change in pull request #10457: ARROW-12980: [C++] Kernels to extract datetime components should be timezone aware

2021-06-14 Thread GitBox



rok commented on a change in pull request #10457:
URL: https://github.com/apache/arrow/pull/10457#discussion_r651290261



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##
@@ -143,39 +142,202 @@ TEST(ScalarTemporalTest, 
TestTemporalComponentExtractionWithDifferentUnits) {
 CheckScalarUnary("quarter", unit, times, int64(), quarter);
 CheckScalarUnary("hour", unit, times, int64(), hour);
 CheckScalarUnary("minute", unit, times, int64(), minute);
-CheckScalarUnary("second", unit, times, float64(), second);
+CheckScalarUnary("second", unit, times, int64(), second);
 CheckScalarUnary("millisecond", unit, times, int64(), zeros);
 CheckScalarUnary("microsecond", unit, times, int64(), zeros);
 CheckScalarUnary("nanosecond", unit, times, int64(), zeros);
 CheckScalarUnary("subsecond", unit, times, float64(), zeros);
   }
 }
 
-TEST(ScalarTemporalTest, TestZonedTemporalComponentExtraction) {
-  std::string timezone = "Asia/Kolkata";
-  const char* times = R"(["1970-01-01T00:00:59", null])";
+TEST(ScalarTemporalTest, TestZoned1) {
+  const char* times =
+  R"(["1970-01-01T00:00:59.123456789","2000-02-29T23:23:23.9",
+  "1899-01-01T00:59:20.001001001","2033-05-18T03:33:20.0",
+  "2020-01-01T01:05:05.001", "2019-12-31T02:10:10.002",
+  "2019-12-30T03:15:15.003", "2009-12-31T04:20:20.004132",
+  "2010-01-01T05:25:25.005321", "2010-01-03T06:30:30.006163",
+  "2010-01-04T07:35:35", "2006-01-01T08:40:40", "2005-12-31T09:45:45",
+  "2008-12-28", "2008-12-29", "2012-01-01 01:02:03", null])";
+  auto unit = timestamp(TimeUnit::NANO, "Pacific/Marquesas");
+  auto iso_calendar_type =
+  struct_({field("iso_year", int64()), field("iso_week", int64()),
+   field("iso_day_of_week", int64())});
+  auto year =
+  "[1969, 2000, 1898, 2033, 2019, 2019, 2019, 2009, 2009, 2010, 2010, 
2005, 2005, "
+  "2008, 2008, 2011, null]";
+  auto month = "[12, 2, 12, 5, 12, 12, 12, 12, 12, 1, 1, 12, 12, 12, 12, 12, 
null]";
+  auto day = "[31, 29, 31, 17, 31, 30, 29, 30, 31, 2, 3, 31, 31, 27, 28, 31, 
null]";
+  auto day_of_week = "[2, 1, 5, 1, 1, 0, 6, 2, 3, 5, 6, 5, 5, 5, 6, 5, null]";
+  auto day_of_year =
+  "[365, 60, 365, 137, 365, 364, 363, 364, 365, 2, 3, 365, 365, 362, 363, 
365, null]";
+  auto iso_year =
+  "[1970, 2000, 1898, 2033, 2020, 2020, 2019, 2009, 2009, 2009, 2009, 
2005, 2005, "
+  "2008, 2008, 2011, null]";
+  auto iso_week = "[1, 9, 52, 20, 1, 1, 52, 53, 53, 53, 53, 52, 52, 52, 52, 
52, null]";
+  auto iso_calendar =
+  ArrayFromJSON(iso_calendar_type,
+R"([{"iso_year": 1970, "iso_week": 1, "iso_day_of_week": 
3},
+{"iso_year": 2000, "iso_week": 9, "iso_day_of_week": 
2},
+{"iso_year": 1898, "iso_week": 52, "iso_day_of_week": 
6},
+{"iso_year": 2033, "iso_week": 20, "iso_day_of_week": 
2},
+{"iso_year": 2020, "iso_week": 1, "iso_day_of_week": 
2},
+{"iso_year": 2020, "iso_week": 1, "iso_day_of_week": 
1},
+{"iso_year": 2019, "iso_week": 52, "iso_day_of_week": 
7},
+{"iso_year": 2009, "iso_week": 53, "iso_day_of_week": 
3},
+{"iso_year": 2009, "iso_week": 53, "iso_day_of_week": 
4},
+{"iso_year": 2009, "iso_week": 53, "iso_day_of_week": 
6},
+{"iso_year": 2009, "iso_week": 53, "iso_day_of_week": 
7},
+{"iso_year": 2005, "iso_week": 52, "iso_day_of_week": 
6},
+{"iso_year": 2005, "iso_week": 52, "iso_day_of_week": 
6},
+{"iso_year": 2008, "iso_week": 52, "iso_day_of_week": 
6},
+{"iso_year": 2008, "iso_week": 52, "iso_day_of_week": 
7},
+{"iso_year": 2011, "iso_week": 52, "iso_day_of_week": 
6}, null])");
+  auto quarter = "[4, 1, 4, 2, 4, 4, 4, 4, 4, 1, 1, 4, 4, 4, 4, 4, null]";
+  auto hour = "[14, 13, 15, 18, 15, 16, 17, 18, 19, 21, 22, 23, 0, 14, 14, 15, 
null]";
+  auto minute = "[30, 53, 41, 3, 35, 40, 45, 50, 55, 0, 5, 10, 15, 30, 30, 32, 
null]";
+  auto second = "[59, 23, 20, 20, 5, 10, 15, 20, 25, 30, 35, 40, 45, 0, 0, 3, 
null]";
+  auto millisecond = "[123, 999, 1, 0, 1, 2, 3, 4, 5, 6, 0, 0, 0, 0, 0, 0, 
null]";
+  auto microsecond = "[456, 999, 1, 0, 0, 0, 0, 132, 321, 163, 0, 0, 0, 0, 0, 
0, null]";
+  auto nanosecond = "[789, 999, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
null]";
+  auto subsecond =
+  "[0.123456789, 0.9, 0.001001001, 0.0, 0.001, 0.002, 0.003, 
0.004132, "
+  "0.005321, 0.006163, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, null]";
 
-  for (auto u : internal::AllTimeUnits()) {
-auto unit = timestamp(u, timezone);
-auto timestamps = ArrayFromJSON(unit, times);
-
-ASSERT_RAISES(Invalid, Year(timestamps));
-

[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #520: Implement window functions with `order_by` clause

2021-06-14 Thread GitBox



Jimexist commented on a change in pull request #520:
URL: https://github.com/apache/arrow-datafusion/pull/520#discussion_r651360294



##
File path: datafusion/src/physical_plan/expressions/nth_value.rs
##
@@ -113,54 +111,32 @@ impl BuiltInWindowFunctionExpr for NthValue {
 
 }
 
-fn create_accumulator() -> Result> {
-Ok(Box::new(NthValueAccumulator::try_new(
-self.kind,
-self.data_type.clone(),
-)?))
-}
-}
-
-#[derive(Debug)]
-struct NthValueAccumulator {
-kind: NthValueKind,
-offset: u32,
-value: ScalarValue,
-}
-
-impl NthValueAccumulator {
-/// new count accumulator
-pub fn try_new(kind: NthValueKind, data_type: DataType) -> Result {
-Ok(Self {
-kind,
-offset: 0,
-// null value of that data_type by default
-value: ScalarValue::try_from(_type)?,
-})
-}
-}
-
-impl WindowAccumulator for NthValueAccumulator {
-fn scan( self, values: &[ScalarValue]) -> Result> {
-self.offset += 1;
-match self.kind {
-NthValueKind::Last => {
-self.value = values[0].clone();
-}
-NthValueKind::First if self.offset == 1 => {
-self.value = values[0].clone();
-}
-NthValueKind::Nth(n) if self.offset == n => {
-self.value = values[0].clone();
-}
-_ => {}
+fn evaluate(, num_rows: usize, values: &[ArrayRef]) -> 
Result {
+if values.is_empty() {
+return Err(DataFusionError::Execution(format!(
+"No arguments supplied to {}",
+self.name()
+)));
 }
-
-Ok(None)
-}
-
-fn evaluate() -> Result> {
-Ok(Some(self.value.clone()))
+let value = [0];
+if value.len() != num_rows {
+return Err(DataFusionError::Execution(format!(
+"Invalid data supplied to {}, expect {} rows, got {} rows",
+self.name(),
+num_rows,
+value.len()
+)));
+}
+if num_rows == 0 {

Review comment:
   but you are right this would _not_ be passed with 0 length input. this 
check is just being pedantic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on a change in pull request #10457: ARROW-12980: [C++] Kernels to extract datetime components should be timezone aware

2021-06-14 Thread GitBox



rok commented on a change in pull request #10457:
URL: https://github.com/apache/arrow/pull/10457#discussion_r651241888



##
File path: python/pyarrow/tests/test_compute.py
##
@@ -1212,6 +1212,80 @@ def test_strptime():
 assert got == expected
 
 
+def _check_datetime_components(ts):
+tsa = [pa.array(ts)]
+subseconds = ((ts.microsecond * 10**3 + ts.nanosecond) *
+  10**-9).values.round(9)
+iso_calendar_fields = [
+pa.field('iso_year', pa.int64()),
+pa.field('iso_week', pa.int64()),
+pa.field('iso_day_of_week', pa.int64())
+]
+iso_calendar = pa.StructArray.from_arrays([
+ts.isocalendar()["year"].astype(int),
+ts.isocalendar()["week"].astype(int),
+ts.isocalendar()["day"].astype(int)],
+fields=iso_calendar_fields)
+
+assert pc.call_function("year", tsa).equals(pa.array(ts.year))

Review comment:
   Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10530: ARROW-13072: [C++] Add bit-wise arithmetic kernels

2021-06-14 Thread GitBox



github-actions[bot] commented on pull request #10530:
URL: https://github.com/apache/arrow/pull/10530#issuecomment-860958622


   https://issues.apache.org/jira/browse/ARROW-13072


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on pull request #10457: ARROW-12980: [C++] Kernels to extract datetime components should be timezone aware

2021-06-14 Thread GitBox



rok commented on pull request #10457:
URL: https://github.com/apache/arrow/pull/10457#issuecomment-861007346


   > I think there was no opposition for this one (i.e. that the field 
extraction should yield local hour/minute/etc), so I don't think there is a 
need to close the PR.
   
   Huh, I forget exactly what I was thinking there and you appear to be right.
   
   I think the main remaining task is the tz db issue. I'll take a look at it 
tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on a change in pull request #10476: ARROW-12499: [C++][Compute] Add ScalarAggregateOptions to Any and All kernels

2021-06-14 Thread GitBox



rok commented on a change in pull request #10476:
URL: https://github.com/apache/arrow/pull/10476#discussion_r651318115



##
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##
@@ -166,32 +168,48 @@ struct BooleanAnyImpl : public ScalarAggregator {
   Status MergeFrom(KernelContext*, KernelState&& src) override {
 const auto& other = checked_cast(src);
 this->any |= other.any;
+this->has_nulls |= other.has_nulls;
 return Status::OK();
   }
 
-  Status Finalize(KernelContext*, Datum* out) override {
-out->value = std::make_shared(this->any);
+  Status Finalize(KernelContext* ctx, Datum* out) override {
+if (!options.skip_nulls && !this->any && this->has_nulls) {

Review comment:
   I documented the new behavior in 
[compute.rst](https://github.com/apache/arrow/pull/10476/files#diff-ce5b94577014735990903d3d03bd4ea4b8c8e6d32f5227592e60b7dd6a912d59)
 and 
[api_aggregate.h](https://github.com/apache/arrow/pull/10476/files#diff-694a635f185cf0d22dd6eefd91acedb48ed0bf71b15fd591aa6e7c508ba4cb09).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] Jimexist commented on pull request #448: Use partition for bool sort

2021-06-14 Thread GitBox



Jimexist commented on pull request #448:
URL: https://github.com/apache/arrow-rs/pull/448#issuecomment-861078094


   i guess 1024 is too short an array so it's dominated by the memory 
allocation rather than sorting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #561: Fix pruning on not equal predicate

2021-06-14 Thread GitBox



Dandandan commented on a change in pull request #561:
URL: https://github.com/apache/arrow-datafusion/pull/561#discussion_r651243037



##
File path: datafusion/src/physical_optimizer/pruning.rs
##
@@ -553,12 +553,14 @@ fn build_predicate_expression(
 let corrected_op = expr_builder.correct_operator(op);
 let statistics_expr = match corrected_op {
 Operator::NotEq => {
-// column != literal => (min, max) = literal => min > literal || 
literal > max
+// column != literal => (min, max) = literal =>
+// !(min != literal && max != literal) ==>
+// min != literal || literal != max
 let min_column_expr = expr_builder.min_column_expr()?;
 let max_column_expr = expr_builder.max_column_expr()?;
 min_column_expr
-.gt(expr_builder.scalar_expr().clone())
-.or(expr_builder.scalar_expr().clone().gt(max_column_expr))
+.not_eq(expr_builder.scalar_expr().clone())
+.or(expr_builder.scalar_expr().clone().not_eq(max_column_expr))

Review comment:
   Shouldn't this be `and`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] adsharma edited a comment on issue #533: Add extension plugin to parse SQL into logical plan

2021-06-14 Thread GitBox



adsharma edited a comment on issue #533:
URL: 
https://github.com/apache/arrow-datafusion/issues/533#issuecomment-860999601


   I don't have much context about the proposal. Trying to understand things 
better. Please bear with me.
   
   The reason why SQL doesn't have `select * from t version as of n` is that it 
violates the isolation property in ACID, by allowing a query to read data from 
two different points in time (or logical sequence numbers if you prefer).
   
   Instead, they prefer a flow such as:
   
   ```
   BEGIN // implicitly selects a version and all queries following would read 
from that version
   select * from t1
   union all
   select * from t2;
   COMMIT
   ```
   
   I can also imagine a variant such as:
   
   ```
   BEGIN TRANSACTION @n1
   ...
   COMMIT
   ```
   
   which has the same effect. The benefit of these variants is that it makes it 
harder to write SQL that violates isolation properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] adsharma commented on issue #533: Add extension plugin to parse SQL into logical plan

2021-06-14 Thread GitBox



adsharma commented on issue #533:
URL: 
https://github.com/apache/arrow-datafusion/issues/533#issuecomment-860999601


   I don't have much context about the proposal. Trying to understand things 
better. Please bear with me.
   
   The reason why SQL doesn't have `select * from t version as of n` is that it 
violates the isolation property in ACID, by allowing a query to read data from 
two different points in time (or logical sequence numbers if you prefer).
   
   Instead, they prefer a flow such as:
   
   ```
   BEGIN // implicitly selections a version and all queries following would 
read from that version
   select * from t1
   union all
   select * from t2;
   COMMIT
   ```
   
   I can also imagine a variant such as:
   
   ```
   BEGIN TRANSACTION @n1
   ...
   COMMIT
   ```
   
   which has the same effect. The benefit of these variants is that it makes it 
harder to write SQL that violates isolation properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] edponce commented on issue #10502: AttributeError: module 'pyarrow.lib' has no attribute '_Weakrefable'

2021-06-14 Thread GitBox



edponce commented on issue #10502:
URL: https://github.com/apache/arrow/issues/10502#issuecomment-861009813


   @bhargav-inthezone Kaggle notebook [installs latest pyarrow by 
default](https://github.com/Kaggle/docker-python/blob/main/Dockerfile#L347) but 
it seems the Docker image was created with Arrow 3.0.0. What version of Arrow 
do you have installed? Does the issue persists with Arrow 4.0.1?
   I installed ``vaex`` in local Jupyter notebook and Google Colab notebook and 
it works fine. I have not been able to test in Kaggle notebook as I do not have 
an active account (...yet).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bhargav-inthezone commented on issue #10502: AttributeError: module 'pyarrow.lib' has no attribute '_Weakrefable'

2021-06-14 Thread GitBox



bhargav-inthezone commented on issue #10502:
URL: https://github.com/apache/arrow/issues/10502#issuecomment-861147363


   I had to uninstall pyarrow before installing Vaex before but I  think kaggle 
fixed the problem. Now its working with installing just Vaex in the notebook


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] p5a0u9l opened a new issue #10531: How to resolve `pyarrow.deserialize` FutureWarning

2021-06-14 Thread GitBox



p5a0u9l opened a new issue #10531:
URL: https://github.com/apache/arrow/issues/10531


   I'd like to resolve this warning 
   ```
   /usr/lib/python3.7/asyncio/events.py:88: FutureWarning: 
'pyarrow.deserialize' is deprecated as of 2.0.0 and will be removed in a future 
version. Use pickle or the pyarrow IPC functionality instead.
   ```
   I'm calling `pyarrow.deserialize` on a bytes-like object that was initially 
a dictionary, but then stored in the plasma store with `client.put` and 
retrieved with 
   
   ```python
   buff = plasma_client.get_buffers([object_id])[0] 


 reader = pyarrow.BufferReader(buff)

   
data = reader.read()   
   ```
   
   Starting with the bytes in `data`, is there a different method than 
`pyarrow.deserialize` to retrieve the original dictionary?
   
   Thank you for arrow! Happy to move this to JIRA if you'd prefer. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] BryanCutler commented on pull request #10513: ARROW-13044: [Java] Change UnionVector and DenseUnionVector to extend AbstractContainerVector

2021-06-14 Thread GitBox



BryanCutler commented on pull request #10513:
URL: https://github.com/apache/arrow/pull/10513#issuecomment-861157094


   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] BryanCutler closed pull request #10513: ARROW-13044: [Java] Change UnionVector and DenseUnionVector to extend AbstractContainerVector

2021-06-14 Thread GitBox



BryanCutler closed pull request #10513:
URL: https://github.com/apache/arrow/pull/10513


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] jorgecarleitao commented on issue #455: Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow

2021-06-14 Thread GitBox



jorgecarleitao commented on issue #455:
URL: https://github.com/apache/arrow-rs/issues/455#issuecomment-861181763


   This may be related to which priority we give when converting, the parquet 
schema or the arrow schema. I would expect pyarrow to write them in a 
consistent manner though, so, as @nevi-me mentioned, an arrow schema in ns with 
a parquet schema in ms does seem odd.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist opened a new issue #562: publish datafusion-cli to brew repo so that macOS users can install easily

2021-06-14 Thread GitBox



Jimexist opened a new issue #562:
URL: https://github.com/apache/arrow-datafusion/issues/562


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   
   publish datafusion-cli to brew repo so that macOS users can install easily
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 commented on a change in pull request #10530: ARROW-13072: [C++] Add bit-wise arithmetic kernels

2021-06-14 Thread GitBox



cyb70289 commented on a change in pull request #10530:
URL: https://github.com/apache/arrow/pull/10530#discussion_r651405645



##
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic.cc
##
@@ -397,6 +397,130 @@ struct PowerChecked {
   }
 };
 
+// Bitwise operations
+
+struct BitWiseNot {
+  template 
+  static T Call(KernelContext*, Arg arg, Status*) {
+return ~arg;
+  }
+};
+
+struct BitWiseAnd {
+  template 
+  static T Call(KernelContext*, Arg0 lhs, Arg1 rhs, Status*) {
+return lhs & rhs;
+  }
+};
+
+struct BitWiseOr {
+  template 
+  static T Call(KernelContext*, Arg0 lhs, Arg1 rhs, Status*) {
+return lhs | rhs;
+  }
+};
+
+struct BitWiseXor {
+  template 
+  static T Call(KernelContext*, Arg0 lhs, Arg1 rhs, Status*) {
+return lhs ^ rhs;
+  }
+};
+
+struct ShiftLeftLogical {
+  template 
+  static T Call(KernelContext*, Arg0 lhs, Arg1 rhs, Status*) {
+static_assert(std::is_same::value, "");
+return lhs << rhs;
+  }
+};
+
+// See SEI CERT C Coding Standard rule INT34-C
+struct ShiftLeftLogicalChecked {
+  template 
+  static enable_if_unsigned_integer Call(KernelContext*, Arg0 lhs, Arg1 rhs,
+Status* st) {
+static_assert(std::is_same::value, "");
+if (ARROW_PREDICT_FALSE(rhs < 0)) {
+  *st = Status::Invalid("Both operands must be non-negative");
+  return lhs;
+}
+if (ARROW_PREDICT_FALSE(rhs >= std::numeric_limits::digits)) {
+  *st = Status::Invalid("overflow");
+  return lhs;
+}
+return lhs << rhs;
+  }
+
+  template 
+  static enable_if_signed_integer Call(KernelContext*, Arg0 lhs, Arg1 rhs,
+  Status* st) {
+static_assert(std::is_same::value, "");
+if (ARROW_PREDICT_FALSE(lhs < 0 || rhs < 0)) {
+  *st = Status::Invalid("Both operands must be non-negative");
+  return lhs;
+}
+if (ARROW_PREDICT_FALSE(rhs >= std::numeric_limits::digits)) {
+  *st = Status::Invalid("overflow");
+  return lhs;
+}
+if (ARROW_PREDICT_FALSE(lhs > (std::numeric_limits::max() >> rhs))) {
+  *st = Status::Invalid("overflow");
+  return lhs;
+}
+return lhs << rhs;
+  }
+};
+
+struct ShiftRightArithmetic {
+  template 
+  static T Call(KernelContext*, Arg0 lhs, Arg1 rhs, Status*) {
+using Signed = typename std::make_signed::type;
+static_assert(std::is_same::value, "");
+// N.B. this is implementation-defined but GCC and MSVC document
+// this as arithmetic right shift. References:
+// 
https://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html#Integers-implementation
+// 
https://docs.microsoft.com/en-us/cpp/cpp/left-shift-and-right-shift-operators-input-and-output?view=msvc-160#right-shifts
+// Clang doesn't document their behavior.
+return static_cast(static_cast(lhs) >> rhs);

Review comment:
   For unsigned char 128, arithmetic right shift 1 bit results to 192, 
looks a bit surprising.
   Looks arithmetic right shift is more meaningful for signed integers, and 
logical right shift for unsigned integers.

##
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic.cc
##
@@ -397,6 +397,130 @@ struct PowerChecked {
   }
 };
 
+// Bitwise operations
+
+struct BitWiseNot {
+  template 
+  static T Call(KernelContext*, Arg arg, Status*) {
+return ~arg;
+  }
+};
+
+struct BitWiseAnd {
+  template 
+  static T Call(KernelContext*, Arg0 lhs, Arg1 rhs, Status*) {
+return lhs & rhs;
+  }
+};
+
+struct BitWiseOr {
+  template 
+  static T Call(KernelContext*, Arg0 lhs, Arg1 rhs, Status*) {
+return lhs | rhs;
+  }
+};
+
+struct BitWiseXor {
+  template 
+  static T Call(KernelContext*, Arg0 lhs, Arg1 rhs, Status*) {
+return lhs ^ rhs;
+  }
+};
+
+struct ShiftLeftLogical {
+  template 
+  static T Call(KernelContext*, Arg0 lhs, Arg1 rhs, Status*) {
+static_assert(std::is_same::value, "");
+return lhs << rhs;

Review comment:
   Though it's not a checked kernel, I think it's still necessary to make 
sure no UB can happen.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] emkornfield commented on issue #455: Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow

2021-06-14 Thread GitBox



emkornfield commented on issue #455:
URL: https://github.com/apache/arrow-rs/issues/455#issuecomment-861130175


   > @emkornfield do you know what we should do on the Rust side to roundtrip 
the file correctly?
   
   Sorry, I would have to dig in the code to have a better understanding.  In 
general we try to avoid int96 because it is a deprecated type.  If this is an 
issue with pyarrow can we open a separate bug to track that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] frmnboi commented on issue #10488: Passing back and forth from Python and C++ with Pyarrow C++ extension and pybind11.

2021-06-14 Thread GitBox



frmnboi commented on issue #10488:
URL: https://github.com/apache/arrow/issues/10488#issuecomment-861180963


   After running python with debug symbols in GDB, Here is the relevant part of 
the GDB backtrace with directory names redacted:
   
   ```
   #0  __memmove_avx_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:314
   #1  0x7fffcf18a9d2 in arrow::BufferBuilder::UnsafeAppend 
(this=0x7fffd100, data=0x7fffcff0, length=8)
   at 
SOURCEPATH/Python_venv_3.8.5/lib/python3.8/site-packages/pyarrow/include/arrow/buffer_builder.h:136
   #2  0x7fffcf195871 in arrow::TypedBufferBuilder::UnsafeAppend (this=0x7fffd100, 
   value=0.10351288056206089)
   at 
SOURCEPATH/Python_venv_3.8.5/lib/python3.8/site-packages/pyarrow/include/arrow/buffer_builder.h:232
   #3  0x7fffcf1902c0 in 
arrow::NumericBuilder::UnsafeAppend (this=0x7fffd070, 
   val=0.10351288056206089)
   at 
SOURCEPATH/Python_venv_3.8.5/lib/python3.8/site-packages/pyarrow/include/arrow/array/builder_primitive.h:265
   #4  0x7fffcf18b455 in vol_adj_close (close=..., volume=...)
   at CODEPATH/Database/helperfuncs.cpp:94
   ```
   
   in addition to:
   
   
   ```
   Thread 1 "python" received signal SIGSEGV, Segmentation fault.
   __memmove_avx_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:314
   314  ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file 
or directory.
   
   ```
   
   As suspected, the issue is in **UnsafeAppend**, but I'm not sure why this 
file is missing in my install/system.  Do I need to build pyarrow from source 
to get this to install?  It looks like it is a reference to an X86 AVX-512 SIMD 
command, as referenced 
[here](https://arrow.apache.org/docs/format/Columnar.html?highlight=avx).  The 
missing file in question can presumably be found publicly online at places like 
[this](https://sources.debian.org/src/glibc/2.28-10/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S/).
  I think part of the problem may be that I have a 3rd gen Ryzen processor, 
which according to some reports does not support AVX-512.  I can't really tell 
if it isn't working because of hardware limitations, or because the capability 
and file exists, but I am not linking all the dependencies I need.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alippai commented on pull request #453: Add C data interface for decimal128

2021-06-14 Thread GitBox



alippai commented on pull request #453:
URL: https://github.com/apache/arrow-rs/pull/453#issuecomment-860251015






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alippai opened a new issue #454: Add missing kernels for decimal128

2021-06-14 Thread GitBox



alippai opened a new issue #454:
URL: https://github.com/apache/arrow-rs/issues/454


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Decimal128 type doesn't have the basic kernels like add, take, filter
   
   **Describe the solution you'd like**
   A basic set of operations available


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #539: Refactor hash aggregates's planner building code

2021-06-14 Thread GitBox



alamb merged pull request #539:
URL: https://github.com/apache/arrow-datafusion/pull/539


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter commented on pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

2021-06-14 Thread GitBox



codecov-commenter commented on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-860050867


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#443](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (eb11b2b) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/0c0077697e55eb154dbfcf3127a3f39e63be2df8?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (0c00776) will **increase** coverage by `0.00%`.
   > The diff coverage is `81.81%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/443/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@   Coverage Diff   @@
   ##   master #443   +/-   ##
   ===
 Coverage   82.71%   82.71%   
   ===
 Files 163  163   
 Lines   4479544805   +10 
   ===
   + Hits3705137061   +10 
 Misses   7744 7744   
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[parquet/src/data\_type.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZGF0YV90eXBlLnJz)
 | `77.64% <60.00%> (-0.27%)` | :arrow_down: |
   | 
[parquet/src/util/bit\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC9iaXRfdXRpbC5ycw==)
 | `93.14% <100.00%> (+0.07%)` | :arrow_up: |
   | 
[parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz)
 | `95.04% <0.00%> (+0.19%)` | :arrow_up: |
   | 
[arrow/src/array/transform/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9ib29sZWFuLnJz)
 | `84.61% <0.00%> (+7.69%)` | :arrow_up: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[0c00776...eb11b2b](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist opened a new issue #551: support for automatic value promotion for aggregation functions

2021-06-14 Thread GitBox

Jimexist opened a new issue #551:
URL: https://github.com/apache/arrow-datafusion/issues/551

**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
A clear and concise description of what the problem is. Ex. I'm always
frustrated when [...]
(This section helps Arrow developers understand the context and *why* for
this feature, in addition to the *what*)

for now,

```sql
select sum(c5) from test;
```

will err:

```
> select sum(c5) from test;
thread 'tokio-runtime-worker' panicked at 'attempt to add with overflow',
/Users/jiayu_liu/.rustup/toolchains/stable-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/arith.rs:107:1
ArrowError(ExternalError(Canceled))
```

even avg will do the same because it depends on sum.

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

have automatic value promotion into wider type

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features
you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan opened a new pull request #559: Push down filter through UNION

2021-06-14 Thread GitBox



Dandandan opened a new pull request #559:
URL: https://github.com/apache/arrow-datafusion/pull/559


   # Which issue does this PR close?
   
   
   
   Closes #557
   
# Rationale for this change
   
   
   # What changes are included in this PR?
   
   Filter is pushed down through union (all), so it can be pushed down further 
towards other operations like table scans.
   
   # Are there any user-facing changes?
   
   
   
   
   FYI @nevi-me 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me closed issue #193: JSON reader does not implement iterator

2021-06-14 Thread GitBox



nevi-me closed issue #193:
URL: https://github.com/apache/arrow-rs/issues/193


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #552: mv register_schema() to impl

2021-06-14 Thread GitBox



codecov-commenter edited a comment on pull request #552:
URL: https://github.com/apache/arrow-datafusion/pull/552#issuecomment-860214243






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist opened a new pull request #545: turn on clippy rule for needless borrow

2021-06-14 Thread GitBox



Jimexist opened a new pull request #545:
URL: https://github.com/apache/arrow-datafusion/pull/545


   # Which issue does this PR close?
   
   
   
   Closes #.
   
# Rationale for this change
   
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #548: reuse code for now function expr creation

2021-06-14 Thread GitBox



codecov-commenter commented on pull request #548:
URL: https://github.com/apache/arrow-datafusion/pull/548#issuecomment-860131843


   # 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#548](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (b38ef9a) into 
[master](https://codecov.io/gh/apache/arrow-datafusion/commit/8f4078d83f7ea0348fa43906d26156bf8a95de4c?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (8f4078d) will **decrease** coverage by `0.00%`.
   > The diff coverage is `73.33%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-datafusion/pull/548/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #548  +/-   ##
   ==
   - Coverage   76.09%   76.08%   -0.01% 
   ==
 Files 156  156  
 Lines   2704827041   -7 
   ==
   - Hits2058120574   -7 
 Misses   6467 6467  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/548/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9mdW5jdGlvbnMucnM=)
 | `92.66% <73.33%> (-0.04%)` | :arrow_down: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[8f4078d...b38ef9a](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] houqp edited a comment on pull request #55: Support qualified columns in queries

2021-06-14 Thread GitBox



houqp edited a comment on pull request #55:
URL: https://github.com/apache/arrow-datafusion/pull/55#issuecomment-860150632


   @andygrove @Dandandan @jorgecarleitao @alamb @Jimexist this PR is now ready 
for review. We are now able to pass both tpch-7 and tpch-8.
   
   I filed https://github.com/apache/arrow-datafusion/issues/549 to track 
compound column support as a fast follow task since this PR is already very 
large and keeps growing every time I rebase on top of the latest master. If you 
all think I should include compound column support in this PR, I am more than 
happy to add it.
   
   Output field name are not 100% conforming to 
https://github.com/apache/arrow-datafusion/blob/master/docs/specification/output-field-name-semantic.md
 yet. The remaining differences are all minor formatting issues like converting 
function names to lower cases, so I decided to leave that to follow up PRs to 
reduce the diff.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #533: Add extension plugin to parse SQL into logical plan

2021-06-14 Thread GitBox



jorgecarleitao commented on issue #533:
URL: 
https://github.com/apache/arrow-datafusion/issues/533#issuecomment-860156321


   I also do not see the issue with the example above, but I would say that, In 
general, custom SQL parsers effectively modify the SQL dialect that is being 
used and therefore the responsibility to document variations, including any 
limitation that they may introduce to the "default" postgres dialect, lays to 
the applications that use/install custom parsers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] jorgecarleitao commented on a change in pull request #439: [WIP] FFI bridge for Schema, Field and DataType

2021-06-14 Thread GitBox



jorgecarleitao commented on a change in pull request #439:
URL: https://github.com/apache/arrow-rs/pull/439#discussion_r650633827



##
File path: arrow/src/ffi.rs
##
@@ -208,15 +213,25 @@ impl FFI_ArrowSchema {
 pub fn name() ->  {
 assert!(!self.name.is_null());
 // safe because the lifetime of `self.name` equals `self`
-unsafe { CStr::from_ptr(self.name) }.to_str().unwrap()
+unsafe { CStr::from_ptr(self.name) }
+.to_str()
+.expect("The external API has a non-utf8 as name")
+}
+
+pub fn flags() -> Option {
+Flags::from_bits(self.flags)
 }
 
 pub fn child(, index: usize) ->  {
 assert!(index < self.n_children as usize);
-assert!(!self.name.is_null());
+// assert!(!self.name.is_null());

Review comment:
   why was this commented?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on pull request #550: Upgrade arrow 430

2021-06-14 Thread GitBox



Jimexist commented on pull request #550:
URL: https://github.com/apache/arrow-datafusion/pull/550#issuecomment-860197215


   > I plan to release arrow 4.3 to crates.io tomorrow (process is underway -- 
see 
https://lists.apache.org/thread.html/r5bfdcf0e069b5df14477b1ac6cd3584e5ee0b0be2d0d4bd18db8f9ab%40%3Cdev.arrow.apache.org%3E)
   > 
   > Once I release to crates.io this change is not necessary (any local 
checkout will just need to do `cargo update` and will update to arrow 4.3.0) 
and other datafusion users can use whatever version of arrow that is in 
Cargo.lock
   > 
   > Sorry that this is confusing / slow.
   
   it's okay not slow at all. was just trying to see if things would break. 
Thanks for the work of releasing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on issue #455: Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow

2021-06-14 Thread GitBox



nevi-me commented on issue #455:
URL: https://github.com/apache/arrow-rs/issues/455#issuecomment-860299977


   I'd have expected the ns timestamp to be written to `int96` as that is the 
legacy nanosecond timestamp format. I'll also dig into this to see, maybe if 
the format is `1.0` then we should coerce the timestamp to a different 
resolution, but I'd also see this as a pyarrow quirk that could be documented 
somewhere.
   
   @emkornfield do you know what we should do on the Rust side to roundtrip the 
file correctly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #520: Implement window functions with `order_by` clause

2021-06-14 Thread GitBox



Dandandan commented on a change in pull request #520:
URL: https://github.com/apache/arrow-datafusion/pull/520#discussion_r650546802



##
File path: ballista/rust/core/Cargo.toml
##
@@ -40,7 +40,7 @@ tokio = "1.0"
 tonic = "0.4"
 uuid = { version = "0.8", features = ["v4"] }
 
-arrow-flight = { version = "4.0"  }
+arrow-flight = { git = "https://github.com/apache/arrow-rs.git;, rev = 
"e5cda312b697c3d610637b28c58b6f1b104b41cc" }

Review comment:
   For this we need to wait on 4.3.0. this includes the required changes, 
no?
   
   https://github.com/apache/arrow-rs/pull/444
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] projjal commented on pull request #10501: ARROW-13032: [Java] Update guava version

2021-06-14 Thread GitBox



projjal commented on pull request #10501:
URL: https://github.com/apache/arrow/pull/10501#issuecomment-860475551


   > Is there any reason not to use the latest version 
https://mvnrepository.com/artifact/com.google.guava/guava/30.1.1-jre ?
   
   Not really. I just updated to the patched version.
   I will update the PR with latest version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on pull request #10525: ARROW-12709: [CI] Use LLVM 10 for s390x

2021-06-14 Thread GitBox



kiszk commented on pull request #10525:
URL: https://github.com/apache/arrow/pull/10525#issuecomment-860477976


   Yes, I will try to merge myself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #546: parallelize window function evaluations

2021-06-14 Thread GitBox



Dandandan edited a comment on pull request #546:
URL: https://github.com/apache/arrow-datafusion/pull/546#issuecomment-860194116


   > maybe a better way is to use rayon
   
   In my opinion we should try to stay away from Rayon and probably also should 
stay away from introducing parallelism at a low level (per expression) as it 
will likely add quite some overhead (in terms of extra threads, memory 
allocations hold on to and Rayon scheduling overhead).
   I added some more thoughts in this discussion
   
https://the-asf.slack.com/archives/C01QUFS30TD/p1623571458296000?thread_ts=1623523766.291500=C01QUFS30TD
   
   Or we should have some convincing benchmarks / reasoning why at a certain 
part it is reasonable to introduce parallelism using tool x.
   
   I think for other parts of the code we could use Tokios  `spawn_blocking` 
function at the moment, while we design a better way to do parallelism.
   
   On the general level I think we mostly should check whether we perform 
parallelism at a partition / file level and introduce enough parallelism in the 
query plan (e.g. by using partitioning 
https://medium.com/@danilheres/increasing-the-level-of-parallelism-in-datafusion-4-0-d2a15b5a2093
 ).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #547: support table alias in join clause

2021-06-14 Thread GitBox



alamb commented on a change in pull request #547:
URL: https://github.com/apache/arrow-datafusion/pull/547#discussion_r650511150



##
File path: datafusion/src/sql/planner.rs
##
@@ -424,17 +424,24 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
 ctes:  HashMap,
 ) -> Result {
 match relation {
-TableFactor::Table { name, .. } => {
+TableFactor::Table { name, alias, .. } => {
 let table_name = name.to_string();
 let cte = ctes.get(_name);
 match (
 cte,
 self.schema_provider.get_table_provider(name.try_into()?),
 ) {
 (Some(cte_plan), _) => Ok(cte_plan.clone()),
-(_, Some(provider)) => {
-LogicalPlanBuilder::scan(_name, provider, 
None)?.build()
-}
+(_, Some(provider)) => LogicalPlanBuilder::scan(
+// take alias into account to support `JOIN table1 as 
table2`
+alias
+.as_ref()
+.map(|a| a.name.value.as_str())
+.unwrap_or(_name),
+provider,
+None,
+)?
+.build(),
 (_, None) => Err(DataFusionError::Plan(format!(

Review comment:
   ```suggestion
   (None, None) => Err(DataFusionError::Plan(format!(
   ```
   Might make it clearer that now both branches are covered?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #414: Doctests for DecimalArray.

2021-06-14 Thread GitBox



codecov-commenter edited a comment on pull request #414:
URL: https://github.com/apache/arrow-rs/pull/414#issuecomment-855393423


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/414?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#414](https://codecov.io/gh/apache/arrow-rs/pull/414?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (fda84cf) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/e21f576e5a01577f63f7d19a0e921417eb14b8db?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (e21f576) will **decrease** coverage by `0.00%`.
   > The diff coverage is `n/a`.
   
   > :exclamation: Current head fda84cf differs from pull request most recent 
head 2233ab2. Consider uploading reports for the commit 2233ab2 to get more 
accurate results
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/414/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/414?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #414  +/-   ##
   ==
   - Coverage   82.64%   82.64%   -0.01% 
   ==
 Files 164  164  
 Lines   4550845508  
   ==
   - Hits3760937608   -1 
   - Misses   7899 7900   +1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/414?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow-rs/pull/414/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2JpbmFyeS5ycw==)
 | `90.13% <ø> (ø)` | |
   | 
[parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/414/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz)
 | `94.85% <0.00%> (-0.20%)` | :arrow_down: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-rs/pull/414?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/414?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[e21f576...2233ab2](https://codecov.io/gh/apache/arrow-rs/pull/414?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #545: turn on clippy rule for needless borrow

2021-06-14 Thread GitBox



alamb commented on pull request #545:
URL: https://github.com/apache/arrow-datafusion/pull/545#issuecomment-860193523


   Thank you @Jimexist !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on pull request #429: implement lead and lag built-in window function

2021-06-14 Thread GitBox



Jimexist commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-860225216


   > Actually let's park this pull request for a while - I plan to implement 
sort and partition first and then window frame, after which the window shift 
approach might not be relevant.
   
   now that #520 is implemented, this PR is ready


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #10488: Passing back and forth from Python and C++ with Pyarrow C++ extension and pybind11.

2021-06-14 Thread GitBox



pitrou commented on issue #10488:
URL: https://github.com/apache/arrow/issues/10488#issuecomment-860451738


   I would suggest debugging these crashes using gdb.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb merged pull request #438: use iterator for partition kernel instead of generating vec

2021-06-14 Thread GitBox



alamb merged pull request #438:
URL: https://github.com/apache/arrow-rs/pull/438


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] yordan-pavlov commented on issue #200: Use iterators to increase performance of creating Arrow arrays

2021-06-14 Thread GitBox



yordan-pavlov commented on issue #200:
URL: https://github.com/apache/arrow-rs/issues/200#issuecomment-860275967


   I finally had some time to check how the new `ArrowArrayReader` affects 
TPC-H benchmark results - for queries which use string columns (queries 1 and 
12), there is a performance improvement of about 30%, other queries that I 
tested, which mostly use non-string columns are unaffected. This makes sense as 
the new `ArrowArrayReader` is only enabled for string arrays currently.
   
   Here are the results:
   
   **before new ArrowArrayReader:**
   Query 1 avg time: 822.14 ms
   Query 3 avg time: 432.85 ms
   Query 5 avg time: 698.90 ms
   Query 6 avg time: 319.38 ms
   Query 12 avg time: 682.50 ms
   
   **after new ArrowArrayReader:**
   Query 1 avg time: 514.88 ms
   Query 3 avg time: 441.08 ms
   Query 5 avg time: 702.91 ms
   Query 6 avg time: 324.05 ms
   Query 12 avg time: 425.38 ms
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb merged pull request #419: Remove DictionaryArray::keys_array method

2021-06-14 Thread GitBox



alamb merged pull request #419:
URL: https://github.com/apache/arrow-rs/pull/419


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #559: Filter push down for Union

2021-06-14 Thread GitBox



codecov-commenter commented on pull request #559:
URL: https://github.com/apache/arrow-datafusion/pull/559#issuecomment-860438515


   # 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/559?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#559](https://codecov.io/gh/apache/arrow-datafusion/pull/559?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (1385342) into 
[master](https://codecov.io/gh/apache/arrow-datafusion/commit/d3828541a61b5681b93590a47e22d63715949136?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (d382854) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-datafusion/pull/559/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/559?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@   Coverage Diff   @@
   ##   master #559   +/-   ##
   ===
 Coverage   76.09%   76.10%   
   ===
 Files 156  156   
 Lines   2704727056+9 
   ===
   + Hits2058120590+9 
 Misses   6466 6466   
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-datafusion/pull/559?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[datafusion/src/optimizer/filter\_push\_down.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/559/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL2ZpbHRlcl9wdXNoX2Rvd24ucnM=)
 | `97.84% <100.00%> (+0.04%)` | :arrow_up: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/559?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/559?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[d382854...1385342](https://codecov.io/gh/apache/arrow-datafusion/pull/559?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #548: reuse code for now function expr creation

2021-06-14 Thread GitBox



codecov-commenter edited a comment on pull request #548:
URL: https://github.com/apache/arrow-datafusion/pull/548#issuecomment-860131843


   # 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#548](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (5083dfe) into 
[master](https://codecov.io/gh/apache/arrow-datafusion/commit/d3828541a61b5681b93590a47e22d63715949136?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (d382854) will **decrease** coverage by `0.00%`.
   > The diff coverage is `73.33%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-datafusion/pull/548/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #548  +/-   ##
   ==
   - Coverage   76.09%   76.08%   -0.01% 
   ==
 Files 156  156  
 Lines   2704727040   -7 
   ==
   - Hits2058120574   -7 
 Misses   6466 6466  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/548/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9mdW5jdGlvbnMucnM=)
 | `92.66% <73.33%> (-0.04%)` | :arrow_down: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[d382854...5083dfe](https://codecov.io/gh/apache/arrow-datafusion/pull/548?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] novemberkilo commented on a change in pull request #414: Doctests for DecimalArray.

2021-06-14 Thread GitBox



novemberkilo commented on a change in pull request #414:
URL: https://github.com/apache/arrow-rs/pull/414#discussion_r650718790



##
File path: arrow/src/array/array_binary.rs
##
@@ -613,6 +613,32 @@ impl Array for FixedSizeBinaryArray {
 }
 
 /// A type of `DecimalArray` whose elements are binaries.
+///
+/// # Examples
+///
+/// ```
+///use arrow::array::{ArrayData, Array, DecimalArray};
+///use arrow::buffer::Buffer;
+///use arrow::datatypes::DataType;
+///// let val_8887: [u8; 16] = [192, 219, 180, 17, 2, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0];
+///// let val_neg_8887: [u8; 16] = [64, 36, 75, 238, 255, 255, 255, 255, 
255, 255, 255, 255, 255, 255, 255, 255];
+///let values: [u8; 32] = [
+///192, 219, 180, 17, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 36, 75, 
238, 253,
+///255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+///];
+///let array_data = ArrayData::builder(DataType::Decimal(23, 6))
+///.len(2)
+///.add_buffer(Buffer::from([..]))
+///.build();
+///let decimal_array = DecimalArray::from(array_data);
+///assert_eq!(2, decimal_array.len());
+///assert_eq!(8_887_000_000, decimal_array.value(0));
+///assert_eq!(-8_887_000_000, decimal_array.value(1));
+///assert_eq!(16, decimal_array.value_length());
+///assert_eq!(23, decimal_array.precision());
+///assert_eq!(6, decimal_array.scale());

Review comment:
   Thanks @alippai - I have added those doctests per your suggestion. // 
@alamb 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10507: ARROW-13022: [R] bindings for lubridate's year, isoyear, quarter, month, day, wday, yday, isoweek, minute, and second

2021-06-14 Thread GitBox



jorisvandenbossche commented on a change in pull request #10507:
URL: https://github.com/apache/arrow/pull/10507#discussion_r650738642



##
File path: r/R/expression.R
##
@@ -28,8 +28,17 @@
   # stringr spellings of those
   "str_length" = "utf8_length",
   "str_to_lower" = "utf8_lower",
-  "str_to_upper" = "utf8_upper"
+  "str_to_upper" = "utf8_upper",
   # str_trim is defined in dplyr.R
+  "year" = "year",
+  "isoyear" = "iso_year",
+  "quarter" = "quarter",
+  "month" = "month",
+  "day" = "day",
+  "yday" = "day_of_year",
+  "isoweek" = "iso_week",
+  "minute" = "minute",
+  "second" = "second"

Review comment:
   Agreed, I would also say that it's not because lubridate does not 
support nanoseconds that *if* you actually have nanoseconds (which is possible 
since arrow supports it) those should be discarded. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] garyanaplan commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

2021-06-14 Thread GitBox



garyanaplan commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r650746454



##
File path: parquet/src/data_type.rs
##
@@ -661,8 +661,15 @@ pub(crate) mod private {
 _:  W,
 bit_writer:  BitWriter,
 ) -> Result<()> {
+if bit_writer.bytes_written() + values.len() >= 
bit_writer.capacity() {

Review comment:
   I think this calculation is entirely in terms of bytes, so units should 
all be correct as is.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] garyanaplan commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

2021-06-14 Thread GitBox



garyanaplan commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r650750468



##
File path: parquet/src/data_type.rs
##
@@ -661,8 +661,15 @@ pub(crate) mod private {
 _:  W,
 bit_writer:  BitWriter,
 ) -> Result<()> {
+if bit_writer.bytes_written() + values.len() >= 
bit_writer.capacity() {
+bit_writer.extend(256);
+}
 for value in values {
-bit_writer.put_value(*value as u64, 1);
+if !bit_writer.put_value(*value as u64, 1) {

Review comment:
   I found it hard to think of a good way to test this with the fix in 
place.
   
   I preferred the "don't auto expand memory at the point of failure" approach 
because I'm fairly conservative and didn't want to make a change that was too 
wide in impact without a better understanding of the code. i.e.: my fix 
specifically targeted the error I reported and made it possible to detect in 
other locations.
   
   I think a better fix would be to (somehow) pre-size the vector or avoid 
having to size a vector for all the bytes that could be written, but that would 
be a much bigger scope to the fix.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou edited a comment on pull request #10525: ARROW-12709: [CI] Use LLVM 10 for s390x

2021-06-14 Thread GitBox



kou edited a comment on pull request #10525:
URL: https://github.com/apache/arrow/pull/10525#issuecomment-860263925


   @kiszk Do you want to merge this yourself?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jonkeane closed pull request #10508: ARROW-13041: [C++] Ensure unary kernels zero-initialize data behind null entries

2021-06-14 Thread GitBox



jonkeane closed pull request #10508:
URL: https://github.com/apache/arrow/pull/10508


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] anthonylouisbsb commented on pull request #10195: ARROW-12595: [C++][Gandiva] Implement TO_HEX([binary] field), HEX, UNHEX and FROM_HEX([string]field] functions

2021-06-14 Thread GitBox



anthonylouisbsb commented on pull request #10195:
URL: https://github.com/apache/arrow/pull/10195#issuecomment-859521494


   @projjal The functions related to `HEX` and `UNHEX` were added inside this 
PR too, so I think you need to review the PR again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on a change in pull request #541: WIP: ShuffleReaderExec now supports multiple locations per partition

2021-06-14 Thread GitBox



andygrove commented on a change in pull request #541:
URL: https://github.com/apache/arrow-datafusion/pull/541#discussion_r650138952



##
File path: ballista/rust/client/src/context.rs
##
@@ -74,32 +71,6 @@ impl BallistaContextState {
 }
 }
 
-struct WrappedStream {
-stream: Pin> + Send + 
Sync>>,
-schema: SchemaRef,
-}
-
-impl RecordBatchStream for WrappedStream {
-fn schema() -> SchemaRef {
-self.schema.clone()
-}
-}
-
-impl Stream for WrappedStream {
-type Item = ArrowResult;
-
-fn poll_next(
-mut self: Pin< Self>,
-cx:  std::task::Context<'_>,
-) -> std::task::Poll> {
-self.stream.poll_next_unpin(cx)
-}
-
-fn size_hint() -> (usize, Option) {
-self.stream.size_hint()
-}
-}
-

Review comment:
   This is moved to ballista-core utils

##
File path: ballista/rust/core/src/execution_plans/shuffle_reader.rs
##
@@ -86,23 +87,18 @@ impl ExecutionPlan for ShuffleReaderExec {
 partition: usize,
 ) -> Result>> {
 info!("ShuffleReaderExec::execute({})", partition);
-let partition_location = _location[partition];
-
-let mut client = BallistaClient::try_new(
-_location.executor_meta.host,
-partition_location.executor_meta.port,
-)
-.await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))?;
 
-client
-.fetch_partition(
-_location.partition_id.job_id,
-partition_location.partition_id.stage_id,
-partition,
-)
+let x = self.partition[partition].clone();
+let result = future::join_all(x.into_iter().map(fetch_partition))
 .await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))
+.into_iter()
+.collect::>>()?;
+
+let result = WrappedStream::new(
+Box::pin(futures::stream::iter(result).flatten()),
+Arc::new(self.schema.as_ref().clone()),
+);
+Ok(Box::pin(result))
 }

Review comment:
   This is the main change

##
File path: ballista/rust/core/src/execution_plans/shuffle_reader.rs
##
@@ -86,23 +87,18 @@ impl ExecutionPlan for ShuffleReaderExec {
 partition: usize,
 ) -> Result>> {
 info!("ShuffleReaderExec::execute({})", partition);
-let partition_location = _location[partition];
-
-let mut client = BallistaClient::try_new(
-_location.executor_meta.host,
-partition_location.executor_meta.port,
-)
-.await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))?;
 
-client
-.fetch_partition(
-_location.partition_id.job_id,
-partition_location.partition_id.stage_id,
-partition,
-)
+let x = self.partition[partition].clone();
+let result = future::join_all(x.into_iter().map(fetch_partition))
 .await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))
+.into_iter()
+.collect::>>()?;
+
+let result = WrappedStream::new(
+Box::pin(futures::stream::iter(result).flatten()),

Review comment:
   Once the basic shuffle mechanism is implemented, there will be a lot of 
optimization work to follow




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] edrevo commented on a change in pull request #541: ShuffleReaderExec now supports multiple locations per partition

2021-06-14 Thread GitBox



edrevo commented on a change in pull request #541:
URL: https://github.com/apache/arrow-datafusion/pull/541#discussion_r650183206



##
File path: ballista/rust/core/src/execution_plans/shuffle_reader.rs
##
@@ -86,23 +87,18 @@ impl ExecutionPlan for ShuffleReaderExec {
 partition: usize,
 ) -> Result>> {
 info!("ShuffleReaderExec::execute({})", partition);
-let partition_location = _location[partition];
-
-let mut client = BallistaClient::try_new(
-_location.executor_meta.host,
-partition_location.executor_meta.port,
-)
-.await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))?;
 
-client
-.fetch_partition(
-_location.partition_id.job_id,
-partition_location.partition_id.stage_id,
-partition,
-)
+let x = self.partition[partition].clone();
+let result = future::join_all(x.into_iter().map(fetch_partition))

Review comment:
   nit; if you change `fetch_partition` to take a refernce, you can avoid 
the `.clone`:
   
   ```suggestion
   let partition_locations = [partition];
   let result = 
future::join_all(partition_locations.iter().map(fetch_partition))
   ```

##
File path: ballista/rust/core/src/execution_plans/shuffle_reader.rs
##
@@ -86,23 +87,18 @@ impl ExecutionPlan for ShuffleReaderExec {
 partition: usize,
 ) -> Result>> {
 info!("ShuffleReaderExec::execute({})", partition);
-let partition_location = _location[partition];
-
-let mut client = BallistaClient::try_new(
-_location.executor_meta.host,
-partition_location.executor_meta.port,
-)
-.await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))?;
 
-client
-.fetch_partition(
-_location.partition_id.job_id,
-partition_location.partition_id.stage_id,
-partition,
-)
+let x = self.partition[partition].clone();
+let result = future::join_all(x.into_iter().map(fetch_partition))
 .await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))
+.into_iter()
+.collect::>>()?;
+
+let result = WrappedStream::new(
+Box::pin(futures::stream::iter(result).flatten()),

Review comment:
   > I wonder if it will serialize them all (aka not start reading from the 
second until the first is entirely consumed)?
   
   Yes, that's exactly what it does.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] Jimexist opened a new pull request #448: Use partition for bool sort

2021-06-14 Thread GitBox



Jimexist opened a new pull request #448:
URL: https://github.com/apache/arrow-rs/pull/448


   # Which issue does this PR close?
   
   
   
   Closes #447
   
   # Rationale for this change


   
   # What changes are included in this PR?
   
   
   
   # Are there any user-facing changes?
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove opened a new issue #540: Ballista ShuffleReaderExec should be able to read from multiple locations per partition

2021-06-14 Thread GitBox



andygrove opened a new issue #540:
URL: https://github.com/apache/arrow-datafusion/issues/540


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   As a step towards implementing true shuffle, we need `ShuffleReaderExec` to 
be able to read from multiple locations per partition instead of from a single 
location.
   
   **Describe the solution you'd like**
   `ShuffleReaderExec` should accept `Vec>` instead of 
`Vec`.
   
   **Describe alternatives you've considered**
   None
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10507: ARROW-13022: [R] bindings for lubridate's year, isoyear, quarter, month, day, wday, yday, isoweek, minute, and second func

2021-06-14 Thread GitBox



nealrichardson commented on a change in pull request #10507:
URL: https://github.com/apache/arrow/pull/10507#discussion_r650076060



##
File path: r/R/expression.R
##
@@ -28,8 +28,17 @@
   # stringr spellings of those
   "str_length" = "utf8_length",
   "str_to_lower" = "utf8_lower",
-  "str_to_upper" = "utf8_upper"
+  "str_to_upper" = "utf8_upper",
   # str_trim is defined in dplyr.R
+  "year" = "year",
+  "isoyear" = "iso_year",
+  "quarter" = "quarter",
+  "month" = "month",
+  "day" = "day",
+  "yday" = "day_of_year",
+  "isoweek" = "iso_week",
+  "minute" = "minute",
+  "second" = "second"

Review comment:
   IDK that we should truncate the data like this. A slight (technical) API 
difference is probably better than throwing away precision.

##
File path: r/tests/testthat/test-dplyr-lubridate.R
##
@@ -0,0 +1,119 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+library(lubridate)
+library(dplyr)
+
+test_date <- ymd_hms("1987-10-09 23:00:00", tz = NULL)
+test_df <- tibble::tibble(date = test_date)
+
+test_that("extract datetime components from date", {
+  expect_dplyr_equal(
+input %>%
+  mutate(x = year(date)) %>%
+  collect(),
+test_df
+  )
+  
+  expect_dplyr_equal(
+input %>%
+  mutate(x = isoyear(date)) %>%
+  collect(),
+test_df
+  )
+  
+  expect_dplyr_equal(
+input %>%
+  mutate(x = quarter(date)) %>%
+  collect(),
+test_df
+  )
+  
+  expect_dplyr_equal(
+input %>%
+  mutate(x = month(date)) %>%
+  collect(),
+test_df
+  )
+  
+  expect_dplyr_equal(
+input %>%
+  mutate(x = wday(date)) %>%
+  collect(),
+test_df
+  )
+  
+  expect_dplyr_equal(
+input %>%
+  mutate(x = wday(date, week_start = 3)) %>%
+  collect(),
+test_df
+  )
+  
+  expect_warning(
+test_df %>%
+  Table$create() %>%
+  mutate(x = wday(date, label = TRUE)) %>%
+  collect(),
+regexp = "Label argument not supported by Arrow; pulling data into R"
+  )
+  
+  expect_warning(
+test_df %>%
+  Table$create() %>%
+  mutate(x = wday(date, locale = Sys.getlocale("LC_TIME"))) %>%
+  collect(),
+regexp = 'Expression wday(date, locale = Sys.getlocale("LC_TIME")) not 
supported in Arrow; pulling data into R',
+fixed = TRUE
+  )

Review comment:
   Yeah I think we can do this more unit-testy version where appropriate; 
we test elsewhere that the errors we throw in the nse_funcs get handled by 
filter/mutate correctly.

##
File path: r/R/dplyr-functions.R
##
@@ -442,3 +442,37 @@ nse_funcs$strptime <- function(x, format = "%Y-%m-%d 
%H:%M:%S", tz = NULL, unit
 
   Expression$create("strptime", x, options = list(format = format, unit = 
unit))
 }
+
+nse_funcs$wday <- function(x, label = FALSE, abbr = TRUE, week_start = 
getOption("lubridate.week.start", 7)) {
+  if (label) {
+arrow_not_supported("Label argument")

Review comment:
   What is the label argument? Is it something we should support? I think 
it's good to document these unsupported exceptions in comments, to note either 
why Arrow should behave differently, or what Jira issue exists to remove the 
exception.

##
File path: r/R/dplyr-functions.R
##
@@ -442,3 +442,37 @@ nse_funcs$strptime <- function(x, format = "%Y-%m-%d 
%H:%M:%S", tz = NULL, unit
 
   Expression$create("strptime", x, options = list(format = format, unit = 
unit))
 }
+
+nse_funcs$wday <- function(x, label = FALSE, abbr = TRUE, week_start = 
getOption("lubridate.week.start", 7)) {
+  if (label) {
+arrow_not_supported("Label argument")
+  }
+  offset <- get_date_offset(week_start)
+  Expression$create("add", Expression$create("day_of_week", x), 
Expression$scalar(offset))
+}
+
+#' Get date offset
+#' 
+#' Arrow's `day_of_week` kernel counts from 0 (Monday) to 6 (Sunday), whereas
+#' `lubridate::wday` counts from 1 to 7, and allows users to specify which day
+#' of the week is first (Sunday by default).  This function converts the 
returned

Review comment:
   Please leave a comment with that JIRA issue somewhere in this code that 
we expect to remove.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please

[GitHub] [arrow] kou closed pull request #10515: ARROW-13030: [CI][Go] Setup Arm64 golang CI

2021-06-14 Thread GitBox



kou closed pull request #10515:
URL: https://github.com/apache/arrow/pull/10515


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10517: ARROW-13050: [C++][Gandiva] Implement SPACE Hive function on Gandiva

2021-06-14 Thread GitBox



github-actions[bot] commented on pull request #10517:
URL: https://github.com/apache/arrow/pull/10517#issuecomment-859490703






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a change in pull request #10519: ARROW-12867: [R] Bindings for abs()

2021-06-14 Thread GitBox



AlenkaF commented on a change in pull request #10519:
URL: https://github.com/apache/arrow/pull/10519#discussion_r650068914



##
File path: r/R/dplyr-functions.R
##
@@ -108,6 +108,10 @@ nse_funcs$is.infinite <- function(x) {
   is_inf & !nse_funcs$is.na(is_inf)
 }
 
+nse_funcs$abs <- function(x){
+  Expression$create("abs_checked", x)

Review comment:
   I used `abs_checked` to be in sync with [ARROW-12575: [R] Use unary 
negative kernel](https://github.com/apache/arrow/pull/10196), see 
[discussion](https://github.com/apache/arrow/pull/10196/files/288e06ed681c170de587591b88c6b37f96a3eddd#r625885713).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter commented on pull request #449: remove clippy unnecessary wraps in cast kernel

2021-06-14 Thread GitBox



codecov-commenter commented on pull request #449:
URL: https://github.com/apache/arrow-rs/pull/449#issuecomment-859619048


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/449?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#449](https://codecov.io/gh/apache/arrow-rs/pull/449?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (e1c2dc4) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/0c0077697e55eb154dbfcf3127a3f39e63be2df8?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (0c00776) will **decrease** coverage by `0.05%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/449/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/449?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #449  +/-   ##
   ==
   - Coverage   82.71%   82.65%   -0.06% 
   ==
 Files 163  164   +1 
 Lines   4479545468 +673 
   ==
   + Hits3705137581 +530 
   - Misses   7744 7887 +143 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/449?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jYXN0LnJz)
 | `94.53% <ø> (ø)` | |
   | 
[parquet/src/util/test\_common/page\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC90ZXN0X2NvbW1vbi9wYWdlX3V0aWwucnM=)
 | `91.00% <0.00%> (-0.67%)` | :arrow_down: |
   | 
[parquet/src/arrow/record\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvcmVjb3JkX3JlYWRlci5ycw==)
 | `93.44% <0.00%> (-0.54%)` | :arrow_down: |
   | 
[parquet/src/column/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3JlYWRlci5ycw==)
 | `74.36% <0.00%> (-0.38%)` | :arrow_down: |
   | 
[arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=)
 | `86.06% <0.00%> (-0.09%)` | :arrow_down: |
   | 
[parquet/src/column/page.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3BhZ2UucnM=)
 | `98.68% <0.00%> (ø)` | |
   | 
[parquet/src/schema/types.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvc2NoZW1hL3R5cGVzLnJz)
 | `88.07% <0.00%> (ø)` | |
   | 
[parquet/src/arrow/arrow\_array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfYXJyYXlfcmVhZGVyLnJz)
 | `78.12% <0.00%> (ø)` | |
   | 
[parquet/src/file/serialized\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9zZXJpYWxpemVkX3JlYWRlci5ycw==)
 | `94.98% <0.00%> (+0.02%)` | :arrow_up: |
   | 
[arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow-rs/pull/449/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9maWx0ZXIucnM=)
 | `91.98% <0.00%> (+0.07%)` | :arrow_up: |
   | ... and [4

[GitHub] [arrow] github-actions[bot] commented on pull request #10521: ARROW-13065: [Packaging][RPM] Add missing required LZ4 version information

2021-06-14 Thread GitBox



github-actions[bot] commented on pull request #10521:
URL: https://github.com/apache/arrow/pull/10521#issuecomment-859921976






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] viirya commented on issue #531: `cargo build` cannot build the project

2021-06-14 Thread GitBox



viirya commented on issue #531:
URL: 
https://github.com/apache/arrow-datafusion/issues/531#issuecomment-859727700






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou opened a new pull request #10514: ARROW-13045: [Packaging][RPM][deb] Don't install system utf8proc if it's old

2021-06-14 Thread GitBox



kou opened a new pull request #10514:
URL: https://github.com/apache/arrow/pull/10514


   See also:
   
 * #10477 30f52a202d0a2f6393366ea1e4a8e5182077c72a
 * ARROW-13002


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] AlenkaF opened a new pull request #10519: ARROW-12867: [R] Bindings for abs()

2021-06-14 Thread GitBox



AlenkaF opened a new pull request #10519:
URL: https://github.com/apache/arrow/pull/10519


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10515: ARROW-13030: [CI][Go] Setup Arm64 golang CI

2021-06-14 Thread GitBox



github-actions[bot] commented on pull request #10515:
URL: https://github.com/apache/arrow/pull/10515#issuecomment-859308213


   https://issues.apache.org/jira/browse/ARROW-13030


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan closed issue #131: Implement hash partitioning

2021-06-14 Thread GitBox



Dandandan closed issue #131:
URL: https://github.com/apache/arrow-datafusion/issues/131


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #539: Refactor hash aggregates's planner building code

2021-06-14 Thread GitBox



Jimexist commented on a change in pull request #539:
URL: https://github.com/apache/arrow-datafusion/pull/539#discussion_r649920230



##
File path: datafusion/src/physical_plan/mod.rs
##
@@ -343,7 +343,8 @@ pub enum Partitioning {
 RoundRobinBatch(usize),
 /// Allocate rows based on a hash of one of more expressions and the 
specified
 /// number of partitions
-/// This partitioning scheme is not yet fully supported. See 
[ARROW-11011](https://issues.apache.org/jira/browse/ARROW-11011)
+/// FIXME: This partitioning scheme is not yet fully supported.

Review comment:
   done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] edrevo commented on a change in pull request #532: reuse datafusion physical planner in ballista building from protobuf

2021-06-14 Thread GitBox



edrevo commented on a change in pull request #532:
URL: https://github.com/apache/arrow-datafusion/pull/532#discussion_r649800622



##
File path: datafusion/src/physical_plan/windows.rs
##
@@ -61,24 +63,27 @@ pub struct WindowAggExec {
 /// Create a physical expression for window function
 pub fn create_window_expr(
 fun: ,
+name: String,
 args: &[Arc],
+_partition_by: &[Arc],
+_order_by: &[PhysicalSortExpr],
+_window_frame: WindowFrame,

Review comment:
   I just realized that this PR changes the behavior when there is a 
non-empty partition_by/order_by/window_frame: before (at least in ballista) it 
would error, whereas now it is silently ignored. Maybe it is worth erroring is 
they aren't empty to make it explicit that there is no support?

##
File path: datafusion/src/physical_plan/windows.rs
##
@@ -61,24 +63,27 @@ pub struct WindowAggExec {
 /// Create a physical expression for window function
 pub fn create_window_expr(
 fun: ,
+name: String,
 args: &[Arc],
+_partition_by: &[Arc],
+_order_by: &[PhysicalSortExpr],
+_window_frame: WindowFrame,

Review comment:
   nit: once this is merged, it would be cool to add a link to this code in 
the respective issues that are tracking these missing features to make it 
easier for new contributors to start contributing.

##
File path: datafusion/src/physical_plan/windows.rs
##
@@ -61,24 +63,27 @@ pub struct WindowAggExec {
 /// Create a physical expression for window function
 pub fn create_window_expr(
 fun: ,
+name: String,
 args: &[Arc],
+_partition_by: &[Arc],
+_order_by: &[PhysicalSortExpr],
+_window_frame: WindowFrame,

Review comment:
   I just realized that this PR changes the behavior when there is a 
non-empty partition_by/order_by/window_frame: before (at least in ballista) it 
would error, whereas now it is silently ignored. Maybe it is worth erroring if 
they aren't empty to make it explicit that there is no support?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] edponce commented on issue #10502: AttributeError: module 'pyarrow.lib' has no attribute '_Weakrefable'

2021-06-14 Thread GitBox



edponce commented on issue #10502:
URL: https://github.com/apache/arrow/issues/10502#issuecomment-859270240






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10507: ARROW-13022: [R] bindings for lubridate's year, isoyear, quarter, month, day, wday, yday, isoweek, minute, and second

2021-06-14 Thread GitBox



jorisvandenbossche commented on a change in pull request #10507:
URL: https://github.com/apache/arrow/pull/10507#discussion_r649723619



##
File path: r/R/dplyr-functions.R
##
@@ -442,3 +442,37 @@ nse_funcs$strptime <- function(x, format = "%Y-%m-%d 
%H:%M:%S", tz = NULL, unit
 
   Expression$create("strptime", x, options = list(format = format, unit = 
unit))
 }
+
+nse_funcs$wday <- function(x, label = FALSE, abbr = TRUE, week_start = 
getOption("lubridate.week.start", 7)) {
+  if (label) {
+arrow_not_supported("Label argument")
+  }
+  offset <- get_date_offset(week_start)
+  Expression$create("add", Expression$create("day_of_week", x), 
Expression$scalar(offset))
+}
+
+#' Get date offset
+#' 
+#' Arrow's `day_of_week` kernel counts from 0 (Monday) to 6 (Sunday), whereas
+#' `lubridate::wday` counts from 1 to 7, and allows users to specify which day
+#' of the week is first (Sunday by default).  This function converts the 
returned

Review comment:
   I was going to mention that we actually also have an "ISO weekday" in 
the C++ kernels, but only as field in the `iso_calendar` kernel, and not as 
separate one. We could also add `iso_day_of_week` if that helps, or even just 
make `day_of_week` follow the ISO conventions. Because the 0 (monday) - 6 
(sunday) might be something Python specific.
   
   But then looked at what C++ does, and there the default is actually 0 
(sunday) - 6 (saturday), so yet something else ;) (although I see that also 
Postgres uses that for `dow`)
   
   In the end, once you have it in one form, it's an easy conversion to any of 
the other of course, so it might not matter that much.
   
   

##
File path: r/R/expression.R
##
@@ -28,8 +28,17 @@
   # stringr spellings of those
   "str_length" = "utf8_length",
   "str_to_lower" = "utf8_lower",
-  "str_to_upper" = "utf8_upper"
+  "str_to_upper" = "utf8_upper",
   # str_trim is defined in dplyr.R
+  "year" = "year",
+  "isoyear" = "iso_year",
+  "quarter" = "quarter",
+  "month" = "month",
+  "day" = "day",
+  "yday" = "day_of_year",
+  "isoweek" = "iso_week",
+  "minute" = "minute",

Review comment:
   "hour" is missing here?

##
File path: r/R/expression.R
##
@@ -28,8 +28,17 @@
   # stringr spellings of those
   "str_length" = "utf8_length",
   "str_to_lower" = "utf8_lower",
-  "str_to_upper" = "utf8_upper"
+  "str_to_upper" = "utf8_upper",
   # str_trim is defined in dplyr.R
+  "year" = "year",
+  "isoyear" = "iso_year",
+  "quarter" = "quarter",
+  "month" = "month",
+  "day" = "day",
+  "yday" = "day_of_year",
+  "isoweek" = "iso_week",
+  "minute" = "minute",
+  "second" = "second"

Review comment:
   Note that second might do something different. I think "second" in 
lubridate is the equivalent of "second + subsecond" in arrow

##
File path: r/R/expression.R
##
@@ -28,8 +28,17 @@
   # stringr spellings of those
   "str_length" = "utf8_length",
   "str_to_lower" = "utf8_lower",
-  "str_to_upper" = "utf8_upper"
+  "str_to_upper" = "utf8_upper",
   # str_trim is defined in dplyr.R
+  "year" = "year",
+  "isoyear" = "iso_year",
+  "quarter" = "quarter",
+  "month" = "month",
+  "day" = "day",
+  "yday" = "day_of_year",
+  "isoweek" = "iso_week",
+  "minute" = "minute",
+  "second" = "second"

Review comment:
   What do you mean exactly with rounding? A quick try gives me:
   
   ```
   > second(ymd_hms("2011-06-04 12:00:01.123456"))
   [1] 1.123456
   ```
   
   which seems to give all decimals

##
File path: r/R/expression.R
##
@@ -28,8 +28,17 @@
   # stringr spellings of those
   "str_length" = "utf8_length",
   "str_to_lower" = "utf8_lower",
-  "str_to_upper" = "utf8_upper"
+  "str_to_upper" = "utf8_upper",
   # str_trim is defined in dplyr.R
+  "year" = "year",
+  "isoyear" = "iso_year",
+  "quarter" = "quarter",
+  "month" = "month",
+  "day" = "day",
+  "yday" = "day_of_year",
+  "isoweek" = "iso_week",
+  "minute" = "minute",
+  "second" = "second"

Review comment:
   I don't think lubridate / R supports nanosecond resolution




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove merged pull request #535: Make BallistaContext::collect streaming

2021-06-14 Thread GitBox



andygrove merged pull request #535:
URL: https://github.com/apache/arrow-datafusion/pull/535


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] Jimexist opened a new issue #446: sort kernel has a lot of unnecessary wrapping

2021-06-14 Thread GitBox



Jimexist opened a new issue #446:
URL: https://github.com/apache/arrow-rs/issues/446


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   
   we can get rid of the wrapping and remove the suppression on clippy warning 
as well
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] Jimexist opened a new issue #447: sort kernel boolean sort can be O(n)

2021-06-14 Thread GitBox



Jimexist opened a new issue #447:
URL: https://github.com/apache/arrow-rs/issues/447


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   
   we can use partition to speed up boolean sort
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #299: Support window functions with PARTITION BY clause

2021-06-14 Thread GitBox



Dandandan edited a comment on issue #299:
URL: 
https://github.com/apache/arrow-datafusion/issues/299#issuecomment-859516691






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jpedroantunes opened a new pull request #10516: ARROW-13049: [C++][Gandiva] Implement BIN Hive function on Gandiva

2021-06-14 Thread GitBox



jpedroantunes opened a new pull request #10516:
URL: https://github.com/apache/arrow/pull/10516


   Implement BIN Hive function on Gandiva


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #532: reuse datafusion physical planner in ballista building from protobuf

2021-06-14 Thread GitBox



Jimexist commented on a change in pull request #532:
URL: https://github.com/apache/arrow-datafusion/pull/532#discussion_r649843237



##
File path: datafusion/src/physical_plan/windows.rs
##
@@ -61,24 +63,27 @@ pub struct WindowAggExec {
 /// Create a physical expression for window function
 pub fn create_window_expr(
 fun: ,
+name: String,
 args: &[Arc],
+_partition_by: &[Arc],
+_order_by: &[PhysicalSortExpr],
+_window_frame: WindowFrame,

Review comment:
   it'll be further developed in 
https://github.com/apache/arrow-datafusion/pull/520

##
File path: datafusion/src/physical_plan/windows.rs
##
@@ -61,24 +63,27 @@ pub struct WindowAggExec {
 /// Create a physical expression for window function
 pub fn create_window_expr(
 fun: ,
+name: String,
 args: &[Arc],
+_partition_by: &[Arc],
+_order_by: &[PhysicalSortExpr],
+_window_frame: WindowFrame,

Review comment:
   but that's a good point let me add guard here.

##
File path: datafusion/src/physical_plan/windows.rs
##
@@ -61,24 +63,27 @@ pub struct WindowAggExec {
 /// Create a physical expression for window function
 pub fn create_window_expr(
 fun: ,
+name: String,
 args: &[Arc],
+_partition_by: &[Arc],
+_order_by: &[PhysicalSortExpr],
+_window_frame: WindowFrame,

Review comment:
   this is fixed now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] Jimexist opened a new pull request #449: remove clippy unnecessary wraps

2021-06-14 Thread GitBox



Jimexist opened a new pull request #449:
URL: https://github.com/apache/arrow-rs/pull/449






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-06-14 Thread GitBox



github-actions[bot] commented on pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#issuecomment-859529990


   https://issues.apache.org/jira/browse/ARROW-13052


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] Jimexist closed pull request #445: remove unnecessary wraps in sort

2021-06-14 Thread GitBox



Jimexist closed pull request #445:
URL: https://github.com/apache/arrow-rs/pull/445


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #543: Ballista: Implement map-side shuffle

2021-06-14 Thread GitBox



andygrove commented on pull request #543:
URL: https://github.com/apache/arrow-datafusion/pull/543#issuecomment-859939472


   @edrevo fyi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] domoritz closed pull request #10500: ARROW-13031: [JS] Support arm in closure compiler on macOS

2021-06-14 Thread GitBox



domoritz closed pull request #10500:
URL: https://github.com/apache/arrow/pull/10500


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on pull request #10501: ARROW-13032: [Java] Update guava version

2021-06-14 Thread GitBox



kiszk commented on pull request #10501:
URL: https://github.com/apache/arrow/pull/10501#issuecomment-859690248


   Is there any reason not to use the latest version 
https://mvnrepository.com/artifact/com.google.guava/guava/30.1.1-jre ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10519: ARROW-12867: [R] Bindings for abs()

2021-06-14 Thread GitBox



github-actions[bot] commented on pull request #10519:
URL: https://github.com/apache/arrow/pull/10519#issuecomment-859534414


   https://issues.apache.org/jira/browse/ARROW-12867


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on pull request #10510: ARROW-13043: [GLib][Ruby] Add GArrowEqualOptions

2021-06-14 Thread GitBox



kou commented on pull request #10510:
URL: https://github.com/apache/arrow/pull/10510#issuecomment-859913376


   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nirandaperera commented on pull request #10487: ARROW-13010: [C++][Compute] Support outputting to slices from kleene kernels

2021-06-14 Thread GitBox



nirandaperera commented on pull request #10487:
URL: https://github.com/apache/arrow/pull/10487#issuecomment-859187060


   @github-actions autotune 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jonkeane closed pull request #10506: ARROW-13039: [R] Fix error message handling

2021-06-14 Thread GitBox



jonkeane closed pull request #10506:
URL: https://github.com/apache/arrow/pull/10506


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #541: ShuffleReaderExec now supports multiple locations per partition

2021-06-14 Thread GitBox



alamb commented on a change in pull request #541:
URL: https://github.com/apache/arrow-datafusion/pull/541#discussion_r650205163



##
File path: ballista/rust/core/src/execution_plans/shuffle_reader.rs
##
@@ -86,23 +87,18 @@ impl ExecutionPlan for ShuffleReaderExec {
 partition: usize,
 ) -> Result>> {
 info!("ShuffleReaderExec::execute({})", partition);
-let partition_location = _location[partition];
-
-let mut client = BallistaClient::try_new(
-_location.executor_meta.host,
-partition_location.executor_meta.port,
-)
-.await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))?;
 
-client
-.fetch_partition(
-_location.partition_id.job_id,
-partition_location.partition_id.stage_id,
-partition,
-)
+let x = self.partition[partition].clone();
+let result = future::join_all(x.into_iter().map(fetch_partition))
 .await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))
+.into_iter()
+.collect::>>()?;
+
+let result = WrappedStream::new(
+Box::pin(futures::stream::iter(result).flatten()),

Review comment:
   this is a clever way of flattening the streams (though I wonder if it 
will serialize them all (aka not start reading from the second until the first 
is entirely consumed)?

##
File path: ballista/rust/client/src/context.rs
##
@@ -74,32 +71,6 @@ impl BallistaContextState {
 }
 }
 
-struct WrappedStream {
-stream: Pin> + Send + 
Sync>>,
-schema: SchemaRef,
-}
-
-impl RecordBatchStream for WrappedStream {
-fn schema() -> SchemaRef {
-self.schema.clone()
-}
-}
-
-impl Stream for WrappedStream {
-type Item = ArrowResult;
-
-fn poll_next(
-mut self: Pin< Self>,
-cx:  std::task::Context<'_>,
-) -> std::task::Poll> {
-self.stream.poll_next_unpin(cx)
-}
-
-fn size_hint() -> (usize, Option) {
-self.stream.size_hint()
-}
-}
-

Review comment:
   this is a cool abstraction -- i have had need of something similar 
elsewhere -- perhaps it would be good to move to datafusion itself eventually

##
File path: ballista/rust/core/src/execution_plans/shuffle_reader.rs
##
@@ -86,23 +87,18 @@ impl ExecutionPlan for ShuffleReaderExec {
 partition: usize,
 ) -> Result>> {
 info!("ShuffleReaderExec::execute({})", partition);
-let partition_location = _location[partition];
-
-let mut client = BallistaClient::try_new(
-_location.executor_meta.host,
-partition_location.executor_meta.port,
-)
-.await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))?;
 
-client
-.fetch_partition(
-_location.partition_id.job_id,
-partition_location.partition_id.stage_id,
-partition,
-)
+let x = self.partition[partition].clone();
+let result = future::join_all(x.into_iter().map(fetch_partition))
 .await
-.map_err(|e| DataFusionError::Execution(format!("Ballista Error: 
{:?}", e)))
+.into_iter()
+.collect::>>()?;
+
+let result = WrappedStream::new(
+Box::pin(futures::stream::iter(result).flatten()),

Review comment:
    
   I guess I figured I would point it out (that the different partitions 
wouldn't be producing in parallel)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

< 1 2 3 4 >

101 - 200 of 346 matches

Mail list logo