[arrow] branch feature/format-string-view created (now 74756051c4)

2022-09-08 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch feature/format-string-view
in repository https://gitbox.apache.org/repos/asf/arrow.git


  at 74756051c4 ARROW-16855: [C++] Adding Read Relation ToProto (#13401)

No new revisions were added by this update.



[arrow] branch master updated: ARROW-17296: [Python] Update serialized metadata size in pyarrow.parquet.read_metadata doctest (#13790)

2022-08-03 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new ee874d67dd ARROW-17296: [Python] Update serialized metadata size in 
pyarrow.parquet.read_metadata doctest (#13790)
ee874d67dd is described below

commit ee874d67ddd417e5c33aff1979df782c4dfa1dfb
Author: Wes McKinney 
AuthorDate: Wed Aug 3 15:11:52 2022 -0600

ARROW-17296: [Python] Update serialized metadata size in 
pyarrow.parquet.read_metadata doctest (#13790)

This should remain correct until we hit major version 100 (or make changes 
that otherwise affect the metadata size)

Lead-authored-by: Wes McKinney 
Co-authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 python/pyarrow/parquet/__init__.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyarrow/parquet/__init__.py 
b/python/pyarrow/parquet/__init__.py
index 5feb922060..5f616bc209 100644
--- a/python/pyarrow/parquet/__init__.py
+++ b/python/pyarrow/parquet/__init__.py
@@ -3419,7 +3419,7 @@ def read_metadata(where, memory_map=False, 
decryption_properties=None):
   num_rows: 3
   num_row_groups: 1
   format_version: 2.6
-  serialized_size: 561
+  serialized_size: ...
 """
 return ParquetFile(where, memory_map=memory_map,
decryption_properties=decryption_properties).metadata



[arrow] branch master updated: ARROW-17213: [C++] Fix for valgrind issue in test-r-linux-valgrind crossbow build (#13715)

2022-07-26 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 49ae8fa953 ARROW-17213: [C++] Fix for valgrind issue in 
test-r-linux-valgrind crossbow build (#13715)
49ae8fa953 is described below

commit 49ae8fa9536b117f26e83941619df3b0e1b9e18a
Author: Wes McKinney 
AuthorDate: Tue Jul 26 20:12:41 2022 -0600

ARROW-17213: [C++] Fix for valgrind issue in test-r-linux-valgrind crossbow 
build (#13715)

Authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/compute/kernels/scalar_compare.cc | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/cpp/src/arrow/compute/kernels/scalar_compare.cc 
b/cpp/src/arrow/compute/kernels/scalar_compare.cc
index f071986dd2..cfe1085531 100644
--- a/cpp/src/arrow/compute/kernels/scalar_compare.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_compare.cc
@@ -271,8 +271,7 @@ struct CompareKernel {
 if (out_is_byte_aligned) {
   out_buffer = out_arr->buffers[1].data + out_arr->offset / 8;
 } else {
-  ARROW_ASSIGN_OR_RAISE(out_buffer_tmp,
-
ctx->Allocate(bit_util::BytesForBits(batch.length)));
+  ARROW_ASSIGN_OR_RAISE(out_buffer_tmp, ctx->AllocateBitmap(batch.length));
   out_buffer = out_buffer_tmp->mutable_data();
 }
 if (batch[0].is_array() && batch[1].is_array()) {



[arrow-datafusion-python] branch master updated: Add .asf.yaml

2022-07-21 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion-python.git


The following commit(s) were added to refs/heads/master by this push:
 new 698fa72  Add .asf.yaml
698fa72 is described below

commit 698fa727fab25e31f9f09780e5f4a79d8966c192
Author: Wes McKinney 
AuthorDate: Thu Jul 21 17:46:48 2022 -0500

Add .asf.yaml
---
 .asf.yaml | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/.asf.yaml b/.asf.yaml
new file mode 100644
index 000..e59b243
--- /dev/null
+++ b/.asf.yaml
@@ -0,0 +1,31 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+notifications:
+  commits:  commits@arrow.apache.org
+  issues:   git...@arrow.apache.org
+  pullrequests: git...@arrow.apache.org
+  jira_options: link label worklog
+github:
+  description: "Apache Arrow DataFusion Python Bindings"
+  homepage: https://arrow.apache.org/datafusion
+  enabled_merge_buttons:
+squash: true
+merge: false
+rebase: false
+  features:
+issues: true



[arrow] branch master updated: ARROW-17135: [C++] Reduce code size in compute/kernels/scalar_compare.cc (#13654)

2022-07-20 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 1214083f7e ARROW-17135: [C++] Reduce code size in 
compute/kernels/scalar_compare.cc (#13654)
1214083f7e is described below

commit 1214083f7ece4e1797b7f3cdecfec1c2cfa8bf89
Author: Wes McKinney 
AuthorDate: Wed Jul 20 13:12:23 2022 -0700

ARROW-17135: [C++] Reduce code size in compute/kernels/scalar_compare.cc 
(#13654)

This "leaner" implementation reduces the generated code size of this C++ 
file from 2307768 bytes to 1192608 bytes in gcc 10.3.0. The benchmarks are also 
faster (on my avx2 laptop):

before

```

---
Benchmark Time CPU   
Iterations UserCounters...

---
GreaterArrayArrayInt64/32768/1 32.1 us 32.1 us
21533 items_per_second=1020.16M/s null_percent=0.01 size=32.768k
GreaterArrayArrayInt64/32768/100   32.1 us 32.1 us
21603 items_per_second=1019.27M/s null_percent=1 size=32.768k
GreaterArrayArrayInt64/32768/1032.1 us 32.1 us
21479 items_per_second=1020.82M/s null_percent=10 size=32.768k
GreaterArrayArrayInt64/32768/2 32.0 us 32.0 us
21468 items_per_second=1023.12M/s null_percent=50 size=32.768k
GreaterArrayArrayInt64/32768/1 32.3 us 32.3 us
21720 items_per_second=1013.44M/s null_percent=100 size=32.768k
GreaterArrayArrayInt64/32768/0 31.6 us 31.6 us
21828 items_per_second=1036.94M/s null_percent=0 size=32.768k
GreaterArrayScalarInt64/32768/120.8 us 20.8 us
33461 items_per_second=1.57238G/s null_percent=0.01 size=32.768k
GreaterArrayScalarInt64/32768/100  20.9 us 20.9 us
33625 items_per_second=1.56611G/s null_percent=1 size=32.768k
GreaterArrayScalarInt64/32768/10   20.8 us 20.8 us
33553 items_per_second=1.57338G/s null_percent=10 size=32.768k
GreaterArrayScalarInt64/32768/220.9 us 20.9 us
33348 items_per_second=1.5687G/s null_percent=50 size=32.768k
GreaterArrayScalarInt64/32768/120.9 us 20.9 us
33419 items_per_second=1.56879G/s null_percent=100 size=32.768k
GreaterArrayScalarInt64/32768/020.5 us 20.5 us
34116 items_per_second=1.59837G/s null_percent=0 size=32.768k
```

after

```

---
Benchmark Time CPU   
Iterations UserCounters...

---
GreaterArrayArrayInt64/32768/1 18.1 us 18.1 us
38751 items_per_second=1.81199G/s null_percent=0.01 size=32.768k
GreaterArrayArrayInt64/32768/100   17.5 us 17.5 us
39374 items_per_second=1.86821G/s null_percent=1 size=32.768k
GreaterArrayArrayInt64/32768/1019.0 us 19.0 us
33941 items_per_second=1.72066G/s null_percent=10 size=32.768k
GreaterArrayArrayInt64/32768/2 18.0 us 18.0 us
39589 items_per_second=1.81817G/s null_percent=50 size=32.768k
GreaterArrayArrayInt64/32768/1 18.1 us 18.1 us
39061 items_per_second=1.80719G/s null_percent=100 size=32.768k
GreaterArrayArrayInt64/32768/0 17.5 us 17.5 us
39813 items_per_second=1.87031G/s null_percent=0 size=32.768k
GreaterArrayScalarInt64/32768/116.3 us 16.3 us
42281 items_per_second=2.01525G/s null_percent=0.01 size=32.768k
GreaterArrayScalarInt64/32768/100  16.5 us 16.5 us
42266 items_per_second=1.98195G/s null_percent=1 size=32.768k
GreaterArrayScalarInt64/32768/10   16.5 us 16.5 us
41872 items_per_second=1.98615G/s null_percent=10 size=32.768k
GreaterArrayScalarInt64/32768/216.3 us 16.3 us
42130 items_per_second=2.00447G/s null_percent=50 size=32.768k
GreaterArrayScalarInt64/32768/116.2 us 16.2 us
42391 items_per_second=2.02296G/s null_percent=100 size=32.768k
GreaterArrayScalarInt64/32768/015.9 us 15.9 us
43498 items_per_second=2.0614G/s null_percent=0 size=32.768k
```

Authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/compute/kernels/codegen_internal.cc  |   4 

[arrow] branch master updated: ARROW-16852: [C++] Migrate remaining kernels to use ExecSpan, remove ExecBatchIterator (#13630)

2022-07-19 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 4d931ff1c0 ARROW-16852: [C++] Migrate remaining kernels to use 
ExecSpan, remove ExecBatchIterator (#13630)
4d931ff1c0 is described below

commit 4d931ff1c0f5661a9b134c514555c8d769001759
Author: Wes McKinney 
AuthorDate: Tue Jul 19 16:26:46 2022 -0500

ARROW-16852: [C++] Migrate remaining kernels to use ExecSpan, remove 
ExecBatchIterator (#13630)

This completes the porting to use ExecSpan everywhere. I also changed the 
ExecBatchIterator benchmarks to use ExecSpan to show the performance 
improvement in input splitting that we've talked about in the past:

Splitting inputs into small ExecSpan:

```


Benchmark  Time CPU   Iterations 
UserCounters...


BM_ExecSpanIterator/1024  205671 ns   205667 ns 3395 
items_per_second=4.86223k/s
BM_ExecSpanIterator/4096   54749 ns54750 ns13121 
items_per_second=18.265k/s
BM_ExecSpanIterator/16384  15979 ns15979 ns42628 
items_per_second=62.5824k/s
BM_ExecSpanIterator/65536   5597 ns 5597 ns   125099 
items_per_second=178.668k/s
```

Splitting inputs into small ExecBatch:

```

-
Benchmark   Time CPU   Iterations 
UserCounters...

-
BM_ExecBatchIterator/102417163432 ns 17163171 ns   41 
items_per_second=58.2643/s
BM_ExecBatchIterator/4096 4243467 ns  4243316 ns  163 
items_per_second=235.665/s
BM_ExecBatchIterator/163841093680 ns  1093638 ns  620 
items_per_second=914.38/s
BM_ExecBatchIterator/65536 272451 ns   272435 ns 2584 
items_per_second=3.6706k/s
```

Because the input in this benchmark has 1M elements, this shows that 
splitting into 1024 chunks of size 1024 adds only 0.2ms of overhead with 
ExecSpanIterator versus 17.16ms of overhead with ExecBatchIterator (> 80x 
improvement).

This won't by itself do much to impact performance in Acero but things for 
the community to explore in the future are the following (this work that I've 
been doing has been a precondition to consider this):

* A leaner ExecuteScalarExpression implementation that reuses temporary 
allocations (ARROW-16758)
* Parallel expression evaluation
* Better defining morsel (~1M elements) versus task (~1K elements) 
granularity in execution
* Work stealing so that we don't "hog" the thread pools, and we keep the 
work pinned to a particular CPU core if there are other things going on at the 
same time

Authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/array/data.cc|   6 +-
 cpp/src/arrow/array/data.h |  15 ++-
 cpp/src/arrow/compute/exec.cc  | 142 -
 cpp/src/arrow/compute/exec.h   |  34 +++--
 cpp/src/arrow/compute/exec/aggregate.cc|  31 +++--
 cpp/src/arrow/compute/exec/aggregate_node.cc   |  25 ++--
 cpp/src/arrow/compute/exec_internal.h  |  40 +-
 cpp/src/arrow/compute/exec_test.cc | 131 ---
 cpp/src/arrow/compute/function_benchmark.cc|  26 ++--
 cpp/src/arrow/compute/function_test.cc |   8 +-
 cpp/src/arrow/compute/kernel.h |  49 +++
 cpp/src/arrow/compute/kernels/aggregate_basic.cc   |  60 -
 .../compute/kernels/aggregate_basic_internal.h |  37 +++---
 cpp/src/arrow/compute/kernels/aggregate_internal.h |  12 +-
 cpp/src/arrow/compute/kernels/aggregate_mode.cc|  28 
 .../arrow/compute/kernels/aggregate_quantile.cc|  42 --
 cpp/src/arrow/compute/kernels/aggregate_tdigest.cc |  10 +-
 cpp/src/arrow/compute/kernels/aggregate_var_std.cc |  36 +++---
 cpp/src/arrow/compute/kernels/hash_aggregate.cc| 140 ++--
 .../arrow/compute/kernels/hash_aggregate_test.cc   |  31 +++--
 .../arrow/compute/kernels/scalar_cast_numeric.cc   |   8 +-
 cpp/src/arrow/compute/kernels/scalar_nested.cc |  10 +-
 cpp/src/arrow/compute/row/grouper.cc   |  42 +++---
 cpp/src/arrow/compute/row/grouper.h|   2 +-
 cpp/src/arrow/dataset/partition.cc |   6 +-
 25 files changed, 312 insertions(+), 659 deletions(-)

diff --git a/cpp/src/arrow/array/data

[arrow] branch master updated: ARROW-16807: [C++][R] count distinct incorrectly merges state (#13583)

2022-07-16 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new af4db7731b ARROW-16807: [C++][R] count distinct incorrectly merges 
state (#13583)
af4db7731b is described below

commit af4db7731b1f857e78221c53c2d8221849b1eeec
Author: octalene 
AuthorDate: Sat Jul 16 14:45:27 2022 -0700

ARROW-16807: [C++][R] count distinct incorrectly merges state (#13583)

This addresses a bug where the `count_distinct` function simply added 
counts when merging state. The correct logic would be to return the number of 
distinct elements after both states have been merged.

State for count_distinct is backed by a MemoTable, which is then backed by 
a HashTable. To properly merge state, this PR adds 2 functions to each 
MemoTable: `MaybeInsert` and `MergeTable`. The MaybeInsert function handles 
simplified logic for inserting an element into the MemoTable. The MergeTable 
function handles iteration over elements in the MemoTable _to be merged_.

This PR also adds an R test and a C++ test. The R test mirrors what was 
provided in ARROW-16807. The C++ test, `AllChunkedArrayTypesWithNulls`, mirrors 
another C++ test, `AllArrayTypesWithNulls`, but uses chunked arrays for test 
data.

Lead-authored-by: Aldrin Montana 
Co-authored-by: Aldrin M 
Co-authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/compute/kernels/aggregate_basic.cc | 17 --
 cpp/src/arrow/compute/kernels/aggregate_test.cc  | 72 
 cpp/src/arrow/compute/kernels/codegen_internal.h |  2 +-
 cpp/src/arrow/util/hashing.h | 32 +++
 r/tests/testthat/test-dplyr-summarize.R  |  9 +++
 5 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/cpp/src/arrow/compute/kernels/aggregate_basic.cc 
b/cpp/src/arrow/compute/kernels/aggregate_basic.cc
index 57cee87f00..fec483318e 100644
--- a/cpp/src/arrow/compute/kernels/aggregate_basic.cc
+++ b/cpp/src/arrow/compute/kernels/aggregate_basic.cc
@@ -136,27 +136,34 @@ struct CountDistinctImpl : public ScalarAggregator {
   Status Consume(KernelContext*, const ExecBatch& batch) override {
 if (batch[0].is_array()) {
   const ArrayData& arr = *batch[0].array();
+  this->has_nulls = arr.GetNullCount() > 0;
+
   auto visit_null = []() { return Status::OK(); };
   auto visit_value = [&](VisitorArgType arg) {
-int y;
+int32_t y;
 return memo_table_->GetOrInsert(arg, );
   };
   RETURN_NOT_OK(VisitArraySpanInline(arr, visit_value, visit_null));
-  this->non_nulls += memo_table_->size();
-  this->has_nulls = arr.GetNullCount() > 0;
+
 } else {
   const Scalar& input = *batch[0].scalar();
   this->has_nulls = !input.is_valid;
+
   if (input.is_valid) {
-this->non_nulls += batch.length;
+int32_t unused;
+
RETURN_NOT_OK(memo_table_->GetOrInsert(UnboxScalar::Unbox(input), 
));
   }
 }
+
+this->non_nulls = memo_table_->size();
+
 return Status::OK();
   }
 
   Status MergeFrom(KernelContext*, KernelState&& src) override {
 const auto& other_state = checked_cast(src);
-this->non_nulls += other_state.non_nulls;
+RETURN_NOT_OK(this->memo_table_->MergeTable(*(other_state.memo_table_)));
+this->non_nulls = this->memo_table_->size();
 this->has_nulls = this->has_nulls || other_state.has_nulls;
 return Status::OK();
   }
diff --git a/cpp/src/arrow/compute/kernels/aggregate_test.cc 
b/cpp/src/arrow/compute/kernels/aggregate_test.cc
index aa54fe5f3e..abd5b5210a 100644
--- a/cpp/src/arrow/compute/kernels/aggregate_test.cc
+++ b/cpp/src/arrow/compute/kernels/aggregate_test.cc
@@ -962,11 +962,83 @@ class TestCountDistinctKernel : public ::testing::Test {
 EXPECT_THAT(CallFunction("count_distinct", {input}, ), one);
   }
 
+  void CheckChunkedArr(const std::shared_ptr& type,
+   const std::vector& json, int64_t 
expected_all,
+   bool has_nulls = true) {
+Check(ChunkedArrayFromJSON(type, json), expected_all, has_nulls);
+  }
+
   CountOptions only_valid{CountOptions::ONLY_VALID};
   CountOptions only_null{CountOptions::ONLY_NULL};
   CountOptions all{CountOptions::ALL};
 };
 
+TEST_F(TestCountDistinctKernel, AllChunkedArrayTypesWithNulls) {
+  // Boolean
+  CheckChunkedArr(boolean(), {"[]", "[]"}, 0, /*has_nulls=*/false);
+  CheckChunkedArr(boolean(), {"[true, null]", "[false, null, false]", 
"[true]"}, 3);
+
+  // Number
+  for (auto ty : NumericTypes()) {
+CheckChunkedArr(ty, {"[1, 1, null, 2]", "[5, 8, 9, 9, null, 10]", "[6, 6, 
8, 9, 10]"},
+   

[arrow] branch master updated: ARROW-16757: [C++][FOLLOWUP] Fix mingw32 RTools 4.0 build by removing usage of alignas (#13557)

2022-07-10 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 88b42ef66f ARROW-16757: [C++][FOLLOWUP] Fix mingw32 RTools 4.0 build 
by removing usage of alignas (#13557)
88b42ef66f is described below

commit 88b42ef66fe664043c5ee5274b2982a3858b414e
Author: Wes McKinney 
AuthorDate: Sun Jul 10 09:20:18 2022 -0500

ARROW-16757: [C++][FOLLOWUP] Fix mingw32 RTools 4.0 build by removing usage 
of alignas (#13557)

Using `alignas(64)` (instead of `alignas(8)`) seemed to break this build.

Authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/array/data.cc   | 6 +++---
 cpp/src/arrow/array/data.h| 2 +-
 cpp/src/arrow/compute/exec.cc | 4 
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/cpp/src/arrow/array/data.cc b/cpp/src/arrow/array/data.cc
index c1a597fea6..d3f28758d9 100644
--- a/cpp/src/arrow/array/data.cc
+++ b/cpp/src/arrow/array/data.cc
@@ -219,7 +219,7 @@ void FillZeroLengthArray(const DataType* type, ArraySpan* 
span) {
   span->length = 0;
   int num_buffers = GetNumBuffers(*type);
   for (int i = 0; i < num_buffers; ++i) {
-span->buffers[i].data = span->scratch_space;
+span->buffers[i].data = reinterpret_cast(span->scratch_space);
 span->buffers[i].size = 0;
   }
 
@@ -270,7 +270,7 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
 }
   } else if (is_base_binary_like(type_id)) {
 const auto& scalar = checked_cast(value);
-this->buffers[1].data = this->scratch_space;
+this->buffers[1].data = reinterpret_cast(this->scratch_space);
 const uint8_t* data_buffer = nullptr;
 int64_t data_size = 0;
 if (scalar.is_valid) {
@@ -328,7 +328,7 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
 // First buffer is kept null since unions have no validity vector
 this->buffers[0] = {};
 
-this->buffers[1].data = this->scratch_space;
+this->buffers[1].data = reinterpret_cast(this->scratch_space);
 this->buffers[1].size = 1;
 int8_t* type_codes = reinterpret_cast(this->scratch_space);
 type_codes[0] = checked_cast(value).type_code;
diff --git a/cpp/src/arrow/array/data.h b/cpp/src/arrow/array/data.h
index fddc60293d..78643ae14a 100644
--- a/cpp/src/arrow/array/data.h
+++ b/cpp/src/arrow/array/data.h
@@ -269,7 +269,7 @@ struct ARROW_EXPORT ArraySpan {
   // 16 bytes of scratch space to enable this ArraySpan to be a view onto
   // scalar values including binary scalars (where we need to create a buffer
   // that looks like two 32-bit or 64-bit offsets)
-  alignas(64) uint8_t scratch_space[16];
+  uint64_t scratch_space[2];
 
   ArraySpan() = default;
 
diff --git a/cpp/src/arrow/compute/exec.cc b/cpp/src/arrow/compute/exec.cc
index e5e256ea6d..4dc5cdc542 100644
--- a/cpp/src/arrow/compute/exec.cc
+++ b/cpp/src/arrow/compute/exec.cc
@@ -383,6 +383,10 @@ int64_t ExecSpanIterator::GetNextChunkSpan(int64_t 
iteration_size, ExecSpan* spa
   continue;
 }
 const ChunkedArray* arg = args_->at(i).chunked_array().get();
+if (arg->num_chunks() == 0) {
+  iteration_size = 0;
+  continue;
+}
 const Array* current_chunk;
 while (true) {
   current_chunk = arg->chunk(chunk_indexes_[i]).get();



[arrow-site] branch master updated (4066731 -> e599783)

2021-11-18 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git.


from 4066731  ARROW-14626: [Website] Update versions tested on
 add e599783  [Website] Update Rust release details info in release blog 
post template (#136)

No new revisions were added by this update.

Summary of changes:
 release-announcement-template.md | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)


[arrow-site] branch master updated: Add jiayuliu as committer (#152)

2021-10-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
 new 158713c  Add jiayuliu as committer (#152)
158713c is described below

commit 158713cca5dbd08c724eba0b6641f65949100ded
Author: Jiayu Liu 
AuthorDate: Mon Oct 11 23:30:24 2021 +0800

Add jiayuliu as committer (#152)
---
 _data/committers.yml | 4 
 1 file changed, 4 insertions(+)

diff --git a/_data/committers.yml b/_data/committers.yml
index 40e7ea4..33daca4 100644
--- a/_data/committers.yml
+++ b/_data/committers.yml
@@ -263,3 +263,7 @@
   role: Committer
   alias: houqp
   affiliation: Scribd, Inc.
+- name: Jiayu Liu
+  role: Committer
+  alias: jiayuliu
+  affiliation: Airbnb Inc.


[arrow-site] branch master updated: Add `graphique` to 'powered by' page. (#143)

2021-08-24 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
 new 50b9c81  Add `graphique` to 'powered by' page. (#143)
50b9c81 is described below

commit 50b9c815b02575a9c46e1bb520d4507fb2596996
Author: A. Coady 
AuthorDate: Tue Aug 24 17:26:48 2021 -0700

Add `graphique` to 'powered by' page. (#143)
---
 powered_by.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/powered_by.md b/powered_by.md
index d09c8e6..1486179 100644
--- a/powered_by.md
+++ b/powered_by.md
@@ -104,6 +104,7 @@ short description of your use case.
   visualizations and/or further analytics.
 * **[GOAI][19]:** Open GPU-Accelerated Analytics Initiative for Arrow-powered
   analytics across GPU tools and vendors
+* **[graphique][41]** GraphQL service for arrow tables and parquet data sets. 
The schema for a query API is derived automatically.
 * **[Graphistry][18]:** Supercharged Visual Investigation Platform used by
   teams for security, anti-fraud, and related investigations. The Graphistry
   team uses Arrow in its NodeJS GPU backend and client libraries, and is an
@@ -219,3 +220,4 @@ short description of your use case.
 [38]: https://github.com/vaexio/vaex
 [39]: https://hash.ai
 [40]: https://github.com/pola-rs/polars
+[41]: https://github.com/coady/graphique


[arrow-cookbook] branch main updated: Initial content for Arrow Cookbook for Python and R (#1)

2021-07-28 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git


The following commit(s) were added to refs/heads/main by this push:
 new d93c637  Initial content for Arrow Cookbook for Python and R (#1)
d93c637 is described below

commit d93c637895ca40d6ec5371c6399757dac7a6f6ea
Author: Alessandro Molina 
AuthorDate: Wed Jul 28 16:38:20 2021 +0200

Initial content for Arrow Cookbook for Python and R (#1)

* Initial Import

* R cookbook initial commit (#1)

* R Cookbook skeleton and initial chapter

* Move r test script to a separate directory

* Add Apache 2 license

* Add parquet section

* Delete files used to demonstrate failing tests in CI

* Licensing

* Add content for different formats and rearrange headings

* Small change to make the tests run on macOS

* Completed the IO section and added intersphinx with PyArrow

* Add workflow to deploy to GH pages

* Update path

* Rename chapters and fill in section titles

* Commit whitespace to trigger build

* Update bookdown job

* try new job config

* Install nightly Arrow

* Evaluate all relevant bits!

* Deploy to r dir

* Try new workflow

* update build path

* Add email and update paths

* Update job to build all cookbooks

* Delete whitespace to trigger build

* Swap order to see if this fixes build

* Install system dependencies

* Put it back on Mac so it's faster

* Separate steps to diagnose issue

* Brew not sudo

* Switching to ubuntu as I don't understand why python 2

* Don't put results in r directory

* Capitalise 'C'

* Update bookdown link so can click to fork/edit

* Add CI stage that runs tests

* Add examples of manually creating Arrow objects and writing to various 
formats

* Add S3 parquet

* Partitioned data

* Partitioned Data from S3

* Rename record_batch_create chunk

* CSV recipe requires pandas

* Filter parquet data on read

* Reading/Writing feather files

* remove duplicated chunk name

* tweak create

* Categorical data

* Speed up compiling

* Fix tests

* tests pass

* Data manipulation functions

* Link to compute functions

* Tweak naming

* Add contribution file

* landing page style tweak

* Improve contribution documentation

* Explicitly reference the contribution docs

* ignore build directory

* Change branch name

* Update contents

* Update CONTRIBUTING.md

* Suggestions from Grammarly

* Rename initial chapter

* Update Makefile to allow Arrow version to be specified

* Truncate license file to relevant part

* typo

* Apply suggestions from code review

Co-authored-by: Weston Pace 

* Add link to code of conduct

Co-authored-by: Ian Cook 

* Capitalise "Array"

* Update r/CONTRIBUTING.md

Co-authored-by: Ian Cook 

* Update r/content/manipulating_data.Rmd

Co-authored-by: Weston Pace 

* Update r/content/manipulating_data.Rmd

Co-authored-by: Weston Pace 

* Update r/content/manipulating_data.Rmd

Co-authored-by: Weston Pace 

* Update r/content/reading_and_writing_data.Rmd

Co-authored-by: Weston Pace 

* Update r/content/creating_arrow_objects.Rmd

Co-authored-by: Ian Cook 

* Update r/content/manipulating_data.Rmd

Co-authored-by: Ian Cook 

* Update r/content/manipulating_data.Rmd

Co-authored-by: Ian Cook 

* Apply suggestions from code review

Co-authored-by: Weston Pace 
Co-authored-by: Ian Cook 

* Mention dependencies

* Mention that this is not the documentation

* rewording

* Add -jauto by default and indent a print

* The Apache Software Foundation

* reword

* Correct ambiguous and incorrect phrasing

* Update r/content/reading_and_writing_data.Rmd

Co-authored-by: Weston Pace 

* Update r/content/reading_and_writing_data.Rmd

Co-authored-by: Weston Pace 

* Reorder sections

* Update r/content/manipulating_data.Rmd

Co-authored-by: Ian Cook 

* Remove redundant code snippet

* Update reading CSVs

* Add in section on converting from/to Arrow Tables and tibbles

* rephrase list of numbers

* rephrase list of numbers

* Add missing bracket

* Rephrase about parquet containing multiple cols

*

[arrow-cookbook] 01/01: Initial commit

2021-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git

commit a9352414df66e5387f478bee92d3de430d59cd47
Author: Wes McKinney 
AuthorDate: Wed Jul 14 16:42:28 2021 -0500

Initial commit
---
 .gitignore | 0
 1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 000..e69de29


[arrow-cookbook] branch main created (now a935241)

2021-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git.


  at a935241  Initial commit

This branch includes the following new commits:

 new a935241  Initial commit

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



[arrow-site] branch master updated: Removing extra "}}" from the Feather Python link. (#126)

2021-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
 new 141667f  Removing extra "}}" from the Feather Python link. (#126)
141667f is described below

commit 141667f0f163711d0a4ceb2c8b7ceda15bdf2e7c
Author: Raul Ascencio 
AuthorDate: Wed Jul 14 15:10:21 2021 -0600

Removing extra "}}" from the Feather Python link. (#126)

Currently, the page: https://arrow.apache.org/use_cases/ contains a python 
link for "Feather" with python using 
"https://arrow.apache.org/docs/python/feather.html%20%7D%7D; which redirects to 
a 404.

Instead, it seems that we should be using the following: 
"https://arrow.apache.org/docs/python/feather.html;.
---
 use_cases.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/use_cases.md b/use_cases.md
index 5dffaf8..f15e55c 100644
--- a/use_cases.md
+++ b/use_cases.md
@@ -36,7 +36,7 @@ and the [Apache Parquet](https://parquet.apache.org/) format.
 
 
 
-* Feather: C++, [Python]({{ site.baseurl }}/docs/python/feather.html }}),
+* Feather: C++, [Python]({{ site.baseurl }}/docs/python/feather.html),
   [R]({{ site.baseurl }}/docs/r/reference/read_feather.html)
 * Parquet: [C++]({{ site.baseurl }}/docs/cpp/parquet.html),
   [Python]({{ site.baseurl }}/docs/python/parquet.html),


[arrow-site] branch master updated: Add polars project to Powered By (#123)

2021-07-05 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
 new 66074d2  Add polars project to Powered By (#123)
66074d2 is described below

commit 66074d254f96a8d7ba23d9142ad310e7d23de1a2
Author: Ritchie Vink 
AuthorDate: Mon Jul 5 18:43:55 2021 +0200

Add polars project to Powered By (#123)

This PR proposes adding Polars to the list of projects that use Apache 
Arrow.
---
 powered_by.md | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/powered_by.md b/powered_by.md
index 9fd3791..d09c8e6 100644
--- a/powered_by.md
+++ b/powered_by.md
@@ -137,6 +137,11 @@ short description of your use case.
   Parquet format. Petastorm supports popular Python-based machine learning
   (ML) frameworks such as Tensorflow, Pytorch, and PySpark. It can also be
   used from pure Python code.
+* **[Polars][40]:** Polars is a blazingly fast DataFrame library and query 
engine 
+  that aims to utilize modern hardware efficiently. 
+  (e.g. multi-threading, SIMD vectorization, hiding memory latencies). 
+  Polars is built upon Apache Arrow and uses its columnar memory, compute 
kernels,
+  and several IO utilities. Polars is written in Rust and available in Rust 
and Python.
 * **[Quilt Data][13]:** Quilt is a data package manager, designed to make
   managing data as easy as managing code. It supports Parquet format via
   pyarrow for data access.
@@ -213,3 +218,4 @@ short description of your use case.
 [37]: https://github.com/tenzir/vast
 [38]: https://github.com/vaexio/vaex
 [39]: https://hash.ai
+[40]: https://github.com/pola-rs/polars


[arrow] branch master updated (9162954 -> 7339bd5)

2021-06-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 9162954  ARROW-13065: [Packaging][RPM] Add missing required LZ4 
version information
 add 7339bd5  [GitHub] Add shorter GitHub repository description to 
.asf.yaml

No new revisions were added by this update.

Summary of changes:
 .asf.yaml | 4 
 1 file changed, 4 insertions(+)


[arrow-site] branch master updated (2d7b592 -> abc9bb2)

2021-04-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git.


from 2d7b592  ARROW-12192: [Website] Use downloadable URL for archive 
download
 add abc9bb2  ARROW-11911: [Website] Add protobuf vs arrow to FAQ (#97)

No new revisions were added by this update.

Summary of changes:
 faq.md | 25 +
 1 file changed, 25 insertions(+)


[arrow-site] branch master updated: Adding Vaex to powered by (#98)

2021-03-09 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
 new a2f6faf  Adding Vaex to powered by (#98)
a2f6faf is described below

commit a2f6faf0840c9ee42b8bead27652257fe687bfeb
Author: Maarten Breddels 
AuthorDate: Tue Mar 9 17:34:31 2021 +0100

Adding Vaex to powered by (#98)
---
 powered_by.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/powered_by.md b/powered_by.md
index 9a041bc..01dd9c5 100644
--- a/powered_by.md
+++ b/powered_by.md
@@ -163,6 +163,8 @@ short description of your use case.
   Database Connectivity (ODBC) interface. It provides the ability to return
   Arrow Tables and RecordBatches in addition to the Python Database API
   Specification 2.0.
+* **[Vaex][38]:** Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python,
+  ML, visualize and explore big tabular data at a billion rows per second.
 * **[VAST][37]:** A network telemetry engine for data-driven security
   investigations. VAST uses Arrow as standardized data plane to provide a
   high-bandwidth output path for downstream analytics. This makes it easy and
@@ -205,3 +207,4 @@ short description of your use case.
 [35]: https://cylondata.org/ 
 [36]: https://bodo.ai
 [37]: https://github.com/tenzir/vast
+[38]: https://github.com/vaexio/vaex



[arrow] branch master updated (8df91c9 -> 8d76312)

2020-12-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 8df91c9  ARROW-10908: [Rust][DataFusion] Update relevant tpch-queries 
with BETWEEN
 add 8d76312  ARROW-6883: [C++][Python] Allow writing dictionary deltas

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/array/array_base.cc|  18 ++--
 cpp/src/arrow/array/array_base.h |  12 ++-
 cpp/src/arrow/flight/client.cc   |   8 ++
 cpp/src/arrow/flight/server.cc   |   9 +-
 cpp/src/arrow/ipc/options.h  |  14 +++
 cpp/src/arrow/ipc/read_write_test.cc | 164 +--
 cpp/src/arrow/ipc/reader.cc  |  44 ++
 cpp/src/arrow/ipc/reader.h   |   5 +-
 cpp/src/arrow/ipc/writer.cc  |  76 +---
 cpp/src/arrow/ipc/writer.h   |  20 +
 docs/source/status.rst   |  25 +-
 python/pyarrow/includes/libarrow.pxd |  35 ++--
 python/pyarrow/ipc.pxi   |  82 +-
 python/pyarrow/ipc.py|   3 +-
 python/pyarrow/tests/test_ipc.py |  56 
 15 files changed, 509 insertions(+), 62 deletions(-)



[arrow] branch master updated (b8e021c -> 8b9f6b9)

2020-11-19 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from b8e021c  ARROW-10634: [C#][CI] Change the build version from 2.2 to 
3.1 in CI
 add 8b9f6b9  ARROW-10598: [C++] Separate out bit-packing in 
internal::GenerateBitsUnrolled for better performance

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/util/bitmap_generate.h | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)



[arrow] branch master updated (4d2cf9f -> 9e587be)

2020-10-09 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 4d2cf9f  ARROW-10175: [CI] Fix nightly HDFS integration tests (ensure 
to use legacy dataset)
 add 9e587be  ARROW-10206: [C++][Python][FlightRPC] Allow disabling server 
validation

No new revisions were added by this update.

Summary of changes:
 ci/conda_env_cpp.yml   |  2 +-
 cpp/cmake_modules/Findzstd.cmake   | 20 +++--
 cpp/src/arrow/flight/CMakeLists.txt| 42 ++
 cpp/src/arrow/flight/client.cc | 95 +++---
 cpp/src/arrow/flight/client.h  |  6 ++
 cpp/src/arrow/flight/flight_test.cc| 26 ++
 .../check_tls_opts_127.cc} | 44 --
 .../check_tls_opts_132.cc} | 44 --
 python/pyarrow/_flight.pyx | 31 +--
 python/pyarrow/includes/libarrow_flight.pxd|  1 +
 python/pyarrow/tests/test_flight.py| 13 +++
 11 files changed, 244 insertions(+), 80 deletions(-)
 copy cpp/src/arrow/flight/{middleware_internal.h => 
try_compile/check_tls_opts_127.cc} (55%)
 copy cpp/src/arrow/flight/{middleware_internal.h => 
try_compile/check_tls_opts_132.cc} (56%)



[arrow] branch master updated (105873e -> b2842ab)

2020-10-05 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 105873e  ARROW-10068: [C++] Add bundled external project for 
aws-sdk-cpp
 add b2842ab  ARROW-10147: [Python] Pandas metadata fails if index name not 
JSON-serializable

No new revisions were added by this update.

Summary of changes:
 python/pyarrow/pandas_compat.py | 11 ++-
 python/pyarrow/tests/test_pandas.py | 14 ++
 2 files changed, 24 insertions(+), 1 deletion(-)



[arrow] branch master updated (ecc3ed8 -> 72a0e96)

2020-10-05 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from ecc3ed8  ARROW-10008: [C++][Dataset] Fix filtering/row group 
statistics of dict columns
 add 72a0e96  ARROW-10121: [C++] Fix emission of new dictionaries in IPC 
writer

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/ipc/CMakeLists.txt |   3 +-
 cpp/src/arrow/ipc/dictionary.cc  |  15 +-
 cpp/src/arrow/ipc/dictionary.h   |   5 +-
 cpp/src/arrow/ipc/read_write_test.cc | 652 ---
 cpp/src/arrow/ipc/reader.cc  | 141 ++--
 cpp/src/arrow/ipc/reader.h   |  29 +-
 cpp/src/arrow/ipc/tensor_test.cc | 506 +++
 cpp/src/arrow/ipc/writer.cc  |  86 +++--
 cpp/src/arrow/ipc/writer.h   |   3 +
 cpp/src/arrow/pretty_print.cc|   2 +-
 10 files changed, 943 insertions(+), 499 deletions(-)
 create mode 100644 cpp/src/arrow/ipc/tensor_test.cc



[arrow] branch master updated (a1157b7 -> 9bff7c4)

2020-10-01 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from a1157b7  ARROW-10136: [Rust]: Fix null handling in StringArray and 
BinaryArray filtering, add BinaryArray::from_opt_vec
 add 9bff7c4  ARROW-10054: [Python] don't crash when slice offset > length

No new revisions were added by this update.

Summary of changes:
 python/pyarrow/array.pxi   |  1 +
 python/pyarrow/table.pxi   |  3 +++
 python/pyarrow/tests/test_array.py |  2 ++
 python/pyarrow/tests/test_table.py | 24 
 4 files changed, 30 insertions(+)



[arrow] branch master updated (571d48e -> 4b0448b)

2020-09-29 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 571d48e  ARROW-10119: [C++] Fix Parquet crashes on invalid input
 add 4b0448b  ARROW-10124: [C++] Don't restrict permissions when creating 
files

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/util/io_util.cc   | 11 +--
 python/pyarrow/tests/test_io.py | 16 
 2 files changed, 17 insertions(+), 10 deletions(-)



[arrow] branch master updated (515daab -> 477c102)

2020-09-28 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 515daab  ARROW-8618: [C++] Clean up some redundant std::move()s
 add 477c102  ARROW-9924: [C++][Dataset] Enable per-column parallelism for 
single ParquetFileFragment scans

No new revisions were added by this update.

Summary of changes:
 c_glib/test/dataset/test-scan-options.rb |  2 +-
 cpp/src/arrow/dataset/file_parquet.cc|  4 ++
 cpp/src/arrow/dataset/file_parquet.h |  6 +++
 cpp/src/arrow/dataset/scanner.h  |  4 +-
 cpp/src/parquet/arrow/reader.cc  | 49 +--
 python/pyarrow/_dataset.pyx  | 37 ++
 python/pyarrow/dataset.py|  2 +-
 python/pyarrow/includes/libarrow_dataset.pxd |  1 +
 python/pyarrow/parquet.py| 73 
 9 files changed, 119 insertions(+), 59 deletions(-)



[arrow] branch master updated (515daab -> 477c102)

2020-09-28 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 515daab  ARROW-8618: [C++] Clean up some redundant std::move()s
 add 477c102  ARROW-9924: [C++][Dataset] Enable per-column parallelism for 
single ParquetFileFragment scans

No new revisions were added by this update.

Summary of changes:
 c_glib/test/dataset/test-scan-options.rb |  2 +-
 cpp/src/arrow/dataset/file_parquet.cc|  4 ++
 cpp/src/arrow/dataset/file_parquet.h |  6 +++
 cpp/src/arrow/dataset/scanner.h  |  4 +-
 cpp/src/parquet/arrow/reader.cc  | 49 +--
 python/pyarrow/_dataset.pyx  | 37 ++
 python/pyarrow/dataset.py|  2 +-
 python/pyarrow/includes/libarrow_dataset.pxd |  1 +
 python/pyarrow/parquet.py| 73 
 9 files changed, 119 insertions(+), 59 deletions(-)



[arrow-site] branch master updated: ARROW-7384: Add an allow-all robots.txt (#76)

2020-09-27 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
 new ae5fbf9  ARROW-7384: Add an allow-all robots.txt (#76)
ae5fbf9 is described below

commit ae5fbf9ffec88dddc56c36d749849e8f164efc89
Author: Uwe L. Korn 
AuthorDate: Sun Sep 27 22:15:59 2020 +0200

ARROW-7384: Add an allow-all robots.txt (#76)
---
 robots.txt | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/robots.txt b/robots.txt
new file mode 100644
index 000..f6e6d1d
--- /dev/null
+++ b/robots.txt
@@ -0,0 +1,2 @@
+User-Agent: *
+Allow: /



[arrow] branch master updated (fe862a4 -> 97ade81)

2020-09-25 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from fe862a4  ARROW-9981: [Rust] [Flight] Expose IpcWriteOptions on utils
 add 97ade81  ARROW-8601: [Go][FOLLOWUP] Fix RAT violations related to 
Flight in Go

No new revisions were added by this update.

Summary of changes:
 dev/release/rat_exclude_files.txt | 2 ++
 1 file changed, 2 insertions(+)



[arrow] branch master updated: ARROW-8601: [Go][Flight] Implementations Flight RPC server and client

2020-09-24 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new c0dd2e2  ARROW-8601: [Go][Flight] Implementations Flight RPC server 
and client
c0dd2e2 is described below

commit c0dd2e2166f5f3a9c6b6a03c6983bd886de16c65
Author: Matthew Topol 
AuthorDate: Thu Sep 24 20:33:00 2020 -0500

ARROW-8601: [Go][Flight] Implementations Flight RPC server and client

Built out from https://github.com/apache/arrow/pull/6731 with some 
inspiration from the existing Reader/Writer and the C++ Flight implementation. 
Still need to build out the tests some more, but would like to get opinions and 
thoughts on what I've got so far as I continue to build it out.

Closes #8175 from zeroshade/zeroshade/go/flight

Authored-by: Matthew Topol 
Signed-off-by: Wes McKinney 
---
 format/Flight.proto   |2 +
 go/arrow/flight/Flight.pb.go  | 1473 +
 go/arrow/flight/Flight_grpc.pb.go |  877 +++
 go/arrow/flight/client.go |   89 ++
 go/arrow/flight/client_auth.go|   91 ++
 go/arrow/flight/example_flight_server_test.go |   70 ++
 go/arrow/flight/flight_test.go|  305 +
 go/arrow/{go.mod => flight/gen.go}|   12 +-
 go/arrow/flight/server.go |  118 ++
 go/arrow/flight/server_auth.go|  145 +++
 go/arrow/go.mod   |8 +
 go/arrow/go.sum   |   94 ++
 go/arrow/ipc/flight_data_reader.go|  210 
 go/arrow/ipc/flight_data_writer.go|  150 +++
 14 files changed, 3634 insertions(+), 10 deletions(-)

diff --git a/format/Flight.proto b/format/Flight.proto
index 71ae7ca..7b0f591 100644
--- a/format/Flight.proto
+++ b/format/Flight.proto
@@ -19,6 +19,8 @@
 syntax = "proto3";
 
 option java_package = "org.apache.arrow.flight.impl";
+option go_package = "github.com/apache/arrow/go/flight;flight";
+
 package arrow.flight.protocol;
 
 /*
diff --git a/go/arrow/flight/Flight.pb.go b/go/arrow/flight/Flight.pb.go
new file mode 100644
index 000..75c6c2c
--- /dev/null
+++ b/go/arrow/flight/Flight.pb.go
@@ -0,0 +1,1473 @@
+//
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+// 
+// http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Code generated by protoc-gen-go. DO NOT EDIT.
+// versions:
+// protoc-gen-go v1.25.0
+// protocv3.9.1
+// source: Flight.proto
+
+package flight
+
+import (
+   proto "github.com/golang/protobuf/proto"
+   protoreflect "google.golang.org/protobuf/reflect/protoreflect"
+   protoimpl "google.golang.org/protobuf/runtime/protoimpl"
+   reflect "reflect"
+   sync "sync"
+)
+
+const (
+   // Verify that this generated code is sufficiently up-to-date.
+   _ = protoimpl.EnforceVersion(20 - protoimpl.MinVersion)
+   // Verify that runtime/protoimpl is sufficiently up-to-date.
+   _ = protoimpl.EnforceVersion(protoimpl.MaxVersion - 20)
+)
+
+// This is a compile-time assertion that a sufficiently up-to-date version
+// of the legacy proto package is being used.
+const _ = proto.ProtoPackageIsVersion4
+
+//
+// Describes what type of descriptor is defined.
+type FlightDescriptor_DescriptorType int32
+
+const (
+   // Protobuf pattern, not used.
+   FlightDescriptor_UNKNOWN FlightDescriptor_DescriptorType = 0
+   //
+   // A named path that identifies a dataset. A path is composed of a 
string
+   // or list of strings describing a particular dataset. This is 
conceptually
+   //  similar to a path inside a filesystem.
+   FlightDescriptor_PATH FlightDescriptor_DescriptorType = 1
+   //
+   // An opaque command to generate a dataset.
+   FlightDescriptor_CMD FlightDescriptor_DescriptorType = 2
+)
+
+// Enum value maps for FlightDescriptor_DescriptorType.
+var (
+   FlightDescriptor_DescriptorType_name = map[int32]string{
+   0: "UNKNOWN",
+

[arrow] branch master updated: ARROW-8601: [Go][Flight] Implementations Flight RPC server and client

2020-09-24 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new c0dd2e2  ARROW-8601: [Go][Flight] Implementations Flight RPC server 
and client
c0dd2e2 is described below

commit c0dd2e2166f5f3a9c6b6a03c6983bd886de16c65
Author: Matthew Topol 
AuthorDate: Thu Sep 24 20:33:00 2020 -0500

ARROW-8601: [Go][Flight] Implementations Flight RPC server and client

Built out from https://github.com/apache/arrow/pull/6731 with some 
inspiration from the existing Reader/Writer and the C++ Flight implementation. 
Still need to build out the tests some more, but would like to get opinions and 
thoughts on what I've got so far as I continue to build it out.

Closes #8175 from zeroshade/zeroshade/go/flight

Authored-by: Matthew Topol 
Signed-off-by: Wes McKinney 
---
 format/Flight.proto   |2 +
 go/arrow/flight/Flight.pb.go  | 1473 +
 go/arrow/flight/Flight_grpc.pb.go |  877 +++
 go/arrow/flight/client.go |   89 ++
 go/arrow/flight/client_auth.go|   91 ++
 go/arrow/flight/example_flight_server_test.go |   70 ++
 go/arrow/flight/flight_test.go|  305 +
 go/arrow/{go.mod => flight/gen.go}|   12 +-
 go/arrow/flight/server.go |  118 ++
 go/arrow/flight/server_auth.go|  145 +++
 go/arrow/go.mod   |8 +
 go/arrow/go.sum   |   94 ++
 go/arrow/ipc/flight_data_reader.go|  210 
 go/arrow/ipc/flight_data_writer.go|  150 +++
 14 files changed, 3634 insertions(+), 10 deletions(-)

diff --git a/format/Flight.proto b/format/Flight.proto
index 71ae7ca..7b0f591 100644
--- a/format/Flight.proto
+++ b/format/Flight.proto
@@ -19,6 +19,8 @@
 syntax = "proto3";
 
 option java_package = "org.apache.arrow.flight.impl";
+option go_package = "github.com/apache/arrow/go/flight;flight";
+
 package arrow.flight.protocol;
 
 /*
diff --git a/go/arrow/flight/Flight.pb.go b/go/arrow/flight/Flight.pb.go
new file mode 100644
index 000..75c6c2c
--- /dev/null
+++ b/go/arrow/flight/Flight.pb.go
@@ -0,0 +1,1473 @@
+//
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+// 
+// http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Code generated by protoc-gen-go. DO NOT EDIT.
+// versions:
+// protoc-gen-go v1.25.0
+// protocv3.9.1
+// source: Flight.proto
+
+package flight
+
+import (
+   proto "github.com/golang/protobuf/proto"
+   protoreflect "google.golang.org/protobuf/reflect/protoreflect"
+   protoimpl "google.golang.org/protobuf/runtime/protoimpl"
+   reflect "reflect"
+   sync "sync"
+)
+
+const (
+   // Verify that this generated code is sufficiently up-to-date.
+   _ = protoimpl.EnforceVersion(20 - protoimpl.MinVersion)
+   // Verify that runtime/protoimpl is sufficiently up-to-date.
+   _ = protoimpl.EnforceVersion(protoimpl.MaxVersion - 20)
+)
+
+// This is a compile-time assertion that a sufficiently up-to-date version
+// of the legacy proto package is being used.
+const _ = proto.ProtoPackageIsVersion4
+
+//
+// Describes what type of descriptor is defined.
+type FlightDescriptor_DescriptorType int32
+
+const (
+   // Protobuf pattern, not used.
+   FlightDescriptor_UNKNOWN FlightDescriptor_DescriptorType = 0
+   //
+   // A named path that identifies a dataset. A path is composed of a 
string
+   // or list of strings describing a particular dataset. This is 
conceptually
+   //  similar to a path inside a filesystem.
+   FlightDescriptor_PATH FlightDescriptor_DescriptorType = 1
+   //
+   // An opaque command to generate a dataset.
+   FlightDescriptor_CMD FlightDescriptor_DescriptorType = 2
+)
+
+// Enum value maps for FlightDescriptor_DescriptorType.
+var (
+   FlightDescriptor_DescriptorType_name = map[int32]string{
+   0: "UNKNOWN",
+

[arrow] branch master updated (152f8b0 -> ac86123)

2020-09-23 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 152f8b0  ARROW-10066: [C++] Make sure default AWS region selection 
algorithm is used
 add ac86123  ARROW-9970: [Go] fix checkptr failure in sum methods

No new revisions were added by this update.

Summary of changes:
 go/arrow/math/float64_avx2_amd64.go   | 4 ++--
 go/arrow/math/float64_sse4_amd64.go   | 4 ++--
 go/arrow/math/int64_avx2_amd64.go | 4 ++--
 go/arrow/math/int64_sse4_amd64.go | 4 ++--
 go/arrow/math/type_simd_amd64.go.tmpl | 4 ++--
 go/arrow/math/uint64_avx2_amd64.go| 4 ++--
 go/arrow/math/uint64_sse4_amd64.go| 4 ++--
 7 files changed, 14 insertions(+), 14 deletions(-)



[arrow] branch master updated (152f8b0 -> ac86123)

2020-09-23 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 152f8b0  ARROW-10066: [C++] Make sure default AWS region selection 
algorithm is used
 add ac86123  ARROW-9970: [Go] fix checkptr failure in sum methods

No new revisions were added by this update.

Summary of changes:
 go/arrow/math/float64_avx2_amd64.go   | 4 ++--
 go/arrow/math/float64_sse4_amd64.go   | 4 ++--
 go/arrow/math/int64_avx2_amd64.go | 4 ++--
 go/arrow/math/int64_sse4_amd64.go | 4 ++--
 go/arrow/math/type_simd_amd64.go.tmpl | 4 ++--
 go/arrow/math/uint64_avx2_amd64.go| 4 ++--
 go/arrow/math/uint64_sse4_amd64.go| 4 ++--
 7 files changed, 14 insertions(+), 14 deletions(-)



[arrow] branch master updated (02287b4 -> 8563b42)

2020-09-22 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 02287b4  ARROW-9078: [C++] Parquet read / write extension type with 
nested storage type
 add 8563b42  PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop 
Lz4Codec

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/util/compression.cc |  15 +
 cpp/src/arrow/util/compression.h  |  13 +++-
 cpp/src/arrow/util/compression_internal.h |   3 +
 cpp/src/arrow/util/compression_lz4.cc | 107 ++
 cpp/src/arrow/util/compression_test.cc|  70 ---
 cpp/src/parquet/column_reader.cc  |   2 +-
 cpp/src/parquet/column_writer.cc  |   2 +-
 cpp/src/parquet/column_writer_test.cc |  10 ++-
 cpp/src/parquet/file_deserialize_test.cc  |   8 ++-
 cpp/src/parquet/file_serialize_test.cc|  15 -
 cpp/src/parquet/reader_test.cc|  74 -
 cpp/src/parquet/thrift_internal.h |   5 +-
 cpp/src/parquet/types.cc  |  41 +++-
 cpp/src/parquet/types.h   |   9 ---
 cpp/submodules/parquet-testing|   2 +-
 python/pyarrow/tests/test_parquet.py  |  16 +
 16 files changed, 296 insertions(+), 96 deletions(-)



[arrow] branch master updated: ARROW-9490: [Python][C++] Bug in pa.array when input mixes int8 with float

2020-08-22 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 085b44d  ARROW-9490: [Python][C++] Bug in pa.array when input mixes 
int8 with float
085b44d is described below

commit 085b44d916cd1266911c05850a2369f30dd1fd65
Author: arw2019 
AuthorDate: Sat Aug 22 12:54:05 2020 -0500

ARROW-9490: [Python][C++] Bug in pa.array when input mixes int8 with float

Closes #8017 from arw2019/ARROW-9490

Authored-by: arw2019 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/python/helpers.cc  | 2 ++
 python/pyarrow/tests/test_convert_builtin.py | 9 -
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc
index 852bf76..1845aa1 100644
--- a/cpp/src/arrow/python/helpers.cc
+++ b/cpp/src/arrow/python/helpers.cc
@@ -328,6 +328,8 @@ Status UnboxIntegerAsInt64(PyObject* obj, int64_t* out) {
 if (overflow) {
   return Status::Invalid("PyLong is too large to fit int64");
 }
+  } else if (PyArray_IsScalar(obj, Byte)) {
+*out = reinterpret_cast(obj)->obval;
   } else if (PyArray_IsScalar(obj, UByte)) {
 *out = reinterpret_cast(obj)->obval;
   } else if (PyArray_IsScalar(obj, Short)) {
diff --git a/python/pyarrow/tests/test_convert_builtin.py 
b/python/pyarrow/tests/test_convert_builtin.py
index 788675a..f62a941 100644
--- a/python/pyarrow/tests/test_convert_builtin.py
+++ b/python/pyarrow/tests/test_convert_builtin.py
@@ -390,10 +390,17 @@ def test_broken_integers(seq):
 
 
 def test_numpy_scalars_mixed_type():
+
 # ARROW-4324
 data = [np.int32(10), np.float32(0.5)]
 arr = pa.array(data)
-expected = pa.array([10, 0.5], type='float64')
+expected = pa.array([10, 0.5], type="float64")
+assert arr.equals(expected)
+
+# ARROW-9490
+data = [np.int8(10), np.float32(0.5)]
+arr = pa.array(data)
+expected = pa.array([10, 0.5], type="float32")
 assert arr.equals(expected)
 
 



[arrow] branch master updated (5d9ccb7 -> 36d267b)

2020-08-20 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 5d9ccb7  ARROW-6437: [R] Add AWS SDK to system dependencies for macOS 
and Windows
 add 36d267b  [MINOR] Fix typo and use more concise word in README.md

No new revisions were added by this update.

Summary of changes:
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)



[arrow] branch master updated: ARROW-9528: [Python] Honor tzinfo when converting from datetime

2020-08-16 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 2e3d7ec  ARROW-9528: [Python] Honor tzinfo when converting from 
datetime
2e3d7ec is described below

commit 2e3d7ecd320d3e91d285ad0ee729aa18e2b4e476
Author: Krisztián Szűcs 
AuthorDate: Sun Aug 16 15:12:28 2020 -0500

ARROW-9528: [Python] Honor tzinfo when converting from datetime

Follow up of:
- ARROW-9223: [Python] Propagate timezone information in pandas conversion
- ARROW-9528: [Python] Honor tzinfo when converting from datetime 
(https://github.com/apache/arrow/pull/7805)

TODOs:
- [x] Store all Timestamp values normalized to UTC
- [x] Infer timezone from the array values if no explicit type was given
- [x] Testing (especially pandas object roundtrip)
- [x] Testing of timezone-naive roundtrips
- [x] Testing mixed pandas and datetime objects

Closes #7816 from kszucs/tz

Lead-authored-by: Krisztián Szűcs 
Co-authored-by: Micah Kornfield 
Signed-off-by: Wes McKinney 
---
 ci/scripts/integration_spark.sh|   3 +
 cpp/src/arrow/compute/kernels/scalar_string.cc |   4 +-
 cpp/src/arrow/python/arrow_to_pandas.cc|  53 --
 cpp/src/arrow/python/arrow_to_pandas.h |   5 +-
 cpp/src/arrow/python/datetime.cc   | 172 +-
 cpp/src/arrow/python/datetime.h|  26 +++
 cpp/src/arrow/python/inference.cc  |  22 +--
 cpp/src/arrow/python/python_to_arrow.cc| 151 +---
 cpp/src/arrow/python/python_to_arrow.h |   8 +-
 python/pyarrow/array.pxi   |   7 +-
 python/pyarrow/includes/libarrow.pxd   |   5 +
 python/pyarrow/tests/test_array.py |  22 ++-
 python/pyarrow/tests/test_convert_builtin.py   | 234 -
 python/pyarrow/tests/test_pandas.py|  60 +--
 python/pyarrow/tests/test_types.py | 117 +
 python/pyarrow/types.pxi   |  40 +
 16 files changed, 747 insertions(+), 182 deletions(-)

diff --git a/ci/scripts/integration_spark.sh b/ci/scripts/integration_spark.sh
index 9828a28..a45ed7a 100755
--- a/ci/scripts/integration_spark.sh
+++ b/ci/scripts/integration_spark.sh
@@ -22,6 +22,9 @@ source_dir=${1}
 spark_dir=${2}
 spark_version=${SPARK_VERSION:-master}
 
+# Use old behavior that always dropped tiemzones.
+export PYARROW_IGNORE_TIMEZONE=1
+
 if [ "${SPARK_VERSION:0:2}" == "2." ]; then
   # 
https://github.com/apache/spark/blob/master/docs/sql-pyspark-pandas-with-arrow.md#compatibility-setting-for-pyarrow--0150-and-spark-23x-24x
   export ARROW_PRE_0_15_IPC_FORMAT=1
diff --git a/cpp/src/arrow/compute/kernels/scalar_string.cc 
b/cpp/src/arrow/compute/kernels/scalar_string.cc
index 7e61617..0332be9 100644
--- a/cpp/src/arrow/compute/kernels/scalar_string.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_string.cc
@@ -861,10 +861,10 @@ void AddBinaryLength(FunctionRegistry* registry) {
   applicator::ScalarUnaryNotNull::Exec;
   ArrayKernelExec exec_offset_64 =
   applicator::ScalarUnaryNotNull::Exec;
-  for (const auto& input_type : {binary(), utf8()}) {
+  for (const auto input_type : {binary(), utf8()}) {
 DCHECK_OK(func->AddKernel({input_type}, int32(), exec_offset_32));
   }
-  for (const auto& input_type : {large_binary(), large_utf8()}) {
+  for (const auto input_type : {large_binary(), large_utf8()}) {
 DCHECK_OK(func->AddKernel({input_type}, int64(), exec_offset_64));
   }
   DCHECK_OK(registry->AddFunction(std::move(func)));
diff --git a/cpp/src/arrow/python/arrow_to_pandas.cc 
b/cpp/src/arrow/python/arrow_to_pandas.cc
index bc4e25b..47b62a3 100644
--- a/cpp/src/arrow/python/arrow_to_pandas.cc
+++ b/cpp/src/arrow/python/arrow_to_pandas.cc
@@ -17,9 +17,8 @@
 
 // Functions for pandas conversion via NumPy
 
-#include "arrow/python/numpy_interop.h"  // IWYU pragma: expand
-
 #include "arrow/python/arrow_to_pandas.h"
+#include "arrow/python/numpy_interop.h"  // IWYU pragma: expand
 
 #include 
 #include 
@@ -642,15 +641,15 @@ inline Status ConvertStruct(const PandasOptions& options, 
const ChunkedArray& da
   std::vector fields_data(num_fields);
   OwnedRef dict_item;
 
-  // XXX(wesm): In ARROW-7723, we found as a result of ARROW-3789 that second
+  // In ARROW-7723, we found as a result of ARROW-3789 that second
   // through microsecond resolution tz-aware timestamps were being promoted to
   // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
   // array in this function. PyArray_GETITEM returns datetime.datetime for
   // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds). We inserted thi

[arrow] branch master updated: ARROW-9598: [C++][Parquet] Fix writing nullable structs

2020-08-10 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 1b0aebe  ARROW-9598: [C++][Parquet] Fix writing nullable structs
1b0aebe is described below

commit 1b0aebea45bcd6b271324fcfc373e4ccc7543eaa
Author: Micah Kornfield 
AuthorDate: Mon Aug 10 15:33:10 2020 -0500

ARROW-9598: [C++][Parquet] Fix writing nullable structs

Traverse the node hierarchy to ensure we capture the right value count.

Closes #7862 from emkornfield/verify_parquetfg

Authored-by: Micah Kornfield 
Signed-off-by: Wes McKinney 
---
 cpp/src/parquet/arrow/arrow_reader_writer_test.cc | 17 +
 cpp/src/parquet/column_writer.cc  |  9 ++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/cpp/src/parquet/arrow/arrow_reader_writer_test.cc 
b/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
index 661ce7b..476d82f 100644
--- a/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
+++ b/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
@@ -2344,6 +2344,23 @@ TEST(ArrowReadWrite, SimpleStructRoundTrip) {
   2);
 }
 
+TEST(ArrowReadWrite, SingleColumnNullableStruct) {
+  auto links =
+  field("Links",
+::arrow::struct_({field("Backward", ::arrow::int64(), 
/*nullable=*/true)}));
+
+  auto links_id_array = ::arrow::ArrayFromJSON(links->type(),
+   "[null, "
+   "{\"Backward\": 10}"
+   "]");
+
+  CheckSimpleRoundtrip(
+  ::arrow::Table::Make(std::make_shared<::arrow::Schema>(
+   
std::vector>{links}),
+   {links_id_array}),
+  3);
+}
+
 // Disabled until implementation can be finished.
 TEST(TestArrowReadWrite, DISABLED_CanonicalNestedRoundTrip) {
   auto doc_id = field("DocId", ::arrow::int64(), /*nullable=*/false);
diff --git a/cpp/src/parquet/column_writer.cc b/cpp/src/parquet/column_writer.cc
index f9cf37c..6cb0bae 100644
--- a/cpp/src/parquet/column_writer.cc
+++ b/cpp/src/parquet/column_writer.cc
@@ -1138,8 +1138,12 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, 
public TypedColumnWriter<
 if (descr_->max_definition_level() > 0) {
   // Minimal definition level for which spaced values are written
   int16_t min_spaced_def_level = descr_->max_definition_level();
-  if (descr_->schema_node()->is_optional()) {
-min_spaced_def_level--;
+  const ::parquet::schema::Node* node = descr_->schema_node().get();
+  while (node != nullptr && !node->is_repeated()) {
+if (node->is_optional()) {
+  min_spaced_def_level--;
+}
+node = node->parent();
   }
   for (int64_t i = 0; i < num_levels; ++i) {
 if (def_levels[i] == descr_->max_definition_level()) {
@@ -1149,7 +1153,6 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, 
public TypedColumnWriter<
   ++spaced_values_to_write;
 }
   }
-
   WriteDefinitionLevels(num_levels, def_levels);
 } else {
   // Required field, write all values



[arrow] branch master updated (4489cb7 -> 9c04867)

2020-08-07 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 4489cb7  ARROW-9462:[Go] The Indentation after the first Record in 
arrjson writer is incorrect
 add 9c04867  ARROW-9643: [C++] Only register the SIMD variants when it's 
supported.

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/compute/kernels/aggregate_basic.cc | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)



[arrow-site] branch master updated: Adjust positioning of badges (#70)

2020-08-04 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
 new 4632363  Adjust positioning of badges (#70)
4632363 is described below

commit 4632363bfae07650817030ca554d311875b97440
Author: Neal Richardson 
AuthorDate: Tue Aug 4 14:58:20 2020 -0700

Adjust positioning of badges (#70)
---
 _layouts/home.html | 11 +--
 css/main.scss  | 13 +
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/_layouts/home.html b/_layouts/home.html
index f6f49ea..fe074f9 100644
--- a/_layouts/home.html
+++ b/_layouts/home.html
@@ -8,8 +8,15 @@
 
   
   A cross-language development platform for in-memory 
analytics
-  
-  https://github.com/apache/arrow; 
data-size="large" data-show-count="true" aria-label="Star apache/arrow on 
GitHub">Star https://twitter.com/ApacheArrow?ref_src=twsrc%5Etfw; 
class="twitter-follow-button" data-show-count="true">Follow 
@ApacheArrowhttps://platform.twitter.com/widgets.js&quot</a>; 
charset="utf-8">
+
+  
+
+  https://github.com/apache/arrow; 
data-size="large" data-show-count="true" aria-label="Star apache/arrow on 
GitHub">Star
+
+
+  https://twitter.com/ApacheArrow?ref_src=twsrc%5Etfw; 
class="twitter-follow-button" data-show-count="true">Follow 
@ApacheArrowhttps://platform.twitter.com/widgets.js&quot</a>; 
charset="utf-8">
+
+  
 
   
   
diff --git a/css/main.scss b/css/main.scss
index e844dfb..a4cdb90 100644
--- a/css/main.scss
+++ b/css/main.scss
@@ -97,3 +97,16 @@ p code, li code {
 p a code {
   color: inherit;
 }
+
+.social-badges iframe {
+  vertical-align: middle;
+}
+
+.social-badges span {
+  vertical-align: top;
+}
+
+.social-badge {
+  display: inline;
+  padding: 12px;
+}



[arrow] branch master updated (0d25270 -> 50d6252)

2020-08-03 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 0d25270  PARQUET-1845: [C++] Add expected results of Int96 in 
big-endian
 add 50d6252  ARROW-9096: [Python] Pandas roundtrip with dtype="object" 
underlying numeric column index

No new revisions were added by this update.

Summary of changes:
 python/pyarrow/pandas_compat.py | 21 +++--
 python/pyarrow/tests/test_pandas.py | 30 +-
 2 files changed, 32 insertions(+), 19 deletions(-)



[arrow] branch master updated: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-08-03 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 0d25270  PARQUET-1845: [C++] Add expected results of Int96 in 
big-endian
0d25270 is described below

commit 0d25270703fcc1db95104d6b77ae6d1286c36977
Author: Kazuaki Ishizaki 
AuthorDate: Mon Aug 3 11:46:18 2020 -0500

PARQUET-1845: [C++] Add expected results of Int96 in big-endian

This PR adds expected results of Int96 for parquet-internals-test in 
big-endian.

This PR assumes that uint_64 and uint_32 elements in Int96 are handled 
using a native endian for effectiveness.

Closes #6981 from kiszk/PARQUET-1845

Authored-by: Kazuaki Ishizaki 
Signed-off-by: Wes McKinney 
---
 cpp/src/parquet/types_test.cc | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/cpp/src/parquet/types_test.cc b/cpp/src/parquet/types_test.cc
index ccec95f..a14308f 100644
--- a/cpp/src/parquet/types_test.cc
+++ b/cpp/src/parquet/types_test.cc
@@ -102,8 +102,13 @@ TEST(TypePrinter, StatisticsTypes) {
   ASSERT_STREQ("1.0245", FormatStatValue(Type::DOUBLE, smin).c_str());
   ASSERT_STREQ("2.0489", FormatStatValue(Type::DOUBLE, smax).c_str());
 
+#if ARROW_LITTLE_ENDIAN
   Int96 Int96_min = {{1024, 2048, 4096}};
   Int96 Int96_max = {{2048, 4096, 8192}};
+#else
+  Int96 Int96_min = {{2048, 1024, 4096}};
+  Int96 Int96_max = {{4096, 2048, 8192}};
+#endif
   smin = std::string(reinterpret_cast(_min), sizeof(Int96));
   smax = std::string(reinterpret_cast(_max), sizeof(Int96));
   ASSERT_STREQ("1024 2048 4096", FormatStatValue(Type::INT96, smin).c_str());
@@ -126,9 +131,14 @@ TEST(TypePrinter, StatisticsTypes) {
 
 TEST(TestInt96Timestamp, Decoding) {
   auto check = [](int32_t julian_day, uint64_t nanoseconds) {
+#if ARROW_LITTLE_ENDIAN
 Int96 i96{static_cast(nanoseconds),
   static_cast(nanoseconds >> 32),
   static_cast(julian_day)};
+#else
+Int96 i96{static_cast(nanoseconds >> 32),
+  static_cast(nanoseconds), 
static_cast(julian_day)};
+#endif
 // Official formula according to 
https://github.com/apache/parquet-format/pull/49
 int64_t expected =
 (julian_day - 2440588) * (86400LL * 1000 * 1000 * 1000) + nanoseconds;



[arrow-site] 01/01: Add GitHub star and Twitter follow buttons

2020-08-02 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch follow-buttons
in repository https://gitbox.apache.org/repos/asf/arrow-site.git

commit c1d35383d0272f4015c03e3011cbf7a82f81e8aa
Author: Wes McKinney 
AuthorDate: Sun Aug 2 13:30:21 2020 -0500

Add GitHub star and Twitter follow buttons
---
 _layouts/home.html | 4 
 1 file changed, 4 insertions(+)

diff --git a/_layouts/home.html b/_layouts/home.html
index c58651f..f6f49ea 100644
--- a/_layouts/home.html
+++ b/_layouts/home.html
@@ -8,6 +8,8 @@
 
   
   A cross-language development platform for in-memory 
analytics
+  
+  https://github.com/apache/arrow; 
data-size="large" data-show-count="true" aria-label="Star apache/arrow on 
GitHub">Star https://twitter.com/ApacheArrow?ref_src=twsrc%5Etfw; 
class="twitter-follow-button" data-show-count="true">Follow 
@ApacheArrowhttps://platform.twitter.com/widgets.js&quot</a>; 
charset="utf-8">
 
   
   
@@ -17,5 +19,7 @@
 
 {% include footer.html %}
   
+
+https://buttons.github.io/buttons.js&quot</a>;>
 
 



[arrow-site] branch follow-buttons created (now c1d3538)

2020-08-02 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch follow-buttons
in repository https://gitbox.apache.org/repos/asf/arrow-site.git.


  at c1d3538  Add GitHub star and Twitter follow buttons

This branch includes the following new commits:

 new c1d3538  Add GitHub star and Twitter follow buttons

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




[arrow] branch master updated: ARROW-9398: [C++] Register SIMD sum variants to function instance.

2020-07-30 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 6efba62  ARROW-9398: [C++] Register SIMD sum variants to function 
instance.
6efba62 is described below

commit 6efba62ee47196e62e3521b07d4c25c092e8910e
Author: Frank Du 
AuthorDate: Thu Jul 30 18:09:06 2020 -0500

ARROW-9398: [C++] Register SIMD sum variants to function instance.

Enable simd_level feature of kernel and use it in DispatchExactImpl.
Add simd_level as a parameter of sum template to make sure every simd 
kernel has its own instantiation instance.
Also expand sum/mean test case to cover BitBlockCounter method.

Signed-off-by: Frank Du 

Closes #7700 from jianxind/sum_variants_to_function

Authored-by: Frank Du 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/compute/function.cc  | 25 +-
 cpp/src/arrow/compute/kernel.h |  9 +++--
 cpp/src/arrow/compute/kernels/aggregate_basic.cc   | 40 --
 .../compute/kernels/aggregate_basic_internal.h | 30 ++--
 .../arrow/compute/kernels/aggregate_sum_avx2.cc| 39 -
 .../arrow/compute/kernels/aggregate_sum_avx512.cc  | 40 --
 cpp/src/arrow/compute/kernels/aggregate_test.cc|  8 +++--
 cpp/src/arrow/compute/registry.cc  | 14 
 cpp/src/arrow/compute/registry_internal.h  |  3 --
 9 files changed, 110 insertions(+), 98 deletions(-)

diff --git a/cpp/src/arrow/compute/function.cc 
b/cpp/src/arrow/compute/function.cc
index 1bce468..41c3e36 100644
--- a/cpp/src/arrow/compute/function.cc
+++ b/cpp/src/arrow/compute/function.cc
@@ -24,6 +24,7 @@
 #include "arrow/compute/exec.h"
 #include "arrow/compute/exec_internal.h"
 #include "arrow/datum.h"
+#include "arrow/util/cpu_info.h"
 
 namespace arrow {
 namespace compute {
@@ -58,6 +59,7 @@ Result DispatchExactImpl(const Function& 
func,
 const std::vector& 
kernels,
 const std::vector& 
values) {
   const int passed_num_args = static_cast(values.size());
+  const KernelType* kernel_matches[SimdLevel::MAX] = {NULL};
 
   // Validate arity
   const Arity arity = func.arity();
@@ -70,9 +72,30 @@ Result DispatchExactImpl(const Function& 
func,
   }
   for (const auto& kernel : kernels) {
 if (kernel.signature->MatchesInputs(values)) {
-  return 
+  kernel_matches[kernel.simd_level] = 
 }
   }
+
+  // Dispatch as the CPU feature
+  auto cpu_info = arrow::internal::CpuInfo::GetInstance();
+#if defined(ARROW_HAVE_RUNTIME_AVX512)
+  if (cpu_info->IsSupported(arrow::internal::CpuInfo::AVX512)) {
+if (kernel_matches[SimdLevel::AVX512]) {
+  return kernel_matches[SimdLevel::AVX512];
+}
+  }
+#endif
+#if defined(ARROW_HAVE_RUNTIME_AVX2)
+  if (cpu_info->IsSupported(arrow::internal::CpuInfo::AVX2)) {
+if (kernel_matches[SimdLevel::AVX2]) {
+  return kernel_matches[SimdLevel::AVX2];
+}
+  }
+#endif
+  if (kernel_matches[SimdLevel::NONE]) {
+return kernel_matches[SimdLevel::NONE];
+  }
+
   return Status::NotImplemented("Function ", func.name(),
 " has no kernel matching input types ",
 FormatArgTypes(values));
diff --git a/cpp/src/arrow/compute/kernel.h b/cpp/src/arrow/compute/kernel.h
index c581544..3fb6947 100644
--- a/cpp/src/arrow/compute/kernel.h
+++ b/cpp/src/arrow/compute/kernel.h
@@ -448,7 +448,7 @@ class ARROW_EXPORT KernelSignature {
 /// type combination for different SIMD levels. Based on the active system's
 /// CPU info or the user's preferences, we can elect to use one over the other.
 struct SimdLevel {
-  enum type { NONE, SSE4_2, AVX, AVX2, AVX512, NEON };
+  enum type { NONE = 0, SSE4_2, AVX, AVX2, AVX512, NEON, MAX };
 };
 
 /// \brief The strategy to use for propagating or otherwise populating the
@@ -555,10 +555,9 @@ struct Kernel {
   bool parallelizable = true;
 
   /// \brief Indicates the level of SIMD instruction support in the host CPU is
-  /// required to use the function. Currently this is not used, but the
-  /// intention is for functions to be able to contain multiple kernels with
-  /// the same signature but different levels of SIMD, so that the most
-  /// optimized kernel supported on a host's processor can be chosen.
+  /// required to use the function. The intention is for functions to be able 
to
+  /// contain multiple kernels with the same signature but different levels of 
SIMD,
+  /// so that the most optimized kernel supported on a host's processor can be 
chosen.
   SimdLevel::type simd_level = SimdLevel::NONE;
 };
 
diff --git a/cpp/src/arrow/compute/ker

[arrow] branch master updated (564366c -> fad0b94)

2020-07-28 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 564366c  ARROW-9589: [C++/R] Forward declare structs as structs
 add fad0b94  ARROW-9585: [Rust][DataFusion] Remove duplicated to-do line

No new revisions were added by this update.

Summary of changes:
 rust/datafusion/README.md | 1 -
 1 file changed, 1 deletion(-)



[arrow-testing] branch master updated: ARROW-8797: Add golden files to support ipc between different endians (#41)

2020-07-28 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-testing.git


The following commit(s) were added to refs/heads/master by this push:
 new 0e56bdd  ARROW-8797: Add golden files to support ipc between different 
endians (#41)
0e56bdd is described below

commit 0e56bdd4fc887f26fdf018c746b24f09f16e2a08
Author: Kazuaki Ishizaki 
AuthorDate: Wed Jul 29 03:41:23 2020 +0900

ARROW-8797: Add golden files to support ipc between different endians (#41)

* add golden files

* address review comment
---
 .../generated_custom_metadata.arrow_file | Bin 0 -> 2682 bytes
 .../generated_custom_metadata.json.gz| Bin 0 -> 598 bytes
 .../1.0.0-bigendian/generated_custom_metadata.stream | Bin 0 -> 1520 bytes
 .../1.0.0-bigendian/generated_datetime.arrow_file| Bin 0 -> 5498 bytes
 .../1.0.0-bigendian/generated_datetime.json.gz   | Bin 0 -> 2738 bytes
 .../1.0.0-bigendian/generated_datetime.stream| Bin 0 -> 4576 bytes
 .../1.0.0-bigendian/generated_decimal.arrow_file | Bin 0 -> 256642 bytes
 .../1.0.0-bigendian/generated_decimal.json.gz| Bin 0 -> 159351 bytes
 .../1.0.0-bigendian/generated_decimal.stream | Bin 0 -> 253920 bytes
 .../1.0.0-bigendian/generated_dictionary.arrow_file  | Bin 0 -> 2642 bytes
 .../1.0.0-bigendian/generated_dictionary.json.gz | Bin 0 -> 1166 bytes
 .../1.0.0-bigendian/generated_dictionary.stream  | Bin 0 -> 2136 bytes
 .../generated_dictionary_unsigned.arrow_file | Bin 0 -> 2178 bytes
 .../generated_dictionary_unsigned.json.gz| Bin 0 -> 693 bytes
 .../generated_dictionary_unsigned.stream | Bin 0 -> 1704 bytes
 .../generated_duplicate_fieldnames.arrow_file| Bin 0 -> 1130 bytes
 .../generated_duplicate_fieldnames.json.gz   | Bin 0 -> 415 bytes
 .../generated_duplicate_fieldnames.stream| Bin 0 -> 736 bytes
 .../1.0.0-bigendian/generated_extension.arrow_file   | Bin 0 -> 2050 bytes
 .../1.0.0-bigendian/generated_extension.json.gz  | Bin 0 -> 918 bytes
 .../1.0.0-bigendian/generated_extension.stream   | Bin 0 -> 1400 bytes
 .../1.0.0-bigendian/generated_interval.arrow_file| Bin 0 -> 2418 bytes
 .../1.0.0-bigendian/generated_interval.json.gz   | Bin 0 -> 1506 bytes
 .../1.0.0-bigendian/generated_interval.stream| Bin 0 -> 1984 bytes
 .../1.0.0-bigendian/generated_large_batch.arrow_file | Bin 0 -> 9838418 bytes
 .../1.0.0-bigendian/generated_large_batch.json.gz| Bin 0 -> 11050357 bytes
 .../1.0.0-bigendian/generated_large_batch.stream | Bin 0 -> 9836424 bytes
 .../1.0.0-bigendian/generated_map.arrow_file | Bin 0 -> 1642 bytes
 .../1.0.0-bigendian/generated_map.json.gz| Bin 0 -> 835 bytes
 .../integration/1.0.0-bigendian/generated_map.stream | Bin 0 -> 1256 bytes
 .../generated_map_non_canonical.arrow_file   | Bin 0 -> 1242 bytes
 .../generated_map_non_canonical.json.gz  | Bin 0 -> 718 bytes
 .../generated_map_non_canonical.stream   | Bin 0 -> 840 bytes
 .../1.0.0-bigendian/generated_nested.arrow_file  | Bin 0 -> 2714 bytes
 .../1.0.0-bigendian/generated_nested.json.gz | Bin 0 -> 1622 bytes
 .../1.0.0-bigendian/generated_nested.stream  | Bin 0 -> 2168 bytes
 .../generated_nested_dictionary.arrow_file   | Bin 0 -> 3362 bytes
 .../generated_nested_dictionary.json.gz  | Bin 0 -> 1149 bytes
 .../generated_nested_dictionary.stream   | Bin 0 -> 2632 bytes
 .../generated_nested_large_offsets.arrow_file| Bin 0 -> 2602 bytes
 .../generated_nested_large_offsets.json.gz   | Bin 0 -> 1105 bytes
 .../generated_nested_large_offsets.stream| Bin 0 -> 2032 bytes
 .../1.0.0-bigendian/generated_null.arrow_file| Bin 0 -> 1322 bytes
 .../1.0.0-bigendian/generated_null.json.gz   | Bin 0 -> 502 bytes
 .../1.0.0-bigendian/generated_null.stream| Bin 0 -> 920 bytes
 .../generated_null_trivial.arrow_file| Bin 0 -> 530 bytes
 .../1.0.0-bigendian/generated_null_trivial.json.gz   | Bin 0 -> 192 bytes
 .../1.0.0-bigendian/generated_null_trivial.stream| Bin 0 -> 320 bytes
 .../1.0.0-bigendian/generated_primitive.arrow_file   | Bin 0 -> 22306 bytes
 .../1.0.0-bigendian/generated_primitive.json.gz  | Bin 0 -> 19362 bytes
 .../1.0.0-bigendian/generated_primitive.stream   | Bin 0 -> 20288 bytes
 .../generated_primitive_large_offsets.arrow_file | Bin 0 -> 3586 bytes
 .../generated_primitive_large_offsets.json.gz| Bin 0 -> 1702 bytes
 .../generated_primitive_large_offsets.stream | Bin 0 -> 3160 bytes
 .../generated_primitive_no_batches.arrow_file| Bin 0 -&

[arrow] branch master updated: ARROW-9512: [C++] Avoid variadic template unpack inside lambda to work around gcc 4.8 bug

2020-07-19 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 8a8d7ce  ARROW-9512: [C++] Avoid variadic template unpack inside 
lambda to work around gcc 4.8 bug
8a8d7ce is described below

commit 8a8d7ce39793ed8cafb2318c2752f027c75a17e6
Author: Wes McKinney 
AuthorDate: Sun Jul 19 12:25:20 2020 -0500

ARROW-9512: [C++] Avoid variadic template unpack inside lambda to work 
around gcc 4.8 bug

This works around a gcc bug. This only affects compilation of unit tests on 
gcc 4.8 so not an issue for the 1.0.0 RC1

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47226

Closes #7794 from wesm/ARROW-9512

Authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/testing/gtest_util.cc | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/cpp/src/arrow/testing/gtest_util.cc 
b/cpp/src/arrow/testing/gtest_util.cc
index de5b87a..b2f5566 100644
--- a/cpp/src/arrow/testing/gtest_util.cc
+++ b/cpp/src/arrow/testing/gtest_util.cc
@@ -106,20 +106,6 @@ void AssertTsSame(const T& expected, const T& actual, 
CompareFunctor&& compare)
   }
 }
 
-template 
-void AssertTsEqual(const T& expected, const T& actual, ExtraArgs... args) {
-  return AssertTsSame(expected, actual, [&](const T& expected, const T& 
actual) {
-return expected.Equals(actual, args...);
-  });
-}
-
-template 
-void AssertTsApproxEqual(const T& expected, const T& actual) {
-  return AssertTsSame(expected, actual, [](const T& expected, const T& actual) 
{
-return expected.ApproxEquals(actual);
-  });
-}
-
 template 
 void AssertArraysEqualWith(const Array& expected, const Array& actual, bool 
verbose,
CompareFunctor&& compare) {
@@ -175,11 +161,17 @@ void AssertScalarsEqual(const Scalar& expected, const 
Scalar& actual, bool verbo
 
 void AssertBatchesEqual(const RecordBatch& expected, const RecordBatch& actual,
 bool check_metadata) {
-  AssertTsEqual(expected, actual, check_metadata);
+  AssertTsSame(expected, actual,
+   [&](const RecordBatch& expected, const RecordBatch& actual) {
+ return expected.Equals(actual, check_metadata);
+   });
 }
 
 void AssertBatchesApproxEqual(const RecordBatch& expected, const RecordBatch& 
actual) {
-  AssertTsApproxEqual(expected, actual);
+  AssertTsSame(expected, actual,
+   [&](const RecordBatch& expected, const RecordBatch& actual) {
+ return expected.ApproxEquals(actual);
+   });
 }
 
 void AssertChunkedEqual(const ChunkedArray& expected, const ChunkedArray& 
actual) {



[arrow] branch master updated (1fcbc6d -> 954547a)

2020-07-15 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 1fcbc6d  ARROW-9478: [C++] Improve error message for unsupported casts
 add 954547a  ARROW-9499: [C++] AdaptiveIntBuilder::AppendNull does not 
increment the null count

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/array/array_test.cc  | 12 
 cpp/src/arrow/array/builder_adaptive.h |  1 +
 2 files changed, 13 insertions(+)



[arrow-testing] branch master updated: ARROW-9497: [C++][Parquet] Add oss-fuzz test case

2020-07-15 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-testing.git


The following commit(s) were added to refs/heads/master by this push:
 new f552c4d  ARROW-9497: [C++][Parquet] Add oss-fuzz test case
f552c4d is described below

commit f552c4dcd2ae3d14048abd20919748cce5276ade
Author: Wes McKinney 
AuthorDate: Wed Jul 15 19:13:00 2020 -0500

ARROW-9497: [C++][Parquet] Add oss-fuzz test case
---
 ...testcase-minimized-parquet-arrow-fuzz-5747849626386432 | Bin 0 -> 213 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)

diff --git 
a/data/parquet/fuzzing/clusterfuzz-testcase-minimized-parquet-arrow-fuzz-5747849626386432
 
b/data/parquet/fuzzing/clusterfuzz-testcase-minimized-parquet-arrow-fuzz-5747849626386432
new file mode 100644
index 000..67697be
Binary files /dev/null and 
b/data/parquet/fuzzing/clusterfuzz-testcase-minimized-parquet-arrow-fuzz-5747849626386432
 differ



[arrow] branch master updated (842d513 -> be84d7b)

2020-07-15 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 842d513  ARROW-9476: [C++][Dataset] Fix incorrect dictionary 
association in HivePartitioningFactory
 add be84d7b  ARROW-9486: [C++][Dataset] Support implicit cast of 
InExpression::set to dict

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/dataset/filter.cc  | 21 +++--
 cpp/src/arrow/dataset/filter_test.cc | 10 +++---
 2 files changed, 26 insertions(+), 5 deletions(-)



[arrow] branch master updated (a88635a -> 399c034)

2020-07-15 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from a88635a  ARROW-9485: [R] Better shared library stripping
 add 399c034  ARROW-9484: [Docs] Update is* functions to be is_* in the 
compute docs

No new revisions were added by this update.

Summary of changes:
 .../compute/kernels/scalar_string_benchmark.cc |  4 +--
 docs/source/cpp/compute.rst| 42 +++---
 r/README.md|  6 
 3 files changed, 23 insertions(+), 29 deletions(-)



[arrow] branch master updated: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 3586292  ARROW-9424: [C++][Parquet] Disable writing files with LZ4 
codec
3586292 is described below

commit 3586292d62c8c348e9fb85676eb524cde53179cf
Author: Wes McKinney 
AuthorDate: Tue Jul 14 21:39:47 2020 -0500

ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Due to ongoing LZ4 problems with Parquet files, this patch disables writing 
files with LZ4 codec by throwing a `ParquetException`.

In progress: adding exceptions for pyarrow when using LZ4 to write files 
and updating relevant pytests

Mailing list discussion: 
https://mail-archives.apache.org/mod_mbox/arrow-dev/202007.mbox/%3CCAJPUwMCM4ZaJB720%2BuoM1aSA2oD9jSEnzuwWjJiw6vwXxHk7nw%40mail.gmail.com%3E

Jira ticket: https://issues.apache.org/jira/browse/ARROW-9424

Closes #7757 from patrickpai/ARROW-9424

Lead-authored-by: Wes McKinney 
Co-authored-by: Patrick Pai 
Signed-off-by: Wes McKinney 
---
 cpp/src/parquet/column_reader.cc |  2 +-
 cpp/src/parquet/column_writer.cc |  2 +-
 cpp/src/parquet/column_writer_test.cc| 10 ++
 cpp/src/parquet/file_deserialize_test.cc |  5 ++---
 cpp/src/parquet/file_serialize_test.cc   |  2 +-
 cpp/src/parquet/thrift_internal.h|  1 +
 cpp/src/parquet/types.cc | 33 
 cpp/src/parquet/types.h  |  9 +
 python/pyarrow/tests/test_parquet.py | 16 ++--
 9 files changed, 64 insertions(+), 16 deletions(-)

diff --git a/cpp/src/parquet/column_reader.cc b/cpp/src/parquet/column_reader.cc
index 0bfc303..bc462ad 100644
--- a/cpp/src/parquet/column_reader.cc
+++ b/cpp/src/parquet/column_reader.cc
@@ -182,7 +182,7 @@ class SerializedPageReader : public PageReader {
   InitDecryption();
 }
 max_page_header_size_ = kDefaultMaxPageHeaderSize;
-decompressor_ = GetCodec(codec);
+decompressor_ = internal::GetReadCodec(codec);
   }
 
   // Implement the PageReader interface
diff --git a/cpp/src/parquet/column_writer.cc b/cpp/src/parquet/column_writer.cc
index 13f91e3..f9cf37c 100644
--- a/cpp/src/parquet/column_writer.cc
+++ b/cpp/src/parquet/column_writer.cc
@@ -172,7 +172,7 @@ class SerializedPageWriter : public PageWriter {
 if (data_encryptor_ != nullptr || meta_encryptor_ != nullptr) {
   InitEncryption();
 }
-compressor_ = GetCodec(codec, compression_level);
+compressor_ = internal::GetWriteCodec(codec, compression_level);
 thrift_serializer_.reset(new ThriftSerializer);
   }
 
diff --git a/cpp/src/parquet/column_writer_test.cc 
b/cpp/src/parquet/column_writer_test.cc
index 23554aa..a92d4d2 100644
--- a/cpp/src/parquet/column_writer_test.cc
+++ b/cpp/src/parquet/column_writer_test.cc
@@ -488,13 +488,15 @@ TYPED_TEST(TestPrimitiveWriter, 
RequiredPlainWithStatsAndGzipCompression) {
 
 #ifdef ARROW_WITH_LZ4
 TYPED_TEST(TestPrimitiveWriter, RequiredPlainWithLz4Compression) {
-  this->TestRequiredWithSettings(Encoding::PLAIN, Compression::LZ4, false, 
false,
- LARGE_SIZE);
+  ASSERT_THROW(this->TestRequiredWithSettings(Encoding::PLAIN, 
Compression::LZ4, false,
+  false, LARGE_SIZE),
+   ParquetException);
 }
 
 TYPED_TEST(TestPrimitiveWriter, RequiredPlainWithStatsAndLz4Compression) {
-  this->TestRequiredWithSettings(Encoding::PLAIN, Compression::LZ4, false, 
true,
- LARGE_SIZE);
+  ASSERT_THROW(this->TestRequiredWithSettings(Encoding::PLAIN, 
Compression::LZ4, false,
+  true, LARGE_SIZE),
+   ParquetException);
 }
 #endif
 
diff --git a/cpp/src/parquet/file_deserialize_test.cc 
b/cpp/src/parquet/file_deserialize_test.cc
index 3fe2230..1dd3492 100644
--- a/cpp/src/parquet/file_deserialize_test.cc
+++ b/cpp/src/parquet/file_deserialize_test.cc
@@ -249,9 +249,8 @@ TEST_F(TestPageSerde, Compression) {
   codec_types.push_back(Compression::GZIP);
 #endif
 
-#ifdef ARROW_WITH_LZ4
-  codec_types.push_back(Compression::LZ4);
-#endif
+  // TODO: Add LZ4 compression type after PARQUET-1878 is complete.
+  // Testing for deserializing LZ4 is hard without writing enabled, so it is 
not included.
 
 #ifdef ARROW_WITH_ZSTD
   codec_types.push_back(Compression::ZSTD);
diff --git a/cpp/src/parquet/file_serialize_test.cc 
b/cpp/src/parquet/file_serialize_test.cc
index c5c4df2..72d7d6f 100644
--- a/cpp/src/parquet/file_serialize_test.cc
+++ b/cpp/src/parquet/file_serialize_test.cc
@@ -309,7 +309,7 @@ TYPED_TEST(TestSerialize, SmallFileGzip) {
 
 #ifdef ARROW_WITH_LZ4
 TYPED_TEST(TestSerialize, SmallFileLz4) {
-  ASSERT_NO_FATAL_FAILURE(this->FileSerializeTest(Compression::LZ4));

[arrow] branch master updated (075e4dd -> a0b7f2a)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 075e4dd  ARROW-9452: [Rust] [DataFusion] Optimize ParquetScanExec
 add a0b7f2a  ARROW-9399: [C++] Add forward compatibility test to detect 
and raise error for future MetadataVersion

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/flight/test_util.cc| 11 +--
 cpp/src/arrow/ipc/message.cc |  5 +
 cpp/src/arrow/ipc/read_write_test.cc | 20 
 cpp/src/arrow/testing/util.cc| 10 ++
 cpp/src/arrow/testing/util.h |  4 
 testing  |  2 +-
 6 files changed, 41 insertions(+), 11 deletions(-)



[arrow] branch master updated (6a3f9eb -> 075e4dd)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 6a3f9eb  ARROW-9473: [Doc] Polishing for 1.0
 add 075e4dd  ARROW-9452: [Rust] [DataFusion] Optimize ParquetScanExec

No new revisions were added by this update.

Summary of changes:
 .../src/execution/physical_plan/parquet.rs | 57 +-
 1 file changed, 24 insertions(+), 33 deletions(-)



[arrow] branch master updated (3fc83c2 -> f131fe6)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 3fc83c2  ARROW-9438: [CI] Add spark patch to compile with recent Arrow 
Java changes
 add f131fe6  ARROW-9390: [C++][Followup] Add underscores to is* string 
functions

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/compute/kernels/scalar_string.cc | 44 ++---
 .../arrow/compute/kernels/scalar_string_test.cc| 77 +++---
 python/pyarrow/compute.py  | 40 +--
 python/pyarrow/tests/test_compute.py   | 29 
 4 files changed, 97 insertions(+), 93 deletions(-)



[arrow-testing] branch master updated: ARROW-9399: [C++] Check in serialized schema with MetadataVersion::V6

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-testing.git


The following commit(s) were added to refs/heads/master by this push:
 new 41209ab  ARROW-9399: [C++] Check in serialized schema with 
MetadataVersion::V6
41209ab is described below

commit 41209ab1ead9fa8438cc41da4640354799627549
Author: Wes McKinney 
AuthorDate: Tue Jul 14 16:25:31 2020 -0500

ARROW-9399: [C++] Check in serialized schema with MetadataVersion::V6
---
 data/forward-compatibility/README.md   |  27 +++
 data/forward-compatibility/schema_v6.arrow | Bin 0 -> 120 bytes
 2 files changed, 27 insertions(+)

diff --git a/data/forward-compatibility/README.md 
b/data/forward-compatibility/README.md
new file mode 100644
index 000..f011f2f
--- /dev/null
+++ b/data/forward-compatibility/README.md
@@ -0,0 +1,27 @@
+
+
+# Forward compatibility testing files
+
+This folder contains files to help with verifying that current Arrow libraries
+reject Flatbuffers protocol additions "from the future" (like new data types,
+new features, new metadata versions, etc.).
+
+* schema_v6.arrow: a serialized Schema using a currently non-existent
+  MetadataVersion::V6
\ No newline at end of file
diff --git a/data/forward-compatibility/schema_v6.arrow 
b/data/forward-compatibility/schema_v6.arrow
new file mode 100644
index 000..a2cd1ae
Binary files /dev/null and b/data/forward-compatibility/schema_v6.arrow differ



[arrow] branch master updated: ARROW-9438: [CI] Add spark patch to compile with recent Arrow Java changes

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 3fc83c2  ARROW-9438: [CI] Add spark patch to compile with recent Arrow 
Java changes
3fc83c2 is described below

commit 3fc83c281104fff0bf8e07e7589281186c7ed251
Author: Bryan Cutler 
AuthorDate: Tue Jul 14 16:04:32 2020 -0500

ARROW-9438: [CI] Add spark patch to compile with recent Arrow Java changes

Recent changes in Arrow Java from ARROW-9300 now require adding a 
dependency on arrow-memory-netty to provide a default allocator. This adds a 
patch to build spark with the required dependency.

Closes #7746 from BryanCutler/spark-integration-patch-ARROW-9438

Lead-authored-by: Bryan Cutler 
Co-authored-by: Krisztián Szűcs 
Signed-off-by: Wes McKinney 
---
 ci/docker/conda-python-spark.dockerfile   |  4 ++
 ci/etc/integration_spark_ARROW-9438.patch | 72 +++
 dev/release/rat_exclude_files.txt |  1 +
 3 files changed, 77 insertions(+)

diff --git a/ci/docker/conda-python-spark.dockerfile 
b/ci/docker/conda-python-spark.dockerfile
index d3f0a22..a20f1ff 100644
--- a/ci/docker/conda-python-spark.dockerfile
+++ b/ci/docker/conda-python-spark.dockerfile
@@ -36,6 +36,10 @@ ARG spark=master
 COPY ci/scripts/install_spark.sh /arrow/ci/scripts/
 RUN /arrow/ci/scripts/install_spark.sh ${spark} /spark
 
+# patch spark to build with current Arrow Java
+COPY ci/etc/integration_spark_ARROW-9438.patch /arrow/ci/etc/
+RUN patch -d /spark -p1 -i /arrow/ci/etc/integration_spark_ARROW-9438.patch
+
 # build cpp with tests
 ENV CC=gcc \
 CXX=g++ \
diff --git a/ci/etc/integration_spark_ARROW-9438.patch 
b/ci/etc/integration_spark_ARROW-9438.patch
new file mode 100644
index 000..2baed30
--- /dev/null
+++ b/ci/etc/integration_spark_ARROW-9438.patch
@@ -0,0 +1,72 @@
+From 0b5388a945a7e5c5706cf00d0754540a6c68254d Mon Sep 17 00:00:00 2001
+From: Bryan Cutler 
+Date: Mon, 13 Jul 2020 23:12:25 -0700
+Subject: [PATCH] Update Arrow Java for 1.0.0
+
+---
+ pom.xml  | 17 ++---
+ sql/catalyst/pom.xml |  4 
+ 2 files changed, 18 insertions(+), 3 deletions(-)
+
+diff --git a/pom.xml b/pom.xml
+index 08ca13bfe9..6619fca200 100644
+--- a/pom.xml
 b/pom.xml
+@@ -199,7 +199,7 @@
+ If you are changing Arrow version specification, please check 
./python/pyspark/sql/utils.py,
+ and ./python/setup.py too.
+ -->
+-0.15.1
++1.0.0-SNAPSHOT
+ 
+ org.fusesource.leveldbjni
+ 
+@@ -2288,7 +2288,7 @@
+   
+   
+ com.fasterxml.jackson.core
+-jackson-databind
++jackson-core
+   
+   
+ io.netty
+@@ -2298,9 +2298,20 @@
+ io.netty
+ netty-common
+   
++
++  
++  
++org.apache.arrow
++arrow-memory-netty
++${arrow.version}
++
+   
+ io.netty
+-netty-handler
++netty-buffer
++  
++  
++io.netty
++netty-common
+   
+ 
+   
+diff --git a/sql/catalyst/pom.xml b/sql/catalyst/pom.xml
+index 9edbb7fec9..6b79eb722f 100644
+--- a/sql/catalyst/pom.xml
 b/sql/catalyst/pom.xml
+@@ -117,6 +117,10 @@
+   org.apache.arrow
+   arrow-vector
+ 
++
++  org.apache.arrow
++  arrow-memory-netty
++
+   
+   
+ 
target/scala-${scala.binary.version}/classes
+-- 
+2.17.1
+
diff --git a/dev/release/rat_exclude_files.txt 
b/dev/release/rat_exclude_files.txt
index d25e2e3..158790d 100644
--- a/dev/release/rat_exclude_files.txt
+++ b/dev/release/rat_exclude_files.txt
@@ -9,6 +9,7 @@
 *.snap
 .github/ISSUE_TEMPLATE/question.md
 ci/etc/rprofile
+ci/etc/*.patch
 cpp/CHANGELOG_PARQUET.md
 cpp/src/arrow/io/mman.h
 cpp/src/arrow/util/random.h



[arrow] branch master updated (e771b94 -> 1413963)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from e771b94  ARROW-8480: [Rust] Use NonNull well aligned pointer as Unique 
reference
 add 1413963  ARROW-8314: [Python] Add a Table.select method to select a 
subset of columns

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/table.cc   | 20 
 cpp/src/arrow/table.h|  3 ++
 cpp/src/arrow/table_test.cc  | 16 +
 python/pyarrow/feather.py|  5 +--
 python/pyarrow/includes/libarrow.pxd |  1 +
 python/pyarrow/table.pxi | 63 
 python/pyarrow/tests/test_dataset.py |  4 +--
 python/pyarrow/tests/test_table.py   | 51 +
 8 files changed, 144 insertions(+), 19 deletions(-)



[arrow] branch master updated (17a0e47 -> e771b94)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 17a0e47  ARROW-9449: [R] Strip arrow.so
 add e771b94  ARROW-8480: [Rust] Use NonNull well aligned pointer as Unique 
reference

No new revisions were added by this update.

Summary of changes:
 rust/arrow/src/buffer.rs | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)



[arrow] branch master updated (4d9d66f -> cd6bd82)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 4d9d66f  ARROW-9458: [Python] Release GIL in ScanTask.execute
 add cd6bd82  ARROW-9447 [Rust][DataFusion] Made ScalarUDF (Send + Sync)

No new revisions were added by this update.

Summary of changes:
 rust/datafusion/src/execution/context.rs| 4 ++--
 rust/datafusion/src/execution/physical_plan/math_expressions.rs | 4 ++--
 rust/datafusion/src/execution/physical_plan/udf.rs  | 2 +-
 rust/datafusion/tests/sql.rs| 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)



[arrow] branch master updated (8ea00f0 -> 4d9d66f)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 8ea00f0  ARROW-9470: [CI][Java] Run Maven in parallel
 add 4d9d66f  ARROW-9458: [Python] Release GIL in ScanTask.execute

No new revisions were added by this update.

Summary of changes:
 python/pyarrow/_dataset.pyx | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)



[arrow] branch master updated (bfd2568 -> 8ea00f0)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from bfd2568  ARROW-9390: [Doc] Add missing file
 add 8ea00f0  ARROW-9470: [CI][Java] Run Maven in parallel

No new revisions were added by this update.

Summary of changes:
 ci/scripts/java_build.sh | 2 ++
 ci/scripts/java_test.sh  | 4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)



[arrow] branch master updated (ad2b2c5 -> 4eaca73)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from ad2b2c5  ARROW-8729: [C++][Dataset] Ensure non-empty batches when only 
virtual columns are projected
 add 4eaca73  ARROW-7831: [Java] do not allocate a new offset buffer if the 
slice starts at 0 since the relative offset pointer would be unchanged

No new revisions were added by this update.

Summary of changes:
 .../arrow/vector/BaseVariableWidthVector.java  | 113 
 .../org/apache/arrow/vector/TestValueVector.java   | 145 +
 2 files changed, 206 insertions(+), 52 deletions(-)



[arrow] branch master updated (10289a0 -> ad2b2c5)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 10289a0  ARROW-9390: [C++][Doc] Review compute function names
 add ad2b2c5  ARROW-8729: [C++][Dataset] Ensure non-empty batches when only 
virtual columns are projected

No new revisions were added by this update.

Summary of changes:
 cpp/src/parquet/arrow/arrow_reader_writer_test.cc |  30 ++-
 cpp/src/parquet/arrow/reader.cc   | 257 --
 cpp/src/parquet/arrow/reader.h|  15 +-
 cpp/src/parquet/arrow/reader_internal.cc  |   4 +-
 python/pyarrow/tests/test_dataset.py  |  18 ++
 5 files changed, 188 insertions(+), 136 deletions(-)



[arrow] branch master updated (6d7e4ec -> 1d7d919)

2020-07-14 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 6d7e4ec  ARROW-9450: [Python] Fix tests startup time
 add 1d7d919  ARROW-9460: [C++] Fix BinaryContainsExact for pattern with 
repeated characters

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/compute/kernels/scalar_string.cc  | 17 -
 cpp/src/arrow/compute/kernels/scalar_string_test.cc |  8 
 2 files changed, 16 insertions(+), 9 deletions(-)



[arrow] branch master updated: ARROW-9440: [Python] Expose Fill Null kernel

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new e559dd0  ARROW-9440: [Python] Expose Fill Null kernel
e559dd0 is described below

commit e559dd080a27875bab3d5cdb0da115c62e2f60bb
Author: c-jamie 
AuthorDate: Mon Jul 13 19:53:47 2020 -0500

ARROW-9440: [Python] Expose Fill Null kernel

Closes #7736 from c-jamie/ARROW-9440

Lead-authored-by: c-jamie 
Co-authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 python/pyarrow/array.pxi |  6 
 python/pyarrow/compute.py| 41 +++
 python/pyarrow/includes/libarrow.pxd |  1 +
 python/pyarrow/scalar.pxi| 13 
 python/pyarrow/table.pxi |  6 
 python/pyarrow/tests/test_compute.py | 63 
 python/pyarrow/tests/test_scalars.py |  9 ++
 7 files changed, 139 insertions(+)

diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index 1cffd37..1dcff02 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -1004,6 +1004,12 @@ cdef class Array(_PandasConvertible):
 """
 return _pc().is_valid(self)
 
+def fill_null(self, fill_value):
+"""
+See pyarrow.compute.fill_null for usage.
+"""
+return _pc().fill_null(self, fill_value)
+
 def __getitem__(self, key):
 """
 Slice or return value at given index
diff --git a/python/pyarrow/compute.py b/python/pyarrow/compute.py
index c8443ed..b8e678f 100644
--- a/python/pyarrow/compute.py
+++ b/python/pyarrow/compute.py
@@ -24,6 +24,7 @@ from pyarrow._compute import (  # noqa
 call_function,
 TakeOptions
 )
+import pyarrow as pa
 import pyarrow._compute as _pc
 
 
@@ -259,3 +260,43 @@ def take(data, indices, boundscheck=True):
 """
 options = TakeOptions(boundscheck)
 return call_function('take', [data, indices], options)
+
+
+def fill_null(values, fill_value):
+"""
+Replace each null element in values with fill_value. The fill_value must be
+the same type as values or able to be implicitly casted to the array's
+type.
+
+Parameters
+--
+data : Array, ChunkedArray
+replace each null element with fill_value
+fill_value: Scalar-like object
+Either a pyarrow.Scalar or any python object coercible to a
+Scalar. If not same type as data will attempt to cast.
+
+Returns
+---
+result : depends on inputs
+
+Examples
+
+>>> import pyarrow as pa
+>>> arr = pa.array([1, 2, None, 3], type=pa.int8())
+>>> fill_value = pa.scalar(5, type=pa.int8())
+>>> arr.fill_null(fill_value)
+pyarrow.lib.Int8Array object at 0x7f95437f01a0>
+[
+  1,
+  2,
+  5,
+  3
+]
+"""
+if not isinstance(fill_value, pa.Scalar):
+fill_value = pa.scalar(fill_value, type=values.type)
+elif values.type != fill_value.type:
+fill_value = pa.scalar(fill_value.as_py(), type=values.type)
+
+return call_function("fill_null", [values, fill_value])
diff --git a/python/pyarrow/includes/libarrow.pxd 
b/python/pyarrow/includes/libarrow.pxd
index 213ef24..c8e7c5b 100644
--- a/python/pyarrow/includes/libarrow.pxd
+++ b/python/pyarrow/includes/libarrow.pxd
@@ -887,6 +887,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil:
 c_bool is_valid
 c_string ToString() const
 c_bool Equals(const CScalar& other) const
+CResult[shared_ptr[CScalar]] CastTo(shared_ptr[CDataType] to) const
 
 cdef cppclass CScalarHash" arrow::Scalar::Hash":
 size_t operator()(const shared_ptr[CScalar]& scalar) const
diff --git a/python/pyarrow/scalar.pxi b/python/pyarrow/scalar.pxi
index 903faae..248d926 100644
--- a/python/pyarrow/scalar.pxi
+++ b/python/pyarrow/scalar.pxi
@@ -63,6 +63,19 @@ cdef class Scalar:
 """
 return self.wrapped.get().is_valid
 
+def cast(self, object target_type):
+"""
+Attempt a safe cast to target data type.
+"""
+cdef:
+DataType type = ensure_type(target_type)
+shared_ptr[CScalar] result
+
+with nogil:
+result = GetResultValue(self.wrapped.get().CastTo(type.sp_type))
+
+return Scalar.wrap(result)
+
 def __repr__(self):
 return ''.format(
 self.__class__.__name__, self.as_py()
diff --git a/python/pyarrow/table.pxi b/python/pyarrow/table.pxi
index 08e3f75..688d668 100644
--- a/python/pyarrow/table.pxi
+++ b/python/pyarrow/table.pxi
@@ -191,6 +191,12 @@ cdef class 

[arrow] branch master updated (dcd17bf -> cad2e96)

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from dcd17bf  ARROW-9445: [Python] Revert Array.equals changes + expose 
comparison ops in compute
 add cad2e96  ARROW-9442: [Python] Do not call Validate() in 
pyarrow_wrap_table

No new revisions were added by this update.

Summary of changes:
 python/pyarrow/public-api.pxi | 2 --
 1 file changed, 2 deletions(-)



[arrow] branch master updated (cad2e96 -> 427fe07)

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from cad2e96  ARROW-9442: [Python] Do not call Validate() in 
pyarrow_wrap_table
 add 427fe07  ARROW-9443: [C++] Bundled bz2 build should only build libbz2

No new revisions were added by this update.

Summary of changes:
 .github/workflows/r.yml |  3 +++
 cpp/cmake_modules/ThirdpartyToolchain.cmake |  3 ++-
 dev/tasks/r/azure.linux.yml |  1 +
 dev/tasks/r/github.linux.cran.yml   |  1 +
 r/configure | 20 +++-
 r/inst/build_arrow_static.sh| 13 -
 r/tools/linuxlibs.R | 19 +--
 r/vignettes/install.Rmd |  2 +-
 8 files changed, 40 insertions(+), 22 deletions(-)



[arrow] branch master updated (389b153 -> dcd17bf)

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 389b153  ARROW-9439: [C++] Fix crash on invalid IPC input
 add dcd17bf  ARROW-9445: [Python] Revert Array.equals changes + expose 
comparison ops in compute

No new revisions were added by this update.

Summary of changes:
 python/pyarrow/array.pxi | 31 ++-
 python/pyarrow/compute.py|  7 +++
 python/pyarrow/table.pxi | 10 ++
 python/pyarrow/tests/test_array.py   | 13 +
 python/pyarrow/tests/test_compute.py | 33 +
 python/pyarrow/tests/test_scalars.py |  4 ++--
 python/pyarrow/tests/test_table.py   |  3 +++
 7 files changed, 54 insertions(+), 47 deletions(-)



[arrow] branch master updated (8daf756 -> 389b153)

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 8daf756  ARROW-9446: [C++] Add compiler id, version, and build flags 
to BuildInfo
 add 389b153  ARROW-9439: [C++] Fix crash on invalid IPC input

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/array/array_base.cc  | 13 ++
 cpp/src/arrow/array/array_base.h   |  5 +++
 cpp/src/arrow/array/array_test.cc  | 49 ++
 cpp/src/arrow/array/concatenate.cc | 86 --
 cpp/src/arrow/array/data.cc|  6 +++
 cpp/src/arrow/array/data.h |  8 +++-
 cpp/src/arrow/buffer.cc| 41 ++
 cpp/src/arrow/buffer.h | 28 +
 cpp/src/arrow/buffer_test.cc   | 37 +++-
 cpp/src/arrow/ipc/reader.cc|  6 +++
 cpp/src/arrow/util/int_util.h  | 17 
 testing|  2 +-
 12 files changed, 263 insertions(+), 35 deletions(-)



[arrow] branch master updated: ARROW-9333: [Python] Expose more IPC options

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new feda987  ARROW-9333: [Python] Expose more IPC options
feda987 is described below

commit feda9877f8145aebf907c61a24640735a968a230
Author: Antoine Pitrou 
AuthorDate: Mon Jul 13 12:49:07 2020 -0500

ARROW-9333: [Python] Expose more IPC options

Also make some optional arguments keyword-only.

Closes #7730 from pitrou/ARROW-9333-py-ipc-options

Authored-by: Antoine Pitrou 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/ipc/options.h  |  7 ++-
 python/pyarrow/_flight.pyx   |  6 +--
 python/pyarrow/includes/libarrow.pxd |  2 +
 python/pyarrow/io.pxi| 29 +--
 python/pyarrow/ipc.pxi   | 55 ++---
 python/pyarrow/ipc.py| 15 +++---
 python/pyarrow/tests/test_flight.py  |  6 +++
 python/pyarrow/tests/test_ipc.py | 95 
 python/pyarrow/tests/util.py | 16 ++
 9 files changed, 174 insertions(+), 57 deletions(-)

diff --git a/cpp/src/arrow/ipc/options.h b/cpp/src/arrow/ipc/options.h
index 69e248c..6bbd7b8 100644
--- a/cpp/src/arrow/ipc/options.h
+++ b/cpp/src/arrow/ipc/options.h
@@ -56,10 +56,9 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// \brief The memory pool to use for allocations made during IPC writing
   MemoryPool* memory_pool = default_memory_pool();
 
-  /// \brief EXPERIMENTAL: Codec to use for compressing and decompressing
-  /// record batch body buffers. This is not part of the Arrow IPC protocol and
-  /// only for internal use (e.g. Feather files). May only be LZ4_FRAME and
-  /// ZSTD
+  /// \brief Compression codec to use for record batch body buffers
+  ///
+  /// May only be UNCOMPRESSED, LZ4_FRAME and ZSTD.
   Compression::type compression = Compression::UNCOMPRESSED;
   int compression_level = Compression::kUseDefaultCompressionLevel;
 
diff --git a/python/pyarrow/_flight.pyx b/python/pyarrow/_flight.pyx
index 7e3c837..7b6b281 100644
--- a/python/pyarrow/_flight.pyx
+++ b/python/pyarrow/_flight.pyx
@@ -97,10 +97,8 @@ def _munge_grpc_python_error(message):
 
 
 cdef IpcWriteOptions _get_options(options):
-cdef IpcWriteOptions write_options = \
- _get_legacy_format_default(
-use_legacy_format=None, options=options)
-return write_options
+return  _get_legacy_format_default(
+use_legacy_format=None, options=options)
 
 
 cdef class FlightCallOptions:
diff --git a/python/pyarrow/includes/libarrow.pxd 
b/python/pyarrow/includes/libarrow.pxd
index 76203f0..3e461c4 100644
--- a/python/pyarrow/includes/libarrow.pxd
+++ b/python/pyarrow/includes/libarrow.pxd
@@ -1329,6 +1329,8 @@ cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" 
nogil:
 c_bool write_legacy_ipc_format
 CMemoryPool* memory_pool
 CMetadataVersion metadata_version
+CCompressionType compression
+c_bool use_threads
 
 @staticmethod
 CIpcWriteOptions Defaults()
diff --git a/python/pyarrow/io.pxi b/python/pyarrow/io.pxi
index 76a058d..058b09a 100644
--- a/python/pyarrow/io.pxi
+++ b/python/pyarrow/io.pxi
@@ -1539,24 +1539,43 @@ def _detect_compression(path):
 
 cdef CCompressionType _ensure_compression(str name) except *:
 uppercase = name.upper()
-if uppercase == 'GZIP':
-return CCompressionType_GZIP
-elif uppercase == 'BZ2':
+if uppercase == 'BZ2':
 return CCompressionType_BZ2
+elif uppercase == 'GZIP':
+return CCompressionType_GZIP
 elif uppercase == 'BROTLI':
 return CCompressionType_BROTLI
 elif uppercase == 'LZ4' or uppercase == 'LZ4_FRAME':
 return CCompressionType_LZ4_FRAME
 elif uppercase == 'LZ4_RAW':
 return CCompressionType_LZ4
-elif uppercase == 'ZSTD':
-return CCompressionType_ZSTD
 elif uppercase == 'SNAPPY':
 return CCompressionType_SNAPPY
+elif uppercase == 'ZSTD':
+return CCompressionType_ZSTD
 else:
 raise ValueError('Invalid value for compression: {!r}'.format(name))
 
 
+cdef str _compression_name(CCompressionType ctype):
+if ctype == CCompressionType_GZIP:
+return 'gzip'
+elif ctype == CCompressionType_BROTLI:
+return 'brotli'
+elif ctype == CCompressionType_BZ2:
+return 'bz2'
+elif ctype == CCompressionType_LZ4_FRAME:
+return 'lz4'
+elif ctype == CCompressionType_LZ4:
+return 'lz4_raw'
+elif ctype == CCompressionType_SNAPPY:
+return 'snappy'
+elif ctype == CCompressionType_ZSTD:
+return 'zstd'
+else:
+raise RuntimeError('Unexpected CCompressionType value')
+
+
 cdef class Codec:
 """
 Compression codec.
diff --git a/python/pyarrow/ipc.pxi b/python/pya

[arrow] branch master updated: ARROW-8989: [C++][Doc] Document available compute functions

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 9d2079c  ARROW-8989: [C++][Doc] Document available compute functions
9d2079c is described below

commit 9d2079c2ead31399b724ecc3775d61432a8096af
Author: Antoine Pitrou 
AuthorDate: Mon Jul 13 12:48:30 2020 -0500

ARROW-8989: [C++][Doc] Document available compute functions

Also fix glaring bugs in arithmetic kernels
(signed overflow detection was broken).

Closes #7695 from pitrou/ARROW-8989-doc-compute-functions

Authored-by: Antoine Pitrou 
Signed-off-by: Wes McKinney 
---
 c_glib/arrow-glib/compute.cpp  |   5 +-
 cpp/src/arrow/array/validate.cc|   7 +-
 cpp/src/arrow/compute/api.h|   4 +
 cpp/src/arrow/compute/api_aggregate.h  |  61 +--
 cpp/src/arrow/compute/api_scalar.h |  97 ++--
 cpp/src/arrow/compute/api_vector.h |  37 +-
 cpp/src/arrow/compute/cast.cc  |   2 +-
 cpp/src/arrow/compute/cast.h   |   5 +
 cpp/src/arrow/compute/exec.h   |  14 +-
 cpp/src/arrow/compute/function.h   |   6 +
 cpp/src/arrow/compute/kernels/aggregate_basic.cc   |   2 +-
 cpp/src/arrow/compute/kernels/aggregate_test.cc|   2 +-
 cpp/src/arrow/compute/kernels/scalar_arithmetic.cc |  28 +-
 .../compute/kernels/scalar_arithmetic_test.cc  |  47 +-
 cpp/src/arrow/compute/registry.h   |   2 +-
 cpp/src/arrow/scalar.h |  40 +-
 cpp/src/arrow/util/int_util.h  |  33 +-
 cpp/src/parquet/column_reader.cc   |   7 +-
 docs/source/conf.py|   7 +-
 docs/source/cpp/api.rst|   2 +
 .../cpp/{getting_started.rst => api/compute.rst}   |  59 ++-
 docs/source/cpp/compute.rst| 526 +
 docs/source/cpp/getting_started.rst|   1 +
 docs/source/python/api/arrays.rst  |  71 +--
 docs/source/python/dataset.rst |   4 +-
 25 files changed, 883 insertions(+), 186 deletions(-)

diff --git a/c_glib/arrow-glib/compute.cpp b/c_glib/arrow-glib/compute.cpp
index d8d0bdc..3e31899 100644
--- a/c_glib/arrow-glib/compute.cpp
+++ b/c_glib/arrow-glib/compute.cpp
@@ -676,7 +676,7 @@ garrow_count_options_set_property(GObject *object,
   switch (prop_id) {
   case PROP_MODE:
 priv->options.count_mode =
-  static_cast(g_value_get_enum(value));
+  static_cast(g_value_get_enum(value));
 break;
   default:
 G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec);
@@ -706,7 +706,8 @@ static void
 garrow_count_options_init(GArrowCountOptions *object)
 {
   auto priv = GARROW_COUNT_OPTIONS_GET_PRIVATE(object);
-  new(>options) 
arrow::compute::CountOptions(arrow::compute::CountOptions::COUNT_ALL);
+  new(>options) arrow::compute::CountOptions(
+arrow::compute::CountOptions::COUNT_NON_NULL);
 }
 
 static void
diff --git a/cpp/src/arrow/array/validate.cc b/cpp/src/arrow/array/validate.cc
index 3dd0ffd..8fb8b59 100644
--- a/cpp/src/arrow/array/validate.cc
+++ b/cpp/src/arrow/array/validate.cc
@@ -98,7 +98,7 @@ struct ValidateArrayVisitor {
 if (value_size < 0) {
   return Status::Invalid("FixedSizeListArray has negative value size ", 
value_size);
 }
-if (HasMultiplyOverflow(len, value_size) ||
+if (HasPositiveMultiplyOverflow(len, value_size) ||
 array.values()->length() != len * value_size) {
   return Status::Invalid("Values Length (", array.values()->length(),
  ") is not equal to the length (", len,
@@ -329,7 +329,7 @@ Status ValidateArray(const Array& array) {
type.ToString(), ", got ", data.buffers.size());
   }
   // This check is required to avoid addition overflow below
-  if (HasAdditionOverflow(array.length(), array.offset())) {
+  if (HasPositiveAdditionOverflow(array.length(), array.offset())) {
 return Status::Invalid("Array of type ", type.ToString(),
" has impossibly large length and offset");
   }
@@ -346,7 +346,8 @@ Status ValidateArray(const Array& array) {
 min_buffer_size = BitUtil::BytesForBits(array.length() + 
array.offset());
 break;
   case DataTypeLayout::FIXED_WIDTH:
-if (HasMultiplyOverflow(array.length() + array.offset(), 
spec.byte_width)) {
+if (HasPositiveMultiplyOverflow(array.length() + array.offset(),
+spec.byte_width)) {
   return Status::Invalid("Array of type ", type.ToString(),
  "

[arrow] branch master updated (1f42ac0 -> 875d0539)

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 1f42ac0  ARROW-9428: [C++][Doc] Update buffer allocation documentation
 add 875d0539 ARROW-9436: [C++][CI] Fix Valgrind failure

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/compute/kernels/scalar_fill_null_test.cc | 3 +--
 cpp/src/arrow/ipc/message.cc   | 2 +-
 cpp/src/arrow/ipc/metadata_internal.cc | 2 +-
 cpp/src/arrow/ipc/reader.cc| 2 +-
 cpp/src/arrow/util/value_parsing_test.cc   | 4 ++--
 cpp/src/parquet/column_scanner.h   | 2 +-
 docker-compose.yml | 2 +-
 7 files changed, 8 insertions(+), 9 deletions(-)



[arrow] branch master updated: ARROW-9428: [C++][Doc] Update buffer allocation documentation

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 1f42ac0  ARROW-9428: [C++][Doc] Update buffer allocation documentation
1f42ac0 is described below

commit 1f42ac0ff0bc1ac098cd64ba27c354890c5b8ff4
Author: Antoine Pitrou 
AuthorDate: Mon Jul 13 12:27:20 2020 -0500

ARROW-9428: [C++][Doc] Update buffer allocation documentation

Use Result-returning AllocateBuffer() version in example.

Also improve cross-referencing in some places.

Closes #7731 from pitrou/ARROW-9428-buffer-allocation-doc

Authored-by: Antoine Pitrou 
Signed-off-by: Wes McKinney 
---
 docs/source/cpp/api/formats.rst |  6 ++
 docs/source/cpp/api/support.rst | 11 +++
 docs/source/cpp/arrays.rst  |  3 +++
 docs/source/cpp/conventions.rst |  3 +++
 docs/source/cpp/csv.rst |  3 +++
 docs/source/cpp/datatypes.rst   |  3 +++
 docs/source/cpp/io.rst  |  4 +++-
 docs/source/cpp/json.rst|  3 +++
 docs/source/cpp/memory.rst  | 10 +++---
 docs/source/cpp/parquet.rst |  3 +++
 docs/source/cpp/tables.rst  |  3 +++
 11 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/docs/source/cpp/api/formats.rst b/docs/source/cpp/api/formats.rst
index 75dfb00..a072f11 100644
--- a/docs/source/cpp/api/formats.rst
+++ b/docs/source/cpp/api/formats.rst
@@ -19,6 +19,8 @@
 File Formats
 
 
+.. _cpp-api-csv:
+
 CSV
 ===
 
@@ -34,6 +36,8 @@ CSV
 .. doxygenclass:: arrow::csv::TableReader
:members:
 
+.. _cpp-api-json:
+
 Line-separated JSON
 ===
 
@@ -48,6 +52,8 @@ Line-separated JSON
 .. doxygenclass:: arrow::json::TableReader
:members:
 
+.. _cpp-api-parquet:
+
 Parquet reader
 ==
 
diff --git a/docs/source/cpp/api/support.rst b/docs/source/cpp/api/support.rst
index 1547a20..c3310e5 100644
--- a/docs/source/cpp/api/support.rst
+++ b/docs/source/cpp/api/support.rst
@@ -15,9 +15,20 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
+===
 Programming Support
 ===
 
+General information
+---
+
+.. doxygenfunction:: arrow::GetBuildInfo
+   :project: arrow_cpp
+
+.. doxygenstruct:: arrow::BuildInfo
+   :project: arrow_cpp
+   :members:
+
 Error return and reporting
 --
 
diff --git a/docs/source/cpp/arrays.rst b/docs/source/cpp/arrays.rst
index 43ac414..bd6ba64 100644
--- a/docs/source/cpp/arrays.rst
+++ b/docs/source/cpp/arrays.rst
@@ -22,6 +22,9 @@
 Arrays
 ==
 
+.. seealso::
+   :doc:`Array API reference `
+
 The central type in Arrow is the class :class:`arrow::Array`.   An array
 represents a known-length sequence of values all having the same type.
 Internally, those values are represented by one or several buffers, the
diff --git a/docs/source/cpp/conventions.rst b/docs/source/cpp/conventions.rst
index 33f0a8c..218d028 100644
--- a/docs/source/cpp/conventions.rst
+++ b/docs/source/cpp/conventions.rst
@@ -102,3 +102,6 @@ For example::
   // return success at the end
   return Status::OK();
}
+
+.. seealso::
+   :doc:`API reference for error reporting `
diff --git a/docs/source/cpp/csv.rst b/docs/source/cpp/csv.rst
index 8d37b29..50a5cdb 100644
--- a/docs/source/cpp/csv.rst
+++ b/docs/source/cpp/csv.rst
@@ -27,6 +27,9 @@ Reading CSV files
 Arrow provides a fast CSV reader allowing ingestion of external data
 as Arrow tables.
 
+.. seealso::
+   :ref:`CSV reader API reference `.
+
 Basic usage
 ===
 
diff --git a/docs/source/cpp/datatypes.rst b/docs/source/cpp/datatypes.rst
index c411632..9149420 100644
--- a/docs/source/cpp/datatypes.rst
+++ b/docs/source/cpp/datatypes.rst
@@ -21,6 +21,9 @@
 Data Types
 ==
 
+.. seealso::
+   :doc:`Datatype API reference `.
+
 Data types govern how physical data is interpreted.  Their :ref:`specification
 ` allows binary interoperability between different Arrow
 implementations, including from different programming languages and runtimes
diff --git a/docs/source/cpp/io.rst b/docs/source/cpp/io.rst
index ed357c6..501998b 100644
--- a/docs/source/cpp/io.rst
+++ b/docs/source/cpp/io.rst
@@ -17,6 +17,7 @@
 
 .. default-domain:: cpp
 .. highlight:: cpp
+.. cpp:namespace:: arrow::io
 
 ==
 Input / output and filesystems
@@ -27,7 +28,8 @@ of input / output operations.  They operate on streams of 
untyped binary data.
 Those abstractions are used for various purposes such as reading CSV or
 Parquet data, transmitting IPC streams, and more.
 
-.. cpp:namespace:: arrow::io
+.. seealso::
+   :doc:`API reference for input/output facilities `.
 
 Reading binary data
 ===
diff --git a/docs/source/cpp/json.rst b/docs/source/cpp/json.rst
index 93dcdfa..cdb742e 100644
--- a/docs/source/cpp/json.rst
+++ b/docs/source/cpp/json.rst

[arrow] branch master updated: ARROW-9374: [C++][Python] Expose MakeArrayFromScalar

2020-07-13 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new d1db0b0  ARROW-9374: [C++][Python] Expose MakeArrayFromScalar
d1db0b0 is described below

commit d1db0b08da7fad1fd171c7275264b87a3d9435dc
Author: Krisztián Szűcs 
AuthorDate: Mon Jul 13 12:25:33 2020 -0500

ARROW-9374: [C++][Python] Expose MakeArrayFromScalar

Since we have a complete scalar implementation on the python side, we can 
implement `pa.repeat(value, size=n)`

Closes #7684 from kszucs/repeat

Authored-by: Krisztián Szűcs 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/array/array_test.cc|  86 +++--
 cpp/src/arrow/array/util.cc  |  70 
 cpp/src/arrow/scalar.cc  |   2 +-
 cpp/src/arrow/scalar.h   |   6 +-
 cpp/src/arrow/scalar_test.cc |  12 
 python/pyarrow/__init__.py   |   4 +-
 python/pyarrow/array.pxi | 120 +++
 python/pyarrow/includes/libarrow.pxd |   3 +
 python/pyarrow/scalar.pxi|  14 ++--
 python/pyarrow/tests/test_array.py   |  56 
 python/pyarrow/tests/test_scalars.py |  11 
 11 files changed, 339 insertions(+), 45 deletions(-)

diff --git a/cpp/src/arrow/array/array_test.cc 
b/cpp/src/arrow/array/array_test.cc
index ea1ded6..42e25d0 100644
--- a/cpp/src/arrow/array/array_test.cc
+++ b/cpp/src/arrow/array/array_test.cc
@@ -354,25 +354,39 @@ TEST_F(TestArray, TestMakeArrayFromScalar) {
   ASSERT_EQ(null_array->null_count(), 5);
 
   auto hello = Buffer::FromString("hello");
-  ScalarVector scalars{std::make_shared(false),
-   std::make_shared(3),
-   std::make_shared(3),
-   std::make_shared(3),
-   std::make_shared(3),
-   std::make_shared(3.0),
-   std::make_shared(hello),
-   std::make_shared(hello),
-   std::make_shared(
-   hello, 
fixed_size_binary(static_cast(hello->size(,
-   std::make_shared(Decimal128(10), 
decimal(16, 4)),
-   std::make_shared(hello),
-   std::make_shared(hello),
-   std::make_shared(
-   ScalarVector{
-   std::make_shared(2),
-   std::make_shared(6),
-   },
-   struct_({field("min", int32()), field("max", 
int32())}))};
+  DayTimeIntervalType::DayMilliseconds daytime{1, 100};
+
+  ScalarVector scalars{
+  std::make_shared(false),
+  std::make_shared(3),
+  std::make_shared(3),
+  std::make_shared(3),
+  std::make_shared(3),
+  std::make_shared(3.0),
+  std::make_shared(10),
+  std::make_shared(11),
+  std::make_shared(1000, time32(TimeUnit::SECOND)),
+  std::make_shared(, time64(TimeUnit::MICRO)),
+  std::make_shared(, timestamp(TimeUnit::MILLI)),
+  std::make_shared(1),
+  std::make_shared(daytime),
+  std::make_shared(60, duration(TimeUnit::SECOND)),
+  std::make_shared(hello),
+  std::make_shared(hello),
+  std::make_shared(
+  hello, fixed_size_binary(static_cast(hello->size(,
+  std::make_shared(Decimal128(10), decimal(16, 4)),
+  std::make_shared(hello),
+  std::make_shared(hello),
+  std::make_shared(ArrayFromJSON(int8(), "[1, 2, 3]")),
+  std::make_shared(ArrayFromJSON(int8(), "[1, 1, 2, 2, 3, 
3]")),
+  std::make_shared(ArrayFromJSON(int8(), "[1, 2, 3, 
4]")),
+  std::make_shared(
+  ScalarVector{
+  std::make_shared(2),
+  std::make_shared(6),
+  },
+  struct_({field("min", int32()), field("max", int32())}))};
 
   for (int64_t length : {16}) {
 for (auto scalar : scalars) {
@@ -384,6 +398,40 @@ TEST_F(TestArray, TestMakeArrayFromScalar) {
   }
 }
 
+TEST_F(TestArray, TestMakeArrayFromDictionaryScalar) {
+  auto dictionary = ArrayFromJSON(utf8(), R"(["foo", "bar", "baz"])");
+  auto type = std::make_shared(int8(), utf8());
+  ASSERT_OK_AND_ASSIGN(auto value, MakeScalar(int8(), 1));
+  auto scalar = DictionaryScalar({value, dictionary}, type);
+
+  ASSERT_OK_AND_ASSIGN(auto array, MakeArrayFromScalar(scalar, 4));
+  ASSERT_OK(array->ValidateFull());
+  ASSERT_EQ(array->length(), 4);
+  ASSERT_EQ(array->null_count(), 0);
+
+  for (int i = 0; i < 4; i++) {
+ASSERT_OK_AND_ASSIGN(auto item, array->GetScalar(i));
+ASSERT_TRUE(item->Equals(scalar));
+  }
+}
+
+TEST_F(TestArray, TestMakeArrayFromMapScalar) {

[arrow] branch master updated: ARROW-7208: [Python][Parquet] Raise better error message when passing a directory path instead of a file path to ParquetFile

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 658618e  ARROW-7208: [Python][Parquet] Raise better error message when 
passing a directory path instead of a file path to ParquetFile
658618e is described below

commit 658618ecd540bc6af76efa608cd1ff7b7938ba4c
Author: Wes McKinney 
AuthorDate: Sun Jul 12 22:31:18 2020 -0500

ARROW-7208: [Python][Parquet] Raise better error message when passing a 
directory path instead of a file path to ParquetFile

Closes #7722 from wesm/ARROW-7208

Authored-by: Wes McKinney 
Signed-off-by: Wes McKinney 
---
 python/pyarrow/io.pxi| 9 +
 python/pyarrow/tests/test_parquet.py | 9 +
 2 files changed, 18 insertions(+)

diff --git a/python/pyarrow/io.pxi b/python/pyarrow/io.pxi
index 8f8cbd1..76a058d 100644
--- a/python/pyarrow/io.pxi
+++ b/python/pyarrow/io.pxi
@@ -776,11 +776,19 @@ def memory_map(path, mode='r'):
 ---
 mmap : MemoryMappedFile
 """
+_check_is_file(path)
+
 cdef MemoryMappedFile mmap = MemoryMappedFile()
 mmap._open(path, mode)
 return mmap
 
 
+cdef _check_is_file(path):
+if os.path.isdir(path):
+raise IOError("Expected file path, but {0} is a directory"
+  .format(path))
+
+
 def create_memory_map(path, size):
 """
 Create a file of the given size and memory-map it.
@@ -807,6 +815,7 @@ cdef class OSFile(NativeFile):
 object path
 
 def __cinit__(self, path, mode='r', MemoryPool memory_pool=None):
+_check_is_file(path)
 self.path = path
 
 cdef:
diff --git a/python/pyarrow/tests/test_parquet.py 
b/python/pyarrow/tests/test_parquet.py
index 539c444..410eee1 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -3448,6 +3448,15 @@ def test_empty_row_groups(tempdir):
 assert reader.read_row_group(i).equals(table)
 
 
+def test_parquet_file_pass_directory_instead_of_file(tempdir):
+# ARROW-7208
+path = tempdir / 'directory'
+os.mkdir(str(path))
+
+with pytest.raises(IOError, match="Expected file path"):
+pq.ParquetFile(path)
+
+
 @pytest.mark.pandas
 @parametrize_legacy_dataset
 def test_parquet_writer_with_caller_provided_filesystem(use_legacy_dataset):



[arrow] branch master updated: ARROW-9413: [Rust] Disable cpm_nan clippy error

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new b9bbee2  ARROW-9413: [Rust] Disable cpm_nan clippy error
b9bbee2 is described below

commit b9bbee2511300d39b3f327fa4dd608648d5bde59
Author: Neville Dipale 
AuthorDate: Sun Jul 12 17:59:48 2020 -0500

ARROW-9413: [Rust] Disable cpm_nan clippy error

Using the comparison recommended by clippy makes sorts with `NAN` 
undeterministic.
We currently sort NAN separately to nulls, we couldcan resolve this 
separately

Closes #7710 from nevi-me/ARROW-9413

Authored-by: Neville Dipale 
Signed-off-by: Wes McKinney 
---
 rust/arrow/src/compute/kernels/sort.rs | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/rust/arrow/src/compute/kernels/sort.rs 
b/rust/arrow/src/compute/kernels/sort.rs
index 8cd6f7b..2b4cbbc 100644
--- a/rust/arrow/src/compute/kernels/sort.rs
+++ b/rust/arrow/src/compute/kernels/sort.rs
@@ -52,12 +52,14 @@ pub fn sort_to_indices(
 .as_any()
 .downcast_ref::()
 .expect("Unable to downcast array");
+#[allow(clippy::cmp_nan)]
 range.partition(|index| array.is_valid(*index) && array.value(*index) 
!= f32::NAN)
 } else if values.data_type() == ::Float64 {
 let array = values
 .as_any()
 .downcast_ref::()
 .expect("Unable to downcast array");
+#[allow(clippy::cmp_nan)]
 range.partition(|index| array.is_valid(*index) && array.value(*index) 
!= f64::NAN)
 } else {
 range.partition(|index| values.is_valid(*index))



[arrow] branch master updated: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 44aa829  ARROW-9288: [C++][Dataset] Fix PartitioningFactory with 
dictionary encoding for HivePartioning
44aa829 is described below

commit 44aa8292605bf7484ae73b289055482e399e90d0
Author: Joris Van den Bossche 
AuthorDate: Sun Jul 12 17:58:10 2020 -0500

ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding 
for HivePartioning

Closes #7608 from jorisvandenbossche/ARROW-9288

Authored-by: Joris Van den Bossche 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/dataset/partition.cc   | 26 +-
 python/pyarrow/tests/test_dataset.py | 29 +
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/cpp/src/arrow/dataset/partition.cc 
b/cpp/src/arrow/dataset/partition.cc
index 744e9dd..2a2ecdf 100644
--- a/cpp/src/arrow/dataset/partition.cc
+++ b/cpp/src/arrow/dataset/partition.cc
@@ -317,6 +317,16 @@ class KeyValuePartitioningInspectImpl {
 return ::arrow::schema(std::move(fields));
   }
 
+  std::vector FieldNames() {
+std::vector names;
+names.reserve(name_to_index_.size());
+
+for (auto kv : name_to_index_) {
+  names.push_back(kv.first);
+}
+return names;
+  }
+
  private:
   std::unordered_map name_to_index_;
   std::vector> values_;
@@ -657,15 +667,29 @@ class HivePartitioningFactory : public 
PartitioningFactory {
   }
 }
 
+field_names_ = impl.FieldNames();
 return impl.Finish(_);
   }
 
   Result> Finish(
   const std::shared_ptr& schema) const override {
-return std::shared_ptr(new HivePartitioning(schema, 
dictionaries_));
+if (dictionaries_.empty()) {
+  return std::make_shared(schema, dictionaries_);
+} else {
+  for (FieldRef ref : field_names_) {
+// ensure all of field_names_ are present in schema
+RETURN_NOT_OK(ref.FindOne(*schema).status());
+  }
+
+  // drop fields which aren't in field_names_
+  auto out_schema = SchemaFromColumnNames(schema, field_names_);
+
+  return std::make_shared(std::move(out_schema), 
dictionaries_);
+}
   }
 
  private:
+  std::vector field_names_;
   ArrayVector dictionaries_;
   PartitioningFactoryOptions options_;
 };
diff --git a/python/pyarrow/tests/test_dataset.py 
b/python/pyarrow/tests/test_dataset.py
index 1c348f4..428547c 100644
--- a/python/pyarrow/tests/test_dataset.py
+++ b/python/pyarrow/tests/test_dataset.py
@@ -1484,6 +1484,35 @@ def test_open_dataset_non_existing_file():
 ds.dataset('file:i-am-not-existing.parquet', format='parquet')
 
 
+@pytest.mark.parquet
+@pytest.mark.parametrize('partitioning', ["directory", "hive"])
+def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning):
+# ARROW-9288
+import pyarrow.parquet as pq
+table = pa.table({'a': range(9), 'b': [0.] * 4 + [1.] * 5})
+
+path = tempdir / "dataset"
+path.mkdir()
+
+for part in ["A", "B", "C"]:
+fmt = "{}" if partitioning == "directory" else "part={}"
+part = path / fmt.format(part)
+part.mkdir()
+pq.write_table(table, part / "test.parquet")
+
+if partitioning == "directory":
+part = ds.DirectoryPartitioning.discover(
+["part"], max_partition_dictionary_size=-1)
+else:
+part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
+
+dataset = ds.dataset(str(path), partitioning=part)
+expected_schema = table.schema.append(
+pa.field("part", pa.dictionary(pa.int32(), pa.string()))
+)
+assert dataset.schema.equals(expected_schema)
+
+
 @pytest.fixture
 def s3_example_simple(s3_connection, s3_server):
 from pyarrow.fs import FileSystem



[arrow] branch master updated: ARROW-9321: [C++][Dataset] Populate statistics opportunistically

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 3ae46e3  ARROW-9321: [C++][Dataset] Populate statistics 
opportunistically
3ae46e3 is described below

commit 3ae46e33aa94c8f357abb8c6debe361b53d7907d
Author: Benjamin Kietzman 
AuthorDate: Sun Jul 12 17:53:16 2020 -0500

ARROW-9321: [C++][Dataset] Populate statistics opportunistically

Populate ParquetFileFragment statistics whenever a reader is opened anyway. 
Also provides an explicit method for forcing load of statistics. (I exposed 
this as a public method, but maybe we'd prefer to hide it inside the 
`statistics` property the way we do physical schema?)

Closes #7692 from bkietz/9321-populate-statistics-on-read

Lead-authored-by: Benjamin Kietzman 
Co-authored-by: Joris Van den Bossche 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/dataset/dataset.cc |  12 +-
 cpp/src/arrow/dataset/file_parquet.cc| 230 ++-
 cpp/src/arrow/dataset/file_parquet.h |  24 +--
 python/pyarrow/_dataset.pyx  |  13 +-
 python/pyarrow/includes/libarrow_dataset.pxd |   1 +
 python/pyarrow/tests/test_dataset.py |  54 ++-
 6 files changed, 207 insertions(+), 127 deletions(-)

diff --git a/cpp/src/arrow/dataset/dataset.cc b/cpp/src/arrow/dataset/dataset.cc
index ed936db..71755aa 100644
--- a/cpp/src/arrow/dataset/dataset.cc
+++ b/cpp/src/arrow/dataset/dataset.cc
@@ -40,9 +40,17 @@ Fragment::Fragment(std::shared_ptr 
partition_expression,
 }
 
 Result> Fragment::ReadPhysicalSchema() {
+  {
+auto lock = physical_schema_mutex_.Lock();
+if (physical_schema_ != nullptr) return physical_schema_;
+  }
+
+  // allow ReadPhysicalSchemaImpl to lock mutex_, if necessary
+  ARROW_ASSIGN_OR_RAISE(auto physical_schema, ReadPhysicalSchemaImpl());
+
   auto lock = physical_schema_mutex_.Lock();
-  if (physical_schema_ == NULLPTR) {
-ARROW_ASSIGN_OR_RAISE(physical_schema_, ReadPhysicalSchemaImpl());
+  if (physical_schema_ == nullptr) {
+physical_schema_ = std::move(physical_schema);
   }
   return physical_schema_;
 }
diff --git a/cpp/src/arrow/dataset/file_parquet.cc 
b/cpp/src/arrow/dataset/file_parquet.cc
index d5e05ed..4581faa 100644
--- a/cpp/src/arrow/dataset/file_parquet.cc
+++ b/cpp/src/arrow/dataset/file_parquet.cc
@@ -286,10 +286,9 @@ ParquetFileFormat::ParquetFileFormat(const 
parquet::ReaderProperties& reader_pro
 Result ParquetFileFormat::IsSupported(const FileSource& source) const {
   try {
 ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
-auto properties = MakeReaderProperties(*this);
 auto reader =
-parquet::ParquetFileReader::Open(std::move(input), 
std::move(properties));
-auto metadata = reader->metadata();
+parquet::ParquetFileReader::Open(std::move(input), 
MakeReaderProperties(*this));
+std::shared_ptr metadata = reader->metadata();
 return metadata != nullptr && metadata->can_decompress();
   } catch (const ::parquet::ParquetInvalidOrCorruptedFileException& e) {
 ARROW_UNUSED(e);
@@ -316,7 +315,7 @@ Result> 
ParquetFileFormat::GetReader
   auto properties = MakeReaderProperties(*this, pool);
   ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source, 
std::move(properties)));
 
-  auto metadata = reader->metadata();
+  std::shared_ptr metadata = reader->metadata();
   auto arrow_properties = MakeArrowReaderProperties(*this, *metadata);
 
   if (options) {
@@ -335,91 +334,41 @@ static inline bool RowGroupInfosAreComplete(const 
std::vector& inf
  [](const RowGroupInfo& i) { return i.HasStatistics(); });
 }
 
-static inline std::vector FilterRowGroups(
-std::vector row_groups, const Expression& predicate) {
-  auto filter = [](const RowGroupInfo& info) {
-return !info.Satisfy(predicate);
-  };
-  auto end = std::remove_if(row_groups.begin(), row_groups.end(), filter);
-  row_groups.erase(end, row_groups.end());
-  return row_groups;
-}
-
-static inline Result> AugmentRowGroups(
-std::vector row_groups, parquet::arrow::FileReader* reader) {
-  auto metadata = reader->parquet_reader()->metadata();
-  auto manifest = reader->manifest();
-  auto num_row_groups = metadata->num_row_groups();
-
-  if (row_groups.empty()) {
-row_groups = RowGroupInfo::FromCount(num_row_groups);
-  }
-
-  // Augment a RowGroup with statistics if missing.
-  auto augment = [&](RowGroupInfo& info) {
-if (!info.HasStatistics() && info.id() < num_row_groups) {
-  auto row_group = metadata->RowGroup(info.id());
-  info.set_num_rows(row_group->num_rows());
-  info.set_total_byte_size(row_group->total_byte_size());
-  info.set_statistics(RowGroupStatisticsAsStructScalar(*row_group, 
m

[arrow] branch master updated (2e94641 -> 5dbf30a)

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 2e94641  ARROW-9297: [C++][Parquet] Support chunked row groups in 
RowGroupRecordBatchReader
 add 5dbf30a  ARROW-9418 [R] nyc-taxi Parquet files not downloaded in 
binary mode on Windows

No new revisions were added by this update.

Summary of changes:
 r/vignettes/dataset.Rmd | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)



[arrow] branch master updated (9ef539e -> 2e94641)

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 9ef539e  ARROW-4221: [C++][Python] Add canonical flag in COO sparse 
index
 add 2e94641  ARROW-9297: [C++][Parquet] Support chunked row groups in 
RowGroupRecordBatchReader

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/util/iterator.h |  16 +--
 cpp/src/arrow/util/iterator_test.cc   |   8 +-
 cpp/src/parquet/arrow/arrow_reader_writer_test.cc |  16 ++-
 cpp/src/parquet/arrow/reader.cc   | 116 +++---
 cpp/src/parquet/arrow/reader.h|  27 +++--
 cpp/src/parquet/arrow/schema.h|  56 +++
 6 files changed, 147 insertions(+), 92 deletions(-)



[arrow] branch master updated (9ef539e -> 2e94641)

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 9ef539e  ARROW-4221: [C++][Python] Add canonical flag in COO sparse 
index
 add 2e94641  ARROW-9297: [C++][Parquet] Support chunked row groups in 
RowGroupRecordBatchReader

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/util/iterator.h |  16 +--
 cpp/src/arrow/util/iterator_test.cc   |   8 +-
 cpp/src/parquet/arrow/arrow_reader_writer_test.cc |  16 ++-
 cpp/src/parquet/arrow/reader.cc   | 116 +++---
 cpp/src/parquet/arrow/reader.h|  27 +++--
 cpp/src/parquet/arrow/schema.h|  56 +++
 6 files changed, 147 insertions(+), 92 deletions(-)



[arrow] branch master updated (d019bc3 -> 9ef539e)

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from d019bc3  PARQUET-1882: [C++] Buffered Reads should allow for 0 length
 add 9ef539e  ARROW-4221: [C++][Python] Add canonical flag in COO sparse 
index

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/ipc/metadata_internal.cc |   3 +-
 cpp/src/arrow/ipc/read_write_test.cc   |  25 
 cpp/src/arrow/ipc/reader.cc|   5 +-
 cpp/src/arrow/python/numpy_convert.cc  |   4 +-
 cpp/src/arrow/sparse_tensor.cc | 126 -
 cpp/src/arrow/sparse_tensor.h  |  34 -
 cpp/src/arrow/sparse_tensor_test.cc| 213 +
 cpp/src/arrow/tensor/coo_converter.cc  |  10 +-
 cpp/src/generated/SparseTensor_generated.h |  21 ++-
 format/SparseTensor.fbs|  11 +-
 python/pyarrow/includes/libarrow.pxd   |   8 ++
 python/pyarrow/tensor.pxi  |  36 -
 python/pyarrow/tests/test_sparse_tensor.py |  33 +++--
 13 files changed, 470 insertions(+), 59 deletions(-)



[arrow] branch master updated (7d377ba -> d019bc3)

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 7d377ba  ARROW-8559: [Rust] Consolidate Record Batch reader traits in 
main arrow crate
 add d019bc3  PARQUET-1882: [C++] Buffered Reads should allow for 0 length

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/io/buffered.cc   |  4 +++-
 cpp/src/arrow/io/buffered_test.cc  |  9 
 cpp/src/parquet/file_serialize_test.cc | 42 ++
 3 files changed, 54 insertions(+), 1 deletion(-)



[arrow] branch master updated (3b0055a -> df629f9)

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 3b0055a  ARROW-9417: [C++] Write length in IPC message by using 
little-endian
 add df629f9  ARROW-9419: [C++] Expand fill_null function testing, test 
sliced arrays, fix some bugs

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/compute/kernels/scalar_fill_null.cc  | 21 
 .../arrow/compute/kernels/scalar_fill_null_test.cc | 62 +++---
 cpp/src/arrow/testing/gtest_util.cc|  4 ++
 3 files changed, 72 insertions(+), 15 deletions(-)



[arrow] branch master updated: ARROW-9417: [C++] Write length in IPC message by using little-endian

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 3b0055a  ARROW-9417: [C++] Write length in IPC message by using 
little-endian
3b0055a is described below

commit 3b0055adc4ab54b59d0671821c3767cebf291bd5
Author: Kazuaki Ishizaki 
AuthorDate: Sun Jul 12 12:09:18 2020 -0500

ARROW-9417: [C++] Write length in IPC message by using little-endian

This PR forces to write metadata_length and footer_length in IPC messages 
by using little-endian to follow [the 
specification](https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst).

Closes #7716 from kiszk/ARROW-9417

Authored-by: Kazuaki Ishizaki 
Signed-off-by: Wes McKinney 
---
 cpp/src/arrow/ipc/message.cc | 18 ++
 cpp/src/arrow/ipc/read_write_test.cc |  5 +
 cpp/src/arrow/ipc/reader.cc  |  3 ++-
 cpp/src/arrow/ipc/writer.cc  |  2 ++
 4 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/cpp/src/arrow/ipc/message.cc b/cpp/src/arrow/ipc/message.cc
index aeb106e..dcf61ef 100644
--- a/cpp/src/arrow/ipc/message.cc
+++ b/cpp/src/arrow/ipc/message.cc
@@ -424,8 +424,9 @@ Status WriteMessage(const Buffer& message, const 
IpcWriteOptions& options,
 RETURN_NOT_OK(file->Write(::kIpcContinuationToken, 
sizeof(int32_t)));
   }
 
-  // Write the flatbuffer size prefix including padding
-  int32_t padded_flatbuffer_size = padded_message_length - prefix_size;
+  // Write the flatbuffer size prefix including padding in little endian
+  int32_t padded_flatbuffer_size =
+  BitUtil::ToLittleEndian(padded_message_length - prefix_size);
   RETURN_NOT_OK(file->Write(_flatbuffer_size, sizeof(int32_t)));
 
   // Write the flatbuffer
@@ -577,18 +578,18 @@ class MessageDecoder::MessageDecoderImpl {
   }
 
   Status ConsumeInitialData(const uint8_t* data, int64_t size) {
-return ConsumeInitial(util::SafeLoadAs(data));
+return 
ConsumeInitial(BitUtil::FromLittleEndian(util::SafeLoadAs(data)));
   }
 
   Status ConsumeInitialBuffer(const std::shared_ptr& buffer) {
 ARROW_ASSIGN_OR_RAISE(auto continuation, ConsumeDataBufferInt32(buffer));
-return ConsumeInitial(continuation);
+return ConsumeInitial(BitUtil::FromLittleEndian(continuation));
   }
 
   Status ConsumeInitialChunks() {
 int32_t continuation = 0;
 RETURN_NOT_OK(ConsumeDataChunks(sizeof(int32_t), ));
-return ConsumeInitial(continuation);
+return ConsumeInitial(BitUtil::FromLittleEndian(continuation));
   }
 
   Status ConsumeInitial(int32_t continuation) {
@@ -616,18 +617,19 @@ class MessageDecoder::MessageDecoderImpl {
   }
 
   Status ConsumeMetadataLengthData(const uint8_t* data, int64_t size) {
-return ConsumeMetadataLength(util::SafeLoadAs(data));
+return ConsumeMetadataLength(
+BitUtil::FromLittleEndian(util::SafeLoadAs(data)));
   }
 
   Status ConsumeMetadataLengthBuffer(const std::shared_ptr& buffer) {
 ARROW_ASSIGN_OR_RAISE(auto metadata_length, 
ConsumeDataBufferInt32(buffer));
-return ConsumeMetadataLength(metadata_length);
+return ConsumeMetadataLength(BitUtil::FromLittleEndian(metadata_length));
   }
 
   Status ConsumeMetadataLengthChunks() {
 int32_t metadata_length = 0;
 RETURN_NOT_OK(ConsumeDataChunks(sizeof(int32_t), _length));
-return ConsumeMetadataLength(metadata_length);
+return ConsumeMetadataLength(BitUtil::FromLittleEndian(metadata_length));
   }
 
   Status ConsumeMetadataLength(int32_t metadata_length) {
diff --git a/cpp/src/arrow/ipc/read_write_test.cc 
b/cpp/src/arrow/ipc/read_write_test.cc
index 9e4f4c9..6ae7611 100644
--- a/cpp/src/arrow/ipc/read_write_test.cc
+++ b/cpp/src/arrow/ipc/read_write_test.cc
@@ -131,6 +131,11 @@ TEST_P(TestMessage, SerializeTo) {
 ASSERT_EQ(BitUtil::RoundUp(metadata->size() + prefix_size, alignment) + 
body_length,
   output_length);
 ASSERT_OK_AND_EQ(output_length, stream->Tell());
+ASSERT_OK_AND_ASSIGN(auto buffer, stream->Finish());
+// chech whether length is written in little endian
+auto buffer_ptr = buffer.get()->data();
+ASSERT_EQ(output_length - body_length - prefix_size,
+  BitUtil::FromLittleEndian(*(uint32_t*)(buffer_ptr + 4)));
   };
 
   CheckWithAlignment(8);
diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc
index 3c51fef..75f2213 100644
--- a/cpp/src/arrow/ipc/reader.cc
+++ b/cpp/src/arrow/ipc/reader.cc
@@ -979,7 +979,8 @@ class RecordBatchFileReaderImpl : public 
RecordBatchFileReader {
   return Status::Invalid("Not an Arrow file");
 }
 
-int32_t footer_length = *reinterpret_cast(buffer->data());
+int32_t footer_length =
+BitUtil::FromLittleEndian(*reinterpret_cast(buffer->data()));
 
 if (footer_length <= 0 ||

[arrow] branch master updated (a5914d5 -> 35c8dff)

2020-07-12 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from a5914d5  ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels
 add 35c8dff  PARQUET-1839: Set values read for required column

No new revisions were added by this update.

Summary of changes:
 cpp/src/parquet/column_reader.cc | 1 +
 1 file changed, 1 insertion(+)



[arrow] branch master updated (3e940dc -> a5914d5)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 3e940dc  ARROW-9389: [C++] Add binary metafunctions for the set lookup 
kernels isin and match that can be called with CallFunction
 add a5914d5  ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/compute/kernels/scalar_string.cc | 491 -
 .../compute/kernels/scalar_string_benchmark.cc |  10 +
 .../arrow/compute/kernels/scalar_string_test.cc| 164 +++
 cpp/src/arrow/util/utf8.h  |  19 +
 docker-compose.yml |   2 +
 python/pyarrow/compute.py  |  21 +
 python/pyarrow/tests/test_compute.py   | 122 +
 7 files changed, 826 insertions(+), 3 deletions(-)



[arrow] branch master updated (1a7519f -> 3e940dc)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 1a7519f  ARROW-9395: [Python] allow configuring MetadataVersion
 add 3e940dc  ARROW-9389: [C++] Add binary metafunctions for the set lookup 
kernels isin and match that can be called with CallFunction

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/compute/kernels/scalar_set_lookup.cc | 30 ++
 .../compute/kernels/scalar_set_lookup_test.cc  | 16 +---
 2 files changed, 43 insertions(+), 3 deletions(-)



[arrow] branch master updated (18a5e3e -> 1a7519f)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 18a5e3e  ARROW-9331: [C++] Improve the performance of 
Tensor-to-SparseTensor conversion
 add 1a7519f  ARROW-9395: [Python] allow configuring MetadataVersion

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/python/flight.cc  |  4 +-
 cpp/src/arrow/python/flight.h   |  3 +-
 python/pyarrow/_flight.pyx  | 46 +--
 python/pyarrow/includes/libarrow.pxd|  1 +
 python/pyarrow/includes/libarrow_flight.pxd | 10 +++--
 python/pyarrow/ipc.pxi  | 46 +--
 python/pyarrow/ipc.py   | 56 ---
 python/pyarrow/lib.pxd  |  6 +++
 python/pyarrow/tests/test_flight.py | 62 --
 python/pyarrow/tests/test_ipc.py| 69 -
 10 files changed, 259 insertions(+), 44 deletions(-)



[arrow] branch master updated (d2ddaa6 -> 18a5e3e)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from d2ddaa6  ARROW-1692: [Java] UnionArray round trip not working
 add 18a5e3e  ARROW-9331: [C++] Improve the performance of 
Tensor-to-SparseTensor conversion

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/tensor/converter_internal.h |  88 +++
 cpp/src/arrow/tensor/coo_converter.cc | 140 +-
 cpp/src/arrow/tensor/csx_converter.cc |   2 +-
 cpp/src/arrow/util/macros.h   |   1 +
 4 files changed, 208 insertions(+), 23 deletions(-)
 create mode 100644 cpp/src/arrow/tensor/converter_internal.h



[arrow] branch master updated (32e1ab3 -> d2ddaa6)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 32e1ab3  ARROW-9276: [Dev] Enable ARROW_CUDA when generating API 
documentations
 add d2ddaa6  ARROW-1692: [Java] UnionArray round trip not working

No new revisions were added by this update.

Summary of changes:
 dev/archery/archery/integration/datagen.py |   1 -
 dev/archery/archery/integration/runner.py  |   2 +
 .../main/codegen/templates/DenseUnionVector.java   | 154 +++--
 .../src/main/codegen/templates/UnionVector.java|  91 
 .../java/org/apache/arrow/vector/BufferLayout.java |   2 +-
 .../java/org/apache/arrow/vector/NullVector.java   |   5 +-
 .../java/org/apache/arrow/vector/TypeLayout.java   |   4 +-
 .../apache/arrow/vector/ipc/JsonFileReader.java|  17 ++-
 .../apache/arrow/vector/ipc/JsonFileWriter.java|  11 +-
 .../java/org/apache/arrow/vector/types/Types.java  |   9 +-
 .../org/apache/arrow/vector/util/Validator.java|   2 +
 .../apache/arrow/vector/util/VectorAppender.java   |  13 +-
 .../apache/arrow/vector/TestDenseUnionVector.java  |  23 +--
 .../org/apache/arrow/vector/TestTypeLayout.java|   2 +-
 .../org/apache/arrow/vector/TestUnionVector.java   |  13 +-
 .../org/apache/arrow/vector/TestValueVector.java   |  24 ++--
 .../vector/complex/impl/TestPromotableWriter.java  |   4 +-
 17 files changed, 182 insertions(+), 195 deletions(-)



[arrow] branch master updated (6ada172 -> 32e1ab3)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 6ada172  ARROW-9283: [Python] Expose build info
 add 32e1ab3  ARROW-9276: [Dev] Enable ARROW_CUDA when generating API 
documentations

No new revisions were added by this update.

Summary of changes:
 ci/docker/linux-apt-docs.dockerfile |  1 +
 dev/release/post-09-docs.sh | 31 ++-
 docker-compose.yml  | 29 ++---
 3 files changed, 21 insertions(+), 40 deletions(-)



[arrow] branch master updated (2fac048 -> 6ada172)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 2fac048  ARROW-9403: [Python] add Array.tolist as alias of .to_pylist
 add 6ada172  ARROW-9283: [Python] Expose build info

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/util/config.h.cmake|  2 +-
 python/pyarrow/__init__.py   | 21 +++-
 python/pyarrow/config.pxi| 49 
 python/pyarrow/includes/libarrow.pxd | 14 +++
 python/pyarrow/lib.pyx   |  3 +++
 python/pyarrow/tests/test_misc.py| 10 
 python/setup.py  | 12 -
 7 files changed, 108 insertions(+), 3 deletions(-)
 create mode 100644 python/pyarrow/config.pxi



[arrow] branch master updated (16290e7 -> 2fac048)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 16290e7  ARROW-1567: [C++] Implement "fill_null" function that 
replaces null values with a scalar value
 add 2fac048  ARROW-9403: [Python] add Array.tolist as alias of .to_pylist

No new revisions were added by this update.

Summary of changes:
 python/pyarrow/array.pxi | 6 ++
 1 file changed, 6 insertions(+)



[arrow] branch master updated (b02095f -> 16290e7)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from b02095f  ARROW-9415: [C++] Arrow does not compile on Power9
 add 16290e7  ARROW-1567: [C++] Implement "fill_null" function that 
replaces null values with a scalar value

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/CMakeLists.txt   |   1 +
 cpp/src/arrow/compute/api_scalar.cc|   4 +
 cpp/src/arrow/compute/api_scalar.h |  15 ++
 cpp/src/arrow/compute/kernels/CMakeLists.txt   |   1 +
 cpp/src/arrow/compute/kernels/codegen_internal.h   |  40 +
 cpp/src/arrow/compute/kernels/scalar_fill_null.cc  | 168 +
 .../arrow/compute/kernels/scalar_fill_null_test.cc | 109 +
 cpp/src/arrow/compute/registry.cc  |   1 +
 cpp/src/arrow/compute/registry_internal.h  |   1 +
 9 files changed, 340 insertions(+)
 create mode 100644 cpp/src/arrow/compute/kernels/scalar_fill_null.cc
 create mode 100644 cpp/src/arrow/compute/kernels/scalar_fill_null_test.cc



[arrow] branch master updated (5e122c6 -> b02095f)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 5e122c6  ARROW-9407: [Python] Recognize more pandas null sentinels in 
sequence type inference when converting to Arrow
 add b02095f  ARROW-9415: [C++] Arrow does not compile on Power9

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/util/hashing.h | 7 +++
 1 file changed, 7 insertions(+)



[arrow] branch master updated (fe541e8 -> 5e122c6)

2020-07-11 Thread wesm
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from fe541e8  ARROW-9362: [Java] Support reading/writing V5 MetadataVersion
 add 5e122c6  ARROW-9407: [Python] Recognize more pandas null sentinels in 
sequence type inference when converting to Arrow

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/python/inference.cc   |  8 +++-
 python/pyarrow/tests/test_pandas.py | 10 +++---
 2 files changed, 14 insertions(+), 4 deletions(-)



  1   2   3   4   5   6   7   8   9   10   >