date:20210104

[GitHub] [arrow] mrkn commented on a change in pull request #6302: ARROW-7633: [C++][CI] Create fuzz targets for tensors and sparse tensors

2021-01-04 Thread GitBox



mrkn commented on a change in pull request #6302:
URL: https://github.com/apache/arrow/pull/6302#discussion_r551765015



##
File path: cpp/src/arrow/ipc/generate_tensor_fuzz_corpus.cc
##
@@ -0,0 +1,132 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// A command line executable that generates a bunch of valid IPC files
+// containing example tensors.  Those are used as fuzzing seeds to make
+// fuzzing more efficient.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/io/file.h"
+#include "arrow/io/memory.h"
+#include "arrow/ipc/test_common.h"
+#include "arrow/ipc/writer.h"
+#include "arrow/result.h"
+#include "arrow/tensor.h"
+#include "arrow/util/io_util.h"
+
+namespace arrow {
+namespace ipc {
+
+using ::arrow::internal::PlatformFilename;
+
+Result PrepareDirectory(const std::string& dir) {
+  ARROW_ASSIGN_OR_RAISE(auto dir_fn, PlatformFilename::FromString(dir));
+  RETURN_NOT_OK(::arrow::internal::CreateDir(dir_fn));
+  return std::move(dir_fn);
+}
+
+Result> MakeSerializedBuffer(
+std::function&)> fn) {
+  ARROW_ASSIGN_OR_RAISE(auto sink, io::BufferOutputStream::Create(1024));
+  RETURN_NOT_OK(fn(sink));
+  return sink->Finish();
+}
+
+Result> SerializeTensor(const std::shared_ptr& 
tensor) {
+  return MakeSerializedBuffer(
+  [&](const std::shared_ptr& sink) -> Status {
+int32_t metadata_length;
+int64_t body_length;
+return ipc::WriteTensor(*tensor, sink.get(), _length, 
_length);
+  });
+}
+
+Result>> Tensors() {
+  std::vector> tensors;
+  std::shared_ptr tensor;
+  std::vector shape = {5, 3, 7};
+  std::shared_ptr types[] = {int8(),  int16(),  int32(),  int64(),
+   uint8(), uint16(), uint32(), uint64()};
+  uint32_t seed = 0;
+  for (auto type : types) {
+RETURN_NOT_OK(test::MakeRandomTensor(type, shape, true, , seed++));

Review comment:
   Fixed.

##
File path: cpp/src/arrow/ipc/metadata_internal.cc
##
@@ -1334,17 +1334,22 @@ Status GetTensorMetadata(const Buffer& metadata, 
std::shared_ptr* type
 return Status::IOError("Header-type of flatbuffer-encoded Message is not 
Tensor.");
   }
 
-  int ndim = static_cast(tensor->shape()->size());
+  auto ndim = tensor->shape()->size();

Review comment:
   Fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] mrkn commented on a change in pull request #6302: ARROW-7633: [C++][CI] Create fuzz targets for tensors and sparse tensors

2021-01-04 Thread GitBox



mrkn commented on a change in pull request #6302:
URL: https://github.com/apache/arrow/pull/6302#discussion_r551764667



##
File path: cpp/src/arrow/ipc/test_common.h
##
@@ -161,6 +162,11 @@ Status MakeUuid(std::shared_ptr* out);
 ARROW_TESTING_EXPORT
 Status MakeDictExtension(std::shared_ptr* out);
 
+ARROW_TESTING_EXPORT
+Status MakeRandomTensor(const std::shared_ptr& type,
+const std::vector& shape, bool row_major_p,
+std::shared_ptr* out, uint32_t seed = 0);

Review comment:
   Because I followed `MakeRandomArray`.  It is in 
`arrow/ipc/test_common.h`.
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-04 Thread GitBox



nevi-me commented on pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#issuecomment-754464034


   Thanks for the review @jorgecarleitao. I've addressed your queries and 
comments, and cleaned up the TODOs



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-04 Thread GitBox



nevi-me commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551762894



##
File path: rust/arrow/src/array/equal/utils.rs
##
@@ -76,3 +80,185 @@ pub(super) fn equal_len(
 ) -> bool {
 lhs_values[lhs_start..(lhs_start + len)] == 
rhs_values[rhs_start..(rhs_start + len)]
 }
+
+/// Computes the logical validity bitmap of the array data using the
+/// parent's array data. The parent should be a list or struct, else
+/// the logical bitmap of the array is returned unaltered.
+///
+/// Parent data is passed along with the parent's logical bitmap, as
+/// nested arrays could have a logical bitmap different to the physical
+/// one on the `ArrayData`.
+pub(super) fn child_logical_null_buffer(
+parent_data: ,
+logical_null_buffer: Option,

Review comment:
   @alamb @jorgecarleitao I wanted to make this `Option<>` to avoid 
cloning, but because I create a `Bitmap` for `parent_bitmap` and 
`self_null_bitmap` , I have to end up cloning the ``. So it's extra work 
to change the signature, and probably doesn't yield any benefit.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-04 Thread GitBox



nevi-me commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551762193



##
File path: rust/arrow/src/array/equal/structure.rs
##
@@ -37,39 +37,20 @@ fn equal_values(
 rhs_start: usize,
 len: usize,
 ) -> bool {
-let mut temp_lhs: Option = None;
-let mut temp_rhs: Option = None;
-
 lhs.child_data()
 .iter()
 .zip(rhs.child_data())
 .all(|(lhs_values, rhs_values)| {
 // merge the null data
-let lhs_merged_nulls = match (lhs_nulls, lhs_values.null_buffer()) 
{

Review comment:
   @jorgecarleitao I thought you were referring to this part of the code. I 
removed the one in `mod.rs` because I found that it was computing a duplicate 
when dealing with just primitive arrays.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-04 Thread GitBox



jorgecarleitao commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551746160



##
File path: rust/arrow/src/array/equal/mod.rs
##
@@ -146,118 +146,103 @@ fn equal_values(
 rhs_start: usize,
 len: usize,
 ) -> bool {
-// compute the nested buffer of the parent and child
-// if the array has no parent, the child is computed with itself
-#[allow(unused_assignments)]
-let mut temp_lhs: Option = None;
-#[allow(unused_assignments)]
-let mut temp_rhs: Option = None;
-let lhs_merged_nulls = match (lhs_nulls, lhs.null_buffer()) {
-(None, None) => None,
-(None, Some(c)) => Some(c),
-(Some(p), None) => Some(p),
-(Some(p), Some(c)) => {
-let merged = (p & c).unwrap();
-temp_lhs = Some(merged);
-temp_lhs.as_ref()
-}
-};

Review comment:
   Well, if it does not work it does not work and we scratch it ^_^





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-04 Thread GitBox



nevi-me commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551745156



##
File path: rust/arrow/src/array/equal/mod.rs
##
@@ -146,118 +146,103 @@ fn equal_values(
 rhs_start: usize,
 len: usize,
 ) -> bool {
-// compute the nested buffer of the parent and child
-// if the array has no parent, the child is computed with itself
-#[allow(unused_assignments)]
-let mut temp_lhs: Option = None;
-#[allow(unused_assignments)]
-let mut temp_rhs: Option = None;
-let lhs_merged_nulls = match (lhs_nulls, lhs.null_buffer()) {
-(None, None) => None,
-(None, Some(c)) => Some(c),
-(Some(p), None) => Some(p),
-(Some(p), Some(c)) => {
-let merged = (p & c).unwrap();
-temp_lhs = Some(merged);
-temp_lhs.as_ref()
-}
-};

Review comment:
   Interestingly, I found an issue with this while fixing the failing 
integration tests. See my latest commit 
https://github.com/apache/arrow/pull/9093/commits/66bb98e240f9ca49a7abe2a00ce83a117a62b426.
   
   I'll follow up with changes for your suggestions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-04 Thread GitBox



jorgecarleitao commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551739520



##
File path: rust/arrow/src/array/data.rs
##
@@ -136,6 +137,84 @@ impl ArrayData {
 _bitmap
 }
 
+/// Computes the logical validity bitmap of the array data using the
+/// parent's array data. The parent should be a list or struct, else
+/// the logical bitmap of the array is returned unaltered.
+///
+/// Parent data is passed along with the parent's logical bitmap, as
+/// nested arrays could have a logical bitmap different to the physical
+/// one on the `ArrayData`.
+///
+/// Safety
+///
+/// As we index into [`ArrayData::child_data`], this function panics if
+/// array data is not a nested type, as it will not have child data.
+pub fn child_logical_null_buffer(

Review comment:
   I would have placed this function outside of `impl ArrayData`, add the 
corresponding arguments, and place it under `equal/utils.rs` for now. 
Alternatively, use `pub(crate)` to not make this public.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-04 Thread GitBox



jorgecarleitao commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551738974



##
File path: rust/arrow/src/array/equal/structure.rs
##
@@ -55,21 +56,24 @@ fn equal_values(
 temp_lhs.as_ref()
 }
 };
-let rhs_merged_nulls = match (rhs_nulls, rhs_values.null_buffer()) 
{
-(None, None) => None,
-(None, Some(c)) => Some(c),
-(Some(p), None) => Some(p),
-(Some(p), Some(c)) => {
-let merged = (p & c).unwrap();
-temp_rhs = Some(merged);
-temp_rhs.as_ref()
-}
-};
+// TODO: this is intentional, looking at which is the better option
+let rhs_merged_nulls =
+rhs.child_logical_null_buffer(rhs_nulls.cloned(), index);

Review comment:
   pass `rhs_values` instead of `index` here?

##
File path: rust/arrow/src/array/data.rs
##
@@ -136,6 +137,84 @@ impl ArrayData {
 _bitmap
 }
 
+/// Computes the logical validity bitmap of the array data using the
+/// parent's array data. The parent should be a list or struct, else
+/// the logical bitmap of the array is returned unaltered.
+///
+/// Parent data is passed along with the parent's logical bitmap, as
+/// nested arrays could have a logical bitmap different to the physical
+/// one on the `ArrayData`.
+///
+/// Safety

Review comment:
   `# safety` is used for `unsafe`, this is `# Panics`.

##
File path: rust/arrow/src/array/equal/structure.rs
##
@@ -38,12 +38,13 @@ fn equal_values(
 len: usize,
 ) -> bool {
 let mut temp_lhs: Option = None;
-let mut temp_rhs: Option = None;
+// let mut temp_rhs: Option = None;
 
 lhs.child_data()
 .iter()
 .zip(rhs.child_data())
-.all(|(lhs_values, rhs_values)| {
+.enumerate()

Review comment:
   which allows to remove this `enumerate` here.

##
File path: rust/arrow/src/array/equal/primitive.rs
##
@@ -32,7 +36,10 @@ pub(super) fn primitive_equal(
 let lhs_values = ()[0].as_slice()[lhs.offset() * byte_width..];
 let rhs_values = ()[0].as_slice()[rhs.offset() * byte_width..];
 
-if lhs.null_count() == 0 && rhs.null_count() == 0 {
+let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
+let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
+
+if lhs_null_count == 0 && rhs_null_count == 0 {

Review comment:
   This is even better: if the buffer only has 1s outside of the range, we 
short-circuit to the fast case.  

##
File path: rust/arrow/src/array/equal/mod.rs
##
@@ -146,118 +146,103 @@ fn equal_values(
 rhs_start: usize,
 len: usize,
 ) -> bool {
-// compute the nested buffer of the parent and child
-// if the array has no parent, the child is computed with itself
-#[allow(unused_assignments)]
-let mut temp_lhs: Option = None;
-#[allow(unused_assignments)]
-let mut temp_rhs: Option = None;
-let lhs_merged_nulls = match (lhs_nulls, lhs.null_buffer()) {
-(None, None) => None,
-(None, Some(c)) => Some(c),
-(Some(p), None) => Some(p),
-(Some(p), Some(c)) => {
-let merged = (p & c).unwrap();
-temp_lhs = Some(merged);
-temp_lhs.as_ref()
-}
-};

Review comment:
   I liked this more. A bit more explicit what was happening here.

##
File path: rust/arrow/src/array/data.rs
##
@@ -136,6 +137,84 @@ impl ArrayData {
 _bitmap
 }
 
+/// Computes the logical validity bitmap of the array data using the
+/// parent's array data. The parent should be a list or struct, else
+/// the logical bitmap of the array is returned unaltered.
+///
+/// Parent data is passed along with the parent's logical bitmap, as
+/// nested arrays could have a logical bitmap different to the physical
+/// one on the `ArrayData`.
+///
+/// Safety
+///
+/// As we index into [`ArrayData::child_data`], this function panics if
+/// array data is not a nested type, as it will not have child data.
+pub fn child_logical_null_buffer(

Review comment:
   I would have placed this function outside of `impl ArrayData`, add the 
corresponding arguments, and place it under `equal/` for now. Alternatively, 
use `pub(crate)` to not make this public.

##
File path: rust/arrow/src/array/data.rs
##
@@ -136,6 +137,84 @@ impl ArrayData {
 _bitmap
 }
 
+/// Computes the logical validity bitmap of the array data using the
+/// parent's array data. The parent should be a list or struct, else
+/// the logical bitmap of the array is returned unaltered.
+///
+/// Parent data is passed along with the parent's logical bitmap, as
+

[GitHub] [arrow] mqy commented on pull request #9025: ARROW-10259: [Rust] Add custom metadata to Field

2021-01-04 Thread GitBox



mqy commented on pull request #9025:
URL: https://github.com/apache/arrow/pull/9025#issuecomment-754421764


   > Hey @mqy, I'm back in the city, and have access to my desktop; so I'll be 
able to review this PR and help you enable integration tests during the week.
   
   Welcome back!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for writing parquet

2021-01-04 Thread GitBox



jorgecarleitao commented on pull request #9099:
URL: https://github.com/apache/arrow/pull/9099#issuecomment-754386570


   Two CI jobs are hanging indefinitely, I canceled them.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] arw2019 commented on pull request #8955: ARROW-9948: [C++] in Decimal128::FromString raise when scale is out of bounds

2021-01-04 Thread GitBox



arw2019 commented on pull request #8955:
URL: https://github.com/apache/arrow/pull/8955#issuecomment-754383958


   Thanks @pitrou @kiszk for reviewing! Planning to push an update in the next 
day or so



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9086: [Rust] [DataFusion] [Experiment] Blocking threads filter

2021-01-04 Thread GitBox



jorgecarleitao commented on a change in pull request #9086:
URL: https://github.com/apache/arrow/pull/9086#discussion_r551702345



##
File path: rust/datafusion/src/physical_plan/filter.rs
##
@@ -103,25 +103,23 @@ impl ExecutionPlan for FilterExec {
 }
 
 async fn execute(, partition: usize) -> 
Result {
-Ok(Box::pin(FilterExecStream {
-schema: self.input.schema().clone(),
-predicate: self.predicate.clone(),
-input: self.input.execute(partition).await?,
-}))
+let predicate = self.predicate.clone();
+
+let stream = self.input.execute(partition).await?;
+let stream = stream.then(move |batch| {
+let predicate = predicate.clone();
+async move {
+// Filtering batches is CPU-bounded and therefore justifies a 
dedicated thread pool
+task::spawn_blocking(move || batch_filter(?, ))
+.await
+.unwrap()
+}
+});
+
+Ok(Box::pin(SizedRecordBatchStream::new(stream, self.schema(

Review comment:
   no, but this did not happen in master either: a stream does not ask for 
the next item until the current item is completed. We get no parallelism 
whatsoever from streams, only concurrency (because they yield). E.g. 
https://docs.rs/parallel-stream/2.1.2/parallel_stream/#differences-with-rayon
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551694880



##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {
+ public:
+  virtual ~ReservationListener();
+
+  virtual Status OnReservation(int64_t size) = 0;
+  virtual Status OnRelease(int64_t size) = 0;
+
+ protected:
+  ReservationListener();
+};
+
+class ARROW_EXPORT ReservationListenableMemoryPool : public MemoryPool {
+ public:
+  explicit ReservationListenableMemoryPool(MemoryPool* pool,
+   
std::shared_ptr listener,

Review comment:
   Would you suggest `unique_ptr` here? I thought that ReservationListener 
instances may be able to share over pools. For example a listener simply for 
logging reservations to std out.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551694096



##
File path: cpp/src/arrow/memory_pool.cc
##
@@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return 
impl_->max_memory(); }
 
 std::string ProxyMemoryPool::backend_name() const { return 
impl_->backend_name(); }
 
+ReservationListener::~ReservationListener() {}
+
+ReservationListener::ReservationListener() {}
+
+class ReservationListenableMemoryPool::ReservationListenableMemoryPoolImpl {
+ public:
+  explicit ReservationListenableMemoryPoolImpl(
+  MemoryPool* pool, std::shared_ptr listener, int64_t 
block_size)
+  : pool_(pool),
+listener_(listener),
+block_size_(block_size),
+blocks_reserved_(0),
+bytes_reserved_(0) {}
+
+  Status Allocate(int64_t size, uint8_t** out) {
+RETURN_NOT_OK(UpdateReservation(size));
+Status error = pool_->Allocate(size, out);
+if (!error.ok()) {
+  RETURN_NOT_OK(UpdateReservation(-size));
+  return error;
+}
+return Status::OK();
+  }
+
+  Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) {
+bool reserved = false;
+int64_t diff = new_size - old_size;
+if (new_size >= old_size) {
+  RETURN_NOT_OK(UpdateReservation(diff));
+  reserved = true;

Review comment:
   Done, thanks for the suggestion.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551693971



##
File path: cpp/src/arrow/memory_pool.cc
##
@@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return 
impl_->max_memory(); }
 
 std::string ProxyMemoryPool::backend_name() const { return 
impl_->backend_name(); }
 
+ReservationListener::~ReservationListener() {}

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551693725



##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {

Review comment:
   added

##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {
+ public:
+  virtual ~ReservationListener();

Review comment:
   Done. Thanks for the suggestion.

##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {
+ public:
+  virtual ~ReservationListener();
+
+  virtual Status OnReservation(int64_t size) = 0;
+  virtual Status OnRelease(int64_t size) = 0;
+
+ protected:
+  ReservationListener();

Review comment:
   done

##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {
+ public:
+  virtual ~ReservationListener();
+
+  virtual Status OnReservation(int64_t size) = 0;
+  virtual Status OnRelease(int64_t size) = 0;
+
+ protected:
+  ReservationListener();
+};
+
+class ARROW_EXPORT ReservationListenableMemoryPool : public MemoryPool {

Review comment:
   added





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou closed pull request #9098: ARROW-11127: [C++] ifdef unused cpu_info on non-x86 platforms

2021-01-04 Thread GitBox



kou closed pull request #9098:
URL: https://github.com/apache/arrow/pull/9098


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #9047: ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types

2021-01-04 Thread GitBox



sunchao commented on a change in pull request #9047:
URL: https://github.com/apache/arrow/pull/9047#discussion_r551681563



##
File path: rust/parquet/src/arrow/schema.rs
##
@@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> {
 LogicalType::INT_32 => Ok(DataType::Int32),
 LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)),
 LogicalType::TIME_MILLIS => 
Ok(DataType::Time32(TimeUnit::Millisecond)),
+LogicalType::DECIMAL => self.to_decimal(),

Review comment:
   np @sweb - thanks for the contribution!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



sunchao commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551680435



##
File path: rust/parquet/src/file/serialized_reader.rs
##
@@ -137,6 +137,22 @@ impl SerializedFileReader {
 metadata,
 })
 }
+
+pub fn filter_row_groups(

Review comment:
   What I was thinking is that, we can have another constructor for 
`SerializedFileReader` which takes a custom metadata:
   ```rust
   pub fn new_with_metadata(chunk_reader: R, metadata: ParquetMetaData) -> 
Result {
   Ok(Self {
   chunk_reader: Arc::new(chunk_reader),
   metadata: metadata,
   })
   }
   ```
   and we move the metadata filtering part to data fusion, or a util function 
in `footer.rs`.
   
   In the long term though, I think we should do something similar to 
[parquet-mr is 
doing](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L656),
 that is, having a `ParquetReadOptions`-like struct which allows user to 
specify various configs, properties, filters, etc when reading a parquet file. 
The struct is extendable as well to accommodate new features in future such as 
filtering with column indexes or bloom filters, so we don't need to have 
multiple constructors. The constructor can become like this:
   ```rust
   pub fn new(chunk_reader: R, options: ParquetReadOptions) -> Result 
{
   Ok(Self {
   chunk_reader: Arc::new(chunk_reader),
   options: options,
   })
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



sunchao commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551680435



##
File path: rust/parquet/src/file/serialized_reader.rs
##
@@ -137,6 +137,22 @@ impl SerializedFileReader {
 metadata,
 })
 }
+
+pub fn filter_row_groups(

Review comment:
   What I was thinking is that, we can have another constructor for 
`SerializedFileReader` which takes a custom metadata:
   ```rust
   pub fn new_with_metadata(chunk_reader: R, metadata: ParquetMetaData) -> 
Result {
   Ok(Self {
   chunk_reader: Arc::new(chunk_reader),
   metadata: metadata,
   })
   }
   ```
   and we move the metadata filtering part to data fusion, or a util function 
in `footer.rs`.
   
   In the long term though, I think we should do something similar to 
[parquet-mr is 
doing](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L656),
 that is, having a `ParquetReadOptions`-like struct which allows user to 
specify various configs, properties, filters, etc when reading a parquet file. 
The struct is extendable as well to accommodate new features in future such as 
filtering with column indexes or bloom filters. The constructor can become like 
this:
   ```rust
   pub fn new(chunk_reader: R, options: ParquetReadOptions) -> Result 
{
   Ok(Self {
   chunk_reader: Arc::new(chunk_reader),
   options: options,
   })
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] terencehonles commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels

2021-01-04 Thread GitBox



terencehonles commented on pull request #8915:
URL: https://github.com/apache/arrow/pull/8915#issuecomment-754339737


   @kszucs I noticed there was a list of allowed actions. Could the allow list 
just be updated or is that enforced at the apache (org) level rather than the 
repo level?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels

2021-01-04 Thread GitBox



kszucs commented on pull request #8915:
URL: https://github.com/apache/arrow/pull/8915#issuecomment-754300248


   Yes, the comment bot seems to be failing 
https://github.com/apache/arrow/actions/runs/462228610
   This is most likely caused by the recent security changes, so we need to 
update the 
[comment_bot.yml](https://github.com/apache/arrow/blob/master/.github/workflows/comment_bot.yml)
 action to use verified plugins only.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced

2021-01-04 Thread GitBox



nealrichardson commented on pull request #9097:
URL: https://github.com/apache/arrow/pull/9097#issuecomment-754297915


   @xhochy autotune isn't working at the moment--INFRA has blocked some of the 
Actions we use, including on that workflow. See 
https://issues.apache.org/jira/browse/INFRA-21239



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes

2021-01-04 Thread GitBox



nealrichardson closed pull request #9092:
URL: https://github.com/apache/arrow/pull/9092


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



yordan-pavlov commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551621749



##
File path: rust/parquet/src/file/serialized_reader.rs
##
@@ -137,6 +137,22 @@ impl SerializedFileReader {
 metadata,
 })
 }
+
+pub fn filter_row_groups(

Review comment:
   there is another possibility - I have just noticed 
`FilePageIterator::with_row_groups` which could be used to filter row groups 
based on a list of row group indexes; this could replace the 
`filter_row_groups` method but would require the row group indexes to be passed 
down all the way to `build_for_primitive_type_inner` where `FilePageIterator` 
is created; this could be done through a new field in 
`ArrayReaderBuilderContext`.
   It's a deeper change but would mean that `filter_row_groups` method is no 
longer necessary. @sunchao  do you think this would be a better way to go about 
filtering of row groups? I am not sure the complexity is worth it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



yordan-pavlov commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551621749



##
File path: rust/parquet/src/file/serialized_reader.rs
##
@@ -137,6 +137,22 @@ impl SerializedFileReader {
 metadata,
 })
 }
+
+pub fn filter_row_groups(

Review comment:
   there is another possibility - I have just noticed 
`FilePageIterator::with_row_groups` which could be used to filter row groups 
based on a list of row group indexes; this could replace the 
`filter_row_groups` method but would require the row group indexes to be passed 
down all the way to `build_for_primitive_type_inner` where `FilePageIterator` 
is created; this could be done through a new field in 
`ArrayReaderBuilderContext`.
   It's a deeper change but would mean that `filter_row_groups` method is no 
longer necessary. @sunchao  do you think this would be a better way to go about 
filtering of row groups?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #9024: ARROW-11044: [C++] Add "replace" kernel

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #9024:
URL: https://github.com/apache/arrow/pull/9024#discussion_r551619381



##
File path: cpp/src/arrow/compute/kernels/scalar_replace.cc
##
@@ -0,0 +1,309 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+
+namespace arrow {
+
+using internal::BitBlockCount;

Review comment:
   use full imports and put them inside the anonymous namespace.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] terencehonles commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels

2021-01-04 Thread GitBox



terencehonles commented on pull request #8915:
URL: https://github.com/apache/arrow/pull/8915#issuecomment-754272662


   @kou / @kszucs you may already know but it looks like the crossbow submitter 
/ github actions bot isn't working(?). It doesn't look like multibuild has 
tasks for GitHub Actions, but I wanted to see how far the Travis OS X commands 
would get.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



emkornfield commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-754270017


   @zhztheplayer I need to take a closer look at the JNI changes since this was 
last approved, and will take another look at the memory stuff once you've added 
some docs.  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb closed pull request #9094: ARROW-11126: [Rust] Document and test ARROW-10656

2021-01-04 Thread GitBox



alamb closed pull request #9094:
URL: https://github.com/apache/arrow/pull/9094


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for loading parquet [WIP]

2021-01-04 Thread GitBox



github-actions[bot] commented on pull request #9099:
URL: https://github.com/apache/arrow/pull/9099#issuecomment-754266445


   https://issues.apache.org/jira/browse/ARROW-11129



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Dandandan opened a new pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for loading parquet [WIP]

2021-01-04 Thread GitBox



Dandandan opened a new pull request #9099:
URL: https://github.com/apache/arrow/pull/9099


   Inspired by PR https://github.com/apache/arrow/pull/9086 from 
@jorgecarleitao 
   
   Seems better to use  `task::spawn_blocking` to set upper limit on nr. of 
threads (512 by default) using tokio.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551610971



##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {
+ public:
+  virtual ~ReservationListener();
+
+  virtual Status OnReservation(int64_t size) = 0;
+  virtual Status OnRelease(int64_t size) = 0;
+
+ protected:
+  ReservationListener();
+};
+
+class ARROW_EXPORT ReservationListenableMemoryPool : public MemoryPool {
+ public:
+  explicit ReservationListenableMemoryPool(MemoryPool* pool,
+   
std::shared_ptr listener,

Review comment:
   why shared ptr?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551610553



##
File path: cpp/src/arrow/memory_pool.cc
##
@@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return 
impl_->max_memory(); }
 
 std::string ProxyMemoryPool::backend_name() const { return 
impl_->backend_name(); }
 
+ReservationListener::~ReservationListener() {}
+
+ReservationListener::ReservationListener() {}
+
+class ReservationListenableMemoryPool::ReservationListenableMemoryPoolImpl {
+ public:
+  explicit ReservationListenableMemoryPoolImpl(
+  MemoryPool* pool, std::shared_ptr listener, int64_t 
block_size)
+  : pool_(pool),
+listener_(listener),
+block_size_(block_size),
+blocks_reserved_(0),
+bytes_reserved_(0) {}
+
+  Status Allocate(int64_t size, uint8_t** out) {
+RETURN_NOT_OK(UpdateReservation(size));
+Status error = pool_->Allocate(size, out);
+if (!error.ok()) {
+  RETURN_NOT_OK(UpdateReservation(-size));
+  return error;
+}
+return Status::OK();
+  }
+
+  Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) {
+bool reserved = false;
+int64_t diff = new_size - old_size;
+if (new_size >= old_size) {
+  RETURN_NOT_OK(UpdateReservation(diff));
+  reserved = true;

Review comment:
   some docs in this method could help readability.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551610223



##
File path: cpp/src/arrow/memory_pool.cc
##
@@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return 
impl_->max_memory(); }
 
 std::string ProxyMemoryPool::backend_name() const { return 
impl_->backend_name(); }
 
+ReservationListener::~ReservationListener() {}

Review comment:
   this and the constructor aren't needed if you use = default as suggested 
in the header.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551610076



##
File path: cpp/src/arrow/memory_pool.cc
##
@@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return 
impl_->max_memory(); }
 
 std::string ProxyMemoryPool::backend_name() const { return 
impl_->backend_name(); }
 
+ReservationListener::~ReservationListener() {}
+
+ReservationListener::ReservationListener() {}
+
+class ReservationListenableMemoryPool::ReservationListenableMemoryPoolImpl {
+ public:
+  explicit ReservationListenableMemoryPoolImpl(
+  MemoryPool* pool, std::shared_ptr listener, int64_t 
block_size)
+  : pool_(pool),
+listener_(listener),
+block_size_(block_size),
+blocks_reserved_(0),
+bytes_reserved_(0) {}
+
+  Status Allocate(int64_t size, uint8_t** out) {
+RETURN_NOT_OK(UpdateReservation(size));
+Status error = pool_->Allocate(size, out);
+if (!error.ok()) {
+  RETURN_NOT_OK(UpdateReservation(-size));
+  return error;
+}
+return Status::OK();
+  }
+
+  Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) {
+bool reserved = false;
+int64_t diff = new_size - old_size;
+if (new_size >= old_size) {
+  RETURN_NOT_OK(UpdateReservation(diff));
+  reserved = true;
+}
+Status error = pool_->Reallocate(old_size, new_size, ptr);
+if (!error.ok()) {
+  if (reserved) {
+RETURN_NOT_OK(UpdateReservation(-diff));
+  }
+  return error;
+}
+if (!reserved) {
+  RETURN_NOT_OK(UpdateReservation(diff));
+}
+return Status::OK();
+  }
+
+  void Free(uint8_t* buffer, int64_t size) {
+pool_->Free(buffer, size);
+// fixme currently method ::Free doesn't allow Status return

Review comment:
   please open a JIRA or e-mail the mailing list for discussion.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551609465



##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {
+ public:
+  virtual ~ReservationListener();
+
+  virtual Status OnReservation(int64_t size) = 0;
+  virtual Status OnRelease(int64_t size) = 0;
+
+ protected:
+  ReservationListener();

Review comment:
   
   ```suggestion
 ReservationListener() = default;
   ```

##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {
+ public:
+  virtual ~ReservationListener();
+
+  virtual Status OnReservation(int64_t size) = 0;
+  virtual Status OnRelease(int64_t size) = 0;
+
+ protected:
+  ReservationListener();
+};
+
+class ARROW_EXPORT ReservationListenableMemoryPool : public MemoryPool {

Review comment:
   documentation please.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551609281



##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {
+ public:
+  virtual ~ReservationListener();

Review comment:
   ```suggestion
 virtual ~ReservationListener() = default;
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r551609196



##
File path: cpp/src/arrow/memory_pool.h
##
@@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
   std::unique_ptr impl_;
 };
 
+class ARROW_EXPORT ReservationListener {

Review comment:
   documentation please.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression

2021-01-04 Thread GitBox



bkietz commented on a change in pull request #8894:
URL: https://github.com/apache/arrow/pull/8894#discussion_r551608598



##
File path: cpp/src/arrow/dataset/expression_internal.h
##
@@ -0,0 +1,465 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/expression.h"
+
+#include 
+#include 
+#include 
+
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/registry.h"
+#include "arrow/record_batch.h"
+#include "arrow/table.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+
+using internal::checked_cast;
+
+namespace dataset {
+
+bool Identical(const Expression& l, const Expression& r) { return l.impl_ == 
r.impl_; }
+
+const Expression::Call* CallNotNull(const Expression& expr) {
+  auto call = expr.call();
+  DCHECK_NE(call, nullptr);
+  return call;
+}
+
+inline void GetAllFieldRefs(const Expression& expr,
+std::unordered_set* 
refs) {
+  if (auto lit = expr.literal()) return;
+
+  if (auto ref = expr.field_ref()) {
+refs->emplace(*ref);
+return;
+  }
+
+  for (const Expression& arg : CallNotNull(expr)->arguments) {
+GetAllFieldRefs(arg, refs);
+  }
+}
+
+inline std::vector GetDescriptors(const std::vector& 
exprs) {
+  std::vector descrs(exprs.size());
+  for (size_t i = 0; i < exprs.size(); ++i) {
+DCHECK(exprs[i].IsBound());
+descrs[i] = exprs[i].descr();
+  }
+  return descrs;
+}
+
+inline std::vector GetDescriptors(const std::vector& 
values) {
+  std::vector descrs(values.size());
+  for (size_t i = 0; i < values.size(); ++i) {
+descrs[i] = values[i].descr();
+  }
+  return descrs;
+}
+
+struct FieldPathGetDatumImpl {
+  template ()))>
+  Result operator()(const std::shared_ptr& ptr) {
+return path_.Get(*ptr).template As();
+  }
+
+  template 
+  Result operator()(const T&) {
+return Status::NotImplemented("FieldPath::Get() into Datum ", 
datum_.ToString());
+  }
+
+  const Datum& datum_;
+  const FieldPath& path_;
+};
+
+inline Result GetDatumField(const FieldRef& ref, const Datum& input) {
+  Datum field;
+
+  FieldPath path;
+  if (auto type = input.type()) {
+ARROW_ASSIGN_OR_RAISE(path, ref.FindOneOrNone(*input.type()));
+  } else if (input.kind() == Datum::RECORD_BATCH) {
+ARROW_ASSIGN_OR_RAISE(path, 
ref.FindOneOrNone(*input.record_batch()->schema()));
+  } else if (input.kind() == Datum::TABLE) {
+ARROW_ASSIGN_OR_RAISE(path, ref.FindOneOrNone(*input.table()->schema()));
+  }
+
+  if (path) {
+ARROW_ASSIGN_OR_RAISE(field,
+  util::visit(FieldPathGetDatumImpl{input, path}, 
input.value));
+  }
+
+  if (field == Datum{}) {
+field = Datum(std::make_shared());
+  }
+
+  return field;
+}
+
+struct Comparison {
+  enum type {
+NA = 0,
+EQUAL = 1,
+LESS = 2,
+GREATER = 4,
+NOT_EQUAL = LESS | GREATER,
+LESS_EQUAL = LESS | EQUAL,
+GREATER_EQUAL = GREATER | EQUAL,
+  };
+
+  static const type* Get(const std::string& function) {
+static std::unordered_map flipped_comparisons{
+{"equal", EQUAL}, {"not_equal", NOT_EQUAL},
+{"less", LESS},   {"less_equal", LESS_EQUAL},
+{"greater", GREATER}, {"greater_equal", GREATER_EQUAL},
+};
+
+auto it = flipped_comparisons.find(function);
+return it != flipped_comparisons.end() ? >second : nullptr;
+  }
+
+  static const type* Get(const Expression& expr) {
+if (auto call = expr.call()) {
+  return Comparison::Get(call->function_name);
+}
+return nullptr;
+  }
+
+  // Execute a simple Comparison between scalars, casting the RHS if types 
disagree
+  static Result Execute(Datum l, Datum r) {
+if (!l.is_scalar() || !r.is_scalar()) {
+  return Status::Invalid("Cannot Execute Comparison on non-scalars");
+}
+
+if (!l.type()->Equals(r.type())) {
+  ARROW_ASSIGN_OR_RAISE(r, compute::Cast(r, l.type()));
+}
+
+std::vector arguments{std::move(l), std::move(r)};
+
+ARROW_ASSIGN_OR_RAISE(auto equal, compute::CallFunction("equal", 
arguments));
+
+if (!equal.scalar()->is_valid) return NA;
+if (equal.scalar_as().value) return EQUAL;
+
+ARROW_ASSIGN_OR_RAISE(auto less, compute::CallFunction("less", arguments));
+
+if

[GitHub] [arrow] emkornfield closed pull request #8597: ARROW-10492: [Java][JDBC] Allow users to config the mapping between SQL types and Arrow types

2021-01-04 Thread GitBox



emkornfield closed pull request #8597:
URL: https://github.com/apache/arrow/pull/8597


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on issue #8987: Are the Date Logical Type and Date Converted Type implemented?

2021-01-04 Thread GitBox



emkornfield commented on issue #8987:
URL: https://github.com/apache/arrow/issues/8987#issuecomment-754262638


   I believe this was addressed on the mailing list.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield closed issue #8987: Are the Date Logical Type and Date Converted Type implemented?

2021-01-04 Thread GitBox



emkornfield closed issue #8987:
URL: https://github.com/apache/arrow/issues/8987


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #8597: ARROW-10492: [Java][JDBC] Allow users to config the mapping between SQL types and Arrow types

2021-01-04 Thread GitBox



emkornfield commented on pull request #8597:
URL: https://github.com/apache/arrow/pull/8597#issuecomment-754261654


   +1 thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-01-04 Thread GitBox



emkornfield commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-754259372


   Is it possible to add a test to confirm that this can be read/written from 
the C++ implementation?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#discussion_r551603469



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java
##
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import static org.apache.arrow.memory.util.MemoryUtil.LITTLE_ENDIAN;
+
+import java.nio.ByteBuffer;
+
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.util.MemoryUtil;
+import org.apache.arrow.util.Preconditions;
+
+import net.jpountz.lz4.LZ4Compressor;
+import net.jpountz.lz4.LZ4Factory;
+import net.jpountz.lz4.LZ4FastDecompressor;
+
+/**
+ * Compression codec for the LZ4 algorithm.
+ */
+public class Lz4CompressionCodec implements CompressionCodec {
+
+  private static final long SIZE_OF_MESSAGE_LENGTH = 8L;
+
+  private final LZ4Factory factory;
+
+  private LZ4Compressor compressor;
+
+  private LZ4FastDecompressor decompressor;
+
+  public Lz4CompressionCodec() {
+factory = LZ4Factory.fastestInstance();
+  }
+
+  @Override
+  public ArrowBuf compress(BufferAllocator allocator, ArrowBuf 
unCompressedBuffer) {
+Preconditions.checkArgument(unCompressedBuffer.writerIndex() <= 
Integer.MAX_VALUE,
+"The uncompressed buffer size exceeds the integer limit");
+
+// create compressor lazily

Review comment:
   why?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#discussion_r551603359



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java
##
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import static org.apache.arrow.memory.util.MemoryUtil.LITTLE_ENDIAN;
+
+import java.nio.ByteBuffer;
+
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.util.MemoryUtil;
+import org.apache.arrow.util.Preconditions;
+
+import net.jpountz.lz4.LZ4Compressor;
+import net.jpountz.lz4.LZ4Factory;
+import net.jpountz.lz4.LZ4FastDecompressor;
+
+/**
+ * Compression codec for the LZ4 algorithm.
+ */
+public class Lz4CompressionCodec implements CompressionCodec {
+
+  private static final long SIZE_OF_MESSAGE_LENGTH = 8L;
+
+  private final LZ4Factory factory;
+
+  private LZ4Compressor compressor;
+
+  private LZ4FastDecompressor decompressor;
+
+  public Lz4CompressionCodec() {
+factory = LZ4Factory.fastestInstance();
+  }
+
+  @Override
+  public ArrowBuf compress(BufferAllocator allocator, ArrowBuf 
unCompressedBuffer) {

Review comment:
   ? or is this consistent with the existing API?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#discussion_r551603225



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java
##
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import static org.apache.arrow.memory.util.MemoryUtil.LITTLE_ENDIAN;
+
+import java.nio.ByteBuffer;
+
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.util.MemoryUtil;
+import org.apache.arrow.util.Preconditions;
+
+import net.jpountz.lz4.LZ4Compressor;
+import net.jpountz.lz4.LZ4Factory;
+import net.jpountz.lz4.LZ4FastDecompressor;
+
+/**
+ * Compression codec for the LZ4 algorithm.
+ */
+public class Lz4CompressionCodec implements CompressionCodec {
+
+  private static final long SIZE_OF_MESSAGE_LENGTH = 8L;
+
+  private final LZ4Factory factory;
+
+  private LZ4Compressor compressor;
+
+  private LZ4FastDecompressor decompressor;
+
+  public Lz4CompressionCodec() {
+factory = LZ4Factory.fastestInstance();
+  }
+
+  @Override
+  public ArrowBuf compress(BufferAllocator allocator, ArrowBuf 
unCompressedBuffer) {

Review comment:
   ```suggestion
 public ArrowBuf compress(BufferAllocator allocator, ArrowBuf 
uncompressedBuffer) {
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#discussion_r551603003



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java
##
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import static org.apache.arrow.memory.util.MemoryUtil.LITTLE_ENDIAN;
+
+import java.nio.ByteBuffer;
+
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.util.MemoryUtil;
+import org.apache.arrow.util.Preconditions;
+
+import net.jpountz.lz4.LZ4Compressor;

Review comment:
   How was this library chosen?  It looks like it might not have been 
released in a while?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



yordan-pavlov commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551597243



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -209,6 +251,477 @@ impl ParquetExec {
 }
 }
 
+#[derive(Debug, Clone)]
+/// Predicate builder used for generating of predicate functions, used to 
filter row group metadata
+pub struct PredicateExpressionBuilder {
+parquet_schema: Schema,
+predicate_expr: Arc,
+stat_column_req: Vec<(String, StatisticsType, Field)>,
+}
+
+impl PredicateExpressionBuilder {
+/// Try to create a new instance of PredicateExpressionBuilder
+pub fn try_new(expr: , parquet_schema: Schema) -> Result {
+// build predicate expression once
+let mut stat_column_req = Vec::<(String, StatisticsType, 
Field)>::new();
+let predicate_expr =
+build_predicate_expression(expr, _schema,  
stat_column_req)?;
+
+Ok(Self {
+parquet_schema,
+predicate_expr,
+stat_column_req,
+})
+}
+
+/// Generate a predicate function used to filter row group metadata
+pub fn build_row_group_predicate(
+,
+row_group_metadata: &[RowGroupMetaData],
+) -> Box bool> {
+// build statistics record batch
+let predicate_result = build_row_group_record_batch(
+row_group_metadata,
+_schema,
+_column_req,
+)
+.and_then(|statistics_batch| {
+// execute predicate expression
+self.predicate_expr.evaluate(_batch)
+})
+.and_then(|v| match v {
+ColumnarValue::Array(array) => Ok(array),
+ColumnarValue::Scalar(_) => Err(DataFusionError::Plan(
+"predicate expression didn't return an array".to_string(),
+)),
+});
+
+let predicate_array = match predicate_result {
+Ok(array) => array,
+_ => return Box::new(|_r, _i| true),
+};
+
+let predicate_array = 
predicate_array.as_any().downcast_ref::();
+match predicate_array {
+// return row group predicate function
+Some(array) => {
+let predicate_values =
+array.iter().map(|x| 
x.unwrap_or(false)).collect::>();
+Box::new(move |_, i| predicate_values[i])
+}
+// predicate result is not a BooleanArray
+_ => Box::new(|_r, _i| true),
+}
+}
+}
+
+fn build_row_group_record_batch(
+row_groups: &[RowGroupMetaData],
+parquet_schema: ,
+stat_column_req: <(String, StatisticsType, Field)>,
+) -> Result {
+let mut fields = Vecnew();
+let mut arrays = Vecnew();
+for (column_name, statistics_type, stat_field) in stat_column_req {
+if let Some((column_index, _)) = 
parquet_schema.column_with_name(column_name) {
+let statistics = row_groups
+.iter()
+.map(|g| g.column(column_index).statistics())
+.collect::>();
+let array = build_statistics_array(
+,
+*statistics_type,
+stat_field.data_type(),
+);
+fields.push(stat_field.clone());
+arrays.push(array);
+}
+}
+let schema = Arc::new(Schema::new(fields));
+RecordBatch::try_new(schema, arrays)
+.map_err(|err| DataFusionError::Plan(err.to_string()))
+}
+
+struct PhysicalExpressionBuilder<'a> {
+column_name: String,
+column_expr: &'a Expr,
+scalar_expr: &'a Expr,
+parquet_field: &'a Field,
+statistics_fields: Vec,
+stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>,
+reverse_operator: bool,
+}
+
+impl<'a> PhysicalExpressionBuilder<'a> {
+fn try_new(
+left: &'a Expr,
+right: &'a Expr,
+parquet_schema: &'a Schema,
+stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>,
+) -> Result {
+// find column name; input could be a more complicated expression
+let mut left_columns = HashSetnew();
+utils::expr_to_column_names(left,  left_columns)?;
+let mut right_columns = HashSetnew();
+utils::expr_to_column_names(right,  right_columns)?;
+let (column_expr, scalar_expr, column_names, reverse_operator) =
+match (left_columns.len(), right_columns.len()) {
+(1, 0) => (left, right, left_columns, false),
+(0, 1) => (right, left, right_columns, true),
+_ => {
+// if more than one column used in expression - not 
supported
+return Err(DataFusionError::Plan(
+"Multi-column expressions are not currently supported"
+.to_string(),
+));
+}
+};
+let column_name =

[GitHub] [arrow] github-actions[bot] commented on pull request #9098: ARROW-11127: [C++] ifdef unused cpu_info on non-x86 platforms

2021-01-04 Thread GitBox



github-actions[bot] commented on pull request #9098:
URL: https://github.com/apache/arrow/pull/9098#issuecomment-754256502


   https://issues.apache.org/jira/browse/ARROW-11127



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] xhochy commented on pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced

2021-01-04 Thread GitBox



xhochy commented on pull request #9097:
URL: https://github.com/apache/arrow/pull/9097#issuecomment-754256374


   @github-actions autotune



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #9053: ARROW-11081: [Java] Make IPC option immutable

2021-01-04 Thread GitBox



emkornfield commented on pull request #9053:
URL: https://github.com/apache/arrow/pull/9053#issuecomment-754256316


   @liyafan82 does this actually make a difference in benchmarks?  I agree it 
is easier to reason about, but is there any way to avoid backward incompability?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] xhochy opened a new pull request #9098: ARROW-11127: [C++] ifdef unused cpu_info on non-x86 platforms

2021-01-04 Thread GitBox



xhochy opened a new pull request #9098:
URL: https://github.com/apache/arrow/pull/9098


   Otherwise we get an error in debug builds because of `-Werror,Wunused`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #9053: ARROW-11081: [Java] Make IPC option immutable

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #9053:
URL: https://github.com/apache/arrow/pull/9053#discussion_r551601013



##
File path: 
java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestMetadataVersion.java
##
@@ -56,9 +56,8 @@ public static void setUpClass() {
 schema = new Schema(Collections.singletonList(Field.nullable("foo", new 
ArrowType.Int(32, true;
 unionSchema = new Schema(
 Collections.singletonList(Field.nullable("union", new 
ArrowType.Union(UnionMode.Dense, new int[]{0};
-optionV4 = new IpcOption();
-optionV4.metadataVersion = MetadataVersion.V4;
-optionV5 = new IpcOption();
+optionV4 = new IpcOption(false, MetadataVersion.V4);

Review comment:
   please comment the literal.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #9053: ARROW-11081: [Java] Make IPC option immutable

2021-01-04 Thread GitBox



emkornfield commented on a change in pull request #9053:
URL: https://github.com/apache/arrow/pull/9053#discussion_r551600824



##
File path: 
java/flight/flight-core/src/main/java/org/apache/arrow/flight/ArrowMessage.java
##
@@ -194,10 +194,9 @@ public ArrowMessage(FlightDescriptor descriptor) {
   private ArrowMessage(FlightDescriptor descriptor, MessageMetadataResult 
message, ArrowBuf appMetadata,
ArrowBuf buf) {
 // No need to take IpcOption as this is used for deserialized ArrowMessage 
coming from the wire.
-this.writeOption = new IpcOption();
-if (message != null) {
-  this.writeOption.metadataVersion = 
MetadataVersion.fromFlatbufID(message.getMessage().version());
-}
+this.writeOption = message != null ?
+new IpcOption(false, 
MetadataVersion.fromFlatbufID(message.getMessage().version())) :

Review comment:
   please comment the literal.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced

2021-01-04 Thread GitBox



emkornfield commented on pull request #9097:
URL: https://github.com/apache/arrow/pull/9097#issuecomment-754254123


   Is it possible to add a unit test?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on pull request #9096: [Python][Packaging] Refactor manylinux and windows wheel building [WIP]

2021-01-04 Thread GitBox



kou commented on pull request #9096:
URL: https://github.com/apache/arrow/pull/9096#issuecomment-754252075


   Related works:
   
 * https://github.com/apache/arrow/pull/8386#issuecomment-728321361
 * #8881



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



yordan-pavlov commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551597243



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -209,6 +251,477 @@ impl ParquetExec {
 }
 }
 
+#[derive(Debug, Clone)]
+/// Predicate builder used for generating of predicate functions, used to 
filter row group metadata
+pub struct PredicateExpressionBuilder {
+parquet_schema: Schema,
+predicate_expr: Arc,
+stat_column_req: Vec<(String, StatisticsType, Field)>,
+}
+
+impl PredicateExpressionBuilder {
+/// Try to create a new instance of PredicateExpressionBuilder
+pub fn try_new(expr: , parquet_schema: Schema) -> Result {
+// build predicate expression once
+let mut stat_column_req = Vec::<(String, StatisticsType, 
Field)>::new();
+let predicate_expr =
+build_predicate_expression(expr, _schema,  
stat_column_req)?;
+
+Ok(Self {
+parquet_schema,
+predicate_expr,
+stat_column_req,
+})
+}
+
+/// Generate a predicate function used to filter row group metadata
+pub fn build_row_group_predicate(
+,
+row_group_metadata: &[RowGroupMetaData],
+) -> Box bool> {
+// build statistics record batch
+let predicate_result = build_row_group_record_batch(
+row_group_metadata,
+_schema,
+_column_req,
+)
+.and_then(|statistics_batch| {
+// execute predicate expression
+self.predicate_expr.evaluate(_batch)
+})
+.and_then(|v| match v {
+ColumnarValue::Array(array) => Ok(array),
+ColumnarValue::Scalar(_) => Err(DataFusionError::Plan(
+"predicate expression didn't return an array".to_string(),
+)),
+});
+
+let predicate_array = match predicate_result {
+Ok(array) => array,
+_ => return Box::new(|_r, _i| true),
+};
+
+let predicate_array = 
predicate_array.as_any().downcast_ref::();
+match predicate_array {
+// return row group predicate function
+Some(array) => {
+let predicate_values =
+array.iter().map(|x| 
x.unwrap_or(false)).collect::>();
+Box::new(move |_, i| predicate_values[i])
+}
+// predicate result is not a BooleanArray
+_ => Box::new(|_r, _i| true),
+}
+}
+}
+
+fn build_row_group_record_batch(
+row_groups: &[RowGroupMetaData],
+parquet_schema: ,
+stat_column_req: <(String, StatisticsType, Field)>,
+) -> Result {
+let mut fields = Vecnew();
+let mut arrays = Vecnew();
+for (column_name, statistics_type, stat_field) in stat_column_req {
+if let Some((column_index, _)) = 
parquet_schema.column_with_name(column_name) {
+let statistics = row_groups
+.iter()
+.map(|g| g.column(column_index).statistics())
+.collect::>();
+let array = build_statistics_array(
+,
+*statistics_type,
+stat_field.data_type(),
+);
+fields.push(stat_field.clone());
+arrays.push(array);
+}
+}
+let schema = Arc::new(Schema::new(fields));
+RecordBatch::try_new(schema, arrays)
+.map_err(|err| DataFusionError::Plan(err.to_string()))
+}
+
+struct PhysicalExpressionBuilder<'a> {
+column_name: String,
+column_expr: &'a Expr,
+scalar_expr: &'a Expr,
+parquet_field: &'a Field,
+statistics_fields: Vec,
+stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>,
+reverse_operator: bool,
+}
+
+impl<'a> PhysicalExpressionBuilder<'a> {
+fn try_new(
+left: &'a Expr,
+right: &'a Expr,
+parquet_schema: &'a Schema,
+stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>,
+) -> Result {
+// find column name; input could be a more complicated expression
+let mut left_columns = HashSetnew();
+utils::expr_to_column_names(left,  left_columns)?;
+let mut right_columns = HashSetnew();
+utils::expr_to_column_names(right,  right_columns)?;
+let (column_expr, scalar_expr, column_names, reverse_operator) =
+match (left_columns.len(), right_columns.len()) {
+(1, 0) => (left, right, left_columns, false),
+(0, 1) => (right, left, right_columns, true),
+_ => {
+// if more than one column used in expression - not 
supported
+return Err(DataFusionError::Plan(
+"Multi-column expressions are not currently supported"
+.to_string(),
+));
+}
+};
+let column_name =

[GitHub] [arrow] nealrichardson commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip

2021-01-04 Thread GitBox



nealrichardson commented on pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#issuecomment-754251796


   > This increases again the weight of the metadata, which now has to include 
attributes for each element of a list column
   
   How much does it blow up the metadata? Is this going to scale to be able to 
handle normal/large shapefiles? 
   
   Is there a more efficient representation for this metadata (considering that 
we have to serialize it to a string)? Should we special-case `sf` objects?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jonkeane commented on pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes

2021-01-04 Thread GitBox



jonkeane commented on pull request #9092:
URL: https://github.com/apache/arrow/pull/9092#issuecomment-754243679


   Updated, passing actions on my fork:
   https://github.com/jonkeane/arrow/actions/runs/461948379
   https://github.com/jonkeane/arrow/actions/runs/461948380



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



yordan-pavlov commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551588681



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -209,6 +251,477 @@ impl ParquetExec {
 }
 }
 
+#[derive(Debug, Clone)]
+/// Predicate builder used for generating of predicate functions, used to 
filter row group metadata
+pub struct PredicateExpressionBuilder {
+parquet_schema: Schema,
+predicate_expr: Arc,
+stat_column_req: Vec<(String, StatisticsType, Field)>,
+}
+
+impl PredicateExpressionBuilder {
+/// Try to create a new instance of PredicateExpressionBuilder
+pub fn try_new(expr: , parquet_schema: Schema) -> Result {
+// build predicate expression once
+let mut stat_column_req = Vec::<(String, StatisticsType, 
Field)>::new();
+let predicate_expr =
+build_predicate_expression(expr, _schema,  
stat_column_req)?;
+
+Ok(Self {
+parquet_schema,
+predicate_expr,
+stat_column_req,
+})
+}
+
+/// Generate a predicate function used to filter row group metadata
+pub fn build_row_group_predicate(
+,
+row_group_metadata: &[RowGroupMetaData],
+) -> Box bool> {
+// build statistics record batch
+let predicate_result = build_row_group_record_batch(
+row_group_metadata,
+_schema,
+_column_req,
+)
+.and_then(|statistics_batch| {
+// execute predicate expression
+self.predicate_expr.evaluate(_batch)
+})
+.and_then(|v| match v {
+ColumnarValue::Array(array) => Ok(array),
+ColumnarValue::Scalar(_) => Err(DataFusionError::Plan(
+"predicate expression didn't return an array".to_string(),
+)),
+});
+
+let predicate_array = match predicate_result {
+Ok(array) => array,
+_ => return Box::new(|_r, _i| true),
+};
+
+let predicate_array = 
predicate_array.as_any().downcast_ref::();
+match predicate_array {
+// return row group predicate function
+Some(array) => {
+let predicate_values =
+array.iter().map(|x| 
x.unwrap_or(false)).collect::>();
+Box::new(move |_, i| predicate_values[i])
+}
+// predicate result is not a BooleanArray
+_ => Box::new(|_r, _i| true),
+}
+}
+}
+
+fn build_row_group_record_batch(
+row_groups: &[RowGroupMetaData],
+parquet_schema: ,
+stat_column_req: <(String, StatisticsType, Field)>,
+) -> Result {
+let mut fields = Vecnew();
+let mut arrays = Vecnew();
+for (column_name, statistics_type, stat_field) in stat_column_req {
+if let Some((column_index, _)) = 
parquet_schema.column_with_name(column_name) {
+let statistics = row_groups
+.iter()
+.map(|g| g.column(column_index).statistics())
+.collect::>();
+let array = build_statistics_array(
+,
+*statistics_type,
+stat_field.data_type(),
+);
+fields.push(stat_field.clone());
+arrays.push(array);
+}
+}
+let schema = Arc::new(Schema::new(fields));
+RecordBatch::try_new(schema, arrays)
+.map_err(|err| DataFusionError::Plan(err.to_string()))
+}
+
+struct PhysicalExpressionBuilder<'a> {
+column_name: String,
+column_expr: &'a Expr,
+scalar_expr: &'a Expr,
+parquet_field: &'a Field,
+statistics_fields: Vec,
+stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>,
+reverse_operator: bool,
+}
+
+impl<'a> PhysicalExpressionBuilder<'a> {
+fn try_new(
+left: &'a Expr,
+right: &'a Expr,
+parquet_schema: &'a Schema,
+stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>,
+) -> Result {
+// find column name; input could be a more complicated expression
+let mut left_columns = HashSetnew();
+utils::expr_to_column_names(left,  left_columns)?;
+let mut right_columns = HashSetnew();
+utils::expr_to_column_names(right,  right_columns)?;
+let (column_expr, scalar_expr, column_names, reverse_operator) =
+match (left_columns.len(), right_columns.len()) {
+(1, 0) => (left, right, left_columns, false),
+(0, 1) => (right, left, right_columns, true),
+_ => {
+// if more than one column used in expression - not 
supported
+return Err(DataFusionError::Plan(
+"Multi-column expressions are not currently supported"
+.to_string(),
+));
+}
+};
+let column_name =

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



yordan-pavlov commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551587504



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -209,6 +251,477 @@ impl ParquetExec {
 }
 }
 
+#[derive(Debug, Clone)]
+/// Predicate builder used for generating of predicate functions, used to 
filter row group metadata
+pub struct PredicateExpressionBuilder {
+parquet_schema: Schema,
+predicate_expr: Arc,
+stat_column_req: Vec<(String, StatisticsType, Field)>,
+}
+
+impl PredicateExpressionBuilder {
+/// Try to create a new instance of PredicateExpressionBuilder
+pub fn try_new(expr: , parquet_schema: Schema) -> Result {
+// build predicate expression once
+let mut stat_column_req = Vec::<(String, StatisticsType, 
Field)>::new();
+let predicate_expr =
+build_predicate_expression(expr, _schema,  
stat_column_req)?;
+
+Ok(Self {
+parquet_schema,
+predicate_expr,
+stat_column_req,
+})
+}
+
+/// Generate a predicate function used to filter row group metadata
+pub fn build_row_group_predicate(
+,
+row_group_metadata: &[RowGroupMetaData],
+) -> Box bool> {
+// build statistics record batch
+let predicate_result = build_row_group_record_batch(
+row_group_metadata,
+_schema,
+_column_req,
+)
+.and_then(|statistics_batch| {
+// execute predicate expression
+self.predicate_expr.evaluate(_batch)
+})
+.and_then(|v| match v {
+ColumnarValue::Array(array) => Ok(array),
+ColumnarValue::Scalar(_) => Err(DataFusionError::Plan(
+"predicate expression didn't return an array".to_string(),
+)),
+});
+
+let predicate_array = match predicate_result {
+Ok(array) => array,
+_ => return Box::new(|_r, _i| true),
+};
+
+let predicate_array = 
predicate_array.as_any().downcast_ref::();
+match predicate_array {
+// return row group predicate function
+Some(array) => {
+let predicate_values =
+array.iter().map(|x| 
x.unwrap_or(false)).collect::>();
+Box::new(move |_, i| predicate_values[i])
+}
+// predicate result is not a BooleanArray
+_ => Box::new(|_r, _i| true),

Review comment:
   My thinking in designing this has been that pushing the predicate down 
to parquet is optional, because even if it fails the query will still compute, 
just slower; because of that the code tries to avoid panicking and instead 
returns a predicate which returns true - it doesn't filter any row groups and 
lets them be processed by downstream operators. 
   It is even possible to have a partial predicate expression, where multiple 
conditions are joined using a logical `AND`, and some of them can't be 
translated for some reason to physical expressions, they will be replaced by 
`true`, but the rest will still be evaluated and could still filter some row 
groups.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #9089: ARROW-11122: [Rust] Added FFI support for date and time.

2021-01-04 Thread GitBox



nevi-me commented on a change in pull request #9089:
URL: https://github.com/apache/arrow/pull/9089#discussion_r551586245



##
File path: rust/arrow-pyarrow-integration-testing/src/lib.rs
##
@@ -153,10 +153,25 @@ fn substring(array: PyObject, start: i64, py: Python) -> 
PyResult {
 to_py(array, py)
 }
 
+/// Returns the concatenate

Review comment:
   Is the doc comment complete?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me closed pull request #9065: ARROW-11096: [Rust] C data interface for [Large]binary

2021-01-04 Thread GitBox



nevi-me closed pull request #9065:
URL: https://github.com/apache/arrow/pull/9065


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



yordan-pavlov commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551583777



##
File path: rust/parquet/src/file/serialized_reader.rs
##
@@ -137,6 +137,22 @@ impl SerializedFileReader {
 metadata,
 })
 }
+
+pub fn filter_row_groups(

Review comment:
   Good point about documentation - will add some shortly. 
   
   As long as row group metadata is filtered immediately after creating a 
SerializedFileReader, this approach will work.
   
   That's the simplest way I could think of to allow filtering of row groups 
using statistics metadata; not sure how this could be done within DataFusion 
itself, because it reads data in batches (of configurable size) which could 
potentially span multiple row groups; it could be done, but would probably move 
a lot of complexity into DataFusion which today is nicely abstracted into the 
parquet library. This would also expose a lot more about the internals of a 
parquet file format to the outside as the user would have to be aware of row 
groups rather than just requesting batches of data.
   May be I misunderstand what you are suggesting?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip

2021-01-04 Thread GitBox



jonkeane commented on a change in pull request #8549:
URL: https://github.com/apache/arrow/pull/8549#discussion_r551552588



##
File path: r/R/record-batch.R
##
@@ -286,6 +286,20 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, 
optional = FALSE, ...
 
 apply_arrow_r_metadata <- function(x, r_metadata) {
   tryCatch({
+columns_metadata <- r_metadata$columns
+if (is.data.frame(x)) {
+  if (length(names(x)) && !is.null(columns_metadata)) {
+for (name in intersect(names(columns_metadata), names(x))) {
+  x[[name]] <- apply_arrow_r_metadata(x[[name]], 
columns_metadata[[name]])
+}
+  }
+} else if(is.list(x) && !inherits(x, "POSIXlt") && 
!is.null(columns_metadata)) {
+  x <- map2(x, columns_metadata, function(.x, .y) {
+apply_arrow_r_metadata(.x, .y)
+  })
+  x
+}

Review comment:
   Could we move `!is.null(columns_metadata)` up to line 290 instead of 
having it on both 291 and 296?

##
File path: r/tests/testthat/test-metadata.R
##
@@ -134,3 +134,10 @@ test_that("metadata keeps attribute of top level data 
frame", {
   expect_identical(attr(as.data.frame(tab), "foo"), "bar")
   expect_identical(as.data.frame(tab), df)
 })
+
+test_that("metadata of list elements (ARROW-10386)", {
+  df <- data.frame(x = I(list(structure(1, foo = "bar"), structure(2, foo = 
"bar"
+  tab <- Table$create(df)
+  expect_identical(attr(as.data.frame(tab)$x[[1]], "foo"), "bar")
+  expect_identical(attr(as.data.frame(tab)$x[[2]], "foo"), "bar")

Review comment:
   I wonder if it would be clearer to have different attributes on each 
item/row to make it super obvious that we're not picking the attributes of the 
first item/row and copying them for the whole column.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression

2021-01-04 Thread GitBox



bkietz commented on a change in pull request #8894:
URL: https://github.com/apache/arrow/pull/8894#discussion_r551578256



##
File path: cpp/src/arrow/compute/cast.cc
##
@@ -118,8 +118,86 @@ class CastMetaFunction : public MetaFunction {
 
 }  // namespace
 
+const FunctionDoc struct_doc{"Wrap Arrays into a StructArray",
+ ("Names of the StructArray's fields are\n"
+  "specified through StructOptions."),
+ {},
+ "StructOptions"};
+
+Result StructResolve(KernelContext* ctx,
+ const std::vector& descrs) {
+  const auto& names = OptionsWrapper::Get(ctx).field_names;
+  if (names.size() != descrs.size()) {
+return Status::Invalid("Struct() was passed ", names.size(), " field ", 
"names but ",
+   descrs.size(), " arguments");
+  }
+
+  size_t i = 0;
+  FieldVector fields(descrs.size());
+
+  ValueDescr::Shape shape = ValueDescr::SCALAR;
+  for (const ValueDescr& descr : descrs) {
+if (descr.shape != ValueDescr::SCALAR) {
+  shape = ValueDescr::ARRAY;
+} else {
+  switch (descr.type->id()) {
+case Type::EXTENSION:

Review comment:
   ExtensionScalar hasn't been implemented yet





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] codecov-io edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

2021-01-04 Thread GitBox



codecov-io edited a comment on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753241531


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=h1) Report
   > Merging 
[#9038](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=desc) (7a44a99) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/5db1d2aad0b6aaca6dfe4e01b0afeefb0311d109?el=desc)
 (5db1d2a) will **decrease** coverage by `0.01%`.
   > The diff coverage is `78.97%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9038/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9038  +/-   ##
   ==
   - Coverage   82.60%   82.58%   -0.02% 
   ==
 Files 204  204  
 Lines   5020050533 +333 
   ==
   + Hits4146741735 +268 
   - Misses   8733 8798  +65 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=)
 | `53.92% <47.05%> (-0.68%)` | :arrow_down: |
   | 
[rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=)
 | `59.04% <71.42%> (+0.32%)` | :arrow_up: |
   | 
[rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==)
 | `76.92% <77.27%> (+0.02%)` | :arrow_up: |
   | 
[rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz)
 | `83.77% <78.53%> (-0.71%)` | :arrow_down: |
   | 
[rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==)
 | `84.06% <81.81%> (+0.09%)` | :arrow_up: |
   | 
[rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=)
 | `78.63% <82.53%> (+0.74%)` | :arrow_up: |
   | 
[rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==)
 | `99.83% <100.00%> (+<0.01%)` | :arrow_up: |
   | 
[rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=)
 | `57.37% <0.00%> (+1.19%)` | :arrow_up: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=footer). Last 
update 
[5db1d2a...7a44a99](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] sweb commented on a change in pull request #9047: ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types

2021-01-04 Thread GitBox



sweb commented on a change in pull request #9047:
URL: https://github.com/apache/arrow/pull/9047#discussion_r551575727



##
File path: rust/parquet/src/arrow/schema.rs
##
@@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> {
 LogicalType::INT_32 => Ok(DataType::Int32),
 LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)),
 LogicalType::TIME_MILLIS => 
Ok(DataType::Time32(TimeUnit::Millisecond)),
+LogicalType::DECIMAL => self.to_decimal(),

Review comment:
   @sunchao thank you for your help with this PR. The result is much better 
than my initial proposal!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression

2021-01-04 Thread GitBox



bkietz commented on a change in pull request #8894:
URL: https://github.com/apache/arrow/pull/8894#discussion_r551574884



##
File path: cpp/src/arrow/dataset/expression_internal.h
##
@@ -0,0 +1,465 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/expression.h"
+
+#include 
+#include 
+#include 
+
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/registry.h"
+#include "arrow/record_batch.h"
+#include "arrow/table.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+
+using internal::checked_cast;
+
+namespace dataset {
+
+bool Identical(const Expression& l, const Expression& r) { return l.impl_ == 
r.impl_; }
+
+const Expression::Call* CallNotNull(const Expression& expr) {
+  auto call = expr.call();
+  DCHECK_NE(call, nullptr);
+  return call;
+}
+
+inline void GetAllFieldRefs(const Expression& expr,
+std::unordered_set* 
refs) {
+  if (auto lit = expr.literal()) return;
+
+  if (auto ref = expr.field_ref()) {
+refs->emplace(*ref);
+return;
+  }
+
+  for (const Expression& arg : CallNotNull(expr)->arguments) {
+GetAllFieldRefs(arg, refs);
+  }
+}
+
+inline std::vector GetDescriptors(const std::vector& 
exprs) {
+  std::vector descrs(exprs.size());
+  for (size_t i = 0; i < exprs.size(); ++i) {
+DCHECK(exprs[i].IsBound());
+descrs[i] = exprs[i].descr();
+  }
+  return descrs;
+}
+
+inline std::vector GetDescriptors(const std::vector& 
values) {
+  std::vector descrs(values.size());
+  for (size_t i = 0; i < values.size(); ++i) {
+descrs[i] = values[i].descr();
+  }
+  return descrs;
+}
+
+struct FieldPathGetDatumImpl {

Review comment:
   Since this is only valid for some Datum variant members, I think it 
should remain a special case for datasets at least for now





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



yordan-pavlov commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551574592



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -209,6 +251,479 @@ impl ParquetExec {
 }
 }
 
+#[derive(Debug, Clone)]
+/// Predicate builder used for generating of predicate functions, used to 
filter row group metadata
+pub struct PredicateExpressionBuilder {

Review comment:
   thinking some more about this, it could be done by moving the creation 
of PredicateExpressionBuilder from `ParquetExec::try_from_files`  into 
`(ExecutionPlan for ParquetExec)::execute`, but then this work would be 
repeated for each partition, where as currently it's only done once; at this 
point I don't think it's worth it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

2021-01-04 Thread GitBox



seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-754223732


   Ok, I have done a major refactor against a rebased master.
   
   I believe this now meets the ANSI behavior with regard to `NULL` handling 
but it does not yet support syntax where columns are referenced in the `list` 
parameter like:
   
   ```sql
   SELECT TRUE IN (col1, col2, FALSE)
   ```
   
   This has been implemented with a "make it work then make it fast" approach 
as this `InList` expression actually has a lot of complexity.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced

2021-01-04 Thread GitBox



github-actions[bot] commented on pull request #9097:
URL: https://github.com/apache/arrow/pull/9097#issuecomment-754222849


   https://issues.apache.org/jira/browse/ARROW-10881



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] xhochy opened a new pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced

2021-01-04 Thread GitBox



xhochy opened a new pull request #9097:
URL: https://github.com/apache/arrow/pull/9097


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] terencehonles commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels

2021-01-04 Thread GitBox



terencehonles commented on pull request #8915:
URL: https://github.com/apache/arrow/pull/8915#issuecomment-754216137


   @github-actions crossbow submit wheel-osx-*-cp39



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] johncassil commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

2021-01-04 Thread GitBox



johncassil commented on pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#issuecomment-754215072


   @jimhester, @nealrichardson, @bkietz @dianaclarke @romainfrancois 
   
   Just wanted to say thanks for working on this.  I reported it a long time 
ago and have just been periodically watching the developments slowly progress.  
 I'm excited to see that there will be a resolution!
   
   Cheers!
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb closed pull request #9047: ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types

2021-01-04 Thread GitBox



alamb closed pull request #9047:
URL: https://github.com/apache/arrow/pull/9047


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #8927: ARROW-10766: [Rust] [Parquet] Nested List IO

2021-01-04 Thread GitBox



nevi-me commented on pull request #8927:
URL: https://github.com/apache/arrow/pull/8927#issuecomment-754211680


   I got stalled with #9093. I think it's the last blocker before I can 
complete this :(



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes

2021-01-04 Thread GitBox



jonkeane commented on a change in pull request #9092:
URL: https://github.com/apache/arrow/pull/9092#discussion_r551560380



##
File path: r/R/record-batch.R
##
@@ -274,6 +274,12 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, 
optional = FALSE, ...
 }
 
 .serialize_arrow_r_metadata <- function(x) {
+  # drop problems attributes (most likely from readr)
+  if ("attributes" %in% names(x) &&
+  "problems" %in% names(x[["attributes"]]) ) {
+x[["attributes"]][["problems"]] <- NULL
+  }
+

Review comment:
   Ok, cool — I'll rewrite the second test to avoid using 
`.serialize_arrow_r_metadata()` so that we don't error there, but still keep 
the testing-serialized-data aspect of that test.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

2021-01-04 Thread GitBox



nealrichardson closed pull request #8365:
URL: https://github.com/apache/arrow/pull/8365


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-04 Thread GitBox



sunchao commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551558307



##
File path: rust/datafusion/src/datasource/parquet.rs
##
@@ -62,17 +64,37 @@ impl TableProvider for ParquetTable {
 self.schema.clone()
 }
 
+fn supports_filter_pushdown(
+,
+_filter: ,
+) -> Result {
+Ok(TableProviderFilterPushDown::Inexact)
+}
+
 /// Scan the file(s), using the provided projection, and return one 
BatchIterator per
 /// partition.
 fn scan(
 ,
 projection: >,
 batch_size: usize,
-_filters: &[Expr],
+filters: &[Expr],

Review comment:
   Ideally, I feel we should have a proper filter API defined in data 
fusion which can be shared among various data sources. On the other hand, the 
actual filtering logic should be implemented by different data sources / 
formats, probably via converting the data fusion's filter API to the 
corresponding ones from the latter. 
   
   But this is a very good start and we can probably do them as follow ups (if 
we don't care much for API changes).

##
File path: rust/parquet/src/file/serialized_reader.rs
##
@@ -137,6 +137,22 @@ impl SerializedFileReader {
 metadata,
 })
 }
+
+pub fn filter_row_groups(

Review comment:
   Yeah I think we can either move this to the application layer (i.e., 
data fusion), or expose it as a util function from `footer.rs`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

2021-01-04 Thread GitBox



nevi-me commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754208396


   > * Implement cast that returns an error on parsing errors instead of null
   
   See https://issues.apache.org/jira/browse/ARROW-7364



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes

2021-01-04 Thread GitBox



nealrichardson commented on a change in pull request #9092:
URL: https://github.com/apache/arrow/pull/9092#discussion_r551556251



##
File path: r/R/record-batch.R
##
@@ -274,6 +274,12 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, 
optional = FALSE, ...
 }
 
 .serialize_arrow_r_metadata <- function(x) {
+  # drop problems attributes (most likely from readr)
+  if ("attributes" %in% names(x) &&
+  "problems" %in% names(x[["attributes"]]) ) {
+x[["attributes"]][["problems"]] <- NULL
+  }
+

Review comment:
   I would have to look into the implementation to be sure, but I think the 
intent of the first test should be that we don't save garbage `r` metadata and 
the second would be that, if someone happened to have bad metadata stored in 
the `r` key (we can't prevent some other data generating process from doing 
that), that we don't crash on loading it. It's not clear that that's actually 
what those tests are doing, but that's what I think they should do.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

2021-01-04 Thread GitBox



nealrichardson commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r551551461



##
File path: r/src/array_to_vector.cpp
##
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
 }
 
 StringArrayType* string_array = static_cast(array.get());
-if (array->null_count()) {
-  // need to watch for nulls
-  arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-array->offset(), n);
+
+const bool all_valid = array->null_count() == 0;
+const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+bool nul_was_stripped = false;
+
+if (all_valid) {
+  // no need to watch for missing strings
   cpp11::unwind_protect([&] {
-for (int i = 0; i < n; i++, null_reader.Next()) {
-  if (null_reader.IsSet()) {
-SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
-  } else {
-SET_STRING_ELT(data, start + i, NA_STRING);
+if (strip_out_nuls) {
+  for (int i = 0; i < n; i++) {
+SET_STRING_ELT(data, start + i,
+   
r_string_from_view_strip_nul(string_array->GetView(i),
+_was_stripped));
   }
+  return;

Review comment:
   no preference, just trying to understand





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes

2021-01-04 Thread GitBox



jonkeane commented on a change in pull request #9092:
URL: https://github.com/apache/arrow/pull/9092#discussion_r551549190



##
File path: r/R/record-batch.R
##
@@ -274,6 +274,12 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, 
optional = FALSE, ...
 }
 
 .serialize_arrow_r_metadata <- function(x) {
+  # drop problems attributes (most likely from readr)
+  if ("attributes" %in% names(x) &&
+  "problems" %in% names(x[["attributes"]]) ) {
+x[["attributes"]][["problems"]] <- NULL
+  }
+

Review comment:
   Yeah I can do that. Though [one of the existing 
tests](https://github.com/apache/arrow/blob/master/r/tests/testthat/test-metadata.R#L76-L81)
 fails here — the one right above it works which might be sufficient, is there 
something that lines 76-81 is testing that we aren't covering with lines 70-75?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

2021-01-04 Thread GitBox



bkietz commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r551542391



##
File path: r/src/array_to_vector.cpp
##
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
 }
 
 StringArrayType* string_array = static_cast(array.get());
-if (array->null_count()) {
-  // need to watch for nulls
-  arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-array->offset(), n);
+
+const bool all_valid = array->null_count() == 0;
+const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+bool nul_was_stripped = false;
+
+if (all_valid) {
+  // no need to watch for missing strings
   cpp11::unwind_protect([&] {
-for (int i = 0; i < n; i++, null_reader.Next()) {
-  if (null_reader.IsSet()) {
-SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
-  } else {
-SET_STRING_ELT(data, start + i, NA_STRING);
+if (strip_out_nuls) {
+  for (int i = 0; i < n; i++) {
+SET_STRING_ELT(data, start + i,
+   
r_string_from_view_strip_nul(string_array->GetView(i),
+_was_stripped));
   }
+  return;
+}
+
+for (int i = 0; i < n; i++) {
+  SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
 }
   });
 } else {
   cpp11::unwind_protect([&] {
-for (int i = 0; i < n; i++) {
-  SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
+arrow::internal::BitmapReader 
validity_reader(array->null_bitmap_data(),
+  array->offset(), n);
+
+if (strip_out_nuls) {
+  for (int i = 0; i < n; i++, validity_reader.Next()) {
+if (validity_reader.IsSet()) {
+  SET_STRING_ELT(data, start + i,
+ 
r_string_from_view_strip_nul(string_array->GetView(i),
+  _was_stripped));
+} else {
+  SET_STRING_ELT(data, start + i, NA_STRING);
+}
+  }
+  return;
+}
+
+for (int i = 0; i < n; i++, validity_reader.Next()) {
+  if (validity_reader.IsSet()) {
+SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
+  } else {
+SET_STRING_ELT(data, start + i, NA_STRING);
+  }
 }
   });
 }
 
+if (nul_was_stripped) {
+  cpp11::warning("Stripping '\\0' (nul) from character vector");
+}
+
 return Status::OK();
   }
 
   bool Parallel() const { return false; }
 
  private:
-  static SEXP r_string_from_view(const arrow::util::string_view& view) {
+  static SEXP r_string_from_view(arrow::util::string_view view) {

Review comment:
   string_view is a trivially copyable structure so there's no need to 
protect it from copies with `const&`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

2021-01-04 Thread GitBox



bkietz commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r551542022



##
File path: r/src/array_to_vector.cpp
##
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
 }
 
 StringArrayType* string_array = static_cast(array.get());
-if (array->null_count()) {
-  // need to watch for nulls
-  arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-array->offset(), n);
+
+const bool all_valid = array->null_count() == 0;
+const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+bool nul_was_stripped = false;
+
+if (all_valid) {
+  // no need to watch for missing strings
   cpp11::unwind_protect([&] {
-for (int i = 0; i < n; i++, null_reader.Next()) {
-  if (null_reader.IsSet()) {
-SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
-  } else {
-SET_STRING_ELT(data, start + i, NA_STRING);
+if (strip_out_nuls) {
+  for (int i = 0; i < n; i++) {
+SET_STRING_ELT(data, start + i,
+   
r_string_from_view_strip_nul(string_array->GetView(i),
+_was_stripped));
   }
+  return;

Review comment:
   This `return` is breaking out of the lambda wrapped by `unwind_protect`; 
it doesn't make the parent scope return. If you prefer, we could use an `else` 
here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] removed a comment on pull request #9095: ARROW-10183: [C++] Apply composable futures to CSV

2021-01-04 Thread GitBox



github-actions[bot] removed a comment on pull request #9095:
URL: https://github.com/apache/arrow/pull/9095#issuecomment-754137607


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #9047: ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types

2021-01-04 Thread GitBox



sunchao commented on a change in pull request #9047:
URL: https://github.com/apache/arrow/pull/9047#discussion_r551540021



##
File path: rust/parquet/src/arrow/schema.rs
##
@@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> {
 LogicalType::INT_32 => Ok(DataType::Int32),
 LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)),
 LogicalType::TIME_MILLIS => 
Ok(DataType::Time32(TimeUnit::Millisecond)),
+LogicalType::DECIMAL => self.to_decimal(),

Review comment:
   Thanks for the update. This PR looks good and we can address the 
remaining in a followup.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-04 Thread GitBox



nevi-me commented on pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#issuecomment-754186287


   I saw the clippy warning, I'll fix it



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #9094: ARROW-11126: [Rust] Document and test ARROW-10656

2021-01-04 Thread GitBox



nevi-me commented on a change in pull request #9094:
URL: https://github.com/apache/arrow/pull/9094#discussion_r551536755



##
File path: rust/arrow/src/record_batch.rs
##
@@ -80,6 +80,10 @@ impl RecordBatch {
 Ok(RecordBatch { schema, columns })
 }
 
+/// Creates a [RecordBatch` from a schema and columns, with additional 
options,

Review comment:
   Thanks, I meant for it to be `RecordBatch`, as the link has little 
value. I think I removed the wrong character while changing it





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings

2021-01-04 Thread GitBox



nealrichardson commented on a change in pull request #8365:
URL: https://github.com/apache/arrow/pull/8365#discussion_r551519433



##
File path: r/src/array_to_vector.cpp
##
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
 }
 
 StringArrayType* string_array = static_cast(array.get());
-if (array->null_count()) {
-  // need to watch for nulls
-  arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-array->offset(), n);
+
+const bool all_valid = array->null_count() == 0;
+const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+bool nul_was_stripped = false;
+
+if (all_valid) {
+  // no need to watch for missing strings
   cpp11::unwind_protect([&] {
-for (int i = 0; i < n; i++, null_reader.Next()) {
-  if (null_reader.IsSet()) {
-SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
-  } else {
-SET_STRING_ELT(data, start + i, NA_STRING);
+if (strip_out_nuls) {
+  for (int i = 0; i < n; i++) {
+SET_STRING_ELT(data, start + i,
+   
r_string_from_view_strip_nul(string_array->GetView(i),
+_was_stripped));
   }
+  return;

Review comment:
   The main function return (L347) is `return Status::OK();` -- should 
these do the same? Or do I misunderstand what this return does?

##
File path: r/src/array_to_vector.cpp
##
@@ -288,36 +290,104 @@ struct Converter_String : public Converter {
 }
 
 StringArrayType* string_array = static_cast(array.get());
-if (array->null_count()) {
-  // need to watch for nulls
-  arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
-array->offset(), n);
+
+const bool all_valid = array->null_count() == 0;
+const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false);
+
+bool nul_was_stripped = false;
+
+if (all_valid) {
+  // no need to watch for missing strings
   cpp11::unwind_protect([&] {
-for (int i = 0; i < n; i++, null_reader.Next()) {
-  if (null_reader.IsSet()) {
-SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
-  } else {
-SET_STRING_ELT(data, start + i, NA_STRING);
+if (strip_out_nuls) {
+  for (int i = 0; i < n; i++) {
+SET_STRING_ELT(data, start + i,
+   
r_string_from_view_strip_nul(string_array->GetView(i),
+_was_stripped));
   }
+  return;
+}
+
+for (int i = 0; i < n; i++) {
+  SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
 }
   });
 } else {
   cpp11::unwind_protect([&] {
-for (int i = 0; i < n; i++) {
-  SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
+arrow::internal::BitmapReader 
validity_reader(array->null_bitmap_data(),
+  array->offset(), n);
+
+if (strip_out_nuls) {
+  for (int i = 0; i < n; i++, validity_reader.Next()) {
+if (validity_reader.IsSet()) {
+  SET_STRING_ELT(data, start + i,
+ 
r_string_from_view_strip_nul(string_array->GetView(i),
+  _was_stripped));
+} else {
+  SET_STRING_ELT(data, start + i, NA_STRING);
+}
+  }
+  return;
+}
+
+for (int i = 0; i < n; i++, validity_reader.Next()) {
+  if (validity_reader.IsSet()) {
+SET_STRING_ELT(data, start + i, 
r_string_from_view(string_array->GetView(i)));
+  } else {
+SET_STRING_ELT(data, start + i, NA_STRING);
+  }
 }
   });
 }
 
+if (nul_was_stripped) {
+  cpp11::warning("Stripping '\\0' (nul) from character vector");
+}
+
 return Status::OK();
   }
 
   bool Parallel() const { return false; }
 
  private:
-  static SEXP r_string_from_view(const arrow::util::string_view& view) {
+  static SEXP r_string_from_view(arrow::util::string_view view) {

Review comment:
   Why this change?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes

2021-01-04 Thread GitBox



nealrichardson commented on a change in pull request #9092:
URL: https://github.com/apache/arrow/pull/9092#discussion_r551534161



##
File path: r/R/record-batch.R
##
@@ -274,6 +274,12 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, 
optional = FALSE, ...
 }
 
 .serialize_arrow_r_metadata <- function(x) {
+  # drop problems attributes (most likely from readr)
+  if ("attributes" %in% names(x) &&
+  "problems" %in% names(x[["attributes"]]) ) {
+x[["attributes"]][["problems"]] <- NULL
+  }
+

Review comment:
   Should we validate/assert `is.list()` in this function?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

2021-01-04 Thread GitBox



Dandandan edited a comment on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662


   @jorgecarleitao note that the `csv` `StringRecord` also verifies whether 
strings are utf8. It adds a bit of overhead, but the utf8 checking itself is 
not much for now, it is mostly the logic surrounding `StringRecord` that adds 
the most overhead.
   I think eventually we could use a `StringArray` or `BinaryArray` as buffer 
so we can remove the `StringRecords` which is internally a `Vec` (by using 
`ByteRecord`) and a `Vec` for the rows.
   
   The current performance penalty between master and this branch currently is 
~10% as we introduce an extra intermediate step which I think could be more 
than compensated for by removing the `StringRecord` abstraction, and trying to 
write to a string or binary array without intermediate steps.
   
   This is the structure the csv crate is using per row:
   
   ```rust
   struct ByteRecordInner {
   /// The position of this byte record.
   pos: Option,
   /// All fields in this record, stored contiguously.
   fields: Vec,
   /// The number of and location of each field in this record.
   bounds: Bounds,
   }
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

2021-01-04 Thread GitBox



Dandandan edited a comment on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662


   @jorgecarleitao note that the `csv` `StringRecord` also verifies whether 
strings are utf8. It adds a bit of overhead, but the utf8 checking itself is 
not much for now, it is mostly the logic surrounding `StringRecord` that adds 
the most overhead.
   I think eventually we could use a `StringArray` or `BinaryArray` as buffer 
so we can remove the `StringRecords` which is internally a `Vec` (by using 
`ByteRecord`) and a `Vec` for the rows.
   
   The current performance penalty between master and this branch currently is 
~10% as we introduce an extra intermediate step which I think could be more 
than compensated for by removing the `StringRecord` abstraction, and trying to 
write to a string or binary array without intermediate steps.
   
   ```rust
   struct ByteRecordInner {
   /// The position of this byte record.
   pos: Option,
   /// All fields in this record, stored contiguously.
   fields: Vec,
   /// The number of and location of each field in this record.
   bounds: Bounds,
   }
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

2021-01-04 Thread GitBox



alamb commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754181158


   I also like the general idea of using the `cast` kernel for consistency. 
Nice idea @Dandandan 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb closed pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

2021-01-04 Thread GitBox



alamb closed pull request #9016:
URL: https://github.com/apache/arrow/pull/9016


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #9094: ARROW-11126: [Rust] Document and test ARROW-10656

2021-01-04 Thread GitBox



alamb commented on a change in pull request #9094:
URL: https://github.com/apache/arrow/pull/9094#discussion_r551527136



##
File path: rust/arrow/src/record_batch.rs
##
@@ -80,6 +80,10 @@ impl RecordBatch {
 Ok(RecordBatch { schema, columns })
 }
 
+/// Creates a [RecordBatch` from a schema and columns, with additional 
options,

Review comment:
   ```suggestion
   /// Creates a [`RecordBatch`] from a schema and columns, with additional 
options,
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 170 matches

Mail list logo