[GitHub] [arrow] mrkn commented on a change in pull request #6302: ARROW-7633: [C++][CI] Create fuzz targets for tensors and sparse tensors
mrkn commented on a change in pull request #6302: URL: https://github.com/apache/arrow/pull/6302#discussion_r551765015 ## File path: cpp/src/arrow/ipc/generate_tensor_fuzz_corpus.cc ## @@ -0,0 +1,132 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// A command line executable that generates a bunch of valid IPC files +// containing example tensors. Those are used as fuzzing seeds to make +// fuzzing more efficient. + +#include +#include +#include +#include +#include + +#include "arrow/io/file.h" +#include "arrow/io/memory.h" +#include "arrow/ipc/test_common.h" +#include "arrow/ipc/writer.h" +#include "arrow/result.h" +#include "arrow/tensor.h" +#include "arrow/util/io_util.h" + +namespace arrow { +namespace ipc { + +using ::arrow::internal::PlatformFilename; + +Result PrepareDirectory(const std::string& dir) { + ARROW_ASSIGN_OR_RAISE(auto dir_fn, PlatformFilename::FromString(dir)); + RETURN_NOT_OK(::arrow::internal::CreateDir(dir_fn)); + return std::move(dir_fn); +} + +Result> MakeSerializedBuffer( +std::function&)> fn) { + ARROW_ASSIGN_OR_RAISE(auto sink, io::BufferOutputStream::Create(1024)); + RETURN_NOT_OK(fn(sink)); + return sink->Finish(); +} + +Result> SerializeTensor(const std::shared_ptr& tensor) { + return MakeSerializedBuffer( + [&](const std::shared_ptr& sink) -> Status { +int32_t metadata_length; +int64_t body_length; +return ipc::WriteTensor(*tensor, sink.get(), _length, _length); + }); +} + +Result>> Tensors() { + std::vector> tensors; + std::shared_ptr tensor; + std::vector shape = {5, 3, 7}; + std::shared_ptr types[] = {int8(), int16(), int32(), int64(), + uint8(), uint16(), uint32(), uint64()}; + uint32_t seed = 0; + for (auto type : types) { +RETURN_NOT_OK(test::MakeRandomTensor(type, shape, true, , seed++)); Review comment: Fixed. ## File path: cpp/src/arrow/ipc/metadata_internal.cc ## @@ -1334,17 +1334,22 @@ Status GetTensorMetadata(const Buffer& metadata, std::shared_ptr* type return Status::IOError("Header-type of flatbuffer-encoded Message is not Tensor."); } - int ndim = static_cast(tensor->shape()->size()); + auto ndim = tensor->shape()->size(); Review comment: Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on a change in pull request #6302: ARROW-7633: [C++][CI] Create fuzz targets for tensors and sparse tensors
mrkn commented on a change in pull request #6302: URL: https://github.com/apache/arrow/pull/6302#discussion_r551764667 ## File path: cpp/src/arrow/ipc/test_common.h ## @@ -161,6 +162,11 @@ Status MakeUuid(std::shared_ptr* out); ARROW_TESTING_EXPORT Status MakeDictExtension(std::shared_ptr* out); +ARROW_TESTING_EXPORT +Status MakeRandomTensor(const std::shared_ptr& type, +const std::vector& shape, bool row_major_p, +std::shared_ptr* out, uint32_t seed = 0); Review comment: Because I followed `MakeRandomArray`. It is in `arrow/ipc/test_common.h`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays
nevi-me commented on pull request #9093: URL: https://github.com/apache/arrow/pull/9093#issuecomment-754464034 Thanks for the review @jorgecarleitao. I've addressed your queries and comments, and cleaned up the TODOs This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays
nevi-me commented on a change in pull request #9093: URL: https://github.com/apache/arrow/pull/9093#discussion_r551762894 ## File path: rust/arrow/src/array/equal/utils.rs ## @@ -76,3 +80,185 @@ pub(super) fn equal_len( ) -> bool { lhs_values[lhs_start..(lhs_start + len)] == rhs_values[rhs_start..(rhs_start + len)] } + +/// Computes the logical validity bitmap of the array data using the +/// parent's array data. The parent should be a list or struct, else +/// the logical bitmap of the array is returned unaltered. +/// +/// Parent data is passed along with the parent's logical bitmap, as +/// nested arrays could have a logical bitmap different to the physical +/// one on the `ArrayData`. +pub(super) fn child_logical_null_buffer( +parent_data: , +logical_null_buffer: Option, Review comment: @alamb @jorgecarleitao I wanted to make this `Option<>` to avoid cloning, but because I create a `Bitmap` for `parent_bitmap` and `self_null_bitmap` , I have to end up cloning the ``. So it's extra work to change the signature, and probably doesn't yield any benefit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays
nevi-me commented on a change in pull request #9093: URL: https://github.com/apache/arrow/pull/9093#discussion_r551762193 ## File path: rust/arrow/src/array/equal/structure.rs ## @@ -37,39 +37,20 @@ fn equal_values( rhs_start: usize, len: usize, ) -> bool { -let mut temp_lhs: Option = None; -let mut temp_rhs: Option = None; - lhs.child_data() .iter() .zip(rhs.child_data()) .all(|(lhs_values, rhs_values)| { // merge the null data -let lhs_merged_nulls = match (lhs_nulls, lhs_values.null_buffer()) { Review comment: @jorgecarleitao I thought you were referring to this part of the code. I removed the one in `mod.rs` because I found that it was computing a duplicate when dealing with just primitive arrays. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays
jorgecarleitao commented on a change in pull request #9093: URL: https://github.com/apache/arrow/pull/9093#discussion_r551746160 ## File path: rust/arrow/src/array/equal/mod.rs ## @@ -146,118 +146,103 @@ fn equal_values( rhs_start: usize, len: usize, ) -> bool { -// compute the nested buffer of the parent and child -// if the array has no parent, the child is computed with itself -#[allow(unused_assignments)] -let mut temp_lhs: Option = None; -#[allow(unused_assignments)] -let mut temp_rhs: Option = None; -let lhs_merged_nulls = match (lhs_nulls, lhs.null_buffer()) { -(None, None) => None, -(None, Some(c)) => Some(c), -(Some(p), None) => Some(p), -(Some(p), Some(c)) => { -let merged = (p & c).unwrap(); -temp_lhs = Some(merged); -temp_lhs.as_ref() -} -}; Review comment: Well, if it does not work it does not work and we scratch it ^_^ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays
nevi-me commented on a change in pull request #9093: URL: https://github.com/apache/arrow/pull/9093#discussion_r551745156 ## File path: rust/arrow/src/array/equal/mod.rs ## @@ -146,118 +146,103 @@ fn equal_values( rhs_start: usize, len: usize, ) -> bool { -// compute the nested buffer of the parent and child -// if the array has no parent, the child is computed with itself -#[allow(unused_assignments)] -let mut temp_lhs: Option = None; -#[allow(unused_assignments)] -let mut temp_rhs: Option = None; -let lhs_merged_nulls = match (lhs_nulls, lhs.null_buffer()) { -(None, None) => None, -(None, Some(c)) => Some(c), -(Some(p), None) => Some(p), -(Some(p), Some(c)) => { -let merged = (p & c).unwrap(); -temp_lhs = Some(merged); -temp_lhs.as_ref() -} -}; Review comment: Interestingly, I found an issue with this while fixing the failing integration tests. See my latest commit https://github.com/apache/arrow/pull/9093/commits/66bb98e240f9ca49a7abe2a00ce83a117a62b426. I'll follow up with changes for your suggestions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays
jorgecarleitao commented on a change in pull request #9093: URL: https://github.com/apache/arrow/pull/9093#discussion_r551739520 ## File path: rust/arrow/src/array/data.rs ## @@ -136,6 +137,84 @@ impl ArrayData { _bitmap } +/// Computes the logical validity bitmap of the array data using the +/// parent's array data. The parent should be a list or struct, else +/// the logical bitmap of the array is returned unaltered. +/// +/// Parent data is passed along with the parent's logical bitmap, as +/// nested arrays could have a logical bitmap different to the physical +/// one on the `ArrayData`. +/// +/// Safety +/// +/// As we index into [`ArrayData::child_data`], this function panics if +/// array data is not a nested type, as it will not have child data. +pub fn child_logical_null_buffer( Review comment: I would have placed this function outside of `impl ArrayData`, add the corresponding arguments, and place it under `equal/utils.rs` for now. Alternatively, use `pub(crate)` to not make this public. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays
jorgecarleitao commented on a change in pull request #9093: URL: https://github.com/apache/arrow/pull/9093#discussion_r551738974 ## File path: rust/arrow/src/array/equal/structure.rs ## @@ -55,21 +56,24 @@ fn equal_values( temp_lhs.as_ref() } }; -let rhs_merged_nulls = match (rhs_nulls, rhs_values.null_buffer()) { -(None, None) => None, -(None, Some(c)) => Some(c), -(Some(p), None) => Some(p), -(Some(p), Some(c)) => { -let merged = (p & c).unwrap(); -temp_rhs = Some(merged); -temp_rhs.as_ref() -} -}; +// TODO: this is intentional, looking at which is the better option +let rhs_merged_nulls = +rhs.child_logical_null_buffer(rhs_nulls.cloned(), index); Review comment: pass `rhs_values` instead of `index` here? ## File path: rust/arrow/src/array/data.rs ## @@ -136,6 +137,84 @@ impl ArrayData { _bitmap } +/// Computes the logical validity bitmap of the array data using the +/// parent's array data. The parent should be a list or struct, else +/// the logical bitmap of the array is returned unaltered. +/// +/// Parent data is passed along with the parent's logical bitmap, as +/// nested arrays could have a logical bitmap different to the physical +/// one on the `ArrayData`. +/// +/// Safety Review comment: `# safety` is used for `unsafe`, this is `# Panics`. ## File path: rust/arrow/src/array/equal/structure.rs ## @@ -38,12 +38,13 @@ fn equal_values( len: usize, ) -> bool { let mut temp_lhs: Option = None; -let mut temp_rhs: Option = None; +// let mut temp_rhs: Option = None; lhs.child_data() .iter() .zip(rhs.child_data()) -.all(|(lhs_values, rhs_values)| { +.enumerate() Review comment: which allows to remove this `enumerate` here. ## File path: rust/arrow/src/array/equal/primitive.rs ## @@ -32,7 +36,10 @@ pub(super) fn primitive_equal( let lhs_values = ()[0].as_slice()[lhs.offset() * byte_width..]; let rhs_values = ()[0].as_slice()[rhs.offset() * byte_width..]; -if lhs.null_count() == 0 && rhs.null_count() == 0 { +let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len); +let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len); + +if lhs_null_count == 0 && rhs_null_count == 0 { Review comment: This is even better: if the buffer only has 1s outside of the range, we short-circuit to the fast case. ## File path: rust/arrow/src/array/equal/mod.rs ## @@ -146,118 +146,103 @@ fn equal_values( rhs_start: usize, len: usize, ) -> bool { -// compute the nested buffer of the parent and child -// if the array has no parent, the child is computed with itself -#[allow(unused_assignments)] -let mut temp_lhs: Option = None; -#[allow(unused_assignments)] -let mut temp_rhs: Option = None; -let lhs_merged_nulls = match (lhs_nulls, lhs.null_buffer()) { -(None, None) => None, -(None, Some(c)) => Some(c), -(Some(p), None) => Some(p), -(Some(p), Some(c)) => { -let merged = (p & c).unwrap(); -temp_lhs = Some(merged); -temp_lhs.as_ref() -} -}; Review comment: I liked this more. A bit more explicit what was happening here. ## File path: rust/arrow/src/array/data.rs ## @@ -136,6 +137,84 @@ impl ArrayData { _bitmap } +/// Computes the logical validity bitmap of the array data using the +/// parent's array data. The parent should be a list or struct, else +/// the logical bitmap of the array is returned unaltered. +/// +/// Parent data is passed along with the parent's logical bitmap, as +/// nested arrays could have a logical bitmap different to the physical +/// one on the `ArrayData`. +/// +/// Safety +/// +/// As we index into [`ArrayData::child_data`], this function panics if +/// array data is not a nested type, as it will not have child data. +pub fn child_logical_null_buffer( Review comment: I would have placed this function outside of `impl ArrayData`, add the corresponding arguments, and place it under `equal/` for now. Alternatively, use `pub(crate)` to not make this public. ## File path: rust/arrow/src/array/data.rs ## @@ -136,6 +137,84 @@ impl ArrayData { _bitmap } +/// Computes the logical validity bitmap of the array data using the +/// parent's array data. The parent should be a list or struct, else +/// the logical bitmap of the array is returned unaltered. +/// +/// Parent data is passed along with the parent's logical bitmap, as +
[GitHub] [arrow] mqy commented on pull request #9025: ARROW-10259: [Rust] Add custom metadata to Field
mqy commented on pull request #9025: URL: https://github.com/apache/arrow/pull/9025#issuecomment-754421764 > Hey @mqy, I'm back in the city, and have access to my desktop; so I'll be able to review this PR and help you enable integration tests during the week. Welcome back! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for writing parquet
jorgecarleitao commented on pull request #9099: URL: https://github.com/apache/arrow/pull/9099#issuecomment-754386570 Two CI jobs are hanging indefinitely, I canceled them. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] arw2019 commented on pull request #8955: ARROW-9948: [C++] in Decimal128::FromString raise when scale is out of bounds
arw2019 commented on pull request #8955: URL: https://github.com/apache/arrow/pull/8955#issuecomment-754383958 Thanks @pitrou @kiszk for reviewing! Planning to push an update in the next day or so This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9086: [Rust] [DataFusion] [Experiment] Blocking threads filter
jorgecarleitao commented on a change in pull request #9086: URL: https://github.com/apache/arrow/pull/9086#discussion_r551702345 ## File path: rust/datafusion/src/physical_plan/filter.rs ## @@ -103,25 +103,23 @@ impl ExecutionPlan for FilterExec { } async fn execute(, partition: usize) -> Result { -Ok(Box::pin(FilterExecStream { -schema: self.input.schema().clone(), -predicate: self.predicate.clone(), -input: self.input.execute(partition).await?, -})) +let predicate = self.predicate.clone(); + +let stream = self.input.execute(partition).await?; +let stream = stream.then(move |batch| { +let predicate = predicate.clone(); +async move { +// Filtering batches is CPU-bounded and therefore justifies a dedicated thread pool +task::spawn_blocking(move || batch_filter(?, )) +.await +.unwrap() +} +}); + +Ok(Box::pin(SizedRecordBatchStream::new(stream, self.schema( Review comment: no, but this did not happen in master either: a stream does not ask for the next item until the current item is completed. We get no parallelism whatsoever from streams, only concurrency (because they yield). E.g. https://docs.rs/parallel-stream/2.1.2/parallel_stream/#differences-with-rayon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551694880 ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { + public: + virtual ~ReservationListener(); + + virtual Status OnReservation(int64_t size) = 0; + virtual Status OnRelease(int64_t size) = 0; + + protected: + ReservationListener(); +}; + +class ARROW_EXPORT ReservationListenableMemoryPool : public MemoryPool { + public: + explicit ReservationListenableMemoryPool(MemoryPool* pool, + std::shared_ptr listener, Review comment: Would you suggest `unique_ptr` here? I thought that ReservationListener instances may be able to share over pools. For example a listener simply for logging reservations to std out. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551694096 ## File path: cpp/src/arrow/memory_pool.cc ## @@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return impl_->max_memory(); } std::string ProxyMemoryPool::backend_name() const { return impl_->backend_name(); } +ReservationListener::~ReservationListener() {} + +ReservationListener::ReservationListener() {} + +class ReservationListenableMemoryPool::ReservationListenableMemoryPoolImpl { + public: + explicit ReservationListenableMemoryPoolImpl( + MemoryPool* pool, std::shared_ptr listener, int64_t block_size) + : pool_(pool), +listener_(listener), +block_size_(block_size), +blocks_reserved_(0), +bytes_reserved_(0) {} + + Status Allocate(int64_t size, uint8_t** out) { +RETURN_NOT_OK(UpdateReservation(size)); +Status error = pool_->Allocate(size, out); +if (!error.ok()) { + RETURN_NOT_OK(UpdateReservation(-size)); + return error; +} +return Status::OK(); + } + + Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) { +bool reserved = false; +int64_t diff = new_size - old_size; +if (new_size >= old_size) { + RETURN_NOT_OK(UpdateReservation(diff)); + reserved = true; Review comment: Done, thanks for the suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551693971 ## File path: cpp/src/arrow/memory_pool.cc ## @@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return impl_->max_memory(); } std::string ProxyMemoryPool::backend_name() const { return impl_->backend_name(); } +ReservationListener::~ReservationListener() {} Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551693725 ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { Review comment: added ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { + public: + virtual ~ReservationListener(); Review comment: Done. Thanks for the suggestion. ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { + public: + virtual ~ReservationListener(); + + virtual Status OnReservation(int64_t size) = 0; + virtual Status OnRelease(int64_t size) = 0; + + protected: + ReservationListener(); Review comment: done ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { + public: + virtual ~ReservationListener(); + + virtual Status OnReservation(int64_t size) = 0; + virtual Status OnRelease(int64_t size) = 0; + + protected: + ReservationListener(); +}; + +class ARROW_EXPORT ReservationListenableMemoryPool : public MemoryPool { Review comment: added This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed pull request #9098: ARROW-11127: [C++] ifdef unused cpu_info on non-x86 platforms
kou closed pull request #9098: URL: https://github.com/apache/arrow/pull/9098 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sunchao commented on a change in pull request #9047: ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types
sunchao commented on a change in pull request #9047: URL: https://github.com/apache/arrow/pull/9047#discussion_r551681563 ## File path: rust/parquet/src/arrow/schema.rs ## @@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> { LogicalType::INT_32 => Ok(DataType::Int32), LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)), LogicalType::TIME_MILLIS => Ok(DataType::Time32(TimeUnit::Millisecond)), +LogicalType::DECIMAL => self.to_decimal(), Review comment: np @sweb - thanks for the contribution! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sunchao commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
sunchao commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551680435 ## File path: rust/parquet/src/file/serialized_reader.rs ## @@ -137,6 +137,22 @@ impl SerializedFileReader { metadata, }) } + +pub fn filter_row_groups( Review comment: What I was thinking is that, we can have another constructor for `SerializedFileReader` which takes a custom metadata: ```rust pub fn new_with_metadata(chunk_reader: R, metadata: ParquetMetaData) -> Result { Ok(Self { chunk_reader: Arc::new(chunk_reader), metadata: metadata, }) } ``` and we move the metadata filtering part to data fusion, or a util function in `footer.rs`. In the long term though, I think we should do something similar to [parquet-mr is doing](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L656), that is, having a `ParquetReadOptions`-like struct which allows user to specify various configs, properties, filters, etc when reading a parquet file. The struct is extendable as well to accommodate new features in future such as filtering with column indexes or bloom filters, so we don't need to have multiple constructors. The constructor can become like this: ```rust pub fn new(chunk_reader: R, options: ParquetReadOptions) -> Result { Ok(Self { chunk_reader: Arc::new(chunk_reader), options: options, }) } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sunchao commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
sunchao commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551680435 ## File path: rust/parquet/src/file/serialized_reader.rs ## @@ -137,6 +137,22 @@ impl SerializedFileReader { metadata, }) } + +pub fn filter_row_groups( Review comment: What I was thinking is that, we can have another constructor for `SerializedFileReader` which takes a custom metadata: ```rust pub fn new_with_metadata(chunk_reader: R, metadata: ParquetMetaData) -> Result { Ok(Self { chunk_reader: Arc::new(chunk_reader), metadata: metadata, }) } ``` and we move the metadata filtering part to data fusion, or a util function in `footer.rs`. In the long term though, I think we should do something similar to [parquet-mr is doing](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L656), that is, having a `ParquetReadOptions`-like struct which allows user to specify various configs, properties, filters, etc when reading a parquet file. The struct is extendable as well to accommodate new features in future such as filtering with column indexes or bloom filters. The constructor can become like this: ```rust pub fn new(chunk_reader: R, options: ParquetReadOptions) -> Result { Ok(Self { chunk_reader: Arc::new(chunk_reader), options: options, }) } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] terencehonles commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels
terencehonles commented on pull request #8915: URL: https://github.com/apache/arrow/pull/8915#issuecomment-754339737 @kszucs I noticed there was a list of allowed actions. Could the allow list just be updated or is that enforced at the apache (org) level rather than the repo level? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels
kszucs commented on pull request #8915: URL: https://github.com/apache/arrow/pull/8915#issuecomment-754300248 Yes, the comment bot seems to be failing https://github.com/apache/arrow/actions/runs/462228610 This is most likely caused by the recent security changes, so we need to update the [comment_bot.yml](https://github.com/apache/arrow/blob/master/.github/workflows/comment_bot.yml) action to use verified plugins only. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced
nealrichardson commented on pull request #9097: URL: https://github.com/apache/arrow/pull/9097#issuecomment-754297915 @xhochy autotune isn't working at the moment--INFRA has blocked some of the Actions we use, including on that workflow. See https://issues.apache.org/jira/browse/INFRA-21239 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes
nealrichardson closed pull request #9092: URL: https://github.com/apache/arrow/pull/9092 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
yordan-pavlov commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551621749 ## File path: rust/parquet/src/file/serialized_reader.rs ## @@ -137,6 +137,22 @@ impl SerializedFileReader { metadata, }) } + +pub fn filter_row_groups( Review comment: there is another possibility - I have just noticed `FilePageIterator::with_row_groups` which could be used to filter row groups based on a list of row group indexes; this could replace the `filter_row_groups` method but would require the row group indexes to be passed down all the way to `build_for_primitive_type_inner` where `FilePageIterator` is created; this could be done through a new field in `ArrayReaderBuilderContext`. It's a deeper change but would mean that `filter_row_groups` method is no longer necessary. @sunchao do you think this would be a better way to go about filtering of row groups? I am not sure the complexity is worth it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
yordan-pavlov commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551621749 ## File path: rust/parquet/src/file/serialized_reader.rs ## @@ -137,6 +137,22 @@ impl SerializedFileReader { metadata, }) } + +pub fn filter_row_groups( Review comment: there is another possibility - I have just noticed `FilePageIterator::with_row_groups` which could be used to filter row groups based on a list of row group indexes; this could replace the `filter_row_groups` method but would require the row group indexes to be passed down all the way to `build_for_primitive_type_inner` where `FilePageIterator` is created; this could be done through a new field in `ArrayReaderBuilderContext`. It's a deeper change but would mean that `filter_row_groups` method is no longer necessary. @sunchao do you think this would be a better way to go about filtering of row groups? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9024: ARROW-11044: [C++] Add "replace" kernel
emkornfield commented on a change in pull request #9024: URL: https://github.com/apache/arrow/pull/9024#discussion_r551619381 ## File path: cpp/src/arrow/compute/kernels/scalar_replace.cc ## @@ -0,0 +1,309 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/common.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" + +namespace arrow { + +using internal::BitBlockCount; Review comment: use full imports and put them inside the anonymous namespace. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] terencehonles commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels
terencehonles commented on pull request #8915: URL: https://github.com/apache/arrow/pull/8915#issuecomment-754272662 @kou / @kszucs you may already know but it looks like the crossbow submitter / github actions bot isn't working(?). It doesn't look like multibuild has tasks for GitHub Actions, but I wanted to see how far the Travis OS X commands would get. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
emkornfield commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-754270017 @zhztheplayer I need to take a closer look at the JNI changes since this was last approved, and will take another look at the memory stuff once you've added some docs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9094: ARROW-11126: [Rust] Document and test ARROW-10656
alamb closed pull request #9094: URL: https://github.com/apache/arrow/pull/9094 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for loading parquet [WIP]
github-actions[bot] commented on pull request #9099: URL: https://github.com/apache/arrow/pull/9099#issuecomment-754266445 https://issues.apache.org/jira/browse/ARROW-11129 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan opened a new pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for loading parquet [WIP]
Dandandan opened a new pull request #9099: URL: https://github.com/apache/arrow/pull/9099 Inspired by PR https://github.com/apache/arrow/pull/9086 from @jorgecarleitao Seems better to use `task::spawn_blocking` to set upper limit on nr. of threads (512 by default) using tokio. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
emkornfield commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551610971 ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { + public: + virtual ~ReservationListener(); + + virtual Status OnReservation(int64_t size) = 0; + virtual Status OnRelease(int64_t size) = 0; + + protected: + ReservationListener(); +}; + +class ARROW_EXPORT ReservationListenableMemoryPool : public MemoryPool { + public: + explicit ReservationListenableMemoryPool(MemoryPool* pool, + std::shared_ptr listener, Review comment: why shared ptr? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
emkornfield commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551610553 ## File path: cpp/src/arrow/memory_pool.cc ## @@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return impl_->max_memory(); } std::string ProxyMemoryPool::backend_name() const { return impl_->backend_name(); } +ReservationListener::~ReservationListener() {} + +ReservationListener::ReservationListener() {} + +class ReservationListenableMemoryPool::ReservationListenableMemoryPoolImpl { + public: + explicit ReservationListenableMemoryPoolImpl( + MemoryPool* pool, std::shared_ptr listener, int64_t block_size) + : pool_(pool), +listener_(listener), +block_size_(block_size), +blocks_reserved_(0), +bytes_reserved_(0) {} + + Status Allocate(int64_t size, uint8_t** out) { +RETURN_NOT_OK(UpdateReservation(size)); +Status error = pool_->Allocate(size, out); +if (!error.ok()) { + RETURN_NOT_OK(UpdateReservation(-size)); + return error; +} +return Status::OK(); + } + + Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) { +bool reserved = false; +int64_t diff = new_size - old_size; +if (new_size >= old_size) { + RETURN_NOT_OK(UpdateReservation(diff)); + reserved = true; Review comment: some docs in this method could help readability. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
emkornfield commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551610223 ## File path: cpp/src/arrow/memory_pool.cc ## @@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return impl_->max_memory(); } std::string ProxyMemoryPool::backend_name() const { return impl_->backend_name(); } +ReservationListener::~ReservationListener() {} Review comment: this and the constructor aren't needed if you use = default as suggested in the header. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
emkornfield commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551610076 ## File path: cpp/src/arrow/memory_pool.cc ## @@ -534,4 +535,139 @@ int64_t ProxyMemoryPool::max_memory() const { return impl_->max_memory(); } std::string ProxyMemoryPool::backend_name() const { return impl_->backend_name(); } +ReservationListener::~ReservationListener() {} + +ReservationListener::ReservationListener() {} + +class ReservationListenableMemoryPool::ReservationListenableMemoryPoolImpl { + public: + explicit ReservationListenableMemoryPoolImpl( + MemoryPool* pool, std::shared_ptr listener, int64_t block_size) + : pool_(pool), +listener_(listener), +block_size_(block_size), +blocks_reserved_(0), +bytes_reserved_(0) {} + + Status Allocate(int64_t size, uint8_t** out) { +RETURN_NOT_OK(UpdateReservation(size)); +Status error = pool_->Allocate(size, out); +if (!error.ok()) { + RETURN_NOT_OK(UpdateReservation(-size)); + return error; +} +return Status::OK(); + } + + Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) { +bool reserved = false; +int64_t diff = new_size - old_size; +if (new_size >= old_size) { + RETURN_NOT_OK(UpdateReservation(diff)); + reserved = true; +} +Status error = pool_->Reallocate(old_size, new_size, ptr); +if (!error.ok()) { + if (reserved) { +RETURN_NOT_OK(UpdateReservation(-diff)); + } + return error; +} +if (!reserved) { + RETURN_NOT_OK(UpdateReservation(diff)); +} +return Status::OK(); + } + + void Free(uint8_t* buffer, int64_t size) { +pool_->Free(buffer, size); +// fixme currently method ::Free doesn't allow Status return Review comment: please open a JIRA or e-mail the mailing list for discussion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
emkornfield commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551609465 ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { + public: + virtual ~ReservationListener(); + + virtual Status OnReservation(int64_t size) = 0; + virtual Status OnRelease(int64_t size) = 0; + + protected: + ReservationListener(); Review comment: ```suggestion ReservationListener() = default; ``` ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { + public: + virtual ~ReservationListener(); + + virtual Status OnReservation(int64_t size) = 0; + virtual Status OnRelease(int64_t size) = 0; + + protected: + ReservationListener(); +}; + +class ARROW_EXPORT ReservationListenableMemoryPool : public MemoryPool { Review comment: documentation please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
emkornfield commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551609281 ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { + public: + virtual ~ReservationListener(); Review comment: ```suggestion virtual ~ReservationListener() = default; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++
emkornfield commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r551609196 ## File path: cpp/src/arrow/memory_pool.h ## @@ -149,6 +149,43 @@ class ARROW_EXPORT ProxyMemoryPool : public MemoryPool { std::unique_ptr impl_; }; +class ARROW_EXPORT ReservationListener { Review comment: documentation please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression
bkietz commented on a change in pull request #8894: URL: https://github.com/apache/arrow/pull/8894#discussion_r551608598 ## File path: cpp/src/arrow/dataset/expression_internal.h ## @@ -0,0 +1,465 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/expression.h" + +#include +#include +#include + +#include "arrow/compute/api_vector.h" +#include "arrow/compute/registry.h" +#include "arrow/record_batch.h" +#include "arrow/table.h" +#include "arrow/util/logging.h" + +namespace arrow { + +using internal::checked_cast; + +namespace dataset { + +bool Identical(const Expression& l, const Expression& r) { return l.impl_ == r.impl_; } + +const Expression::Call* CallNotNull(const Expression& expr) { + auto call = expr.call(); + DCHECK_NE(call, nullptr); + return call; +} + +inline void GetAllFieldRefs(const Expression& expr, +std::unordered_set* refs) { + if (auto lit = expr.literal()) return; + + if (auto ref = expr.field_ref()) { +refs->emplace(*ref); +return; + } + + for (const Expression& arg : CallNotNull(expr)->arguments) { +GetAllFieldRefs(arg, refs); + } +} + +inline std::vector GetDescriptors(const std::vector& exprs) { + std::vector descrs(exprs.size()); + for (size_t i = 0; i < exprs.size(); ++i) { +DCHECK(exprs[i].IsBound()); +descrs[i] = exprs[i].descr(); + } + return descrs; +} + +inline std::vector GetDescriptors(const std::vector& values) { + std::vector descrs(values.size()); + for (size_t i = 0; i < values.size(); ++i) { +descrs[i] = values[i].descr(); + } + return descrs; +} + +struct FieldPathGetDatumImpl { + template ()))> + Result operator()(const std::shared_ptr& ptr) { +return path_.Get(*ptr).template As(); + } + + template + Result operator()(const T&) { +return Status::NotImplemented("FieldPath::Get() into Datum ", datum_.ToString()); + } + + const Datum& datum_; + const FieldPath& path_; +}; + +inline Result GetDatumField(const FieldRef& ref, const Datum& input) { + Datum field; + + FieldPath path; + if (auto type = input.type()) { +ARROW_ASSIGN_OR_RAISE(path, ref.FindOneOrNone(*input.type())); + } else if (input.kind() == Datum::RECORD_BATCH) { +ARROW_ASSIGN_OR_RAISE(path, ref.FindOneOrNone(*input.record_batch()->schema())); + } else if (input.kind() == Datum::TABLE) { +ARROW_ASSIGN_OR_RAISE(path, ref.FindOneOrNone(*input.table()->schema())); + } + + if (path) { +ARROW_ASSIGN_OR_RAISE(field, + util::visit(FieldPathGetDatumImpl{input, path}, input.value)); + } + + if (field == Datum{}) { +field = Datum(std::make_shared()); + } + + return field; +} + +struct Comparison { + enum type { +NA = 0, +EQUAL = 1, +LESS = 2, +GREATER = 4, +NOT_EQUAL = LESS | GREATER, +LESS_EQUAL = LESS | EQUAL, +GREATER_EQUAL = GREATER | EQUAL, + }; + + static const type* Get(const std::string& function) { +static std::unordered_map flipped_comparisons{ +{"equal", EQUAL}, {"not_equal", NOT_EQUAL}, +{"less", LESS}, {"less_equal", LESS_EQUAL}, +{"greater", GREATER}, {"greater_equal", GREATER_EQUAL}, +}; + +auto it = flipped_comparisons.find(function); +return it != flipped_comparisons.end() ? >second : nullptr; + } + + static const type* Get(const Expression& expr) { +if (auto call = expr.call()) { + return Comparison::Get(call->function_name); +} +return nullptr; + } + + // Execute a simple Comparison between scalars, casting the RHS if types disagree + static Result Execute(Datum l, Datum r) { +if (!l.is_scalar() || !r.is_scalar()) { + return Status::Invalid("Cannot Execute Comparison on non-scalars"); +} + +if (!l.type()->Equals(r.type())) { + ARROW_ASSIGN_OR_RAISE(r, compute::Cast(r, l.type())); +} + +std::vector arguments{std::move(l), std::move(r)}; + +ARROW_ASSIGN_OR_RAISE(auto equal, compute::CallFunction("equal", arguments)); + +if (!equal.scalar()->is_valid) return NA; +if (equal.scalar_as().value) return EQUAL; + +ARROW_ASSIGN_OR_RAISE(auto less, compute::CallFunction("less", arguments)); + +if
[GitHub] [arrow] emkornfield closed pull request #8597: ARROW-10492: [Java][JDBC] Allow users to config the mapping between SQL types and Arrow types
emkornfield closed pull request #8597: URL: https://github.com/apache/arrow/pull/8597 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on issue #8987: Are the Date Logical Type and Date Converted Type implemented?
emkornfield commented on issue #8987: URL: https://github.com/apache/arrow/issues/8987#issuecomment-754262638 I believe this was addressed on the mailing list. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield closed issue #8987: Are the Date Logical Type and Date Converted Type implemented?
emkornfield closed issue #8987: URL: https://github.com/apache/arrow/issues/8987 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #8597: ARROW-10492: [Java][JDBC] Allow users to config the mapping between SQL types and Arrow types
emkornfield commented on pull request #8597: URL: https://github.com/apache/arrow/pull/8597#issuecomment-754261654 +1 thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
emkornfield commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-754259372 Is it possible to add a test to confirm that this can be read/written from the C++ implementation? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
emkornfield commented on a change in pull request #8949: URL: https://github.com/apache/arrow/pull/8949#discussion_r551603469 ## File path: java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java ## @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.compression; + +import static org.apache.arrow.memory.util.MemoryUtil.LITTLE_ENDIAN; + +import java.nio.ByteBuffer; + +import org.apache.arrow.flatbuf.CompressionType; +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.util.MemoryUtil; +import org.apache.arrow.util.Preconditions; + +import net.jpountz.lz4.LZ4Compressor; +import net.jpountz.lz4.LZ4Factory; +import net.jpountz.lz4.LZ4FastDecompressor; + +/** + * Compression codec for the LZ4 algorithm. + */ +public class Lz4CompressionCodec implements CompressionCodec { + + private static final long SIZE_OF_MESSAGE_LENGTH = 8L; + + private final LZ4Factory factory; + + private LZ4Compressor compressor; + + private LZ4FastDecompressor decompressor; + + public Lz4CompressionCodec() { +factory = LZ4Factory.fastestInstance(); + } + + @Override + public ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer) { +Preconditions.checkArgument(unCompressedBuffer.writerIndex() <= Integer.MAX_VALUE, +"The uncompressed buffer size exceeds the integer limit"); + +// create compressor lazily Review comment: why? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
emkornfield commented on a change in pull request #8949: URL: https://github.com/apache/arrow/pull/8949#discussion_r551603359 ## File path: java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java ## @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.compression; + +import static org.apache.arrow.memory.util.MemoryUtil.LITTLE_ENDIAN; + +import java.nio.ByteBuffer; + +import org.apache.arrow.flatbuf.CompressionType; +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.util.MemoryUtil; +import org.apache.arrow.util.Preconditions; + +import net.jpountz.lz4.LZ4Compressor; +import net.jpountz.lz4.LZ4Factory; +import net.jpountz.lz4.LZ4FastDecompressor; + +/** + * Compression codec for the LZ4 algorithm. + */ +public class Lz4CompressionCodec implements CompressionCodec { + + private static final long SIZE_OF_MESSAGE_LENGTH = 8L; + + private final LZ4Factory factory; + + private LZ4Compressor compressor; + + private LZ4FastDecompressor decompressor; + + public Lz4CompressionCodec() { +factory = LZ4Factory.fastestInstance(); + } + + @Override + public ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer) { Review comment: ? or is this consistent with the existing API? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
emkornfield commented on a change in pull request #8949: URL: https://github.com/apache/arrow/pull/8949#discussion_r551603225 ## File path: java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java ## @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.compression; + +import static org.apache.arrow.memory.util.MemoryUtil.LITTLE_ENDIAN; + +import java.nio.ByteBuffer; + +import org.apache.arrow.flatbuf.CompressionType; +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.util.MemoryUtil; +import org.apache.arrow.util.Preconditions; + +import net.jpountz.lz4.LZ4Compressor; +import net.jpountz.lz4.LZ4Factory; +import net.jpountz.lz4.LZ4FastDecompressor; + +/** + * Compression codec for the LZ4 algorithm. + */ +public class Lz4CompressionCodec implements CompressionCodec { + + private static final long SIZE_OF_MESSAGE_LENGTH = 8L; + + private final LZ4Factory factory; + + private LZ4Compressor compressor; + + private LZ4FastDecompressor decompressor; + + public Lz4CompressionCodec() { +factory = LZ4Factory.fastestInstance(); + } + + @Override + public ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer) { Review comment: ```suggestion public ArrowBuf compress(BufferAllocator allocator, ArrowBuf uncompressedBuffer) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
emkornfield commented on a change in pull request #8949: URL: https://github.com/apache/arrow/pull/8949#discussion_r551603003 ## File path: java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java ## @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.compression; + +import static org.apache.arrow.memory.util.MemoryUtil.LITTLE_ENDIAN; + +import java.nio.ByteBuffer; + +import org.apache.arrow.flatbuf.CompressionType; +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.util.MemoryUtil; +import org.apache.arrow.util.Preconditions; + +import net.jpountz.lz4.LZ4Compressor; Review comment: How was this library chosen? It looks like it might not have been released in a while? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
yordan-pavlov commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551597243 ## File path: rust/datafusion/src/physical_plan/parquet.rs ## @@ -209,6 +251,477 @@ impl ParquetExec { } } +#[derive(Debug, Clone)] +/// Predicate builder used for generating of predicate functions, used to filter row group metadata +pub struct PredicateExpressionBuilder { +parquet_schema: Schema, +predicate_expr: Arc, +stat_column_req: Vec<(String, StatisticsType, Field)>, +} + +impl PredicateExpressionBuilder { +/// Try to create a new instance of PredicateExpressionBuilder +pub fn try_new(expr: , parquet_schema: Schema) -> Result { +// build predicate expression once +let mut stat_column_req = Vec::<(String, StatisticsType, Field)>::new(); +let predicate_expr = +build_predicate_expression(expr, _schema, stat_column_req)?; + +Ok(Self { +parquet_schema, +predicate_expr, +stat_column_req, +}) +} + +/// Generate a predicate function used to filter row group metadata +pub fn build_row_group_predicate( +, +row_group_metadata: &[RowGroupMetaData], +) -> Box bool> { +// build statistics record batch +let predicate_result = build_row_group_record_batch( +row_group_metadata, +_schema, +_column_req, +) +.and_then(|statistics_batch| { +// execute predicate expression +self.predicate_expr.evaluate(_batch) +}) +.and_then(|v| match v { +ColumnarValue::Array(array) => Ok(array), +ColumnarValue::Scalar(_) => Err(DataFusionError::Plan( +"predicate expression didn't return an array".to_string(), +)), +}); + +let predicate_array = match predicate_result { +Ok(array) => array, +_ => return Box::new(|_r, _i| true), +}; + +let predicate_array = predicate_array.as_any().downcast_ref::(); +match predicate_array { +// return row group predicate function +Some(array) => { +let predicate_values = +array.iter().map(|x| x.unwrap_or(false)).collect::>(); +Box::new(move |_, i| predicate_values[i]) +} +// predicate result is not a BooleanArray +_ => Box::new(|_r, _i| true), +} +} +} + +fn build_row_group_record_batch( +row_groups: &[RowGroupMetaData], +parquet_schema: , +stat_column_req: <(String, StatisticsType, Field)>, +) -> Result { +let mut fields = Vecnew(); +let mut arrays = Vecnew(); +for (column_name, statistics_type, stat_field) in stat_column_req { +if let Some((column_index, _)) = parquet_schema.column_with_name(column_name) { +let statistics = row_groups +.iter() +.map(|g| g.column(column_index).statistics()) +.collect::>(); +let array = build_statistics_array( +, +*statistics_type, +stat_field.data_type(), +); +fields.push(stat_field.clone()); +arrays.push(array); +} +} +let schema = Arc::new(Schema::new(fields)); +RecordBatch::try_new(schema, arrays) +.map_err(|err| DataFusionError::Plan(err.to_string())) +} + +struct PhysicalExpressionBuilder<'a> { +column_name: String, +column_expr: &'a Expr, +scalar_expr: &'a Expr, +parquet_field: &'a Field, +statistics_fields: Vec, +stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>, +reverse_operator: bool, +} + +impl<'a> PhysicalExpressionBuilder<'a> { +fn try_new( +left: &'a Expr, +right: &'a Expr, +parquet_schema: &'a Schema, +stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>, +) -> Result { +// find column name; input could be a more complicated expression +let mut left_columns = HashSetnew(); +utils::expr_to_column_names(left, left_columns)?; +let mut right_columns = HashSetnew(); +utils::expr_to_column_names(right, right_columns)?; +let (column_expr, scalar_expr, column_names, reverse_operator) = +match (left_columns.len(), right_columns.len()) { +(1, 0) => (left, right, left_columns, false), +(0, 1) => (right, left, right_columns, true), +_ => { +// if more than one column used in expression - not supported +return Err(DataFusionError::Plan( +"Multi-column expressions are not currently supported" +.to_string(), +)); +} +}; +let column_name =
[GitHub] [arrow] github-actions[bot] commented on pull request #9098: ARROW-11127: [C++] ifdef unused cpu_info on non-x86 platforms
github-actions[bot] commented on pull request #9098: URL: https://github.com/apache/arrow/pull/9098#issuecomment-754256502 https://issues.apache.org/jira/browse/ARROW-11127 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced
xhochy commented on pull request #9097: URL: https://github.com/apache/arrow/pull/9097#issuecomment-754256374 @github-actions autotune This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #9053: ARROW-11081: [Java] Make IPC option immutable
emkornfield commented on pull request #9053: URL: https://github.com/apache/arrow/pull/9053#issuecomment-754256316 @liyafan82 does this actually make a difference in benchmarks? I agree it is easier to reason about, but is there any way to avoid backward incompability? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy opened a new pull request #9098: ARROW-11127: [C++] ifdef unused cpu_info on non-x86 platforms
xhochy opened a new pull request #9098: URL: https://github.com/apache/arrow/pull/9098 Otherwise we get an error in debug builds because of `-Werror,Wunused` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9053: ARROW-11081: [Java] Make IPC option immutable
emkornfield commented on a change in pull request #9053: URL: https://github.com/apache/arrow/pull/9053#discussion_r551601013 ## File path: java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestMetadataVersion.java ## @@ -56,9 +56,8 @@ public static void setUpClass() { schema = new Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, true; unionSchema = new Schema( Collections.singletonList(Field.nullable("union", new ArrowType.Union(UnionMode.Dense, new int[]{0}; -optionV4 = new IpcOption(); -optionV4.metadataVersion = MetadataVersion.V4; -optionV5 = new IpcOption(); +optionV4 = new IpcOption(false, MetadataVersion.V4); Review comment: please comment the literal. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #9053: ARROW-11081: [Java] Make IPC option immutable
emkornfield commented on a change in pull request #9053: URL: https://github.com/apache/arrow/pull/9053#discussion_r551600824 ## File path: java/flight/flight-core/src/main/java/org/apache/arrow/flight/ArrowMessage.java ## @@ -194,10 +194,9 @@ public ArrowMessage(FlightDescriptor descriptor) { private ArrowMessage(FlightDescriptor descriptor, MessageMetadataResult message, ArrowBuf appMetadata, ArrowBuf buf) { // No need to take IpcOption as this is used for deserialized ArrowMessage coming from the wire. -this.writeOption = new IpcOption(); -if (message != null) { - this.writeOption.metadataVersion = MetadataVersion.fromFlatbufID(message.getMessage().version()); -} +this.writeOption = message != null ? +new IpcOption(false, MetadataVersion.fromFlatbufID(message.getMessage().version())) : Review comment: please comment the literal. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced
emkornfield commented on pull request #9097: URL: https://github.com/apache/arrow/pull/9097#issuecomment-754254123 Is it possible to add a unit test? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #9096: [Python][Packaging] Refactor manylinux and windows wheel building [WIP]
kou commented on pull request #9096: URL: https://github.com/apache/arrow/pull/9096#issuecomment-754252075 Related works: * https://github.com/apache/arrow/pull/8386#issuecomment-728321361 * #8881 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
yordan-pavlov commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551597243 ## File path: rust/datafusion/src/physical_plan/parquet.rs ## @@ -209,6 +251,477 @@ impl ParquetExec { } } +#[derive(Debug, Clone)] +/// Predicate builder used for generating of predicate functions, used to filter row group metadata +pub struct PredicateExpressionBuilder { +parquet_schema: Schema, +predicate_expr: Arc, +stat_column_req: Vec<(String, StatisticsType, Field)>, +} + +impl PredicateExpressionBuilder { +/// Try to create a new instance of PredicateExpressionBuilder +pub fn try_new(expr: , parquet_schema: Schema) -> Result { +// build predicate expression once +let mut stat_column_req = Vec::<(String, StatisticsType, Field)>::new(); +let predicate_expr = +build_predicate_expression(expr, _schema, stat_column_req)?; + +Ok(Self { +parquet_schema, +predicate_expr, +stat_column_req, +}) +} + +/// Generate a predicate function used to filter row group metadata +pub fn build_row_group_predicate( +, +row_group_metadata: &[RowGroupMetaData], +) -> Box bool> { +// build statistics record batch +let predicate_result = build_row_group_record_batch( +row_group_metadata, +_schema, +_column_req, +) +.and_then(|statistics_batch| { +// execute predicate expression +self.predicate_expr.evaluate(_batch) +}) +.and_then(|v| match v { +ColumnarValue::Array(array) => Ok(array), +ColumnarValue::Scalar(_) => Err(DataFusionError::Plan( +"predicate expression didn't return an array".to_string(), +)), +}); + +let predicate_array = match predicate_result { +Ok(array) => array, +_ => return Box::new(|_r, _i| true), +}; + +let predicate_array = predicate_array.as_any().downcast_ref::(); +match predicate_array { +// return row group predicate function +Some(array) => { +let predicate_values = +array.iter().map(|x| x.unwrap_or(false)).collect::>(); +Box::new(move |_, i| predicate_values[i]) +} +// predicate result is not a BooleanArray +_ => Box::new(|_r, _i| true), +} +} +} + +fn build_row_group_record_batch( +row_groups: &[RowGroupMetaData], +parquet_schema: , +stat_column_req: <(String, StatisticsType, Field)>, +) -> Result { +let mut fields = Vecnew(); +let mut arrays = Vecnew(); +for (column_name, statistics_type, stat_field) in stat_column_req { +if let Some((column_index, _)) = parquet_schema.column_with_name(column_name) { +let statistics = row_groups +.iter() +.map(|g| g.column(column_index).statistics()) +.collect::>(); +let array = build_statistics_array( +, +*statistics_type, +stat_field.data_type(), +); +fields.push(stat_field.clone()); +arrays.push(array); +} +} +let schema = Arc::new(Schema::new(fields)); +RecordBatch::try_new(schema, arrays) +.map_err(|err| DataFusionError::Plan(err.to_string())) +} + +struct PhysicalExpressionBuilder<'a> { +column_name: String, +column_expr: &'a Expr, +scalar_expr: &'a Expr, +parquet_field: &'a Field, +statistics_fields: Vec, +stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>, +reverse_operator: bool, +} + +impl<'a> PhysicalExpressionBuilder<'a> { +fn try_new( +left: &'a Expr, +right: &'a Expr, +parquet_schema: &'a Schema, +stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>, +) -> Result { +// find column name; input could be a more complicated expression +let mut left_columns = HashSetnew(); +utils::expr_to_column_names(left, left_columns)?; +let mut right_columns = HashSetnew(); +utils::expr_to_column_names(right, right_columns)?; +let (column_expr, scalar_expr, column_names, reverse_operator) = +match (left_columns.len(), right_columns.len()) { +(1, 0) => (left, right, left_columns, false), +(0, 1) => (right, left, right_columns, true), +_ => { +// if more than one column used in expression - not supported +return Err(DataFusionError::Plan( +"Multi-column expressions are not currently supported" +.to_string(), +)); +} +}; +let column_name =
[GitHub] [arrow] nealrichardson commented on pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip
nealrichardson commented on pull request #8549: URL: https://github.com/apache/arrow/pull/8549#issuecomment-754251796 > This increases again the weight of the metadata, which now has to include attributes for each element of a list column How much does it blow up the metadata? Is this going to scale to be able to handle normal/large shapefiles? Is there a more efficient representation for this metadata (considering that we have to serialize it to a string)? Should we special-case `sf` objects? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane commented on pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes
jonkeane commented on pull request #9092: URL: https://github.com/apache/arrow/pull/9092#issuecomment-754243679 Updated, passing actions on my fork: https://github.com/jonkeane/arrow/actions/runs/461948379 https://github.com/jonkeane/arrow/actions/runs/461948380 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
yordan-pavlov commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551588681 ## File path: rust/datafusion/src/physical_plan/parquet.rs ## @@ -209,6 +251,477 @@ impl ParquetExec { } } +#[derive(Debug, Clone)] +/// Predicate builder used for generating of predicate functions, used to filter row group metadata +pub struct PredicateExpressionBuilder { +parquet_schema: Schema, +predicate_expr: Arc, +stat_column_req: Vec<(String, StatisticsType, Field)>, +} + +impl PredicateExpressionBuilder { +/// Try to create a new instance of PredicateExpressionBuilder +pub fn try_new(expr: , parquet_schema: Schema) -> Result { +// build predicate expression once +let mut stat_column_req = Vec::<(String, StatisticsType, Field)>::new(); +let predicate_expr = +build_predicate_expression(expr, _schema, stat_column_req)?; + +Ok(Self { +parquet_schema, +predicate_expr, +stat_column_req, +}) +} + +/// Generate a predicate function used to filter row group metadata +pub fn build_row_group_predicate( +, +row_group_metadata: &[RowGroupMetaData], +) -> Box bool> { +// build statistics record batch +let predicate_result = build_row_group_record_batch( +row_group_metadata, +_schema, +_column_req, +) +.and_then(|statistics_batch| { +// execute predicate expression +self.predicate_expr.evaluate(_batch) +}) +.and_then(|v| match v { +ColumnarValue::Array(array) => Ok(array), +ColumnarValue::Scalar(_) => Err(DataFusionError::Plan( +"predicate expression didn't return an array".to_string(), +)), +}); + +let predicate_array = match predicate_result { +Ok(array) => array, +_ => return Box::new(|_r, _i| true), +}; + +let predicate_array = predicate_array.as_any().downcast_ref::(); +match predicate_array { +// return row group predicate function +Some(array) => { +let predicate_values = +array.iter().map(|x| x.unwrap_or(false)).collect::>(); +Box::new(move |_, i| predicate_values[i]) +} +// predicate result is not a BooleanArray +_ => Box::new(|_r, _i| true), +} +} +} + +fn build_row_group_record_batch( +row_groups: &[RowGroupMetaData], +parquet_schema: , +stat_column_req: <(String, StatisticsType, Field)>, +) -> Result { +let mut fields = Vecnew(); +let mut arrays = Vecnew(); +for (column_name, statistics_type, stat_field) in stat_column_req { +if let Some((column_index, _)) = parquet_schema.column_with_name(column_name) { +let statistics = row_groups +.iter() +.map(|g| g.column(column_index).statistics()) +.collect::>(); +let array = build_statistics_array( +, +*statistics_type, +stat_field.data_type(), +); +fields.push(stat_field.clone()); +arrays.push(array); +} +} +let schema = Arc::new(Schema::new(fields)); +RecordBatch::try_new(schema, arrays) +.map_err(|err| DataFusionError::Plan(err.to_string())) +} + +struct PhysicalExpressionBuilder<'a> { +column_name: String, +column_expr: &'a Expr, +scalar_expr: &'a Expr, +parquet_field: &'a Field, +statistics_fields: Vec, +stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>, +reverse_operator: bool, +} + +impl<'a> PhysicalExpressionBuilder<'a> { +fn try_new( +left: &'a Expr, +right: &'a Expr, +parquet_schema: &'a Schema, +stat_column_req: &'a mut Vec<(String, StatisticsType, Field)>, +) -> Result { +// find column name; input could be a more complicated expression +let mut left_columns = HashSetnew(); +utils::expr_to_column_names(left, left_columns)?; +let mut right_columns = HashSetnew(); +utils::expr_to_column_names(right, right_columns)?; +let (column_expr, scalar_expr, column_names, reverse_operator) = +match (left_columns.len(), right_columns.len()) { +(1, 0) => (left, right, left_columns, false), +(0, 1) => (right, left, right_columns, true), +_ => { +// if more than one column used in expression - not supported +return Err(DataFusionError::Plan( +"Multi-column expressions are not currently supported" +.to_string(), +)); +} +}; +let column_name =
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
yordan-pavlov commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551587504 ## File path: rust/datafusion/src/physical_plan/parquet.rs ## @@ -209,6 +251,477 @@ impl ParquetExec { } } +#[derive(Debug, Clone)] +/// Predicate builder used for generating of predicate functions, used to filter row group metadata +pub struct PredicateExpressionBuilder { +parquet_schema: Schema, +predicate_expr: Arc, +stat_column_req: Vec<(String, StatisticsType, Field)>, +} + +impl PredicateExpressionBuilder { +/// Try to create a new instance of PredicateExpressionBuilder +pub fn try_new(expr: , parquet_schema: Schema) -> Result { +// build predicate expression once +let mut stat_column_req = Vec::<(String, StatisticsType, Field)>::new(); +let predicate_expr = +build_predicate_expression(expr, _schema, stat_column_req)?; + +Ok(Self { +parquet_schema, +predicate_expr, +stat_column_req, +}) +} + +/// Generate a predicate function used to filter row group metadata +pub fn build_row_group_predicate( +, +row_group_metadata: &[RowGroupMetaData], +) -> Box bool> { +// build statistics record batch +let predicate_result = build_row_group_record_batch( +row_group_metadata, +_schema, +_column_req, +) +.and_then(|statistics_batch| { +// execute predicate expression +self.predicate_expr.evaluate(_batch) +}) +.and_then(|v| match v { +ColumnarValue::Array(array) => Ok(array), +ColumnarValue::Scalar(_) => Err(DataFusionError::Plan( +"predicate expression didn't return an array".to_string(), +)), +}); + +let predicate_array = match predicate_result { +Ok(array) => array, +_ => return Box::new(|_r, _i| true), +}; + +let predicate_array = predicate_array.as_any().downcast_ref::(); +match predicate_array { +// return row group predicate function +Some(array) => { +let predicate_values = +array.iter().map(|x| x.unwrap_or(false)).collect::>(); +Box::new(move |_, i| predicate_values[i]) +} +// predicate result is not a BooleanArray +_ => Box::new(|_r, _i| true), Review comment: My thinking in designing this has been that pushing the predicate down to parquet is optional, because even if it fails the query will still compute, just slower; because of that the code tries to avoid panicking and instead returns a predicate which returns true - it doesn't filter any row groups and lets them be processed by downstream operators. It is even possible to have a partial predicate expression, where multiple conditions are joined using a logical `AND`, and some of them can't be translated for some reason to physical expressions, they will be replaced by `true`, but the rest will still be evaluated and could still filter some row groups. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9089: ARROW-11122: [Rust] Added FFI support for date and time.
nevi-me commented on a change in pull request #9089: URL: https://github.com/apache/arrow/pull/9089#discussion_r551586245 ## File path: rust/arrow-pyarrow-integration-testing/src/lib.rs ## @@ -153,10 +153,25 @@ fn substring(array: PyObject, start: i64, py: Python) -> PyResult { to_py(array, py) } +/// Returns the concatenate Review comment: Is the doc comment complete? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me closed pull request #9065: ARROW-11096: [Rust] C data interface for [Large]binary
nevi-me closed pull request #9065: URL: https://github.com/apache/arrow/pull/9065 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
yordan-pavlov commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551583777 ## File path: rust/parquet/src/file/serialized_reader.rs ## @@ -137,6 +137,22 @@ impl SerializedFileReader { metadata, }) } + +pub fn filter_row_groups( Review comment: Good point about documentation - will add some shortly. As long as row group metadata is filtered immediately after creating a SerializedFileReader, this approach will work. That's the simplest way I could think of to allow filtering of row groups using statistics metadata; not sure how this could be done within DataFusion itself, because it reads data in batches (of configurable size) which could potentially span multiple row groups; it could be done, but would probably move a lot of complexity into DataFusion which today is nicely abstracted into the parquet library. This would also expose a lot more about the internals of a parquet file format to the outside as the user would have to be aware of row groups rather than just requesting batches of data. May be I misunderstand what you are suggesting? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane commented on a change in pull request #8549: ARROW-10386 [R]: List column class attributes not preserved in roundtrip
jonkeane commented on a change in pull request #8549: URL: https://github.com/apache/arrow/pull/8549#discussion_r551552588 ## File path: r/R/record-batch.R ## @@ -286,6 +286,20 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, optional = FALSE, ... apply_arrow_r_metadata <- function(x, r_metadata) { tryCatch({ +columns_metadata <- r_metadata$columns +if (is.data.frame(x)) { + if (length(names(x)) && !is.null(columns_metadata)) { +for (name in intersect(names(columns_metadata), names(x))) { + x[[name]] <- apply_arrow_r_metadata(x[[name]], columns_metadata[[name]]) +} + } +} else if(is.list(x) && !inherits(x, "POSIXlt") && !is.null(columns_metadata)) { + x <- map2(x, columns_metadata, function(.x, .y) { +apply_arrow_r_metadata(.x, .y) + }) + x +} Review comment: Could we move `!is.null(columns_metadata)` up to line 290 instead of having it on both 291 and 296? ## File path: r/tests/testthat/test-metadata.R ## @@ -134,3 +134,10 @@ test_that("metadata keeps attribute of top level data frame", { expect_identical(attr(as.data.frame(tab), "foo"), "bar") expect_identical(as.data.frame(tab), df) }) + +test_that("metadata of list elements (ARROW-10386)", { + df <- data.frame(x = I(list(structure(1, foo = "bar"), structure(2, foo = "bar" + tab <- Table$create(df) + expect_identical(attr(as.data.frame(tab)$x[[1]], "foo"), "bar") + expect_identical(attr(as.data.frame(tab)$x[[2]], "foo"), "bar") Review comment: I wonder if it would be clearer to have different attributes on each item/row to make it super obvious that we're not picking the attributes of the first item/row and copying them for the whole column. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression
bkietz commented on a change in pull request #8894: URL: https://github.com/apache/arrow/pull/8894#discussion_r551578256 ## File path: cpp/src/arrow/compute/cast.cc ## @@ -118,8 +118,86 @@ class CastMetaFunction : public MetaFunction { } // namespace +const FunctionDoc struct_doc{"Wrap Arrays into a StructArray", + ("Names of the StructArray's fields are\n" + "specified through StructOptions."), + {}, + "StructOptions"}; + +Result StructResolve(KernelContext* ctx, + const std::vector& descrs) { + const auto& names = OptionsWrapper::Get(ctx).field_names; + if (names.size() != descrs.size()) { +return Status::Invalid("Struct() was passed ", names.size(), " field ", "names but ", + descrs.size(), " arguments"); + } + + size_t i = 0; + FieldVector fields(descrs.size()); + + ValueDescr::Shape shape = ValueDescr::SCALAR; + for (const ValueDescr& descr : descrs) { +if (descr.shape != ValueDescr::SCALAR) { + shape = ValueDescr::ARRAY; +} else { + switch (descr.type->id()) { +case Type::EXTENSION: Review comment: ExtensionScalar hasn't been implemented yet This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)
codecov-io edited a comment on pull request #9038: URL: https://github.com/apache/arrow/pull/9038#issuecomment-753241531 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=h1) Report > Merging [#9038](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=desc) (7a44a99) into [master](https://codecov.io/gh/apache/arrow/commit/5db1d2aad0b6aaca6dfe4e01b0afeefb0311d109?el=desc) (5db1d2a) will **decrease** coverage by `0.01%`. > The diff coverage is `78.97%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9038/graphs/tree.svg?width=650=150=pr=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=tree) ```diff @@Coverage Diff @@ ## master#9038 +/- ## == - Coverage 82.60% 82.58% -0.02% == Files 204 204 Lines 5020050533 +333 == + Hits4146741735 +268 - Misses 8733 8798 +65 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=tree) | Coverage Δ | | |---|---|---| | [rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=) | `53.92% <47.05%> (-0.68%)` | :arrow_down: | | [rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=) | `59.04% <71.42%> (+0.32%)` | :arrow_up: | | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.92% <77.27%> (+0.02%)` | :arrow_up: | | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `83.77% <78.53%> (-0.71%)` | :arrow_down: | | [rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==) | `84.06% <81.81%> (+0.09%)` | :arrow_up: | | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `78.63% <82.53%> (+0.74%)` | :arrow_up: | | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.83% <100.00%> (+<0.01%)` | :arrow_up: | | [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `57.37% <0.00%> (+1.19%)` | :arrow_up: | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=footer). Last update [5db1d2a...7a44a99](https://codecov.io/gh/apache/arrow/pull/9038?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sweb commented on a change in pull request #9047: ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types
sweb commented on a change in pull request #9047: URL: https://github.com/apache/arrow/pull/9047#discussion_r551575727 ## File path: rust/parquet/src/arrow/schema.rs ## @@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> { LogicalType::INT_32 => Ok(DataType::Int32), LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)), LogicalType::TIME_MILLIS => Ok(DataType::Time32(TimeUnit::Millisecond)), +LogicalType::DECIMAL => self.to_decimal(), Review comment: @sunchao thank you for your help with this PR. The result is much better than my initial proposal! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression
bkietz commented on a change in pull request #8894: URL: https://github.com/apache/arrow/pull/8894#discussion_r551574884 ## File path: cpp/src/arrow/dataset/expression_internal.h ## @@ -0,0 +1,465 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/expression.h" + +#include +#include +#include + +#include "arrow/compute/api_vector.h" +#include "arrow/compute/registry.h" +#include "arrow/record_batch.h" +#include "arrow/table.h" +#include "arrow/util/logging.h" + +namespace arrow { + +using internal::checked_cast; + +namespace dataset { + +bool Identical(const Expression& l, const Expression& r) { return l.impl_ == r.impl_; } + +const Expression::Call* CallNotNull(const Expression& expr) { + auto call = expr.call(); + DCHECK_NE(call, nullptr); + return call; +} + +inline void GetAllFieldRefs(const Expression& expr, +std::unordered_set* refs) { + if (auto lit = expr.literal()) return; + + if (auto ref = expr.field_ref()) { +refs->emplace(*ref); +return; + } + + for (const Expression& arg : CallNotNull(expr)->arguments) { +GetAllFieldRefs(arg, refs); + } +} + +inline std::vector GetDescriptors(const std::vector& exprs) { + std::vector descrs(exprs.size()); + for (size_t i = 0; i < exprs.size(); ++i) { +DCHECK(exprs[i].IsBound()); +descrs[i] = exprs[i].descr(); + } + return descrs; +} + +inline std::vector GetDescriptors(const std::vector& values) { + std::vector descrs(values.size()); + for (size_t i = 0; i < values.size(); ++i) { +descrs[i] = values[i].descr(); + } + return descrs; +} + +struct FieldPathGetDatumImpl { Review comment: Since this is only valid for some Datum variant members, I think it should remain a special case for datasets at least for now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
yordan-pavlov commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551574592 ## File path: rust/datafusion/src/physical_plan/parquet.rs ## @@ -209,6 +251,479 @@ impl ParquetExec { } } +#[derive(Debug, Clone)] +/// Predicate builder used for generating of predicate functions, used to filter row group metadata +pub struct PredicateExpressionBuilder { Review comment: thinking some more about this, it could be done by moving the creation of PredicateExpressionBuilder from `ParquetExec::try_from_files` into `(ExecutionPlan for ParquetExec)::execute`, but then this work would be repeated for each partition, where as currently it's only done once; at this point I don't think it's worth it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)
seddonm1 commented on pull request #9038: URL: https://github.com/apache/arrow/pull/9038#issuecomment-754223732 Ok, I have done a major refactor against a rebased master. I believe this now meets the ANSI behavior with regard to `NULL` handling but it does not yet support syntax where columns are referenced in the `list` parameter like: ```sql SELECT TRUE IN (col1, col2, FALSE) ``` This has been implemented with a "make it work then make it fast" approach as this `InList` expression actually has a lot of complexity. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced
github-actions[bot] commented on pull request #9097: URL: https://github.com/apache/arrow/pull/9097#issuecomment-754222849 https://issues.apache.org/jira/browse/ARROW-10881 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy opened a new pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced
xhochy opened a new pull request #9097: URL: https://github.com/apache/arrow/pull/9097 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] terencehonles commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels
terencehonles commented on pull request #8915: URL: https://github.com/apache/arrow/pull/8915#issuecomment-754216137 @github-actions crossbow submit wheel-osx-*-cp39 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] johncassil commented on pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings
johncassil commented on pull request #8365: URL: https://github.com/apache/arrow/pull/8365#issuecomment-754215072 @jimhester, @nealrichardson, @bkietz @dianaclarke @romainfrancois Just wanted to say thanks for working on this. I reported it a long time ago and have just been periodically watching the developments slowly progress. I'm excited to see that there will be a resolution! Cheers! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9047: ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types
alamb closed pull request #9047: URL: https://github.com/apache/arrow/pull/9047 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #8927: ARROW-10766: [Rust] [Parquet] Nested List IO
nevi-me commented on pull request #8927: URL: https://github.com/apache/arrow/pull/8927#issuecomment-754211680 I got stalled with #9093. I think it's the last blocker before I can complete this :( This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane commented on a change in pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes
jonkeane commented on a change in pull request #9092: URL: https://github.com/apache/arrow/pull/9092#discussion_r551560380 ## File path: r/R/record-batch.R ## @@ -274,6 +274,12 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, optional = FALSE, ... } .serialize_arrow_r_metadata <- function(x) { + # drop problems attributes (most likely from readr) + if ("attributes" %in% names(x) && + "problems" %in% names(x[["attributes"]]) ) { +x[["attributes"]][["problems"]] <- NULL + } + Review comment: Ok, cool — I'll rewrite the second test to avoid using `.serialize_arrow_r_metadata()` so that we don't error there, but still keep the testing-serialized-data aspect of that test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings
nealrichardson closed pull request #8365: URL: https://github.com/apache/arrow/pull/8365 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sunchao commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables
sunchao commented on a change in pull request #9064: URL: https://github.com/apache/arrow/pull/9064#discussion_r551558307 ## File path: rust/datafusion/src/datasource/parquet.rs ## @@ -62,17 +64,37 @@ impl TableProvider for ParquetTable { self.schema.clone() } +fn supports_filter_pushdown( +, +_filter: , +) -> Result { +Ok(TableProviderFilterPushDown::Inexact) +} + /// Scan the file(s), using the provided projection, and return one BatchIterator per /// partition. fn scan( , projection: >, batch_size: usize, -_filters: &[Expr], +filters: &[Expr], Review comment: Ideally, I feel we should have a proper filter API defined in data fusion which can be shared among various data sources. On the other hand, the actual filtering logic should be implemented by different data sources / formats, probably via converting the data fusion's filter API to the corresponding ones from the latter. But this is a very good start and we can probably do them as follow ups (if we don't care much for API changes). ## File path: rust/parquet/src/file/serialized_reader.rs ## @@ -137,6 +137,22 @@ impl SerializedFileReader { metadata, }) } + +pub fn filter_row_groups( Review comment: Yeah I think we can either move this to the application layer (i.e., data fusion), or expose it as a util function from `footer.rs`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser
nevi-me commented on pull request #9090: URL: https://github.com/apache/arrow/pull/9090#issuecomment-754208396 > * Implement cast that returns an error on parsing errors instead of null See https://issues.apache.org/jira/browse/ARROW-7364 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes
nealrichardson commented on a change in pull request #9092: URL: https://github.com/apache/arrow/pull/9092#discussion_r551556251 ## File path: r/R/record-batch.R ## @@ -274,6 +274,12 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, optional = FALSE, ... } .serialize_arrow_r_metadata <- function(x) { + # drop problems attributes (most likely from readr) + if ("attributes" %in% names(x) && + "problems" %in% names(x[["attributes"]]) ) { +x[["attributes"]][["problems"]] <- NULL + } + Review comment: I would have to look into the implementation to be sure, but I think the intent of the first test should be that we don't save garbage `r` metadata and the second would be that, if someone happened to have bad metadata stored in the `r` key (we can't prevent some other data generating process from doing that), that we don't crash on loading it. It's not clear that that's actually what those tests are doing, but that's what I think they should do. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings
nealrichardson commented on a change in pull request #8365: URL: https://github.com/apache/arrow/pull/8365#discussion_r551551461 ## File path: r/src/array_to_vector.cpp ## @@ -288,36 +290,104 @@ struct Converter_String : public Converter { } StringArrayType* string_array = static_cast(array.get()); -if (array->null_count()) { - // need to watch for nulls - arrow::internal::BitmapReader null_reader(array->null_bitmap_data(), -array->offset(), n); + +const bool all_valid = array->null_count() == 0; +const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false); + +bool nul_was_stripped = false; + +if (all_valid) { + // no need to watch for missing strings cpp11::unwind_protect([&] { -for (int i = 0; i < n; i++, null_reader.Next()) { - if (null_reader.IsSet()) { -SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); - } else { -SET_STRING_ELT(data, start + i, NA_STRING); +if (strip_out_nuls) { + for (int i = 0; i < n; i++) { +SET_STRING_ELT(data, start + i, + r_string_from_view_strip_nul(string_array->GetView(i), +_was_stripped)); } + return; Review comment: no preference, just trying to understand This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane commented on a change in pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes
jonkeane commented on a change in pull request #9092: URL: https://github.com/apache/arrow/pull/9092#discussion_r551549190 ## File path: r/R/record-batch.R ## @@ -274,6 +274,12 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, optional = FALSE, ... } .serialize_arrow_r_metadata <- function(x) { + # drop problems attributes (most likely from readr) + if ("attributes" %in% names(x) && + "problems" %in% names(x[["attributes"]]) ) { +x[["attributes"]][["problems"]] <- NULL + } + Review comment: Yeah I can do that. Though [one of the existing tests](https://github.com/apache/arrow/blob/master/r/tests/testthat/test-metadata.R#L76-L81) fails here — the one right above it works which might be sufficient, is there something that lines 76-81 is testing that we aren't covering with lines 70-75? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings
bkietz commented on a change in pull request #8365: URL: https://github.com/apache/arrow/pull/8365#discussion_r551542391 ## File path: r/src/array_to_vector.cpp ## @@ -288,36 +290,104 @@ struct Converter_String : public Converter { } StringArrayType* string_array = static_cast(array.get()); -if (array->null_count()) { - // need to watch for nulls - arrow::internal::BitmapReader null_reader(array->null_bitmap_data(), -array->offset(), n); + +const bool all_valid = array->null_count() == 0; +const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false); + +bool nul_was_stripped = false; + +if (all_valid) { + // no need to watch for missing strings cpp11::unwind_protect([&] { -for (int i = 0; i < n; i++, null_reader.Next()) { - if (null_reader.IsSet()) { -SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); - } else { -SET_STRING_ELT(data, start + i, NA_STRING); +if (strip_out_nuls) { + for (int i = 0; i < n; i++) { +SET_STRING_ELT(data, start + i, + r_string_from_view_strip_nul(string_array->GetView(i), +_was_stripped)); } + return; +} + +for (int i = 0; i < n; i++) { + SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); } }); } else { cpp11::unwind_protect([&] { -for (int i = 0; i < n; i++) { - SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); +arrow::internal::BitmapReader validity_reader(array->null_bitmap_data(), + array->offset(), n); + +if (strip_out_nuls) { + for (int i = 0; i < n; i++, validity_reader.Next()) { +if (validity_reader.IsSet()) { + SET_STRING_ELT(data, start + i, + r_string_from_view_strip_nul(string_array->GetView(i), + _was_stripped)); +} else { + SET_STRING_ELT(data, start + i, NA_STRING); +} + } + return; +} + +for (int i = 0; i < n; i++, validity_reader.Next()) { + if (validity_reader.IsSet()) { +SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); + } else { +SET_STRING_ELT(data, start + i, NA_STRING); + } } }); } +if (nul_was_stripped) { + cpp11::warning("Stripping '\\0' (nul) from character vector"); +} + return Status::OK(); } bool Parallel() const { return false; } private: - static SEXP r_string_from_view(const arrow::util::string_view& view) { + static SEXP r_string_from_view(arrow::util::string_view view) { Review comment: string_view is a trivially copyable structure so there's no need to protect it from copies with `const&` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings
bkietz commented on a change in pull request #8365: URL: https://github.com/apache/arrow/pull/8365#discussion_r551542022 ## File path: r/src/array_to_vector.cpp ## @@ -288,36 +290,104 @@ struct Converter_String : public Converter { } StringArrayType* string_array = static_cast(array.get()); -if (array->null_count()) { - // need to watch for nulls - arrow::internal::BitmapReader null_reader(array->null_bitmap_data(), -array->offset(), n); + +const bool all_valid = array->null_count() == 0; +const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false); + +bool nul_was_stripped = false; + +if (all_valid) { + // no need to watch for missing strings cpp11::unwind_protect([&] { -for (int i = 0; i < n; i++, null_reader.Next()) { - if (null_reader.IsSet()) { -SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); - } else { -SET_STRING_ELT(data, start + i, NA_STRING); +if (strip_out_nuls) { + for (int i = 0; i < n; i++) { +SET_STRING_ELT(data, start + i, + r_string_from_view_strip_nul(string_array->GetView(i), +_was_stripped)); } + return; Review comment: This `return` is breaking out of the lambda wrapped by `unwind_protect`; it doesn't make the parent scope return. If you prefer, we could use an `else` here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] removed a comment on pull request #9095: ARROW-10183: [C++] Apply composable futures to CSV
github-actions[bot] removed a comment on pull request #9095: URL: https://github.com/apache/arrow/pull/9095#issuecomment-754137607 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sunchao commented on a change in pull request #9047: ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types
sunchao commented on a change in pull request #9047: URL: https://github.com/apache/arrow/pull/9047#discussion_r551540021 ## File path: rust/parquet/src/arrow/schema.rs ## @@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> { LogicalType::INT_32 => Ok(DataType::Int32), LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)), LogicalType::TIME_MILLIS => Ok(DataType::Time32(TimeUnit::Millisecond)), +LogicalType::DECIMAL => self.to_decimal(), Review comment: Thanks for the update. This PR looks good and we can address the remaining in a followup. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays
nevi-me commented on pull request #9093: URL: https://github.com/apache/arrow/pull/9093#issuecomment-754186287 I saw the clippy warning, I'll fix it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9094: ARROW-11126: [Rust] Document and test ARROW-10656
nevi-me commented on a change in pull request #9094: URL: https://github.com/apache/arrow/pull/9094#discussion_r551536755 ## File path: rust/arrow/src/record_batch.rs ## @@ -80,6 +80,10 @@ impl RecordBatch { Ok(RecordBatch { schema, columns }) } +/// Creates a [RecordBatch` from a schema and columns, with additional options, Review comment: Thanks, I meant for it to be `RecordBatch`, as the link has little value. I think I removed the wrong character while changing it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #8365: ARROW-6582: [R] Arrow to R fails with embedded nuls in strings
nealrichardson commented on a change in pull request #8365: URL: https://github.com/apache/arrow/pull/8365#discussion_r551519433 ## File path: r/src/array_to_vector.cpp ## @@ -288,36 +290,104 @@ struct Converter_String : public Converter { } StringArrayType* string_array = static_cast(array.get()); -if (array->null_count()) { - // need to watch for nulls - arrow::internal::BitmapReader null_reader(array->null_bitmap_data(), -array->offset(), n); + +const bool all_valid = array->null_count() == 0; +const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false); + +bool nul_was_stripped = false; + +if (all_valid) { + // no need to watch for missing strings cpp11::unwind_protect([&] { -for (int i = 0; i < n; i++, null_reader.Next()) { - if (null_reader.IsSet()) { -SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); - } else { -SET_STRING_ELT(data, start + i, NA_STRING); +if (strip_out_nuls) { + for (int i = 0; i < n; i++) { +SET_STRING_ELT(data, start + i, + r_string_from_view_strip_nul(string_array->GetView(i), +_was_stripped)); } + return; Review comment: The main function return (L347) is `return Status::OK();` -- should these do the same? Or do I misunderstand what this return does? ## File path: r/src/array_to_vector.cpp ## @@ -288,36 +290,104 @@ struct Converter_String : public Converter { } StringArrayType* string_array = static_cast(array.get()); -if (array->null_count()) { - // need to watch for nulls - arrow::internal::BitmapReader null_reader(array->null_bitmap_data(), -array->offset(), n); + +const bool all_valid = array->null_count() == 0; +const bool strip_out_nuls = GetBoolOption("arrow.skip_nul", false); + +bool nul_was_stripped = false; + +if (all_valid) { + // no need to watch for missing strings cpp11::unwind_protect([&] { -for (int i = 0; i < n; i++, null_reader.Next()) { - if (null_reader.IsSet()) { -SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); - } else { -SET_STRING_ELT(data, start + i, NA_STRING); +if (strip_out_nuls) { + for (int i = 0; i < n; i++) { +SET_STRING_ELT(data, start + i, + r_string_from_view_strip_nul(string_array->GetView(i), +_was_stripped)); } + return; +} + +for (int i = 0; i < n; i++) { + SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); } }); } else { cpp11::unwind_protect([&] { -for (int i = 0; i < n; i++) { - SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); +arrow::internal::BitmapReader validity_reader(array->null_bitmap_data(), + array->offset(), n); + +if (strip_out_nuls) { + for (int i = 0; i < n; i++, validity_reader.Next()) { +if (validity_reader.IsSet()) { + SET_STRING_ELT(data, start + i, + r_string_from_view_strip_nul(string_array->GetView(i), + _was_stripped)); +} else { + SET_STRING_ELT(data, start + i, NA_STRING); +} + } + return; +} + +for (int i = 0; i < n; i++, validity_reader.Next()) { + if (validity_reader.IsSet()) { +SET_STRING_ELT(data, start + i, r_string_from_view(string_array->GetView(i))); + } else { +SET_STRING_ELT(data, start + i, NA_STRING); + } } }); } +if (nul_was_stripped) { + cpp11::warning("Stripping '\\0' (nul) from character vector"); +} + return Status::OK(); } bool Parallel() const { return false; } private: - static SEXP r_string_from_view(const arrow::util::string_view& view) { + static SEXP r_string_from_view(arrow::util::string_view view) { Review comment: Why this change? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9092: ARROW-10624: [R] Proactively remove "problems" attributes
nealrichardson commented on a change in pull request #9092: URL: https://github.com/apache/arrow/pull/9092#discussion_r551534161 ## File path: r/R/record-batch.R ## @@ -274,6 +274,12 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, optional = FALSE, ... } .serialize_arrow_r_metadata <- function(x) { + # drop problems attributes (most likely from readr) + if ("attributes" %in% names(x) && + "problems" %in% names(x[["attributes"]]) ) { +x[["attributes"]][["problems"]] <- NULL + } + Review comment: Should we validate/assert `is.list()` in this function? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan edited a comment on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser
Dandandan edited a comment on pull request #9090: URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662 @jorgecarleitao note that the `csv` `StringRecord` also verifies whether strings are utf8. It adds a bit of overhead, but the utf8 checking itself is not much for now, it is mostly the logic surrounding `StringRecord` that adds the most overhead. I think eventually we could use a `StringArray` or `BinaryArray` as buffer so we can remove the `StringRecords` which is internally a `Vec` (by using `ByteRecord`) and a `Vec` for the rows. The current performance penalty between master and this branch currently is ~10% as we introduce an extra intermediate step which I think could be more than compensated for by removing the `StringRecord` abstraction, and trying to write to a string or binary array without intermediate steps. This is the structure the csv crate is using per row: ```rust struct ByteRecordInner { /// The position of this byte record. pos: Option, /// All fields in this record, stored contiguously. fields: Vec, /// The number of and location of each field in this record. bounds: Bounds, } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Dandandan edited a comment on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser
Dandandan edited a comment on pull request #9090: URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662 @jorgecarleitao note that the `csv` `StringRecord` also verifies whether strings are utf8. It adds a bit of overhead, but the utf8 checking itself is not much for now, it is mostly the logic surrounding `StringRecord` that adds the most overhead. I think eventually we could use a `StringArray` or `BinaryArray` as buffer so we can remove the `StringRecords` which is internally a `Vec` (by using `ByteRecord`) and a `Vec` for the rows. The current performance penalty between master and this branch currently is ~10% as we introduce an extra intermediate step which I think could be more than compensated for by removing the `StringRecord` abstraction, and trying to write to a string or binary array without intermediate steps. ```rust struct ByteRecordInner { /// The position of this byte record. pos: Option, /// All fields in this record, stored contiguously. fields: Vec, /// The number of and location of each field in this record. bounds: Bounds, } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser
alamb commented on pull request #9090: URL: https://github.com/apache/arrow/pull/9090#issuecomment-754181158 I also like the general idea of using the `cast` kernel for consistency. Nice idea @Dandandan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
alamb closed pull request #9016: URL: https://github.com/apache/arrow/pull/9016 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9094: ARROW-11126: [Rust] Document and test ARROW-10656
alamb commented on a change in pull request #9094: URL: https://github.com/apache/arrow/pull/9094#discussion_r551527136 ## File path: rust/arrow/src/record_batch.rs ## @@ -80,6 +80,10 @@ impl RecordBatch { Ok(RecordBatch { schema, columns }) } +/// Creates a [RecordBatch` from a schema and columns, with additional options, Review comment: ```suggestion /// Creates a [`RecordBatch`] from a schema and columns, with additional options, ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org