[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7027: ARROW-8572: [Python] expose UnionArray fields to Python
jorisvandenbossche commented on a change in pull request #7027: URL: https://github.com/apache/arrow/pull/7027#discussion_r416374325 ## File path: python/pyarrow/array.pxi ## @@ -1720,6 +1720,39 @@ cdef class UnionArray(Array): Concrete class for Arrow arrays of a Union data type. """ +def child(self, int pos): +""" +Return the given child array as an individual array. + +For sparse unions, the returned array has its offset, length, +and null count adjusted. + +For dense unions, the returned array is unchanged. +""" +cdef shared_ptr[CArray] result +result = ( self.ap).child(pos) +if result != NULL: +return pyarrow_wrap_array(result) +raise KeyError("UnionArray does not have child {}".format(pos)) + +@property +def type_codes(self): +"""Get the type codes array.""" +buf = pyarrow_wrap_buffer(( self.ap).type_codes()) +return Array.from_buffers(int8(), len(self), [None, buf]) + +@property +def value_offsets(self): Review comment: For ListArray, this is called "offsets" (not sure if we prefer consistency within the python api, or consistency with the c++ api) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #6963: ARROW-8509: [Plasma][GLib] GArrowRecordBatch <-> GArrowBuffer conversion functions
kou commented on pull request #6963: URL: https://github.com/apache/arrow/pull/6963#issuecomment-620412397 I take over this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r416295541 ## File path: cpp/src/jni/dataset/proto/Types.proto ## @@ -0,0 +1,149 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto2"; +package types; + +option java_package = "org.apache.arrow.dataset"; Review comment: Thanks, I think I would understand any of the concerns against a too detailed mapping. (actually currently Fragment is not mapped to Java-side so it's not yet a 1-1 mapping) I'll try remove more JNI stuffs. As for filter, If I understand correctly it's OK to keep Java API but the JNI mapping for filters is considered fragile, right? I can remove the mapping anyway but when users read parquet files from Java they'll not be able to filter row groups to reduce I/O. Which is extremely important when low-selectivity filter is specified. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rtyler commented on pull request #7049: Avoid loading simd_load_set_invalid which doesn't exist on aarch64
rtyler commented on pull request #7049: URL: https://github.com/apache/arrow/pull/7049#issuecomment-620377892 Fixing this also helps surface [ARROW-8610](https://issues.apache.org/jira/browse/ARROW-8610), which I have had a hell of a time fixing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rtyler opened a new pull request #7049: Avoid loading simd_load_set_invalid which doesn't exist on aarch64
rtyler opened a new pull request #7049: URL: https://github.com/apache/arrow/pull/7049 error[E0432]: unresolved import `crate::compute::util::simd_load_set_invalid` --> /home/tyler/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/2a8e37d/rust/arrow/src/compute/kernels/arithmetic.rs:42:5 | 42 | use crate::compute::util::simd_load_set_invalid; | ^^^ no `simd_load_set_invalid` in `compute::util` Compiling thiserror v1.0.16 error: aborting due to previous error This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r416295541 ## File path: cpp/src/jni/dataset/proto/Types.proto ## @@ -0,0 +1,149 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto2"; +package types; + +option java_package = "org.apache.arrow.dataset"; Review comment: Thanks, as I knew some of the base design purposes of datasets API, dataframe API, and query engine system, so I think I would understand any of the concerns against a too detailed mapping. (actually currently Fragment is not mapped to Java-side so it's not yet a 1-1 mapping) I'll try remove more JNI stuffs. As for filter, If I understand correctly it's OK to keep Java API but the JNI mapping for filters is considered fragile, right? I can remove the mapping anyway but when users read parquet files from Java they'll not be able to filter row groups to reduce I/O. Which is extremely important when low-selectivity filter is specified. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r416295541 ## File path: cpp/src/jni/dataset/proto/Types.proto ## @@ -0,0 +1,149 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto2"; +package types; + +option java_package = "org.apache.arrow.dataset"; Review comment: Thanks, as I knew some of the base design purposes of datasets API, dataframe API, and query engine system, so I think I would understand any of the concerns against a too detailed mapping. I'll try remove more JNI stuffs. As for filter, If I understand correctly it's OK to keep Java API but the JNI mapping for filters is considered fragile, right? I can remove the mapping anyway but when users read parquet files from Java they'll not be able to filter row groups to reduce I/O. Which is extremely important when low-selectivity filters is specified. ## File path: cpp/src/jni/dataset/proto/Types.proto ## @@ -0,0 +1,149 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto2"; +package types; + +option java_package = "org.apache.arrow.dataset"; Review comment: Thanks, as I knew some of the base design purposes of datasets API, dataframe API, and query engine system, so I think I would understand any of the concerns against a too detailed mapping. I'll try remove more JNI stuffs. As for filter, If I understand correctly it's OK to keep Java API but the JNI mapping for filters is considered fragile, right? I can remove the mapping anyway but when users read parquet files from Java they'll not be able to filter row groups to reduce I/O. Which is extremely important when low-selectivity filter is specified. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on pull request #6323: URL: https://github.com/apache/arrow/pull/6323#issuecomment-620338202 @BryanCutler @emkornfield Thanks a lot for your effort. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7048: ARROW-8609: [C++] fix orc jni crash
github-actions[bot] commented on pull request #7048: URL: https://github.com/apache/arrow/pull/7048#issuecomment-620323735 https://issues.apache.org/jira/browse/ARROW-8609 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhouyuan commented on pull request #6953: [Java] Reproduce ORC JNI binding crash [WIP]
zhouyuan commented on pull request #6953: URL: https://github.com/apache/arrow/pull/6953#issuecomment-620321153 @kszucs ran into similar issue on parquet side, not sure if this #7048 works This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhouyuan opened a new pull request #7048: ARROW-8609: [C++] fix orc jni crash
zhouyuan opened a new pull request #7048: URL: https://github.com/apache/arrow/pull/7048 check if arrow buffer is null before passing to the constructor Signed-off-by: Yuan Zhou This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416253903 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: I'll add a failing test tomorrow so this can become less hand wavy This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416239202 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: The edge case here is due to the way we pass only the projected schema during a scan, so maybe the solution is to always pass the full dataset schema as well so that no materialized fields will be inferring at scan time This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416235260 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: Reading using a predetermined schema is already supported for CSV and for all the other formats. The first chunk (as determined by `CsvFileFormat::chunk_size`) is used to sniff the schema which is assumed for the rest of the file, and this inspection can be applied to just one or to all the files in a dataset at discovery time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on pull request #7032: URL: https://github.com/apache/arrow/pull/7032#issuecomment-620266004 @eerhardt thanks for the excellent review and feedback. I believe that I have pushed fixes for everything. I also noticed I had missed validity buffer state management in `PrimitiveArrayBuilder`'s `Set`, `Swap` and `Clear` methods. I've added those in. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #6154: ARROW-7531: [C++] Reduce header cost
wesm commented on pull request #6154: URL: https://github.com/apache/arrow/pull/6154#issuecomment-620248207 There's a lot of good changes here. I agree that starting fresh and making smaller PRs is the path forward. I'm going to close this for now if that's OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416165482 ## File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs ## @@ -48,21 +48,28 @@ public BinaryArray(ArrowTypeId typeId, ArrayData data) data.EnsureBufferCount(3); } -public abstract class BuilderBase : IArrowArrayBuilder +public abstract class BuilderBase : IArrowArrayBuilder Review comment: Early on in the effort I was trying to get a more generalized BuilderBase, and one that could be more easily re-used by end users (such as myself) who might be interested in implementing a custom type. It didn't really end working out that way, however, so I have reverted it back. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings
nealrichardson commented on pull request #7026: URL: https://github.com/apache/arrow/pull/7026#issuecomment-620245767 @github-actions autotune everything This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416155042 ## File path: csharp/src/Apache.Arrow/Arrays/BooleanArray.cs ## @@ -153,17 +182,25 @@ private void CheckIndex(int index) new[] { nullBitmapBuffer, valueBuffer })) { } -public BooleanArray(ArrayData data) +public BooleanArray(ArrayData data) : base(data) { data.EnsureDataType(ArrowTypeId.Boolean); } public override void Accept(IArrowArrayVisitor visitor) => Accept(this, visitor); +[Obsolete("GetBoolean does not support null values. Use GetValue instead (which this method invokes internally).")] Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416154898 ## File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs ## @@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default) { ValueOffsets.Append(Offset); -var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 0, -new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) }); +var data = new ArrayData(DataType, ValueOffsets.Length - 1, NullCount, 0, +new[] { ValidityBuffer.Build(allocator).ValueBuffer, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) }); return Build(data); } -public TBuilder Append(byte value) +public TBuilder AppendNull() +{ +ValueOffsets.Append(Offset); +Offset++; +NullCount++; +return Instance; +} + +public TBuilder Append(T value) { ValueOffsets.Append(Offset); ValueBuffer.Append(value); Offset++; +ValidityBuffer.Append(true); return Instance; } -public TBuilder Append(ReadOnlySpan span) +public TBuilder Append(ReadOnlySpan span) { ValueOffsets.Append(Offset); ValueBuffer.Append(span); -Offset += span.Length; +ValidityBuffer.Append(true); +Offset += span == Span.Empty +? 1 +: span.Length; return Instance; } -public TBuilder AppendRange(IEnumerable values) +public TBuilder AppendRange(IEnumerable values) { foreach (var arr in values) { +if (arr != null && arr.Length > 0) Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7047: ARROW-8607: [R][CI] Unbreak builds following R 4.0 release
github-actions[bot] commented on pull request #7047: URL: https://github.com/apache/arrow/pull/7047#issuecomment-620239462 https://issues.apache.org/jira/browse/ARROW-8607 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
wesm commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416152095 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: In most database systems, it is required to use a pre-determined schema when reading CSV files. If we want to support schema inference, we could use a part of a file to "sniff" the schema but then assume it going forward. That's a bit better than forcing people to write down the schema This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
wesm commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416152095 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: In database systems, it is required to use a pre-determined schema when reading CSV files. If we want to support schema inference, we could use a part of a file to "sniff" the schema but then assume it going forward. That's a bit better than forcing people to write down the schema This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416150841 ## File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs ## @@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default) { ValueOffsets.Append(Offset); -var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 0, -new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) }); +var data = new ArrayData(DataType, ValueOffsets.Length - 1, NullCount, 0, +new[] { ValidityBuffer.Build(allocator).ValueBuffer, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) }); return Build(data); } -public TBuilder Append(byte value) +public TBuilder AppendNull() +{ +ValueOffsets.Append(Offset); +Offset++; Review comment: We do not. You are correct. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7047: ARROW-8607: [R][CI] Unbreak builds following R 4.0 release
github-actions[bot] commented on pull request #7047: URL: https://github.com/apache/arrow/pull/7047#issuecomment-620234261 Revision: ee605c39e96f6750e72f1b2d10b929ebffee4015 Submitted crossbow builds: [ursa-labs/crossbow @ actions-167](https://github.com/ursa-labs/crossbow/branches/all?query=actions-167) |Task|Status| ||--| |homebrew-r-autobrew|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-167-travis-homebrew-r-autobrew.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |test-conda-r-3.6|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-167-circle-test-conda-r-3.6.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-167-circle-test-conda-r-3.6)| |test-r-linux-as-cran|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-167-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-167-github-test-r-linux-as-cran)| |test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rhub-ubuntu-gcc-release)| |test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rocker-r-base-latest)| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rstudio-r-base-3.6-bionic)| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rstudio-r-base-3.6-opensuse15)| |test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rstudio-r-base-3.6-opensuse42)| |test-ubuntu-18.04-r-3.6|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-167-circle-test-ubuntu-18.04-r-3.6.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-167-circle-test-ubuntu-18.04-r-3.6)| |test-ubuntu-18.04-r-sanitizer|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-167-circle-test-ubuntu-18.04-r-sanitizer.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-167-circle-test-ubuntu-18.04-r-sanitizer)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7021: Wrap docker-compose commands with archery
wesm commented on pull request #7021: URL: https://github.com/apache/arrow/pull/7021#issuecomment-620234647 > it's pretty straightforward to use docker-compose to reproduce these builds locally I see both sides of this -- with some builds requiring 3 or more `docker-compose build` invocations and relying on various environment variables / parameters, there's a lot of room for frustration and user error (speaking from experience). I'm also pretty anti-copypasta in build configurations -- having all the build logic tightly coupled to GHA seems to go against some of the spirit of our prior decoupling from Travis CI, Appveyor. Of course, everything has to be well-documented and able to be serviced successfully by most Arrow maintainers This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7047: ARROW-8607: [R][CI] Unbreak builds following R 4.0 release
nealrichardson opened a new pull request #7047: URL: https://github.com/apache/arrow/pull/7047 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7047: ARROW-8607: [R][CI] Unbreak builds following R 4.0 release
nealrichardson commented on pull request #7047: URL: https://github.com/apache/arrow/pull/7047#issuecomment-620233527 @github-actions crossbow submit -g r This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7046: ARROW-8606: [CI] Don't trigger all builds on a change to any file in ci/
nealrichardson commented on pull request #7046: URL: https://github.com/apache/arrow/pull/7046#issuecomment-620222893 Rust lint failure is clearly not related. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416116593 ## File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs ## @@ -48,21 +48,28 @@ public BinaryArray(ArrowTypeId typeId, ArrayData data) data.EnsureBufferCount(3); } -public abstract class BuilderBase : IArrowArrayBuilder +public abstract class BuilderBase : IArrowArrayBuilder +where T: struct where TArray : IArrowArray -where TBuilder : class, IArrowArrayBuilder +where TBuilder : class, IArrowArrayBuilder { + protected IArrowType DataType { get; } protected TBuilder Instance => this as TBuilder; protected ArrowBuffer.Builder ValueOffsets { get; } -protected ArrowBuffer.Builder ValueBuffer { get; } +protected ArrowBuffer.Builder ValueBuffer { get; } +protected BooleanArray.Builder ValidityBuffer { get; } + protected int Offset { get; set; } +protected int ValidityOffset { get; set; } Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416115777 ## File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs ## @@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default) { ValueOffsets.Append(Offset); -var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 0, -new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) }); +var data = new ArrayData(DataType, ValueOffsets.Length - 1, NullCount, 0, +new[] { ValidityBuffer.Build(allocator).ValueBuffer, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) }); return Build(data); } -public TBuilder Append(byte value) +public TBuilder AppendNull() +{ +ValueOffsets.Append(Offset); +Offset++; +NullCount++; +return Instance; +} + +public TBuilder Append(T value) { ValueOffsets.Append(Offset); ValueBuffer.Append(value); Offset++; +ValidityBuffer.Append(true); return Instance; } -public TBuilder Append(ReadOnlySpan span) +public TBuilder Append(ReadOnlySpan span) { ValueOffsets.Append(Offset); ValueBuffer.Append(span); -Offset += span.Length; +ValidityBuffer.Append(true); +Offset += span == Span.Empty +? 1 Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416115591 ## File path: csharp/test/Apache.Arrow.Tests/ArrayBuilderTests.cs ## @@ -149,8 +180,8 @@ public void ProducesExpectedArray() Assert.IsAssignableFrom(array); Assert.NotNull(array); -Assert.Equal(3, array.Length); -Assert.Equal(0, array.NullCount); +Assert.Equal(expectedLength, array.Length); +Assert.Equal(expectedNullCount, array.NullCount); Review comment: Good idea, and done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
eerhardt commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416111304 ## File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs ## @@ -162,8 +188,8 @@ public TBuilder Swap(int i, int j) public TArray Build(MemoryAllocator allocator = default) { return Build( -ValueBuffer.Build(allocator), ArrowBuffer.Empty, -ValueBuffer.Length, 0, 0); +ValueBuffer.Build(allocator), ValidityBuffer.Build(allocator).ValueBuffer, Review comment: > That said, Build could easily pass an empty buffer if NullCount == 0. Yes that would work well. It is what the C++ implementation does: https://github.com/apache/arrow/blob/3064a27cf284726ce08b98b14725b2f9d7ef5ea2/cpp/src/arrow/array.cc#L59-L62 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416110664 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: That's indeed a data error. Whether parquet or csv, in C++ this would be caught by the ScannerBuilder::Filter setter (which *does* have access to the dataset's schema), so maybe this edge case isn't interesting after all This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings
bkietz commented on pull request #7026: URL: https://github.com/apache/arrow/pull/7026#issuecomment-620196068 @kszucs @jorisvandenbossche PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
pitrou commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416099536 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: So if e.g. two Parquet files in a dataset have different types for `a` or `b` (which may be a data error), that passes silently? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] chrish42 commented on a change in pull request #7025: ARROW-2260: [C++][Plasma] Use Gflags for command-line parsing
chrish42 commented on a change in pull request #7025: URL: https://github.com/apache/arrow/pull/7025#discussion_r416057944 ## File path: cpp/src/plasma/store.cc ## @@ -1207,65 +1211,77 @@ void StartServer(char* socket_name, std::string plasma_directory, bool hugepages g_runner->Start(socket_name, plasma_directory, hugepages_enabled, external_store); } +void UsageError(const char* error_msg, int exit_code=1) { + std::cerr << gflags::ProgramInvocationShortName() << ": " << error_msg << std::endl; + exit(exit_code); +} + } // namespace plasma +#ifdef __linux__ +#define SHM_DEFAULT_PATH "/dev/shm" +#else +#define SHM_DEFAULT_PATH "/tmp" +#endif + +// Command-line flags. +DEFINE_string(d, SHM_DEFAULT_PATH, "directory where to create the memory-backed file"); +DEFINE_string(e, "", "endpoint for external storage service, where objects " +"evicted from Plasma store can be written to, optional"); +DEFINE_bool(h, false, "whether to enable hugepage support"); +DEFINE_string(s, "", "socket name where the Plasma store will listen for requests, required"); +DEFINE_string(m, "", "amount of memory in bytes to use for Plasma store, required"); + int main(int argc, char* argv[]) { ArrowLog::StartArrowLog(argv[0], ArrowLogLevel::ARROW_INFO); ArrowLog::InstallFailureSignalHandler(); + + gflags::SetUsageMessage("Shared-memory server for Arrow data.\nUsage: "); + gflags::SetVersionString("TODO"); + char* socket_name = nullptr; // Directory where plasma memory mapped files are stored. std::string plasma_directory; std::string external_store_endpoint; bool hugepages_enabled = false; int64_t system_memory = -1; - int c; - while ((c = getopt(argc, argv, "s:m:d:e:h")) != -1) { -switch (c) { - case 'd': -plasma_directory = std::string(optarg); -break; - case 'e': -external_store_endpoint = std::string(optarg); -break; - case 'h': -hugepages_enabled = true; -break; - case 's': -socket_name = optarg; -break; - case 'm': { -char extra; -int scanned = sscanf(optarg, "%" SCNd64 "%c", &system_memory, &extra); -ARROW_CHECK(scanned == 1); -// Set system memory capacity - plasma::PlasmaAllocator::SetFootprintLimit(static_cast(system_memory)); -ARROW_LOG(INFO) << "Allowing the Plasma store to use up to " -<< static_cast(system_memory) / 10 -<< "GB of memory."; -break; - } - default: -exit(-1); -} + + gflags::ParseCommandLineFlags(&argc, &argv, true); + plasma_directory = FLAGS_d; + external_store_endpoint = FLAGS_e; + hugepages_enabled = FLAGS_h; + if (!FLAGS_s.empty()) { +// We only check below if socket_name is null, so don't set it if the flag was empty. +socket_name = const_cast(FLAGS_s.c_str()); + } + + if (!FLAGS_m.empty()) { +char extra; +int scanned = sscanf(FLAGS_m.c_str(), "%" SCNd64 "%c", &system_memory, &extra); +ARROW_CHECK(scanned == 1); +// Set system memory capacity + plasma::PlasmaAllocator::SetFootprintLimit(static_cast(system_memory)); +ARROW_LOG(INFO) << "Allowing the Plasma store to use up to " +<< static_cast(system_memory) / 10 +<< "GB of memory."; } + // Sanity check command line options. - if (!socket_name) { -ARROW_LOG(FATAL) << "please specify socket for incoming connections with -s switch"; + if (!socket_name && system_memory == -1) { Review comment: I'm checking here if both are wrong to give a better error message, so people who run the program without arguments get it right the second time, instead of getting an error telling them about `-s`, only to then get a second error telling them about `-m`. But I guess I could add a little comment about that... ## File path: cpp/src/plasma/store.cc ## @@ -1207,65 +1211,77 @@ void StartServer(char* socket_name, std::string plasma_directory, bool hugepages g_runner->Start(socket_name, plasma_directory, hugepages_enabled, external_store); } +void UsageError(const char* error_msg, int exit_code=1) { + std::cerr << gflags::ProgramInvocationShortName() << ": " << error_msg << std::endl; + exit(exit_code); +} + } // namespace plasma +#ifdef __linux__ +#define SHM_DEFAULT_PATH "/dev/shm" +#else +#define SHM_DEFAULT_PATH "/tmp" +#endif + +// Command-line flags. +DEFINE_string(d, SHM_DEFAULT_PATH, "directory where to create the memory-backed file"); +DEFINE_string(e, "", "endpoint for external storage service, where objects " +"evicted from Plasma store can be written to, optional"); +DEFINE_bool(h, false, "whether to enable hugepage support"); +DEFINE_string(s, "", "socket name where the Plasma store will listen for requests, required"); +DEFINE_string(m, "", "amount of memory in bytes to use for Plasma store, required"); + int main(int argc, c
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416097760 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: The type of field "a" can't be known without the schema from which it originates. When the filter is applied to a dataset in python or R the dataset's schema will be used to add cast expressions to the filter which will handle the necessary conversions. However in c++ we don't pass the dataset schema at scan time This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
pitrou commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416093398 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: That sounds nasty indeed. The datasets layer doesn't know the type of these columns at all? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings
bkietz commented on a change in pull request #7026: URL: https://github.com/apache/arrow/pull/7026#discussion_r416092830 ## File path: python/pyarrow/_dataset.pyx ## @@ -269,20 +454,21 @@ cdef class FileSystemDataset(Dataset): cdef: CFileSystemDataset* filesystem_dataset -def __init__(self, paths_or_selector, schema=None, format=None, - filesystem=None, partitions=None, root_partition=None): +def __init__(self, paths_or_selector, schema, format, filesystem, + partitions=None, Expression root_partition=_true): Review comment: reverted This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416092312 ## File path: csharp/src/Apache.Arrow/Arrays/BooleanArray.cs ## @@ -24,25 +24,38 @@ public class BooleanArray: Array { public class Builder : IArrowArrayBuilder { -private ArrowBuffer.Builder ValueBuffer { get; } Review comment: Remnants of testing which were refactored away. Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r416090623 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: PTAL. There's a subtlety resulting from the loosely typed nature of CSV: we don't know the types of columns materialized but not projected, which could lead to some very nasty errors in edge cases. For example, if filter were `"a"_ == "b"_` but only column `"c"` were projected then we have no information about the types of fields a, b except that they must be identical. I'm not sure what the correct solution here is This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416089870 ## File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs ## @@ -162,8 +188,8 @@ public TBuilder Swap(int i, int j) public TArray Build(MemoryAllocator allocator = default) { return Build( -ValueBuffer.Build(allocator), ArrowBuffer.Empty, -ValueBuffer.Length, 0, 0); +ValueBuffer.Build(allocator), ValidityBuffer.Build(allocator).ValueBuffer, Review comment: I agree with the sentiment, but given that every call to `Append[Range]` also touches `ValidityBuffer` I think this could end up trading-off a modest memory utilization improvement for [a modest-to-significant compute overhead](https://gist.github.com/mrange/aa82d33e94ad76d0f33ed86e704d7492) depending on the desired level of thread-safety and a given application's access pattern. That said, `Build` could easily pass an empty buffer if `NullCount` == 0. Then `ArrowStreamWriter` could check the buffer's span `IsEmpty` property, and if so, emit [NullType instead](https://issues.apache.org/jira/browse/ARROW-6643). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] BryanCutler commented on pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
BryanCutler commented on pull request #6323: URL: https://github.com/apache/arrow/pull/6323#issuecomment-620174537 test failure is unrelated, merged to master. Thanks @liyafan82 ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
wesm commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r416073397 ## File path: cpp/src/jni/dataset/proto/Types.proto ## @@ -0,0 +1,149 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto2"; +package types; + +option java_package = "org.apache.arrow.dataset"; Review comment: I think it's fine if you have a filter API in Java, but I think you should try to limit the exposure of the particular details of the current C++ API in Java. If the Java API has a 1-1 correspondence with the C++ API that's what I'm advising you against This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416050847 ## File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs ## @@ -99,43 +105,63 @@ public abstract class PrimitiveArrayBuilder : IArrowArrayBu { protected TBuilder Instance => this as TBuilder; protected ArrowBuffer.Builder ValueBuffer { get; } +protected BooleanArray.Builder ValidityBuffer { get; } public int Length => ValueBuffer.Length; +protected int NullCount { get; set; } + // TODO: Implement support for null values (null bitmaps) internal PrimitiveArrayBuilder() { ValueBuffer = new ArrowBuffer.Builder(); +ValidityBuffer = new BooleanArray.Builder(); } public TBuilder Resize(int length) { ValueBuffer.Resize(length); +ValidityBuffer.Resize(length + 1); return Instance; } public TBuilder Reserve(int capacity) { ValueBuffer.Reserve(capacity); +ValidityBuffer.Reserve(capacity + 1); return Instance; } public TBuilder Append(T value) { ValueBuffer.Append(value); +ValidityBuffer.Append(true); return Instance; } public TBuilder Append(ReadOnlySpan span) { ValueBuffer.Append(span); +ValidityBuffer.Append(span != Span.Empty); Review comment: Gah! Forgot to refactor here after fixing it in `BinaryBuilder`. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r416049921 ## File path: csharp/src/Apache.Arrow/Arrays/ArrayData.cs ## @@ -84,7 +84,7 @@ public ArrayData Slice(int offset, int length) length = Math.Min(Length - offset, length); offset += Offset; -return new ArrayData(DataType, length, -1, offset, Buffers, Children); +return new ArrayData(DataType, length, NullCount, offset, Buffers, Children); Review comment: Appears obvious in retrospect. I reverted, though I did introduce a constant for enhanced readability. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
fsaintjacques commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r416028395 ## File path: python/pyarrow/parquet.py ## @@ -310,6 +310,44 @@ def read_row_groups(self, row_groups, columns=None, use_threads=True, column_indices=column_indices, use_threads=use_threads) +def iter_batches(self, batch_size, row_groups=None, columns=None, Review comment: In the other method `scan_contents`, batch_size is optional. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
fsaintjacques commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r416027477 ## File path: python/pyarrow/_parquet.pyx ## @@ -1077,6 +1078,54 @@ cdef class ParquetReader: def set_use_threads(self, bint use_threads): self.reader.get().set_use_threads(use_threads) +def set_batch_size(self, int64_t batch_size): +self.reader.get().set_batch_size(batch_size) + +def iter_batches(self, int64_t batch_size, row_groups, column_indices=None, + bint use_threads=True): +cdef: +vector[int] c_row_groups +vector[int] c_column_indices +shared_ptr[CRecordBatch] record_batch +shared_ptr[TableBatchReader] batch_reader +unique_ptr[CRecordBatchReader] recordbatchreader + +self.set_batch_size(batch_size) + +if use_threads: +self.set_use_threads(use_threads) + +for row_group in row_groups: +c_row_groups.push_back(row_group) + +if column_indices is not None: +for index in column_indices: +c_column_indices.push_back(index) +check_status( +self.reader.get().GetRecordBatchReader( +c_row_groups, c_column_indices, &recordbatchreader +) +) +else: +check_status( +self.reader.get().GetRecordBatchReader( +c_row_groups, &recordbatchreader +) +) + +while True: +with nogil: +check_status( +recordbatchreader.get().ReadNext(&record_batch) +) + +if record_batch.get() == NULL: +break + +py_record_batch = pyarrow_wrap_batch(record_batch) Review comment: ```suggestion yield pyarrow_wrap_batch(record_batch) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7046: ARROW-8606: [CI] Don't trigger all builds on a change to any file in ci/
github-actions[bot] commented on pull request #7046: URL: https://github.com/apache/arrow/pull/7046#issuecomment-620134352 https://issues.apache.org/jira/browse/ARROW-8606 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
fsaintjacques commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r416018166 ## File path: python/pyarrow/_parquet.pyx ## @@ -1083,6 +1084,50 @@ cdef class ParquetReader: def set_use_threads(self, bint use_threads): self.reader.get().set_use_threads(use_threads) +def iter_batches(self, int64_t batch_size, row_groups, column_indices=None, Review comment: Noted for consistency. I find it un-intuitive that calling this method changes the batch size (or threading) for the following callers. Since this is not introduced by this PR, I won't make this a blocking issue. I'll leave @jorisvandenbossche decide if we should have a followup to fix this anti-pattern. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7046: ARROW-8606: [CI] Don't trigger all builds on a change to any file in ci/
nealrichardson opened a new pull request #7046: URL: https://github.com/apache/arrow/pull/7046 If I edit an R build script, we shouldn't be running JS and Go builds. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sunchao commented on pull request #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)
sunchao commented on pull request #6949: URL: https://github.com/apache/arrow/pull/6949#issuecomment-620113014 Merged. Thanks @rdettai ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r415981809 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), Review comment: - not currently implemented, will add This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r415958451 ## File path: cpp/src/arrow/dataset/file_csv.h ## @@ -0,0 +1,52 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" + +namespace arrow { +namespace dataset { + +/// \brief A FileFormat implementation that reads from and writes to Csv files +class ARROW_DS_EXPORT CsvFileFormat : public FileFormat { + public: + csv::ParseOptions parse_options = csv::ParseOptions::Defaults(); Review comment: Definitely you can set a full ParseOptions. I haven't exposed any of ReadOptions and ConvertOptions will probably need separate handling, as noted below This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r415971888 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: Assembling these from `ScanOptions` is definitely a goal; the first pass just defers all this to the format. Some of `ConvertOptions` belong in the format (`null_values`, ..., `strings_can_be_null`, ...) while others such as the ones you note do not. I'll rewrite shortly This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r415971988 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; Review comment: will do This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r415963714 ## File path: csharp/Directory.Build.props ## @@ -21,6 +21,12 @@ $(CSharpDir)ApacheArrow.snk + + Review comment: This was required for some early tests that I wrote, which later refactoring removed the need for. This was unfortunately left behind. Cleanup it up. Thanks for catching it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
bkietz commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r415958451 ## File path: cpp/src/arrow/dataset/file_csv.h ## @@ -0,0 +1,52 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" + +namespace arrow { +namespace dataset { + +/// \brief A FileFormat implementation that reads from and writes to Csv files +class ARROW_DS_EXPORT CsvFileFormat : public FileFormat { + public: + csv::ParseOptions parse_options = csv::ParseOptions::Defaults(); Review comment: Definitely you can set a full ParseOptions and ConvertOptions. I haven't exposed any of ReadOptions, as noted below This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
zgramana commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r415955058 ## File path: csharp/src/Apache.Arrow/Apache.Arrow.csproj ## @@ -4,7 +4,7 @@ netstandard1.3;netcoreapp2.1 true $(DefineConstants);UNSAFE_BYTEBUFFER;BYTEBUFFER_NO_BOUNDS_CHECK;ENABLE_SPAN_T - + Review comment: 👍 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
pitrou commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r415949596 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", + source.path(), "': ", maybe_reader.status()); + } + + return std::move(maybe_reader).ValueOrDie(); +} + +/// \brief A ScanTask backed by an Csv file. +class CsvScanTask : public ScanTask { + public: + CsvScanTask(std::shared_ptr format, FileSource source, + std::shared_ptr options, std::shared_ptr context) + : ScanTask(std::move(options), std::move(context)), +format_(std::move(format)), +source_(std::move(source)) {} + + Result Execute() override { +ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_)); Review comment: Also, the `ScanOptions` should be used to set the `column_names`, `column_types` and `include_columns` options. Unless you prefer to let the CSV reader infer types and then convert them yourself? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
nealrichardson commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r415887520 ## File path: cpp/src/arrow/dataset/file_csv.h ## @@ -0,0 +1,52 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" + +namespace arrow { +namespace dataset { + +/// \brief A FileFormat implementation that reads from and writes to Csv files +class ARROW_DS_EXPORT CsvFileFormat : public FileFormat { + public: + csv::ParseOptions parse_options = csv::ParseOptions::Defaults(); Review comment: Can I set a full `csv::ParseOptions` object? We have builders for those in R already. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] yonidavidson commented on pull request #6731: [WIP] ARROW-8601: [Go][Flight] Added implementation of FlightDataWriter
yonidavidson commented on pull request #6731: URL: https://github.com/apache/arrow/pull/6731#issuecomment-620017093 Hi All, I am working on https://github.com/353solutions/carrow/tree/issue/46-flight . This is a Go project that wraps the C++ one but wanted everyone to know that it's also on our roadmap. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7045: ARROW-8603: [C++][Documentation] Add missing params comment
github-actions[bot] commented on pull request #7045: URL: https://github.com/apache/arrow/pull/7045#issuecomment-619997013 https://issues.apache.org/jira/browse/ARROW-8603 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques opened a new pull request #7045: ARROW-8603: [C++][Documentation] Add missing params comment
fsaintjacques opened a new pull request #7045: URL: https://github.com/apache/arrow/pull/7045 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat
fsaintjacques commented on a change in pull request #7033: URL: https://github.com/apache/arrow/pull/7033#discussion_r415798798 ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; + return defaults; +} + +static inline Result> OpenReader( +const FileSource& source, const CsvFileFormat& format) { + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + + auto maybe_reader = csv::StreamingReader::Make( + default_memory_pool(), std::move(input), default_read_options(), + format.parse_options, format.convert_options); + if (!maybe_reader.ok()) { +return maybe_reader.status().WithMessage("Could not open IPC input source '", Review comment: ```suggestion return maybe_reader.status().WithMessage("Could not open CSV input source '", ``` ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset/visibility.h" +#include "arrow/result.h" +#include "arrow/util/iterator.h" + +namespace arrow { +namespace dataset { + +using internal::checked_cast; +using internal::checked_pointer_cast; + +static inline csv::ReadOptions default_read_options() { + auto defaults = csv::ReadOptions::Defaults(); + defaults.use_threads = false; Review comment: This warrants a comment explaining why we disable threading. ## File path: cpp/src/arrow/dataset/file_csv.cc ## @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/dataset/file_csv.h" + +#include +#include + +#include "arrow/csv/options.h" +#include "arrow/csv/reader.h" +#include "arrow/dataset/dataset_internal.h" +#include "arrow/dataset/file_base.h" +#include "arrow/dataset/type_fwd.h" +#include "arrow/dataset
[GitHub] [arrow] paddyhoran commented on pull request #7043: ARROW-8598: [Rust] `simd_compare_op` creates buffer of incorrect length
paddyhoran commented on pull request #7043: URL: https://github.com/apache/arrow/pull/7043#issuecomment-619967530 @nevi-me when I added this fix I hadn't looked at your PR removing packed_simd. I introduced this bug so wanted to get a fix posted. Also, I wasn't sure what your timelines were for #7037. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on pull request #7001: ARROW-8602: [C++][CMake] Fix ws2_32 link issue when cross-compiling on Linux
fsaintjacques commented on pull request #7001: URL: https://github.com/apache/arrow/pull/7001#issuecomment-619964385 @davidanthoff feel free to create an account on JIRA so that I can assign you the ticket https://issues.apache.org/jira/browse/ARROW-8602 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on pull request #6731: [WIP] ARROW-8601: [Go][Flight] Added implementation of FlightDataWriter
fsaintjacques commented on pull request #6731: URL: https://github.com/apache/arrow/pull/6731#issuecomment-619959812 Created https://jira.apache.org/jira/browse/ARROW-8601 for this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #6731: feat(flight): Added implementation of FlightDataWriter
fsaintjacques commented on a change in pull request #6731: URL: https://github.com/apache/arrow/pull/6731#discussion_r415775496 ## File path: format/Flight.proto ## @@ -19,8 +19,17 @@ syntax = "proto3"; option java_package = "org.apache.arrow.flight.impl"; +option go_package = "ipc"; Review comment: Can you use `flight` namespace, there's already a component named `ipc` in the C++ code base and this could bring confusion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #6154: ARROW-7531: [C++] Reduce header cost
pitrou commented on pull request #6154: URL: https://github.com/apache/arrow/pull/6154#issuecomment-619959114 Given the diffusion of changes accross the codebase, rebasing this wholesale would probably be painful. A better strategy would probably to retry and do some of the changes one by one, in separate PRs. I should try to do that someday, but this is also low-priority. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on pull request #6645: ARROW-8074: [Dataset][Python] FileFragments from Buffers
fsaintjacques commented on pull request #6645: URL: https://github.com/apache/arrow/pull/6645#issuecomment-619957568 I'll close this for now, ARROW-8318 will remove this limitation and FileSystemDataset will be created from a list of FileFragment, which themselves can be created from Buffer-backed FileSource. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques edited a comment on pull request #6645: ARROW-8074: [Dataset][Python] FileFragments from Buffers
fsaintjacques edited a comment on pull request #6645: URL: https://github.com/apache/arrow/pull/6645#issuecomment-619957568 I'll close this for now, ARROW-8318 will remove this limitation and FileSystemDataset will be created from a list of FileFragment, which themselves can be created from Buffer-backed FileSource. You'll be able to create Dataset from a mix of files on disk and buffers in memory. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on pull request #6154: ARROW-7531: [C++] Reduce header cost
fsaintjacques commented on pull request #6154: URL: https://github.com/apache/arrow/pull/6154#issuecomment-619954274 @pitrou close or rebase? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings
kszucs commented on pull request #7026: URL: https://github.com/apache/arrow/pull/7026#issuecomment-619902102 Agree that exposing `ds.field` and `ds.scalar` should be sufficient on the python side. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jianxind edited a comment on pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
jianxind edited a comment on pull request #7029: URL: https://github.com/apache/arrow/pull/7029#issuecomment-618855696 cc @emkornfield The AVX512 path is straightforward as the helper of mask_compress/mask_expand API provide by AVX512. For potential path-finding of SSE/AVX2, as you pointed in the Jira, a solution with fixed lookup table may help, I will work the chance then but it definitely need take more time thus I commit this done part firstly. Below is the benchmark data on Avx512 device before/after the intrinsics: Before: BM_PlainEncodingSpacedFloat/1024 1471 ns 1469 ns 476373 bytes_per_second=2.59603G/s BM_PlainEncodingSpacedDouble/1024 1498 ns 1496 ns 468131 bytes_per_second=5.09834G/s BM_PlainDecodingSpacedFloat/1024 1266 ns 1265 ns 554320 bytes_per_second=3.01623G/s BM_PlainDecodingSpacedDouble/1024 920 ns 919 ns 759151 bytes_per_second=8.30509G/s After: BM_PlainEncodingSpacedFloat/1024 513 ns 512 ns 1374561 bytes_per_second=7.44416G/s BM_PlainEncodingSpacedDouble/1024 635 ns 634 ns 1108739 bytes_per_second=12.0322G/s BM_PlainDecodingSpacedFloat/1024 217 ns 217 ns 3233406 bytes_per_second=17.613G/s BM_PlainDecodingSpacedDouble/1024 309 ns 309 ns 2267740 bytes_per_second=24.7257G/s This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings
jorisvandenbossche commented on a change in pull request #7026: URL: https://github.com/apache/arrow/pull/7026#discussion_r415703472 ## File path: python/pyarrow/_dataset.pyx ## @@ -269,20 +454,21 @@ cdef class FileSystemDataset(Dataset): cdef: CFileSystemDataset* filesystem_dataset -def __init__(self, paths_or_selector, schema=None, format=None, - filesystem=None, partitions=None, root_partition=None): +def __init__(self, paths_or_selector, schema, format, filesystem, + partitions=None, Expression root_partition=_true): Review comment: Can you leave this as it was? This was done on purpose to provide decent error messages (see discussion in https://github.com/apache/arrow/pull/6913, I am happy to find a better way though) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r415667244 ## File path: cpp/src/jni/dataset/jni_wrapper.cpp ## @@ -0,0 +1,577 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "concurrent_map.h" +#include "arrow/compute/kernel.h" +#include "arrow/compute/kernels/cast.h" +#include "arrow/compute/kernels/compare.h" +#include "jni/dataset/Types.pb.h" + +#include "org_apache_arrow_dataset_jni_JniWrapper.h" +#include "org_apache_arrow_dataset_file_JniWrapper.h" + +static jclass illegal_access_exception_class; +static jclass illegal_argument_exception_class; +static jclass runtime_exception_class; + +static jclass record_batch_handle_class; +static jclass record_batch_handle_field_class; +static jclass record_batch_handle_buffer_class; + +static jmethodID record_batch_handle_constructor; +static jmethodID record_batch_handle_field_constructor; +static jmethodID record_batch_handle_buffer_constructor; + +static jint JNI_VERSION = JNI_VERSION_1_6; + +using arrow::jni::ConcurrentMap; + +static ConcurrentMap> dataset_factory_holder_; +static ConcurrentMap> dataset_holder_; +static ConcurrentMap> scan_task_holder_; +static ConcurrentMap> scanner_holder_; +static ConcurrentMap> iterator_holder_; +static ConcurrentMap> buffer_holder_; + +#define JNI_ASSIGN_OR_THROW_NAME(x, y) ARROW_CONCAT(x, y) + +#define JNI_ASSIGN_OR_THROW_IMPL(t, lhs, rexpr) \ + auto t = (rexpr); \ + if (!t.status().ok()) { \ +env->ThrowNew(runtime_exception_class, t.status().message().c_str()); \ + } \ + lhs = std::move(t).ValueOrDie(); + +#define JNI_ASSIGN_OR_THROW(lhs, rexpr) \ + JNI_ASSIGN_OR_THROW_IMPL(JNI_ASSIGN_OR_THROW_NAME(_tmp_var, __COUNTER__), lhs, rexpr) + +#define JNI_ASSERT_OK_OR_THROW(expr) \ + do {\ +auto _res = (expr); \ +arrow::Status _st = ::arrow::internal::GenericToStatus(_res); \ +if (!_st.ok()) { \ + env->ThrowNew(runtime_exception_class, _st.message().c_str()); \ +} \ + } while (false); + +jclass CreateGlobalClassReference(JNIEnv* env, const char* class_name) { + jclass local_class = env->FindClass(class_name); + jclass global_class = (jclass)env->NewGlobalRef(local_class); + env->DeleteLocalRef(local_class); + return global_class; +} + +jmethodID GetMethodID(JNIEnv* env, jclass this_class, const char* name, const char* sig) { + jmethodID ret = env->GetMethodID(this_class, name, sig); + if (ret == nullptr) { +std::string error_message = "Unable to find method " + std::string(name) + +" within signature" + std::string(sig); +env->ThrowNew(illegal_access_exception_class, error_message.c_str()); + } + return ret; +} + +jint JNI_OnLoad(JavaVM* vm, void* reserved) { + JNIEnv* env; + if (vm->GetEnv(reinterpret_cast(&env), JNI_VERSION) != JNI_OK) { +return JNI_ERR; + } + + illegal_access_exception_class = + CreateGlobalClassReference(env, "Ljava/lang/IllegalAccessException;"); + illegal_argument_exception_class = + CreateGlobalClassReference(env, "Ljava/lang/IllegalArgumentException;"); + runtime_exception_class = + CreateGlobalClassReference(env, "Ljava/lang/RuntimeException;"); + + record_batch_handle_class = + CreateGlobalClassReference(env, "Lorg/apache/arrow/dataset/jni/NativeRecordBatchHandle;"); + record_batch_handle_field_class = + CreateGlobalClassReference(env, "Lorg/apache/arrow/dataset/jni/NativeRecordBatchHandle$Field;"); + record_batch_handle_buffer_class = + CreateGlobalClassReference(env, "Lorg/apache/arrow/dataset
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r415664616 ## File path: cpp/src/jni/dataset/proto/Types.proto ## @@ -0,0 +1,149 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +syntax = "proto2"; +package types; + +option java_package = "org.apache.arrow.dataset"; Review comment: Currently it is for passing datasets filter via JNI. I see Wes has recommended not to include filter bindings (in the bottom comment) in this version. I may consider removing it in a later commit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on a change in pull request #7030: URL: https://github.com/apache/arrow/pull/7030#discussion_r415661628 ## File path: cpp/src/jni/dataset/jni_wrapper.cpp ## @@ -0,0 +1,577 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "concurrent_map.h" +#include "arrow/compute/kernel.h" +#include "arrow/compute/kernels/cast.h" +#include "arrow/compute/kernels/compare.h" +#include "jni/dataset/Types.pb.h" + +#include "org_apache_arrow_dataset_jni_JniWrapper.h" +#include "org_apache_arrow_dataset_file_JniWrapper.h" + +static jclass illegal_access_exception_class; +static jclass illegal_argument_exception_class; +static jclass runtime_exception_class; + +static jclass record_batch_handle_class; +static jclass record_batch_handle_field_class; +static jclass record_batch_handle_buffer_class; + +static jmethodID record_batch_handle_constructor; +static jmethodID record_batch_handle_field_constructor; +static jmethodID record_batch_handle_buffer_constructor; + +static jint JNI_VERSION = JNI_VERSION_1_6; + +using arrow::jni::ConcurrentMap; + +static ConcurrentMap> dataset_factory_holder_; Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #7043: ARROW-8598: [Rust] `simd_compare_op` creates buffer of incorrect length
nevi-me commented on pull request #7043: URL: https://github.com/apache/arrow/pull/7043#issuecomment-619853175 @paddyhoran do we need to worry about this, as it'd get removed by #7037? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] vertexclique commented on a change in pull request #7036: ARROW-8591: [Rust] Reverse lookup for a key in DictionaryArray
vertexclique commented on a change in pull request #7036: URL: https://github.com/apache/arrow/pull/7036#discussion_r415653220 ## File path: rust/arrow/src/array/array.rs ## @@ -1786,38 +1786,34 @@ impl From<(Vec<(Field, ArrayRef)>, Buffer, usize)> for StructArray { /// This is mostly used to represent strings or a limited set of primitive types as integers, /// for example when doing NLP analysis or representing chromosomes by name. /// -/// Example with nullable data: +/// Example **with nullable** data: /// /// ``` -/// use arrow::array::DictionaryArray; -/// use arrow::datatypes::Int8Type; -/// let test = vec!["a", "a", "b", "c"]; -/// let array : DictionaryArray = test.iter().map(|&x| if x == "b" {None} else {Some(x)}).collect(); -/// assert_eq!(array.keys().collect::>>(), vec![Some(0), Some(0), None, Some(1)]); +/// use arrow::array::DictionaryArray; +/// use arrow::datatypes::Int8Type; +/// let test = vec!["a", "a", "b", "c"]; +/// let array : DictionaryArray = test.iter().map(|&x| if x == "b" {None} else {Some(x)}).collect(); +/// assert_eq!(array.keys().collect::>>(), vec![Some(0), Some(0), None, Some(1)]); /// ``` /// -/// Example without nullable data: +/// Example **without nullable** data: /// /// ``` -/// -/// use arrow::array::DictionaryArray; -/// use arrow::datatypes::Int8Type; -/// let test = vec!["a", "a", "b", "c"]; -/// let array : DictionaryArray = test.into_iter().collect(); -/// assert_eq!(array.keys().collect::>>(), vec![Some(0), Some(0), Some(1), Some(2)]); +/// use arrow::array::DictionaryArray; +/// use arrow::datatypes::Int8Type; +/// let test = vec!["a", "a", "b", "c"]; +/// let array : DictionaryArray = test.into_iter().collect(); +/// assert_eq!(array.keys().collect::>>(), vec![Some(0), Some(0), Some(1), Some(2)]); /// ``` pub struct DictionaryArray { -// Array of keys, much like a PrimitiveArray +/// Array of keys, much like a PrimitiveArray data: ArrayDataRef, -// Pointer to the key values. +/// Pointer to the key values. raw_values: RawPtrBox, -// Array of any values. +/// Array of any values. values: ArrayRef, - Review comment: Brought it back. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rdettai commented on a change in pull request #6935: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files
rdettai commented on a change in pull request #6935: URL: https://github.com/apache/arrow/pull/6935#discussion_r415644697 ## File path: rust/parquet/src/column/reader.rs ## @@ -190,15 +190,12 @@ impl ColumnReaderImpl { (self.num_buffered_values - self.num_decoded_values) as usize, ); -// Adjust batch size by taking into account how much space is left in -// values slice or levels slices (if available) -adjusted_size = min(adjusted_size, values.len() - values_read); -if let Some(ref levels) = def_levels { -adjusted_size = min(adjusted_size, levels.len() - levels_read); -} -if let Some(ref levels) = rep_levels { -adjusted_size = min(adjusted_size, levels.len() - levels_read); -} +// Adjust batch size by taking into account how much data there +// to read. As batch_size is also smaller than value and level +// slices (if available), this ensures that available space is not +// exceeded. +adjusted_size = min(adjusted_size, batch_size - values_read); Review comment: is it clearer now or do you need me to find a better example ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rdettai commented on pull request #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)
rdettai commented on pull request #6949: URL: https://github.com/apache/arrow/pull/6949#issuecomment-619834304 I completely agree ! What I am saying is that the layer that makes the handle thread safe cannot be the `FileSource` which is in charge of tracking the reading position of one specific reading thread. But this question of multi-threading seems to be a whole other concern :-) I've made the _nit_ changes you've proposed and improved the tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] durch commented on pull request #7042: ARROW-8597 [Rust] Lints and readability improvements for arrow crate
durch commented on pull request #7042: URL: https://github.com/apache/arrow/pull/7042#issuecomment-619798711 @nevi-me I left it as is for now, did not want to actually write any new code in this PR. After this one gets merged I'll make another pass and write some code, in addition to the `is_empty` there is also a low hanging `Default` impl, possibly a few more pedantic changes :). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #6898: ARROW-8399: [Rust] Extend memory alignments to include other architectures
nevi-me commented on pull request #6898: URL: https://github.com/apache/arrow/pull/6898#issuecomment-619797480 > > I am using this code in various places, but if it is against the spec or not useful for the improvements we can drop it off. > > As @pitrou says: > > > It's a recommendation, not an obligation. If you decide that specific architectures deserve a specific alignment value, then IMHO it's not a problem. > > I'm all for the changes if they improve performance but this is at the limit of my understanding. @andygrove @nevi-me thoughts? I don't know much about alignment, but I'm happy to claw back performance (esp after we remove packed_simd). I'll rebase this against the PR to drop packed_simd, then check the performance difference. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org