[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7027: ARROW-8572: [Python] expose UnionArray fields to Python

2020-04-27 Thread GitBox


jorisvandenbossche commented on a change in pull request #7027:
URL: https://github.com/apache/arrow/pull/7027#discussion_r416374325



##
File path: python/pyarrow/array.pxi
##
@@ -1720,6 +1720,39 @@ cdef class UnionArray(Array):
 Concrete class for Arrow arrays of a Union data type.
 """
 
+def child(self, int pos):
+"""
+Return the given child array as an individual array.
+
+For sparse unions, the returned array has its offset, length,
+and null count adjusted.
+
+For dense unions, the returned array is unchanged.
+"""
+cdef shared_ptr[CArray] result
+result = ( self.ap).child(pos)
+if result != NULL:
+return pyarrow_wrap_array(result)
+raise KeyError("UnionArray does not have child {}".format(pos))
+
+@property
+def type_codes(self):
+"""Get the type codes array."""
+buf = pyarrow_wrap_buffer(( self.ap).type_codes())
+return Array.from_buffers(int8(), len(self), [None, buf])
+
+@property
+def value_offsets(self):

Review comment:
   For ListArray, this is called "offsets" (not sure if we prefer 
consistency within the python api, or consistency with the c++ api)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #6963: ARROW-8509: [Plasma][GLib] GArrowRecordBatch <-> GArrowBuffer conversion functions

2020-04-27 Thread GitBox


kou commented on pull request #6963:
URL: https://github.com/apache/arrow/pull/6963#issuecomment-620412397


   I take over this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-04-27 Thread GitBox


zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r416295541



##
File path: cpp/src/jni/dataset/proto/Types.proto
##
@@ -0,0 +1,149 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto2";
+package types;
+
+option java_package = "org.apache.arrow.dataset";

Review comment:
   Thanks, I think I would understand any of the concerns against a too 
detailed mapping. (actually currently Fragment is not mapped to Java-side so 
it's not yet a 1-1 mapping) I'll try remove more JNI stuffs.
   
   As for filter, If I understand correctly it's OK to keep Java API but the 
JNI mapping for filters is considered fragile, right? I can remove the mapping 
anyway but when users read parquet files from Java they'll not be able to 
filter row groups to reduce I/O. Which is extremely important when 
low-selectivity filter is specified.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rtyler commented on pull request #7049: Avoid loading simd_load_set_invalid which doesn't exist on aarch64

2020-04-27 Thread GitBox


rtyler commented on pull request #7049:
URL: https://github.com/apache/arrow/pull/7049#issuecomment-620377892


   Fixing this also helps surface 
[ARROW-8610](https://issues.apache.org/jira/browse/ARROW-8610), which I have 
had a hell of a time fixing



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rtyler opened a new pull request #7049: Avoid loading simd_load_set_invalid which doesn't exist on aarch64

2020-04-27 Thread GitBox


rtyler opened a new pull request #7049:
URL: https://github.com/apache/arrow/pull/7049


   error[E0432]: unresolved import 
`crate::compute::util::simd_load_set_invalid`
   --> 
/home/tyler/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/2a8e37d/rust/arrow/src/compute/kernels/arithmetic.rs:42:5
   |
   42 | use crate::compute::util::simd_load_set_invalid;
   | ^^^ no 
`simd_load_set_invalid` in `compute::util`
   
   Compiling thiserror v1.0.16
   error: aborting due to previous error



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-04-27 Thread GitBox


zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r416295541



##
File path: cpp/src/jni/dataset/proto/Types.proto
##
@@ -0,0 +1,149 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto2";
+package types;
+
+option java_package = "org.apache.arrow.dataset";

Review comment:
   Thanks, as I knew some of the base design purposes of datasets API, 
dataframe API, and query engine system, so I think I would understand any of 
the concerns against a too detailed mapping. (actually currently Fragment is 
not mapped to Java-side so it's not yet a 1-1 mapping) I'll try remove more JNI 
stuffs.
   
   As for filter, If I understand correctly it's OK to keep Java API but the 
JNI mapping for filters is considered fragile, right? I can remove the mapping 
anyway but when users read parquet files from Java they'll not be able to 
filter row groups to reduce I/O. Which is extremely important when 
low-selectivity filter is specified.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-04-27 Thread GitBox


zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r416295541



##
File path: cpp/src/jni/dataset/proto/Types.proto
##
@@ -0,0 +1,149 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto2";
+package types;
+
+option java_package = "org.apache.arrow.dataset";

Review comment:
   Thanks, as I knew some of the base design purposes of datasets API, 
dataframe API, and query engine system, so I think I would understand any of 
the concerns against a too detailed mapping. I'll try remove more JNI stuffs.
   
   As for filter, If I understand correctly it's OK to keep Java API but the 
JNI mapping for filters is considered fragile, right? I can remove the mapping 
anyway but when users read parquet files from Java they'll not be able to 
filter row groups to reduce I/O. Which is extremely important when 
low-selectivity filters is specified.

##
File path: cpp/src/jni/dataset/proto/Types.proto
##
@@ -0,0 +1,149 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto2";
+package types;
+
+option java_package = "org.apache.arrow.dataset";

Review comment:
   Thanks, as I knew some of the base design purposes of datasets API, 
dataframe API, and query engine system, so I think I would understand any of 
the concerns against a too detailed mapping. I'll try remove more JNI stuffs.
   
   As for filter, If I understand correctly it's OK to keep Java API but the 
JNI mapping for filters is considered fragile, right? I can remove the mapping 
anyway but when users read parquet files from Java they'll not be able to 
filter row groups to reduce I/O. Which is extremely important when 
low-selectivity filter is specified.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-27 Thread GitBox


liyafan82 commented on pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#issuecomment-620338202


   @BryanCutler @emkornfield Thanks a lot for your effort. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7048: ARROW-8609: [C++] fix orc jni crash

2020-04-27 Thread GitBox


github-actions[bot] commented on pull request #7048:
URL: https://github.com/apache/arrow/pull/7048#issuecomment-620323735


   https://issues.apache.org/jira/browse/ARROW-8609



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhouyuan commented on pull request #6953: [Java] Reproduce ORC JNI binding crash [WIP]

2020-04-27 Thread GitBox


zhouyuan commented on pull request #6953:
URL: https://github.com/apache/arrow/pull/6953#issuecomment-620321153


   @kszucs 
   ran into similar issue on parquet side, not sure if this #7048 works  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhouyuan opened a new pull request #7048: ARROW-8609: [C++] fix orc jni crash

2020-04-27 Thread GitBox


zhouyuan opened a new pull request #7048:
URL: https://github.com/apache/arrow/pull/7048


   check if arrow buffer is null before passing to the constructor
   
   Signed-off-by: Yuan Zhou 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416253903



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   I'll add a failing test tomorrow so this can become less hand wavy





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416239202



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   The edge case here is due to the way we pass only the projected schema 
during a scan, so maybe the solution is to always pass the full dataset schema 
as well so that no materialized fields will be inferring at scan time





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416235260



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   Reading using a predetermined schema is already supported for CSV and 
for all the other formats. The first chunk (as determined by 
`CsvFileFormat::chunk_size`) is used to sniff the schema which is assumed for 
the rest of the file, and this inspection can be applied to just one or to all 
the files in a dataset at discovery time.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#issuecomment-620266004


   @eerhardt thanks for the excellent review and feedback. I believe that I 
have pushed fixes for everything.
   
   I also noticed I had missed validity buffer state management in 
`PrimitiveArrayBuilder`'s `Set`, `Swap` and `Clear` methods. I've added those 
in.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6154: ARROW-7531: [C++] Reduce header cost

2020-04-27 Thread GitBox


wesm commented on pull request #6154:
URL: https://github.com/apache/arrow/pull/6154#issuecomment-620248207


   There's a lot of good changes here. I agree that starting fresh and making 
smaller PRs is the path forward. I'm going to close this for now if that's OK 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416165482



##
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##
@@ -48,21 +48,28 @@ public BinaryArray(ArrowTypeId typeId, ArrayData data)
 data.EnsureBufferCount(3);
 }
 
-public abstract class BuilderBase : 
IArrowArrayBuilder
+public abstract class BuilderBase : 
IArrowArrayBuilder

Review comment:
   Early on in the effort I was trying to get a more generalized 
BuilderBase, and one that could be more easily re-used by end users (such as 
myself) who might be interested in implementing a custom type. It didn't really 
end working out that way, however, so I have reverted it back.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings

2020-04-27 Thread GitBox


nealrichardson commented on pull request #7026:
URL: https://github.com/apache/arrow/pull/7026#issuecomment-620245767


   @github-actions autotune everything



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416155042



##
File path: csharp/src/Apache.Arrow/Arrays/BooleanArray.cs
##
@@ -153,17 +182,25 @@ private void CheckIndex(int index)
 new[] { nullBitmapBuffer, valueBuffer }))
 { }
 
-public BooleanArray(ArrayData data) 
+public BooleanArray(ArrayData data)
 : base(data)
 {
 data.EnsureDataType(ArrowTypeId.Boolean);
 }
 
 public override void Accept(IArrowArrayVisitor visitor) => 
Accept(this, visitor);
 
+[Obsolete("GetBoolean does not support null values. Use GetValue 
instead (which this method invokes internally).")]

Review comment:
   Fixed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416154898



##
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##
@@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default)
 {
 ValueOffsets.Append(Offset);
 
-var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 
0,
-new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), 
ValueBuffer.Build(allocator) });
+var data = new ArrayData(DataType, ValueOffsets.Length - 1, 
NullCount, 0,
+new[] { ValidityBuffer.Build(allocator).ValueBuffer, 
ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
 
 return Build(data);
 }
 
-public TBuilder Append(byte value)
+public TBuilder AppendNull()
+{
+ValueOffsets.Append(Offset);
+Offset++;
+NullCount++;
+return Instance;
+}
+
+public TBuilder Append(T value)
 {
 ValueOffsets.Append(Offset);
 ValueBuffer.Append(value);
 Offset++;
+ValidityBuffer.Append(true);
 return Instance;
 }
 
-public TBuilder Append(ReadOnlySpan span)
+public TBuilder Append(ReadOnlySpan span)
 {
 ValueOffsets.Append(Offset);
 ValueBuffer.Append(span);
-Offset += span.Length;
+ValidityBuffer.Append(true);
+Offset += span == Span.Empty
+? 1
+: span.Length;
 return Instance;
 }
 
-public TBuilder AppendRange(IEnumerable values)
+public TBuilder AppendRange(IEnumerable values)
 {
 foreach (var arr in values)
 {
+if (arr != null && arr.Length > 0)

Review comment:
   Fixed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7047: ARROW-8607: [R][CI] Unbreak builds following R 4.0 release

2020-04-27 Thread GitBox


github-actions[bot] commented on pull request #7047:
URL: https://github.com/apache/arrow/pull/7047#issuecomment-620239462


   https://issues.apache.org/jira/browse/ARROW-8607



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


wesm commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416152095



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   In most database systems, it is required to use a pre-determined schema 
when reading CSV files. If we want to support schema inference, we could use a 
part of a file to "sniff" the schema but then assume it going forward. That's a 
bit better than forcing people to write down the schema





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


wesm commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416152095



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   In database systems, it is required to use a pre-determined schema when 
reading CSV files. If we want to support schema inference, we could use a part 
of a file to "sniff" the schema but then assume it going forward. That's a bit 
better than forcing people to write down the schema





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416150841



##
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##
@@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default)
 {
 ValueOffsets.Append(Offset);
 
-var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 
0,
-new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), 
ValueBuffer.Build(allocator) });
+var data = new ArrayData(DataType, ValueOffsets.Length - 1, 
NullCount, 0,
+new[] { ValidityBuffer.Build(allocator).ValueBuffer, 
ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
 
 return Build(data);
 }
 
-public TBuilder Append(byte value)
+public TBuilder AppendNull()
+{
+ValueOffsets.Append(Offset);
+Offset++;

Review comment:
   We do not. You are correct.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7047: ARROW-8607: [R][CI] Unbreak builds following R 4.0 release

2020-04-27 Thread GitBox


github-actions[bot] commented on pull request #7047:
URL: https://github.com/apache/arrow/pull/7047#issuecomment-620234261


   Revision: ee605c39e96f6750e72f1b2d10b929ebffee4015
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-167](https://github.com/ursa-labs/crossbow/branches/all?query=actions-167)
   
   |Task|Status|
   ||--|
   
|homebrew-r-autobrew|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-167-travis-homebrew-r-autobrew.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|test-conda-r-3.6|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-167-circle-test-conda-r-3.6.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-167-circle-test-conda-r-3.6)|
   |test-r-linux-as-cran|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-167-github-test-r-linux-as-cran)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-167-github-test-r-linux-as-cran)|
   
|test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rhub-ubuntu-gcc-release)|
   
|test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rocker-r-base-latest)|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rstudio-r-base-3.6-bionic)|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rstudio-r-base-3.6-opensuse15)|
   
|test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-167-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1&branchName=actions-167-azure-test-r-rstudio-r-base-3.6-opensuse42)|
   
|test-ubuntu-18.04-r-3.6|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-167-circle-test-ubuntu-18.04-r-3.6.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-167-circle-test-ubuntu-18.04-r-3.6)|
   
|test-ubuntu-18.04-r-sanitizer|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-167-circle-test-ubuntu-18.04-r-sanitizer.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-167-circle-test-ubuntu-18.04-r-sanitizer)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7021: Wrap docker-compose commands with archery

2020-04-27 Thread GitBox


wesm commented on pull request #7021:
URL: https://github.com/apache/arrow/pull/7021#issuecomment-620234647


   > it's pretty straightforward to use docker-compose to reproduce these 
builds locally
   
   I see both sides of this -- with some builds requiring 3 or more 
`docker-compose build` invocations and relying on various environment variables 
/ parameters, there's a lot of room for frustration and user error (speaking 
from experience). I'm also pretty anti-copypasta in build configurations -- 
having all the build logic tightly coupled to GHA seems to go against some of 
the spirit of our prior decoupling from Travis CI, Appveyor. 
   
   Of course, everything has to be well-documented and able to be serviced 
successfully by most Arrow maintainers



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7047: ARROW-8607: [R][CI] Unbreak builds following R 4.0 release

2020-04-27 Thread GitBox


nealrichardson opened a new pull request #7047:
URL: https://github.com/apache/arrow/pull/7047


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7047: ARROW-8607: [R][CI] Unbreak builds following R 4.0 release

2020-04-27 Thread GitBox


nealrichardson commented on pull request #7047:
URL: https://github.com/apache/arrow/pull/7047#issuecomment-620233527


   @github-actions crossbow submit -g r



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7046: ARROW-8606: [CI] Don't trigger all builds on a change to any file in ci/

2020-04-27 Thread GitBox


nealrichardson commented on pull request #7046:
URL: https://github.com/apache/arrow/pull/7046#issuecomment-620222893


   Rust lint failure is clearly not related.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416116593



##
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##
@@ -48,21 +48,28 @@ public BinaryArray(ArrowTypeId typeId, ArrayData data)
 data.EnsureBufferCount(3);
 }
 
-public abstract class BuilderBase : 
IArrowArrayBuilder
+public abstract class BuilderBase : 
IArrowArrayBuilder
+where T: struct
 where TArray : IArrowArray
-where TBuilder : class, IArrowArrayBuilder
+where TBuilder : class, IArrowArrayBuilder
 {
+
 protected IArrowType DataType { get; }
 protected TBuilder Instance => this as TBuilder;
 protected ArrowBuffer.Builder ValueOffsets { get; }
-protected ArrowBuffer.Builder ValueBuffer { get; }
+protected ArrowBuffer.Builder ValueBuffer { get; }
+protected BooleanArray.Builder ValidityBuffer { get; }
+
 protected int Offset { get; set; }
+protected int ValidityOffset { get; set; }

Review comment:
   Fixed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416115777



##
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##
@@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default)
 {
 ValueOffsets.Append(Offset);
 
-var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 
0,
-new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), 
ValueBuffer.Build(allocator) });
+var data = new ArrayData(DataType, ValueOffsets.Length - 1, 
NullCount, 0,
+new[] { ValidityBuffer.Build(allocator).ValueBuffer, 
ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
 
 return Build(data);
 }
 
-public TBuilder Append(byte value)
+public TBuilder AppendNull()
+{
+ValueOffsets.Append(Offset);
+Offset++;
+NullCount++;
+return Instance;
+}
+
+public TBuilder Append(T value)
 {
 ValueOffsets.Append(Offset);
 ValueBuffer.Append(value);
 Offset++;
+ValidityBuffer.Append(true);
 return Instance;
 }
 
-public TBuilder Append(ReadOnlySpan span)
+public TBuilder Append(ReadOnlySpan span)
 {
 ValueOffsets.Append(Offset);
 ValueBuffer.Append(span);
-Offset += span.Length;
+ValidityBuffer.Append(true);
+Offset += span == Span.Empty
+? 1

Review comment:
   Fixed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416115591



##
File path: csharp/test/Apache.Arrow.Tests/ArrayBuilderTests.cs
##
@@ -149,8 +180,8 @@ public void ProducesExpectedArray()
 
 Assert.IsAssignableFrom(array);
 Assert.NotNull(array);
-Assert.Equal(3, array.Length);
-Assert.Equal(0, array.NullCount);
+Assert.Equal(expectedLength, array.Length);
+Assert.Equal(expectedNullCount, array.NullCount);

Review comment:
   Good idea, and done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


eerhardt commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416111304



##
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##
@@ -162,8 +188,8 @@ public TBuilder Swap(int i, int j)
 public TArray Build(MemoryAllocator allocator = default)
 {
 return Build(
-ValueBuffer.Build(allocator), ArrowBuffer.Empty,
-ValueBuffer.Length, 0, 0);
+ValueBuffer.Build(allocator), 
ValidityBuffer.Build(allocator).ValueBuffer,

Review comment:
   > That said, Build could easily pass an empty buffer if NullCount == 0. 
   
   Yes that would work well. It is what the C++ implementation does:
   
   
https://github.com/apache/arrow/blob/3064a27cf284726ce08b98b14725b2f9d7ef5ea2/cpp/src/arrow/array.cc#L59-L62





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416110664



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   That's indeed a data error. Whether parquet or csv, in C++ this would be 
caught by the ScannerBuilder::Filter setter (which *does* have access to the 
dataset's schema), so maybe this edge case isn't interesting after all





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings

2020-04-27 Thread GitBox


bkietz commented on pull request #7026:
URL: https://github.com/apache/arrow/pull/7026#issuecomment-620196068


   @kszucs @jorisvandenbossche PTAL



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


pitrou commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416099536



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   So if e.g. two Parquet files in a dataset have different types for `a` 
or `b` (which may be a data error), that passes silently?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] chrish42 commented on a change in pull request #7025: ARROW-2260: [C++][Plasma] Use Gflags for command-line parsing

2020-04-27 Thread GitBox


chrish42 commented on a change in pull request #7025:
URL: https://github.com/apache/arrow/pull/7025#discussion_r416057944



##
File path: cpp/src/plasma/store.cc
##
@@ -1207,65 +1211,77 @@ void StartServer(char* socket_name, std::string 
plasma_directory, bool hugepages
   g_runner->Start(socket_name, plasma_directory, hugepages_enabled, 
external_store);
 }
 
+void UsageError(const char* error_msg, int exit_code=1) {
+  std::cerr << gflags::ProgramInvocationShortName() << ": " << error_msg << 
std::endl;
+  exit(exit_code);
+}
+
 }  // namespace plasma
 
+#ifdef __linux__
+#define SHM_DEFAULT_PATH "/dev/shm"
+#else
+#define SHM_DEFAULT_PATH "/tmp"
+#endif
+
+// Command-line flags.
+DEFINE_string(d, SHM_DEFAULT_PATH, "directory where to create the 
memory-backed file");
+DEFINE_string(e, "", "endpoint for external storage service, where objects "
+"evicted from Plasma store can be written to, optional");
+DEFINE_bool(h, false, "whether to enable hugepage support");
+DEFINE_string(s, "", "socket name where the Plasma store will listen for 
requests, required");
+DEFINE_string(m, "", "amount of memory in bytes to use for Plasma store, 
required");
+
 int main(int argc, char* argv[]) {
   ArrowLog::StartArrowLog(argv[0], ArrowLogLevel::ARROW_INFO);
   ArrowLog::InstallFailureSignalHandler();
+
+  gflags::SetUsageMessage("Shared-memory server for Arrow data.\nUsage: ");
+  gflags::SetVersionString("TODO");
+
   char* socket_name = nullptr;
   // Directory where plasma memory mapped files are stored.
   std::string plasma_directory;
   std::string external_store_endpoint;
   bool hugepages_enabled = false;
   int64_t system_memory = -1;
-  int c;
-  while ((c = getopt(argc, argv, "s:m:d:e:h")) != -1) {
-switch (c) {
-  case 'd':
-plasma_directory = std::string(optarg);
-break;
-  case 'e':
-external_store_endpoint = std::string(optarg);
-break;
-  case 'h':
-hugepages_enabled = true;
-break;
-  case 's':
-socket_name = optarg;
-break;
-  case 'm': {
-char extra;
-int scanned = sscanf(optarg, "%" SCNd64 "%c", &system_memory, &extra);
-ARROW_CHECK(scanned == 1);
-// Set system memory capacity
-
plasma::PlasmaAllocator::SetFootprintLimit(static_cast(system_memory));
-ARROW_LOG(INFO) << "Allowing the Plasma store to use up to "
-<< static_cast(system_memory) / 10
-<< "GB of memory.";
-break;
-  }
-  default:
-exit(-1);
-}
+
+  gflags::ParseCommandLineFlags(&argc, &argv, true);
+  plasma_directory = FLAGS_d;
+  external_store_endpoint = FLAGS_e;
+  hugepages_enabled = FLAGS_h;
+  if (!FLAGS_s.empty()) {
+// We only check below if socket_name is null, so don't set it if the flag 
was empty.
+socket_name = const_cast(FLAGS_s.c_str());
+  }
+
+  if (!FLAGS_m.empty()) {
+char extra;
+int scanned = sscanf(FLAGS_m.c_str(), "%" SCNd64 "%c", &system_memory, 
&extra);
+ARROW_CHECK(scanned == 1);
+// Set system memory capacity
+
plasma::PlasmaAllocator::SetFootprintLimit(static_cast(system_memory));
+ARROW_LOG(INFO) << "Allowing the Plasma store to use up to "
+<< static_cast(system_memory) / 10
+<< "GB of memory.";
   }
+
   // Sanity check command line options.
-  if (!socket_name) {
-ARROW_LOG(FATAL) << "please specify socket for incoming connections with 
-s switch";
+  if (!socket_name && system_memory == -1) {

Review comment:
   I'm checking here if both are wrong to give a better error message, so 
people who run the program without arguments get it right the second time, 
instead of getting an error telling them about `-s`, only to then get a second 
error telling them about `-m`. But I guess I could add a little comment about 
that...

##
File path: cpp/src/plasma/store.cc
##
@@ -1207,65 +1211,77 @@ void StartServer(char* socket_name, std::string 
plasma_directory, bool hugepages
   g_runner->Start(socket_name, plasma_directory, hugepages_enabled, 
external_store);
 }
 
+void UsageError(const char* error_msg, int exit_code=1) {
+  std::cerr << gflags::ProgramInvocationShortName() << ": " << error_msg << 
std::endl;
+  exit(exit_code);
+}
+
 }  // namespace plasma
 
+#ifdef __linux__
+#define SHM_DEFAULT_PATH "/dev/shm"
+#else
+#define SHM_DEFAULT_PATH "/tmp"
+#endif
+
+// Command-line flags.
+DEFINE_string(d, SHM_DEFAULT_PATH, "directory where to create the 
memory-backed file");
+DEFINE_string(e, "", "endpoint for external storage service, where objects "
+"evicted from Plasma store can be written to, optional");
+DEFINE_bool(h, false, "whether to enable hugepage support");
+DEFINE_string(s, "", "socket name where the Plasma store will listen for 
requests, required");
+DEFINE_string(m, "", "amount of memory in bytes to use for Plasma store, 
required");
+
 int main(int argc, c

[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416097760



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   The type of field "a" can't be known without the schema from which it 
originates. When the filter is applied to a dataset in python or R the 
dataset's schema will be used to add cast expressions to the filter which will 
handle the necessary conversions. However in c++ we don't pass the dataset 
schema at scan time





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


pitrou commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416093398



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   That sounds nasty indeed. The datasets layer doesn't know the type of 
these columns at all?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7026:
URL: https://github.com/apache/arrow/pull/7026#discussion_r416092830



##
File path: python/pyarrow/_dataset.pyx
##
@@ -269,20 +454,21 @@ cdef class FileSystemDataset(Dataset):
 cdef:
 CFileSystemDataset* filesystem_dataset
 
-def __init__(self, paths_or_selector, schema=None, format=None,
- filesystem=None, partitions=None, root_partition=None):
+def __init__(self, paths_or_selector, schema, format, filesystem,
+ partitions=None, Expression root_partition=_true):

Review comment:
   reverted





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416092312



##
File path: csharp/src/Apache.Arrow/Arrays/BooleanArray.cs
##
@@ -24,25 +24,38 @@ public class BooleanArray: Array
 {
 public class Builder : IArrowArrayBuilder
 {
-private ArrowBuffer.Builder ValueBuffer { get; }

Review comment:
   Remnants of testing which were refactored away. Fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416090623



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   PTAL. There's a subtlety resulting from the loosely typed nature of CSV: 
we don't know the types of columns materialized but not projected, which could 
lead to some very nasty errors in edge cases. For example, if filter were `"a"_ 
== "b"_` but only column `"c"` were projected then we have no information about 
the types of fields a, b except that they must be identical. I'm not sure what 
the correct solution here is





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416089870



##
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##
@@ -162,8 +188,8 @@ public TBuilder Swap(int i, int j)
 public TArray Build(MemoryAllocator allocator = default)
 {
 return Build(
-ValueBuffer.Build(allocator), ArrowBuffer.Empty,
-ValueBuffer.Length, 0, 0);
+ValueBuffer.Build(allocator), 
ValidityBuffer.Build(allocator).ValueBuffer,

Review comment:
   I agree with the sentiment, but given that every call to `Append[Range]` 
also touches `ValidityBuffer` I think this could end up trading-off a modest 
memory utilization improvement for [a modest-to-significant compute 
overhead](https://gist.github.com/mrange/aa82d33e94ad76d0f33ed86e704d7492) 
depending on the desired level of thread-safety and a given application's 
access pattern.
   
   That said, `Build` could easily pass an empty buffer if `NullCount` == 0. 
Then `ArrowStreamWriter` could check the buffer's span `IsEmpty` property, and 
if so, emit [NullType 
instead](https://issues.apache.org/jira/browse/ARROW-6643).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] BryanCutler commented on pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations

2020-04-27 Thread GitBox


BryanCutler commented on pull request #6323:
URL: https://github.com/apache/arrow/pull/6323#issuecomment-620174537


   test failure is unrelated, merged to master. Thanks @liyafan82 !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-04-27 Thread GitBox


wesm commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r416073397



##
File path: cpp/src/jni/dataset/proto/Types.proto
##
@@ -0,0 +1,149 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto2";
+package types;
+
+option java_package = "org.apache.arrow.dataset";

Review comment:
   I think it's fine if you have a filter API in Java, but I think you 
should try to limit the exposure of the particular details of the current C++ 
API in Java. If the Java API has a 1-1 correspondence with the C++ API that's 
what I'm advising you against





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416050847



##
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##
@@ -99,43 +105,63 @@ public abstract class PrimitiveArrayBuilder : IArrowArrayBu
 {
 protected TBuilder Instance => this as TBuilder;
 protected ArrowBuffer.Builder ValueBuffer { get; }
+protected BooleanArray.Builder ValidityBuffer { get; }
 
 public int Length => ValueBuffer.Length;
 
+protected int NullCount { get; set; }
+
 // TODO: Implement support for null values (null bitmaps)
 
 internal PrimitiveArrayBuilder()
 {
 ValueBuffer = new ArrowBuffer.Builder();
+ValidityBuffer = new BooleanArray.Builder();
 }
 
 public TBuilder Resize(int length)
 {
 ValueBuffer.Resize(length);
+ValidityBuffer.Resize(length + 1);
 return Instance;
 }
 
 public TBuilder Reserve(int capacity)
 {
 ValueBuffer.Reserve(capacity);
+ValidityBuffer.Reserve(capacity + 1);
 return Instance;
 }
 
 public TBuilder Append(T value)
 {
 ValueBuffer.Append(value);
+ValidityBuffer.Append(true);
 return Instance;
 }
 
 public TBuilder Append(ReadOnlySpan span)
 {
 ValueBuffer.Append(span);
+ValidityBuffer.Append(span != Span.Empty);

Review comment:
   Gah! Forgot to refactor here after fixing it in `BinaryBuilder`. Thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416049921



##
File path: csharp/src/Apache.Arrow/Arrays/ArrayData.cs
##
@@ -84,7 +84,7 @@ public ArrayData Slice(int offset, int length)
 length = Math.Min(Length - offset, length);
 offset += Offset;
 
-return new ArrayData(DataType, length, -1, offset, Buffers, 
Children);
+return new ArrayData(DataType, length, NullCount, offset, Buffers, 
Children);

Review comment:
   Appears obvious in retrospect. I reverted, though I did introduce a 
constant for enhanced readability.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-27 Thread GitBox


fsaintjacques commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r416028395



##
File path: python/pyarrow/parquet.py
##
@@ -310,6 +310,44 @@ def read_row_groups(self, row_groups, columns=None, 
use_threads=True,
column_indices=column_indices,
use_threads=use_threads)
 
+def iter_batches(self, batch_size, row_groups=None, columns=None,

Review comment:
   In the other method `scan_contents`, batch_size is optional.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-27 Thread GitBox


fsaintjacques commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r416027477



##
File path: python/pyarrow/_parquet.pyx
##
@@ -1077,6 +1078,54 @@ cdef class ParquetReader:
 def set_use_threads(self, bint use_threads):
 self.reader.get().set_use_threads(use_threads)
 
+def set_batch_size(self, int64_t batch_size):
+self.reader.get().set_batch_size(batch_size)
+
+def iter_batches(self, int64_t batch_size, row_groups, column_indices=None,
+ bint use_threads=True):
+cdef:
+vector[int] c_row_groups
+vector[int] c_column_indices
+shared_ptr[CRecordBatch] record_batch
+shared_ptr[TableBatchReader] batch_reader
+unique_ptr[CRecordBatchReader] recordbatchreader
+
+self.set_batch_size(batch_size)
+
+if use_threads:
+self.set_use_threads(use_threads)
+
+for row_group in row_groups:
+c_row_groups.push_back(row_group)
+
+if column_indices is not None:
+for index in column_indices:
+c_column_indices.push_back(index)
+check_status(
+self.reader.get().GetRecordBatchReader(
+c_row_groups, c_column_indices, &recordbatchreader
+)
+)
+else:
+check_status(
+self.reader.get().GetRecordBatchReader(
+c_row_groups, &recordbatchreader
+)
+)
+
+while True:
+with nogil:
+check_status(
+recordbatchreader.get().ReadNext(&record_batch)
+)
+
+if record_batch.get() == NULL:
+break
+
+py_record_batch = pyarrow_wrap_batch(record_batch)

Review comment:
   ```suggestion
   yield pyarrow_wrap_batch(record_batch)
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7046: ARROW-8606: [CI] Don't trigger all builds on a change to any file in ci/

2020-04-27 Thread GitBox


github-actions[bot] commented on pull request #7046:
URL: https://github.com/apache/arrow/pull/7046#issuecomment-620134352


   https://issues.apache.org/jira/browse/ARROW-8606



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-27 Thread GitBox


fsaintjacques commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r416018166



##
File path: python/pyarrow/_parquet.pyx
##
@@ -1083,6 +1084,50 @@ cdef class ParquetReader:
 def set_use_threads(self, bint use_threads):
 self.reader.get().set_use_threads(use_threads)
 
+def iter_batches(self, int64_t batch_size, row_groups, column_indices=None,

Review comment:
   Noted for consistency. I find it un-intuitive that calling this method 
changes the batch size (or threading) for the following callers. Since this is 
not introduced by this PR, I won't make this a blocking issue.
   
   I'll leave @jorisvandenbossche decide if we should have a followup to fix 
this anti-pattern.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7046: ARROW-8606: [CI] Don't trigger all builds on a change to any file in ci/

2020-04-27 Thread GitBox


nealrichardson opened a new pull request #7046:
URL: https://github.com/apache/arrow/pull/7046


   If I edit an R build script, we shouldn't be running JS and Go builds.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sunchao commented on pull request #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)

2020-04-27 Thread GitBox


sunchao commented on pull request #6949:
URL: https://github.com/apache/arrow/pull/6949#issuecomment-620113014


   Merged. Thanks @rdettai !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r415981809



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),

Review comment:
   - not currently implemented, will add





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r415958451



##
File path: cpp/src/arrow/dataset/file_csv.h
##
@@ -0,0 +1,52 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+
+namespace arrow {
+namespace dataset {
+
+/// \brief A FileFormat implementation that reads from and writes to Csv files
+class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
+ public:
+  csv::ParseOptions parse_options = csv::ParseOptions::Defaults();

Review comment:
   Definitely you can set a full ParseOptions. I haven't exposed any of 
ReadOptions and ConvertOptions will probably need separate handling, as noted 
below





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r415971888



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   Assembling these from `ScanOptions` is definitely a goal; the first pass 
just defers all this to the format. Some of `ConvertOptions` belong in the 
format (`null_values`, ..., `strings_can_be_null`, ...) while others such as 
the ones you note do not. I'll rewrite shortly





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r415971988



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;

Review comment:
   will do





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r415963714



##
File path: csharp/Directory.Build.props
##
@@ -21,6 +21,12 @@
 
$(CSharpDir)ApacheArrow.snk
   
 
+  
+

Review comment:
   This was required for some early tests that I wrote, which later 
refactoring removed the need for. This was unfortunately left behind. Cleanup 
it up. Thanks for catching it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r415958451



##
File path: cpp/src/arrow/dataset/file_csv.h
##
@@ -0,0 +1,52 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+
+namespace arrow {
+namespace dataset {
+
+/// \brief A FileFormat implementation that reads from and writes to Csv files
+class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
+ public:
+  csv::ParseOptions parse_options = csv::ParseOptions::Defaults();

Review comment:
   Definitely you can set a full ParseOptions and ConvertOptions. I haven't 
exposed any of ReadOptions, as noted below





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r415955058



##
File path: csharp/src/Apache.Arrow/Apache.Arrow.csproj
##
@@ -4,7 +4,7 @@
 netstandard1.3;netcoreapp2.1
 true
 
$(DefineConstants);UNSAFE_BYTEBUFFER;BYTEBUFFER_NO_BOUNDS_CHECK;ENABLE_SPAN_T
-
+

Review comment:
   👍 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


pitrou commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r415949596



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   Also, the `ScanOptions` should be used to set the `column_names`, 
`column_types` and `include_columns` options. Unless you prefer to let the CSV 
reader infer types and then convert them yourself?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


nealrichardson commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r415887520



##
File path: cpp/src/arrow/dataset/file_csv.h
##
@@ -0,0 +1,52 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+
+namespace arrow {
+namespace dataset {
+
+/// \brief A FileFormat implementation that reads from and writes to Csv files
+class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
+ public:
+  csv::ParseOptions parse_options = csv::ParseOptions::Defaults();

Review comment:
   Can I set a full `csv::ParseOptions` object? We have builders for those 
in R already.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] yonidavidson commented on pull request #6731: [WIP] ARROW-8601: [Go][Flight] Added implementation of FlightDataWriter

2020-04-27 Thread GitBox


yonidavidson commented on pull request #6731:
URL: https://github.com/apache/arrow/pull/6731#issuecomment-620017093


   Hi All,
   I am working on https://github.com/353solutions/carrow/tree/issue/46-flight .
   This is a Go project that wraps the C++ one but wanted everyone to know that 
it's also on our roadmap. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7045: ARROW-8603: [C++][Documentation] Add missing params comment

2020-04-27 Thread GitBox


github-actions[bot] commented on pull request #7045:
URL: https://github.com/apache/arrow/pull/7045#issuecomment-619997013


   https://issues.apache.org/jira/browse/ARROW-8603



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques opened a new pull request #7045: ARROW-8603: [C++][Documentation] Add missing params comment

2020-04-27 Thread GitBox


fsaintjacques opened a new pull request #7045:
URL: https://github.com/apache/arrow/pull/7045


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


fsaintjacques commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r415798798



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",

Review comment:
   ```suggestion
   return maybe_reader.status().WithMessage("Could not open CSV input 
source '",
   ```

##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;

Review comment:
   This warrants a comment explaining why we disable threading.

##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset

[GitHub] [arrow] paddyhoran commented on pull request #7043: ARROW-8598: [Rust] `simd_compare_op` creates buffer of incorrect length

2020-04-27 Thread GitBox


paddyhoran commented on pull request #7043:
URL: https://github.com/apache/arrow/pull/7043#issuecomment-619967530


   @nevi-me when I added this fix I hadn't looked at your PR removing 
packed_simd.  I introduced this bug so wanted to get a fix posted.  
   
   Also, I wasn't sure what your timelines were for #7037. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7001: ARROW-8602: [C++][CMake] Fix ws2_32 link issue when cross-compiling on Linux

2020-04-27 Thread GitBox


fsaintjacques commented on pull request #7001:
URL: https://github.com/apache/arrow/pull/7001#issuecomment-619964385


   @davidanthoff feel free to create an account on JIRA so that I can assign 
you the ticket https://issues.apache.org/jira/browse/ARROW-8602



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #6731: [WIP] ARROW-8601: [Go][Flight] Added implementation of FlightDataWriter

2020-04-27 Thread GitBox


fsaintjacques commented on pull request #6731:
URL: https://github.com/apache/arrow/pull/6731#issuecomment-619959812


   Created https://jira.apache.org/jira/browse/ARROW-8601 for this



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #6731: feat(flight): Added implementation of FlightDataWriter

2020-04-27 Thread GitBox


fsaintjacques commented on a change in pull request #6731:
URL: https://github.com/apache/arrow/pull/6731#discussion_r415775496



##
File path: format/Flight.proto
##
@@ -19,8 +19,17 @@
 syntax = "proto3";
 
 option java_package = "org.apache.arrow.flight.impl";
+option go_package = "ipc";

Review comment:
   Can you use `flight` namespace, there's already a component named `ipc` 
in the C++ code base and this could bring confusion.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #6154: ARROW-7531: [C++] Reduce header cost

2020-04-27 Thread GitBox


pitrou commented on pull request #6154:
URL: https://github.com/apache/arrow/pull/6154#issuecomment-619959114


   Given the diffusion of changes accross the codebase, rebasing this wholesale 
would probably be painful. A better strategy would probably to retry and do 
some of the changes one by one, in separate PRs. I should try to do that 
someday, but this is also low-priority.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #6645: ARROW-8074: [Dataset][Python] FileFragments from Buffers

2020-04-27 Thread GitBox


fsaintjacques commented on pull request #6645:
URL: https://github.com/apache/arrow/pull/6645#issuecomment-619957568


   I'll close this for now, ARROW-8318 will remove this limitation and 
FileSystemDataset will be created from a list of FileFragment, which themselves 
can be created from Buffer-backed FileSource.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques edited a comment on pull request #6645: ARROW-8074: [Dataset][Python] FileFragments from Buffers

2020-04-27 Thread GitBox


fsaintjacques edited a comment on pull request #6645:
URL: https://github.com/apache/arrow/pull/6645#issuecomment-619957568


   I'll close this for now, ARROW-8318 will remove this limitation and 
FileSystemDataset will be created from a list of FileFragment, which themselves 
can be created from Buffer-backed FileSource. You'll be able to create Dataset 
from a mix of files on disk and buffers in memory.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #6154: ARROW-7531: [C++] Reduce header cost

2020-04-27 Thread GitBox


fsaintjacques commented on pull request #6154:
URL: https://github.com/apache/arrow/pull/6154#issuecomment-619954274


   @pitrou close or rebase?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings

2020-04-27 Thread GitBox


kszucs commented on pull request #7026:
URL: https://github.com/apache/arrow/pull/7026#issuecomment-619902102


   Agree that exposing `ds.field` and `ds.scalar` should be sufficient on the 
python side.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind edited a comment on pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.

2020-04-27 Thread GitBox


jianxind edited a comment on pull request #7029:
URL: https://github.com/apache/arrow/pull/7029#issuecomment-618855696


   cc @emkornfield 
   
   The AVX512 path is straightforward as the helper of 
mask_compress/mask_expand API provide by AVX512. For potential path-finding of 
SSE/AVX2, as you pointed in the Jira, a solution with fixed lookup table may 
help, I will work the chance then but it definitely need take more time thus I 
commit this done part firstly.
   
   Below is the benchmark data on Avx512 device before/after the intrinsics:
   
   Before:
   BM_PlainEncodingSpacedFloat/1024  1471 ns 1469 ns   
476373 bytes_per_second=2.59603G/s
   BM_PlainEncodingSpacedDouble/1024 1498 ns 1496 ns   
468131 bytes_per_second=5.09834G/s
   BM_PlainDecodingSpacedFloat/1024  1266 ns 1265 ns   
554320 bytes_per_second=3.01623G/s
   BM_PlainDecodingSpacedDouble/1024  920 ns  919 ns   
759151 bytes_per_second=8.30509G/s
   
   After:
   BM_PlainEncodingSpacedFloat/1024   513 ns  512 ns  
1374561 bytes_per_second=7.44416G/s
   BM_PlainEncodingSpacedDouble/1024  635 ns  634 ns  
1108739 bytes_per_second=12.0322G/s
   BM_PlainDecodingSpacedFloat/1024   217 ns  217 ns  
3233406 bytes_per_second=17.613G/s
   BM_PlainDecodingSpacedDouble/1024  309 ns  309 ns  
2267740 bytes_per_second=24.7257G/s



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7026: ARROW-7391: [C++][Dataset] Remove Expression subclasses from bindings

2020-04-27 Thread GitBox


jorisvandenbossche commented on a change in pull request #7026:
URL: https://github.com/apache/arrow/pull/7026#discussion_r415703472



##
File path: python/pyarrow/_dataset.pyx
##
@@ -269,20 +454,21 @@ cdef class FileSystemDataset(Dataset):
 cdef:
 CFileSystemDataset* filesystem_dataset
 
-def __init__(self, paths_or_selector, schema=None, format=None,
- filesystem=None, partitions=None, root_partition=None):
+def __init__(self, paths_or_selector, schema, format, filesystem,
+ partitions=None, Expression root_partition=_true):

Review comment:
   Can you leave this as it was? This was done on purpose to provide decent 
error messages (see discussion in https://github.com/apache/arrow/pull/6913, I 
am happy to find a better way though)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-04-27 Thread GitBox


zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r415667244



##
File path: cpp/src/jni/dataset/jni_wrapper.cpp
##
@@ -0,0 +1,577 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "concurrent_map.h"
+#include "arrow/compute/kernel.h"
+#include "arrow/compute/kernels/cast.h"
+#include "arrow/compute/kernels/compare.h"
+#include "jni/dataset/Types.pb.h"
+
+#include "org_apache_arrow_dataset_jni_JniWrapper.h"
+#include "org_apache_arrow_dataset_file_JniWrapper.h"
+
+static jclass illegal_access_exception_class;
+static jclass illegal_argument_exception_class;
+static jclass runtime_exception_class;
+
+static jclass record_batch_handle_class;
+static jclass record_batch_handle_field_class;
+static jclass record_batch_handle_buffer_class;
+
+static jmethodID record_batch_handle_constructor;
+static jmethodID record_batch_handle_field_constructor;
+static jmethodID record_batch_handle_buffer_constructor;
+
+static jint JNI_VERSION = JNI_VERSION_1_6;
+
+using arrow::jni::ConcurrentMap;
+
+static ConcurrentMap> 
dataset_factory_holder_;
+static ConcurrentMap> dataset_holder_;
+static ConcurrentMap> 
scan_task_holder_;
+static ConcurrentMap> scanner_holder_;
+static ConcurrentMap> 
iterator_holder_;
+static ConcurrentMap> buffer_holder_;
+
+#define JNI_ASSIGN_OR_THROW_NAME(x, y) ARROW_CONCAT(x, y)
+
+#define JNI_ASSIGN_OR_THROW_IMPL(t, lhs, rexpr) \
+  auto t = (rexpr); \
+  if (!t.status().ok()) {   \
+env->ThrowNew(runtime_exception_class, t.status().message().c_str());   \
+  } \
+  lhs = std::move(t).ValueOrDie();
+
+#define JNI_ASSIGN_OR_THROW(lhs, rexpr) \
+  JNI_ASSIGN_OR_THROW_IMPL(JNI_ASSIGN_OR_THROW_NAME(_tmp_var, __COUNTER__), 
lhs, rexpr)
+
+#define JNI_ASSERT_OK_OR_THROW(expr)  \
+  do {\
+auto _res = (expr);   \
+arrow::Status _st = ::arrow::internal::GenericToStatus(_res); \
+if (!_st.ok()) {  \
+   env->ThrowNew(runtime_exception_class, _st.message().c_str());  \
+} \
+  } while (false);
+
+jclass CreateGlobalClassReference(JNIEnv* env, const char* class_name) {
+  jclass local_class = env->FindClass(class_name);
+  jclass global_class = (jclass)env->NewGlobalRef(local_class);
+  env->DeleteLocalRef(local_class);
+  return global_class;
+}
+
+jmethodID GetMethodID(JNIEnv* env, jclass this_class, const char* name, const 
char* sig) {
+  jmethodID ret = env->GetMethodID(this_class, name, sig);
+  if (ret == nullptr) {
+std::string error_message = "Unable to find method " + std::string(name) +
+" within signature" + std::string(sig);
+env->ThrowNew(illegal_access_exception_class, error_message.c_str());
+  }
+  return ret;
+}
+
+jint JNI_OnLoad(JavaVM* vm, void* reserved) {
+  JNIEnv* env;
+  if (vm->GetEnv(reinterpret_cast(&env), JNI_VERSION) != JNI_OK) {
+return JNI_ERR;
+  }
+
+  illegal_access_exception_class =
+  CreateGlobalClassReference(env, "Ljava/lang/IllegalAccessException;");
+  illegal_argument_exception_class =
+  CreateGlobalClassReference(env, "Ljava/lang/IllegalArgumentException;");
+  runtime_exception_class =
+  CreateGlobalClassReference(env, "Ljava/lang/RuntimeException;");
+
+  record_batch_handle_class =
+  CreateGlobalClassReference(env, 
"Lorg/apache/arrow/dataset/jni/NativeRecordBatchHandle;");
+  record_batch_handle_field_class =
+  CreateGlobalClassReference(env, 
"Lorg/apache/arrow/dataset/jni/NativeRecordBatchHandle$Field;");
+  record_batch_handle_buffer_class =
+  CreateGlobalClassReference(env, 
"Lorg/apache/arrow/dataset

[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-04-27 Thread GitBox


zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r415664616



##
File path: cpp/src/jni/dataset/proto/Types.proto
##
@@ -0,0 +1,149 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+syntax = "proto2";
+package types;
+
+option java_package = "org.apache.arrow.dataset";

Review comment:
   Currently it is for passing datasets filter via JNI. I see Wes has 
recommended not to include filter bindings (in the bottom comment) in this 
version. I may consider removing it in a later commit.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on a change in pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-04-27 Thread GitBox


zhztheplayer commented on a change in pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#discussion_r415661628



##
File path: cpp/src/jni/dataset/jni_wrapper.cpp
##
@@ -0,0 +1,577 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "concurrent_map.h"
+#include "arrow/compute/kernel.h"
+#include "arrow/compute/kernels/cast.h"
+#include "arrow/compute/kernels/compare.h"
+#include "jni/dataset/Types.pb.h"
+
+#include "org_apache_arrow_dataset_jni_JniWrapper.h"
+#include "org_apache_arrow_dataset_file_JniWrapper.h"
+
+static jclass illegal_access_exception_class;
+static jclass illegal_argument_exception_class;
+static jclass runtime_exception_class;
+
+static jclass record_batch_handle_class;
+static jclass record_batch_handle_field_class;
+static jclass record_batch_handle_buffer_class;
+
+static jmethodID record_batch_handle_constructor;
+static jmethodID record_batch_handle_field_constructor;
+static jmethodID record_batch_handle_buffer_constructor;
+
+static jint JNI_VERSION = JNI_VERSION_1_6;
+
+using arrow::jni::ConcurrentMap;
+
+static ConcurrentMap> 
dataset_factory_holder_;

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #7043: ARROW-8598: [Rust] `simd_compare_op` creates buffer of incorrect length

2020-04-27 Thread GitBox


nevi-me commented on pull request #7043:
URL: https://github.com/apache/arrow/pull/7043#issuecomment-619853175


   @paddyhoran do we need to worry about this, as it'd get removed by #7037?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] vertexclique commented on a change in pull request #7036: ARROW-8591: [Rust] Reverse lookup for a key in DictionaryArray

2020-04-27 Thread GitBox


vertexclique commented on a change in pull request #7036:
URL: https://github.com/apache/arrow/pull/7036#discussion_r415653220



##
File path: rust/arrow/src/array/array.rs
##
@@ -1786,38 +1786,34 @@ impl From<(Vec<(Field, ArrayRef)>, Buffer, usize)> for 
StructArray {
 /// This is mostly used to represent strings or a limited set of primitive 
types as integers,
 /// for example when doing NLP analysis or representing chromosomes by name.
 ///
-/// Example with nullable data:
+/// Example **with nullable** data:
 ///
 /// ```
-/// use arrow::array::DictionaryArray;
-/// use arrow::datatypes::Int8Type;
-/// let test = vec!["a", "a", "b", "c"];
-/// let array : DictionaryArray = test.iter().map(|&x| if x == 
"b" {None} else {Some(x)}).collect();
-/// assert_eq!(array.keys().collect::>>(), vec![Some(0), 
Some(0), None, Some(1)]);
+/// use arrow::array::DictionaryArray;
+/// use arrow::datatypes::Int8Type;
+/// let test = vec!["a", "a", "b", "c"];
+/// let array : DictionaryArray = test.iter().map(|&x| if x == "b" 
{None} else {Some(x)}).collect();
+/// assert_eq!(array.keys().collect::>>(), vec![Some(0), 
Some(0), None, Some(1)]);
 /// ```
 ///
-/// Example without nullable data:
+/// Example **without nullable** data:
 ///
 /// ```
-///
-/// use arrow::array::DictionaryArray;
-/// use arrow::datatypes::Int8Type;
-/// let test = vec!["a", "a", "b", "c"];
-/// let array : DictionaryArray = test.into_iter().collect();
-/// assert_eq!(array.keys().collect::>>(), vec![Some(0), 
Some(0), Some(1), Some(2)]);
+/// use arrow::array::DictionaryArray;
+/// use arrow::datatypes::Int8Type;
+/// let test = vec!["a", "a", "b", "c"];
+/// let array : DictionaryArray = test.into_iter().collect();
+/// assert_eq!(array.keys().collect::>>(), vec![Some(0), 
Some(0), Some(1), Some(2)]);
 /// ```
 pub struct DictionaryArray {
-// Array of keys, much like a PrimitiveArray
+/// Array of keys, much like a PrimitiveArray
 data: ArrayDataRef,
 
-// Pointer to the key values.
+/// Pointer to the key values.
 raw_values: RawPtrBox,
 
-// Array of any values.
+/// Array of any values.
 values: ArrayRef,
-

Review comment:
   Brought it back.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rdettai commented on a change in pull request #6935: ARROW-8455: [Rust] Parquet Arrow column read on partially compatible files

2020-04-27 Thread GitBox


rdettai commented on a change in pull request #6935:
URL: https://github.com/apache/arrow/pull/6935#discussion_r415644697



##
File path: rust/parquet/src/column/reader.rs
##
@@ -190,15 +190,12 @@ impl ColumnReaderImpl {
 (self.num_buffered_values - self.num_decoded_values) as 
usize,
 );
 
-// Adjust batch size by taking into account how much space is 
left in
-// values slice or levels slices (if available)
-adjusted_size = min(adjusted_size, values.len() - values_read);
-if let Some(ref levels) = def_levels {
-adjusted_size = min(adjusted_size, levels.len() - 
levels_read);
-}
-if let Some(ref levels) = rep_levels {
-adjusted_size = min(adjusted_size, levels.len() - 
levels_read);
-}
+// Adjust batch size by taking into account how much data there
+// to read. As batch_size is also smaller than value and level
+// slices (if available), this ensures that available space is 
not
+// exceeded.
+adjusted_size = min(adjusted_size, batch_size - values_read);

Review comment:
   is it clearer now or do you need me to find a better example ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rdettai commented on pull request #6949: ARROW-7681: [Rust] Explicitly seeking a BufReader will discard the internal buffer (2)

2020-04-27 Thread GitBox


rdettai commented on pull request #6949:
URL: https://github.com/apache/arrow/pull/6949#issuecomment-619834304


   I completely agree ! What I am saying is that the layer that makes the 
handle thread safe cannot be the `FileSource` which is in charge of tracking 
the reading position of one specific reading thread.
   
   But this question of multi-threading seems to be a whole other concern :-) 
I've made the _nit_ changes you've proposed and improved the tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] durch commented on pull request #7042: ARROW-8597 [Rust] Lints and readability improvements for arrow crate

2020-04-27 Thread GitBox


durch commented on pull request #7042:
URL: https://github.com/apache/arrow/pull/7042#issuecomment-619798711


   @nevi-me I left it as is for now, did not want to actually write any new 
code in this PR. After this one gets merged I'll make another pass and write 
some code, in addition to the `is_empty` there is also a low hanging `Default` 
impl, possibly a few more pedantic changes :).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #6898: ARROW-8399: [Rust] Extend memory alignments to include other architectures

2020-04-27 Thread GitBox


nevi-me commented on pull request #6898:
URL: https://github.com/apache/arrow/pull/6898#issuecomment-619797480


   > > I am using this code in various places, but if it is against the spec or 
not useful for the improvements we can drop it off.
   > 
   > As @pitrou says:
   > 
   > > It's a recommendation, not an obligation. If you decide that specific 
architectures deserve a specific alignment value, then IMHO it's not a problem.
   > 
   > I'm all for the changes if they improve performance but this is at the 
limit of my understanding. @andygrove @nevi-me thoughts?
   
   I don't know much about alignment, but I'm happy to claw back performance 
(esp after we remove packed_simd). I'll rebase this against the PR to drop 
packed_simd, then check the performance difference.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org