date:20210713

[GitHub] [arrow-datafusion] lvheyang edited a comment on pull request #716: #699 fix return type conflict when calling builtin math fuctions

2021-07-13 Thread GitBox



lvheyang edited a comment on pull request #716:
URL: https://github.com/apache/arrow-datafusion/pull/716#issuecomment-879595756


   Thanks for your kind help!  @alamb I added test cases in context.rs, test 
SQLs:
   
   1. `SELECT sqrt(f64) FROM t`
   2. `SELECT sqrt(f32) FROM t` // would panic before this pull request
   3. `SELECT sqrt(i64) FROM t`
   4. `SELECT sqrt(i32) FROM t`
   
   Is there any case else needed?  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] lvheyang edited a comment on pull request #716: #699 fix return type conflict when calling builtin math fuctions

2021-07-13 Thread GitBox



lvheyang edited a comment on pull request #716:
URL: https://github.com/apache/arrow-datafusion/pull/716#issuecomment-879595756


   Thanks for your kind help!  @alamb I added test cases in context.rs, test 
SQLs:
   
   1. `SELECT sqrt(f64) FROM t`
   2. `SELECT sqrt(f32) FROM t` // would be failed before this pull request
   3. `SELECT sqrt(i64) FROM t`
   4. `SELECT sqrt(i32) FROM t`
   
   Is there any case else needed?  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] lvheyang commented on pull request #716: #699 fix return type conflict when calling builtin math fuctions

2021-07-13 Thread GitBox



lvheyang commented on pull request #716:
URL: https://github.com/apache/arrow-datafusion/pull/716#issuecomment-879595756


   Thanks for your kind help!  @alamb I added test cases in context.rs, test 
SQLs:
   
   1. `SELECT sqrt(f64) FROM t`
   2. `SELECT sqrt(f32) FROM t`
   3. `SELECT sqrt(i64) FROM t`
   4. `SELECT sqrt(i32) FROM t`
   
   Is there any case else needed?  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 commented on a change in pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed

2021-07-13 Thread GitBox



cyb70289 commented on a change in pull request #10719:
URL: https://github.com/apache/arrow/pull/10719#discussion_r669292380



##
File path: cpp/src/arrow/testing/gtest_util.cc
##
@@ -907,7 +910,7 @@ class GatingTask::Impl : public 
std::enable_shared_from_this {
   double timeout_seconds_;
   Status status_;
   bool unlocked_;
-  int num_launched_ = 0;
+  std::atomic num_launched_;

Review comment:
   Nit: since we don't care about memory order, I think we can change this 
line to
   `std::atomic num_launched_{0}`, and leave all other codes untouched.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r669276514



##
File path: python/pyarrow/_hdfs.pyx
##
@@ -93,9 +93,10 @@ cdef class HadoopFileSystem(FileSystem):
 Instantiate HadoopFileSystem object from an URI string.
 
 The following two calls are equivalent
-* HadoopFileSystem.from_uri('hdfs://localhost:8020/?user=test'
-'=1')
-* HadoopFileSystem('localhost', port=8020, user='test', replication=1)
+
+* 
``HadoopFileSystem.from_uri('hdfs://localhost:8020/?user=test=1')`` 
# noqa: E501
+
+* ``HadoopFileSystem('localhost', port=8020, user='test', 
replication=1)`` # noqa: E501

Review comment:
   Anyone have any opinion on ignoring flake here?  RST didn't like the 
multi line strings.  I tried using literal blocks `::` but then flake 
complained about the empty lines required by RST.  This seems like the most 
readable but the lines are too long (hence the `# noqa`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me merged pull request #545: Fix build, Make the js package a feature that can be enabled for wasm, rather than always on

2021-07-13 Thread GitBox



nevi-me merged pull request #545:
URL: https://github.com/apache/arrow-rs/pull/545


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] nevi-me closed issue #544: Error building on master - error: cyclic package dependency: package `ahash v0.7.4` depends on itself. Cycle

2021-07-13 Thread GitBox



nevi-me closed issue #544:
URL: https://github.com/apache/arrow-rs/issues/544


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

2021-07-13 Thread GitBox



westonpace commented on pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#issuecomment-879575972


   I believe I have addressed all comments.  I also noticed that 
IpcWriteOptions was undocumented so I added it in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r669270057



##
File path: docs/source/python/dataset.rst
##
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
 dataset.to_table().to_pandas()
 dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
--
+Iterative (out of core or streaming) reads
+--
 
-..
-Possible content:
-- fragments (get_fragments)
-- scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-for scan_task in dataset.scan(columns=[...], filter=...):
-for record_batch in scan_task.execute():
-# process the record batch
+.. ipython:: python
+
+import pyarrow.compute as pc
+
+sum = 0
+count = 0
+for batch in dataset.to_batches(filter=~ds.field('col2').is_null()):
+sum += pc.sum(batch.column('col2')).as_py()
+count += batch.num_rows
+mean_a = sum/count

Review comment:
   True, but then you wouldn't be using `to_batches` anyways and this whole 
section goes away.  I think dataset scanning is likely to be a niche API once 
any kind of SQL support is added.  We can address it then.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r669269421



##
File path: docs/source/python/dataset.rst
##
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
 dataset.to_table().to_pandas()
 dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning

Review comment:
   Hmmm, I just dropped the sentence about scanning.  If someone wants 
clarification we can discuss it at that time.  @bkietz any opinion?  I recall 
we spoke about this before.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r669268967



##
File path: docs/source/python/dataset.rst
##
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
 dataset.to_table().to_pandas()
 dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
--
+Iterative (out of core or streaming) reads
+--
 
-..
-Possible content:
-- fragments (get_fragments)
-- scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-for scan_task in dataset.scan(columns=[...], filter=...):
-for record_batch in scan_task.execute():
-# process the record batch
+.. ipython:: python
+
+import pyarrow.compute as pc
+
+sum = 0

Review comment:
   Thanks.  I renamed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed

2021-07-13 Thread GitBox



github-actions[bot] commented on pull request #10719:
URL: https://github.com/apache/arrow/pull/10719#issuecomment-879572131


   Revision: 3a1f4e47a3c7af582d67e013bed3bd712d87dba8
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-594](https://github.com/ursacomputing/crossbow/branches/all?query=actions-594)
   
   |Task|Status|
   ||--|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-594-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-594-github-test-ubuntu-20.04-cpp-thread-sanitizer)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 commented on pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed

2021-07-13 Thread GitBox



cyb70289 commented on pull request #10719:
URL: https://github.com/apache/arrow/pull/10719#issuecomment-879571834


   @github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages

2021-07-13 Thread GitBox



ursabot edited a comment on pull request #10663:
URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533552


   Benchmark runs are scheduled for baseline = 
780e95c512d63bbea1e040af0eb44a0bf63c4d72 and contender = 
64ecb2a73c76e30091428af5e03d538766b12c74. Results will be available as each 
benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 
(mimalloc)](https://conbench.ursa.dev/compare/runs/399dd7ae47d345969dfba70ba5d60fae...720759ec32734f10a4016acec5cc4633/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x 
(mimalloc)](https://conbench.ursa.dev/compare/runs/eef93c4bfb1d4a8b92351f41883dfd55...41a8348d634e4410808dad7fcdd62c59/)
   [Finished :arrow_down:0.24% :arrow_up:0.05%] [ursa-thinkcentre-m75q 
(mimalloc)](https://conbench.ursa.dev/compare/runs/e5af519d38bd4cc483cd45f86750d807...b477c89467fc4b2d80211db405bd786e/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 commented on a change in pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages

2021-07-13 Thread GitBox



cyb70289 commented on a change in pull request #10663:
URL: https://github.com/apache/arrow/pull/10663#discussion_r669261083



##
File path: cpp/src/arrow/flight/serialization_internal.cc
##
@@ -201,9 +193,7 @@ grpc::Status FlightDataSerialize(const FlightPayload& msg, 
ByteBuffer* out,
   // Write the descriptor if present
   int32_t descriptor_size = 0;
   if (msg.descriptor != nullptr) {
-if (msg.descriptor->size() > kInt32Max) {
-  return ToGrpcStatus(Status::CapacityError("Descriptor size overflow (>= 
2**31)"));

Review comment:
   From the jira issue, looks grpc should check and handle our return value 
decently.
   Maybe open an issue in grpc community?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages

2021-07-13 Thread GitBox



ursabot edited a comment on pull request #10663:
URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533552


   Benchmark runs are scheduled for baseline = 
780e95c512d63bbea1e040af0eb44a0bf63c4d72 and contender = 
64ecb2a73c76e30091428af5e03d538766b12c74. Results will be available as each 
benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 
(mimalloc)](https://conbench.ursa.dev/compare/runs/399dd7ae47d345969dfba70ba5d60fae...720759ec32734f10a4016acec5cc4633/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x 
(mimalloc)](https://conbench.ursa.dev/compare/runs/eef93c4bfb1d4a8b92351f41883dfd55...41a8348d634e4410808dad7fcdd62c59/)
   [Scheduled] [ursa-thinkcentre-m75q 
(mimalloc)](https://conbench.ursa.dev/compare/runs/e5af519d38bd4cc483cd45f86750d807...b477c89467fc4b2d80211db405bd786e/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r669252727



##
File path: docs/source/python/dataset.rst
##
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
 dataset.to_table().to_pandas()
 dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
--
+Iterative (out of core or streaming) reads
+--
 
-..
-Possible content:
-- fragments (get_fragments)
-- scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-for scan_task in dataset.scan(columns=[...], filter=...):
-for record_batch in scan_task.execute():
-# process the record batch
+.. ipython:: python

Review comment:
   All the scripts I added execute pretty rapidly as they are dealing with 
tables with less than 10 rows.  I'm not sure they add significantly to the 
build times.
   
   For a test I tried converting all ipython to code-block and saw no noticable 
difference in build times.  I'd prefer ipython just for the testing sake but 
I'm happy to go with whatever is decided in ARROW-13159.  Since it should be 
pretty easy to change after the fact (just a find-replace from ipython to 
code-block) I'd rather address it after ARROW-13159 is resolved if that is ok.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 commented on a change in pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages

2021-07-13 Thread GitBox



cyb70289 commented on a change in pull request #10663:
URL: https://github.com/apache/arrow/pull/10663#discussion_r669247552



##
File path: cpp/src/arrow/flight/serialization_internal.cc
##
@@ -201,9 +193,7 @@ grpc::Status FlightDataSerialize(const FlightPayload& msg, 
ByteBuffer* out,
   // Write the descriptor if present
   int32_t descriptor_size = 0;
   if (msg.descriptor != nullptr) {
-if (msg.descriptor->size() > kInt32Max) {
-  return ToGrpcStatus(Status::CapacityError("Descriptor size overflow (>= 
2**31)"));
-}
+DCHECK_LT(msg.descriptor->size(), kInt32Max);

Review comment:
   It's a bit strange to continue running here, if we know it will fail 
finally (will it?)
   
   From the jira issue, looks grpc should check and handle our return value 
decently.
   Maybe open an issue in grpc community? If grpc is willing to change, do we 
still need to update our code?
   
   For now, I'm not sure if it's appropriate to simply abort here in release 
build, with some clear error messages and hints to fix it by chunking the data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] cyb70289 commented on a change in pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages

2021-07-13 Thread GitBox



cyb70289 commented on a change in pull request #10663:
URL: https://github.com/apache/arrow/pull/10663#discussion_r669232404



##
File path: cpp/src/arrow/flight/client.cc
##
@@ -688,11 +688,12 @@ class GrpcStreamWriter : public FlightStreamWriter {
   Status WriteMetadata(std::shared_ptr app_metadata) override {
 FlightPayload payload{};
 payload.app_metadata = app_metadata;
-if (!internal::WritePayload(payload, writer_->stream().get())) {
+auto status = internal::WritePayload(payload, writer_->stream().get());
+if (!status.ok() && status.IsIOError()) {

Review comment:
   Can we remove `!status.ok()` and only check `if (status.IsIOError())` ?

##
File path: cpp/src/arrow/flight/serialization_internal.cc
##
@@ -201,9 +193,7 @@ grpc::Status FlightDataSerialize(const FlightPayload& msg, 
ByteBuffer* out,
   // Write the descriptor if present
   int32_t descriptor_size = 0;
   if (msg.descriptor != nullptr) {
-if (msg.descriptor->size() > kInt32Max) {
-  return ToGrpcStatus(Status::CapacityError("Descriptor size overflow (>= 
2**31)"));
-}
+DCHECK_LT(msg.descriptor->size(), kInt32Max);

Review comment:
   `DCHECK_LE` ?

##
File path: cpp/src/arrow/flight/serialization_internal.cc
##
@@ -201,9 +193,7 @@ grpc::Status FlightDataSerialize(const FlightPayload& msg, 
ByteBuffer* out,
   // Write the descriptor if present
   int32_t descriptor_size = 0;
   if (msg.descriptor != nullptr) {
-if (msg.descriptor->size() > kInt32Max) {
-  return ToGrpcStatus(Status::CapacityError("Descriptor size overflow (>= 
2**31)"));
-}
+DCHECK_LT(msg.descriptor->size(), kInt32Max);

Review comment:
   It's a bit strange to continue running here, if we know it will fail 
finally (will it?)
   
   From the jira issue, looks grpc should check and handle our return value 
decently.
   Maybe open an issue in grpc community? If grpc is willing to change, do we 
still need to update our code?
   
   For now, I'm not sure if it's appropriate to simply abort here in release 
build, with some clear error messages and hints to fix it by chunking the data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10604: ARROW-13190: [C++] [Gandiva] Change behavior of INITCAP function

2021-07-13 Thread GitBox



anthonylouisbsb commented on a change in pull request #10604:
URL: https://github.com/apache/arrow/pull/10604#discussion_r669245436



##
File path: cpp/src/gandiva/gdv_function_stubs.cc
##
@@ -635,30 +638,31 @@ const char* gdv_fn_initcap_utf8(int64_t context, const 
char* data, int32_t data_
   int32_t out_char_len = 0;
   int32_t out_idx = 0;
   uint32_t char_codepoint;
+
+  // Any character is considered as space, except if it is alphanumeric
   bool last_char_was_space = true;
 
   for (int32_t i = 0; i < data_len; i += char_len) {
 char_len = gdv_fn_utf8_char_length(data[i]);
-// For single byte characters:
-// If it is a lowercase ASCII character, set the output to its 
corresponding uppercase
-// character; else, set the output to the read character
+// An optimization for single byte characters:
 if (char_len == 1) {

Review comment:
   Suggestion applied




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages

2021-07-13 Thread GitBox



ursabot edited a comment on pull request #10663:
URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533552


   Benchmark runs are scheduled for baseline = 
780e95c512d63bbea1e040af0eb44a0bf63c4d72 and contender = 
64ecb2a73c76e30091428af5e03d538766b12c74. Results will be available as each 
benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 
(mimalloc)](https://conbench.ursa.dev/compare/runs/399dd7ae47d345969dfba70ba5d60fae...720759ec32734f10a4016acec5cc4633/)
   [Scheduled] [ursa-i9-9960x 
(mimalloc)](https://conbench.ursa.dev/compare/runs/eef93c4bfb1d4a8b92351f41883dfd55...41a8348d634e4410808dad7fcdd62c59/)
   [Scheduled] [ursa-thinkcentre-m75q 
(mimalloc)](https://conbench.ursa.dev/compare/runs/e5af519d38bd4cc483cd45f86750d807...b477c89467fc4b2d80211db405bd786e/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages

2021-07-13 Thread GitBox



ursabot commented on pull request #10663:
URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533552


   Benchmark runs are scheduled for baseline = 
780e95c512d63bbea1e040af0eb44a0bf63c4d72 and contender = 
64ecb2a73c76e30091428af5e03d538766b12c74. Results will be available as each 
benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2 
(mimalloc)](https://conbench.ursa.dev/compare/runs/399dd7ae47d345969dfba70ba5d60fae...720759ec32734f10a4016acec5cc4633/)
   [Scheduled] [ursa-i9-9960x 
(mimalloc)](https://conbench.ursa.dev/compare/runs/eef93c4bfb1d4a8b92351f41883dfd55...41a8348d634e4410808dad7fcdd62c59/)
   [Scheduled] [ursa-thinkcentre-m75q 
(mimalloc)](https://conbench.ursa.dev/compare/runs/e5af519d38bd4cc483cd45f86750d807...b477c89467fc4b2d80211db405bd786e/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] dianaclarke commented on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages

2021-07-13 Thread GitBox



dianaclarke commented on pull request #10663:
URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533345


   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed

2021-07-13 Thread GitBox



github-actions[bot] commented on pull request #10719:
URL: https://github.com/apache/arrow/pull/10719#issuecomment-879528937


   https://issues.apache.org/jira/browse/ARROW-13332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace opened a new pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed

2021-07-13 Thread GitBox



westonpace opened a new pull request #10719:
URL: https://github.com/apache/arrow/pull/10719


   Minor cleanup fix from my earlier fix of this test in #10602 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] mcassels opened a new issue #723: ABS() function in WHERE clause gives unexpected results

2021-07-13 Thread GitBox



mcassels opened a new issue #723:
URL: https://github.com/apache/arrow-datafusion/issues/723


   **Describe the bug**
   ABS(col - x) in WHERE clause of query sometimes does not filter results 
correctly. The test file has a float column with high-precision values. We want 
to use `ABS(col - x) < y` to do equality comparisons for high-precision float 
columns.
   
   **To Reproduce**
   datafusion cli example using test parquet file attached:
   
   ```
   ➜  arrow-datafusion git:(master) ✗ cargo run --bin datafusion-cli
   Finished dev [unoptimized + debuginfo] target(s) in 0.22s
Running `target/debug/datafusion-cli`
   > CREATE EXTERNAL TABLE foo STORED AS PARQUET LOCATION 'test.parquet';
   0 rows in set. Query took 0.001 seconds.
   > select * from foo;
   ++
   | c0 |
   ++
   | 107.0090813093981  |
   | 125.51519138755981 |
   | 141.83342587451415 |
   | 113.65534481251639 |
   | 251.10794896957802 |
   | 112.08361695028363 |
   ++
   6 rows in set. Query took 0.006 seconds.
   > select c0, ABS(c0 - 251.10794896957802) from foo;
   ++---+
   | c0 | abs(c0 Minus Float64(251.10794896957802)) |
   ++---+
   | 107.0090813093981  | 144.0988676601799 |
   | 125.51519138755981 | 125.59275758201821|
   | 141.83342587451415 | 109.27452309506387|
   | 113.65534481251639 | 137.45260415706161|
   | 251.10794896957802 | 0 |
   | 112.08361695028363 | 139.02433201929438|
   ++---+
   6 rows in set. Query took 0.006 seconds.
   > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 
251.10794896957802) < 1;
   0 rows in set. Query took 0.005 seconds.
   > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 
251.10794896957802) < 111;
   0 rows in set. Query took 0.003 seconds.
   > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 
251.10794896957802) < 150;
   ++---+
   | c0 | abs(c0 Minus Float64(251.10794896957802)) |
   ++---+
   | 107.0090813093981  | 144.0988676601799 |
   | 125.51519138755981 | 125.59275758201821|
   | 141.83342587451415 | 109.27452309506387|
   | 113.65534481251639 | 137.45260415706161|
   | 251.10794896957802 | 0 |
   | 112.08361695028363 | 139.02433201929438|
   ++---+
   6 rows in set. Query took 0.007 seconds.
   ```
   
   **Expected behavior**
   The query
   ```
   > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 
251.10794896957802) < 1;
   ```
   was expected to give the following 1 row:
   ```
   ++---+
   | c0 | abs(c0 Minus Float64(251.10794896957802)) |
   ++---+
   | 251.10794896957802 | 0 |
   ```
   
   And the query 
   
   ```
   > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 
251.10794896957802) < 111;
   ```
   was expected to give the following 2 rows:
   ```
   ++---+
   | c0 | abs(c0 Minus Float64(251.10794896957802)) |
   ++---+
   | 251.10794896957802 | 0 |
   | 141.83342587451415 | 109.27452309506387|
   ```
   
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



kou commented on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879506199


   It's strange...
   Can we confirm which OpenSSL is used for this case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ZMZ91 commented on pull request #10606: ARROW-13005: [C++] Add support for take implementation on dense union type

2021-07-13 Thread GitBox



ZMZ91 commented on pull request #10606:
URL: https://github.com/apache/arrow/pull/10606#issuecomment-879501530


   Hi @bkietz and @pitrou, could you help check what else to update in this pr? 
Or we may merge it? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10604: ARROW-13190: [C++] [Gandiva] Change behavior of INITCAP function

2021-07-13 Thread GitBox



anthonylouisbsb commented on a change in pull request #10604:
URL: https://github.com/apache/arrow/pull/10604#discussion_r669203451



##
File path: cpp/src/gandiva/gdv_function_stubs.cc
##
@@ -427,7 +427,8 @@ CAST_VARLEN_TYPE_FROM_NUMERIC(VARBINARY)
 #undef GDV_FN_CAST_VARCHAR_REAL
 
 GANDIVA_EXPORT
-int32_t gdv_fn_utf8_char_length(char c) {
+FORCE_INLINE

Review comment:
   I applied that change, lets see if the tests will work




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



anthonylouisbsb commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669196394



##
File path: cpp/src/gandiva/string_function_holder_test.cc
##
@@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) {
   auto status = LikeHolder::Make("ab\\_", "", _holder);
   EXPECT_EQ(status.ok(), false) << status.message();
 }
+
+// Tests related to the REGEXP_EXTRACT function
+class TestExtractHolder : public ::testing::Test {
+ protected:
+  ExecutionContext execution_context_;
+};
+
+TEST_F(TestExtractHolder, TestSimpleExtract) {

Review comment:
   I added the test, I found an error with the implementation when needs to 
match the string partially. I will look at it, I think it is a wrong use of the 
RE2 library




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

2021-07-13 Thread GitBox



westonpace commented on pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#issuecomment-879491179


   Any remaining comments or suggestions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



anthonylouisbsb commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669196057



##
File path: cpp/src/gandiva/string_function_holder.h
##
@@ -0,0 +1,140 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include 
+#include 
+
+#include "arrow/status.h"
+#include "gandiva/execution_context.h"
+#include "gandiva/function_holder.h"
+#include "gandiva/node.h"
+#include "gandiva/visibility.h"
+
+namespace gandiva {
+
+/// Function Holder for SQL 'like'
+class GANDIVA_EXPORT LikeHolder : public FunctionHolder {
+ public:
+  ~LikeHolder() override = default;
+
+  static Status Make(const FunctionNode& node, std::shared_ptr* 
holder);
+
+  static Status Make(const std::string& sql_pattern, 
std::shared_ptr* holder);
+
+  static Status Make(const std::string& sql_pattern, const std::string& 
escape_char,
+ std::shared_ptr* holder);
+
+  static Status Make(const std::string& sql_pattern, 
std::shared_ptr* holder,
+ RE2::Options regex_op);
+
+  // Try and optimise a function node with a "like" pattern.
+  static const FunctionNode TryOptimize(const FunctionNode& node);
+
+  /// Return true if the data matches the pattern.
+  bool operator()(const std::string& data) { return RE2::FullMatch(data, 
regex_); }
+
+ private:
+  explicit LikeHolder(const std::string& pattern) : pattern_(pattern), 
regex_(pattern) {}
+
+  LikeHolder(const std::string& pattern, RE2::Options regex_op)
+  : pattern_(pattern), regex_(pattern, regex_op) {}
+
+  std::string pattern_;  // posix pattern string, to help debugging
+  RE2 regex_;// compiled regex for the pattern
+
+  static RE2 starts_with_regex_;  // pre-compiled pattern for matching 
starts_with
+  static RE2 ends_with_regex_;// pre-compiled pattern for matching 
ends_with
+  static RE2 is_substr_regex_;// pre-compiled pattern for matching 
is_substr
+};
+
+/// Function Holder for 'regexp_extract' function
+class GANDIVA_EXPORT ExtractHolder : public FunctionHolder {
+ public:
+  ~ExtractHolder() override = default;
+
+  static Status Make(const FunctionNode& node, std::shared_ptr* 
holder);
+
+  static Status Make(const std::string& sql_pattern,
+ std::shared_ptr* holder);
+
+  /// Extracts the matching text from a string using a regex
+  const char* operator()(ExecutionContext* ctx, const char* user_input,

Review comment:
   I moved the code to the file




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on pull request #687: #554: Lead/lag window function with offset and default value arguments

2021-07-13 Thread GitBox



Jimexist commented on pull request #687:
URL: https://github.com/apache/arrow-datafusion/pull/687#issuecomment-879487452


   > @Jimexist do you think this PR is ready? Do you need help reviewing ?
   
   looks okay after rebasing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] djKooks edited a comment on pull request #442: Change return type of 'DataFrame.collect()'

2021-07-13 Thread GitBox



djKooks edited a comment on pull request #442:
URL: https://github.com/apache/arrow-datafusion/pull/442#issuecomment-879483480


   @alamb sure~
   But because I'm still on reviewing the project, so want to make sure I'm 
going in right way...
   
   > As far as I review from now, seems it needs to fix return format of 
collect inside dataframe_impl. And I've fixed collect inside physical_plan 
because they are calling this function inside...
   > Please let me know if there's any misunderstanding 
   
   => Am I thinking correctly?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs edited a comment on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



kszucs edited a comment on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879399202


   @kou Interestingly we have [linker 
errors](https://github.com/ursacomputing/crossbow/runs/3060717856?check_suite_focus=true#step:8:654)
 in the universal2 build but the tests seem to work for both x86_64 and arm64 
in their own seperate virtualenvs. Can this cause issues with 
`PARQUET_REQUIRE_ENCRYPTION` and `ARROW_S3` (`ARROW_FLIGHT` is disabled)?
   
   cc @xhochy 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] djKooks commented on pull request #442: Change return type of 'DataFrame.collect()'

2021-07-13 Thread GitBox



djKooks commented on pull request #442:
URL: https://github.com/apache/arrow-datafusion/pull/442#issuecomment-879483480


   @alamb sure~
   But because I'm still on reviewing the project, so want to make sure I'm 
going in right way...
   
   > As far as I review from now, seems it needs to fix return format of 
collect inside dataframe_impl. And I've fixed collect inside physical_plan 
because they are calling this function inside...
   > Please let me know if there's any misunderstanding 
   => Am I thinking correctly?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10718: ARROW-13331: [C++][Gandiva] Add format_number hive function to gandiva

2021-07-13 Thread GitBox



github-actions[bot] commented on pull request #10718:
URL: https://github.com/apache/arrow/pull/10718#issuecomment-879478708


   https://issues.apache.org/jira/browse/ARROW-13331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] augustoasilva opened a new pull request #10718: ARROW-13331: [C++][Gandiva] Add format_number hive function to gandiva

2021-07-13 Thread GitBox



augustoasilva opened a new pull request #10718:
URL: https://github.com/apache/arrow/pull/10718


   Formats the number X to a format like '#,###,###.##', rounded to D decimal 
places, and returns the result as a string. If D is 0, the result has no 
decimal point or fractional part.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



anthonylouisbsb commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669181293



##
File path: cpp/src/gandiva/CMakeLists.txt
##
@@ -82,7 +82,7 @@ set(SRC_FILES
 hash_utils.cc
 llvm_generator.cc
 llvm_types.cc
-like_holder.cc
+string_function_holder.cc

Review comment:
   Renamed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jpedroantunes commented on a change in pull request #10516: ARROW-13049: [C++][Gandiva] Implement BIN Hive function on Gandiva

2021-07-13 Thread GitBox



jpedroantunes commented on a change in pull request #10516:
URL: https://github.com/apache/arrow/pull/10516#discussion_r669179317



##
File path: cpp/src/gandiva/precompiled/extended_math_ops.cc
##
@@ -367,4 +367,36 @@ gdv_float64 get_scale_multiplier(gdv_int32 scale) {
   return power_float64_float64(10.0, scale);
 }
 
+// returns the binary representation of a given integer (e.g. 928 -> 
111010)
+#define BIN_INTEGER(IN_TYPE)   
   \
+  FORCE_INLINE 
   \
+  const char* bin_##IN_TYPE(int64_t context, gdv_##IN_TYPE value, int32_t* 
out_len) { \
+*out_len = 0;  
   \
+int32_t len = 8 * sizeof(value);   
   \
+char* ret = reinterpret_cast(gdv_fn_context_arena_malloc(context, 
len));   \
+if (ret == nullptr) {  
   \
+  gdv_fn_context_set_error_msg(context, "Could not allocate memory for 
output");  \
+  return "";   
   \
+}  
   \
+/* generate binary representation iteratively */   
   \
+gdv_u##IN_TYPE i;  
   \
+bool first = false; /* flag for not printing left zeros in positive 
numbers */\
+for (i = static_cast(1) << (len - 1); i > 0; i = i / 2) {  
   \
+  if ((value & i) != 0) {  
   \

Review comment:
   You are right! Added the handler for that case and also added tests to 
cover it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



anthonylouisbsb commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669179205



##
File path: cpp/src/gandiva/string_function_holder_test.cc
##
@@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) {
   auto status = LikeHolder::Make("ab\\_", "", _holder);
   EXPECT_EQ(status.ok(), false) << status.message();
 }
+
+// Tests related to the REGEXP_EXTRACT function
+class TestExtractHolder : public ::testing::Test {
+ protected:
+  ExecutionContext execution_context_;
+};
+
+TEST_F(TestExtractHolder, TestSimpleExtract) {
+  std::shared_ptr extract_holder;
+
+  // Pattern to match of two group of letters
+  auto status = ExtractHolder::Make(R"((\w+) (\w+))", _holder);

Review comment:
   And in the example, the index 0 must return all string parts that match 
the regex, the index 1 the first group in the regex(first name) and the index 2 
the second group(surname)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



anthonylouisbsb commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669177954



##
File path: cpp/src/gandiva/string_function_holder_test.cc
##
@@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) {
   auto status = LikeHolder::Make("ab\\_", "", _holder);
   EXPECT_EQ(status.ok(), false) << status.message();
 }
+
+// Tests related to the REGEXP_EXTRACT function
+class TestExtractHolder : public ::testing::Test {
+ protected:
+  ExecutionContext execution_context_;
+};
+
+TEST_F(TestExtractHolder, TestSimpleExtract) {
+  std::shared_ptr extract_holder;
+
+  // Pattern to match of two group of letters
+  auto status = ExtractHolder::Make(R"((\w+) (\w+))", _holder);

Review comment:
   About the first question, the first and last parenthesis does not count, 
they are part of the [raw string 
sintax](https://www.geeksforgeeks.org/raw-string-literal-c/) in C++




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

2021-07-13 Thread GitBox



rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669166367



##
File path: r/configure.win
##
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
   No worries, thanks for responding :)
   The TZ functionality of the vendored library (date.h) needs to download a 
timezone DB on windows to work. There's [a jira to set that 
up](https://issues.apache.org/jira/browse/ARROW-13168) so this will eventually 
need to compile on windows.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669164458



##
File path: r/configure.win
##
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
   Ah, sorry, thought this was on the other PR where I was questioning 
this.  I will let someone with more knowledge of the build process chime in on 
whether this is correct or not.  I thought the TZ stuff didn't support Windows 
anyways?  Or is that only strftime?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669164458



##
File path: r/configure.win
##
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
   Ah, sorry, thought this was on the other PR where I was questioning 
this.  I will let someone with more knowledge of the build process chime in on 
whether this is correct or not.  I thought the TZ stuff didn't support Windows 
anyways?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669163511



##
File path: r/configure.win
##
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
   No, I didn't see your earlier comment so I didn't understand why you 
were adding this.  I get it now, sorry.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10717: ARROW-13091: [Python] Added compression_level to IpcWriteOptions

2021-07-13 Thread GitBox



github-actions[bot] commented on pull request #10717:
URL: https://github.com/apache/arrow/pull/10717#issuecomment-879456403


   https://issues.apache.org/jira/browse/ARROW-13091


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace opened a new pull request #10717: ARROW-13091: [Python] Added compression_level to IpcWriteOptions

2021-07-13 Thread GitBox



westonpace opened a new pull request #10717:
URL: https://github.com/apache/arrow/pull/10717


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kevingurney commented on pull request #10614: ARROW-13100: [MATLAB] Integrate GoogleTest with MATLAB Interface C++ Code

2021-07-13 Thread GitBox



kevingurney commented on pull request #10614:
URL: https://github.com/apache/arrow/pull/10614#issuecomment-879449900


   @kou we just pushed updates to `matlab/CMakeLists.txt` which enable building 
of the Arrow C++ libraries and bundled GoogleTest binaries automatically from 
source on macOS and Windows.
   
   We are now able to build and run the MATLAB C++ tests consistently, using 
the same set of commands, on Windows, Mac, and Linux:
   
   ```bash
   $ cmake -S . -B build -D MATLAB_BUILD_TESTS=ON
   $ cmake --build build --config Release
   $ ctest --test-dir build
   ```
   
   I think we are good to go now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10624: ARROW-12992: [R] bindings for substr(), substring(), str_sub()

2021-07-13 Thread GitBox



thisisnic commented on a change in pull request #10624:
URL: https://github.com/apache/arrow/pull/10624#discussion_r669142885



##
File path: r/src/compute.cpp
##
@@ -316,6 +316,19 @@ std::shared_ptr 
make_compute_options(
 return std::make_shared(max_splits, reverse);
   }
 
+  if (func_name == "utf8_slice_codeunits") {
+using Options = arrow::compute::SliceOptions;
+int64_t start = 1;
+int64_t stop = -1;

Review comment:
   Sorry, I didn't explain this properly, my fault!  What I mean is that 
here stop has been set to `-1` but the default C++ value isn't `-1`, so the 
default value here probably shouldn't be `-1` either.
   
   Check out this line of code here that shows the default value of `stop` in 
the C++: https://github.com/apache/arrow/blob/7114c4b6fdb63 
9b3500d77cfd66649af8c5c5e6b/cpp/src/arrow/compute/api_scalar.h#L205-L206 
   
   The default value of `stop` is `std::numeric_limits::max()` which 
is the biggest int64.  So if you don't supply a value for `stop` and just use 
the default, this allows you to slice to the end of the string.
   
   In some of the other functions in this file, this line has been used to set 
the default values to the ones from the C++ code:
   `std::make_shared(Options::Defaults());`
   
   Maybe instead of manually setting the value of stop to `-1`, if you use the 
line above, it might make it easier to fix some of the failing tests as now 
you'll be able to slice to the end of a string.  If this doesn't make sense, 
let me know!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zeroshade opened a new pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-13 Thread GitBox



zeroshade opened a new pull request #10716:
URL: https://github.com/apache/arrow/pull/10716


   @emkornfield Thanks for merging the previous PR #10379 
   
   Here's the remaining files that we pulled out of that PR to shrink it down, 
including all the unit tests for the Encoding package.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-13 Thread GitBox



github-actions[bot] commented on pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#issuecomment-879433836


   https://issues.apache.org/jira/browse/ARROW-13330


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb closed issue #343: Add a RecordBatch::split to split large batches into a set of smaller batches

2021-07-13 Thread GitBox



alamb closed issue #343:
URL: https://github.com/apache/arrow-rs/issues/343


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #343: Add a RecordBatch::split to split large batches into a set of smaller batches

2021-07-13 Thread GitBox



alamb commented on issue #343:
URL: https://github.com/apache/arrow-rs/issues/343#issuecomment-879428117


   Given we now have slice in https://github.com/apache/arrow-rs/issues/460 I 
don't think this adds much anymore


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] dianaclarke commented on a change in pull request #10650: ARROW-13058: This is a draft to provide save-report

2021-07-13 Thread GitBox



dianaclarke commented on a change in pull request #10650:
URL: https://github.com/apache/arrow/pull/10650#discussion_r669128039



##
File path: dev/archery/archery/crossbow/reports.py
##
@@ -121,6 +121,61 @@ def show(self, outstream, asset_callback=None):
asset))
 
 
+class JsonReport(Report):
+
+HEADER = textwrap.dedent("""
+Arrow Build Report for Job {job_name}
+
+All tasks: {all_tasks_url}
+""")
+
+TASK = textwrap.dedent("""
+  - {name}:
+URL: {url}
+""").strip()
+
+STATUS_HEADERS = {
+# from CombinedStatus
+'error': 'Errored Tasks:',
+'failure': 'Failed Tasks:',
+'pending': 'Pending Tasks:',
+'success': 'Succeeded Tasks:',
+}
+
+def __init__(self, job):

Review comment:
   You can remove this method since this is the default behaviour and your 
not extending it.

##
File path: dev/archery/archery/crossbow/reports.py
##
@@ -121,6 +121,61 @@ def show(self, outstream, asset_callback=None):
asset))
 
 
+class JsonReport(Report):
+
+HEADER = textwrap.dedent("""
+Arrow Build Report for Job {job_name}
+
+All tasks: {all_tasks_url}
+""")
+
+TASK = textwrap.dedent("""
+  - {name}:
+URL: {url}
+""").strip()
+
+STATUS_HEADERS = {
+# from CombinedStatus
+'error': 'Errored Tasks:',
+'failure': 'Failed Tasks:',
+'pending': 'Pending Tasks:',
+'success': 'Succeeded Tasks:',
+}
+
+def __init__(self, job):

Review comment:
   You can remove this method since this is the default behaviour and 
you're not extending it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

2021-07-13 Thread GitBox



rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669127595



##
File path: r/configure.win
##
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
   Should this be somewhere else?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] dianaclarke commented on a change in pull request #10650: ARROW-13058: This is a draft to provide save-report

2021-07-13 Thread GitBox



dianaclarke commented on a change in pull request #10650:
URL: https://github.com/apache/arrow/pull/10650#discussion_r669125627



##
File path: dev/archery/archery/crossbow/reports.py
##
@@ -32,7 +33,6 @@ def __init__(self, job):
 def show(self):
 raise NotImplementedError()
 
-

Review comment:
   I suspect this new line was intentional, as archery uses black for 
formatting.
   
   - https://github.com/apache/arrow/pull/10012
   - https://github.com/psf/black
   
   > Surround top-level function and class definitions with two blank lines.
   - https://www.python.org/dev/peps/pep-0008/




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] dianaclarke commented on a change in pull request #10650: ARROW-13058: This is a draft to provide save-report

2021-07-13 Thread GitBox



dianaclarke commented on a change in pull request #10650:
URL: https://github.com/apache/arrow/pull/10650#discussion_r669121940



##
File path: dev/archery/archery/crossbow/reports.py
##
@@ -121,6 +121,61 @@ def show(self, outstream, asset_callback=None):
asset))
 
 
+class JsonReport(Report):
+
+HEADER = textwrap.dedent("""
+Arrow Build Report for Job {job_name}
+
+All tasks: {all_tasks_url}
+""")
+
+TASK = textwrap.dedent("""
+  - {name}:
+URL: {url}
+""").strip()
+
+STATUS_HEADERS = {
+# from CombinedStatus
+'error': 'Errored Tasks:',
+'failure': 'Failed Tasks:',
+'pending': 'Pending Tasks:',
+'success': 'Succeeded Tasks:',
+}
+
+def __init__(self, job):
+super().__init__(job)
+
+def url(self, query):
+repo_url = self.job.queue.remote_url.strip('.git')
+return '{}/branches/all?query={}'.format(repo_url, query)
+
+def todayStr(self):
+date = datetime.utcnow()
+return "{}-{}-{}".format(date.year, date.month, date.day)
+
+def tasksToDict(self, date, tasks):

Review comment:
   Python uses snake_case for variable, function, and method names.
   
   - `tasksToDict` --> `tasks_to_dict `
   - `jsonTasks` --> `json_tasks`
   - `jsonStr` --> `jsonStr`
   - `getJsonTasks` --> `get_json_tasks`
   - etc




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10705: ARROW-13313: [C++][Compute] Add scalar aggregate node

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10705:
URL: https://github.com/apache/arrow/pull/10705#discussion_r669115135



##
File path: cpp/src/arrow/compute/exec/exec_plan.cc
##
@@ -601,5 +618,215 @@ AsyncGenerator> 
MakeSinkNode(ExecNode* input,
   return out;
 }
 
+std::shared_ptr MakeSinkNodeReader(ExecNode* input,
+  std::string label) {
+  struct Impl : RecordBatchReader {
+std::shared_ptr schema() const override { return schema_; }
+Status ReadNext(std::shared_ptr* record_batch) override {
+  ARROW_ASSIGN_OR_RAISE(auto batch, iterator_.Next());
+  if (batch) {
+ARROW_ASSIGN_OR_RAISE(*record_batch, batch->ToRecordBatch(schema_, 
pool_));
+  } else {
+*record_batch = IterationEnd>();
+  }
+  return Status::OK();
+}
+
+MemoryPool* pool_;
+std::shared_ptr schema_;
+Iterator> iterator_;
+  };
+
+  auto out = std::make_shared();
+  out->pool_ = input->plan()->exec_context()->memory_pool();
+  out->schema_ = input->output_schema();
+  out->iterator_ = MakeGeneratorIterator(MakeSinkNode(input, 
std::move(label)));
+  return out;
+}
+
+struct ScalarAggregateNode : ExecNode {
+  ScalarAggregateNode(ExecNode* input, std::string label,
+  std::shared_ptr output_schema,
+  std::vector kernels,
+  std::vector>> 
states)
+  : ExecNode(input->plan(), std::move(label), {input}, {"target"},
+ /*output_schema=*/std::move(output_schema),
+ /*num_outputs=*/1),
+kernels_(std::move(kernels)),
+states_(std::move(states)) {}
+
+  const char* kind_name() override { return "ScalarAggregateNode"; }
+
+  Status DoConsume(const ExecBatch& batch,
+   const std::vector>& states) {
+for (size_t i = 0; i < states.size(); ++i) {
+  KernelContext batch_ctx{plan()->exec_context()};
+  batch_ctx.SetState(states[i].get());
+  ExecBatch single_column_batch{{batch.values[i]}, batch.length};
+  RETURN_NOT_OK(kernels_[i]->consume(_ctx, single_column_batch));
+}
+return Status::OK();
+  }
+
+  void InputReceived(ExecNode* input, int seq, ExecBatch batch) override {
+DCHECK_EQ(input, inputs_[0]);
+
+std::unique_lock lock(mutex_);
+auto it =
+thread_indices_.emplace(std::this_thread::get_id(), 
thread_indices_.size()).first;
+++num_received_;
+auto thread_index = it->second;
+
+lock.unlock();
+
+const auto& thread_local_state = states_[thread_index];
+Status st = DoConsume(std::move(batch), thread_local_state);
+if (!st.ok()) {
+  outputs_[0]->ErrorReceived(this, std::move(st));
+  return;
+}
+
+lock.lock();
+st = MaybeFinish();
+if (!st.ok()) {
+  outputs_[0]->ErrorReceived(this, std::move(st));
+}
+  }
+
+  void ErrorReceived(ExecNode* input, Status error) override {
+DCHECK_EQ(input, inputs_[0]);
+outputs_[0]->ErrorReceived(this, std::move(error));
+  }
+
+  void InputFinished(ExecNode* input, int seq) override {
+DCHECK_EQ(input, inputs_[0]);
+std::unique_lock lock(mutex_);
+num_total_ = seq;
+Status st = MaybeFinish();
+
+if (!st.ok()) {
+  outputs_[0]->ErrorReceived(this, std::move(st));
+}
+  }
+
+  Status StartProducing() override {
+finished_ = Future<>::Make();
+// Scalar aggregates will only output a single batch
+outputs_[0]->InputFinished(this, 1);
+return Status::OK();
+  }
+
+  void PauseProducing(ExecNode* output) override {}
+
+  void ResumeProducing(ExecNode* output) override {}
+
+  void StopProducing(ExecNode* output) override {
+DCHECK_EQ(output, outputs_[0]);
+StopProducing();
+  }
+
+  void StopProducing() override {
+inputs_[0]->StopProducing(this);
+finished_.MarkFinished();
+  }
+
+  Future<> finished() override { return finished_; }
+
+ private:
+  Status MaybeFinish(std::unique_lock* lock) {
+if (num_received_ != num_total_) return Status::OK();
+
+if (finished_.is_finished()) return Status::OK();
+
+ExecBatch batch{{}, 1};
+batch.values.resize(kernels_.size());
+
+for (size_t i = 0; i < kernels_.size(); ++i) {

Review comment:
   Ah, In my head "merge" meant something more like a merge sort.  I agree, 
if it's just summing up a sum/mean/etc. counter across the various states then 
I agree it's not necessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



github-actions[bot] commented on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879407362


   Revision: 10dbf02030fc431bf175826a51214e643a916042
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-593](https://github.com/ursacomputing/crossbow/branches/all?query=actions-593)
   
   |Task|Status|
   ||--|
   |wheel-macos-big-sur-cp39-arm64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-big-sur-cp39-arm64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-big-sur-cp39-arm64)|
   |wheel-macos-big-sur-cp39-universal2|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-big-sur-cp39-universal2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-big-sur-cp39-universal2)|
   |wheel-macos-high-sierra-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-high-sierra-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-high-sierra-cp36-amd64)|
   |wheel-macos-high-sierra-cp37-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-high-sierra-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-high-sierra-cp37-amd64)|
   |wheel-macos-high-sierra-cp38-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-high-sierra-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-high-sierra-cp38-amd64)|
   |wheel-macos-high-sierra-cp39-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-high-sierra-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-high-sierra-cp39-amd64)|
   |wheel-macos-mavericks-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-mavericks-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-mavericks-cp36-amd64)|
   |wheel-macos-mavericks-cp37-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-mavericks-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-mavericks-cp37-amd64)|
   |wheel-macos-mavericks-cp38-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-mavericks-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-mavericks-cp38-amd64)|
   |wheel-macos-mavericks-cp39-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-mavericks-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-mavericks-cp39-amd64)|
   |wheel-manylinux2010-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2010-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2010-cp36-amd64)|
   |wheel-manylinux2010-cp37-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2010-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2010-cp37-amd64)|
   |wheel-manylinux2010-cp38-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2010-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2010-cp38-amd64)|
   |wheel-manylinux2010-cp39-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2010-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2010-cp39-amd64)|
   |wheel-manylinux2014-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2014-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2014-cp36-amd64)|

[GitHub] [arrow] kszucs commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



kszucs commented on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879406726


   @github-actions crossbow submit -g wheel


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rok commented on pull request #10647: ARROW-13174: [C++][Compute] Add strftime kernel

2021-07-13 Thread GitBox



rok commented on pull request #10647:
URL: https://github.com/apache/arrow/pull/10647#issuecomment-879403748


   > Thanks for the changes. This looks good. I'm afraid I don't know the 
differences between `date.h` and the system equivalents. By system equivalents 
do you mean `ctime`? In that case does strftime support time zones?
   
   I meant like [cpp](https://en.cppreference.com/w/cpp/chrono/c/strftime) or 
[linux](https://man7.org/linux/man-pages/man3/strftime.3.html) strftimes. It 
seems they support timezones but one still needs to go from UTC to local time 
etc.
   
   > Since we already have the vendored lib in I suppose it comes down to 
readability, simplicity, and performance. What you have here seems plenty 
readable and simple. We can always benchmark the two approaches later.
   
   Fully agreed. Without benchmarking alternatives it's a moot discussion 
anyway. I just wanted to check if I missed something obvious.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a change in pull request #491: Minimal MapArray support

2021-07-13 Thread GitBox



alamb commented on a change in pull request #491:
URL: https://github.com/apache/arrow-rs/pull/491#discussion_r669078940



##
File path: arrow/Cargo.toml
##
@@ -58,13 +58,15 @@ multiversion = "0.6.1"
 bitflags = "1.2.1"
 
 [features]
-default = ["csv", "ipc"]
+default = ["csv", "ipc", "experimental-api"]

Review comment:
   I recommend that `experimental-api` is not enabled by default as an 
additional safeguard against someone being surprised if a new arrow release 
breaks it

##
File path: arrow/src/array/array_map.rs
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::any::Any;
+use std::fmt;
+use std::mem;
+
+use super::make_array;
+use super::{
+array::print_long_array, raw_pointer::RawPtrBox, Array, ArrayData, 
ArrayRef,
+};
+use crate::datatypes::{ArrowNativeType, DataType};
+use crate::error::ArrowError;
+
+/// A nested array type where each record is a key-value map.
+/// Keys should always be non-null, but values can be null.
+///
+/// [MapArray] is physically a [ListArray] that has a [StructArray]
+/// with 2 child fields.
+pub struct MapArray {
+data: ArrayData,
+values: ArrayRef,
+value_offsets: RawPtrBox,
+}
+
+impl MapArray {
+#[cfg(feature = "experimental-api")]
+/// Returns a reference to the keys of this map.
+pub fn keys() -> ArrayRef {
+make_array(self.values.data().child_data()[0].clone())
+}
+
+#[cfg(feature = "experimental-api")]
+/// Returns a reference to the values of this map.
+pub fn values() -> ArrayRef {
+make_array(self.values.data().child_data()[1].clone())
+}
+
+#[cfg(feature = "experimental-api")]
+/// Returns the data type of the map's keys.
+pub fn key_type() -> DataType {
+self.values.data().child_data()[0].data_type().clone()
+}
+
+#[cfg(feature = "experimental-api")]
+/// Returns the data type of the map's values.
+pub fn value_type() -> DataType {
+self.values.data().child_data()[1].data_type().clone()
+}
+
+/// Returns ith value of this map array.
+/// # Safety
+/// Caller must ensure that the index is within the array bounds
+pub unsafe fn value_unchecked(, i: usize) -> ArrayRef {
+let end = *self.value_offsets().get_unchecked(i + 1);
+let start = *self.value_offsets().get_unchecked(i);
+self.values
+.slice(start.to_usize().unwrap(), (end - 
start).to_usize().unwrap())
+}
+
+/// Returns ith value of this map array.
+pub fn value(, i: usize) -> ArrayRef {
+let end = self.value_offsets()[i + 1] as usize;
+let start = self.value_offsets()[i] as usize;
+self.values.slice(start, end - start)
+}
+
+/// Returns the offset values in the offsets buffer
+#[inline]
+pub fn value_offsets() -> &[i32] {
+// Soundness
+// pointer alignment & location is ensured by RawPtrBox
+// buffer bounds/offset is ensured by the ArrayData instance.
+unsafe {
+std::slice::from_raw_parts(
+self.value_offsets.as_ptr().add(self.data.offset()),
+self.len() + 1,
+)
+}
+}
+
+/// Returns the length for value at index `i`.
+#[inline]
+pub fn value_length(, i: usize) -> i32 {
+let offsets = self.value_offsets();
+offsets[i + 1] - offsets[i]
+}
+}
+
+impl From for MapArray {
+fn from(data: ArrayData) -> Self {
+Self::try_new_from_array_data(data)
+.expect("Expected infallable creation of MapArray from ArrayData 
failed")
+}
+}
+
+impl MapArray {
+fn try_new_from_array_data(data: ArrayData) -> Result {
+if data.buffers().len() != 1 {
+return Err(ArrowError::InvalidArgumentError(
+format!("MapArray data should contain a single buffer only 
(value offsets), had {}",
+data.len(;
+}
+
+if data.child_data().len() != 1 {
+return Err(ArrowError::InvalidArgumentError(format!(
+"MapArray should contain a single child array (values array), 
had {}",
+data.child_data().len()
+

[GitHub] [arrow] rok commented on a change in pull request #10647: ARROW-13174: [C++][Compute] Add strftime kernel

2021-07-13 Thread GitBox



rok commented on a change in pull request #10647:
URL: https://github.com/apache/arrow/pull/10647#discussion_r669107412



##
File path: r/configure.win
##
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
   Yeah, without it RTools40 fails on mingw32 build:
   ```
   
C:/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe:
 ../windows/arrow-4.0.1.9000/lib/i386/libarrow.a(tz.cpp.obj):(.text+0x725): 
undefined reference to `_imp__CoTaskMemFree@4'
   
C:/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe:
 
../windows/arrow-4.0.1.9000/lib/i386/libarrow.a(tz.cpp.obj):(.text.unlikely+0x3e):
 undefined reference to `_imp__CoTaskMemFree@4'
   ```
[Fix source here](https://github.com/HowardHinnant/date/issues/272). I 
don't really know what it really does. I'm also not sure if this is the place 
to put it. :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs edited a comment on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



kszucs edited a comment on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879399202


   @kou Interestingly we have [linker 
errors](https://github.com/ursacomputing/crossbow/runs/3060717856?check_suite_focus=true#step:8:654)
 in the universal2 build but the tests seem to work for both x86_64 and arm64 
in their own seperate virtualenvs. Can this cause issues with 
`PARQUET_REQUIRE_ENCRYPTION` and `ARROW_S3` (`ARROW_FLIGHT` is disabled)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



kszucs commented on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879399202


   @kou Interestingly we have [linker 
errors](https://github.com/ursacomputing/crossbow/runs/3060717856?check_suite_focus=true#step:8:654)
 in the universal2 build but the tests seem to work for both x86_64 and arm64 
in their own seperate virtualenvs. Can this cause issues with 
`PARQUET_REQUIRE_ENCRYPTION` and `ARROW_S3`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #524: Expose ExecutionContext.register_csv to the python bindings

2021-07-13 Thread GitBox



alamb commented on pull request #524:
URL: https://github.com/apache/arrow-datafusion/pull/524#issuecomment-879398069


   Cool -- I am just trying to shepherd PRs that look like they got stale


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #10705: ARROW-13313: [C++][Compute] Add scalar aggregate node

2021-07-13 Thread GitBox



bkietz closed pull request #10705:
URL: https://github.com/apache/arrow/pull/10705


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #722: perf: improve performance of `SortPreservingMergeExec` operator

2021-07-13 Thread GitBox



alamb commented on a change in pull request #722:
URL: https://github.com/apache/arrow-datafusion/pull/722#discussion_r669104634



##
File path: datafusion/src/physical_plan/sort_preserving_merge.rs
##
@@ -246,7 +274,19 @@ impl SortKeyCursor {
 .zip(other.columns.iter())
 .zip(options.iter());
 
-for ((l, r), sort_options) in zipped {
+// Recall or initialise a collection of comparators for comparing
+// columnar arrays of this cursor and "other".
+let cmp = self
+.batch_comparators
+.entry(other.batch_idx)
+.or_insert_with(|| Vec::with_capacity(other.columns.len()));

Review comment:
   I think this way is fine, personally




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10705: ARROW-13313: [C++][Compute] Add scalar aggregate node

2021-07-13 Thread GitBox



lidavidm commented on a change in pull request #10705:
URL: https://github.com/apache/arrow/pull/10705#discussion_r669103875



##
File path: cpp/src/arrow/dataset/scanner_test.cc
##
@@ -1323,5 +1323,86 @@ TEST(ScanNode, MaterializationOfVirtualColumn) {
   Finishes(ResultWith(UnorderedElementsAreArray(expected;
 }
 
+TEST(ScanNode, MinimalEndToEnd) {
+  // NB: This test is here for didactic purposes
+
+  // Specify a MemoryPool and ThreadPool for the ExecPlan
+  compute::ExecContext exec_context(default_memory_pool(), 
internal::GetCpuThreadPool());
+
+  // A ScanNode is constructed from an ExecPlan (into which it is inserted),
+  // a Dataset (whose batches will be scanned), and ScanOptions (to specify a 
filter for
+  // predicate pushdown, a projection to skip materialization of unnecessary 
columns, ...)
+  ASSERT_OK_AND_ASSIGN(std::shared_ptr plan,
+   compute::ExecPlan::Make(_context));
+
+  std::shared_ptr dataset = std::make_shared(
+  TableFromJSON(schema({field("a", int32()), field("b", boolean())}),
+{
+R"([{"a": 1,"b": null},
+{"a": 2,"b": true}])",
+R"([{"a": null, "b": true},
+{"a": 3,"b": false}])",
+R"([{"a": null, "b": true},
+{"a": 4,"b": false}])",
+R"([{"a": 5,"b": null},
+{"a": 6,"b": false},
+{"a": 7,"b": false}])",
+}));
+
+  auto options = std::make_shared();
+  // sync scanning is not supported by ScanNode
+  options->use_async = true;
+  // for now, we must replicate the dataset schema here
+  options->dataset_schema = dataset->schema();
+  // specify the filter
+  compute::Expression b_is_true = field_ref("b");
+  ASSERT_OK_AND_ASSIGN(b_is_true, b_is_true.Bind(*dataset->schema()));
+  options->filter = b_is_true;
+  // for now, specify the projection as the full project expression 
(eventually this can
+  // just be a list of materialized field names)
+  compute::Expression a_times_2 = call("multiply", {field_ref("a"), 
literal(2)});
+  ASSERT_OK_AND_ASSIGN(a_times_2, a_times_2.Bind(*dataset->schema()));
+  options->projection = call("project", {a_times_2}, 
compute::ProjectOptions{{"a * 2"}});
+
+  // construct the scan node
+  ASSERT_OK_AND_ASSIGN(compute::ExecNode * scan,
+   dataset::MakeScanNode(plan.get(), dataset, options));
+
+  // pipe the scan node into a filter node
+  ASSERT_OK_AND_ASSIGN(compute::ExecNode * filter,
+   compute::MakeFilterNode(scan, "filter", b_is_true));
+
+  // pipe the filter node into a project node
+  // NB: we're using the project node factory which preserves fragment/batch 
index
+  // tagging, so we *can* reorder later if we choose. The tags will not appear 
in
+  // our output.
+  ASSERT_OK_AND_ASSIGN(compute::ExecNode * project,
+   dataset::MakeAugmentedProjectNode(filter, "project", 
{a_times_2}));
+
+  // finally, pipe the project node into a sink node
+  // NB: if we don't need ordering, we could use compute::MakeSinkNode instead
+  ASSERT_OK_AND_ASSIGN(auto sink_gen, dataset::MakeOrderedSinkNode(project, 
"sink"));

Review comment:
   Thanks for the example! I do like this setup a lot more since it is 
clearer what all the steps are in reading a dataset + everything is neatly 
separated. It is not that the current scanner does not separate up the various 
stages of its pipeline, but the pipeline in the scanner is rather tied together 
while this is clearly partitioned. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10647: ARROW-13174: [C++][Compute] Add strftime kernel

2021-07-13 Thread GitBox



westonpace commented on a change in pull request #10647:
URL: https://github.com/apache/arrow/pull/10647#discussion_r669098242



##
File path: r/configure.win
##
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
   Is this related?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10608: ARROW-13136: [C++] Add coalesce function

2021-07-13 Thread GitBox



lidavidm commented on pull request #10608:
URL: https://github.com/apache/arrow/pull/10608#issuecomment-879390751


   I would kind of prefer to get all these kernels merged and consolidated 
before I start trying to microoptimize them, though, given they've been around 
for a while and all use similar helper code (that's now starting to diverge 
slightly once I look at optimizing).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10608: ARROW-13136: [C++] Add coalesce function

2021-07-13 Thread GitBox



lidavidm commented on pull request #10608:
URL: https://github.com/apache/arrow/pull/10608#issuecomment-879390143


   Basically, there was a lot of overhead from the fallback loop of "for offset 
in range(block size), if bit is set, copy one element" because 1) the 'copy one 
element' function used CopyBitmap which has a *ton* of overhead for copying one 
bit and 2) unboxing the array every time was costly when done in a loop like 
that (e.g. the profiler showed that even Buffer::data()'s check for whether the 
buffer is on-CPU was hot). But now I've specialized things to avoid most of 
that overhead. 
   
   The reason why I wanted something like VisitSetBitRunsVoid was to go a step 
further and always try to perform block copies instead of falling back to 
one-element-at-a-time-copies. But yes, then it needs to be able to combine two 
bitmaps with AndNot (we want runs of bits where !output_valid & input_valid)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



github-actions[bot] commented on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879378962


   Revision: 3aa0df229a92a3425bcb9b5a7711853bcc49e53d
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-592](https://github.com/ursacomputing/crossbow/branches/all?query=actions-592)
   
   |Task|Status|
   ||--|
   |wheel-macos-big-sur-cp39-arm64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-big-sur-cp39-arm64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-big-sur-cp39-arm64)|
   |wheel-macos-big-sur-cp39-universal2|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-big-sur-cp39-universal2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-big-sur-cp39-universal2)|
   |wheel-macos-high-sierra-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-high-sierra-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-high-sierra-cp36-amd64)|
   |wheel-macos-high-sierra-cp37-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-high-sierra-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-high-sierra-cp37-amd64)|
   |wheel-macos-high-sierra-cp38-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-high-sierra-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-high-sierra-cp38-amd64)|
   |wheel-macos-high-sierra-cp39-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-high-sierra-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-high-sierra-cp39-amd64)|
   |wheel-macos-mavericks-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-mavericks-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-mavericks-cp36-amd64)|
   |wheel-macos-mavericks-cp37-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-mavericks-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-mavericks-cp37-amd64)|
   |wheel-macos-mavericks-cp38-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-mavericks-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-mavericks-cp38-amd64)|
   |wheel-macos-mavericks-cp39-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-mavericks-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-mavericks-cp39-amd64)|
   |wheel-manylinux2010-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2010-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2010-cp36-amd64)|
   |wheel-manylinux2010-cp37-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2010-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2010-cp37-amd64)|
   |wheel-manylinux2010-cp38-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2010-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2010-cp38-amd64)|
   |wheel-manylinux2010-cp39-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2010-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2010-cp39-amd64)|
   |wheel-manylinux2014-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2014-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2014-cp36-amd64)|

[GitHub] [arrow] kszucs commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



kszucs commented on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879378309


   @github-actions crossbow submit -g wheel


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter commented on pull request #545: Fix build, Make the js package a feature that can be enabled for wasm, rather than always on

2021-07-13 Thread GitBox



codecov-commenter commented on pull request #545:
URL: https://github.com/apache/arrow-rs/pull/545#issuecomment-879376748


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#545](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (89a087c) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/6698eed2900cffadc549a92defe2030d36ccf0d6?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (6698eed) will **not change** coverage.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/545/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@   Coverage Diff   @@
   ##   master #545   +/-   ##
   ===
 Coverage   82.60%   82.60%   
   ===
 Files 167  167   
 Lines   4598445984   
   ===
 Hits3798437984   
 Misses   8000 8000   
   ```
   
   
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[6698eed...89a087c](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages

2021-07-13 Thread GitBox



lidavidm commented on pull request #10663:
URL: https://github.com/apache/arrow/pull/10663#issuecomment-879371544


   @cyb70289, might you have time to look at this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm closed pull request #10712: ARROW-13307: [C++][Python][R] Use reflection-based enums for function options

2021-07-13 Thread GitBox



lidavidm closed pull request #10712:
URL: https://github.com/apache/arrow/pull/10712


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10412: ARROW-9430: [C++] Implement replace_with_mask kernel

2021-07-13 Thread GitBox



lidavidm commented on a change in pull request #10412:
URL: https://github.com/apache/arrow/pull/10412#discussion_r669074742



##
File path: cpp/src/arrow/compute/kernels/vector_replace.cc
##
@@ -0,0 +1,510 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/api_scalar.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/bitmap_ops.h"
+
+namespace arrow {
+namespace compute {
+namespace internal {
+
+namespace {
+
+Status ReplacementArrayTooShort(int64_t expected, int64_t actual) {
+  return Status::Invalid("Replacement array must be of appropriate length 
(expected ",
+ expected, " items but got ", actual, " items)");
+}
+
+// Helper to implement replace_with kernel with scalar mask for fixed-width 
types,
+// using callbacks to handle both bool and byte-sized types
+Status ReplaceWithScalarMask(KernelContext* ctx, const ArrayData& array,
+ const BooleanScalar& mask, const Datum& 
replacements,
+ ArrayData* output) {
+  if (!mask.is_valid) {
+// Output = null
+ARROW_ASSIGN_OR_RAISE(auto replacement_array,
+  MakeArrayOfNull(array.type, array.length, 
ctx->memory_pool()));
+*output = *replacement_array->data();
+return Status::OK();
+  }
+  if (mask.value) {
+// Output = replacement
+if (replacements.is_scalar()) {
+  ARROW_ASSIGN_OR_RAISE(
+  auto replacement_array,
+  MakeArrayFromScalar(*replacements.scalar(), array.length, 
ctx->memory_pool()));
+  *output = *replacement_array->data();
+} else {
+  auto replacement_array = replacements.array();
+  if (replacement_array->length != array.length) {
+return ReplacementArrayTooShort(array.length, 
replacement_array->length);
+  }
+  *output = *replacement_array;
+}
+  } else {
+// Output = input
+*output = array;
+  }
+  return Status::OK();
+}
+
+struct CopyArrayBitmap {
+  const uint8_t* in_bitmap;
+  int64_t in_offset;
+
+  void CopyBitmap(uint8_t* out_bitmap, int64_t out_offset, int64_t offset,
+  int64_t length) const {
+arrow::internal::CopyBitmap(in_bitmap, in_offset + offset, length, 
out_bitmap,
+out_offset);
+  }
+
+  void SetBit(uint8_t* out_bitmap, int64_t out_offset, int64_t offset) const {
+BitUtil::SetBitTo(out_bitmap, out_offset,
+  BitUtil::GetBit(in_bitmap, in_offset + offset));
+  }
+};
+
+struct CopyScalarBitmap {
+  const bool is_valid;
+
+  void CopyBitmap(uint8_t* out_bitmap, int64_t out_offset, int64_t offset,
+  int64_t length) const {
+BitUtil::SetBitsTo(out_bitmap, out_offset, length, is_valid);
+  }
+
+  void SetBit(uint8_t* out_bitmap, int64_t out_offset, int64_t offset) const {
+BitUtil::SetBitTo(out_bitmap, out_offset, is_valid);
+  }
+};
+
+// Helper to implement replace_with kernel with array mask for fixed-width 
types,
+// using callbacks to handle both bool and byte-sized types and to handle
+// scalar and array replacements
+template 
+void ReplaceWithArrayMaskImpl(const ArrayData& array, const ArrayData& mask,
+  const Data& replacements, bool 
replacements_bitmap,
+  const CopyBitmap copy_bitmap, const uint8_t* 
mask_bitmap,
+  const uint8_t* mask_values, uint8_t* out_bitmap,
+  uint8_t* out_values, const int64_t out_offset) {
+  Functor::CopyData(*array.type, out_values, /*out_offset=*/0, array, 
/*in_offset=*/0,
+array.length);
+  arrow::internal::OptionalBinaryBitBlockCounter counter(
+  mask_values, mask.offset, mask_bitmap, mask.offset, mask.length);
+  int64_t write_offset = 0;
+  int64_t replacements_offset = 0;
+  while (write_offset < array.length) {
+BitBlockCount block = counter.NextAndBlock();
+if (block.AllSet()) {
+  // Copy from replacement array
+  Functor::CopyData(*array.type, out_values, out_offset + write_offset, 
replacements,
+replacements_offset, block.length);
+  if (replacements_bitmap) {
+

[GitHub] [arrow] github-actions[bot] commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



github-actions[bot] commented on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879364310


   Revision: 086b4582939f4aef8b5784ab45ba4276f438ed6d
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-591](https://github.com/ursacomputing/crossbow/branches/all?query=actions-591)
   
   |Task|Status|
   ||--|
   |wheel-macos-big-sur-cp39-arm64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-big-sur-cp39-arm64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-big-sur-cp39-arm64)|
   |wheel-macos-big-sur-cp39-universal2|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-big-sur-cp39-universal2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-big-sur-cp39-universal2)|
   |wheel-macos-high-sierra-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-high-sierra-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-high-sierra-cp36-amd64)|
   |wheel-macos-high-sierra-cp37-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-high-sierra-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-high-sierra-cp37-amd64)|
   |wheel-macos-high-sierra-cp38-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-high-sierra-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-high-sierra-cp38-amd64)|
   |wheel-macos-high-sierra-cp39-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-high-sierra-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-high-sierra-cp39-amd64)|
   |wheel-macos-mavericks-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-mavericks-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-mavericks-cp36-amd64)|
   |wheel-macos-mavericks-cp37-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-mavericks-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-mavericks-cp37-amd64)|
   |wheel-macos-mavericks-cp38-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-mavericks-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-mavericks-cp38-amd64)|
   |wheel-macos-mavericks-cp39-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-mavericks-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-mavericks-cp39-amd64)|
   |wheel-manylinux2010-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2010-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2010-cp36-amd64)|
   |wheel-manylinux2010-cp37-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2010-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2010-cp37-amd64)|
   |wheel-manylinux2010-cp38-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2010-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2010-cp38-amd64)|
   |wheel-manylinux2010-cp39-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2010-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2010-cp39-amd64)|
   |wheel-manylinux2014-cp36-amd64|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2014-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2014-cp36-amd64)|

[GitHub] [arrow] kszucs commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac

2021-07-13 Thread GitBox



kszucs commented on pull request #10659:
URL: https://github.com/apache/arrow/pull/10659#issuecomment-879363719


   @github-actions crossbow submit -g wheel


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10557: ARROW-13064: [C++] Implement select ('case when') function for fixed-width types

2021-07-13 Thread GitBox



lidavidm commented on pull request #10557:
URL: https://github.com/apache/arrow/pull/10557#issuecomment-879363403


   Fixed docstrings and fixed a TODO I noticed about fully initializing the 
output buffer.
   
   Also added a benchmark case for when both the cond struct array and the 
child cond boolean arrays can have nulls. This case is especially terrible, I 
made a slight optimization to eliminate one of the more egregious offenders I 
saw in perf, but it's still bad even then:
   
   ```
   
---
   Benchmark Time CPU   
Iterations UserCounters...
   
---
   CaseWhenBench64/1048576/0   6045379 ns  6045204 ns  
125 bytes_per_second=1.29235G/s
   CaseWhenBench64/1048576/99  6106510 ns  6106346 ns  
123 bytes_per_second=1.27929G/s
   CaseWhenBench64OuterNulls/1048576/032735686 ns 32733991 ns   
23 bytes_per_second=244.394M/s
   CaseWhenBench64OuterNulls/1048576/99   34098317 ns 34097473 ns   
21 bytes_per_second=234.599M/s
   CaseWhenBench64Contiguous/1048576/0 2229728 ns  2229687 ns  
338 bytes_per_second=3.50386G/s
   CaseWhenBench64Contiguous/1048576/992456084 ns  2456085 ns  
237 bytes_per_second=3.18058G/s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] kszucs commented on pull request #524: Expose ExecutionContext.register_csv to the python bindings

2021-07-13 Thread GitBox



kszucs commented on pull request #524:
URL: https://github.com/apache/arrow-datafusion/pull/524#issuecomment-879359345


   Since https://github.com/apache/arrow-rs/pull/439 has been merged I can 
expose the `schema` argument as well, though we can defer that to a follow-up 
PR too. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #524: Expose ExecutionContext.register_csv to the python bindings

2021-07-13 Thread GitBox



alamb commented on pull request #524:
URL: https://github.com/apache/arrow-datafusion/pull/524#issuecomment-879357118


   @jorgecarleitao  / @kszucs  -- what is the plan for this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #442: Change return type of 'DataFrame.collect()'

2021-07-13 Thread GitBox



alamb commented on pull request #442:
URL: https://github.com/apache/arrow-datafusion/pull/442#issuecomment-879356925


   @djKooks  do you plan to keep working on this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #441: WIP: Add tokomak optimizer

2021-07-13 Thread GitBox



alamb commented on pull request #441:
URL: https://github.com/apache/arrow-datafusion/pull/441#issuecomment-879356559


   @Dandandan  do you still plan to work on this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #8151: ARROW-9279: [C++] Implement PrettyPrint for Scalars

2021-07-13 Thread GitBox



bkietz commented on pull request #8151:
URL: https://github.com/apache/arrow/pull/8151#issuecomment-879355469


   Closing this for now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #8151: ARROW-9279: [C++] Implement PrettyPrint for Scalars

2021-07-13 Thread GitBox



bkietz closed pull request #8151:
URL: https://github.com/apache/arrow/pull/8151


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on pull request #537: Implement `RecordBatch::concat`

2021-07-13 Thread GitBox



alamb commented on pull request #537:
URL: https://github.com/apache/arrow-rs/pull/537#issuecomment-879355147


   Unless I hear different, I plan to merge this PR tomorrow (and include it in 
5.0.0)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



projjal commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669059415



##
File path: cpp/src/gandiva/string_function_holder_test.cc
##
@@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) {
   auto status = LikeHolder::Make("ab\\_", "", _holder);
   EXPECT_EQ(status.ok(), false) << status.message();
 }
+
+// Tests related to the REGEXP_EXTRACT function
+class TestExtractHolder : public ::testing::Test {
+ protected:
+  ExecutionContext execution_context_;
+};
+
+TEST_F(TestExtractHolder, TestSimpleExtract) {

Review comment:
   Can you also add integration test in projector_tests?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



projjal commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669058383



##
File path: cpp/src/gandiva/string_function_holder.h
##
@@ -0,0 +1,140 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include 
+#include 
+
+#include "arrow/status.h"
+#include "gandiva/execution_context.h"
+#include "gandiva/function_holder.h"
+#include "gandiva/node.h"
+#include "gandiva/visibility.h"
+
+namespace gandiva {
+
+/// Function Holder for SQL 'like'
+class GANDIVA_EXPORT LikeHolder : public FunctionHolder {
+ public:
+  ~LikeHolder() override = default;
+
+  static Status Make(const FunctionNode& node, std::shared_ptr* 
holder);
+
+  static Status Make(const std::string& sql_pattern, 
std::shared_ptr* holder);
+
+  static Status Make(const std::string& sql_pattern, const std::string& 
escape_char,
+ std::shared_ptr* holder);
+
+  static Status Make(const std::string& sql_pattern, 
std::shared_ptr* holder,
+ RE2::Options regex_op);
+
+  // Try and optimise a function node with a "like" pattern.
+  static const FunctionNode TryOptimize(const FunctionNode& node);
+
+  /// Return true if the data matches the pattern.
+  bool operator()(const std::string& data) { return RE2::FullMatch(data, 
regex_); }
+
+ private:
+  explicit LikeHolder(const std::string& pattern) : pattern_(pattern), 
regex_(pattern) {}
+
+  LikeHolder(const std::string& pattern, RE2::Options regex_op)
+  : pattern_(pattern), regex_(pattern, regex_op) {}
+
+  std::string pattern_;  // posix pattern string, to help debugging
+  RE2 regex_;// compiled regex for the pattern
+
+  static RE2 starts_with_regex_;  // pre-compiled pattern for matching 
starts_with
+  static RE2 ends_with_regex_;// pre-compiled pattern for matching 
ends_with
+  static RE2 is_substr_regex_;// pre-compiled pattern for matching 
is_substr
+};
+
+/// Function Holder for 'regexp_extract' function
+class GANDIVA_EXPORT ExtractHolder : public FunctionHolder {
+ public:
+  ~ExtractHolder() override = default;
+
+  static Status Make(const FunctionNode& node, std::shared_ptr* 
holder);
+
+  static Status Make(const std::string& sql_pattern,
+ std::shared_ptr* holder);
+
+  /// Extracts the matching text from a string using a regex
+  const char* operator()(ExecutionContext* ctx, const char* user_input,

Review comment:
   Better to move the method to .cc file




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



projjal commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669058383



##
File path: cpp/src/gandiva/string_function_holder.h
##
@@ -0,0 +1,140 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include 
+#include 
+
+#include "arrow/status.h"
+#include "gandiva/execution_context.h"
+#include "gandiva/function_holder.h"
+#include "gandiva/node.h"
+#include "gandiva/visibility.h"
+
+namespace gandiva {
+
+/// Function Holder for SQL 'like'
+class GANDIVA_EXPORT LikeHolder : public FunctionHolder {
+ public:
+  ~LikeHolder() override = default;
+
+  static Status Make(const FunctionNode& node, std::shared_ptr* 
holder);
+
+  static Status Make(const std::string& sql_pattern, 
std::shared_ptr* holder);
+
+  static Status Make(const std::string& sql_pattern, const std::string& 
escape_char,
+ std::shared_ptr* holder);
+
+  static Status Make(const std::string& sql_pattern, 
std::shared_ptr* holder,
+ RE2::Options regex_op);
+
+  // Try and optimise a function node with a "like" pattern.
+  static const FunctionNode TryOptimize(const FunctionNode& node);
+
+  /// Return true if the data matches the pattern.
+  bool operator()(const std::string& data) { return RE2::FullMatch(data, 
regex_); }
+
+ private:
+  explicit LikeHolder(const std::string& pattern) : pattern_(pattern), 
regex_(pattern) {}
+
+  LikeHolder(const std::string& pattern, RE2::Options regex_op)
+  : pattern_(pattern), regex_(pattern, regex_op) {}
+
+  std::string pattern_;  // posix pattern string, to help debugging
+  RE2 regex_;// compiled regex for the pattern
+
+  static RE2 starts_with_regex_;  // pre-compiled pattern for matching 
starts_with
+  static RE2 ends_with_regex_;// pre-compiled pattern for matching 
ends_with
+  static RE2 is_substr_regex_;// pre-compiled pattern for matching 
is_substr
+};
+
+/// Function Holder for 'regexp_extract' function
+class GANDIVA_EXPORT ExtractHolder : public FunctionHolder {
+ public:
+  ~ExtractHolder() override = default;
+
+  static Status Make(const FunctionNode& node, std::shared_ptr* 
holder);
+
+  static Status Make(const std::string& sql_pattern,
+ std::shared_ptr* holder);
+
+  /// Extracts the matching text from a string using a regex
+  const char* operator()(ExecutionContext* ctx, const char* user_input,

Review comment:
   Better move the method to .cc file




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pachadotdev commented on pull request #10650: ARROW-13058: This is a draft to provide save-report

2021-07-13 Thread GitBox



pachadotdev commented on pull request #10650:
URL: https://github.com/apache/arrow/pull/10650#issuecomment-879349226


   I did a few changes regarding the task of adding a function to capture the 
nightly builds state in json before sending the emails
   i'm gonna need a second review on this (plz consider i had to learn python 
from 0.05 to implement this)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10679: ARROW-13170 [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-07-13 Thread GitBox



ursabot edited a comment on pull request #10679:
URL: https://github.com/apache/arrow/pull/10679#issuecomment-879314016


   Benchmark runs are scheduled for baseline = 
cf6a7ff65f4e2920641d116a3ba1f578b2bd8a9e and contender = 
38110e8e7ee598ddb0e8a3465d81ea7e24bafebc. Results will be available as each 
benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Provided benchmark filters do not have any benchmark 
groups to be executed on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2 
(mimalloc)](https://conbench.ursa.dev/compare/runs/96acb784c87842b2bfd28e17a9a90432...fdfaab8d2dc84fbd81318b663afe1e44/)
   [Skipped :warning: Only ['Python', 'R'] langs are supported on 
ursa-i9-9960x] [ursa-i9-9960x 
(mimalloc)](https://conbench.ursa.dev/compare/runs/5744b64d14d448198229be1dbb5265e7...a08372ce96df46c48f6874d84c74035c/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-thinkcentre-m75q 
(mimalloc)](https://conbench.ursa.dev/compare/runs/cd0a0e80ad2c4de0b60cda38c58b64a4...9d55178f98434a64b8398f25458f5408/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



projjal commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669049315



##
File path: cpp/src/gandiva/CMakeLists.txt
##
@@ -82,7 +82,7 @@ set(SRC_FILES
 hash_utils.cc
 llvm_generator.cc
 llvm_types.cc
-like_holder.cc
+string_function_holder.cc

Review comment:
   can you name it regex_functions_holder?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function

2021-07-13 Thread GitBox



projjal commented on a change in pull request #10518:
URL: https://github.com/apache/arrow/pull/10518#discussion_r669049098



##
File path: cpp/src/gandiva/string_function_holder_test.cc
##
@@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) {
   auto status = LikeHolder::Make("ab\\_", "", _holder);
   EXPECT_EQ(status.ok(), false) << status.message();
 }
+
+// Tests related to the REGEXP_EXTRACT function
+class TestExtractHolder : public ::testing::Test {
+ protected:
+  ExecutionContext execution_context_;
+};
+
+TEST_F(TestExtractHolder, TestSimpleExtract) {
+  std::shared_ptr extract_holder;
+
+  // Pattern to match of two group of letters
+  auto status = ExtractHolder::Make(R"((\w+) (\w+))", _holder);

Review comment:
   Shouldn't this be R"(\w+) (\w+)" since you are enclosing another bracket 
inside the constructor?
   So in your example index 2 should return first name?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #10608: ARROW-13136: [C++] Add coalesce function

2021-07-13 Thread GitBox



bkietz commented on pull request #10608:
URL: https://github.com/apache/arrow/pull/10608#issuecomment-879342845


   > Trying an approach based on VisitSetBitRunsVoid
   
   IIUC this would require a varargs version of `OptionalBitBlockCounter` or 
`Bitmap::VisitWords`, which would probably be generally useful as varargs 
compute functions continue to proliferate


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10412: ARROW-9430: [C++] Implement replace_with_mask kernel

2021-07-13 Thread GitBox



bkietz commented on a change in pull request #10412:
URL: https://github.com/apache/arrow/pull/10412#discussion_r669046761



##
File path: cpp/src/arrow/compute/kernels/vector_replace.cc
##
@@ -0,0 +1,510 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/api_scalar.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/bitmap_ops.h"
+
+namespace arrow {
+namespace compute {
+namespace internal {
+
+namespace {
+
+Status ReplacementArrayTooShort(int64_t expected, int64_t actual) {
+  return Status::Invalid("Replacement array must be of appropriate length 
(expected ",
+ expected, " items but got ", actual, " items)");
+}
+
+// Helper to implement replace_with kernel with scalar mask for fixed-width 
types,
+// using callbacks to handle both bool and byte-sized types
+Status ReplaceWithScalarMask(KernelContext* ctx, const ArrayData& array,
+ const BooleanScalar& mask, const Datum& 
replacements,
+ ArrayData* output) {
+  if (!mask.is_valid) {
+// Output = null
+ARROW_ASSIGN_OR_RAISE(auto replacement_array,
+  MakeArrayOfNull(array.type, array.length, 
ctx->memory_pool()));
+*output = *replacement_array->data();
+return Status::OK();
+  }
+  if (mask.value) {
+// Output = replacement
+if (replacements.is_scalar()) {
+  ARROW_ASSIGN_OR_RAISE(
+  auto replacement_array,
+  MakeArrayFromScalar(*replacements.scalar(), array.length, 
ctx->memory_pool()));
+  *output = *replacement_array->data();
+} else {
+  auto replacement_array = replacements.array();
+  if (replacement_array->length != array.length) {
+return ReplacementArrayTooShort(array.length, 
replacement_array->length);
+  }
+  *output = *replacement_array;
+}
+  } else {
+// Output = input
+*output = array;
+  }
+  return Status::OK();
+}
+
+struct CopyArrayBitmap {
+  const uint8_t* in_bitmap;
+  int64_t in_offset;
+
+  void CopyBitmap(uint8_t* out_bitmap, int64_t out_offset, int64_t offset,
+  int64_t length) const {
+arrow::internal::CopyBitmap(in_bitmap, in_offset + offset, length, 
out_bitmap,
+out_offset);
+  }
+
+  void SetBit(uint8_t* out_bitmap, int64_t out_offset, int64_t offset) const {
+BitUtil::SetBitTo(out_bitmap, out_offset,
+  BitUtil::GetBit(in_bitmap, in_offset + offset));
+  }
+};
+
+struct CopyScalarBitmap {
+  const bool is_valid;
+
+  void CopyBitmap(uint8_t* out_bitmap, int64_t out_offset, int64_t offset,
+  int64_t length) const {
+BitUtil::SetBitsTo(out_bitmap, out_offset, length, is_valid);
+  }
+
+  void SetBit(uint8_t* out_bitmap, int64_t out_offset, int64_t offset) const {
+BitUtil::SetBitTo(out_bitmap, out_offset, is_valid);
+  }
+};
+
+// Helper to implement replace_with kernel with array mask for fixed-width 
types,
+// using callbacks to handle both bool and byte-sized types and to handle
+// scalar and array replacements
+template 
+void ReplaceWithArrayMaskImpl(const ArrayData& array, const ArrayData& mask,
+  const Data& replacements, bool 
replacements_bitmap,
+  const CopyBitmap copy_bitmap, const uint8_t* 
mask_bitmap,
+  const uint8_t* mask_values, uint8_t* out_bitmap,
+  uint8_t* out_values, const int64_t out_offset) {
+  Functor::CopyData(*array.type, out_values, /*out_offset=*/0, array, 
/*in_offset=*/0,
+array.length);
+  arrow::internal::OptionalBinaryBitBlockCounter counter(
+  mask_values, mask.offset, mask_bitmap, mask.offset, mask.length);
+  int64_t write_offset = 0;
+  int64_t replacements_offset = 0;
+  while (write_offset < array.length) {
+BitBlockCount block = counter.NextAndBlock();
+if (block.AllSet()) {
+  // Copy from replacement array
+  Functor::CopyData(*array.type, out_values, out_offset + write_offset, 
replacements,
+replacements_offset, block.length);
+  if (replacements_bitmap) {
+

[GitHub] [arrow-rs] alamb opened a new pull request #545: Fix build, Make the js package a feature that can be enabled for wasm, rather than always on

2021-07-13 Thread GitBox



alamb opened a new pull request #545:
URL: https://github.com/apache/arrow-rs/pull/545


   # Which issue does this PR close?
   Closes https://github.com/apache/arrow-rs/issues/544
   
   # Rationale for this change
   
   We upgraded Arrow to the latest version of `rand` in 
https://github.com/apache/arrow-rs/pull/488 which has changed how they do js/ 
wasm support
   
   When I tried to use arrow/datafusion in IOx (which depends on both arrow and 
datafusion) I got this crazy error while trying to resolve dependencies:
   
   ```
   error: cyclic package dependency: package `ahash v0.7.4` depends on itself. 
Cycle:
   package `ahash v0.7.4`
   ... which is depended on by `hashbrown v0.11.2`
   ... which is depended on by `indexmap v1.7.0`
   ... which is depended on by `serde_json v1.0.64`
   ... which is depended on by `wasm-bindgen v0.2.74`
   ... which is depended on by `js-sys v0.3.51`
   ... which is depended on by `getrandom v0.2.3`
   ... which is depended on by `ahash v0.7.4`
   ```
   
   The issue appears to be that the `js` feature of rand somehow pulls in ahsah 
again. 
   
   # What changes are included in this PR?
   
   Make the `getrandom/js` (needed to build for wasm) dependency optional for 
arrow
   
   # Are there any user-facing changes?
   
   Yes, In order to build for WASM, the previous command
   
   ```shell
   cargo build --target wasm32-unknown-unknown
   ```
   
   Fails with an error such as
   ```
   cd /Users/alamb/Software/arrow-rs/arrow && cargo build --target 
wasm32-unknown-unknown
  Compiling getrandom v0.2.3
   error: the wasm32-unknown-unknown target is not supported by default, you 
may need to enable the "js" feature. For more information see: 
https://docs.rs/getrandom/#webassembly-support
  --> 
/Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/getrandom-0.2.3/src/lib.rs:219:9
   |
   219 | / compile_error!("the wasm32-unknown-unknown target is not 
supported by \
   220 | | default, you may need to enable the \"js\" 
feature. \
   221 | | For more information see: \
   222 | | 
https://docs.rs/getrandom/#webassembly-support;);
   | 
|_^
   
   ```
   
   Now, arrow users will have to explicitly specify the `js` feature:
   ```shell
   cargo build --features=js --target wasm32-unknown-unknown
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10557: ARROW-13064: [C++] Implement select ('case when') function for fixed-width types

2021-07-13 Thread GitBox



bkietz commented on a change in pull request #10557:
URL: https://github.com/apache/arrow/pull/10557#discussion_r669034240



##
File path: docs/source/cpp/compute.rst
##
@@ -859,50 +859,60 @@ Structural transforms
 
+--+++-+-+
 | Function name| Arity  | Input types  
  | Output type | Notes   |
 
+==+++=+=+
-| fill_null| Binary | Boolean, Null, Numeric, Temporal, 
String-like  | Input type  | \(1)|
+| case_when| Varargs| Boolean, Any fixed-width 
  | Input type  | \(1)   |
 
+--+++-+-+
-| if_else  | Ternary| Boolean, Null, Numeric, Temporal 
  | Input type  + \(2)|
+| fill_null| Binary | Boolean, Null, Numeric, Temporal, 
String-like  | Input type  | \(2)|
 
+--+++-+-+
-| is_finite| Unary  | Float, Double
  | Boolean | \(3)|
+| if_else  | Ternary| Boolean, Null, Numeric, Temporal 
  | Input type  | \(3)|
 
+--+++-+-+
-| is_inf   | Unary  | Float, Double
  | Boolean | \(4)|
+| is_finite| Unary  | Float, Double
  | Boolean | \(4)|
 
+--+++-+-+
-| is_nan   | Unary  | Float, Double
  | Boolean | \(5)|
+| is_inf   | Unary  | Float, Double
  | Boolean | \(5)|
 
+--+++-+-+
-| is_null  | Unary  | Any  
  | Boolean | \(6)|
+| is_nan   | Unary  | Float, Double
  | Boolean | \(6)|
 
+--+++-+-+
-| is_valid | Unary  | Any  
  | Boolean | \(7)|
+| is_null  | Unary  | Any  
  | Boolean | \(7)|
 
+--+++-+-+
-| list_value_length| Unary  | List-like
  | Int32 or Int64  | \(8)|
+| is_valid | Unary  | Any  
  | Boolean | \(8)|
 
+--+++-+-+
-| project  | Varargs| Any  
  | Struct  | \(9)|
+| list_value_length| Unary  | List-like
  | Int32 or Int64  | \(9)|
 
+--+++-+-+
+| project  | Varargs| Any  
  | Struct  | \(10)   |
++--+++-+-+
+
+* \(1) This function acts like a SQL 'case when' statement or switch-case. The
+  input is any number of alternating Boolean and value data, followed by an

Review comment:
   This (and some of the other comments) need to be updated with the new 
signature

##
File path: cpp/src/arrow/compute/kernels/scalar_if_else_benchmark.cc
##
@@ -97,6 +97,99 @@ static void IfElseBench32Contiguous(benchmark::State& state) 
{
   return IfElseBenchContiguous(state);
 }
 
+template 
+static void CaseWhenBench(benchmark::State& state) {
+  using CType = typename Type::c_type;
+  auto type = TypeTraits::type_singleton();
+  using ArrayType = typename TypeTraits::ArrayType;
+
+  int64_t len = state.range(0);
+  int64_t offset = state.range(1);
+
+

1 2 3 >

1 - 100 of 270 matches

Mail list logo