[GitHub] [arrow-datafusion] lvheyang edited a comment on pull request #716: #699 fix return type conflict when calling builtin math fuctions
lvheyang edited a comment on pull request #716: URL: https://github.com/apache/arrow-datafusion/pull/716#issuecomment-879595756 Thanks for your kind help! @alamb I added test cases in context.rs, test SQLs: 1. `SELECT sqrt(f64) FROM t` 2. `SELECT sqrt(f32) FROM t` // would panic before this pull request 3. `SELECT sqrt(i64) FROM t` 4. `SELECT sqrt(i32) FROM t` Is there any case else needed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] lvheyang edited a comment on pull request #716: #699 fix return type conflict when calling builtin math fuctions
lvheyang edited a comment on pull request #716: URL: https://github.com/apache/arrow-datafusion/pull/716#issuecomment-879595756 Thanks for your kind help! @alamb I added test cases in context.rs, test SQLs: 1. `SELECT sqrt(f64) FROM t` 2. `SELECT sqrt(f32) FROM t` // would be failed before this pull request 3. `SELECT sqrt(i64) FROM t` 4. `SELECT sqrt(i32) FROM t` Is there any case else needed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] lvheyang commented on pull request #716: #699 fix return type conflict when calling builtin math fuctions
lvheyang commented on pull request #716: URL: https://github.com/apache/arrow-datafusion/pull/716#issuecomment-879595756 Thanks for your kind help! @alamb I added test cases in context.rs, test SQLs: 1. `SELECT sqrt(f64) FROM t` 2. `SELECT sqrt(f32) FROM t` 3. `SELECT sqrt(i64) FROM t` 4. `SELECT sqrt(i32) FROM t` Is there any case else needed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed
cyb70289 commented on a change in pull request #10719: URL: https://github.com/apache/arrow/pull/10719#discussion_r669292380 ## File path: cpp/src/arrow/testing/gtest_util.cc ## @@ -907,7 +910,7 @@ class GatingTask::Impl : public std::enable_shared_from_this { double timeout_seconds_; Status status_; bool unlocked_; - int num_launched_ = 0; + std::atomic num_launched_; Review comment: Nit: since we don't care about memory order, I think we can change this line to `std::atomic num_launched_{0}`, and leave all other codes untouched. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset
westonpace commented on a change in pull request #10693: URL: https://github.com/apache/arrow/pull/10693#discussion_r669276514 ## File path: python/pyarrow/_hdfs.pyx ## @@ -93,9 +93,10 @@ cdef class HadoopFileSystem(FileSystem): Instantiate HadoopFileSystem object from an URI string. The following two calls are equivalent -* HadoopFileSystem.from_uri('hdfs://localhost:8020/?user=test' -'=1') -* HadoopFileSystem('localhost', port=8020, user='test', replication=1) + +* ``HadoopFileSystem.from_uri('hdfs://localhost:8020/?user=test=1')`` # noqa: E501 + +* ``HadoopFileSystem('localhost', port=8020, user='test', replication=1)`` # noqa: E501 Review comment: Anyone have any opinion on ignoring flake here? RST didn't like the multi line strings. I tried using literal blocks `::` but then flake complained about the empty lines required by RST. This seems like the most readable but the lines are too long (hence the `# noqa`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me merged pull request #545: Fix build, Make the js package a feature that can be enabled for wasm, rather than always on
nevi-me merged pull request #545: URL: https://github.com/apache/arrow-rs/pull/545 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me closed issue #544: Error building on master - error: cyclic package dependency: package `ahash v0.7.4` depends on itself. Cycle
nevi-me closed issue #544: URL: https://github.com/apache/arrow-rs/issues/544 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset
westonpace commented on pull request #10693: URL: https://github.com/apache/arrow/pull/10693#issuecomment-879575972 I believe I have addressed all comments. I also noticed that IpcWriteOptions was undocumented so I added it in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset
westonpace commented on a change in pull request #10693: URL: https://github.com/apache/arrow/pull/10693#discussion_r669270057 ## File path: docs/source/python/dataset.rst ## @@ -456,20 +456,163 @@ is materialized as columns when reading the data and can be used for filtering: dataset.to_table().to_pandas() dataset.to_table(filter=ds.field('year') == 2019).to_pandas() +Another benefit of manually scheduling the files is that the order of the files +controls the order of the data. When performing an ordered read (or a read to +a table) then the rows returned will match the order of the files given. This +only applies when the dataset is constructed with a list of files. There +are no order guarantees given when the files are instead discovered by scanning +a directory. -Manual scheduling -- +Iterative (out of core or streaming) reads +-- -.. -Possible content: -- fragments (get_fragments) -- scan / scan tasks / iterators of record batches +The previous examples have demonstrated how to read the data into a table. This is +useful if the dataset is small or there is only a small amount of data that needs to +be read. The dataset API contains additional methods to read and process large amounts +of data in a streaming fashion. -The :func:`~Dataset.to_table` method loads all selected data into memory -at once resulting in a pyarrow Table. Alternatively, a dataset can also be -scanned one RecordBatch at a time in an iterative manner using the -:func:`~Dataset.scan` method:: +The easiest way to do this is to use the method :meth:`Dataset.to_batches`. This +method returns an iterator of record batches. For example, we can use this method to +calculate the average of a column without loading the entire column into memory: -for scan_task in dataset.scan(columns=[...], filter=...): -for record_batch in scan_task.execute(): -# process the record batch +.. ipython:: python + +import pyarrow.compute as pc + +sum = 0 +count = 0 +for batch in dataset.to_batches(filter=~ds.field('col2').is_null()): +sum += pc.sum(batch.column('col2')).as_py() +count += batch.num_rows +mean_a = sum/count Review comment: True, but then you wouldn't be using `to_batches` anyways and this whole section goes away. I think dataset scanning is likely to be a niche API once any kind of SQL support is added. We can address it then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset
westonpace commented on a change in pull request #10693: URL: https://github.com/apache/arrow/pull/10693#discussion_r669269421 ## File path: docs/source/python/dataset.rst ## @@ -456,20 +456,163 @@ is materialized as columns when reading the data and can be used for filtering: dataset.to_table().to_pandas() dataset.to_table(filter=ds.field('year') == 2019).to_pandas() +Another benefit of manually scheduling the files is that the order of the files +controls the order of the data. When performing an ordered read (or a read to +a table) then the rows returned will match the order of the files given. This +only applies when the dataset is constructed with a list of files. There +are no order guarantees given when the files are instead discovered by scanning Review comment: Hmmm, I just dropped the sentence about scanning. If someone wants clarification we can discuss it at that time. @bkietz any opinion? I recall we spoke about this before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset
westonpace commented on a change in pull request #10693: URL: https://github.com/apache/arrow/pull/10693#discussion_r669268967 ## File path: docs/source/python/dataset.rst ## @@ -456,20 +456,163 @@ is materialized as columns when reading the data and can be used for filtering: dataset.to_table().to_pandas() dataset.to_table(filter=ds.field('year') == 2019).to_pandas() +Another benefit of manually scheduling the files is that the order of the files +controls the order of the data. When performing an ordered read (or a read to +a table) then the rows returned will match the order of the files given. This +only applies when the dataset is constructed with a list of files. There +are no order guarantees given when the files are instead discovered by scanning +a directory. -Manual scheduling -- +Iterative (out of core or streaming) reads +-- -.. -Possible content: -- fragments (get_fragments) -- scan / scan tasks / iterators of record batches +The previous examples have demonstrated how to read the data into a table. This is +useful if the dataset is small or there is only a small amount of data that needs to +be read. The dataset API contains additional methods to read and process large amounts +of data in a streaming fashion. -The :func:`~Dataset.to_table` method loads all selected data into memory -at once resulting in a pyarrow Table. Alternatively, a dataset can also be -scanned one RecordBatch at a time in an iterative manner using the -:func:`~Dataset.scan` method:: +The easiest way to do this is to use the method :meth:`Dataset.to_batches`. This +method returns an iterator of record batches. For example, we can use this method to +calculate the average of a column without loading the entire column into memory: -for scan_task in dataset.scan(columns=[...], filter=...): -for record_batch in scan_task.execute(): -# process the record batch +.. ipython:: python + +import pyarrow.compute as pc + +sum = 0 Review comment: Thanks. I renamed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed
github-actions[bot] commented on pull request #10719: URL: https://github.com/apache/arrow/pull/10719#issuecomment-879572131 Revision: 3a1f4e47a3c7af582d67e013bed3bd712d87dba8 Submitted crossbow builds: [ursacomputing/crossbow @ actions-594](https://github.com/ursacomputing/crossbow/branches/all?query=actions-594) |Task|Status| ||--| |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-594-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-594-github-test-ubuntu-20.04-cpp-thread-sanitizer)| -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed
cyb70289 commented on pull request #10719: URL: https://github.com/apache/arrow/pull/10719#issuecomment-879571834 @github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages
ursabot edited a comment on pull request #10663: URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533552 Benchmark runs are scheduled for baseline = 780e95c512d63bbea1e040af0eb44a0bf63c4d72 and contender = 64ecb2a73c76e30091428af5e03d538766b12c74. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/399dd7ae47d345969dfba70ba5d60fae...720759ec32734f10a4016acec5cc4633/) [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/eef93c4bfb1d4a8b92351f41883dfd55...41a8348d634e4410808dad7fcdd62c59/) [Finished :arrow_down:0.24% :arrow_up:0.05%] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/e5af519d38bd4cc483cd45f86750d807...b477c89467fc4b2d80211db405bd786e/) Supported benchmarks: ursa-i9-9960x: langs = Python, R ursa-thinkcentre-m75q: langs = C++, Java ec2-t3-xlarge-us-east-2: cloud = True -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages
cyb70289 commented on a change in pull request #10663: URL: https://github.com/apache/arrow/pull/10663#discussion_r669261083 ## File path: cpp/src/arrow/flight/serialization_internal.cc ## @@ -201,9 +193,7 @@ grpc::Status FlightDataSerialize(const FlightPayload& msg, ByteBuffer* out, // Write the descriptor if present int32_t descriptor_size = 0; if (msg.descriptor != nullptr) { -if (msg.descriptor->size() > kInt32Max) { - return ToGrpcStatus(Status::CapacityError("Descriptor size overflow (>= 2**31)")); Review comment: From the jira issue, looks grpc should check and handle our return value decently. Maybe open an issue in grpc community? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages
ursabot edited a comment on pull request #10663: URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533552 Benchmark runs are scheduled for baseline = 780e95c512d63bbea1e040af0eb44a0bf63c4d72 and contender = 64ecb2a73c76e30091428af5e03d538766b12c74. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/399dd7ae47d345969dfba70ba5d60fae...720759ec32734f10a4016acec5cc4633/) [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/eef93c4bfb1d4a8b92351f41883dfd55...41a8348d634e4410808dad7fcdd62c59/) [Scheduled] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/e5af519d38bd4cc483cd45f86750d807...b477c89467fc4b2d80211db405bd786e/) Supported benchmarks: ursa-i9-9960x: langs = Python, R ursa-thinkcentre-m75q: langs = C++, Java ec2-t3-xlarge-us-east-2: cloud = True -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset
westonpace commented on a change in pull request #10693: URL: https://github.com/apache/arrow/pull/10693#discussion_r669252727 ## File path: docs/source/python/dataset.rst ## @@ -456,20 +456,163 @@ is materialized as columns when reading the data and can be used for filtering: dataset.to_table().to_pandas() dataset.to_table(filter=ds.field('year') == 2019).to_pandas() +Another benefit of manually scheduling the files is that the order of the files +controls the order of the data. When performing an ordered read (or a read to +a table) then the rows returned will match the order of the files given. This +only applies when the dataset is constructed with a list of files. There +are no order guarantees given when the files are instead discovered by scanning +a directory. -Manual scheduling -- +Iterative (out of core or streaming) reads +-- -.. -Possible content: -- fragments (get_fragments) -- scan / scan tasks / iterators of record batches +The previous examples have demonstrated how to read the data into a table. This is +useful if the dataset is small or there is only a small amount of data that needs to +be read. The dataset API contains additional methods to read and process large amounts +of data in a streaming fashion. -The :func:`~Dataset.to_table` method loads all selected data into memory -at once resulting in a pyarrow Table. Alternatively, a dataset can also be -scanned one RecordBatch at a time in an iterative manner using the -:func:`~Dataset.scan` method:: +The easiest way to do this is to use the method :meth:`Dataset.to_batches`. This +method returns an iterator of record batches. For example, we can use this method to +calculate the average of a column without loading the entire column into memory: -for scan_task in dataset.scan(columns=[...], filter=...): -for record_batch in scan_task.execute(): -# process the record batch +.. ipython:: python Review comment: All the scripts I added execute pretty rapidly as they are dealing with tables with less than 10 rows. I'm not sure they add significantly to the build times. For a test I tried converting all ipython to code-block and saw no noticable difference in build times. I'd prefer ipython just for the testing sake but I'm happy to go with whatever is decided in ARROW-13159. Since it should be pretty easy to change after the fact (just a find-replace from ipython to code-block) I'd rather address it after ARROW-13159 is resolved if that is ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages
cyb70289 commented on a change in pull request #10663: URL: https://github.com/apache/arrow/pull/10663#discussion_r669247552 ## File path: cpp/src/arrow/flight/serialization_internal.cc ## @@ -201,9 +193,7 @@ grpc::Status FlightDataSerialize(const FlightPayload& msg, ByteBuffer* out, // Write the descriptor if present int32_t descriptor_size = 0; if (msg.descriptor != nullptr) { -if (msg.descriptor->size() > kInt32Max) { - return ToGrpcStatus(Status::CapacityError("Descriptor size overflow (>= 2**31)")); -} +DCHECK_LT(msg.descriptor->size(), kInt32Max); Review comment: It's a bit strange to continue running here, if we know it will fail finally (will it?) From the jira issue, looks grpc should check and handle our return value decently. Maybe open an issue in grpc community? If grpc is willing to change, do we still need to update our code? For now, I'm not sure if it's appropriate to simply abort here in release build, with some clear error messages and hints to fix it by chunking the data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages
cyb70289 commented on a change in pull request #10663: URL: https://github.com/apache/arrow/pull/10663#discussion_r669232404 ## File path: cpp/src/arrow/flight/client.cc ## @@ -688,11 +688,12 @@ class GrpcStreamWriter : public FlightStreamWriter { Status WriteMetadata(std::shared_ptr app_metadata) override { FlightPayload payload{}; payload.app_metadata = app_metadata; -if (!internal::WritePayload(payload, writer_->stream().get())) { +auto status = internal::WritePayload(payload, writer_->stream().get()); +if (!status.ok() && status.IsIOError()) { Review comment: Can we remove `!status.ok()` and only check `if (status.IsIOError())` ? ## File path: cpp/src/arrow/flight/serialization_internal.cc ## @@ -201,9 +193,7 @@ grpc::Status FlightDataSerialize(const FlightPayload& msg, ByteBuffer* out, // Write the descriptor if present int32_t descriptor_size = 0; if (msg.descriptor != nullptr) { -if (msg.descriptor->size() > kInt32Max) { - return ToGrpcStatus(Status::CapacityError("Descriptor size overflow (>= 2**31)")); -} +DCHECK_LT(msg.descriptor->size(), kInt32Max); Review comment: `DCHECK_LE` ? ## File path: cpp/src/arrow/flight/serialization_internal.cc ## @@ -201,9 +193,7 @@ grpc::Status FlightDataSerialize(const FlightPayload& msg, ByteBuffer* out, // Write the descriptor if present int32_t descriptor_size = 0; if (msg.descriptor != nullptr) { -if (msg.descriptor->size() > kInt32Max) { - return ToGrpcStatus(Status::CapacityError("Descriptor size overflow (>= 2**31)")); -} +DCHECK_LT(msg.descriptor->size(), kInt32Max); Review comment: It's a bit strange to continue running here, if we know it will fail finally (will it?) From the jira issue, looks grpc should check and handle our return value decently. Maybe open an issue in grpc community? If grpc is willing to change, do we still need to update our code? For now, I'm not sure if it's appropriate to simply abort here in release build, with some clear error messages and hints to fix it by chunking the data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10604: ARROW-13190: [C++] [Gandiva] Change behavior of INITCAP function
anthonylouisbsb commented on a change in pull request #10604: URL: https://github.com/apache/arrow/pull/10604#discussion_r669245436 ## File path: cpp/src/gandiva/gdv_function_stubs.cc ## @@ -635,30 +638,31 @@ const char* gdv_fn_initcap_utf8(int64_t context, const char* data, int32_t data_ int32_t out_char_len = 0; int32_t out_idx = 0; uint32_t char_codepoint; + + // Any character is considered as space, except if it is alphanumeric bool last_char_was_space = true; for (int32_t i = 0; i < data_len; i += char_len) { char_len = gdv_fn_utf8_char_length(data[i]); -// For single byte characters: -// If it is a lowercase ASCII character, set the output to its corresponding uppercase -// character; else, set the output to the read character +// An optimization for single byte characters: if (char_len == 1) { Review comment: Suggestion applied -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages
ursabot edited a comment on pull request #10663: URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533552 Benchmark runs are scheduled for baseline = 780e95c512d63bbea1e040af0eb44a0bf63c4d72 and contender = 64ecb2a73c76e30091428af5e03d538766b12c74. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/399dd7ae47d345969dfba70ba5d60fae...720759ec32734f10a4016acec5cc4633/) [Scheduled] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/eef93c4bfb1d4a8b92351f41883dfd55...41a8348d634e4410808dad7fcdd62c59/) [Scheduled] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/e5af519d38bd4cc483cd45f86750d807...b477c89467fc4b2d80211db405bd786e/) Supported benchmarks: ursa-i9-9960x: langs = Python, R ursa-thinkcentre-m75q: langs = C++, Java ec2-t3-xlarge-us-east-2: cloud = True -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages
ursabot commented on pull request #10663: URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533552 Benchmark runs are scheduled for baseline = 780e95c512d63bbea1e040af0eb44a0bf63c4d72 and contender = 64ecb2a73c76e30091428af5e03d538766b12c74. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Scheduled] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/399dd7ae47d345969dfba70ba5d60fae...720759ec32734f10a4016acec5cc4633/) [Scheduled] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/eef93c4bfb1d4a8b92351f41883dfd55...41a8348d634e4410808dad7fcdd62c59/) [Scheduled] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/e5af519d38bd4cc483cd45f86750d807...b477c89467fc4b2d80211db405bd786e/) Supported benchmarks: ursa-i9-9960x: langs = Python, R ursa-thinkcentre-m75q: langs = C++, Java ec2-t3-xlarge-us-east-2: cloud = True -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages
dianaclarke commented on pull request #10663: URL: https://github.com/apache/arrow/pull/10663#issuecomment-879533345 @ursabot please benchmark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed
github-actions[bot] commented on pull request #10719: URL: https://github.com/apache/arrow/pull/10719#issuecomment-879528937 https://issues.apache.org/jira/browse/ARROW-13332 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace opened a new pull request #10719: ARROW-13332: [C++] TSAN failure in TestAsyncUtil.ReadaheadFailed
westonpace opened a new pull request #10719: URL: https://github.com/apache/arrow/pull/10719 Minor cleanup fix from my earlier fix of this test in #10602 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] mcassels opened a new issue #723: ABS() function in WHERE clause gives unexpected results
mcassels opened a new issue #723: URL: https://github.com/apache/arrow-datafusion/issues/723 **Describe the bug** ABS(col - x) in WHERE clause of query sometimes does not filter results correctly. The test file has a float column with high-precision values. We want to use `ABS(col - x) < y` to do equality comparisons for high-precision float columns. **To Reproduce** datafusion cli example using test parquet file attached: ``` ➜ arrow-datafusion git:(master) ✗ cargo run --bin datafusion-cli Finished dev [unoptimized + debuginfo] target(s) in 0.22s Running `target/debug/datafusion-cli` > CREATE EXTERNAL TABLE foo STORED AS PARQUET LOCATION 'test.parquet'; 0 rows in set. Query took 0.001 seconds. > select * from foo; ++ | c0 | ++ | 107.0090813093981 | | 125.51519138755981 | | 141.83342587451415 | | 113.65534481251639 | | 251.10794896957802 | | 112.08361695028363 | ++ 6 rows in set. Query took 0.006 seconds. > select c0, ABS(c0 - 251.10794896957802) from foo; ++---+ | c0 | abs(c0 Minus Float64(251.10794896957802)) | ++---+ | 107.0090813093981 | 144.0988676601799 | | 125.51519138755981 | 125.59275758201821| | 141.83342587451415 | 109.27452309506387| | 113.65534481251639 | 137.45260415706161| | 251.10794896957802 | 0 | | 112.08361695028363 | 139.02433201929438| ++---+ 6 rows in set. Query took 0.006 seconds. > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 251.10794896957802) < 1; 0 rows in set. Query took 0.005 seconds. > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 251.10794896957802) < 111; 0 rows in set. Query took 0.003 seconds. > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 251.10794896957802) < 150; ++---+ | c0 | abs(c0 Minus Float64(251.10794896957802)) | ++---+ | 107.0090813093981 | 144.0988676601799 | | 125.51519138755981 | 125.59275758201821| | 141.83342587451415 | 109.27452309506387| | 113.65534481251639 | 137.45260415706161| | 251.10794896957802 | 0 | | 112.08361695028363 | 139.02433201929438| ++---+ 6 rows in set. Query took 0.007 seconds. ``` **Expected behavior** The query ``` > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 251.10794896957802) < 1; ``` was expected to give the following 1 row: ``` ++---+ | c0 | abs(c0 Minus Float64(251.10794896957802)) | ++---+ | 251.10794896957802 | 0 | ``` And the query ``` > select c0, ABS(c0 - 251.10794896957802) from foo where ABS(c0 - 251.10794896957802) < 111; ``` was expected to give the following 2 rows: ``` ++---+ | c0 | abs(c0 Minus Float64(251.10794896957802)) | ++---+ | 251.10794896957802 | 0 | | 141.83342587451415 | 109.27452309506387| ``` **Additional context** Add any other context about the problem here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
kou commented on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879506199 It's strange... Can we confirm which OpenSSL is used for this case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ZMZ91 commented on pull request #10606: ARROW-13005: [C++] Add support for take implementation on dense union type
ZMZ91 commented on pull request #10606: URL: https://github.com/apache/arrow/pull/10606#issuecomment-879501530 Hi @bkietz and @pitrou, could you help check what else to update in this pr? Or we may merge it? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10604: ARROW-13190: [C++] [Gandiva] Change behavior of INITCAP function
anthonylouisbsb commented on a change in pull request #10604: URL: https://github.com/apache/arrow/pull/10604#discussion_r669203451 ## File path: cpp/src/gandiva/gdv_function_stubs.cc ## @@ -427,7 +427,8 @@ CAST_VARLEN_TYPE_FROM_NUMERIC(VARBINARY) #undef GDV_FN_CAST_VARCHAR_REAL GANDIVA_EXPORT -int32_t gdv_fn_utf8_char_length(char c) { +FORCE_INLINE Review comment: I applied that change, lets see if the tests will work -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
anthonylouisbsb commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669196394 ## File path: cpp/src/gandiva/string_function_holder_test.cc ## @@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) { auto status = LikeHolder::Make("ab\\_", "", _holder); EXPECT_EQ(status.ok(), false) << status.message(); } + +// Tests related to the REGEXP_EXTRACT function +class TestExtractHolder : public ::testing::Test { + protected: + ExecutionContext execution_context_; +}; + +TEST_F(TestExtractHolder, TestSimpleExtract) { Review comment: I added the test, I found an error with the implementation when needs to match the string partially. I will look at it, I think it is a wrong use of the RE2 library -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage
westonpace commented on pull request #10629: URL: https://github.com/apache/arrow/pull/10629#issuecomment-879491179 Any remaining comments or suggestions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
anthonylouisbsb commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669196057 ## File path: cpp/src/gandiva/string_function_holder.h ## @@ -0,0 +1,140 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include + +#include +#include + +#include "arrow/status.h" +#include "gandiva/execution_context.h" +#include "gandiva/function_holder.h" +#include "gandiva/node.h" +#include "gandiva/visibility.h" + +namespace gandiva { + +/// Function Holder for SQL 'like' +class GANDIVA_EXPORT LikeHolder : public FunctionHolder { + public: + ~LikeHolder() override = default; + + static Status Make(const FunctionNode& node, std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, const std::string& escape_char, + std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, std::shared_ptr* holder, + RE2::Options regex_op); + + // Try and optimise a function node with a "like" pattern. + static const FunctionNode TryOptimize(const FunctionNode& node); + + /// Return true if the data matches the pattern. + bool operator()(const std::string& data) { return RE2::FullMatch(data, regex_); } + + private: + explicit LikeHolder(const std::string& pattern) : pattern_(pattern), regex_(pattern) {} + + LikeHolder(const std::string& pattern, RE2::Options regex_op) + : pattern_(pattern), regex_(pattern, regex_op) {} + + std::string pattern_; // posix pattern string, to help debugging + RE2 regex_;// compiled regex for the pattern + + static RE2 starts_with_regex_; // pre-compiled pattern for matching starts_with + static RE2 ends_with_regex_;// pre-compiled pattern for matching ends_with + static RE2 is_substr_regex_;// pre-compiled pattern for matching is_substr +}; + +/// Function Holder for 'regexp_extract' function +class GANDIVA_EXPORT ExtractHolder : public FunctionHolder { + public: + ~ExtractHolder() override = default; + + static Status Make(const FunctionNode& node, std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, + std::shared_ptr* holder); + + /// Extracts the matching text from a string using a regex + const char* operator()(ExecutionContext* ctx, const char* user_input, Review comment: I moved the code to the file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Jimexist commented on pull request #687: #554: Lead/lag window function with offset and default value arguments
Jimexist commented on pull request #687: URL: https://github.com/apache/arrow-datafusion/pull/687#issuecomment-879487452 > @Jimexist do you think this PR is ready? Do you need help reviewing ? looks okay after rebasing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] djKooks edited a comment on pull request #442: Change return type of 'DataFrame.collect()'
djKooks edited a comment on pull request #442: URL: https://github.com/apache/arrow-datafusion/pull/442#issuecomment-879483480 @alamb sure~ But because I'm still on reviewing the project, so want to make sure I'm going in right way... > As far as I review from now, seems it needs to fix return format of collect inside dataframe_impl. And I've fixed collect inside physical_plan because they are calling this function inside... > Please let me know if there's any misunderstanding => Am I thinking correctly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs edited a comment on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
kszucs edited a comment on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879399202 @kou Interestingly we have [linker errors](https://github.com/ursacomputing/crossbow/runs/3060717856?check_suite_focus=true#step:8:654) in the universal2 build but the tests seem to work for both x86_64 and arm64 in their own seperate virtualenvs. Can this cause issues with `PARQUET_REQUIRE_ENCRYPTION` and `ARROW_S3` (`ARROW_FLIGHT` is disabled)? cc @xhochy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] djKooks commented on pull request #442: Change return type of 'DataFrame.collect()'
djKooks commented on pull request #442: URL: https://github.com/apache/arrow-datafusion/pull/442#issuecomment-879483480 @alamb sure~ But because I'm still on reviewing the project, so want to make sure I'm going in right way... > As far as I review from now, seems it needs to fix return format of collect inside dataframe_impl. And I've fixed collect inside physical_plan because they are calling this function inside... > Please let me know if there's any misunderstanding => Am I thinking correctly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10718: ARROW-13331: [C++][Gandiva] Add format_number hive function to gandiva
github-actions[bot] commented on pull request #10718: URL: https://github.com/apache/arrow/pull/10718#issuecomment-879478708 https://issues.apache.org/jira/browse/ARROW-13331 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] augustoasilva opened a new pull request #10718: ARROW-13331: [C++][Gandiva] Add format_number hive function to gandiva
augustoasilva opened a new pull request #10718: URL: https://github.com/apache/arrow/pull/10718 Formats the number X to a format like '#,###,###.##', rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
anthonylouisbsb commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669181293 ## File path: cpp/src/gandiva/CMakeLists.txt ## @@ -82,7 +82,7 @@ set(SRC_FILES hash_utils.cc llvm_generator.cc llvm_types.cc -like_holder.cc +string_function_holder.cc Review comment: Renamed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jpedroantunes commented on a change in pull request #10516: ARROW-13049: [C++][Gandiva] Implement BIN Hive function on Gandiva
jpedroantunes commented on a change in pull request #10516: URL: https://github.com/apache/arrow/pull/10516#discussion_r669179317 ## File path: cpp/src/gandiva/precompiled/extended_math_ops.cc ## @@ -367,4 +367,36 @@ gdv_float64 get_scale_multiplier(gdv_int32 scale) { return power_float64_float64(10.0, scale); } +// returns the binary representation of a given integer (e.g. 928 -> 111010) +#define BIN_INTEGER(IN_TYPE) \ + FORCE_INLINE \ + const char* bin_##IN_TYPE(int64_t context, gdv_##IN_TYPE value, int32_t* out_len) { \ +*out_len = 0; \ +int32_t len = 8 * sizeof(value); \ +char* ret = reinterpret_cast(gdv_fn_context_arena_malloc(context, len)); \ +if (ret == nullptr) { \ + gdv_fn_context_set_error_msg(context, "Could not allocate memory for output"); \ + return ""; \ +} \ +/* generate binary representation iteratively */ \ +gdv_u##IN_TYPE i; \ +bool first = false; /* flag for not printing left zeros in positive numbers */\ +for (i = static_cast(1) << (len - 1); i > 0; i = i / 2) { \ + if ((value & i) != 0) { \ Review comment: You are right! Added the handler for that case and also added tests to cover it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
anthonylouisbsb commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669179205 ## File path: cpp/src/gandiva/string_function_holder_test.cc ## @@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) { auto status = LikeHolder::Make("ab\\_", "", _holder); EXPECT_EQ(status.ok(), false) << status.message(); } + +// Tests related to the REGEXP_EXTRACT function +class TestExtractHolder : public ::testing::Test { + protected: + ExecutionContext execution_context_; +}; + +TEST_F(TestExtractHolder, TestSimpleExtract) { + std::shared_ptr extract_holder; + + // Pattern to match of two group of letters + auto status = ExtractHolder::Make(R"((\w+) (\w+))", _holder); Review comment: And in the example, the index 0 must return all string parts that match the regex, the index 1 the first group in the regex(first name) and the index 2 the second group(surname) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] anthonylouisbsb commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
anthonylouisbsb commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669177954 ## File path: cpp/src/gandiva/string_function_holder_test.cc ## @@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) { auto status = LikeHolder::Make("ab\\_", "", _holder); EXPECT_EQ(status.ok(), false) << status.message(); } + +// Tests related to the REGEXP_EXTRACT function +class TestExtractHolder : public ::testing::Test { + protected: + ExecutionContext execution_context_; +}; + +TEST_F(TestExtractHolder, TestSimpleExtract) { + std::shared_ptr extract_holder; + + // Pattern to match of two group of letters + auto status = ExtractHolder::Make(R"((\w+) (\w+))", _holder); Review comment: About the first question, the first and last parenthesis does not count, they are part of the [raw string sintax](https://www.geeksforgeeks.org/raw-string-literal-c/) in C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)
rok commented on a change in pull request #10610: URL: https://github.com/apache/arrow/pull/10610#discussion_r669166367 ## File path: r/configure.win ## @@ -44,7 +44,7 @@ else RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)" fi OPENSSL_LIBS="-lcrypto -lcrypt32" -MIMALLOC_LIBS="-lbcrypt -lpsapi" +MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32" Review comment: No worries, thanks for responding :) The TZ functionality of the vendored library (date.h) needs to download a timezone DB on windows to work. There's [a jira to set that up](https://issues.apache.org/jira/browse/ARROW-13168) so this will eventually need to compile on windows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)
westonpace commented on a change in pull request #10610: URL: https://github.com/apache/arrow/pull/10610#discussion_r669164458 ## File path: r/configure.win ## @@ -44,7 +44,7 @@ else RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)" fi OPENSSL_LIBS="-lcrypto -lcrypt32" -MIMALLOC_LIBS="-lbcrypt -lpsapi" +MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32" Review comment: Ah, sorry, thought this was on the other PR where I was questioning this. I will let someone with more knowledge of the build process chime in on whether this is correct or not. I thought the TZ stuff didn't support Windows anyways? Or is that only strftime? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)
westonpace commented on a change in pull request #10610: URL: https://github.com/apache/arrow/pull/10610#discussion_r669164458 ## File path: r/configure.win ## @@ -44,7 +44,7 @@ else RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)" fi OPENSSL_LIBS="-lcrypto -lcrypt32" -MIMALLOC_LIBS="-lbcrypt -lpsapi" +MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32" Review comment: Ah, sorry, thought this was on the other PR where I was questioning this. I will let someone with more knowledge of the build process chime in on whether this is correct or not. I thought the TZ stuff didn't support Windows anyways? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)
westonpace commented on a change in pull request #10610: URL: https://github.com/apache/arrow/pull/10610#discussion_r669163511 ## File path: r/configure.win ## @@ -44,7 +44,7 @@ else RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)" fi OPENSSL_LIBS="-lcrypto -lcrypt32" -MIMALLOC_LIBS="-lbcrypt -lpsapi" +MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32" Review comment: No, I didn't see your earlier comment so I didn't understand why you were adding this. I get it now, sorry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10717: ARROW-13091: [Python] Added compression_level to IpcWriteOptions
github-actions[bot] commented on pull request #10717: URL: https://github.com/apache/arrow/pull/10717#issuecomment-879456403 https://issues.apache.org/jira/browse/ARROW-13091 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace opened a new pull request #10717: ARROW-13091: [Python] Added compression_level to IpcWriteOptions
westonpace opened a new pull request #10717: URL: https://github.com/apache/arrow/pull/10717 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kevingurney commented on pull request #10614: ARROW-13100: [MATLAB] Integrate GoogleTest with MATLAB Interface C++ Code
kevingurney commented on pull request #10614: URL: https://github.com/apache/arrow/pull/10614#issuecomment-879449900 @kou we just pushed updates to `matlab/CMakeLists.txt` which enable building of the Arrow C++ libraries and bundled GoogleTest binaries automatically from source on macOS and Windows. We are now able to build and run the MATLAB C++ tests consistently, using the same set of commands, on Windows, Mac, and Linux: ```bash $ cmake -S . -B build -D MATLAB_BUILD_TESTS=ON $ cmake --build build --config Release $ ctest --test-dir build ``` I think we are good to go now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] thisisnic commented on a change in pull request #10624: ARROW-12992: [R] bindings for substr(), substring(), str_sub()
thisisnic commented on a change in pull request #10624: URL: https://github.com/apache/arrow/pull/10624#discussion_r669142885 ## File path: r/src/compute.cpp ## @@ -316,6 +316,19 @@ std::shared_ptr make_compute_options( return std::make_shared(max_splits, reverse); } + if (func_name == "utf8_slice_codeunits") { +using Options = arrow::compute::SliceOptions; +int64_t start = 1; +int64_t stop = -1; Review comment: Sorry, I didn't explain this properly, my fault! What I mean is that here stop has been set to `-1` but the default C++ value isn't `-1`, so the default value here probably shouldn't be `-1` either. Check out this line of code here that shows the default value of `stop` in the C++: https://github.com/apache/arrow/blob/7114c4b6fdb63 9b3500d77cfd66649af8c5c5e6b/cpp/src/arrow/compute/api_scalar.h#L205-L206 The default value of `stop` is `std::numeric_limits::max()` which is the biggest int64. So if you don't supply a value for `stop` and just use the default, this allows you to slice to the end of the string. In some of the other functions in this file, this line has been used to set the default values to the ones from the C++ code: `std::make_shared(Options::Defaults());` Maybe instead of manually setting the value of stop to `-1`, if you use the line above, it might make it easier to fix some of the failing tests as now you'll be able to slice to the end of a string. If this doesn't make sense, let me know! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zeroshade opened a new pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package
zeroshade opened a new pull request #10716: URL: https://github.com/apache/arrow/pull/10716 @emkornfield Thanks for merging the previous PR #10379 Here's the remaining files that we pulled out of that PR to shrink it down, including all the unit tests for the Encoding package. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package
github-actions[bot] commented on pull request #10716: URL: https://github.com/apache/arrow/pull/10716#issuecomment-879433836 https://issues.apache.org/jira/browse/ARROW-13330 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb closed issue #343: Add a RecordBatch::split to split large batches into a set of smaller batches
alamb closed issue #343: URL: https://github.com/apache/arrow-rs/issues/343 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on issue #343: Add a RecordBatch::split to split large batches into a set of smaller batches
alamb commented on issue #343: URL: https://github.com/apache/arrow-rs/issues/343#issuecomment-879428117 Given we now have slice in https://github.com/apache/arrow-rs/issues/460 I don't think this adds much anymore -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on a change in pull request #10650: ARROW-13058: This is a draft to provide save-report
dianaclarke commented on a change in pull request #10650: URL: https://github.com/apache/arrow/pull/10650#discussion_r669128039 ## File path: dev/archery/archery/crossbow/reports.py ## @@ -121,6 +121,61 @@ def show(self, outstream, asset_callback=None): asset)) +class JsonReport(Report): + +HEADER = textwrap.dedent(""" +Arrow Build Report for Job {job_name} + +All tasks: {all_tasks_url} +""") + +TASK = textwrap.dedent(""" + - {name}: +URL: {url} +""").strip() + +STATUS_HEADERS = { +# from CombinedStatus +'error': 'Errored Tasks:', +'failure': 'Failed Tasks:', +'pending': 'Pending Tasks:', +'success': 'Succeeded Tasks:', +} + +def __init__(self, job): Review comment: You can remove this method since this is the default behaviour and your not extending it. ## File path: dev/archery/archery/crossbow/reports.py ## @@ -121,6 +121,61 @@ def show(self, outstream, asset_callback=None): asset)) +class JsonReport(Report): + +HEADER = textwrap.dedent(""" +Arrow Build Report for Job {job_name} + +All tasks: {all_tasks_url} +""") + +TASK = textwrap.dedent(""" + - {name}: +URL: {url} +""").strip() + +STATUS_HEADERS = { +# from CombinedStatus +'error': 'Errored Tasks:', +'failure': 'Failed Tasks:', +'pending': 'Pending Tasks:', +'success': 'Succeeded Tasks:', +} + +def __init__(self, job): Review comment: You can remove this method since this is the default behaviour and you're not extending it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)
rok commented on a change in pull request #10610: URL: https://github.com/apache/arrow/pull/10610#discussion_r669127595 ## File path: r/configure.win ## @@ -44,7 +44,7 @@ else RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)" fi OPENSSL_LIBS="-lcrypto -lcrypt32" -MIMALLOC_LIBS="-lbcrypt -lpsapi" +MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32" Review comment: Should this be somewhere else? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on a change in pull request #10650: ARROW-13058: This is a draft to provide save-report
dianaclarke commented on a change in pull request #10650: URL: https://github.com/apache/arrow/pull/10650#discussion_r669125627 ## File path: dev/archery/archery/crossbow/reports.py ## @@ -32,7 +33,6 @@ def __init__(self, job): def show(self): raise NotImplementedError() - Review comment: I suspect this new line was intentional, as archery uses black for formatting. - https://github.com/apache/arrow/pull/10012 - https://github.com/psf/black > Surround top-level function and class definitions with two blank lines. - https://www.python.org/dev/peps/pep-0008/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on a change in pull request #10650: ARROW-13058: This is a draft to provide save-report
dianaclarke commented on a change in pull request #10650: URL: https://github.com/apache/arrow/pull/10650#discussion_r669121940 ## File path: dev/archery/archery/crossbow/reports.py ## @@ -121,6 +121,61 @@ def show(self, outstream, asset_callback=None): asset)) +class JsonReport(Report): + +HEADER = textwrap.dedent(""" +Arrow Build Report for Job {job_name} + +All tasks: {all_tasks_url} +""") + +TASK = textwrap.dedent(""" + - {name}: +URL: {url} +""").strip() + +STATUS_HEADERS = { +# from CombinedStatus +'error': 'Errored Tasks:', +'failure': 'Failed Tasks:', +'pending': 'Pending Tasks:', +'success': 'Succeeded Tasks:', +} + +def __init__(self, job): +super().__init__(job) + +def url(self, query): +repo_url = self.job.queue.remote_url.strip('.git') +return '{}/branches/all?query={}'.format(repo_url, query) + +def todayStr(self): +date = datetime.utcnow() +return "{}-{}-{}".format(date.year, date.month, date.day) + +def tasksToDict(self, date, tasks): Review comment: Python uses snake_case for variable, function, and method names. - `tasksToDict` --> `tasks_to_dict ` - `jsonTasks` --> `json_tasks` - `jsonStr` --> `jsonStr` - `getJsonTasks` --> `get_json_tasks` - etc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10705: ARROW-13313: [C++][Compute] Add scalar aggregate node
westonpace commented on a change in pull request #10705: URL: https://github.com/apache/arrow/pull/10705#discussion_r669115135 ## File path: cpp/src/arrow/compute/exec/exec_plan.cc ## @@ -601,5 +618,215 @@ AsyncGenerator> MakeSinkNode(ExecNode* input, return out; } +std::shared_ptr MakeSinkNodeReader(ExecNode* input, + std::string label) { + struct Impl : RecordBatchReader { +std::shared_ptr schema() const override { return schema_; } +Status ReadNext(std::shared_ptr* record_batch) override { + ARROW_ASSIGN_OR_RAISE(auto batch, iterator_.Next()); + if (batch) { +ARROW_ASSIGN_OR_RAISE(*record_batch, batch->ToRecordBatch(schema_, pool_)); + } else { +*record_batch = IterationEnd>(); + } + return Status::OK(); +} + +MemoryPool* pool_; +std::shared_ptr schema_; +Iterator> iterator_; + }; + + auto out = std::make_shared(); + out->pool_ = input->plan()->exec_context()->memory_pool(); + out->schema_ = input->output_schema(); + out->iterator_ = MakeGeneratorIterator(MakeSinkNode(input, std::move(label))); + return out; +} + +struct ScalarAggregateNode : ExecNode { + ScalarAggregateNode(ExecNode* input, std::string label, + std::shared_ptr output_schema, + std::vector kernels, + std::vector>> states) + : ExecNode(input->plan(), std::move(label), {input}, {"target"}, + /*output_schema=*/std::move(output_schema), + /*num_outputs=*/1), +kernels_(std::move(kernels)), +states_(std::move(states)) {} + + const char* kind_name() override { return "ScalarAggregateNode"; } + + Status DoConsume(const ExecBatch& batch, + const std::vector>& states) { +for (size_t i = 0; i < states.size(); ++i) { + KernelContext batch_ctx{plan()->exec_context()}; + batch_ctx.SetState(states[i].get()); + ExecBatch single_column_batch{{batch.values[i]}, batch.length}; + RETURN_NOT_OK(kernels_[i]->consume(_ctx, single_column_batch)); +} +return Status::OK(); + } + + void InputReceived(ExecNode* input, int seq, ExecBatch batch) override { +DCHECK_EQ(input, inputs_[0]); + +std::unique_lock lock(mutex_); +auto it = +thread_indices_.emplace(std::this_thread::get_id(), thread_indices_.size()).first; +++num_received_; +auto thread_index = it->second; + +lock.unlock(); + +const auto& thread_local_state = states_[thread_index]; +Status st = DoConsume(std::move(batch), thread_local_state); +if (!st.ok()) { + outputs_[0]->ErrorReceived(this, std::move(st)); + return; +} + +lock.lock(); +st = MaybeFinish(); +if (!st.ok()) { + outputs_[0]->ErrorReceived(this, std::move(st)); +} + } + + void ErrorReceived(ExecNode* input, Status error) override { +DCHECK_EQ(input, inputs_[0]); +outputs_[0]->ErrorReceived(this, std::move(error)); + } + + void InputFinished(ExecNode* input, int seq) override { +DCHECK_EQ(input, inputs_[0]); +std::unique_lock lock(mutex_); +num_total_ = seq; +Status st = MaybeFinish(); + +if (!st.ok()) { + outputs_[0]->ErrorReceived(this, std::move(st)); +} + } + + Status StartProducing() override { +finished_ = Future<>::Make(); +// Scalar aggregates will only output a single batch +outputs_[0]->InputFinished(this, 1); +return Status::OK(); + } + + void PauseProducing(ExecNode* output) override {} + + void ResumeProducing(ExecNode* output) override {} + + void StopProducing(ExecNode* output) override { +DCHECK_EQ(output, outputs_[0]); +StopProducing(); + } + + void StopProducing() override { +inputs_[0]->StopProducing(this); +finished_.MarkFinished(); + } + + Future<> finished() override { return finished_; } + + private: + Status MaybeFinish(std::unique_lock* lock) { +if (num_received_ != num_total_) return Status::OK(); + +if (finished_.is_finished()) return Status::OK(); + +ExecBatch batch{{}, 1}; +batch.values.resize(kernels_.size()); + +for (size_t i = 0; i < kernels_.size(); ++i) { Review comment: Ah, In my head "merge" meant something more like a merge sort. I agree, if it's just summing up a sum/mean/etc. counter across the various states then I agree it's not necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
github-actions[bot] commented on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879407362 Revision: 10dbf02030fc431bf175826a51214e643a916042 Submitted crossbow builds: [ursacomputing/crossbow @ actions-593](https://github.com/ursacomputing/crossbow/branches/all?query=actions-593) |Task|Status| ||--| |wheel-macos-big-sur-cp39-arm64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-big-sur-cp39-arm64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-big-sur-cp39-arm64)| |wheel-macos-big-sur-cp39-universal2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-big-sur-cp39-universal2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-big-sur-cp39-universal2)| |wheel-macos-high-sierra-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-high-sierra-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-high-sierra-cp36-amd64)| |wheel-macos-high-sierra-cp37-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-high-sierra-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-high-sierra-cp37-amd64)| |wheel-macos-high-sierra-cp38-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-high-sierra-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-high-sierra-cp38-amd64)| |wheel-macos-high-sierra-cp39-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-high-sierra-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-high-sierra-cp39-amd64)| |wheel-macos-mavericks-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-mavericks-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-mavericks-cp36-amd64)| |wheel-macos-mavericks-cp37-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-mavericks-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-mavericks-cp37-amd64)| |wheel-macos-mavericks-cp38-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-mavericks-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-mavericks-cp38-amd64)| |wheel-macos-mavericks-cp39-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-macos-mavericks-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-macos-mavericks-cp39-amd64)| |wheel-manylinux2010-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2010-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2010-cp36-amd64)| |wheel-manylinux2010-cp37-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2010-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2010-cp37-amd64)| |wheel-manylinux2010-cp38-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2010-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2010-cp38-amd64)| |wheel-manylinux2010-cp39-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2010-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2010-cp39-amd64)| |wheel-manylinux2014-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-593-github-wheel-manylinux2014-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-593-github-wheel-manylinux2014-cp36-amd64)|
[GitHub] [arrow] kszucs commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
kszucs commented on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879406726 @github-actions crossbow submit -g wheel -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on pull request #10647: ARROW-13174: [C++][Compute] Add strftime kernel
rok commented on pull request #10647: URL: https://github.com/apache/arrow/pull/10647#issuecomment-879403748 > Thanks for the changes. This looks good. I'm afraid I don't know the differences between `date.h` and the system equivalents. By system equivalents do you mean `ctime`? In that case does strftime support time zones? I meant like [cpp](https://en.cppreference.com/w/cpp/chrono/c/strftime) or [linux](https://man7.org/linux/man-pages/man3/strftime.3.html) strftimes. It seems they support timezones but one still needs to go from UTC to local time etc. > Since we already have the vendored lib in I suppose it comes down to readability, simplicity, and performance. What you have here seems plenty readable and simple. We can always benchmark the two approaches later. Fully agreed. Without benchmarking alternatives it's a moot discussion anyway. I just wanted to check if I missed something obvious. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #491: Minimal MapArray support
alamb commented on a change in pull request #491: URL: https://github.com/apache/arrow-rs/pull/491#discussion_r669078940 ## File path: arrow/Cargo.toml ## @@ -58,13 +58,15 @@ multiversion = "0.6.1" bitflags = "1.2.1" [features] -default = ["csv", "ipc"] +default = ["csv", "ipc", "experimental-api"] Review comment: I recommend that `experimental-api` is not enabled by default as an additional safeguard against someone being surprised if a new arrow release breaks it ## File path: arrow/src/array/array_map.rs ## @@ -0,0 +1,380 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::fmt; +use std::mem; + +use super::make_array; +use super::{ +array::print_long_array, raw_pointer::RawPtrBox, Array, ArrayData, ArrayRef, +}; +use crate::datatypes::{ArrowNativeType, DataType}; +use crate::error::ArrowError; + +/// A nested array type where each record is a key-value map. +/// Keys should always be non-null, but values can be null. +/// +/// [MapArray] is physically a [ListArray] that has a [StructArray] +/// with 2 child fields. +pub struct MapArray { +data: ArrayData, +values: ArrayRef, +value_offsets: RawPtrBox, +} + +impl MapArray { +#[cfg(feature = "experimental-api")] +/// Returns a reference to the keys of this map. +pub fn keys() -> ArrayRef { +make_array(self.values.data().child_data()[0].clone()) +} + +#[cfg(feature = "experimental-api")] +/// Returns a reference to the values of this map. +pub fn values() -> ArrayRef { +make_array(self.values.data().child_data()[1].clone()) +} + +#[cfg(feature = "experimental-api")] +/// Returns the data type of the map's keys. +pub fn key_type() -> DataType { +self.values.data().child_data()[0].data_type().clone() +} + +#[cfg(feature = "experimental-api")] +/// Returns the data type of the map's values. +pub fn value_type() -> DataType { +self.values.data().child_data()[1].data_type().clone() +} + +/// Returns ith value of this map array. +/// # Safety +/// Caller must ensure that the index is within the array bounds +pub unsafe fn value_unchecked(, i: usize) -> ArrayRef { +let end = *self.value_offsets().get_unchecked(i + 1); +let start = *self.value_offsets().get_unchecked(i); +self.values +.slice(start.to_usize().unwrap(), (end - start).to_usize().unwrap()) +} + +/// Returns ith value of this map array. +pub fn value(, i: usize) -> ArrayRef { +let end = self.value_offsets()[i + 1] as usize; +let start = self.value_offsets()[i] as usize; +self.values.slice(start, end - start) +} + +/// Returns the offset values in the offsets buffer +#[inline] +pub fn value_offsets() -> &[i32] { +// Soundness +// pointer alignment & location is ensured by RawPtrBox +// buffer bounds/offset is ensured by the ArrayData instance. +unsafe { +std::slice::from_raw_parts( +self.value_offsets.as_ptr().add(self.data.offset()), +self.len() + 1, +) +} +} + +/// Returns the length for value at index `i`. +#[inline] +pub fn value_length(, i: usize) -> i32 { +let offsets = self.value_offsets(); +offsets[i + 1] - offsets[i] +} +} + +impl From for MapArray { +fn from(data: ArrayData) -> Self { +Self::try_new_from_array_data(data) +.expect("Expected infallable creation of MapArray from ArrayData failed") +} +} + +impl MapArray { +fn try_new_from_array_data(data: ArrayData) -> Result { +if data.buffers().len() != 1 { +return Err(ArrowError::InvalidArgumentError( +format!("MapArray data should contain a single buffer only (value offsets), had {}", +data.len(; +} + +if data.child_data().len() != 1 { +return Err(ArrowError::InvalidArgumentError(format!( +"MapArray should contain a single child array (values array), had {}", +data.child_data().len() +
[GitHub] [arrow] rok commented on a change in pull request #10647: ARROW-13174: [C++][Compute] Add strftime kernel
rok commented on a change in pull request #10647: URL: https://github.com/apache/arrow/pull/10647#discussion_r669107412 ## File path: r/configure.win ## @@ -44,7 +44,7 @@ else RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)" fi OPENSSL_LIBS="-lcrypto -lcrypt32" -MIMALLOC_LIBS="-lbcrypt -lpsapi" +MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32" Review comment: Yeah, without it RTools40 fails on mingw32 build: ``` C:/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe: ../windows/arrow-4.0.1.9000/lib/i386/libarrow.a(tz.cpp.obj):(.text+0x725): undefined reference to `_imp__CoTaskMemFree@4' C:/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe: ../windows/arrow-4.0.1.9000/lib/i386/libarrow.a(tz.cpp.obj):(.text.unlikely+0x3e): undefined reference to `_imp__CoTaskMemFree@4' ``` [Fix source here](https://github.com/HowardHinnant/date/issues/272). I don't really know what it really does. I'm also not sure if this is the place to put it. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs edited a comment on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
kszucs edited a comment on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879399202 @kou Interestingly we have [linker errors](https://github.com/ursacomputing/crossbow/runs/3060717856?check_suite_focus=true#step:8:654) in the universal2 build but the tests seem to work for both x86_64 and arm64 in their own seperate virtualenvs. Can this cause issues with `PARQUET_REQUIRE_ENCRYPTION` and `ARROW_S3` (`ARROW_FLIGHT` is disabled)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
kszucs commented on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879399202 @kou Interestingly we have [linker errors](https://github.com/ursacomputing/crossbow/runs/3060717856?check_suite_focus=true#step:8:654) in the universal2 build but the tests seem to work for both x86_64 and arm64 in their own seperate virtualenvs. Can this cause issues with `PARQUET_REQUIRE_ENCRYPTION` and `ARROW_S3`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #524: Expose ExecutionContext.register_csv to the python bindings
alamb commented on pull request #524: URL: https://github.com/apache/arrow-datafusion/pull/524#issuecomment-879398069 Cool -- I am just trying to shepherd PRs that look like they got stale -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz closed pull request #10705: ARROW-13313: [C++][Compute] Add scalar aggregate node
bkietz closed pull request #10705: URL: https://github.com/apache/arrow/pull/10705 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on a change in pull request #722: perf: improve performance of `SortPreservingMergeExec` operator
alamb commented on a change in pull request #722: URL: https://github.com/apache/arrow-datafusion/pull/722#discussion_r669104634 ## File path: datafusion/src/physical_plan/sort_preserving_merge.rs ## @@ -246,7 +274,19 @@ impl SortKeyCursor { .zip(other.columns.iter()) .zip(options.iter()); -for ((l, r), sort_options) in zipped { +// Recall or initialise a collection of comparators for comparing +// columnar arrays of this cursor and "other". +let cmp = self +.batch_comparators +.entry(other.batch_idx) +.or_insert_with(|| Vec::with_capacity(other.columns.len())); Review comment: I think this way is fine, personally -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #10705: ARROW-13313: [C++][Compute] Add scalar aggregate node
lidavidm commented on a change in pull request #10705: URL: https://github.com/apache/arrow/pull/10705#discussion_r669103875 ## File path: cpp/src/arrow/dataset/scanner_test.cc ## @@ -1323,5 +1323,86 @@ TEST(ScanNode, MaterializationOfVirtualColumn) { Finishes(ResultWith(UnorderedElementsAreArray(expected; } +TEST(ScanNode, MinimalEndToEnd) { + // NB: This test is here for didactic purposes + + // Specify a MemoryPool and ThreadPool for the ExecPlan + compute::ExecContext exec_context(default_memory_pool(), internal::GetCpuThreadPool()); + + // A ScanNode is constructed from an ExecPlan (into which it is inserted), + // a Dataset (whose batches will be scanned), and ScanOptions (to specify a filter for + // predicate pushdown, a projection to skip materialization of unnecessary columns, ...) + ASSERT_OK_AND_ASSIGN(std::shared_ptr plan, + compute::ExecPlan::Make(_context)); + + std::shared_ptr dataset = std::make_shared( + TableFromJSON(schema({field("a", int32()), field("b", boolean())}), +{ +R"([{"a": 1,"b": null}, +{"a": 2,"b": true}])", +R"([{"a": null, "b": true}, +{"a": 3,"b": false}])", +R"([{"a": null, "b": true}, +{"a": 4,"b": false}])", +R"([{"a": 5,"b": null}, +{"a": 6,"b": false}, +{"a": 7,"b": false}])", +})); + + auto options = std::make_shared(); + // sync scanning is not supported by ScanNode + options->use_async = true; + // for now, we must replicate the dataset schema here + options->dataset_schema = dataset->schema(); + // specify the filter + compute::Expression b_is_true = field_ref("b"); + ASSERT_OK_AND_ASSIGN(b_is_true, b_is_true.Bind(*dataset->schema())); + options->filter = b_is_true; + // for now, specify the projection as the full project expression (eventually this can + // just be a list of materialized field names) + compute::Expression a_times_2 = call("multiply", {field_ref("a"), literal(2)}); + ASSERT_OK_AND_ASSIGN(a_times_2, a_times_2.Bind(*dataset->schema())); + options->projection = call("project", {a_times_2}, compute::ProjectOptions{{"a * 2"}}); + + // construct the scan node + ASSERT_OK_AND_ASSIGN(compute::ExecNode * scan, + dataset::MakeScanNode(plan.get(), dataset, options)); + + // pipe the scan node into a filter node + ASSERT_OK_AND_ASSIGN(compute::ExecNode * filter, + compute::MakeFilterNode(scan, "filter", b_is_true)); + + // pipe the filter node into a project node + // NB: we're using the project node factory which preserves fragment/batch index + // tagging, so we *can* reorder later if we choose. The tags will not appear in + // our output. + ASSERT_OK_AND_ASSIGN(compute::ExecNode * project, + dataset::MakeAugmentedProjectNode(filter, "project", {a_times_2})); + + // finally, pipe the project node into a sink node + // NB: if we don't need ordering, we could use compute::MakeSinkNode instead + ASSERT_OK_AND_ASSIGN(auto sink_gen, dataset::MakeOrderedSinkNode(project, "sink")); Review comment: Thanks for the example! I do like this setup a lot more since it is clearer what all the steps are in reading a dataset + everything is neatly separated. It is not that the current scanner does not separate up the various stages of its pipeline, but the pipeline in the scanner is rather tied together while this is clearly partitioned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10647: ARROW-13174: [C++][Compute] Add strftime kernel
westonpace commented on a change in pull request #10647: URL: https://github.com/apache/arrow/pull/10647#discussion_r669098242 ## File path: r/configure.win ## @@ -44,7 +44,7 @@ else RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)" fi OPENSSL_LIBS="-lcrypto -lcrypt32" -MIMALLOC_LIBS="-lbcrypt -lpsapi" +MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32" Review comment: Is this related? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #10608: ARROW-13136: [C++] Add coalesce function
lidavidm commented on pull request #10608: URL: https://github.com/apache/arrow/pull/10608#issuecomment-879390751 I would kind of prefer to get all these kernels merged and consolidated before I start trying to microoptimize them, though, given they've been around for a while and all use similar helper code (that's now starting to diverge slightly once I look at optimizing). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #10608: ARROW-13136: [C++] Add coalesce function
lidavidm commented on pull request #10608: URL: https://github.com/apache/arrow/pull/10608#issuecomment-879390143 Basically, there was a lot of overhead from the fallback loop of "for offset in range(block size), if bit is set, copy one element" because 1) the 'copy one element' function used CopyBitmap which has a *ton* of overhead for copying one bit and 2) unboxing the array every time was costly when done in a loop like that (e.g. the profiler showed that even Buffer::data()'s check for whether the buffer is on-CPU was hot). But now I've specialized things to avoid most of that overhead. The reason why I wanted something like VisitSetBitRunsVoid was to go a step further and always try to perform block copies instead of falling back to one-element-at-a-time-copies. But yes, then it needs to be able to combine two bitmaps with AndNot (we want runs of bits where !output_valid & input_valid) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
github-actions[bot] commented on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879378962 Revision: 3aa0df229a92a3425bcb9b5a7711853bcc49e53d Submitted crossbow builds: [ursacomputing/crossbow @ actions-592](https://github.com/ursacomputing/crossbow/branches/all?query=actions-592) |Task|Status| ||--| |wheel-macos-big-sur-cp39-arm64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-big-sur-cp39-arm64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-big-sur-cp39-arm64)| |wheel-macos-big-sur-cp39-universal2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-big-sur-cp39-universal2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-big-sur-cp39-universal2)| |wheel-macos-high-sierra-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-high-sierra-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-high-sierra-cp36-amd64)| |wheel-macos-high-sierra-cp37-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-high-sierra-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-high-sierra-cp37-amd64)| |wheel-macos-high-sierra-cp38-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-high-sierra-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-high-sierra-cp38-amd64)| |wheel-macos-high-sierra-cp39-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-high-sierra-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-high-sierra-cp39-amd64)| |wheel-macos-mavericks-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-mavericks-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-mavericks-cp36-amd64)| |wheel-macos-mavericks-cp37-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-mavericks-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-mavericks-cp37-amd64)| |wheel-macos-mavericks-cp38-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-mavericks-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-mavericks-cp38-amd64)| |wheel-macos-mavericks-cp39-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-macos-mavericks-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-macos-mavericks-cp39-amd64)| |wheel-manylinux2010-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2010-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2010-cp36-amd64)| |wheel-manylinux2010-cp37-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2010-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2010-cp37-amd64)| |wheel-manylinux2010-cp38-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2010-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2010-cp38-amd64)| |wheel-manylinux2010-cp39-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2010-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2010-cp39-amd64)| |wheel-manylinux2014-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-592-github-wheel-manylinux2014-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-592-github-wheel-manylinux2014-cp36-amd64)|
[GitHub] [arrow] kszucs commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
kszucs commented on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879378309 @github-actions crossbow submit -g wheel -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #545: Fix build, Make the js package a feature that can be enabled for wasm, rather than always on
codecov-commenter commented on pull request #545: URL: https://github.com/apache/arrow-rs/pull/545#issuecomment-879376748 # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#545](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (89a087c) into [master](https://codecov.io/gh/apache/arrow-rs/commit/6698eed2900cffadc549a92defe2030d36ccf0d6?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (6698eed) will **not change** coverage. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/545/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master #545 +/- ## === Coverage 82.60% 82.60% === Files 167 167 Lines 4598445984 === Hits3798437984 Misses 8000 8000 ``` -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Last update [6698eed...89a087c](https://codecov.io/gh/apache/arrow-rs/pull/545?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #10663: ARROW-13253: [FlightRPC][C++] Fix segfault with large messages
lidavidm commented on pull request #10663: URL: https://github.com/apache/arrow/pull/10663#issuecomment-879371544 @cyb70289, might you have time to look at this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm closed pull request #10712: ARROW-13307: [C++][Python][R] Use reflection-based enums for function options
lidavidm closed pull request #10712: URL: https://github.com/apache/arrow/pull/10712 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #10412: ARROW-9430: [C++] Implement replace_with_mask kernel
lidavidm commented on a change in pull request #10412: URL: https://github.com/apache/arrow/pull/10412#discussion_r669074742 ## File path: cpp/src/arrow/compute/kernels/vector_replace.cc ## @@ -0,0 +1,510 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/bitmap_ops.h" + +namespace arrow { +namespace compute { +namespace internal { + +namespace { + +Status ReplacementArrayTooShort(int64_t expected, int64_t actual) { + return Status::Invalid("Replacement array must be of appropriate length (expected ", + expected, " items but got ", actual, " items)"); +} + +// Helper to implement replace_with kernel with scalar mask for fixed-width types, +// using callbacks to handle both bool and byte-sized types +Status ReplaceWithScalarMask(KernelContext* ctx, const ArrayData& array, + const BooleanScalar& mask, const Datum& replacements, + ArrayData* output) { + if (!mask.is_valid) { +// Output = null +ARROW_ASSIGN_OR_RAISE(auto replacement_array, + MakeArrayOfNull(array.type, array.length, ctx->memory_pool())); +*output = *replacement_array->data(); +return Status::OK(); + } + if (mask.value) { +// Output = replacement +if (replacements.is_scalar()) { + ARROW_ASSIGN_OR_RAISE( + auto replacement_array, + MakeArrayFromScalar(*replacements.scalar(), array.length, ctx->memory_pool())); + *output = *replacement_array->data(); +} else { + auto replacement_array = replacements.array(); + if (replacement_array->length != array.length) { +return ReplacementArrayTooShort(array.length, replacement_array->length); + } + *output = *replacement_array; +} + } else { +// Output = input +*output = array; + } + return Status::OK(); +} + +struct CopyArrayBitmap { + const uint8_t* in_bitmap; + int64_t in_offset; + + void CopyBitmap(uint8_t* out_bitmap, int64_t out_offset, int64_t offset, + int64_t length) const { +arrow::internal::CopyBitmap(in_bitmap, in_offset + offset, length, out_bitmap, +out_offset); + } + + void SetBit(uint8_t* out_bitmap, int64_t out_offset, int64_t offset) const { +BitUtil::SetBitTo(out_bitmap, out_offset, + BitUtil::GetBit(in_bitmap, in_offset + offset)); + } +}; + +struct CopyScalarBitmap { + const bool is_valid; + + void CopyBitmap(uint8_t* out_bitmap, int64_t out_offset, int64_t offset, + int64_t length) const { +BitUtil::SetBitsTo(out_bitmap, out_offset, length, is_valid); + } + + void SetBit(uint8_t* out_bitmap, int64_t out_offset, int64_t offset) const { +BitUtil::SetBitTo(out_bitmap, out_offset, is_valid); + } +}; + +// Helper to implement replace_with kernel with array mask for fixed-width types, +// using callbacks to handle both bool and byte-sized types and to handle +// scalar and array replacements +template +void ReplaceWithArrayMaskImpl(const ArrayData& array, const ArrayData& mask, + const Data& replacements, bool replacements_bitmap, + const CopyBitmap copy_bitmap, const uint8_t* mask_bitmap, + const uint8_t* mask_values, uint8_t* out_bitmap, + uint8_t* out_values, const int64_t out_offset) { + Functor::CopyData(*array.type, out_values, /*out_offset=*/0, array, /*in_offset=*/0, +array.length); + arrow::internal::OptionalBinaryBitBlockCounter counter( + mask_values, mask.offset, mask_bitmap, mask.offset, mask.length); + int64_t write_offset = 0; + int64_t replacements_offset = 0; + while (write_offset < array.length) { +BitBlockCount block = counter.NextAndBlock(); +if (block.AllSet()) { + // Copy from replacement array + Functor::CopyData(*array.type, out_values, out_offset + write_offset, replacements, +replacements_offset, block.length); + if (replacements_bitmap) { +
[GitHub] [arrow] github-actions[bot] commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
github-actions[bot] commented on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879364310 Revision: 086b4582939f4aef8b5784ab45ba4276f438ed6d Submitted crossbow builds: [ursacomputing/crossbow @ actions-591](https://github.com/ursacomputing/crossbow/branches/all?query=actions-591) |Task|Status| ||--| |wheel-macos-big-sur-cp39-arm64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-big-sur-cp39-arm64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-big-sur-cp39-arm64)| |wheel-macos-big-sur-cp39-universal2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-big-sur-cp39-universal2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-big-sur-cp39-universal2)| |wheel-macos-high-sierra-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-high-sierra-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-high-sierra-cp36-amd64)| |wheel-macos-high-sierra-cp37-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-high-sierra-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-high-sierra-cp37-amd64)| |wheel-macos-high-sierra-cp38-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-high-sierra-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-high-sierra-cp38-amd64)| |wheel-macos-high-sierra-cp39-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-high-sierra-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-high-sierra-cp39-amd64)| |wheel-macos-mavericks-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-mavericks-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-mavericks-cp36-amd64)| |wheel-macos-mavericks-cp37-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-mavericks-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-mavericks-cp37-amd64)| |wheel-macos-mavericks-cp38-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-mavericks-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-mavericks-cp38-amd64)| |wheel-macos-mavericks-cp39-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-macos-mavericks-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-macos-mavericks-cp39-amd64)| |wheel-manylinux2010-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2010-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2010-cp36-amd64)| |wheel-manylinux2010-cp37-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2010-cp37-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2010-cp37-amd64)| |wheel-manylinux2010-cp38-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2010-cp38-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2010-cp38-amd64)| |wheel-manylinux2010-cp39-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2010-cp39-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2010-cp39-amd64)| |wheel-manylinux2014-cp36-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-591-github-wheel-manylinux2014-cp36-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-591-github-wheel-manylinux2014-cp36-amd64)|
[GitHub] [arrow] kszucs commented on pull request #10659: ARROW-12122: [Python] Cannot install via pip M1 mac
kszucs commented on pull request #10659: URL: https://github.com/apache/arrow/pull/10659#issuecomment-879363719 @github-actions crossbow submit -g wheel -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #10557: ARROW-13064: [C++] Implement select ('case when') function for fixed-width types
lidavidm commented on pull request #10557: URL: https://github.com/apache/arrow/pull/10557#issuecomment-879363403 Fixed docstrings and fixed a TODO I noticed about fully initializing the output buffer. Also added a benchmark case for when both the cond struct array and the child cond boolean arrays can have nulls. This case is especially terrible, I made a slight optimization to eliminate one of the more egregious offenders I saw in perf, but it's still bad even then: ``` --- Benchmark Time CPU Iterations UserCounters... --- CaseWhenBench64/1048576/0 6045379 ns 6045204 ns 125 bytes_per_second=1.29235G/s CaseWhenBench64/1048576/99 6106510 ns 6106346 ns 123 bytes_per_second=1.27929G/s CaseWhenBench64OuterNulls/1048576/032735686 ns 32733991 ns 23 bytes_per_second=244.394M/s CaseWhenBench64OuterNulls/1048576/99 34098317 ns 34097473 ns 21 bytes_per_second=234.599M/s CaseWhenBench64Contiguous/1048576/0 2229728 ns 2229687 ns 338 bytes_per_second=3.50386G/s CaseWhenBench64Contiguous/1048576/992456084 ns 2456085 ns 237 bytes_per_second=3.18058G/s ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] kszucs commented on pull request #524: Expose ExecutionContext.register_csv to the python bindings
kszucs commented on pull request #524: URL: https://github.com/apache/arrow-datafusion/pull/524#issuecomment-879359345 Since https://github.com/apache/arrow-rs/pull/439 has been merged I can expose the `schema` argument as well, though we can defer that to a follow-up PR too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #524: Expose ExecutionContext.register_csv to the python bindings
alamb commented on pull request #524: URL: https://github.com/apache/arrow-datafusion/pull/524#issuecomment-879357118 @jorgecarleitao / @kszucs -- what is the plan for this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #442: Change return type of 'DataFrame.collect()'
alamb commented on pull request #442: URL: https://github.com/apache/arrow-datafusion/pull/442#issuecomment-879356925 @djKooks do you plan to keep working on this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #441: WIP: Add tokomak optimizer
alamb commented on pull request #441: URL: https://github.com/apache/arrow-datafusion/pull/441#issuecomment-879356559 @Dandandan do you still plan to work on this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #8151: ARROW-9279: [C++] Implement PrettyPrint for Scalars
bkietz commented on pull request #8151: URL: https://github.com/apache/arrow/pull/8151#issuecomment-879355469 Closing this for now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz closed pull request #8151: ARROW-9279: [C++] Implement PrettyPrint for Scalars
bkietz closed pull request #8151: URL: https://github.com/apache/arrow/pull/8151 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on pull request #537: Implement `RecordBatch::concat`
alamb commented on pull request #537: URL: https://github.com/apache/arrow-rs/pull/537#issuecomment-879355147 Unless I hear different, I plan to merge this PR tomorrow (and include it in 5.0.0) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
projjal commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669059415 ## File path: cpp/src/gandiva/string_function_holder_test.cc ## @@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) { auto status = LikeHolder::Make("ab\\_", "", _holder); EXPECT_EQ(status.ok(), false) << status.message(); } + +// Tests related to the REGEXP_EXTRACT function +class TestExtractHolder : public ::testing::Test { + protected: + ExecutionContext execution_context_; +}; + +TEST_F(TestExtractHolder, TestSimpleExtract) { Review comment: Can you also add integration test in projector_tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
projjal commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669058383 ## File path: cpp/src/gandiva/string_function_holder.h ## @@ -0,0 +1,140 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include + +#include +#include + +#include "arrow/status.h" +#include "gandiva/execution_context.h" +#include "gandiva/function_holder.h" +#include "gandiva/node.h" +#include "gandiva/visibility.h" + +namespace gandiva { + +/// Function Holder for SQL 'like' +class GANDIVA_EXPORT LikeHolder : public FunctionHolder { + public: + ~LikeHolder() override = default; + + static Status Make(const FunctionNode& node, std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, const std::string& escape_char, + std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, std::shared_ptr* holder, + RE2::Options regex_op); + + // Try and optimise a function node with a "like" pattern. + static const FunctionNode TryOptimize(const FunctionNode& node); + + /// Return true if the data matches the pattern. + bool operator()(const std::string& data) { return RE2::FullMatch(data, regex_); } + + private: + explicit LikeHolder(const std::string& pattern) : pattern_(pattern), regex_(pattern) {} + + LikeHolder(const std::string& pattern, RE2::Options regex_op) + : pattern_(pattern), regex_(pattern, regex_op) {} + + std::string pattern_; // posix pattern string, to help debugging + RE2 regex_;// compiled regex for the pattern + + static RE2 starts_with_regex_; // pre-compiled pattern for matching starts_with + static RE2 ends_with_regex_;// pre-compiled pattern for matching ends_with + static RE2 is_substr_regex_;// pre-compiled pattern for matching is_substr +}; + +/// Function Holder for 'regexp_extract' function +class GANDIVA_EXPORT ExtractHolder : public FunctionHolder { + public: + ~ExtractHolder() override = default; + + static Status Make(const FunctionNode& node, std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, + std::shared_ptr* holder); + + /// Extracts the matching text from a string using a regex + const char* operator()(ExecutionContext* ctx, const char* user_input, Review comment: Better to move the method to .cc file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
projjal commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669058383 ## File path: cpp/src/gandiva/string_function_holder.h ## @@ -0,0 +1,140 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include + +#include +#include + +#include "arrow/status.h" +#include "gandiva/execution_context.h" +#include "gandiva/function_holder.h" +#include "gandiva/node.h" +#include "gandiva/visibility.h" + +namespace gandiva { + +/// Function Holder for SQL 'like' +class GANDIVA_EXPORT LikeHolder : public FunctionHolder { + public: + ~LikeHolder() override = default; + + static Status Make(const FunctionNode& node, std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, const std::string& escape_char, + std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, std::shared_ptr* holder, + RE2::Options regex_op); + + // Try and optimise a function node with a "like" pattern. + static const FunctionNode TryOptimize(const FunctionNode& node); + + /// Return true if the data matches the pattern. + bool operator()(const std::string& data) { return RE2::FullMatch(data, regex_); } + + private: + explicit LikeHolder(const std::string& pattern) : pattern_(pattern), regex_(pattern) {} + + LikeHolder(const std::string& pattern, RE2::Options regex_op) + : pattern_(pattern), regex_(pattern, regex_op) {} + + std::string pattern_; // posix pattern string, to help debugging + RE2 regex_;// compiled regex for the pattern + + static RE2 starts_with_regex_; // pre-compiled pattern for matching starts_with + static RE2 ends_with_regex_;// pre-compiled pattern for matching ends_with + static RE2 is_substr_regex_;// pre-compiled pattern for matching is_substr +}; + +/// Function Holder for 'regexp_extract' function +class GANDIVA_EXPORT ExtractHolder : public FunctionHolder { + public: + ~ExtractHolder() override = default; + + static Status Make(const FunctionNode& node, std::shared_ptr* holder); + + static Status Make(const std::string& sql_pattern, + std::shared_ptr* holder); + + /// Extracts the matching text from a string using a regex + const char* operator()(ExecutionContext* ctx, const char* user_input, Review comment: Better move the method to .cc file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pachadotdev commented on pull request #10650: ARROW-13058: This is a draft to provide save-report
pachadotdev commented on pull request #10650: URL: https://github.com/apache/arrow/pull/10650#issuecomment-879349226 I did a few changes regarding the task of adding a function to capture the nightly builds state in json before sending the emails i'm gonna need a second review on this (plz consider i had to learn python from 0.05 to implement this) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #10679: ARROW-13170 [C++] Reducing branching in compute/kernels/vector_selection.cc
ursabot edited a comment on pull request #10679: URL: https://github.com/apache/arrow/pull/10679#issuecomment-879314016 Benchmark runs are scheduled for baseline = cf6a7ff65f4e2920641d116a3ba1f578b2bd8a9e and contender = 38110e8e7ee598ddb0e8a3465d81ea7e24bafebc. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Provided benchmark filters do not have any benchmark groups to be executed on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/96acb784c87842b2bfd28e17a9a90432...fdfaab8d2dc84fbd81318b663afe1e44/) [Skipped :warning: Only ['Python', 'R'] langs are supported on ursa-i9-9960x] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/5744b64d14d448198229be1dbb5265e7...a08372ce96df46c48f6874d84c74035c/) [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/cd0a0e80ad2c4de0b60cda38c58b64a4...9d55178f98434a64b8398f25458f5408/) Supported benchmarks: ursa-i9-9960x: langs = Python, R ursa-thinkcentre-m75q: langs = C++, Java ec2-t3-xlarge-us-east-2: cloud = True -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
projjal commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669049315 ## File path: cpp/src/gandiva/CMakeLists.txt ## @@ -82,7 +82,7 @@ set(SRC_FILES hash_utils.cc llvm_generator.cc llvm_types.cc -like_holder.cc +string_function_holder.cc Review comment: can you name it regex_functions_holder? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] projjal commented on a change in pull request #10518: ARROW-13052: [C++][Gandiva] Implements REGEXP_EXTRACT function
projjal commented on a change in pull request #10518: URL: https://github.com/apache/arrow/pull/10518#discussion_r669049098 ## File path: cpp/src/gandiva/string_function_holder_test.cc ## @@ -212,6 +213,189 @@ TEST_F(TestLikeHolder, TestMultipleEscapeChar) { auto status = LikeHolder::Make("ab\\_", "", _holder); EXPECT_EQ(status.ok(), false) << status.message(); } + +// Tests related to the REGEXP_EXTRACT function +class TestExtractHolder : public ::testing::Test { + protected: + ExecutionContext execution_context_; +}; + +TEST_F(TestExtractHolder, TestSimpleExtract) { + std::shared_ptr extract_holder; + + // Pattern to match of two group of letters + auto status = ExtractHolder::Make(R"((\w+) (\w+))", _holder); Review comment: Shouldn't this be R"(\w+) (\w+)" since you are enclosing another bracket inside the constructor? So in your example index 2 should return first name? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #10608: ARROW-13136: [C++] Add coalesce function
bkietz commented on pull request #10608: URL: https://github.com/apache/arrow/pull/10608#issuecomment-879342845 > Trying an approach based on VisitSetBitRunsVoid IIUC this would require a varargs version of `OptionalBitBlockCounter` or `Bitmap::VisitWords`, which would probably be generally useful as varargs compute functions continue to proliferate -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10412: ARROW-9430: [C++] Implement replace_with_mask kernel
bkietz commented on a change in pull request #10412: URL: https://github.com/apache/arrow/pull/10412#discussion_r669046761 ## File path: cpp/src/arrow/compute/kernels/vector_replace.cc ## @@ -0,0 +1,510 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/bitmap_ops.h" + +namespace arrow { +namespace compute { +namespace internal { + +namespace { + +Status ReplacementArrayTooShort(int64_t expected, int64_t actual) { + return Status::Invalid("Replacement array must be of appropriate length (expected ", + expected, " items but got ", actual, " items)"); +} + +// Helper to implement replace_with kernel with scalar mask for fixed-width types, +// using callbacks to handle both bool and byte-sized types +Status ReplaceWithScalarMask(KernelContext* ctx, const ArrayData& array, + const BooleanScalar& mask, const Datum& replacements, + ArrayData* output) { + if (!mask.is_valid) { +// Output = null +ARROW_ASSIGN_OR_RAISE(auto replacement_array, + MakeArrayOfNull(array.type, array.length, ctx->memory_pool())); +*output = *replacement_array->data(); +return Status::OK(); + } + if (mask.value) { +// Output = replacement +if (replacements.is_scalar()) { + ARROW_ASSIGN_OR_RAISE( + auto replacement_array, + MakeArrayFromScalar(*replacements.scalar(), array.length, ctx->memory_pool())); + *output = *replacement_array->data(); +} else { + auto replacement_array = replacements.array(); + if (replacement_array->length != array.length) { +return ReplacementArrayTooShort(array.length, replacement_array->length); + } + *output = *replacement_array; +} + } else { +// Output = input +*output = array; + } + return Status::OK(); +} + +struct CopyArrayBitmap { + const uint8_t* in_bitmap; + int64_t in_offset; + + void CopyBitmap(uint8_t* out_bitmap, int64_t out_offset, int64_t offset, + int64_t length) const { +arrow::internal::CopyBitmap(in_bitmap, in_offset + offset, length, out_bitmap, +out_offset); + } + + void SetBit(uint8_t* out_bitmap, int64_t out_offset, int64_t offset) const { +BitUtil::SetBitTo(out_bitmap, out_offset, + BitUtil::GetBit(in_bitmap, in_offset + offset)); + } +}; + +struct CopyScalarBitmap { + const bool is_valid; + + void CopyBitmap(uint8_t* out_bitmap, int64_t out_offset, int64_t offset, + int64_t length) const { +BitUtil::SetBitsTo(out_bitmap, out_offset, length, is_valid); + } + + void SetBit(uint8_t* out_bitmap, int64_t out_offset, int64_t offset) const { +BitUtil::SetBitTo(out_bitmap, out_offset, is_valid); + } +}; + +// Helper to implement replace_with kernel with array mask for fixed-width types, +// using callbacks to handle both bool and byte-sized types and to handle +// scalar and array replacements +template +void ReplaceWithArrayMaskImpl(const ArrayData& array, const ArrayData& mask, + const Data& replacements, bool replacements_bitmap, + const CopyBitmap copy_bitmap, const uint8_t* mask_bitmap, + const uint8_t* mask_values, uint8_t* out_bitmap, + uint8_t* out_values, const int64_t out_offset) { + Functor::CopyData(*array.type, out_values, /*out_offset=*/0, array, /*in_offset=*/0, +array.length); + arrow::internal::OptionalBinaryBitBlockCounter counter( + mask_values, mask.offset, mask_bitmap, mask.offset, mask.length); + int64_t write_offset = 0; + int64_t replacements_offset = 0; + while (write_offset < array.length) { +BitBlockCount block = counter.NextAndBlock(); +if (block.AllSet()) { + // Copy from replacement array + Functor::CopyData(*array.type, out_values, out_offset + write_offset, replacements, +replacements_offset, block.length); + if (replacements_bitmap) { +
[GitHub] [arrow-rs] alamb opened a new pull request #545: Fix build, Make the js package a feature that can be enabled for wasm, rather than always on
alamb opened a new pull request #545: URL: https://github.com/apache/arrow-rs/pull/545 # Which issue does this PR close? Closes https://github.com/apache/arrow-rs/issues/544 # Rationale for this change We upgraded Arrow to the latest version of `rand` in https://github.com/apache/arrow-rs/pull/488 which has changed how they do js/ wasm support When I tried to use arrow/datafusion in IOx (which depends on both arrow and datafusion) I got this crazy error while trying to resolve dependencies: ``` error: cyclic package dependency: package `ahash v0.7.4` depends on itself. Cycle: package `ahash v0.7.4` ... which is depended on by `hashbrown v0.11.2` ... which is depended on by `indexmap v1.7.0` ... which is depended on by `serde_json v1.0.64` ... which is depended on by `wasm-bindgen v0.2.74` ... which is depended on by `js-sys v0.3.51` ... which is depended on by `getrandom v0.2.3` ... which is depended on by `ahash v0.7.4` ``` The issue appears to be that the `js` feature of rand somehow pulls in ahsah again. # What changes are included in this PR? Make the `getrandom/js` (needed to build for wasm) dependency optional for arrow # Are there any user-facing changes? Yes, In order to build for WASM, the previous command ```shell cargo build --target wasm32-unknown-unknown ``` Fails with an error such as ``` cd /Users/alamb/Software/arrow-rs/arrow && cargo build --target wasm32-unknown-unknown Compiling getrandom v0.2.3 error: the wasm32-unknown-unknown target is not supported by default, you may need to enable the "js" feature. For more information see: https://docs.rs/getrandom/#webassembly-support --> /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/getrandom-0.2.3/src/lib.rs:219:9 | 219 | / compile_error!("the wasm32-unknown-unknown target is not supported by \ 220 | | default, you may need to enable the \"js\" feature. \ 221 | | For more information see: \ 222 | | https://docs.rs/getrandom/#webassembly-support;); | |_^ ``` Now, arrow users will have to explicitly specify the `js` feature: ```shell cargo build --features=js --target wasm32-unknown-unknown ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10557: ARROW-13064: [C++] Implement select ('case when') function for fixed-width types
bkietz commented on a change in pull request #10557: URL: https://github.com/apache/arrow/pull/10557#discussion_r669034240 ## File path: docs/source/cpp/compute.rst ## @@ -859,50 +859,60 @@ Structural transforms +--+++-+-+ | Function name| Arity | Input types | Output type | Notes | +==+++=+=+ -| fill_null| Binary | Boolean, Null, Numeric, Temporal, String-like | Input type | \(1)| +| case_when| Varargs| Boolean, Any fixed-width | Input type | \(1) | +--+++-+-+ -| if_else | Ternary| Boolean, Null, Numeric, Temporal | Input type + \(2)| +| fill_null| Binary | Boolean, Null, Numeric, Temporal, String-like | Input type | \(2)| +--+++-+-+ -| is_finite| Unary | Float, Double | Boolean | \(3)| +| if_else | Ternary| Boolean, Null, Numeric, Temporal | Input type | \(3)| +--+++-+-+ -| is_inf | Unary | Float, Double | Boolean | \(4)| +| is_finite| Unary | Float, Double | Boolean | \(4)| +--+++-+-+ -| is_nan | Unary | Float, Double | Boolean | \(5)| +| is_inf | Unary | Float, Double | Boolean | \(5)| +--+++-+-+ -| is_null | Unary | Any | Boolean | \(6)| +| is_nan | Unary | Float, Double | Boolean | \(6)| +--+++-+-+ -| is_valid | Unary | Any | Boolean | \(7)| +| is_null | Unary | Any | Boolean | \(7)| +--+++-+-+ -| list_value_length| Unary | List-like | Int32 or Int64 | \(8)| +| is_valid | Unary | Any | Boolean | \(8)| +--+++-+-+ -| project | Varargs| Any | Struct | \(9)| +| list_value_length| Unary | List-like | Int32 or Int64 | \(9)| +--+++-+-+ +| project | Varargs| Any | Struct | \(10) | ++--+++-+-+ + +* \(1) This function acts like a SQL 'case when' statement or switch-case. The + input is any number of alternating Boolean and value data, followed by an Review comment: This (and some of the other comments) need to be updated with the new signature ## File path: cpp/src/arrow/compute/kernels/scalar_if_else_benchmark.cc ## @@ -97,6 +97,99 @@ static void IfElseBench32Contiguous(benchmark::State& state) { return IfElseBenchContiguous(state); } +template +static void CaseWhenBench(benchmark::State& state) { + using CType = typename Type::c_type; + auto type = TypeTraits::type_singleton(); + using ArrayType = typename TypeTraits::ArrayType; + + int64_t len = state.range(0); + int64_t offset = state.range(1); + +