[GitHub] [arrow-datafusion] houqp commented on a change in pull request #443: add invariants spec
houqp commented on a change in pull request #443: URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643661190 ## File path: docs/specification/invariants.md ## @@ -0,0 +1,327 @@ + + +# DataFusion's Invariants + +This document enumerates invariants of DataFusion's logical and physical planes +(functions, and nodes). Some of these invariants are currently not enforced. +This document assumes that the reader is familiar with some of the codebase, +including rust arrow's RecordBatch and Array. + +## Rational + +DataFusion's computational model is built on top of a dynamically typed arrow +object, Array, that offers the interface `Array::as_any` to downcast itself to +its statically typed versions (e.g. `Int32Array`). DataFusion uses +`Array::data_type` to perform the respective downcasting on its physical +operations. DataFusion uses a dynamic type system because the queries being +executed are not always known at compile time: they are only known during the +runtime (or query time) of programs built with DataFusion. This document is +built on top of this principle. + +In dynamically typed interfaces, it is up to developers to enforce type +invariances. This document declares some of these invariants, so that users +know what they can expect from a query in DataFusion, and DataFusion developers +know what they need to enforce at the coding level. + +## Notation + +* Field or physical field: the tuple name, `arrow::DataType` and nullability flag (a bool whether values can be null), represented in this document by `PF(name, type, nullable)` +* Logical field: Field with a relation name. Represented in this document by `LF(relation, name, type, nullable)` +* Projected plan: plan with projection as the root node. +* Logical schema: a vector of logical fields, used by logical plan. +* Physical schema: a vector of physical fields, used by both physical plan and Arrow record batch. + +### Logical + + Function + +An object that knows its valid incoming logical fields and how to derive its +output logical field from its arguments' logical fields. A functions' output +field is itself a function of its input fields: + +``` +logical_field(lf1: LF, lf2: LF, ...) -> LF +``` + +Examples: + +* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is the function mapping input types to output type (`get_supertype` in our current implementation). +* `length(a) -> LF(None, "length({a})", u32, a.nullable)` + + Plan + +A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its schema. + +Certain plans have a frozen schema (e.g. Scan), while others derive their +schema from their child nodes. + + Column + +A type of logical node in a logical plan consists of field name and relation name. + +### Physical + + Function + +An object that knows how to derive its physical field from its arguments' +physical fields, and also how to actually perform the computation on data. A +functions' output physical field is a function of its input physical fields: + +``` +physical_field(PF1, PF2, ...) -> PF +``` + +Examples: + +* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is a complex function (`get_supertype` in our current implementation) whose computation is for each element in the columns, sum the two entries together and return it in the same type as the smallest type of both columns. +* `length() -> PF("length({a})", u32, a.nullable)` whose computation is "count number of bytes in the string". + + Plan + +A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its metadata and compute itself. + +Note how the physical plane does not know how to derive field names: field +names are solely a property of the logical plane, as they are not needed in the +physical plane. + + Column + +A type of physical node in a physical plan consists of a field name and unique index. + +### Data Sources' registry + +A map of source name/relation -> Schema plus associated properties necessary to read data from it (e.g. file path). + +### Functions' registry + +A map of function name -> logical + physical function. + +### Physical Planner + +A function that knows how to derive a physical plan from a logical plan: + +``` +plan(LogicalPlan) -> PhysicalPlan +``` + +### Logical Optimizer + +A function that accepts a logical plan and returns an (optimized) logical plan +which computes the same results, but in a more efficient manner: + +``` +optimize(LogicalPlan) -> LogicalPlan +``` + +### Physical Optimizer + +A function that accepts a physical plan and returns an (optimized) physical +plan which computes the same results, but may differ based on the actual +hardware or execution environment being run: + +``` +optimize(PhysicalPlan) -> PhysicalPlan +``` + +### Builder + +A function
[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #443: add invariants spec
jorgecarleitao commented on a change in pull request #443: URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643655420 ## File path: docs/specification/invariants.md ## @@ -0,0 +1,327 @@ + + +# DataFusion's Invariants + +This document enumerates invariants of DataFusion's logical and physical planes +(functions, and nodes). Some of these invariants are currently not enforced. +This document assumes that the reader is familiar with some of the codebase, +including rust arrow's RecordBatch and Array. + +## Rational + +DataFusion's computational model is built on top of a dynamically typed arrow +object, Array, that offers the interface `Array::as_any` to downcast itself to +its statically typed versions (e.g. `Int32Array`). DataFusion uses +`Array::data_type` to perform the respective downcasting on its physical +operations. DataFusion uses a dynamic type system because the queries being +executed are not always known at compile time: they are only known during the +runtime (or query time) of programs built with DataFusion. This document is +built on top of this principle. + +In dynamically typed interfaces, it is up to developers to enforce type +invariances. This document declares some of these invariants, so that users +know what they can expect from a query in DataFusion, and DataFusion developers +know what they need to enforce at the coding level. + +## Notation + +* Field or physical field: the tuple name, `arrow::DataType` and nullability flag (a bool whether values can be null), represented in this document by `PF(name, type, nullable)` +* Logical field: Field with a relation name. Represented in this document by `LF(relation, name, type, nullable)` +* Projected plan: plan with projection as the root node. +* Logical schema: a vector of logical fields, used by logical plan. +* Physical schema: a vector of physical fields, used by both physical plan and Arrow record batch. + +### Logical + + Function + +An object that knows its valid incoming logical fields and how to derive its +output logical field from its arguments' logical fields. A functions' output +field is itself a function of its input fields: + +``` +logical_field(lf1: LF, lf2: LF, ...) -> LF +``` + +Examples: + +* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is the function mapping input types to output type (`get_supertype` in our current implementation). +* `length(a) -> LF(None, "length({a})", u32, a.nullable)` + + Plan + +A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its schema. + +Certain plans have a frozen schema (e.g. Scan), while others derive their +schema from their child nodes. + + Column + +A type of logical node in a logical plan consists of field name and relation name. + +### Physical + + Function + +An object that knows how to derive its physical field from its arguments' +physical fields, and also how to actually perform the computation on data. A +functions' output physical field is a function of its input physical fields: + +``` +physical_field(PF1, PF2, ...) -> PF +``` + +Examples: + +* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is a complex function (`get_supertype` in our current implementation) whose computation is for each element in the columns, sum the two entries together and return it in the same type as the smallest type of both columns. +* `length() -> PF("length({a})", u32, a.nullable)` whose computation is "count number of bytes in the string". + + Plan + +A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its metadata and compute itself. + +Note how the physical plane does not know how to derive field names: field +names are solely a property of the logical plane, as they are not needed in the Review comment: That is my understanding also: naming is in the logical plane; in the physical plane, people can use whatever they want for as long as the record batches respect the output schema declared on the corresponding logical nodes. People can derive a debug format for the physical expressions, but the naming must be set beforehand, or we risk having a divergence between the names declared in the output schema of the logical node and the record batches returned by the physical node. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #443: add invariants spec
houqp commented on a change in pull request #443: URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643649837 ## File path: docs/specification/invariants.md ## @@ -0,0 +1,327 @@ + + +# DataFusion's Invariants + +This document enumerates invariants of DataFusion's logical and physical planes +(functions, and nodes). Some of these invariants are currently not enforced. +This document assumes that the reader is familiar with some of the codebase, +including rust arrow's RecordBatch and Array. + +## Rational + +DataFusion's computational model is built on top of a dynamically typed arrow +object, Array, that offers the interface `Array::as_any` to downcast itself to +its statically typed versions (e.g. `Int32Array`). DataFusion uses +`Array::data_type` to perform the respective downcasting on its physical +operations. DataFusion uses a dynamic type system because the queries being +executed are not always known at compile time: they are only known during the +runtime (or query time) of programs built with DataFusion. This document is +built on top of this principle. + +In dynamically typed interfaces, it is up to developers to enforce type +invariances. This document declares some of these invariants, so that users +know what they can expect from a query in DataFusion, and DataFusion developers +know what they need to enforce at the coding level. + +## Notation + +* Field or physical field: the tuple name, `arrow::DataType` and nullability flag (a bool whether values can be null), represented in this document by `PF(name, type, nullable)` +* Logical field: Field with a relation name. Represented in this document by `LF(relation, name, type, nullable)` +* Projected plan: plan with projection as the root node. +* Logical schema: a vector of logical fields, used by logical plan. +* Physical schema: a vector of physical fields, used by both physical plan and Arrow record batch. + +### Logical + + Function + +An object that knows its valid incoming logical fields and how to derive its +output logical field from its arguments' logical fields. A functions' output +field is itself a function of its input fields: + +``` +logical_field(lf1: LF, lf2: LF, ...) -> LF +``` + +Examples: + +* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is the function mapping input types to output type (`get_supertype` in our current implementation). +* `length(a) -> LF(None, "length({a})", u32, a.nullable)` + + Plan + +A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its schema. + +Certain plans have a frozen schema (e.g. Scan), while others derive their +schema from their child nodes. + + Column + +A type of logical node in a logical plan consists of field name and relation name. + +### Physical + + Function + +An object that knows how to derive its physical field from its arguments' +physical fields, and also how to actually perform the computation on data. A +functions' output physical field is a function of its input physical fields: + +``` +physical_field(PF1, PF2, ...) -> PF +``` + +Examples: + +* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is a complex function (`get_supertype` in our current implementation) whose computation is for each element in the columns, sum the two entries together and return it in the same type as the smallest type of both columns. +* `length() -> PF("length({a})", u32, a.nullable)` whose computation is "count number of bytes in the string". + + Plan + +A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its metadata and compute itself. + +Note how the physical plane does not know how to derive field names: field +names are solely a property of the logical plane, as they are not needed in the Review comment: @jorgecarleitao please correct me if I am wrong. I think the focus on this statement is on generation of field names. In the current code base, physical field names are generated using metadata from the logical plane, more specifically from logical `Expr`s. The actual physical plan and physical expressions are not used to generate the field names. > names are used to line up inputs to PhysicalExpr This used to be true at the time when Jorge wrote it, but with https://github.com/apache/arrow-datafusion/pull/55, we will be changing it to line up inputs using field index instead. But regardless we make this change or not, it doesn't change the fact that physical filed names are only generated using metadata from the logical plane. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at:
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #443: add invariants spec
houqp commented on a change in pull request #443: URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643649837 ## File path: docs/specification/invariants.md ## @@ -0,0 +1,327 @@ + + +# DataFusion's Invariants + +This document enumerates invariants of DataFusion's logical and physical planes +(functions, and nodes). Some of these invariants are currently not enforced. +This document assumes that the reader is familiar with some of the codebase, +including rust arrow's RecordBatch and Array. + +## Rational + +DataFusion's computational model is built on top of a dynamically typed arrow +object, Array, that offers the interface `Array::as_any` to downcast itself to +its statically typed versions (e.g. `Int32Array`). DataFusion uses +`Array::data_type` to perform the respective downcasting on its physical +operations. DataFusion uses a dynamic type system because the queries being +executed are not always known at compile time: they are only known during the +runtime (or query time) of programs built with DataFusion. This document is +built on top of this principle. + +In dynamically typed interfaces, it is up to developers to enforce type +invariances. This document declares some of these invariants, so that users +know what they can expect from a query in DataFusion, and DataFusion developers +know what they need to enforce at the coding level. + +## Notation + +* Field or physical field: the tuple name, `arrow::DataType` and nullability flag (a bool whether values can be null), represented in this document by `PF(name, type, nullable)` +* Logical field: Field with a relation name. Represented in this document by `LF(relation, name, type, nullable)` +* Projected plan: plan with projection as the root node. +* Logical schema: a vector of logical fields, used by logical plan. +* Physical schema: a vector of physical fields, used by both physical plan and Arrow record batch. + +### Logical + + Function + +An object that knows its valid incoming logical fields and how to derive its +output logical field from its arguments' logical fields. A functions' output +field is itself a function of its input fields: + +``` +logical_field(lf1: LF, lf2: LF, ...) -> LF +``` + +Examples: + +* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is the function mapping input types to output type (`get_supertype` in our current implementation). +* `length(a) -> LF(None, "length({a})", u32, a.nullable)` + + Plan + +A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its schema. + +Certain plans have a frozen schema (e.g. Scan), while others derive their +schema from their child nodes. + + Column + +A type of logical node in a logical plan consists of field name and relation name. + +### Physical + + Function + +An object that knows how to derive its physical field from its arguments' +physical fields, and also how to actually perform the computation on data. A +functions' output physical field is a function of its input physical fields: + +``` +physical_field(PF1, PF2, ...) -> PF +``` + +Examples: + +* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is a complex function (`get_supertype` in our current implementation) whose computation is for each element in the columns, sum the two entries together and return it in the same type as the smallest type of both columns. +* `length() -> PF("length({a})", u32, a.nullable)` whose computation is "count number of bytes in the string". + + Plan + +A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its metadata and compute itself. + +Note how the physical plane does not know how to derive field names: field +names are solely a property of the logical plane, as they are not needed in the Review comment: @jorgecarleitao please correct me if I am wrong. I think the focus on this statement is on generation of field names. In the current code base, physical field names are generated using metadata from the logical plane, more specifically from logical `Expr`s. The actual physical plan and physical expressions are not used to generated the field name. > names are used to line up inputs to PhysicalExpr This used to be true at the time when Jorge wrote it, but with https://github.com/apache/arrow-datafusion/pull/55, we will be changing it to line up inputs using field index instead. But regardless we make this change or not, it doesn't change the fact that physical filed names are only generated using metadata from the logical plane. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at:
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #443: add invariants spec
houqp commented on a change in pull request #443: URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643649837 ## File path: docs/specification/invariants.md ## @@ -0,0 +1,327 @@ + + +# DataFusion's Invariants + +This document enumerates invariants of DataFusion's logical and physical planes +(functions, and nodes). Some of these invariants are currently not enforced. +This document assumes that the reader is familiar with some of the codebase, +including rust arrow's RecordBatch and Array. + +## Rational + +DataFusion's computational model is built on top of a dynamically typed arrow +object, Array, that offers the interface `Array::as_any` to downcast itself to +its statically typed versions (e.g. `Int32Array`). DataFusion uses +`Array::data_type` to perform the respective downcasting on its physical +operations. DataFusion uses a dynamic type system because the queries being +executed are not always known at compile time: they are only known during the +runtime (or query time) of programs built with DataFusion. This document is +built on top of this principle. + +In dynamically typed interfaces, it is up to developers to enforce type +invariances. This document declares some of these invariants, so that users +know what they can expect from a query in DataFusion, and DataFusion developers +know what they need to enforce at the coding level. + +## Notation + +* Field or physical field: the tuple name, `arrow::DataType` and nullability flag (a bool whether values can be null), represented in this document by `PF(name, type, nullable)` +* Logical field: Field with a relation name. Represented in this document by `LF(relation, name, type, nullable)` +* Projected plan: plan with projection as the root node. +* Logical schema: a vector of logical fields, used by logical plan. +* Physical schema: a vector of physical fields, used by both physical plan and Arrow record batch. + +### Logical + + Function + +An object that knows its valid incoming logical fields and how to derive its +output logical field from its arguments' logical fields. A functions' output +field is itself a function of its input fields: + +``` +logical_field(lf1: LF, lf2: LF, ...) -> LF +``` + +Examples: + +* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is the function mapping input types to output type (`get_supertype` in our current implementation). +* `length(a) -> LF(None, "length({a})", u32, a.nullable)` + + Plan + +A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its schema. + +Certain plans have a frozen schema (e.g. Scan), while others derive their +schema from their child nodes. + + Column + +A type of logical node in a logical plan consists of field name and relation name. + +### Physical + + Function + +An object that knows how to derive its physical field from its arguments' +physical fields, and also how to actually perform the computation on data. A +functions' output physical field is a function of its input physical fields: + +``` +physical_field(PF1, PF2, ...) -> PF +``` + +Examples: + +* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is a complex function (`get_supertype` in our current implementation) whose computation is for each element in the columns, sum the two entries together and return it in the same type as the smallest type of both columns. +* `length() -> PF("length({a})", u32, a.nullable)` whose computation is "count number of bytes in the string". + + Plan + +A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`) +that knows how to derive its metadata and compute itself. + +Note how the physical plane does not know how to derive field names: field +names are solely a property of the logical plane, as they are not needed in the Review comment: @jorgecarleitao please correct me if I am wrong. I think the focus on this statement is on generation of field names. In the current code base, physical field names are generated using metadata from the logical plane, more specifically from logical `Expr`s. The actual physical plan and physical expressions are not used to generated the field name. > names are used to line up inputs to PhysicalExpr This used to be true at the time when Jorge wrote it, but with https://github.com/apache/arrow-datafusion/pull/55, we will be changing it to line up inputs using index instead that. But regardless we make this change or not, it doesn't change the fact that physical filed names are only generated using metadata from the logical plane. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at:
[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #463: Add sort in window functions
codecov-commenter commented on pull request #463: URL: https://github.com/apache/arrow-datafusion/pull/463#issuecomment-852716291 # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/463?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#463](https://codecov.io/gh/apache/arrow-datafusion/pull/463?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (50f848d) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/16011120a1b73798049c5be49f9548b00f8a0a00?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (1601112) will **increase** coverage by `0.03%`. > The diff coverage is `76.23%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/463/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/463?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff @@ ## master #463 +/- ## == + Coverage 75.84% 75.87% +0.03% == Files 153 153 Lines 2587226030 +158 == + Hits1962219751 +129 - Misses 6250 6279 +29 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/463?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [...sta/rust/core/src/serde/logical\_plan/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vZnJvbV9wcm90by5ycw==) | `36.00% <0.00%> (-0.22%)` | :arrow_down: | | [...ta/rust/core/src/serde/physical\_plan/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9waHlzaWNhbF9wbGFuL2Zyb21fcHJvdG8ucnM=) | `38.92% <0.00%> (-0.86%)` | :arrow_down: | | [datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3V0aWxzLnJz) | `47.51% <0.00%> (-2.49%)` | :arrow_down: | | [...lista/rust/core/src/serde/logical\_plan/to\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vdG9fcHJvdG8ucnM=) | `62.56% <18.18%> (+0.15%)` | :arrow_up: | | [datafusion/src/optimizer/projection\_push\_down.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3Byb2plY3Rpb25fcHVzaF9kb3duLnJz) | `98.46% <87.50%> (+<0.01%)` | :arrow_up: | | [datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvc3FsL3V0aWxzLnJz) | `64.92% <91.26%> (+17.02%)` | :arrow_up: | | [datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvbG9naWNhbF9wbGFuL2V4cHIucnM=) | `84.60% <91.66%> (+0.07%)` | :arrow_up: | | [datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvc3FsL3BsYW5uZXIucnM=) | `84.37% <98.07%> (+0.31%)` | :arrow_up: | | [datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvbG9naWNhbF9wbGFuL2J1aWxkZXIucnM=) | `90.43% <100.00%> (-0.05%)` | :arrow_down: | |
[GitHub] [arrow] cyb70289 edited a comment on pull request #10364: ARROW-12074: [C++][Compute] Add scalar arithmetic kernels for decimal
cyb70289 edited a comment on pull request #10364: URL: https://github.com/apache/arrow/pull/10364#issuecomment-852709410 @bkietz , met with one problem, would like to hear your comments. Thanks. Decimal upscaling is operation dependent. E.g., `+,-` will upscale arg with smaller scale to align digit, `*` needn't scaling, `/` is more complicated. Implicit args casting happens before kernel is created. `DispatchBest` only knows arg types, no operation type (kernel dependent) is available. So we cannot figure out the "to be casted" arg type (new precision, scale). https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.cc#L175 Maybe add another callback `kernel->explicit_cast()` and call it after or inside `DispatchBest`? Or create different `ScalarFunction` struct (and DispatchBest) for each decimal operation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] msathis commented on issue #472: [Ballista] Improve task and job metadata
msathis commented on issue #472: URL: https://github.com/apache/arrow-datafusion/issues/472#issuecomment-852710260 This will be great addition. We can expose this information to the UI as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #10364: ARROW-12074: [C++][Compute] Add scalar arithmetic kernels for decimal
cyb70289 commented on pull request #10364: URL: https://github.com/apache/arrow/pull/10364#issuecomment-852709410 @bkietz , met with one problem, would like to hear your comments. Thanks. Decimal upscaling is operation dependent. E.g., `+,-` will upscale arg with small scale to align digit, `*` needn't scaling, `/` is more complicated. Implicit args casting happens before kernel is created. `DispatchBest` only knows arg types, no operation type (kernel dependent) is available. So we cannot figure out the "to be casted" arg type (new precision, scale). https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.cc#L175 Maybe add another callback `kernel->explicit_cast()` and call it after or inside `DispatchBest`? Or one `ScalarFunction` struct (and DispatchBest) per binary decimal operation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
westonpace commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643637975 ## File path: cpp/src/arrow/compute/exec/expression_test.cc ## @@ -165,6 +165,56 @@ TEST(ExpressionUtils, StripOrderPreservingCasts) { Expect(cast(field_ref("i32"), uint64()), no_change); } +TEST(ExpressionUtils, MakeExecBatch) { + auto Expect = [](std::shared_ptr partial_batch) { +SCOPED_TRACE(partial_batch->ToString()); +ASSERT_OK_AND_ASSIGN(auto batch, MakeExecBatch(*kBoringSchema, partial_batch)); + +ASSERT_EQ(batch.num_values(), kBoringSchema->num_fields()); +for (int i = 0; i < kBoringSchema->num_fields(); ++i) { + const auto& field = *kBoringSchema->field(i); + + SCOPED_TRACE("Field#" + std::to_string(i) + " " + field.ToString()); + + EXPECT_TRUE(batch[i].type()->Equals(field.type())) + << "Incorrect type " << batch[i].type()->ToString(); + + ASSERT_OK_AND_ASSIGN(auto col, FieldRef(field.name()).GetOneOrNone(*partial_batch)); Review comment: Ah, I expected `GetColumnByName` to just return the first instance of the field (I thought duplicate field names were generally allowed outside of `compute`) but you are correct, it treats it the same as "not found". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
westonpace commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643637072 ## File path: cpp/src/arrow/dataset/dataset_internal.h ## @@ -204,5 +204,35 @@ arrow::Result> GetFragmentScanOptions( return internal::checked_pointer_cast(source); } +class FragmentDataset : public Dataset { Review comment: Right, sorry, I meant the other way around. `InMemoryDataset : public FragmentDataset`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on pull request #10258: URL: https://github.com/apache/arrow/pull/10258#issuecomment-852702836 @pitrou Don't worry about the delay, I've been plenty busy elsewhere. I have a just a few follow-up questions and then I'll make the changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on a change in pull request #10258: URL: https://github.com/apache/arrow/pull/10258#discussion_r643636361 ## File path: cpp/src/arrow/util/future_test.cc ## @@ -952,6 +951,85 @@ TEST(FutureCompletionTest, FutureVoid) { } } +class FutureSchedulingTest : public testing::Test { + public: + internal::Executor* executor() { return mock_executor.get(); } + int spawn_count() { return mock_executor->spawn_count; } + + std::function callback = [](const Status&) {}; + std::shared_ptr mock_executor = std::make_shared(); +}; + +TEST_F(FutureSchedulingTest, ScheduleAlways) { + CallbackOptions options; + options.should_schedule = ShouldSchedule::ALWAYS; + options.executor = executor(); + // Successful future + { +auto fut = Future<>::Make(); +fut.AddCallback(callback, options); +fut.MarkFinished(); +fut.AddCallback(callback, options); +ASSERT_EQ(2, spawn_count()); + } + // Failing future + { +auto fut = Future<>::Make(); +fut.AddCallback(callback, options); +fut.MarkFinished(Status::Invalid("XYZ")); +fut.AddCallback(callback, options); +ASSERT_EQ(4, spawn_count()); + } +} + +TEST_F(FutureSchedulingTest, ScheduleIfUnfinished) { + CallbackOptions options; + options.should_schedule = ShouldSchedule::IF_UNFINISHED; + options.executor = executor(); + // Successful future + { +auto fut = Future<>::Make(); +fut.AddCallback(callback, options); +fut.MarkFinished(); +fut.AddCallback(callback, options); +ASSERT_EQ(1, spawn_count()); + } + // Failing future + { +auto fut = Future<>::Make(); +fut.AddCallback(callback, options); +fut.MarkFinished(Status::Invalid("XYZ")); +fut.AddCallback(callback, options); +ASSERT_EQ(2, spawn_count()); + } +} + +class DelayedExecutor : public internal::Executor { Review comment: My rationale was only that `DelayedExecutor` is only used in `future_test.cc` while `MockExecutor` is used in `future_test.cc` and `thread_pool_test.cc` but I see your point. I'll move this into `test_common.h`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on a change in pull request #10258: URL: https://github.com/apache/arrow/pull/10258#discussion_r643636029 ## File path: cpp/src/arrow/util/test_common.h ## @@ -85,4 +88,18 @@ inline void AssertIteratorExhausted(Iterator& it) { Transformer MakeFilter(std::function filter); +class MockExecutor : public internal::Executor { Review comment: There is a bit of categorization based on folder (e.g. `util/test_common.h` vs `io/test_common.h`). This file isn't really all that large. However, it could probably be split into `test_iterator_common.h` and `test_thread_pool_common.h`. Or am I missing the point? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on a change in pull request #10258: URL: https://github.com/apache/arrow/pull/10258#discussion_r643634671 ## File path: cpp/src/arrow/util/future.h ## @@ -453,30 +480,35 @@ class Future { /// cyclic reference to itself through the callback. template typename std::enable_if::value>::type - AddCallback(OnComplete on_complete) const { + AddCallback(OnComplete on_complete, + CallbackOptions opts = CallbackOptions::Defaults()) const { // We know impl_ will not be dangling when invoking callbacks because at least one // thread will be waiting for MarkFinished to return. Thus it's safe to keep a // weak reference to impl_ here struct Callback { - void operator()() && { std::move(on_complete)(weak_self.get().result()); } - WeakFuture weak_self; + void operator()(const FutureImpl& impl) && { + std::move(on_complete)(*static_cast*>(impl.result_.get())); + } OnComplete on_complete; }; -impl_->AddCallback(Callback{WeakFuture(*this), std::move(on_complete)}); +impl_->AddCallback(Callback{std::move(on_complete)}, opts); } /// Overload for callbacks accepting a Status template typename std::enable_if::value>::type - AddCallback(OnComplete on_complete) const { + AddCallback(OnComplete on_complete, + CallbackOptions opts = CallbackOptions::Defaults()) const { static_assert(std::is_same::value, "Callbacks for Future<> should accept Status and not Result"); struct Callback { - void operator()() && { std::move(on_complete)(weak_self.get().status()); } - WeakFuture weak_self; + void operator()(const FutureImpl& impl) && { +std::move(on_complete)( +static_cast*>(impl.result_.get())->status()); Review comment: It is a method on `Future` (named `GetResult`) but `FutureImpl` is type-erased and so it has no reference to `ValueType`. If it were a `.cc` file I could extract it into an anonymous function but no luck there because of templates. Would it be acceptable to create a method `GetResultFromFutureImpl` inside of `arrow::detail`? Or is there some other trick I can use? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on a change in pull request #10258: URL: https://github.com/apache/arrow/pull/10258#discussion_r643634671 ## File path: cpp/src/arrow/util/future.h ## @@ -453,30 +480,35 @@ class Future { /// cyclic reference to itself through the callback. template typename std::enable_if::value>::type - AddCallback(OnComplete on_complete) const { + AddCallback(OnComplete on_complete, + CallbackOptions opts = CallbackOptions::Defaults()) const { // We know impl_ will not be dangling when invoking callbacks because at least one // thread will be waiting for MarkFinished to return. Thus it's safe to keep a // weak reference to impl_ here struct Callback { - void operator()() && { std::move(on_complete)(weak_self.get().result()); } - WeakFuture weak_self; + void operator()(const FutureImpl& impl) && { + std::move(on_complete)(*static_cast*>(impl.result_.get())); + } OnComplete on_complete; }; -impl_->AddCallback(Callback{WeakFuture(*this), std::move(on_complete)}); +impl_->AddCallback(Callback{std::move(on_complete)}, opts); } /// Overload for callbacks accepting a Status template typename std::enable_if::value>::type - AddCallback(OnComplete on_complete) const { + AddCallback(OnComplete on_complete, + CallbackOptions opts = CallbackOptions::Defaults()) const { static_assert(std::is_same::value, "Callbacks for Future<> should accept Status and not Result"); struct Callback { - void operator()() && { std::move(on_complete)(weak_self.get().status()); } - WeakFuture weak_self; + void operator()(const FutureImpl& impl) && { +std::move(on_complete)( +static_cast*>(impl.result_.get())->status()); Review comment: It is a method on `Future` (named `GetResult`) but `FutureImpl` is type-erased and so it has no reference to `ValueType`. If it were a `.cc` file I could extract it into an anonymous function but no luck there because of templates. I think I can create a `future_internal.h`. Would it be acceptable to create a method `GetResultFromFutureImpl` inside of `arrow::detail`? Or is there some other trick I can use? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #443: add invariants spec
codecov-commenter edited a comment on pull request #443: URL: https://github.com/apache/arrow-datafusion/pull/443#issuecomment-850944261 # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/443?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#443](https://codecov.io/gh/apache/arrow-datafusion/pull/443?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (5ba60e2) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/321fda40a47bcc494c5d2511b6e8b02c9ea975b4?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (321fda4) will **increase** coverage by `0.67%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/443/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/443?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff @@ ## master #443 +/- ## == + Coverage 75.16% 75.84% +0.67% == Files 150 153 +3 Lines 2514425872 +728 == + Hits1889919622 +723 - Misses 6245 6250 +5 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/443?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9jb21tb24ucnM=) | `84.21% <0.00%> (-2.00%)` | :arrow_down: | | [ballista/rust/scheduler/src/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9zY2hlZHVsZXIvc3JjL3BsYW5uZXIucnM=) | `66.91% <0.00%> (-0.74%)` | :arrow_down: | | [datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi90ZXN0cy9zcWwucnM=) | `99.27% <0.00%> (-0.62%)` | :arrow_down: | | [datafusion-cli/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1jbGkvc3JjL21haW4ucnM=) | `0.00% <0.00%> (ø)` | | | [ballista/rust/executor/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvbWFpbi5ycw==) | `0.00% <0.00%> (ø)` | | | [ballista/rust/executor/src/execution\_loop.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvZXhlY3V0aW9uX2xvb3AucnM=) | `0.00% <0.00%> (ø)` | | | [ballista/rust/executor/src/flight\_service.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvZmxpZ2h0X3NlcnZpY2UucnM=) | `0.00% <0.00%> (ø)` | | | [datafusion/src/physical\_plan/expressions/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9tb2QucnM=) | `71.42% <0.00%> (ø)` | | | [ballista/rust/core/src/serde/logical\_plan/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vbW9kLnJz) | `99.40% <0.00%> (ø)` | | |
[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on a change in pull request #10258: URL: https://github.com/apache/arrow/pull/10258#discussion_r643632996 ## File path: cpp/src/arrow/util/future.cc ## @@ -272,8 +315,8 @@ class ConcreteFutureImpl : public FutureImpl { // // In fact, it is important not to hold the locks because the callback // may be slow or do its own locking on other resources -for (auto&& callback : callbacks_) { - std::move(callback)(); +for (auto& callback_record : callbacks_) { + RunOrScheduleCallback(callback_record, /*from_unfinished=*/true); Review comment: On the bright side, if we remove `IF_UNFINISHED` then we can change `ShouldSchedule` to a `bool`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on a change in pull request #10258: URL: https://github.com/apache/arrow/pull/10258#discussion_r643632152 ## File path: cpp/src/arrow/util/future.h ## @@ -202,8 +202,30 @@ enum class FutureState : int8_t { PENDING, SUCCESS, FAILURE }; inline bool IsFutureFinished(FutureState state) { return state != FutureState::PENDING; } +/// \brief Describes whether the callback should be scheduled or run synchronously +enum ShouldSchedule { + /// Always run the callback synchronously (the default) + NEVER = 0, Review comment: Hmm, technically the style guide prefers `kAlways` but I see `Always` used more often in Arrow. Although some of the gandiva code uses kAlways. (https://google.github.io/styleguide/cppguide.html#Enumerator_Names). Any preference? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on a change in pull request #10258: URL: https://github.com/apache/arrow/pull/10258#discussion_r643630671 ## File path: cpp/src/arrow/util/future.cc ## @@ -272,8 +315,8 @@ class ConcreteFutureImpl : public FutureImpl { // // In fact, it is important not to hold the locks because the callback // may be slow or do its own locking on other resources -for (auto&& callback : callbacks_) { - std::move(callback)(); +for (auto& callback_record : callbacks_) { + RunOrScheduleCallback(callback_record, /*from_unfinished=*/true); Review comment: Hmm, `from_unfinished` is supposed to mean "was the callback added when the future was unfinished" but I can see how that is vague. I could just remove this option entirely. The `TryAddCallback` function already gives you this capability (it is used in `Transfer`) and so it isn't used anywhere. Any strong opinion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on a change in pull request #10258: URL: https://github.com/apache/arrow/pull/10258#discussion_r643628821 ## File path: cpp/src/arrow/util/future.cc ## @@ -231,26 +232,68 @@ class ConcreteFutureImpl : public FutureImpl { void DoMarkFailed() { DoMarkFinishedOrFailed(FutureState::FAILURE); } - void AddCallback(Callback callback) { + void CheckOptions(const CallbackOptions& opts) { +if (opts.should_schedule != ShouldSchedule::NEVER) { + DCHECK_NE(opts.executor, NULL) + << "An executor must be specified when adding a callback that might schedule"; +} + } + + void AddCallback(Callback callback, CallbackOptions opts) { +CheckOptions(opts); std::unique_lock lock(mutex_); +CallbackRecord callback_record{std::move(callback), opts}; if (IsFutureFinished(state_)) { lock.unlock(); - std::move(callback)(); + RunOrScheduleCallback(callback_record, /*from_unfinished=*/false); } else { - callbacks_.push_back(std::move(callback)); + callbacks_.push_back(std::move(callback_record)); } } - bool TryAddCallback(const std::function& callback_factory) { + bool TryAddCallback(const std::function& callback_factory, + CallbackOptions opts) { +CheckOptions(opts); std::unique_lock lock(mutex_); if (IsFutureFinished(state_)) { return false; } else { - callbacks_.push_back(callback_factory()); + callbacks_.push_back({callback_factory(), opts}); return true; } } + bool ShouldSchedule(const CallbackRecord& callback_record, bool from_unfinished) { +switch (callback_record.options.should_schedule) { + case ShouldSchedule::NEVER: +return false; + case ShouldSchedule::ALWAYS: +return true; + case ShouldSchedule::IF_UNFINISHED: +return from_unfinished; + default: +DCHECK(false) << "Unrecognized ShouldSchedule option"; +return false; +} + } + + void RunOrScheduleCallback(CallbackRecord& callback_record, bool from_unfinished) { +if (ShouldSchedule(callback_record, from_unfinished)) { + // Need to make a copy of this to keep it alive until the callback has a chance + // to be scheduled. + struct CallbackTask { +void operator()() { std::move(callback)(*self); } + +Callback callback; +std::shared_ptr self; Review comment: The weak pointer in the old implementation was used to prevent futures from creating a circular reference on themselves (callback references future which references callback). Unfortunately, the weak pointer relied on the future remaining valid until all callbacks had run. If all callbacks run synchronously this is easy (whomever is calling `MarkFinished` must have a valid reference until all callbacks finish). Once we start scheduling callbacks we run into the problem where `MarkFinished` can return before some callbacks have run and then when those callbacks get scheduled the future has been deleted. This fix isn't just a change to a strong pointer though (that would introduce the circular reference problem again). Instead of the callback itself having a reference to the future I changed it so that the callback took the FutureImpl in as an argument (note, this is the internal `FutureImpl` callback and not the publicly exposed `Future` callback). This allowed me to avoid the circular reference because the strong pointer is created when the callback is being triggered and not when the callback is being added. Also, the strong pointer is only created if it is a scheduled callback. Any existing performance should remain the same since no strong pointer of `shared_from_this` call is made. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future
westonpace commented on a change in pull request #10258: URL: https://github.com/apache/arrow/pull/10258#discussion_r643625949 ## File path: cpp/src/arrow/util/future.cc ## @@ -231,26 +232,68 @@ class ConcreteFutureImpl : public FutureImpl { void DoMarkFailed() { DoMarkFinishedOrFailed(FutureState::FAILURE); } - void AddCallback(Callback callback) { + void CheckOptions(const CallbackOptions& opts) { +if (opts.should_schedule != ShouldSchedule::NEVER) { + DCHECK_NE(opts.executor, NULL) + << "An executor must be specified when adding a callback that might schedule"; +} + } + + void AddCallback(Callback callback, CallbackOptions opts) { +CheckOptions(opts); std::unique_lock lock(mutex_); +CallbackRecord callback_record{std::move(callback), opts}; if (IsFutureFinished(state_)) { lock.unlock(); - std::move(callback)(); + RunOrScheduleCallback(callback_record, /*from_unfinished=*/false); } else { - callbacks_.push_back(std::move(callback)); + callbacks_.push_back(std::move(callback_record)); } } - bool TryAddCallback(const std::function& callback_factory) { + bool TryAddCallback(const std::function& callback_factory, + CallbackOptions opts) { +CheckOptions(opts); std::unique_lock lock(mutex_); if (IsFutureFinished(state_)) { return false; } else { - callbacks_.push_back(callback_factory()); + callbacks_.push_back({callback_factory(), opts}); return true; } } + bool ShouldSchedule(const CallbackRecord& callback_record, bool from_unfinished) { +switch (callback_record.options.should_schedule) { + case ShouldSchedule::NEVER: +return false; + case ShouldSchedule::ALWAYS: +return true; + case ShouldSchedule::IF_UNFINISHED: +return from_unfinished; + default: +DCHECK(false) << "Unrecognized ShouldSchedule option"; +return false; +} + } + + void RunOrScheduleCallback(CallbackRecord& callback_record, bool from_unfinished) { +if (ShouldSchedule(callback_record, from_unfinished)) { + // Need to make a copy of this to keep it alive until the callback has a chance Review comment: The copy is a few lines down when we call `shared_from_this`. I'll move the comment and make it more explicit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10421: ARROW-12903: [C++] Create new thread pool benchmark demonstrating the "scheduling" bottleneck
westonpace commented on a change in pull request #10421: URL: https://github.com/apache/arrow/pull/10421#discussion_r643623285 ## File path: cpp/src/arrow/util/thread_pool.h ## @@ -288,6 +288,10 @@ class ARROW_EXPORT ThreadPool : public Executor { // tasks are finished. Status Shutdown(bool wait = true); + // Waits for the thread pool to reach a quiet state where all workers are Review comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10421: ARROW-12903: [C++] Create new thread pool benchmark demonstrating the "scheduling" bottleneck
westonpace commented on a change in pull request #10421: URL: https://github.com/apache/arrow/pull/10421#discussion_r643622911 ## File path: cpp/src/arrow/util/thread_pool_benchmark.cc ## @@ -103,6 +103,52 @@ static void ThreadPoolSpawn(benchmark::State& state) { // NOLINT non-const refe state.SetItemsProcessed(state.iterations() * nspawns); } +// The ThreadPoolSpawn benchmark submits all tasks from a single outside thread. This +// ends up causing a worst-case scenario for the current simple thread pool. All threads +// compete over the task queue mutex trying to grab the next thread off the queue and the +// result is a large amount of contention. +// +// By spreading out the scheduling across multiple threads we can help reduce that +// contention. This benchmark demonstrates the ideal case where we are able to perfectly +// partition the scheduling across the available threads. +// +// Both situations could be encountered (the thread pool can't choose how it is used) but +// by having both benchmarks we can express the importance of distributed scheduling. +static void ThreadPoolIdealSpawn(benchmark::State& state) { // NOLINT non-const reference + const auto nthreads = static_cast(state.range(0)); + const auto workload_size = static_cast(state.range(1)); + + Workload workload(workload_size); + + // Spawn enough tasks to make the pool start up overhead negligible + const int32_t nspawns = 2 / workload_size + 1; + const int32_t nspawns_per_thread = nspawns / nthreads; + + for (auto _ : state) { +state.PauseTiming(); +std::shared_ptr pool; +pool = *ThreadPool::Make(nthreads); +state.ResumeTiming(); + +for (int32_t i = 0; i < nthreads; ++i) { + // Pass the task by reference to avoid copying it around + ABORT_NOT_OK(pool->Spawn([, , nspawns_per_thread] { +for (int32_t j = 0; j < nspawns_per_thread; j++) { + ABORT_NOT_OK(pool->Spawn(std::ref(workload))); +} + })); +} + +// Wait for all tasks to finish +pool->WaitForIdle(); Review comment: At this point we cannot know that all the tasks have been spawned. If I call `Shutdown(wait=true)` then a slow spawner will fail because `SpawnReal` returns `Status::Invalid` if `please_shutdown_` is `true`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace edited a comment on pull request #10421: ARROW-12903: [C++] Create new thread pool benchmark demonstrating the "scheduling" bottleneck
westonpace edited a comment on pull request #10421: URL: https://github.com/apache/arrow/pull/10421#issuecomment-852682551 Just adding the benchmark... ``` ThreadPoolSpawn/threads:1/task_cost:1000/real_time 104576026 ns 39527736 ns7 items_per_second=1.91249M/s ThreadPoolSpawn/threads:2/task_cost:1000/real_time 81736943 ns 69631881 ns8 items_per_second=2.44689M/s ThreadPoolSpawn/threads:4/task_cost:1000/real_time 395577537 ns 337000146 ns2 items_per_second=505.592k/s ThreadPoolSpawn/threads:8/task_cost:1000/real_time 326650393 ns 290524204 ns2 items_per_second=612.278k/s ThreadPoolSpawn/threads:1/task_cost:1/real_time 81345849 ns 2355243 ns8 items_per_second=245.876k/s ThreadPoolSpawn/threads:2/task_cost:1/real_time 43399109 ns 2694481 ns 16 items_per_second=460.862k/s ThreadPoolSpawn/threads:4/task_cost:1/real_time 22975768 ns 3457033 ns 31 items_per_second=870.526k/s ThreadPoolSpawn/threads:8/task_cost:1/real_time 21717680 ns 14331911 ns 37 items_per_second=920.955k/s ThreadPoolSpawn/threads:1/task_cost:10/real_time81517332 ns 270745 ns8 items_per_second=24.5469k/s ThreadPoolSpawn/threads:2/task_cost:10/real_time41615534 ns 281452 ns 17 items_per_second=48.083k/s ThreadPoolSpawn/threads:4/task_cost:10/real_time21324149 ns 316989 ns 33 items_per_second=93.8373k/s ThreadPoolSpawn/threads:8/task_cost:10/real_time12702954 ns 443910 ns 55 items_per_second=157.522k/s ThreadPoolIdealSpawn/threads:1/task_cost:1000/real_time107253606 ns 91704 ns6 items_per_second=1.86475M/s ThreadPoolIdealSpawn/threads:2/task_cost:1000/real_time 88176114 ns 117639 ns8 items_per_second=2.2682M/s ThreadPoolIdealSpawn/threads:4/task_cost:1000/real_time 86717904 ns 107266 ns8 items_per_second=2.30634M/s ThreadPoolIdealSpawn/threads:8/task_cost:1000/real_time 98762117 ns 209733 ns7 items_per_second=2.02508M/s ThreadPoolIdealSpawn/threads:1/task_cost:1/real_time84566727 ns 64819 ns8 items_per_second=236.511k/s ThreadPoolIdealSpawn/threads:2/task_cost:1/real_time46885981 ns 70276 ns 15 items_per_second=426.588k/s ThreadPoolIdealSpawn/threads:4/task_cost:1/real_time27807860 ns 94742 ns 26 items_per_second=719.257k/s ThreadPoolIdealSpawn/threads:8/task_cost:1/real_time16994645 ns 148328 ns 41 items_per_second=1.1769M/s ThreadPoolIdealSpawn/threads:1/task_cost:10/real_time 81094164 ns 52531 ns8 items_per_second=24.675k/s ThreadPoolIdealSpawn/threads:2/task_cost:10/real_time 42196762 ns 85439 ns 16 items_per_second=47.4207k/s ThreadPoolIdealSpawn/threads:4/task_cost:10/real_time 22380385 ns 120278 ns 32 items_per_second=89.4086k/s ThreadPoolIdealSpawn/threads:8/task_cost:10/real_time 12873517 ns 180938 ns 56 items_per_second=155.435k/s ``` Early results from work-stealing (note, impl:1 is the single queue implementation, it benefits quite a bit from the generalized refactor (#10401) which shrinks the critical section)... ``` ThreadPoolSpawn/impl:1/threads:1/task_cost:1000/real_time 109095290 ns 45507459 ns6 items_per_second=1.83327M/s ThreadPoolSpawn/impl:1/threads:2/task_cost:1000/real_time 84445897 ns 73408467 ns8 items_per_second=2.36839M/s ThreadPoolSpawn/impl:1/threads:4/task_cost:1000/real_time 384508473 ns33388 ns2 items_per_second=520.147k/s ThreadPoolSpawn/impl:1/threads:8/task_cost:1000/real_time 340298964 ns310431590 ns2 items_per_second=587.721k/s ThreadPoolSpawn/impl:1/threads:1/task_cost:1/real_time 84889601 ns 2927850 ns8 items_per_second=235.612k/s ThreadPoolSpawn/impl:1/threads:2/task_cost:1/real_time 46962168 ns 4429182 ns 16 items_per_second=425.896k/s ThreadPoolSpawn/impl:1/threads:4/task_cost:1/real_time 27891032 ns 5498450 ns 24 items_per_second=717.112k/s ThreadPoolSpawn/impl:1/threads:8/task_cost:1/real_time 23484115 ns 15697174 ns 29 items_per_second=851.682k/s ThreadPoolSpawn/impl:1/threads:1/task_cost:10/real_time86121178 ns 466594 ns8 items_per_second=23.2347k/s ThreadPoolSpawn/impl:1/threads:2/task_cost:10/real_time47425209 ns 563522 ns 14 items_per_second=42.1928k/s
[GitHub] [arrow] westonpace commented on pull request #10421: ARROW-12903: [C++] Create new thread pool benchmark demonstrating the "scheduling" bottleneck
westonpace commented on pull request #10421: URL: https://github.com/apache/arrow/pull/10421#issuecomment-852682551 Just adding the benchmark... ``` ThreadPoolSpawn/threads:1/task_cost:1000/real_time 104576026 ns 39527736 ns7 items_per_second=1.91249M/s ThreadPoolSpawn/threads:2/task_cost:1000/real_time 81736943 ns 69631881 ns8 items_per_second=2.44689M/s ThreadPoolSpawn/threads:4/task_cost:1000/real_time 395577537 ns 337000146 ns2 items_per_second=505.592k/s ThreadPoolSpawn/threads:8/task_cost:1000/real_time 326650393 ns 290524204 ns2 items_per_second=612.278k/s ThreadPoolSpawn/threads:1/task_cost:1/real_time 81345849 ns 2355243 ns8 items_per_second=245.876k/s ThreadPoolSpawn/threads:2/task_cost:1/real_time 43399109 ns 2694481 ns 16 items_per_second=460.862k/s ThreadPoolSpawn/threads:4/task_cost:1/real_time 22975768 ns 3457033 ns 31 items_per_second=870.526k/s ThreadPoolSpawn/threads:8/task_cost:1/real_time 21717680 ns 14331911 ns 37 items_per_second=920.955k/s ThreadPoolSpawn/threads:1/task_cost:10/real_time81517332 ns 270745 ns8 items_per_second=24.5469k/s ThreadPoolSpawn/threads:2/task_cost:10/real_time41615534 ns 281452 ns 17 items_per_second=48.083k/s ThreadPoolSpawn/threads:4/task_cost:10/real_time21324149 ns 316989 ns 33 items_per_second=93.8373k/s ThreadPoolSpawn/threads:8/task_cost:10/real_time12702954 ns 443910 ns 55 items_per_second=157.522k/s ThreadPoolIdealSpawn/threads:1/task_cost:1000/real_time107253606 ns 91704 ns6 items_per_second=1.86475M/s ThreadPoolIdealSpawn/threads:2/task_cost:1000/real_time 88176114 ns 117639 ns8 items_per_second=2.2682M/s ThreadPoolIdealSpawn/threads:4/task_cost:1000/real_time 86717904 ns 107266 ns8 items_per_second=2.30634M/s ThreadPoolIdealSpawn/threads:8/task_cost:1000/real_time 98762117 ns 209733 ns7 items_per_second=2.02508M/s ThreadPoolIdealSpawn/threads:1/task_cost:1/real_time84566727 ns 64819 ns8 items_per_second=236.511k/s ThreadPoolIdealSpawn/threads:2/task_cost:1/real_time46885981 ns 70276 ns 15 items_per_second=426.588k/s ThreadPoolIdealSpawn/threads:4/task_cost:1/real_time27807860 ns 94742 ns 26 items_per_second=719.257k/s ThreadPoolIdealSpawn/threads:8/task_cost:1/real_time16994645 ns 148328 ns 41 items_per_second=1.1769M/s ThreadPoolIdealSpawn/threads:1/task_cost:10/real_time 81094164 ns 52531 ns8 items_per_second=24.675k/s ThreadPoolIdealSpawn/threads:2/task_cost:10/real_time 42196762 ns 85439 ns 16 items_per_second=47.4207k/s ThreadPoolIdealSpawn/threads:4/task_cost:10/real_time 22380385 ns 120278 ns 32 items_per_second=89.4086k/s ThreadPoolIdealSpawn/threads:8/task_cost:10/real_time 12873517 ns 180938 ns 56 items_per_second=155.435k/s ``` Early results from work-stealing (note, impl:1 is the single queue implementation, it benefits quite a bit from the generalized refactor (#10401) which shrinks the critical section considerably)... ``` ThreadPoolSpawn/impl:1/threads:1/task_cost:1000/real_time 109095290 ns 45507459 ns6 items_per_second=1.83327M/s ThreadPoolSpawn/impl:1/threads:2/task_cost:1000/real_time 84445897 ns 73408467 ns8 items_per_second=2.36839M/s ThreadPoolSpawn/impl:1/threads:4/task_cost:1000/real_time 384508473 ns33388 ns2 items_per_second=520.147k/s ThreadPoolSpawn/impl:1/threads:8/task_cost:1000/real_time 340298964 ns310431590 ns2 items_per_second=587.721k/s ThreadPoolSpawn/impl:1/threads:1/task_cost:1/real_time 84889601 ns 2927850 ns8 items_per_second=235.612k/s ThreadPoolSpawn/impl:1/threads:2/task_cost:1/real_time 46962168 ns 4429182 ns 16 items_per_second=425.896k/s ThreadPoolSpawn/impl:1/threads:4/task_cost:1/real_time 27891032 ns 5498450 ns 24 items_per_second=717.112k/s ThreadPoolSpawn/impl:1/threads:8/task_cost:1/real_time 23484115 ns 15697174 ns 29 items_per_second=851.682k/s ThreadPoolSpawn/impl:1/threads:1/task_cost:10/real_time86121178 ns 466594 ns8 items_per_second=23.2347k/s ThreadPoolSpawn/impl:1/threads:2/task_cost:10/real_time47425209 ns 563522 ns 14 items_per_second=42.1928k/s
[GitHub] [arrow] liyafan82 commented on a change in pull request #10423: ARROW-12907: [Java] Fix memory leak on deserialization errors
liyafan82 commented on a change in pull request #10423: URL: https://github.com/apache/arrow/pull/10423#discussion_r643606492 ## File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/MessageSerializerTest.java ## @@ -197,12 +199,30 @@ public void testSerializeRecordBatchV5() throws IOException { IpcOption option = new IpcOption(false, MetadataVersion.V5); ByteArrayOutputStream out = new ByteArrayOutputStream(); MessageSerializer.serialize(new WriteChannel(Channels.newChannel(out)), batch, option); +validityb.close(); +valuesb.close(); +batch.close(); + +{ + ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); Review comment: It would be nice to wrap this into a try-with-resource block. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #10423: ARROW-12907: [Java] Fix memory leak on deserialization errors
liyafan82 commented on a change in pull request #10423: URL: https://github.com/apache/arrow/pull/10423#discussion_r643604065 ## File path: java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java ## @@ -723,8 +723,13 @@ public static MessageMetadataResult readMessage(ReadChannel in) throws IOExcepti public static ArrowBuf readMessageBody(ReadChannel in, long bodyLength, BufferAllocator allocator) throws IOException { ArrowBuf bodyBuffer = allocator.buffer(bodyLength); Review comment: Nice catch! A more conventional way is to wrap the statements in try-with-resource block? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on a change in pull request #384: Implement faster arrow array reader
nevi-me commented on a change in pull request #384: URL: https://github.com/apache/arrow-rs/pull/384#discussion_r643577679 ## File path: parquet/src/arrow/arrow_array_reader.rs ## @@ -0,0 +1,1394 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::{any::Any, collections::VecDeque, marker::PhantomData}; +use std::{rc::Rc, cell::RefCell}; +use arrow::{array::{ArrayRef, Int16Array}, buffer::MutableBuffer, datatypes::{DataType as ArrowType, ToByteSlice}}; +use crate::{column::page::{Page, PageIterator}, memory::ByteBufferPtr, schema::types::{ColumnDescPtr, ColumnDescriptor}}; +use crate::arrow::schema::parquet_to_arrow_field; +use crate::errors::{ParquetError, Result}; +use crate::basic::Encoding; +use super::array_reader::ArrayReader; + +struct UnzipIter +{ +shared_state: Rc>, +select_item_buffer: fn( State) -> VecDeque, +consume_source_item: fn(source_item: Source, state: State) -> Target, +} + +impl UnzipIter +{ +fn new( +shared_state: Rc>, +item_buffer_selector: fn( State) -> VecDeque, +source_item_consumer: fn(source_item: Source, state: State) -> Target +) -> Self { +Self { +shared_state, +select_item_buffer: item_buffer_selector, +consume_source_item: source_item_consumer, +} +} +} + +trait UnzipIterState { +type SourceIter: Iterator; +fn source_iter( self) -> Self::SourceIter; +} + +impl> Iterator for UnzipIter { +type Item = Target; + +fn next( self) -> Option { +let mut inner = self.shared_state.borrow_mut(); +// try to get one from the stored data +(self.select_item_buffer)( *inner).pop_front().or_else(|| +// nothing stored, we need a new element. +inner.source_iter().next().map(|s| { +(self.consume_source_item)(s, inner) +})) +} +} + +struct PageBufferUnzipIterState { +iter: It, +value_iter_buffer: VecDeque, +def_level_iter_buffer: VecDeque, +rep_level_iter_buffer: VecDeque, +} + +impl> UnzipIterState<(V, L, L)> for PageBufferUnzipIterState { +type SourceIter = It; + +#[inline] +fn source_iter( self) -> Self::SourceIter { + self.iter +} +} + +fn unzip_iter>(it: It) -> ( +UnzipIter<(V, L, L), V, PageBufferUnzipIterState>, +UnzipIter<(V, L, L), L, PageBufferUnzipIterState>, +UnzipIter<(V, L, L), L, PageBufferUnzipIterState>, +) { +let shared_data = Rc::new(RefCell::new(PageBufferUnzipIterState { +iter: it, +value_iter_buffer: VecDeque::new(), +def_level_iter_buffer: VecDeque::new(), +rep_level_iter_buffer: VecDeque::new(), +})); + +let value_iter = UnzipIter::new( +shared_data.clone(), +|state| state.value_iter_buffer, +|(v, d, r), state| { +state.def_level_iter_buffer.push_back(d); +state.rep_level_iter_buffer.push_back(r); +v +}, +); + +let def_level_iter = UnzipIter::new( +shared_data.clone(), +|state| state.def_level_iter_buffer, +|(v, d, r), state| { +state.value_iter_buffer.push_back(v); +state.rep_level_iter_buffer.push_back(r); +d +}, +); + +let rep_level_iter = UnzipIter::new( +shared_data, +|state| state.rep_level_iter_buffer, +|(v, d, r), state| { +state.value_iter_buffer.push_back(v); +state.def_level_iter_buffer.push_back(d); +r +}, +); + +(value_iter, def_level_iter, rep_level_iter) +} + +pub trait ArrayConverter { +fn convert_value_bytes(, value_decoder: impl ValueDecoder, num_values: usize) -> Result; +} + +pub struct ArrowArrayReader<'a, C: ArrayConverter + 'a> { +column_desc: ColumnDescPtr, +data_type: ArrowType, +def_level_decoder: Box, +rep_level_decoder: Box, +value_decoder: Box, +last_def_levels: Option, +last_rep_levels: Option, +array_converter: C, +} + +pub(crate) struct ColumnChunkContext { +dictionary_values: Option>, +} + +impl ColumnChunkContext { +fn new() -> Self { +Self { +
[GitHub] [arrow] github-actions[bot] commented on pull request #10433: ARROW-12911: [Python] Export scalar aggregate options to pc.sum
github-actions[bot] commented on pull request #10433: URL: https://github.com/apache/arrow/pull/10433#issuecomment-852644263 https://issues.apache.org/jira/browse/ARROW-12911 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 opened a new pull request #10433: ARROW-12911: [Python] Export scalar aggregate options to pc.sum
cyb70289 opened a new pull request #10433: URL: https://github.com/apache/arrow/pull/10433 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643590723 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc ## @@ -0,0 +1,614 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/builder.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/time.h" +#include "arrow/vendored/datetime.h" + +namespace arrow { + +namespace compute { +namespace internal { + +using applicator::ScalarUnaryNotNull; +using applicator::SimpleUnary; +using arrow_vendored::date::days; +using arrow_vendored::date::floor; +using arrow_vendored::date::hh_mm_ss; +using arrow_vendored::date::sys_days; +using arrow_vendored::date::sys_time; +using arrow_vendored::date::trunc; +using arrow_vendored::date::weekday; +using arrow_vendored::date::weeks; +using arrow_vendored::date::year_month_day; +using arrow_vendored::date::years; +using arrow_vendored::date::literals::dec; +using arrow_vendored::date::literals::jan; +using arrow_vendored::date::literals::last; +using arrow_vendored::date::literals::mon; +using arrow_vendored::date::literals::thu; + +// Based on ScalarUnaryNotNullStateful. Adds timezone awareness. +template +struct ScalarUnaryStatefulTemporal { + using ThisType = ScalarUnaryStatefulTemporal; + using OutValue = typename GetOutputType::T; + + Op op; + explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {} + + template + struct ArrayExec { +static Status Exec(const ThisType& functor, KernelContext* ctx, const ArrayData& arg0, + Datum* out) { + const std::string timezone = + std::static_pointer_cast(arg0.type)->timezone(); + Status st = Status::OK(); + ArrayData* out_arr = out->mutable_array(); + auto out_data = out_arr->GetMutableValues(1); + + if (timezone.empty()) { +VisitArrayValuesInline( +arg0, +[&](int64_t v) { + *out_data++ = functor.op.template Call(ctx, v, ); +}, +[&]() { + // null + ++out_data; +}); + } else { +st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); + } + return st; +} + }; + + Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) { +const std::string timezone = +std::static_pointer_cast(arg0.type)->timezone(); +Status st = Status::OK(); +if (timezone.empty()) { + if (arg0.is_valid) { +int64_t arg0_val = UnboxScalar::Unbox(arg0); +BoxScalar::Box(this->op.template Call(ctx, arg0_val, ), +out->scalar().get()); + } +} else { + st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); +} +return st; + } + + Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + return ArrayExec::Exec(*this, ctx, *batch[0].array(), out); +} else { + return Scalar(ctx, *batch[0].scalar(), out); +} + } +}; + +template +struct ScalarUnaryTemporal { + using OutValue = typename GetOutputType::T; + + static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +// Seed kernel with dummy state +ScalarUnaryStatefulTemporal kernel({}); +return kernel.Exec(ctx, batch, out); + } +}; + +// -- +// Extract year from timestamp + +template +struct Year { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).year())); + } +}; + +// -- +// Extract month from timestamp + +template +struct Month { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).month())); + } +}; + +// -- +//
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643590630 ## File path: docs/source/cpp/compute.rst ## @@ -637,6 +637,54 @@ String extraction e.g. 'letter' and 'digit' for the regular expression ``(?P[ab])(?P\\d)``. +Temporal component extraction +~ + +These functions extract datetime components (year, month, day, etc) from timestamp type. +Note: timezone information is currently ignored if present. + ++++---+-++ +| Function name | Arity | Input types | Output type | Notes | ++++===+=++ +| year | Unary | Temporal | Numeric | | ++++---+-++ +| month | Unary | Temporal | Numeric | | ++++---+-++ +| day| Unary | Temporal | Numeric | | ++++---+-++ +| day_of_week| Unary | Temporal | Numeric | | ++++---+-++ +| day_of_year| Unary | Temporal | Numeric | | ++++---+-++ +| iso_year | Unary | Temporal | Numeric | \(1) | ++++---+-++ +| iso_week | Unary | Temporal | Numeric | \(1) | ++++---+-++ +| iso_calendar | Unary | Temporal | Scalar Struct | \(2) | ++++---+-++ +| quarter| Unary | Temporal | Numeric | | ++++---+-++ +| hour | Unary | Temporal | Numeric | | ++++---+-++ +| minute | Unary | Temporal | Numeric | | ++++---+-++ +| second | Unary | Temporal | Numeric | | ++++---+-++ +| millisecond| Unary | Temporal | Numeric | | ++++---+-++ +| microsecond| Unary | Temporal | Numeric | | ++++---+-++ +| nanosecond | Unary | Temporal | Numeric | | ++++---+-++ +| subsecond | Unary | Temporal | Numeric | | ++++---+-++ + +* \(1) The ISO 8601 definition for week 01 is the week with the first Thursday + of the Gregorian year (i.e. of January) in it. + .. _Wikipedia ISO Week date: https://en.wikipedia.org/wiki/ISO_week_date#First_week Review comment: Changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643590434 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc ## @@ -0,0 +1,107 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/util/checked_cast.h" +#include "arrow/util/formatting.h" + +namespace arrow { + +using internal::StringFormatter; + +class ScalarTemporalTest : public ::testing::Test {}; + +namespace compute { + +TEST(ScalarTemporalTest, TestSimpleTemporalComponentExtraction) { + const char* times = + R"(["1970-01-01T00:00:59.123456789","2000-02-29T23:23:23.9", + "1899-01-01T00:59:20.001001001","2033-05-18T03:33:20.0", null])"; + auto unit = timestamp(TimeUnit::NANO); + auto timestamps = ArrayFromJSON(unit, times); + auto iso_calendar_type = + struct_({field("iso_year", int64()), field("iso_week", int64()), + field("weekday", int64())}); + + auto year = "[1970, 2000, 1899, 2033, null]"; + auto month = "[1, 2, 1, 5, null]"; + auto day = "[1, 29, 1, 18, null]"; + auto day_of_week = "[4, 2, 7, 3, null]"; + auto day_of_year = "[1, 60, 1, 138, null]"; + auto iso_year = "[1970, 2000, 1899, 2033, null]"; + auto iso_week = "[1, 9, 52, 20, null]"; + auto iso_calendar = ArrayFromJSON(iso_calendar_type, +R"([{"iso_year": 1970, "iso_week": 1, "weekday": 4}, Review comment: `day_of_week` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643590355 ## File path: cpp/src/arrow/compute/api_scalar.h ## @@ -462,5 +462,177 @@ ARROW_EXPORT Result FillNull(const Datum& values, const Datum& fill_value, ExecContext* ctx = NULLPTR); +/// \brief Year returns year value for each element of `values` +/// +/// \param[in] values input to extract year from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Year(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Month returns month value for each element of `values` +/// +/// \param[in] values input to extract month from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Month(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Day returns day value for each element of `values` +/// +/// \param[in] values input to extract day from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Day(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief DayOfWeek returns day of the week value for each element of `values` +/// +/// \param[in] values input to extract dat of the week from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT Result DayOfWeek(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief DayOfYear returns day of year value for each element of `values` +/// +/// \param[in] values input to extract day of year from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT Result DayOfYear(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief ISOYear returns ISO year value for each element of `values` +/// +/// \param[in] values input to extract ISO year from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result ISOYear(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief ISOWeek returns ISO week of year value for each element of `values` +/// +/// \param[in] values input to extract ISO week of year from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT Result ISOWeek(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief ISOCalendar returns a (year, ISO week, weekday) struct for each element of +/// `values` +/// +/// \param[in] values input to ISO calendar struct from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT Result ISOCalendar(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Quarter returns quarter of year value for each element of `values` Review comment: Done. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643590073 ## File path: cpp/src/arrow/compute/api_scalar.h ## @@ -462,5 +462,177 @@ ARROW_EXPORT Result FillNull(const Datum& values, const Datum& fill_value, ExecContext* ctx = NULLPTR); +/// \brief Year returns year value for each element of `values` +/// +/// \param[in] values input to extract year from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Year(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Month returns month value for each element of `values` Review comment: Done. ## File path: cpp/src/arrow/compute/api_scalar.h ## @@ -462,5 +462,177 @@ ARROW_EXPORT Result FillNull(const Datum& values, const Datum& fill_value, ExecContext* ctx = NULLPTR); +/// \brief Year returns year value for each element of `values` +/// +/// \param[in] values input to extract year from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Year(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Month returns month value for each element of `values` +/// +/// \param[in] values input to extract month from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Month(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Day returns day value for each element of `values` Review comment: Done. ## File path: cpp/src/arrow/compute/api_scalar.h ## @@ -462,5 +462,177 @@ ARROW_EXPORT Result FillNull(const Datum& values, const Datum& fill_value, ExecContext* ctx = NULLPTR); +/// \brief Year returns year value for each element of `values` +/// +/// \param[in] values input to extract year from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Year(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Month returns month value for each element of `values` +/// +/// \param[in] values input to extract month from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Month(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Day returns day value for each element of `values` +/// +/// \param[in] values input to extract day from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Day(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief DayOfWeek returns day of the week value for each element of `values` Review comment: Done. ## File path: cpp/src/arrow/compute/api_scalar.h ## @@ -462,5 +462,177 @@ ARROW_EXPORT Result FillNull(const Datum& values, const Datum& fill_value, ExecContext* ctx = NULLPTR); +/// \brief Year returns year value for each element of `values` +/// +/// \param[in] values input to extract year from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Year(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Month returns month value for each element of `values` +/// +/// \param[in] values input to extract month from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Month(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief Day returns day value for each element of `values` +/// +/// \param[in] values input to extract day from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT +Result Day(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief DayOfWeek returns day of the week value for each element of `values` +/// +/// \param[in] values input to extract dat of the week from +/// \param[in] ctx the function execution context, optional +/// \return the resulting datum +/// +/// \since 4.0.0 +/// \note API not yet finalized +ARROW_EXPORT Result DayOfWeek(const Datum& values, ExecContext* ctx = NULLPTR); + +/// \brief DayOfYear returns day of year value for each element of `values` Review comment: Done. ## File path:
[GitHub] [arrow-rs] nevi-me commented on a change in pull request #386: add more tests for window::shift and handle boundary cases
nevi-me commented on a change in pull request #386: URL: https://github.com/apache/arrow-rs/pull/386#discussion_r643581616 ## File path: arrow/src/compute/kernels/window.rs ## @@ -33,56 +32,120 @@ use crate::{array::PrimitiveArray, datatypes::ArrowPrimitiveType, error::Result} /// use arrow::compute::shift; /// /// let a: Int32Array = vec![Some(1), None, Some(4)].into(); +/// /// // shift array 1 element to the right /// let res = shift(, 1).unwrap(); /// let expected: Int32Array = vec![None, Some(1), None].into(); -/// assert_eq!(res.as_ref(), ) +/// assert_eq!(res.as_ref(), ); +/// +/// // shift array 1 element to the left +/// let res = shift(, -1).unwrap(); +/// let expected: Int32Array = vec![None, Some(4), None].into(); +/// assert_eq!(res.as_ref(), ); +/// +/// // shift array 0 element, although not recommended +/// let res = shift(, 0).unwrap(); +/// let expected: Int32Array = vec![Some(1), None, Some(4)].into(); +/// assert_eq!(res.as_ref(), ); +/// +/// // shift array 3 element tot he right +/// let res = shift(, 3).unwrap(); +/// let expected: Int32Array = vec![None, None, None].into(); +/// assert_eq!(res.as_ref(), ); /// ``` pub fn shift(values: , offset: i64) -> Result where T: ArrowPrimitiveType, { -// Compute slice -let slice_offset = clamp(-offset, 0, values.len() as i64) as usize; -let length = values.len() - abs(offset) as usize; -let slice = values.slice(slice_offset, length); - -// Generate array with remaining `null` items -let nulls = abs(offset as i64) as usize; +let value_len = values.len() as i64; +if offset == 0 { +Ok(values.slice(0, values.len())) Review comment: Same comment as #388 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on a change in pull request #388: window::shift to work for all array types
nevi-me commented on a change in pull request #388: URL: https://github.com/apache/arrow-rs/pull/388#discussion_r643581161 ## File path: arrow/src/compute/kernels/window.rs ## @@ -33,56 +32,161 @@ use crate::{array::PrimitiveArray, datatypes::ArrowPrimitiveType, error::Result} /// use arrow::compute::shift; /// /// let a: Int32Array = vec![Some(1), None, Some(4)].into(); +/// /// // shift array 1 element to the right /// let res = shift(, 1).unwrap(); /// let expected: Int32Array = vec![None, Some(1), None].into(); -/// assert_eq!(res.as_ref(), ) +/// assert_eq!(res.as_ref(), ); +/// +/// // shift array 1 element to the left +/// let res = shift(, -1).unwrap(); +/// let expected: Int32Array = vec![None, Some(4), None].into(); +/// assert_eq!(res.as_ref(), ); +/// +/// // shift array 0 element, although not recommended +/// let res = shift(, 0).unwrap(); +/// let expected: Int32Array = vec![Some(1), None, Some(4)].into(); +/// assert_eq!(res.as_ref(), ); +/// +/// // shift array 3 element tot he right +/// let res = shift(, 3).unwrap(); +/// let expected: Int32Array = vec![None, None, None].into(); +/// assert_eq!(res.as_ref(), ); /// ``` -pub fn shift(values: , offset: i64) -> Result -where -T: ArrowPrimitiveType, -{ -// Compute slice -let slice_offset = clamp(-offset, 0, values.len() as i64) as usize; -let length = values.len() - abs(offset) as usize; -let slice = values.slice(slice_offset, length); - -// Generate array with remaining `null` items -let nulls = abs(offset as i64) as usize; - -let null_arr = new_null_array(::DATA_TYPE, nulls); - -// Concatenate both arrays, add nulls after if shift > 0 else before -if offset > 0 { -concat(&[null_arr.as_ref(), slice.as_ref()]) +pub fn shift(array: , offset: i64) -> Result { Review comment: This looks great! May you please add tests for struct and list types. ## File path: arrow/src/compute/kernels/window.rs ## @@ -33,56 +32,161 @@ use crate::{array::PrimitiveArray, datatypes::ArrowPrimitiveType, error::Result} /// use arrow::compute::shift; /// /// let a: Int32Array = vec![Some(1), None, Some(4)].into(); +/// /// // shift array 1 element to the right /// let res = shift(, 1).unwrap(); /// let expected: Int32Array = vec![None, Some(1), None].into(); -/// assert_eq!(res.as_ref(), ) +/// assert_eq!(res.as_ref(), ); +/// +/// // shift array 1 element to the left +/// let res = shift(, -1).unwrap(); +/// let expected: Int32Array = vec![None, Some(4), None].into(); +/// assert_eq!(res.as_ref(), ); +/// +/// // shift array 0 element, although not recommended +/// let res = shift(, 0).unwrap(); +/// let expected: Int32Array = vec![Some(1), None, Some(4)].into(); +/// assert_eq!(res.as_ref(), ); +/// +/// // shift array 3 element tot he right +/// let res = shift(, 3).unwrap(); +/// let expected: Int32Array = vec![None, None, None].into(); +/// assert_eq!(res.as_ref(), ); /// ``` -pub fn shift(values: , offset: i64) -> Result -where -T: ArrowPrimitiveType, -{ -// Compute slice -let slice_offset = clamp(-offset, 0, values.len() as i64) as usize; -let length = values.len() - abs(offset) as usize; -let slice = values.slice(slice_offset, length); - -// Generate array with remaining `null` items -let nulls = abs(offset as i64) as usize; - -let null_arr = new_null_array(::DATA_TYPE, nulls); - -// Concatenate both arrays, add nulls after if shift > 0 else before -if offset > 0 { -concat(&[null_arr.as_ref(), slice.as_ref()]) +pub fn shift(array: , offset: i64) -> Result { +let value_len = array.len() as i64; +if offset == 0 { +Ok(array.slice(0, array.len())) Review comment: If offset == 0, why not return `array.clone()`? `array.slice` does incur small overhead as it might have to descend into the array children for nested types (see #389 ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
bkietz commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643578583 ## File path: cpp/src/arrow/compute/exec/expression.cc ## @@ -613,6 +639,22 @@ std::vector FieldsInExpression(const Expression& expr) { return fields; } +std::vector ParametersInExpression(const Expression& expr) { Review comment: I may remove this, I don't think I'm using it anymore -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
bkietz commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643578392 ## File path: cpp/src/arrow/compute/exec/expression.cc ## @@ -61,13 +61,22 @@ Expression call(std::string function, std::vector arguments, call.function_name = std::move(function); call.arguments = std::move(arguments); call.options = std::move(options); + + call.hash = std::hash{}(call.function_name); + for (const auto& arg : call.arguments) { +call.hash ^= arg.hash(); + } return Expression(std::move(call)); } const Datum* Expression::literal() const { return util::get_if(impl_.get()); } +const Expression::Parameter* Expression::parameter() const { Review comment: a Parameter is a {field_ref, type, index} (with the last two properties only available after `Bind`). During simplification against guarantees (which are unbound), parameters must be compared by their field_refs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
bkietz commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643577382 ## File path: cpp/src/arrow/compute/exec/expression.cc ## @@ -510,7 +475,67 @@ Result Expression::Bind(const Schema& in_schema, return Bind(ValueDescr::Array(struct_(in_schema.fields())), exec_context); } -Result ExecuteScalarExpression(const Expression& expr, const Datum& input, +Result MakeExecBatch(const Schema& full_schema, const Datum& partial) { + ExecBatch out; + + if (partial.kind() == Datum::RECORD_BATCH) { +const auto& partial_batch = *partial.record_batch(); +out.length = partial_batch.num_rows(); + +for (const auto& field : full_schema.fields()) { + ARROW_ASSIGN_OR_RAISE(auto column, + FieldRef(field->name()).GetOneOrNone(partial_batch)); + + if (column) { +if (!column->type()->Equals(field->type())) { + // Referenced field was present but didn't have the expected type. + // This *should* be handled by readers, and will just be an error in the future. + ARROW_ASSIGN_OR_RAISE( + auto converted, + compute::Cast(column, field->type(), compute::CastOptions::Safe())); + column = converted.make_array(); +} +out.values.emplace_back(std::move(column)); + } else { +out.values.emplace_back(MakeNullScalar(field->type())); + } +} +return out; + } + + // wasteful but useful for testing: + if (partial.type()->id() == Type::STRUCT) { +if (partial.is_array()) { + ARROW_ASSIGN_OR_RAISE(auto partial_batch, + RecordBatch::FromStructArray(partial.make_array())); + + return MakeExecBatch(full_schema, partial_batch); +} + +if (partial.is_scalar()) { + ARROW_ASSIGN_OR_RAISE(auto partial_array, +MakeArrayFromScalar(*partial.scalar(), 1)); + ARROW_ASSIGN_OR_RAISE(auto out, MakeExecBatch(full_schema, partial_array)); + + for (Datum& value : out.values) { +if (value.is_scalar()) continue; +ARROW_ASSIGN_OR_RAISE(value, value.make_array()->GetScalar(0)); + } Review comment: This was as compact as I could write this case; if you see a way to compress/simplify it then I'll take it but the scalar/array cases are really just for testing purposes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
bkietz commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643576195 ## File path: cpp/src/arrow/compute/exec/expression_test.cc ## @@ -165,6 +165,56 @@ TEST(ExpressionUtils, StripOrderPreservingCasts) { Expect(cast(field_ref("i32"), uint64()), no_change); } +TEST(ExpressionUtils, MakeExecBatch) { + auto Expect = [](std::shared_ptr partial_batch) { +SCOPED_TRACE(partial_batch->ToString()); +ASSERT_OK_AND_ASSIGN(auto batch, MakeExecBatch(*kBoringSchema, partial_batch)); + +ASSERT_EQ(batch.num_values(), kBoringSchema->num_fields()); +for (int i = 0; i < kBoringSchema->num_fields(); ++i) { + const auto& field = *kBoringSchema->field(i); + + SCOPED_TRACE("Field#" + std::to_string(i) + " " + field.ToString()); + + EXPECT_TRUE(batch[i].type()->Equals(field.type())) + << "Incorrect type " << batch[i].type()->ToString(); + + ASSERT_OK_AND_ASSIGN(auto col, FieldRef(field.name()).GetOneOrNone(*partial_batch)); Review comment: GetOneOrNone raises a descriptive error if duplicate field names are found, whereas GetFieldByName will just return null IIRC -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
bkietz commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643575745 ## File path: cpp/src/arrow/dataset/dataset_internal.h ## @@ -204,5 +204,35 @@ arrow::Result> GetFragmentScanOptions( return internal::checked_pointer_cast(source); } +class FragmentDataset : public Dataset { Review comment: Whether or not the component fragments are in memory is up to the fragments, so I don't think it's appropriate to inherit that here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on pull request #389: make slice work for nested types
nevi-me commented on pull request #389: URL: https://github.com/apache/arrow-rs/pull/389#issuecomment-852610400 @jorgecarleitao I have the below failures, which are mostly related to `MutableArrayData`. I need your help when you have some time to spare. I worked on this over the weekend out of curiousity, so I'll continue working on the other tests this week. ``` failures: array::array_union::tests::test_dense_mixed_with_nulls_and_offset array::array_union::tests::test_sparse_mixed_with_nulls_and_offset array::transform::tests::test_list_append array::transform::tests::test_list_nulls_append array::transform::tests::test_struct_offset compute::kernels::cast::tests::test_list_cast_offsets compute::kernels::concat::tests::test_concat_struct_array_slices json::writer::tests::write_dictionary json::writer::tests::write_nested_list ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #381: Respect max rowgroup size in Arrow writer
codecov-commenter edited a comment on pull request #381: URL: https://github.com/apache/arrow-rs/pull/381#issuecomment-850950657 # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/381?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#381](https://codecov.io/gh/apache/arrow-rs/pull/381?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (530949e) into [master](https://codecov.io/gh/apache/arrow-rs/commit/f26ffb3091ae355d246edc4a6fcc2c8e5b9bc570?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (f26ffb3) will **increase** coverage by `0.02%`. > The diff coverage is `93.33%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/381/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/381?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff @@ ## master #381 +/- ## == + Coverage 82.60% 82.62% +0.02% == Files 162 162 Lines 4419944275 +76 == + Hits3650936583 +74 - Misses 7690 7692 +2 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/381?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `82.78% <89.47%> (+0.19%)` | :arrow_up: | | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `98.04% <96.77%> (-0.08%)` | :arrow_down: | | [parquet/src/file/properties.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9wcm9wZXJ0aWVzLnJz) | `95.74% <100.00%> (+0.01%)` | :arrow_up: | | [arrow/src/array/transform/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9ib29sZWFuLnJz) | `76.92% <0.00%> (-7.70%)` | :arrow_down: | | [arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=) | `85.29% <0.00%> (ø)` | | | [arrow/src/compute/kernels/concat.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb25jYXQucnM=) | `100.00% <0.00%> (ø)` | | | [arrow/src/compute/kernels/window.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy93aW5kb3cucnM=) | `100.00% <0.00%> (ø)` | | | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `77.18% <0.00%> (+0.09%)` | :arrow_up: | | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `95.04% <0.00%> (+0.19%)` | :arrow_up: | | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `89.18% <0.00%> (+0.24%)` | :arrow_up: | | ... and [1
[GitHub] [arrow-rs] nevi-me opened a new pull request #389: make slice work for nested types
nevi-me opened a new pull request #389: URL: https://github.com/apache/arrow-rs/pull/389 # Which issue does this PR close? Corresponding issue might not yet exist, will check when finalising this PR # Rationale for this change `ArrayData::slice()` does not work for nested types, because only the `ArrayData::buffers` are updated with the new offset and length. This has caused a lot of issues in the past. This blocks us from being able to implement `RecordBatch::slice()`, and has led to creating #381 to sidestep this issue. # What changes are included in this PR? I propose a manual slice where we check if `ArrayData::child_data` is not empty, and then propagate the offset and length down to them. The only trick is with list and largelist, where we have to inspect the offset buffers to determine what the new offset and length should be. # Are there any user-facing changes? No UFC -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on pull request #381: Respect max rowgroup size in Arrow writer
nevi-me commented on pull request #381: URL: https://github.com/apache/arrow-rs/pull/381#issuecomment-852597694 I've addressed feedback, PTAL @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on a change in pull request #381: Respect max rowgroup size in Arrow writer
nevi-me commented on a change in pull request #381: URL: https://github.com/apache/arrow-rs/pull/381#discussion_r643570618 ## File path: parquet/src/arrow/arrow_writer.rs ## @@ -1176,31 +1236,51 @@ mod tests { let raw_values: Vec<_> = (0..SMALL_SIZE as i64).collect(); let values = Arc::new(TimestampSecondArray::from_vec(raw_values, None)); -one_column_roundtrip("timestamp_second_single_column", values, false); +one_column_roundtrip( Review comment: I've changed the `SMALL_SIZE` to an odd number, and then changed the batch sizes of some tests. ## File path: parquet/src/arrow/arrow_writer.rs ## @@ -87,17 +92,31 @@ impl ArrowWriter { "Record batch schema does not match writer schema".to_string(), )); } -// compute the definition and repetition levels of the batch -let batch_level = LevelInfo::new_from_batch(batch); -let mut row_group_writer = self.writer.next_row_group()?; -for (array, field) in batch.columns().iter().zip(batch.schema().fields()) { -let mut levels = batch_level.calculate_array_levels(array, field); -// Reverse levels as we pop() them when writing arrays -levels.reverse(); -write_leaves( row_group_writer, array, levels)?; +// Track the number of rows being written in the batch. +// We currently do not have a way of slicing nested arrays, thus we +// track this manually. +let num_rows = batch.num_rows(); +let batches = (num_rows + self.max_row_group_size - 1) / self.max_row_group_size; Review comment: Added a note, thanks :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane commented on a change in pull request #10430: ARROW-12915: [Release] Build of ubuntu-docs is failing on thrift
jonkeane commented on a change in pull request #10430: URL: https://github.com/apache/arrow/pull/10430#discussion_r643569671 ## File path: ci/docker/ubuntu-20.04-cpp.dockerfile ## @@ -75,6 +75,7 @@ RUN apt-get update -y -q && \ libcurl4-openssl-dev \ libgflags-dev \ libgoogle-glog-dev \ +libgrpc++-dev \ Review comment: Ah yes, thanks for that! Cruft leftover from me trying to get it working -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] BohuTANG edited a comment on issue #354: Implement some way to self assign tickets without having full edit access to github
BohuTANG edited a comment on issue #354: URL: https://github.com/apache/arrow-datafusion/issues/354#issuecomment-852561940 Hello, we can consider this bot https://github.com/datafuselabs/fusebots , who has an `/assignme` command to take the current issue away and then automatically add a `community-take` label, demo: https://github.com/datafuselabs/datafuse/issues/663#issuecomment-851260591 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #10430: ARROW-12915: [Release] Build of ubuntu-docs is failing on thrift
kou commented on a change in pull request #10430: URL: https://github.com/apache/arrow/pull/10430#discussion_r643565782 ## File path: ci/docker/ubuntu-20.04-cpp.dockerfile ## @@ -75,6 +75,7 @@ RUN apt-get update -y -q && \ libcurl4-openssl-dev \ libgflags-dev \ libgoogle-glog-dev \ +libgrpc++-dev \ Review comment: We don't need to install `libgrpc++-dev` and `protobuf-compiler-grpc` when we use bundled gRPC and ProtoBuf. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] BohuTANG edited a comment on issue #354: Implement some way to self assign tickets without having full edit access to github
BohuTANG edited a comment on issue #354: URL: https://github.com/apache/arrow-datafusion/issues/354#issuecomment-852561940 Hello, we can consider this bot https://github.com/datafuselabs/fusebots , who has an '/assignme' command to take the current issue away and then automatically add a 'community-take' label, demo: https://github.com/datafuselabs/datafuse/issues/663#issuecomment-851260591 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] BohuTANG commented on issue #354: Implement some way to self assign tickets without having full edit access to github
BohuTANG commented on issue #354: URL: https://github.com/apache/arrow-datafusion/issues/354#issuecomment-852561940 Hello, we can consider this bot https://github.com/datafuselabs/fusebots , who has an '/assignme' command to take the current issue away and then automatically add an 'community-take' label, demo: https://github.com/datafuselabs/datafuse/issues/663#issuecomment-851260591 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9620: ARROW-11843: [C++] Provide async Parquet reader
lidavidm commented on a change in pull request #9620: URL: https://github.com/apache/arrow/pull/9620#discussion_r643540804 ## File path: cpp/src/parquet/file_reader.cc ## @@ -264,23 +264,92 @@ class SerializedFile : public ParquetFileReader::Contents { } } PARQUET_THROW_NOT_OK(cached_source_->Cache(ranges)); -return cached_source_->Wait(); } + ::arrow::Future<> WhenBuffered(const std::vector& row_groups, + const std::vector& column_indices) const { +if (!cached_source_) { + return ::arrow::Status::Invalid("Must call PreBuffer before WhenBuffered"); +} +std::vector<::arrow::io::ReadRange> ranges; +for (int row : row_groups) { + for (int col : column_indices) { +ranges.push_back( +ComputeColumnChunkRange(file_metadata_.get(), source_size_, row, col)); + } +} +return cached_source_->WaitFor(ranges); + } + + // Metadata/footer parsing. Divided up to separate sync/async paths, and to use + // exceptions for error handling (with the async path converting to Future/Status). + void ParseMetaData() { Review comment: I've updated the current encryption tests to test both Open and OpenAsync (though let's hope they don't make the test flaky again). ## File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc ## @@ -2331,6 +2330,63 @@ TEST(TestArrowReadWrite, GetRecordBatchReaderNoColumns) { ASSERT_EQ(actual_batch->num_rows(), num_rows); } +TEST(TestArrowReadWrite, GetRecordBatchGenerator) { + ArrowReaderProperties properties = default_arrow_reader_properties(); + const int num_rows = 1024; + const int row_group_size = 512; + const int num_columns = 2; + + std::shared_ptr table; + ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(num_columns, num_rows, 1, )); + + std::shared_ptr buffer; + ASSERT_NO_FATAL_FAILURE(WriteTableToBuffer(table, row_group_size, + default_arrow_writer_properties(), )); + + std::shared_ptr reader; + { +std::unique_ptr unique_reader; +FileReaderBuilder builder; +ASSERT_OK(builder.Open(std::make_shared(buffer))); +ASSERT_OK(builder.properties(properties)->Build(_reader)); +reader = std::move(unique_reader); + } + + auto check_batches = [](const std::shared_ptr<::arrow::RecordBatch>& batch, + int num_columns, int num_rows) { +ASSERT_NE(batch, nullptr); +ASSERT_EQ(batch->num_columns(), num_columns); +ASSERT_EQ(batch->num_rows(), num_rows); Review comment: I've adjusted the test to check equality below. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #429: implement lead and lag built-in window function
alamb commented on pull request #429: URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-852533565 I plan to review this PR tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #470: Support semi join
alamb commented on pull request #470: URL: https://github.com/apache/arrow-datafusion/pull/470#issuecomment-852533237 I plan to review this tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-rs] alamb commented on pull request #384: Implement faster arrow array reader
alamb commented on pull request #384: URL: https://github.com/apache/arrow-rs/pull/384#issuecomment-852532332 Thanks @yordan-pavlov -- I will try and set time aside tomorrow to review this PR. Sorry for the delay -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #10381: ARROW-12722: [R] Raise error when attemping to print table with duplicated naming
nealrichardson closed pull request #10381: URL: https://github.com/apache/arrow/pull/10381 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10432: ARROW-12924: [Gandiva][C++] Implement CONVERT_TIMEZONE SQL function in Gandiva
github-actions[bot] commented on pull request #10432: URL: https://github.com/apache/arrow/pull/10432#issuecomment-852494029 https://issues.apache.org/jira/browse/ARROW-12924 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jvictorhuguenin opened a new pull request #10432: ARROW-12924: [Gandiva][C++] Implement CONVERT_TIMEZONE SQL function in Gandiva
jvictorhuguenin opened a new pull request #10432: URL: https://github.com/apache/arrow/pull/10432 Converts timestamp to specified timezone. If the sourceTimezone parameter is not present, Dremio assumes the timestamp provided in the third parameter is in UTC format. The sourceTimezone and destinationTimezone parameters accept any of the following values: timezone name from sys.timezone_names, timezone abbreviation from sys.timezone_abbrevs, and offset, such as +02:00. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jonkeane commented on pull request #10430: ARROW-12915: [Release] Build of ubuntu-docs is failing on thrift
jonkeane commented on pull request #10430: URL: https://github.com/apache/arrow/pull/10430#issuecomment-852480193 I ran this on crossbow (locally since the GH actions comment bot isn't working right now https://issues.apache.org/jira/browse/ARROW-12919) and the only ubuntu crossbow builds that failed were all ones that have other current nightly failures. I've also added a defaults docs nightly to catch if we have a mismatch like this in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #10386: ARROW-12859: [C++] Add ScalarFromJSON for testing
lidavidm commented on pull request #10386: URL: https://github.com/apache/arrow/pull/10386#issuecomment-852463069 For CastTo: I think it was actually the Cast kernel (string->type cast). I added some basic tests. I don't think there's a good way to hit the DCHECK because that would imply a JSON value converter appended two array values for a single JSON value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #474: Update k8s user guide to use deployments
codecov-commenter commented on pull request #474: URL: https://github.com/apache/arrow-datafusion/pull/474#issuecomment-852435958 # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#474](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (56aeaea) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/16011120a1b73798049c5be49f9548b00f8a0a00?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (1601112) will **not change** coverage. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/474/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master #474 +/- ## === Coverage 75.84% 75.84% === Files 153 153 Lines 2587225872 === Hits1962219622 Misses 6250 6250 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [ballista/rust/executor/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/474/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvbWFpbi5ycw==) | `0.00% <0.00%> (ø)` | | | [ballista/rust/scheduler/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/474/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9zY2hlZHVsZXIvc3JjL21haW4ucnM=) | `0.00% <0.00%> (ø)` | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Last update [1601112...56aeaea](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)
codecov-commenter commented on pull request #436: URL: https://github.com/apache/arrow-datafusion/pull/436#issuecomment-852434776 # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/436?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) Report > Merging [#436](https://codecov.io/gh/apache/arrow-datafusion/pull/436?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (25adf53) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/db4f098d38993b96ce1134c4bc7bf5c6579509cf?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) (db4f098) will **increase** coverage by `1.00%`. > The diff coverage is `86.61%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/436/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/436?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff @@ ## master #436 +/- ## == + Coverage 74.94% 75.94% +1.00% == Files 146 154 +8 Lines 2431426126+1812 == + Hits1822119842+1621 - Misses 6093 6284 +191 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/436?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [datafusion/src/execution/context.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvZXhlY3V0aW9uL2NvbnRleHQucnM=) | `92.08% <ø> (+0.03%)` | :arrow_up: | | [datafusion/src/optimizer/simplify\_expressions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3NpbXBsaWZ5X2V4cHJlc3Npb25zLnJz) | `86.61% <86.61%> (ø)` | | | [datafusion-cli/src/print\_format.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1jbGkvc3JjL3ByaW50X2Zvcm1hdC5ycw==) | `81.25% <0.00%> (-9.17%)` | :arrow_down: | | [datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9tb2QucnM=) | `78.70% <0.00%> (-4.06%)` | :arrow_down: | | [datafusion/src/physical\_plan/window\_functions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi93aW5kb3dfZnVuY3Rpb25zLnJz) | `85.71% <0.00%> (-3.01%)` | :arrow_down: | | [...tafusion/src/physical\_plan/datetime\_expressions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9kYXRldGltZV9leHByZXNzaW9ucy5ycw==) | `67.29% <0.00%> (-2.52%)` | :arrow_down: | | [datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9jb21tb24ucnM=) | `84.21% <0.00%> (-2.00%)` | :arrow_down: | | [datafusion/src/physical\_plan/repartition.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9yZXBhcnRpdGlvbi5ycw==) | `82.45% <0.00%> (-1.89%)` | :arrow_down: | | [ballista/rust/scheduler/src/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9zY2hlZHVsZXIvc3JjL3BsYW5uZXIucnM=) | `66.91% <0.00%> (-0.74%)` | :arrow_down: | |
[GitHub] [arrow-datafusion] edrevo commented on issue #72: Update link in Ballista donation blog post
edrevo commented on issue #72: URL: https://github.com/apache/arrow-datafusion/issues/72#issuecomment-852428471 I'd say this is fixed, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] edrevo commented on pull request #474: Update k8s user guide to use deployments
edrevo commented on pull request #474: URL: https://github.com/apache/arrow-datafusion/pull/474#issuecomment-852426880 cc @andygrove -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10410: ARROW-10640: [C++] A "where" kernel to combine two arrays based on a mask
bkietz commented on a change in pull request #10410: URL: https://github.com/apache/arrow/pull/10410#discussion_r643454193 ## File path: cpp/src/arrow/compute/kernels/scalar_if_else_test.cc ## @@ -0,0 +1,264 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include + +namespace arrow { +namespace compute { + +void CheckIfElseOutput(const Datum& cond, const Datum& left, const Datum& right, + const Datum& expected) { + ASSERT_OK_AND_ASSIGN(Datum datum_out, IfElse(cond, left, right)); + if (datum_out.is_array()) { +std::shared_ptr result = datum_out.make_array(); +ASSERT_OK(result->ValidateFull()); +std::shared_ptr expected_ = expected.make_array(); +AssertArraysEqual(*expected_, *result, /*verbose=*/true); + } else { // expecting scalar +const std::shared_ptr& result = datum_out.scalar(); +const std::shared_ptr& expected_ = expected.scalar(); +AssertScalarsEqual(*expected_, *result, /*verbose=*/true); + } +} + +class TestIfElseKernel : public ::testing::Test {}; + +template +class TestIfElsePrimitive : public ::testing::Test {}; + +using PrimitiveTypes = ::testing::Types; + +TYPED_TEST_SUITE(TestIfElsePrimitive, PrimitiveTypes); + +TYPED_TEST(TestIfElsePrimitive, IfElseFixedSizeRand) { + using ArrayType = typename TypeTraits::ArrayType; + auto type = TypeTraits::type_singleton(); + + random::RandomArrayGenerator rand(/*seed=*/0); + int64_t len = 1000; + auto cond = std::static_pointer_cast( + rand.ArrayOf(boolean(), len, /*null_probability=*/0.01)); + auto left = std::static_pointer_cast( + rand.ArrayOf(type, len, /*null_probability=*/0.01)); + auto right = std::static_pointer_cast( + rand.ArrayOf(type, len, /*null_probability=*/0.01)); + + typename TypeTraits::BuilderType builder; + + for (int64_t i = 0; i < len; ++i) { +if (!cond->IsValid(i) || (cond->Value(i) && !left->IsValid(i)) || +(!cond->Value(i) && !right->IsValid(i))) { + ASSERT_OK(builder.AppendNull()); + continue; +} + +if (cond->Value(i)) { + ASSERT_OK(builder.Append(left->Value(i))); +} else { + ASSERT_OK(builder.Append(right->Value(i))); +} + } + ASSERT_OK_AND_ASSIGN(auto expected_data, builder.Finish()); + + CheckIfElseOutput(cond, left, right, expected_data); +} + +void CheckWithDifferentShapes(const std::shared_ptr& cond, + const std::shared_ptr& left, + const std::shared_ptr& right, + const std::shared_ptr& expected) { + // this will check for whole arrays, every scalar at i'th index and slicing (offset) + CheckScalar("if_else", {cond, left, right}, expected); + + auto len = left->length(); + + enum { COND_SCALAR = 1, LEFT_SCALAR = 2, RIGHT_SCALAR = 4 }; + for (int mask = 0; mask < (COND_SCALAR | LEFT_SCALAR | RIGHT_SCALAR); ++mask) { +for (int64_t cond_idx = 0; cond_idx < len; ++cond_idx) { + Datum cond_in, cond_bcast; + if (mask & COND_SCALAR) { +ASSERT_OK_AND_ASSIGN(cond_in, cond->GetScalar(cond_idx)); +ASSERT_OK_AND_ASSIGN(cond_bcast, MakeArrayFromScalar(*cond_in.scalar(), len)); + } else { +cond_in = cond_bcast = cond; + } Review comment: Please add SCOPED_TRACEs here and below for left/right so that failures within the loop are more informative about which iteration failed ```suggestion Datum cond_in, cond_bcast; std::string trace_msg = "Cond"; if (mask & COND_SCALAR) { ASSERT_OK_AND_ASSIGN(cond_in, cond->GetScalar(cond_idx)); ASSERT_OK_AND_ASSIGN(cond_bcast, MakeArrayFromScalar(*cond_in.scalar(), len)); trace_msg += "@" + std::to_string(cond_idx) + "=" + cond_in.scalar()->ToString(); } else { cond_in = cond_bcast = cond; } SCOPED_TRACE(trace_msg); ``` ## File path: cpp/src/arrow/compute/kernels/scalar_if_else.cc ## @@ -699,14 +699,58 @@ struct IfElseFunctor> { // AAS static Status Call(KernelContext* ctx, const ArrayData& cond, const ArrayData& left, const Scalar& right,
[GitHub] [arrow-datafusion] edrevo opened a new pull request #474: K8s deployments
edrevo opened a new pull request #474: URL: https://github.com/apache/arrow-datafusion/pull/474 # Which issue does this PR close? Closes #473. # What changes are included in this PR? - Rename port to bind-port since the configure_me create was trying to parse an env var that was being autopopulated by Kubernetes - Change the user guide for k8s -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] edrevo opened a new issue #473: [Ballista] Use deployments in k8s user guide
edrevo opened a new issue #473: URL: https://github.com/apache/arrow-datafusion/issues/473 The executors can now be used as a k8s deployment, which is a more flexible and simpler k8s primitive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on pull request #10176: URL: https://github.com/apache/arrow/pull/10176#issuecomment-852423550 Thanks for the review @pitrou! I've addressed some of the comments and I'll try to finish the rest today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] jgoday commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)
jgoday commented on a change in pull request #436: URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643455939 ## File path: datafusion/src/optimizer/mod.rs ## @@ -25,4 +25,5 @@ pub mod hash_build_probe_order; pub mod limit_push_down; pub mod optimizer; pub mod projection_push_down; +pub mod remove_duplicate_filters; Review comment: Ok, changed to simplify_expressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643455290 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc ## @@ -0,0 +1,614 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/builder.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/time.h" +#include "arrow/vendored/datetime.h" + +namespace arrow { + +namespace compute { +namespace internal { + +using applicator::ScalarUnaryNotNull; +using applicator::SimpleUnary; +using arrow_vendored::date::days; +using arrow_vendored::date::floor; +using arrow_vendored::date::hh_mm_ss; +using arrow_vendored::date::sys_days; +using arrow_vendored::date::sys_time; +using arrow_vendored::date::trunc; +using arrow_vendored::date::weekday; +using arrow_vendored::date::weeks; +using arrow_vendored::date::year_month_day; +using arrow_vendored::date::years; +using arrow_vendored::date::literals::dec; +using arrow_vendored::date::literals::jan; +using arrow_vendored::date::literals::last; +using arrow_vendored::date::literals::mon; +using arrow_vendored::date::literals::thu; + +// Based on ScalarUnaryNotNullStateful. Adds timezone awareness. +template +struct ScalarUnaryStatefulTemporal { + using ThisType = ScalarUnaryStatefulTemporal; + using OutValue = typename GetOutputType::T; + + Op op; + explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {} + + template + struct ArrayExec { +static Status Exec(const ThisType& functor, KernelContext* ctx, const ArrayData& arg0, + Datum* out) { + const std::string timezone = + std::static_pointer_cast(arg0.type)->timezone(); + Status st = Status::OK(); + ArrayData* out_arr = out->mutable_array(); + auto out_data = out_arr->GetMutableValues(1); + + if (timezone.empty()) { +VisitArrayValuesInline( +arg0, +[&](int64_t v) { + *out_data++ = functor.op.template Call(ctx, v, ); +}, +[&]() { + // null + ++out_data; +}); + } else { +st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); + } + return st; +} + }; + + Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) { +const std::string timezone = +std::static_pointer_cast(arg0.type)->timezone(); +Status st = Status::OK(); +if (timezone.empty()) { + if (arg0.is_valid) { +int64_t arg0_val = UnboxScalar::Unbox(arg0); +BoxScalar::Box(this->op.template Call(ctx, arg0_val, ), +out->scalar().get()); + } +} else { + st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); +} +return st; + } + + Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + return ArrayExec::Exec(*this, ctx, *batch[0].array(), out); +} else { + return Scalar(ctx, *batch[0].scalar(), out); +} + } +}; + +template +struct ScalarUnaryTemporal { + using OutValue = typename GetOutputType::T; + + static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +// Seed kernel with dummy state +ScalarUnaryStatefulTemporal kernel({}); +return kernel.Exec(ctx, batch, out); + } +}; + +// -- +// Extract year from timestamp + +template +struct Year { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).year())); + } +}; + +// -- +// Extract month from timestamp + +template +struct Month { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).month())); + } +}; + +// -- +//
[GitHub] [arrow-datafusion] edrevo commented on issue #472: [Ballista] Improve task and job metadata
edrevo commented on issue #472: URL: https://github.com/apache/arrow-datafusion/issues/472#issuecomment-852422276 cc @pradomota. I'm opening this one in case you want to take a stab at it We can do pair programming if you want. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643454925 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc ## @@ -0,0 +1,614 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/builder.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/time.h" +#include "arrow/vendored/datetime.h" + +namespace arrow { + +namespace compute { +namespace internal { + +using applicator::ScalarUnaryNotNull; +using applicator::SimpleUnary; +using arrow_vendored::date::days; +using arrow_vendored::date::floor; +using arrow_vendored::date::hh_mm_ss; +using arrow_vendored::date::sys_days; +using arrow_vendored::date::sys_time; +using arrow_vendored::date::trunc; +using arrow_vendored::date::weekday; +using arrow_vendored::date::weeks; +using arrow_vendored::date::year_month_day; +using arrow_vendored::date::years; +using arrow_vendored::date::literals::dec; +using arrow_vendored::date::literals::jan; +using arrow_vendored::date::literals::last; +using arrow_vendored::date::literals::mon; +using arrow_vendored::date::literals::thu; + +// Based on ScalarUnaryNotNullStateful. Adds timezone awareness. +template +struct ScalarUnaryStatefulTemporal { + using ThisType = ScalarUnaryStatefulTemporal; + using OutValue = typename GetOutputType::T; + + Op op; + explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {} + + template + struct ArrayExec { +static Status Exec(const ThisType& functor, KernelContext* ctx, const ArrayData& arg0, + Datum* out) { + const std::string timezone = + std::static_pointer_cast(arg0.type)->timezone(); + Status st = Status::OK(); + ArrayData* out_arr = out->mutable_array(); + auto out_data = out_arr->GetMutableValues(1); + + if (timezone.empty()) { +VisitArrayValuesInline( +arg0, +[&](int64_t v) { + *out_data++ = functor.op.template Call(ctx, v, ); +}, +[&]() { + // null + ++out_data; +}); + } else { +st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); + } + return st; +} + }; + + Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) { +const std::string timezone = +std::static_pointer_cast(arg0.type)->timezone(); +Status st = Status::OK(); +if (timezone.empty()) { + if (arg0.is_valid) { +int64_t arg0_val = UnboxScalar::Unbox(arg0); +BoxScalar::Box(this->op.template Call(ctx, arg0_val, ), +out->scalar().get()); + } +} else { + st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); +} +return st; + } + + Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + return ArrayExec::Exec(*this, ctx, *batch[0].array(), out); +} else { + return Scalar(ctx, *batch[0].scalar(), out); +} + } +}; + +template +struct ScalarUnaryTemporal { + using OutValue = typename GetOutputType::T; + + static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +// Seed kernel with dummy state +ScalarUnaryStatefulTemporal kernel({}); +return kernel.Exec(ctx, batch, out); + } +}; + +// -- +// Extract year from timestamp + +template +struct Year { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).year())); + } +}; + +// -- +// Extract month from timestamp + +template +struct Month { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).month())); + } +}; + +// -- +//
[GitHub] [arrow-datafusion] edrevo opened a new issue #472: [Ballista] Improve task and job metadata
edrevo opened a new issue #472: URL: https://github.com/apache/arrow-datafusion/issues/472 The task and job status we save in the scheduler state is currently lacking. See: https://github.com/apache/arrow-datafusion/blob/16011120a1b73798049c5be49f9548b00f8a0a00/ballista/rust/core/proto/ballista.proto#L669-L680 And https://github.com/apache/arrow-datafusion/blob/16011120a1b73798049c5be49f9548b00f8a0a00/ballista/rust/core/proto/ballista.proto#L614-L624 The job status should include: - Start time for the job (the moment it was queued) - Finish time for the job (for the completed state) - Number of pending/running/completed/failed tasks and stages (for the running state) The task status should include: - Start time for the task - Finish time for the task (for the completed state) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643453195 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc ## @@ -0,0 +1,614 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/builder.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/time.h" +#include "arrow/vendored/datetime.h" + +namespace arrow { + +namespace compute { +namespace internal { + +using applicator::ScalarUnaryNotNull; +using applicator::SimpleUnary; +using arrow_vendored::date::days; +using arrow_vendored::date::floor; +using arrow_vendored::date::hh_mm_ss; +using arrow_vendored::date::sys_days; +using arrow_vendored::date::sys_time; +using arrow_vendored::date::trunc; +using arrow_vendored::date::weekday; +using arrow_vendored::date::weeks; +using arrow_vendored::date::year_month_day; +using arrow_vendored::date::years; +using arrow_vendored::date::literals::dec; +using arrow_vendored::date::literals::jan; +using arrow_vendored::date::literals::last; +using arrow_vendored::date::literals::mon; +using arrow_vendored::date::literals::thu; + +// Based on ScalarUnaryNotNullStateful. Adds timezone awareness. +template +struct ScalarUnaryStatefulTemporal { + using ThisType = ScalarUnaryStatefulTemporal; + using OutValue = typename GetOutputType::T; + + Op op; + explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {} + + template + struct ArrayExec { +static Status Exec(const ThisType& functor, KernelContext* ctx, const ArrayData& arg0, + Datum* out) { + const std::string timezone = + std::static_pointer_cast(arg0.type)->timezone(); + Status st = Status::OK(); + ArrayData* out_arr = out->mutable_array(); + auto out_data = out_arr->GetMutableValues(1); + + if (timezone.empty()) { +VisitArrayValuesInline( +arg0, +[&](int64_t v) { + *out_data++ = functor.op.template Call(ctx, v, ); +}, +[&]() { + // null + ++out_data; +}); + } else { +st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); + } + return st; +} + }; + + Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) { +const std::string timezone = +std::static_pointer_cast(arg0.type)->timezone(); +Status st = Status::OK(); +if (timezone.empty()) { + if (arg0.is_valid) { +int64_t arg0_val = UnboxScalar::Unbox(arg0); +BoxScalar::Box(this->op.template Call(ctx, arg0_val, ), +out->scalar().get()); + } +} else { + st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); +} +return st; + } + + Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + return ArrayExec::Exec(*this, ctx, *batch[0].array(), out); +} else { + return Scalar(ctx, *batch[0].scalar(), out); +} + } +}; + +template +struct ScalarUnaryTemporal { + using OutValue = typename GetOutputType::T; + + static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +// Seed kernel with dummy state +ScalarUnaryStatefulTemporal kernel({}); +return kernel.Exec(ctx, batch, out); + } +}; + +// -- +// Extract year from timestamp + +template +struct Year { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).year())); + } +}; + +// -- +// Extract month from timestamp + +template +struct Month { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).month())); + } +}; + +// -- +//
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643453054 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc ## @@ -0,0 +1,614 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/builder.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/time.h" +#include "arrow/vendored/datetime.h" + +namespace arrow { + +namespace compute { +namespace internal { + +using applicator::ScalarUnaryNotNull; +using applicator::SimpleUnary; +using arrow_vendored::date::days; +using arrow_vendored::date::floor; +using arrow_vendored::date::hh_mm_ss; +using arrow_vendored::date::sys_days; +using arrow_vendored::date::sys_time; +using arrow_vendored::date::trunc; +using arrow_vendored::date::weekday; +using arrow_vendored::date::weeks; +using arrow_vendored::date::year_month_day; +using arrow_vendored::date::years; +using arrow_vendored::date::literals::dec; +using arrow_vendored::date::literals::jan; +using arrow_vendored::date::literals::last; +using arrow_vendored::date::literals::mon; +using arrow_vendored::date::literals::thu; + +// Based on ScalarUnaryNotNullStateful. Adds timezone awareness. +template +struct ScalarUnaryStatefulTemporal { + using ThisType = ScalarUnaryStatefulTemporal; + using OutValue = typename GetOutputType::T; + + Op op; + explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {} + + template + struct ArrayExec { +static Status Exec(const ThisType& functor, KernelContext* ctx, const ArrayData& arg0, + Datum* out) { + const std::string timezone = + std::static_pointer_cast(arg0.type)->timezone(); + Status st = Status::OK(); + ArrayData* out_arr = out->mutable_array(); + auto out_data = out_arr->GetMutableValues(1); + + if (timezone.empty()) { +VisitArrayValuesInline( +arg0, +[&](int64_t v) { + *out_data++ = functor.op.template Call(ctx, v, ); +}, +[&]() { + // null + ++out_data; +}); + } else { +st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); + } + return st; +} + }; + + Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) { +const std::string timezone = +std::static_pointer_cast(arg0.type)->timezone(); +Status st = Status::OK(); +if (timezone.empty()) { + if (arg0.is_valid) { +int64_t arg0_val = UnboxScalar::Unbox(arg0); +BoxScalar::Box(this->op.template Call(ctx, arg0_val, ), +out->scalar().get()); + } +} else { + st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); +} +return st; + } + + Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + return ArrayExec::Exec(*this, ctx, *batch[0].array(), out); +} else { + return Scalar(ctx, *batch[0].scalar(), out); +} + } +}; + +template +struct ScalarUnaryTemporal { + using OutValue = typename GetOutputType::T; + + static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +// Seed kernel with dummy state +ScalarUnaryStatefulTemporal kernel({}); +return kernel.Exec(ctx, batch, out); + } +}; + +// -- +// Extract year from timestamp + +template +struct Year { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).year())); + } +}; + +// -- +// Extract month from timestamp + +template +struct Month { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).month())); + } +}; + +// -- +//
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643452934 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc ## @@ -0,0 +1,614 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/builder.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/time.h" +#include "arrow/vendored/datetime.h" + +namespace arrow { + +namespace compute { +namespace internal { + +using applicator::ScalarUnaryNotNull; +using applicator::SimpleUnary; +using arrow_vendored::date::days; +using arrow_vendored::date::floor; +using arrow_vendored::date::hh_mm_ss; +using arrow_vendored::date::sys_days; +using arrow_vendored::date::sys_time; +using arrow_vendored::date::trunc; +using arrow_vendored::date::weekday; +using arrow_vendored::date::weeks; +using arrow_vendored::date::year_month_day; +using arrow_vendored::date::years; +using arrow_vendored::date::literals::dec; +using arrow_vendored::date::literals::jan; +using arrow_vendored::date::literals::last; +using arrow_vendored::date::literals::mon; +using arrow_vendored::date::literals::thu; + +// Based on ScalarUnaryNotNullStateful. Adds timezone awareness. +template +struct ScalarUnaryStatefulTemporal { + using ThisType = ScalarUnaryStatefulTemporal; + using OutValue = typename GetOutputType::T; + + Op op; + explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {} + + template + struct ArrayExec { +static Status Exec(const ThisType& functor, KernelContext* ctx, const ArrayData& arg0, + Datum* out) { + const std::string timezone = + std::static_pointer_cast(arg0.type)->timezone(); + Status st = Status::OK(); + ArrayData* out_arr = out->mutable_array(); + auto out_data = out_arr->GetMutableValues(1); + + if (timezone.empty()) { +VisitArrayValuesInline( +arg0, +[&](int64_t v) { + *out_data++ = functor.op.template Call(ctx, v, ); +}, +[&]() { + // null + ++out_data; +}); + } else { +st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); + } + return st; +} + }; + + Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) { +const std::string timezone = +std::static_pointer_cast(arg0.type)->timezone(); +Status st = Status::OK(); +if (timezone.empty()) { + if (arg0.is_valid) { +int64_t arg0_val = UnboxScalar::Unbox(arg0); +BoxScalar::Box(this->op.template Call(ctx, arg0_val, ), +out->scalar().get()); + } +} else { + st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); +} +return st; + } + + Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + return ArrayExec::Exec(*this, ctx, *batch[0].array(), out); +} else { + return Scalar(ctx, *batch[0].scalar(), out); +} + } +}; + +template +struct ScalarUnaryTemporal { + using OutValue = typename GetOutputType::T; + + static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +// Seed kernel with dummy state +ScalarUnaryStatefulTemporal kernel({}); +return kernel.Exec(ctx, batch, out); + } +}; + +// -- +// Extract year from timestamp + +template +struct Year { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).year())); + } +}; + +// -- +// Extract month from timestamp + +template +struct Month { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( + year_month_day(floor(sys_time(Duration{arg}))).month())); + } +}; + +// -- +//
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643452234 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc ## @@ -0,0 +1,614 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/builder.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/time.h" +#include "arrow/vendored/datetime.h" + +namespace arrow { + +namespace compute { +namespace internal { + +using applicator::ScalarUnaryNotNull; +using applicator::SimpleUnary; +using arrow_vendored::date::days; +using arrow_vendored::date::floor; +using arrow_vendored::date::hh_mm_ss; +using arrow_vendored::date::sys_days; +using arrow_vendored::date::sys_time; +using arrow_vendored::date::trunc; +using arrow_vendored::date::weekday; +using arrow_vendored::date::weeks; +using arrow_vendored::date::year_month_day; +using arrow_vendored::date::years; +using arrow_vendored::date::literals::dec; +using arrow_vendored::date::literals::jan; +using arrow_vendored::date::literals::last; +using arrow_vendored::date::literals::mon; +using arrow_vendored::date::literals::thu; + +// Based on ScalarUnaryNotNullStateful. Adds timezone awareness. +template +struct ScalarUnaryStatefulTemporal { + using ThisType = ScalarUnaryStatefulTemporal; + using OutValue = typename GetOutputType::T; + + Op op; + explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {} + + template + struct ArrayExec { +static Status Exec(const ThisType& functor, KernelContext* ctx, const ArrayData& arg0, + Datum* out) { + const std::string timezone = + std::static_pointer_cast(arg0.type)->timezone(); + Status st = Status::OK(); + ArrayData* out_arr = out->mutable_array(); + auto out_data = out_arr->GetMutableValues(1); + + if (timezone.empty()) { +VisitArrayValuesInline( +arg0, +[&](int64_t v) { + *out_data++ = functor.op.template Call(ctx, v, ); +}, +[&]() { + // null + ++out_data; +}); + } else { +st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); + } + return st; +} + }; + + Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) { +const std::string timezone = +std::static_pointer_cast(arg0.type)->timezone(); +Status st = Status::OK(); +if (timezone.empty()) { + if (arg0.is_valid) { +int64_t arg0_val = UnboxScalar::Unbox(arg0); +BoxScalar::Box(this->op.template Call(ctx, arg0_val, ), +out->scalar().get()); + } +} else { + st = Status::Invalid("Timezone aware timestamps not supported. Timezone found: ", + timezone); +} +return st; + } + + Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + return ArrayExec::Exec(*this, ctx, *batch[0].array(), out); +} else { + return Scalar(ctx, *batch[0].scalar(), out); +} + } +}; + +template +struct ScalarUnaryTemporal { + using OutValue = typename GetOutputType::T; + + static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +// Seed kernel with dummy state +ScalarUnaryStatefulTemporal kernel({}); +return kernel.Exec(ctx, batch, out); + } +}; + +// -- +// Extract year from timestamp + +template +struct Year { + template + static T Call(KernelContext*, int64_t arg, Status*) { +return static_cast(static_cast( Review comment: date.h won't allow casting to types other than int32 and throws at compile-time: ``` ../src/arrow/compute/kernels/scalar_temporal.cc:144:78: error: invalid static_cast from type 'arrow_vendored::date::year' to type 'const int64_t' {aka 'const long int'} 144 | year_month_day(floor(sys_time(Duration{arg}))).year())); ``` I'll take a look if there's a recommended way to deal with this. -- This is an automated message from
[GitHub] [arrow-datafusion] jgoday commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)
jgoday commented on a change in pull request #436: URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643450066 ## File path: datafusion/src/optimizer/remove_duplicate_filters.rs ## @@ -0,0 +1,611 @@ +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Remove duplicate filters optimizer rule + +use crate::execution::context::ExecutionProps; +use crate::logical_plan::LogicalPlan; +use crate::logical_plan::{lit, Expr}; +use crate::optimizer::optimizer::OptimizerRule; +use crate::optimizer::utils; +use crate::optimizer::utils::optimize_explain; +use crate::scalar::ScalarValue; +use crate::{error::Result, logical_plan::Operator}; + +/// Remove duplicate filters optimizer. +/// # Introduction +/// It uses boolean algebra laws to simplify or reduce the number of terms in expressions. +/// +/// Filter: #b Gt Int32(2) And #b Gt Int32(2) +/// is optimized to +/// Filter: #b Gt Int32(2) +pub struct RemoveDuplicateFilters {} + +fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool { +match expr { +Expr::BinaryExpr { +left, +op: Operator::And, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +_ => expr == needle, +} +} + +fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> { +match expr { +Expr::BinaryExpr { .. } => Some(expr), +_ => None, +} +} + +fn operator_is_boolean(op: ) -> bool { +op == ::And || op == ::Or +} + +fn is_one<'a>(s: &'a Expr) -> bool { +match s { Review comment: I like that last option :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643449818 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc ## @@ -0,0 +1,614 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/builder.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/util/time.h" +#include "arrow/vendored/datetime.h" + +namespace arrow { + +namespace compute { +namespace internal { + +using applicator::ScalarUnaryNotNull; +using applicator::SimpleUnary; +using arrow_vendored::date::days; +using arrow_vendored::date::floor; +using arrow_vendored::date::hh_mm_ss; +using arrow_vendored::date::sys_days; +using arrow_vendored::date::sys_time; +using arrow_vendored::date::trunc; +using arrow_vendored::date::weekday; +using arrow_vendored::date::weeks; +using arrow_vendored::date::year_month_day; +using arrow_vendored::date::years; +using arrow_vendored::date::literals::dec; +using arrow_vendored::date::literals::jan; +using arrow_vendored::date::literals::last; +using arrow_vendored::date::literals::mon; +using arrow_vendored::date::literals::thu; + +// Based on ScalarUnaryNotNullStateful. Adds timezone awareness. Review comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643449669 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc ## @@ -0,0 +1,107 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/util/checked_cast.h" +#include "arrow/util/formatting.h" + +namespace arrow { + +using internal::StringFormatter; + +class ScalarTemporalTest : public ::testing::Test {}; + +namespace compute { + +TEST(ScalarTemporalTest, TestSimpleTemporalComponentExtraction) { + const char* times = + R"(["1970-01-01T00:00:59.123456789","2000-02-29T23:23:23.9", + "1899-01-01T00:59:20.001001001","2033-05-18T03:33:20.0", null])"; + auto unit = timestamp(TimeUnit::NANO); + auto timestamps = ArrayFromJSON(unit, times); + auto iso_calendar_type = + struct_({field("iso_year", int64()), field("iso_week", int64()), + field("weekday", int64())}); + + auto year = "[1970, 2000, 1899, 2033, null]"; + auto month = "[1, 2, 1, 5, null]"; + auto day = "[1, 29, 1, 18, null]"; + auto day_of_week = "[4, 2, 7, 3, null]"; + auto day_of_year = "[1, 60, 1, 138, null]"; + auto iso_year = "[1970, 2000, 1899, 2033, null]"; + auto iso_week = "[1, 9, 52, 20, null]"; + auto iso_calendar = ArrayFromJSON(iso_calendar_type, +R"([{"iso_year": 1970, "iso_week": 1, "weekday": 4}, +{"iso_year": 2000, "iso_week": 9, "weekday": 2}, +{"iso_year": 1899, "iso_week": 52, "weekday": 7}, +{"iso_year": 2033, "iso_week": 20, "weekday": 3}, null])"); + auto quarter = "[1, 1, 1, 2, null]"; + auto hour = "[0, 23, 0, 3, null]"; + auto minute = "[0, 23, 59, 33, null]"; + auto second = "[59.123456789, 23.9, 20.001001001, 20.0, null]"; + auto millisecond = "[123, 999, 1, 0, null]"; + auto microsecond = "[456, 999, 1, 0, null]"; + auto nanosecond = "[789, 999, 1, 0, null]"; + auto subsecond = "[123456789, 9, 1001001, 0, null]"; + + CheckScalarUnary("year", unit, times, int64(), year); + CheckScalarUnary("month", unit, times, int64(), month); + CheckScalarUnary("day", unit, times, int64(), day); + CheckScalarUnary("day_of_week", unit, times, int64(), day_of_week); + CheckScalarUnary("day_of_year", unit, times, int64(), day_of_year); + CheckScalarUnary("iso_year", unit, times, int64(), iso_year); + CheckScalarUnary("iso_week", unit, times, int64(), iso_week); + CheckScalarUnary("iso_calendar", timestamps, iso_calendar); + CheckScalarUnary("quarter", unit, times, int64(), quarter); + CheckScalarUnary("hour", unit, times, int64(), hour); + CheckScalarUnary("minute", unit, times, int64(), minute); + CheckScalarUnary("second", unit, times, float64(), second); + CheckScalarUnary("millisecond", unit, times, int64(), millisecond); + CheckScalarUnary("microsecond", unit, times, int64(), microsecond); + CheckScalarUnary("nanosecond", unit, times, int64(), nanosecond); + CheckScalarUnary("subsecond", unit, times, int64(), subsecond); Review comment: We'll probably need to calculate these to mimic pandas and R. We can calculate them here or later. [See discussion.](https://github.com/apache/arrow/pull/10176#discussion_r633564079) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified
thisisnic commented on a change in pull request #10326: URL: https://github.com/apache/arrow/pull/10326#discussion_r643448875 ## File path: r/R/util.R ## @@ -110,3 +110,15 @@ handle_embedded_nul_error <- function(e) { } stop(e) } + +handle_parquet_io_error <- function(e, format) { Review comment: Good point! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type
rok commented on a change in pull request #10176: URL: https://github.com/apache/arrow/pull/10176#discussion_r643448314 ## File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc ## @@ -0,0 +1,107 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include "arrow/compute/api_scalar.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/util/checked_cast.h" +#include "arrow/util/formatting.h" + +namespace arrow { + +using internal::StringFormatter; + +class ScalarTemporalTest : public ::testing::Test {}; + +namespace compute { + +TEST(ScalarTemporalTest, TestSimpleTemporalComponentExtraction) { + const char* times = + R"(["1970-01-01T00:00:59.123456789","2000-02-29T23:23:23.9", + "1899-01-01T00:59:20.001001001","2033-05-18T03:33:20.0", null])"; + auto unit = timestamp(TimeUnit::NANO); + auto timestamps = ArrayFromJSON(unit, times); + auto iso_calendar_type = + struct_({field("iso_year", int64()), field("iso_week", int64()), + field("weekday", int64())}); + + auto year = "[1970, 2000, 1899, 2033, null]"; + auto month = "[1, 2, 1, 5, null]"; + auto day = "[1, 29, 1, 18, null]"; + auto day_of_week = "[4, 2, 7, 3, null]"; + auto day_of_year = "[1, 60, 1, 138, null]"; + auto iso_year = "[1970, 2000, 1899, 2033, null]"; + auto iso_week = "[1, 9, 52, 20, null]"; + auto iso_calendar = ArrayFromJSON(iso_calendar_type, +R"([{"iso_year": 1970, "iso_week": 1, "weekday": 4}, +{"iso_year": 2000, "iso_week": 9, "weekday": 2}, +{"iso_year": 1899, "iso_week": 52, "weekday": 7}, +{"iso_year": 2033, "iso_week": 20, "weekday": 3}, null])"); + auto quarter = "[1, 1, 1, 2, null]"; + auto hour = "[0, 23, 0, 3, null]"; + auto minute = "[0, 23, 59, 33, null]"; + auto second = "[59.123456789, 23.9, 20.001001001, 20.0, null]"; + auto millisecond = "[123, 999, 1, 0, null]"; + auto microsecond = "[456, 999, 1, 0, null]"; + auto nanosecond = "[789, 999, 1, 0, null]"; + auto subsecond = "[123456789, 9, 1001001, 0, null]"; + + CheckScalarUnary("year", unit, times, int64(), year); + CheckScalarUnary("month", unit, times, int64(), month); + CheckScalarUnary("day", unit, times, int64(), day); + CheckScalarUnary("day_of_week", unit, times, int64(), day_of_week); + CheckScalarUnary("day_of_year", unit, times, int64(), day_of_year); + CheckScalarUnary("iso_year", unit, times, int64(), iso_year); + CheckScalarUnary("iso_week", unit, times, int64(), iso_week); + CheckScalarUnary("iso_calendar", timestamps, iso_calendar); + CheckScalarUnary("quarter", unit, times, int64(), quarter); + CheckScalarUnary("hour", unit, times, int64(), hour); + CheckScalarUnary("minute", unit, times, int64(), minute); + CheckScalarUnary("second", unit, times, float64(), second); + CheckScalarUnary("millisecond", unit, times, int64(), millisecond); + CheckScalarUnary("microsecond", unit, times, int64(), microsecond); + CheckScalarUnary("nanosecond", unit, times, int64(), nanosecond); + CheckScalarUnary("subsecond", unit, times, int64(), subsecond); +} + +TEST(ScalarTemporalTest, TestZonedTemporalComponentExtraction) { + std::string timezone = "Etc/UTC-2"; + const char* times = + R"(["1970-01-01T00:00:59.123456789","2000-02-29T23:23:23.9", + "1899-01-01T00:59:20.001001001","2033-05-18T03:33:20.0", null])"; + auto unit = timestamp(TimeUnit::NANO, timezone); + auto timestamps = ArrayFromJSON(unit, times); + + ASSERT_RAISES(Invalid, Year(timestamps)); + ASSERT_RAISES(Invalid, Month(timestamps)); + ASSERT_RAISES(Invalid, Day(timestamps)); + ASSERT_RAISES(Invalid, DayOfWeek(timestamps)); + ASSERT_RAISES(Invalid, DayOfYear(timestamps)); + ASSERT_RAISES(Invalid, ISOYear(timestamps)); + ASSERT_RAISES(Invalid, ISOWeek(timestamps)); + ASSERT_RAISES(Invalid, ISOCalendar(timestamps)); + ASSERT_RAISES(Invalid, Quarter(timestamps)); + ASSERT_RAISES(Invalid, Hour(timestamps)); + ASSERT_RAISES(Invalid, Minute(timestamps)); + ASSERT_RAISES(Invalid, Second(timestamps)); + ASSERT_RAISES(Invalid, Millisecond(timestamps)); +
[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
bkietz commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643448100 ## File path: cpp/src/arrow/compute/exec/expression.h ## @@ -207,11 +218,22 @@ Result SimplifyWithGuarantee(Expression, // Execution -/// Execute a scalar expression against the provided state and input Datum. This +/// Ensure that a RecordBatch (which may have missing or incorrectly ordered columns) Review comment: will do, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
bkietz commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643447263 ## File path: cpp/src/arrow/compute/exec/expression.cc ## @@ -510,7 +475,67 @@ Result Expression::Bind(const Schema& in_schema, return Bind(ValueDescr::Array(struct_(in_schema.fields())), exec_context); } -Result ExecuteScalarExpression(const Expression& expr, const Datum& input, +Result MakeExecBatch(const Schema& full_schema, const Datum& partial) { + ExecBatch out; + + if (partial.kind() == Datum::RECORD_BATCH) { +const auto& partial_batch = *partial.record_batch(); +out.length = partial_batch.num_rows(); + +for (const auto& field : full_schema.fields()) { + ARROW_ASSIGN_OR_RAISE(auto column, + FieldRef(field->name()).GetOneOrNone(partial_batch)); Review comment: It will raise Status::Invalid -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
bkietz commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643447263 ## File path: cpp/src/arrow/compute/exec/expression.cc ## @@ -510,7 +475,67 @@ Result Expression::Bind(const Schema& in_schema, return Bind(ValueDescr::Array(struct_(in_schema.fields())), exec_context); } -Result ExecuteScalarExpression(const Expression& expr, const Datum& input, +Result MakeExecBatch(const Schema& full_schema, const Datum& partial) { + ExecBatch out; + + if (partial.kind() == Datum::RECORD_BATCH) { +const auto& partial_batch = *partial.record_batch(); +out.length = partial_batch.num_rows(); + +for (const auto& field : full_schema.fields()) { + ARROW_ASSIGN_OR_RAISE(auto column, + FieldRef(field->name()).GetOneOrNone(partial_batch)); Review comment: It will -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
bkietz commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643447166 ## File path: cpp/src/arrow/compute/exec/exec_plan.h ## @@ -225,22 +212,43 @@ class ARROW_EXPORT ExecNode { virtual void StopProducing() = 0; protected: - ExecNode(ExecPlan* plan, std::string label, std::vector input_descrs, + ExecNode(ExecPlan*, std::string label, NodeVector inputs, std::vector input_labels, BatchDescr output_descr, int num_outputs); ExecPlan* plan_; - std::string label_; - std::vector input_descrs_; - std::vector input_labels_; NodeVector inputs_; + std::vector input_labels_; BatchDescr output_descr_; int num_outputs_; NodeVector outputs_; }; +/// \brief Adapt an AsyncGenerator as a source node +ARROW_EXPORT +ExecNode* MakeSourceNode(ExecPlan*, std::string label, ExecNode::BatchDescr output_descr, + AsyncGenerator>); + +/// \brief Add a sink node which forwards to an AsyncGenerator +ARROW_EXPORT +AsyncGenerator> MakeSinkNode(ExecNode* input, + std::string label); + +/// \brief Make a node which excludes some rows from batches passed through it +/// +/// filter Expression must be bound; no field references will be looked up by name +ARROW_EXPORT +ExecNode* MakeFilterNode(ExecNode* input, std::string label, Expression filter); + +/// \brief Make a node which executes expressions on input batches, producing new batches. +/// +/// Expressions must be bound; no field references will be looked up by name +ARROW_EXPORT +ExecNode* MakeProjectNode(ExecNode* input, std::string label, Review comment: I'll add clarification -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)
Dandandan commented on a change in pull request #436: URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643441081 ## File path: datafusion/src/optimizer/remove_duplicate_filters.rs ## @@ -0,0 +1,611 @@ +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Remove duplicate filters optimizer rule + +use crate::execution::context::ExecutionProps; +use crate::logical_plan::LogicalPlan; +use crate::logical_plan::{lit, Expr}; +use crate::optimizer::optimizer::OptimizerRule; +use crate::optimizer::utils; +use crate::optimizer::utils::optimize_explain; +use crate::scalar::ScalarValue; +use crate::{error::Result, logical_plan::Operator}; + +/// Remove duplicate filters optimizer. +/// # Introduction +/// It uses boolean algebra laws to simplify or reduce the number of terms in expressions. +/// +/// Filter: #b Gt Int32(2) And #b Gt Int32(2) +/// is optimized to +/// Filter: #b Gt Int32(2) +pub struct RemoveDuplicateFilters {} + +fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool { +match expr { +Expr::BinaryExpr { +left, +op: Operator::And, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +_ => expr == needle, +} +} + +fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> { +match expr { +Expr::BinaryExpr { .. } => Some(expr), +_ => None, +} +} + +fn operator_is_boolean(op: ) -> bool { +op == ::And || op == ::Or +} + +fn is_one<'a>(s: &'a Expr) -> bool { +match s { +Expr::Literal(ScalarValue::Int8(Some(1))) => true, +Expr::Literal(ScalarValue::Int16(Some(1))) => true, +Expr::Literal(ScalarValue::Int32(Some(1))) => true, +Expr::Literal(ScalarValue::Int64(Some(1))) => true, +Expr::Literal(ScalarValue::UInt8(Some(1))) => true, +Expr::Literal(ScalarValue::UInt16(Some(1))) => true, +Expr::Literal(ScalarValue::UInt32(Some(1))) => true, +Expr::Literal(ScalarValue::UInt64(Some(1))) => true, +Expr::Literal(ScalarValue::Float32(Some(v))) if *v == 1. => true, +Expr::Literal(ScalarValue::Float64(Some(v))) if *v == 1. => true, +_ => false +} +} + +fn is_zero<'a>(s: &'a Expr) -> bool { +match s { +Expr::Literal(ScalarValue::Int8(Some(0))) => true, +Expr::Literal(ScalarValue::Int16(Some(0))) => true, +Expr::Literal(ScalarValue::Int32(Some(0))) => true, +Expr::Literal(ScalarValue::Int64(Some(0))) => true, +Expr::Literal(ScalarValue::UInt8(Some(0))) => true, +Expr::Literal(ScalarValue::UInt16(Some(0))) => true, +Expr::Literal(ScalarValue::UInt32(Some(0))) => true, +Expr::Literal(ScalarValue::UInt64(Some(0))) => true, +Expr::Literal(ScalarValue::Float32(Some(v))) if *v == 0. => true, +Expr::Literal(ScalarValue::Float64(Some(v))) if *v == 0. => true, +_ => false +} +} + +fn is_true<'a>(expr: &'a Expr) -> bool { +match expr { +Expr::Literal(ScalarValue::Boolean(Some(v))) => *v, +_ => false, +} +} + +fn is_false<'a>(expr: &'a Expr) -> bool { +match expr { +Expr::Literal(ScalarValue::Boolean(Some(v))) => *v == false, +_ => false, +} +} + +fn simplify<'a>(expr: &'a Expr) -> Expr { +match expr { +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} if is_true(left) || is_true(right) => lit(true), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} if is_false(left) => simplify(right), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} if is_false(right) => simplify(left), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} if left == right => simplify(left), +Expr::BinaryExpr { +left, +op: Operator::And, +right, +} if is_false(left) || is_false(right) => lit(false), +Expr::BinaryExpr { +left, +op: Operator::And, +right, +} if is_true(right) =>
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)
Dandandan commented on a change in pull request #436: URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643440084 ## File path: datafusion/src/optimizer/mod.rs ## @@ -25,4 +25,5 @@ pub mod hash_build_probe_order; pub mod limit_push_down; pub mod optimizer; pub mod projection_push_down; +pub mod remove_duplicate_filters; Review comment: I think that would be a more accurate name by now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)
Dandandan commented on a change in pull request #436: URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643439834 ## File path: datafusion/src/optimizer/remove_duplicate_filters.rs ## @@ -0,0 +1,611 @@ +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Remove duplicate filters optimizer rule + +use crate::execution::context::ExecutionProps; +use crate::logical_plan::LogicalPlan; +use crate::logical_plan::{lit, Expr}; +use crate::optimizer::optimizer::OptimizerRule; +use crate::optimizer::utils; +use crate::optimizer::utils::optimize_explain; +use crate::scalar::ScalarValue; +use crate::{error::Result, logical_plan::Operator}; + +/// Remove duplicate filters optimizer. +/// # Introduction +/// It uses boolean algebra laws to simplify or reduce the number of terms in expressions. +/// +/// Filter: #b Gt Int32(2) And #b Gt Int32(2) +/// is optimized to +/// Filter: #b Gt Int32(2) +pub struct RemoveDuplicateFilters {} + +fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool { +match expr { +Expr::BinaryExpr { +left, +op: Operator::And, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +_ => expr == needle, +} +} + +fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> { +match expr { +Expr::BinaryExpr { .. } => Some(expr), +_ => None, +} +} + +fn operator_is_boolean(op: ) -> bool { +op == ::And || op == ::Or +} + +fn is_one<'a>(s: &'a Expr) -> bool { +match s { +Expr::Literal(ScalarValue::Int8(Some(1))) => true, +Expr::Literal(ScalarValue::Int16(Some(1))) => true, +Expr::Literal(ScalarValue::Int32(Some(1))) => true, +Expr::Literal(ScalarValue::Int64(Some(1))) => true, +Expr::Literal(ScalarValue::UInt8(Some(1))) => true, +Expr::Literal(ScalarValue::UInt16(Some(1))) => true, +Expr::Literal(ScalarValue::UInt32(Some(1))) => true, +Expr::Literal(ScalarValue::UInt64(Some(1))) => true, +Expr::Literal(ScalarValue::Float32(Some(v))) if *v == 1. => true, +Expr::Literal(ScalarValue::Float64(Some(v))) if *v == 1. => true, +_ => false +} +} + +fn is_zero<'a>(s: &'a Expr) -> bool { +match s { +Expr::Literal(ScalarValue::Int8(Some(0))) => true, +Expr::Literal(ScalarValue::Int16(Some(0))) => true, +Expr::Literal(ScalarValue::Int32(Some(0))) => true, +Expr::Literal(ScalarValue::Int64(Some(0))) => true, +Expr::Literal(ScalarValue::UInt8(Some(0))) => true, +Expr::Literal(ScalarValue::UInt16(Some(0))) => true, +Expr::Literal(ScalarValue::UInt32(Some(0))) => true, +Expr::Literal(ScalarValue::UInt64(Some(0))) => true, +Expr::Literal(ScalarValue::Float32(Some(v))) if *v == 0. => true, +Expr::Literal(ScalarValue::Float64(Some(v))) if *v == 0. => true, +_ => false +} +} + +fn is_true<'a>(expr: &'a Expr) -> bool { +match expr { +Expr::Literal(ScalarValue::Boolean(Some(v))) => *v, +_ => false, +} +} + +fn is_false<'a>(expr: &'a Expr) -> bool { +match expr { +Expr::Literal(ScalarValue::Boolean(Some(v))) => *v == false, +_ => false, +} +} + +fn simplify<'a>(expr: &'a Expr) -> Expr { +match expr { +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} if is_true(left) || is_true(right) => lit(true), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} if is_false(left) => simplify(right), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} if is_false(right) => simplify(left), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} if left == right => simplify(left), +Expr::BinaryExpr { +left, +op: Operator::And, +right, +} if is_false(left) || is_false(right) => lit(false), +Expr::BinaryExpr { +left, +op: Operator::And, +right, +} if is_true(right) =>
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)
Dandandan commented on a change in pull request #436: URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643439659 ## File path: datafusion/src/optimizer/remove_duplicate_filters.rs ## @@ -0,0 +1,611 @@ +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Remove duplicate filters optimizer rule + +use crate::execution::context::ExecutionProps; +use crate::logical_plan::LogicalPlan; +use crate::logical_plan::{lit, Expr}; +use crate::optimizer::optimizer::OptimizerRule; +use crate::optimizer::utils; +use crate::optimizer::utils::optimize_explain; +use crate::scalar::ScalarValue; +use crate::{error::Result, logical_plan::Operator}; + +/// Remove duplicate filters optimizer. +/// # Introduction +/// It uses boolean algebra laws to simplify or reduce the number of terms in expressions. +/// +/// Filter: #b Gt Int32(2) And #b Gt Int32(2) +/// is optimized to +/// Filter: #b Gt Int32(2) +pub struct RemoveDuplicateFilters {} + +fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool { +match expr { +Expr::BinaryExpr { +left, +op: Operator::And, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +_ => expr == needle, +} +} + +fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> { +match expr { +Expr::BinaryExpr { .. } => Some(expr), +_ => None, +} +} + +fn operator_is_boolean(op: ) -> bool { +op == ::And || op == ::Or +} + +fn is_one<'a>(s: &'a Expr) -> bool { +match s { Review comment: Or it might be better to just use `|` patterns (without `matches!`) in the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans
westonpace commented on a change in pull request #10397: URL: https://github.com/apache/arrow/pull/10397#discussion_r643353334 ## File path: cpp/src/arrow/compute/exec/exec_plan.h ## @@ -225,22 +212,43 @@ class ARROW_EXPORT ExecNode { virtual void StopProducing() = 0; protected: - ExecNode(ExecPlan* plan, std::string label, std::vector input_descrs, + ExecNode(ExecPlan*, std::string label, NodeVector inputs, Review comment: Nit: Seems odd to have no name for `ExecPlan` but then have a name for everything else. ## File path: cpp/src/arrow/compute/exec/expression.cc ## @@ -61,13 +61,22 @@ Expression call(std::string function, std::vector arguments, call.function_name = std::move(function); call.arguments = std::move(arguments); call.options = std::move(options); + + call.hash = std::hash{}(call.function_name); + for (const auto& arg : call.arguments) { +call.hash ^= arg.hash(); + } return Expression(std::move(call)); } const Datum* Expression::literal() const { return util::get_if(impl_.get()); } +const Expression::Parameter* Expression::parameter() const { Review comment: Checking my knowledge. A parameter is an index + an expected type? ## File path: cpp/src/arrow/compute/exec/expression.cc ## @@ -613,6 +639,22 @@ std::vector FieldsInExpression(const Expression& expr) { return fields; } +std::vector ParametersInExpression(const Expression& expr) { Review comment: It seems this could return duplicate indices (e.g. in something like `x < 5 && x > 0`). Is that a problem? ## File path: cpp/src/arrow/dataset/dataset_internal.h ## @@ -204,5 +204,35 @@ arrow::Result> GetFragmentScanOptions( return internal::checked_pointer_cast(source); } +class FragmentDataset : public Dataset { Review comment: Should this be a base type of `InMemoryDataset`? ## File path: cpp/src/arrow/compute/exec/test_util.cc ## @@ -124,277 +130,42 @@ struct DummyNode : ExecNode { bool started_ = false; }; -struct RecordBatchReaderNode : ExecNode { - RecordBatchReaderNode(ExecPlan* plan, std::string label, -std::shared_ptr reader, Executor* io_executor) - : ExecNode(plan, std::move(label), {}, {}, - DescrFromSchemaColumns(*reader->schema()), /*num_outputs=*/1), -schema_(reader->schema()), -reader_(std::move(reader)), -io_executor_(io_executor) {} - - RecordBatchReaderNode(ExecPlan* plan, std::string label, std::shared_ptr schema, -RecordBatchGenerator generator, Executor* io_executor) - : ExecNode(plan, std::move(label), {}, {}, DescrFromSchemaColumns(*schema), - /*num_outputs=*/1), -schema_(std::move(schema)), -generator_(std::move(generator)), -io_executor_(io_executor) {} - - const char* kind_name() override { return "RecordBatchReader"; } - - void InputReceived(ExecNode* input, int seq_num, compute::ExecBatch batch) override {} - - void ErrorReceived(ExecNode* input, Status error) override {} - - void InputFinished(ExecNode* input, int seq_stop) override {} - - Status StartProducing() override { -next_batch_index_ = 0; -if (!generator_) { - auto it = MakeIteratorFromReader(reader_); - ARROW_ASSIGN_OR_RAISE(generator_, -MakeBackgroundGenerator(std::move(it), io_executor_)); -} -GenerateOne(std::unique_lock{mutex_}); -return Status::OK(); - } - - void PauseProducing(ExecNode* output) override {} - - void ResumeProducing(ExecNode* output) override {} - - void StopProducing(ExecNode* output) override { -ASSERT_EQ(output, outputs_[0]); -std::unique_lock lock(mutex_); -generator_ = nullptr; // null function - } - - void StopProducing() override { StopProducing(outputs_[0]); } - - private: - void GenerateOne(std::unique_lock&& lock) { -if (!generator_) { - // Stopped - return; -} -auto plan = this->plan()->shared_from_this(); -auto fut = generator_(); -const auto batch_index = next_batch_index_++; - -lock.unlock(); -// TODO we want to transfer always here -io_executor_->Transfer(std::move(fut)) -.AddCallback( -[plan, batch_index, this](const Result>& res) { - std::unique_lock lock(mutex_); - if (!res.ok()) { -for (auto out : outputs_) { - out->ErrorReceived(this, res.status()); -} -return; - } - const auto& batch = *res; - if (IsIterationEnd(batch)) { -lock.unlock(); -for (auto out : outputs_) { - out->InputFinished(this, batch_index); -} - } else { -lock.unlock(); -for (auto out : outputs_) { - out->InputReceived(this,
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)
Dandandan commented on a change in pull request #436: URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643438683 ## File path: datafusion/src/optimizer/remove_duplicate_filters.rs ## @@ -0,0 +1,611 @@ +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Remove duplicate filters optimizer rule + +use crate::execution::context::ExecutionProps; +use crate::logical_plan::LogicalPlan; +use crate::logical_plan::{lit, Expr}; +use crate::optimizer::optimizer::OptimizerRule; +use crate::optimizer::utils; +use crate::optimizer::utils::optimize_explain; +use crate::scalar::ScalarValue; +use crate::{error::Result, logical_plan::Operator}; + +/// Remove duplicate filters optimizer. +/// # Introduction +/// It uses boolean algebra laws to simplify or reduce the number of terms in expressions. +/// +/// Filter: #b Gt Int32(2) And #b Gt Int32(2) +/// is optimized to +/// Filter: #b Gt Int32(2) +pub struct RemoveDuplicateFilters {} + +fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool { +match expr { +Expr::BinaryExpr { +left, +op: Operator::And, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +Expr::BinaryExpr { +left, +op: Operator::Or, +right, +} => expr_contains(left, needle) || expr_contains(right, needle), +_ => expr == needle, +} +} + +fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> { +match expr { +Expr::BinaryExpr { .. } => Some(expr), +_ => None, +} +} + +fn operator_is_boolean(op: ) -> bool { +op == ::And || op == ::Or +} + +fn is_one<'a>(s: &'a Expr) -> bool { +match s { Review comment: A `matches!` returns a boolean so you can also use `||` / boolean or to cover the float cases and keep the rest in a single `matches!` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #434: fix: display the content of debug explain
alamb merged pull request #434: URL: https://github.com/apache/arrow-datafusion/pull/434 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #430: LogicalPlan::inputs function should return the input plan for Explain enum
alamb closed issue #430: URL: https://github.com/apache/arrow-datafusion/issues/430 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] anthonylouisbsb commented on pull request #10350: ARROW-12814: [C++][Gandiva] Implements the ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT, CEIL, TRUNC and LN functions
anthonylouisbsb commented on pull request #10350: URL: https://github.com/apache/arrow/pull/10350#issuecomment-852402117 The `LN` function was added in this PR too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-datafusion] NGA-TRAN commented on pull request #434: fix: display the content of debug explain
NGA-TRAN commented on pull request #434: URL: https://github.com/apache/arrow-datafusion/pull/434#issuecomment-852397463 @alamb Finally all checks have passed :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10431: ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API
github-actions[bot] commented on pull request #10431: URL: https://github.com/apache/arrow/pull/10431#issuecomment-852396357 https://issues.apache.org/jira/browse/ARROW-12921 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] JayjeetAtGithub opened a new pull request #10431: ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API
JayjeetAtGithub opened a new pull request #10431: URL: https://github.com/apache/arrow/pull/10431 The implementation includes a new `RadosParquetFileFormat` class that inherits from the `ParquetFileFormat` class to defer the evaluation of scan operations on a Parquet dataset to a RADOS storage backend. This new file format plugs into the `FileSystemDataset` API, converts filenames to object IDs using FS metadata and uses the [librados](https://docs.ceph.com/en/latest/rados/api/librados-intro/) C++ library to execute storage side functions that scan the files on the [Ceph](https://ceph.io) storage nodes (OSDs) using Arrow libraries. We ship unit and integration tests with our implementation where the tests are run against a single-node Ceph cluster. The storage-side code is implemented as a RADOS CLS (object storage class) using [Ceph's Object Class SDK](https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree.). The code lives in `cpp/src/arrow/adapters/arrow-rados-cls`, and is expected to be deployed on the storage nodes (Ceph's OSDs) prior to operating on tables via the `RadosParquetFileFormat` implementation. This PR includes a CMake configuration for building this library if desired (`ARROW_CLS` CMake option). We have also added Python bindings for our C++ implementations and added integration tests for them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org