[GitHub] [arrow-datafusion] houqp commented on a change in pull request #443: add invariants spec

2021-06-01 Thread GitBox


houqp commented on a change in pull request #443:
URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643661190



##
File path: docs/specification/invariants.md
##
@@ -0,0 +1,327 @@
+
+
+# DataFusion's Invariants
+
+This document enumerates invariants of DataFusion's logical and physical planes
+(functions, and nodes). Some of these invariants are currently not enforced.
+This document assumes that the reader is familiar with some of the codebase,
+including rust arrow's RecordBatch and Array.
+
+## Rational
+
+DataFusion's computational model is built on top of a dynamically typed arrow
+object, Array, that offers the interface `Array::as_any` to downcast itself to
+its statically typed versions (e.g. `Int32Array`). DataFusion uses
+`Array::data_type` to perform the respective downcasting on its physical
+operations. DataFusion uses a dynamic type system because the queries being
+executed are not always known at compile time: they are only known during the
+runtime (or query time) of programs built with DataFusion. This document is
+built on top of this principle.
+
+In dynamically typed interfaces, it is up to developers to enforce type
+invariances. This document declares some of these invariants, so that users
+know what they can expect from a query in DataFusion, and DataFusion developers
+know what they need to enforce at the coding level.
+
+## Notation
+
+* Field or physical field: the tuple name, `arrow::DataType` and nullability 
flag (a bool whether values can be null), represented in this document by 
`PF(name, type, nullable)`
+* Logical field: Field with a relation name. Represented in this document by 
`LF(relation, name, type, nullable)`
+* Projected plan: plan with projection as the root node.
+* Logical schema: a vector of logical fields, used by logical plan.
+* Physical schema: a vector of physical fields, used by both physical plan and 
Arrow record batch.
+
+### Logical
+
+ Function
+
+An object that knows its valid incoming logical fields and how to derive its
+output logical field from its arguments' logical fields. A functions' output
+field is itself a function of its input fields:
+
+```
+logical_field(lf1: LF, lf2: LF, ...) -> LF
+```
+
+Examples:
+
+* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | 
b.nullable)` where d is the function mapping input types to output type 
(`get_supertype` in our current implementation).
+* `length(a) -> LF(None, "length({a})", u32, a.nullable)`
+
+ Plan
+
+A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - 
c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its schema.
+
+Certain plans have a frozen schema (e.g. Scan), while others derive their
+schema from their child nodes.
+
+ Column
+
+A type of logical node in a logical plan consists of field name and relation 
name.
+
+### Physical
+
+ Function
+
+An object that knows how to derive its physical field from its arguments'
+physical fields, and also how to actually perform the computation on data. A
+functions' output physical field is a function of its input physical fields:
+
+```
+physical_field(PF1, PF2, ...) -> PF
+```
+
+Examples:
+
+* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` 
where d is a complex function (`get_supertype` in our current implementation) 
whose computation is for each element in the columns, sum the two entries 
together and return it in the same type as the smallest type of both columns.
+* `length() -> PF("length({a})", u32, a.nullable)` whose computation is 
"count number of bytes in the string".
+
+ Plan
+
+A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its metadata and compute itself.
+
+Note how the physical plane does not know how to derive field names: field
+names are solely a property of the logical plane, as they are not needed in the
+physical plane.
+
+ Column
+
+A type of physical node in a physical plan consists of a field name and unique 
index.
+
+### Data Sources' registry
+
+A map of source name/relation -> Schema plus associated properties necessary 
to read data from it (e.g. file path).
+
+### Functions' registry
+
+A map of function name -> logical + physical function.
+
+### Physical Planner
+
+A function that knows how to derive a physical plan from a logical plan:
+
+```
+plan(LogicalPlan) -> PhysicalPlan
+```
+
+### Logical Optimizer
+
+A function that accepts a logical plan and returns an (optimized) logical plan
+which computes the same results, but in a more efficient manner:
+
+```
+optimize(LogicalPlan) -> LogicalPlan
+```
+
+### Physical Optimizer
+
+A function that accepts a physical plan and returns an (optimized) physical
+plan which computes the same results, but may differ based on the actual
+hardware or execution environment being run:
+
+```
+optimize(PhysicalPlan) -> PhysicalPlan
+```
+
+### Builder
+
+A function 

[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #443: add invariants spec

2021-06-01 Thread GitBox


jorgecarleitao commented on a change in pull request #443:
URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643655420



##
File path: docs/specification/invariants.md
##
@@ -0,0 +1,327 @@
+
+
+# DataFusion's Invariants
+
+This document enumerates invariants of DataFusion's logical and physical planes
+(functions, and nodes). Some of these invariants are currently not enforced.
+This document assumes that the reader is familiar with some of the codebase,
+including rust arrow's RecordBatch and Array.
+
+## Rational
+
+DataFusion's computational model is built on top of a dynamically typed arrow
+object, Array, that offers the interface `Array::as_any` to downcast itself to
+its statically typed versions (e.g. `Int32Array`). DataFusion uses
+`Array::data_type` to perform the respective downcasting on its physical
+operations. DataFusion uses a dynamic type system because the queries being
+executed are not always known at compile time: they are only known during the
+runtime (or query time) of programs built with DataFusion. This document is
+built on top of this principle.
+
+In dynamically typed interfaces, it is up to developers to enforce type
+invariances. This document declares some of these invariants, so that users
+know what they can expect from a query in DataFusion, and DataFusion developers
+know what they need to enforce at the coding level.
+
+## Notation
+
+* Field or physical field: the tuple name, `arrow::DataType` and nullability 
flag (a bool whether values can be null), represented in this document by 
`PF(name, type, nullable)`
+* Logical field: Field with a relation name. Represented in this document by 
`LF(relation, name, type, nullable)`
+* Projected plan: plan with projection as the root node.
+* Logical schema: a vector of logical fields, used by logical plan.
+* Physical schema: a vector of physical fields, used by both physical plan and 
Arrow record batch.
+
+### Logical
+
+ Function
+
+An object that knows its valid incoming logical fields and how to derive its
+output logical field from its arguments' logical fields. A functions' output
+field is itself a function of its input fields:
+
+```
+logical_field(lf1: LF, lf2: LF, ...) -> LF
+```
+
+Examples:
+
+* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | 
b.nullable)` where d is the function mapping input types to output type 
(`get_supertype` in our current implementation).
+* `length(a) -> LF(None, "length({a})", u32, a.nullable)`
+
+ Plan
+
+A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - 
c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its schema.
+
+Certain plans have a frozen schema (e.g. Scan), while others derive their
+schema from their child nodes.
+
+ Column
+
+A type of logical node in a logical plan consists of field name and relation 
name.
+
+### Physical
+
+ Function
+
+An object that knows how to derive its physical field from its arguments'
+physical fields, and also how to actually perform the computation on data. A
+functions' output physical field is a function of its input physical fields:
+
+```
+physical_field(PF1, PF2, ...) -> PF
+```
+
+Examples:
+
+* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` 
where d is a complex function (`get_supertype` in our current implementation) 
whose computation is for each element in the columns, sum the two entries 
together and return it in the same type as the smallest type of both columns.
+* `length() -> PF("length({a})", u32, a.nullable)` whose computation is 
"count number of bytes in the string".
+
+ Plan
+
+A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its metadata and compute itself.
+
+Note how the physical plane does not know how to derive field names: field
+names are solely a property of the logical plane, as they are not needed in the

Review comment:
   That is my understanding also: naming is in the logical plane; in the 
physical plane, people can use whatever they want for as long as the record 
batches respect the output schema declared on the corresponding logical nodes.
   
   People can derive a debug format for the physical expressions, but the 
naming must be set beforehand, or we risk having a divergence between the names 
declared in the output schema of the logical node and the record batches 
returned by the physical node.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] houqp commented on a change in pull request #443: add invariants spec

2021-06-01 Thread GitBox


houqp commented on a change in pull request #443:
URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643649837



##
File path: docs/specification/invariants.md
##
@@ -0,0 +1,327 @@
+
+
+# DataFusion's Invariants
+
+This document enumerates invariants of DataFusion's logical and physical planes
+(functions, and nodes). Some of these invariants are currently not enforced.
+This document assumes that the reader is familiar with some of the codebase,
+including rust arrow's RecordBatch and Array.
+
+## Rational
+
+DataFusion's computational model is built on top of a dynamically typed arrow
+object, Array, that offers the interface `Array::as_any` to downcast itself to
+its statically typed versions (e.g. `Int32Array`). DataFusion uses
+`Array::data_type` to perform the respective downcasting on its physical
+operations. DataFusion uses a dynamic type system because the queries being
+executed are not always known at compile time: they are only known during the
+runtime (or query time) of programs built with DataFusion. This document is
+built on top of this principle.
+
+In dynamically typed interfaces, it is up to developers to enforce type
+invariances. This document declares some of these invariants, so that users
+know what they can expect from a query in DataFusion, and DataFusion developers
+know what they need to enforce at the coding level.
+
+## Notation
+
+* Field or physical field: the tuple name, `arrow::DataType` and nullability 
flag (a bool whether values can be null), represented in this document by 
`PF(name, type, nullable)`
+* Logical field: Field with a relation name. Represented in this document by 
`LF(relation, name, type, nullable)`
+* Projected plan: plan with projection as the root node.
+* Logical schema: a vector of logical fields, used by logical plan.
+* Physical schema: a vector of physical fields, used by both physical plan and 
Arrow record batch.
+
+### Logical
+
+ Function
+
+An object that knows its valid incoming logical fields and how to derive its
+output logical field from its arguments' logical fields. A functions' output
+field is itself a function of its input fields:
+
+```
+logical_field(lf1: LF, lf2: LF, ...) -> LF
+```
+
+Examples:
+
+* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | 
b.nullable)` where d is the function mapping input types to output type 
(`get_supertype` in our current implementation).
+* `length(a) -> LF(None, "length({a})", u32, a.nullable)`
+
+ Plan
+
+A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - 
c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its schema.
+
+Certain plans have a frozen schema (e.g. Scan), while others derive their
+schema from their child nodes.
+
+ Column
+
+A type of logical node in a logical plan consists of field name and relation 
name.
+
+### Physical
+
+ Function
+
+An object that knows how to derive its physical field from its arguments'
+physical fields, and also how to actually perform the computation on data. A
+functions' output physical field is a function of its input physical fields:
+
+```
+physical_field(PF1, PF2, ...) -> PF
+```
+
+Examples:
+
+* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` 
where d is a complex function (`get_supertype` in our current implementation) 
whose computation is for each element in the columns, sum the two entries 
together and return it in the same type as the smallest type of both columns.
+* `length() -> PF("length({a})", u32, a.nullable)` whose computation is 
"count number of bytes in the string".
+
+ Plan
+
+A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its metadata and compute itself.
+
+Note how the physical plane does not know how to derive field names: field
+names are solely a property of the logical plane, as they are not needed in the

Review comment:
   @jorgecarleitao please correct me if I am wrong. I think the focus on 
this statement is on generation of field names. In the current code base, 
physical field names are generated using metadata from the logical plane, more 
specifically  from logical `Expr`s. The actual physical plan and physical 
expressions are not used to generate the field names.
   
   > names are used to line up inputs to PhysicalExpr
   
   This used to be true at the time when Jorge wrote it, but with 
https://github.com/apache/arrow-datafusion/pull/55, we will be changing it to 
line up inputs using field index instead. But regardless we make this change or 
not, it doesn't change the fact that physical filed names are only generated 
using metadata from the logical plane.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #443: add invariants spec

2021-06-01 Thread GitBox


houqp commented on a change in pull request #443:
URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643649837



##
File path: docs/specification/invariants.md
##
@@ -0,0 +1,327 @@
+
+
+# DataFusion's Invariants
+
+This document enumerates invariants of DataFusion's logical and physical planes
+(functions, and nodes). Some of these invariants are currently not enforced.
+This document assumes that the reader is familiar with some of the codebase,
+including rust arrow's RecordBatch and Array.
+
+## Rational
+
+DataFusion's computational model is built on top of a dynamically typed arrow
+object, Array, that offers the interface `Array::as_any` to downcast itself to
+its statically typed versions (e.g. `Int32Array`). DataFusion uses
+`Array::data_type` to perform the respective downcasting on its physical
+operations. DataFusion uses a dynamic type system because the queries being
+executed are not always known at compile time: they are only known during the
+runtime (or query time) of programs built with DataFusion. This document is
+built on top of this principle.
+
+In dynamically typed interfaces, it is up to developers to enforce type
+invariances. This document declares some of these invariants, so that users
+know what they can expect from a query in DataFusion, and DataFusion developers
+know what they need to enforce at the coding level.
+
+## Notation
+
+* Field or physical field: the tuple name, `arrow::DataType` and nullability 
flag (a bool whether values can be null), represented in this document by 
`PF(name, type, nullable)`
+* Logical field: Field with a relation name. Represented in this document by 
`LF(relation, name, type, nullable)`
+* Projected plan: plan with projection as the root node.
+* Logical schema: a vector of logical fields, used by logical plan.
+* Physical schema: a vector of physical fields, used by both physical plan and 
Arrow record batch.
+
+### Logical
+
+ Function
+
+An object that knows its valid incoming logical fields and how to derive its
+output logical field from its arguments' logical fields. A functions' output
+field is itself a function of its input fields:
+
+```
+logical_field(lf1: LF, lf2: LF, ...) -> LF
+```
+
+Examples:
+
+* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | 
b.nullable)` where d is the function mapping input types to output type 
(`get_supertype` in our current implementation).
+* `length(a) -> LF(None, "length({a})", u32, a.nullable)`
+
+ Plan
+
+A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - 
c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its schema.
+
+Certain plans have a frozen schema (e.g. Scan), while others derive their
+schema from their child nodes.
+
+ Column
+
+A type of logical node in a logical plan consists of field name and relation 
name.
+
+### Physical
+
+ Function
+
+An object that knows how to derive its physical field from its arguments'
+physical fields, and also how to actually perform the computation on data. A
+functions' output physical field is a function of its input physical fields:
+
+```
+physical_field(PF1, PF2, ...) -> PF
+```
+
+Examples:
+
+* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` 
where d is a complex function (`get_supertype` in our current implementation) 
whose computation is for each element in the columns, sum the two entries 
together and return it in the same type as the smallest type of both columns.
+* `length() -> PF("length({a})", u32, a.nullable)` whose computation is 
"count number of bytes in the string".
+
+ Plan
+
+A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its metadata and compute itself.
+
+Note how the physical plane does not know how to derive field names: field
+names are solely a property of the logical plane, as they are not needed in the

Review comment:
   @jorgecarleitao please correct me if I am wrong. I think the focus on 
this statement is on generation of field names. In the current code base, 
physical field names are generated using metadata from the logical plane, more 
specifically  from logical `Expr`s. The actual physical plan and physical 
expressions are not used to generated the field name.
   
   > names are used to line up inputs to PhysicalExpr
   
   This used to be true at the time when Jorge wrote it, but with 
https://github.com/apache/arrow-datafusion/pull/55, we will be changing it to 
line up inputs using field index instead. But regardless we make this change or 
not, it doesn't change the fact that physical filed names are only generated 
using metadata from the logical plane.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #443: add invariants spec

2021-06-01 Thread GitBox


houqp commented on a change in pull request #443:
URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643649837



##
File path: docs/specification/invariants.md
##
@@ -0,0 +1,327 @@
+
+
+# DataFusion's Invariants
+
+This document enumerates invariants of DataFusion's logical and physical planes
+(functions, and nodes). Some of these invariants are currently not enforced.
+This document assumes that the reader is familiar with some of the codebase,
+including rust arrow's RecordBatch and Array.
+
+## Rational
+
+DataFusion's computational model is built on top of a dynamically typed arrow
+object, Array, that offers the interface `Array::as_any` to downcast itself to
+its statically typed versions (e.g. `Int32Array`). DataFusion uses
+`Array::data_type` to perform the respective downcasting on its physical
+operations. DataFusion uses a dynamic type system because the queries being
+executed are not always known at compile time: they are only known during the
+runtime (or query time) of programs built with DataFusion. This document is
+built on top of this principle.
+
+In dynamically typed interfaces, it is up to developers to enforce type
+invariances. This document declares some of these invariants, so that users
+know what they can expect from a query in DataFusion, and DataFusion developers
+know what they need to enforce at the coding level.
+
+## Notation
+
+* Field or physical field: the tuple name, `arrow::DataType` and nullability 
flag (a bool whether values can be null), represented in this document by 
`PF(name, type, nullable)`
+* Logical field: Field with a relation name. Represented in this document by 
`LF(relation, name, type, nullable)`
+* Projected plan: plan with projection as the root node.
+* Logical schema: a vector of logical fields, used by logical plan.
+* Physical schema: a vector of physical fields, used by both physical plan and 
Arrow record batch.
+
+### Logical
+
+ Function
+
+An object that knows its valid incoming logical fields and how to derive its
+output logical field from its arguments' logical fields. A functions' output
+field is itself a function of its input fields:
+
+```
+logical_field(lf1: LF, lf2: LF, ...) -> LF
+```
+
+Examples:
+
+* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | 
b.nullable)` where d is the function mapping input types to output type 
(`get_supertype` in our current implementation).
+* `length(a) -> LF(None, "length({a})", u32, a.nullable)`
+
+ Plan
+
+A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - 
c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its schema.
+
+Certain plans have a frozen schema (e.g. Scan), while others derive their
+schema from their child nodes.
+
+ Column
+
+A type of logical node in a logical plan consists of field name and relation 
name.
+
+### Physical
+
+ Function
+
+An object that knows how to derive its physical field from its arguments'
+physical fields, and also how to actually perform the computation on data. A
+functions' output physical field is a function of its input physical fields:
+
+```
+physical_field(PF1, PF2, ...) -> PF
+```
+
+Examples:
+
+* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` 
where d is a complex function (`get_supertype` in our current implementation) 
whose computation is for each element in the columns, sum the two entries 
together and return it in the same type as the smallest type of both columns.
+* `length() -> PF("length({a})", u32, a.nullable)` whose computation is 
"count number of bytes in the string".
+
+ Plan
+
+A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its metadata and compute itself.
+
+Note how the physical plane does not know how to derive field names: field
+names are solely a property of the logical plane, as they are not needed in the

Review comment:
   @jorgecarleitao please correct me if I am wrong. I think the focus on 
this statement is on generation of field names. In the current code base, 
physical field names are generated using metadata from the logical plane, more 
specifically  from logical `Expr`s. The actual physical plan and physical 
expressions are not used to generated the field name.
   
   > names are used to line up inputs to PhysicalExpr
   
   This used to be true at the time when Jorge wrote it, but with 
https://github.com/apache/arrow-datafusion/pull/55, we will be changing it to 
line up inputs using index instead that. But regardless we make this change or 
not, it doesn't change the fact that physical filed names are only generated 
using metadata from the logical plane.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:

[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #463: Add sort in window functions

2021-06-01 Thread GitBox


codecov-commenter commented on pull request #463:
URL: https://github.com/apache/arrow-datafusion/pull/463#issuecomment-852716291


   # 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/463?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#463](https://codecov.io/gh/apache/arrow-datafusion/pull/463?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (50f848d) into 
[master](https://codecov.io/gh/apache/arrow-datafusion/commit/16011120a1b73798049c5be49f9548b00f8a0a00?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (1601112) will **increase** coverage by `0.03%`.
   > The diff coverage is `76.23%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-datafusion/pull/463/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/463?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #463  +/-   ##
   ==
   + Coverage   75.84%   75.87%   +0.03% 
   ==
 Files 153  153  
 Lines   2587226030 +158 
   ==
   + Hits1962219751 +129 
   - Misses   6250 6279  +29 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-datafusion/pull/463?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[...sta/rust/core/src/serde/logical\_plan/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vZnJvbV9wcm90by5ycw==)
 | `36.00% <0.00%> (-0.22%)` | :arrow_down: |
   | 
[...ta/rust/core/src/serde/physical\_plan/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9waHlzaWNhbF9wbGFuL2Zyb21fcHJvdG8ucnM=)
 | `38.92% <0.00%> (-0.86%)` | :arrow_down: |
   | 
[datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3V0aWxzLnJz)
 | `47.51% <0.00%> (-2.49%)` | :arrow_down: |
   | 
[...lista/rust/core/src/serde/logical\_plan/to\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vdG9fcHJvdG8ucnM=)
 | `62.56% <18.18%> (+0.15%)` | :arrow_up: |
   | 
[datafusion/src/optimizer/projection\_push\_down.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3Byb2plY3Rpb25fcHVzaF9kb3duLnJz)
 | `98.46% <87.50%> (+<0.01%)` | :arrow_up: |
   | 
[datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvc3FsL3V0aWxzLnJz)
 | `64.92% <91.26%> (+17.02%)` | :arrow_up: |
   | 
[datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvbG9naWNhbF9wbGFuL2V4cHIucnM=)
 | `84.60% <91.66%> (+0.07%)` | :arrow_up: |
   | 
[datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvc3FsL3BsYW5uZXIucnM=)
 | `84.37% <98.07%> (+0.31%)` | :arrow_up: |
   | 
[datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/463/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvbG9naWNhbF9wbGFuL2J1aWxkZXIucnM=)
 | `90.43% <100.00%> (-0.05%)` | :arrow_down: |
   | 

[GitHub] [arrow] cyb70289 edited a comment on pull request #10364: ARROW-12074: [C++][Compute] Add scalar arithmetic kernels for decimal

2021-06-01 Thread GitBox


cyb70289 edited a comment on pull request #10364:
URL: https://github.com/apache/arrow/pull/10364#issuecomment-852709410


   @bkietz , met with one problem, would like to hear your comments. Thanks.
   
   Decimal upscaling is operation dependent. E.g., `+,-` will upscale arg with 
smaller scale to align digit, `*` needn't scaling, `/` is more complicated.
   
   Implicit args casting happens before kernel is created. `DispatchBest` only 
knows arg types, no operation type (kernel dependent) is available. So we 
cannot figure out the "to be casted" arg type (new precision, scale).
   
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.cc#L175
   
   Maybe add another callback `kernel->explicit_cast()` and call it after or 
inside `DispatchBest`? Or  create different `ScalarFunction` struct (and 
DispatchBest) for each decimal operation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] msathis commented on issue #472: [Ballista] Improve task and job metadata

2021-06-01 Thread GitBox


msathis commented on issue #472:
URL: 
https://github.com/apache/arrow-datafusion/issues/472#issuecomment-852710260


   This will be great addition. We can expose this information to the UI as 
well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #10364: ARROW-12074: [C++][Compute] Add scalar arithmetic kernels for decimal

2021-06-01 Thread GitBox


cyb70289 commented on pull request #10364:
URL: https://github.com/apache/arrow/pull/10364#issuecomment-852709410


   @bkietz , met with one problem, would like to hear your comments. Thanks.
   
   Decimal upscaling is operation dependent. E.g., `+,-` will upscale arg with 
small scale to align digit, `*` needn't scaling, `/` is more complicated.
   
   Implicit args casting happens before kernel is created. `DispatchBest` only 
knows arg types, no operation type (kernel dependent) is available. So we 
cannot figure out the "to be casted" arg type (new precision, scale).
   
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.cc#L175
   
   Maybe add another callback `kernel->explicit_cast()` and call it after or 
inside `DispatchBest`? Or one `ScalarFunction` struct (and DispatchBest) per 
binary decimal operation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643637975



##
File path: cpp/src/arrow/compute/exec/expression_test.cc
##
@@ -165,6 +165,56 @@ TEST(ExpressionUtils, StripOrderPreservingCasts) {
   Expect(cast(field_ref("i32"), uint64()), no_change);
 }
 
+TEST(ExpressionUtils, MakeExecBatch) {
+  auto Expect = [](std::shared_ptr partial_batch) {
+SCOPED_TRACE(partial_batch->ToString());
+ASSERT_OK_AND_ASSIGN(auto batch, MakeExecBatch(*kBoringSchema, 
partial_batch));
+
+ASSERT_EQ(batch.num_values(), kBoringSchema->num_fields());
+for (int i = 0; i < kBoringSchema->num_fields(); ++i) {
+  const auto& field = *kBoringSchema->field(i);
+
+  SCOPED_TRACE("Field#" + std::to_string(i) + " " + field.ToString());
+
+  EXPECT_TRUE(batch[i].type()->Equals(field.type()))
+  << "Incorrect type " << batch[i].type()->ToString();
+
+  ASSERT_OK_AND_ASSIGN(auto col, 
FieldRef(field.name()).GetOneOrNone(*partial_batch));

Review comment:
   Ah, I expected `GetColumnByName` to just return the first instance of 
the field (I thought duplicate field names were generally allowed outside of 
`compute`) but you are correct, it treats it the same as "not found".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643637072



##
File path: cpp/src/arrow/dataset/dataset_internal.h
##
@@ -204,5 +204,35 @@ arrow::Result> GetFragmentScanOptions(
   return internal::checked_pointer_cast(source);
 }
 
+class FragmentDataset : public Dataset {

Review comment:
   Right, sorry, I meant the other way around.  `InMemoryDataset : public 
FragmentDataset`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#issuecomment-852702836


   @pitrou Don't worry about the delay, I've been plenty busy elsewhere.  I 
have a just a few follow-up questions and then I'll make the changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#discussion_r643636361



##
File path: cpp/src/arrow/util/future_test.cc
##
@@ -952,6 +951,85 @@ TEST(FutureCompletionTest, FutureVoid) {
   }
 }
 
+class FutureSchedulingTest : public testing::Test {
+ public:
+  internal::Executor* executor() { return mock_executor.get(); }
+  int spawn_count() { return mock_executor->spawn_count; }
+
+  std::function callback = [](const Status&) {};
+  std::shared_ptr mock_executor = 
std::make_shared();
+};
+
+TEST_F(FutureSchedulingTest, ScheduleAlways) {
+  CallbackOptions options;
+  options.should_schedule = ShouldSchedule::ALWAYS;
+  options.executor = executor();
+  // Successful future
+  {
+auto fut = Future<>::Make();
+fut.AddCallback(callback, options);
+fut.MarkFinished();
+fut.AddCallback(callback, options);
+ASSERT_EQ(2, spawn_count());
+  }
+  // Failing future
+  {
+auto fut = Future<>::Make();
+fut.AddCallback(callback, options);
+fut.MarkFinished(Status::Invalid("XYZ"));
+fut.AddCallback(callback, options);
+ASSERT_EQ(4, spawn_count());
+  }
+}
+
+TEST_F(FutureSchedulingTest, ScheduleIfUnfinished) {
+  CallbackOptions options;
+  options.should_schedule = ShouldSchedule::IF_UNFINISHED;
+  options.executor = executor();
+  // Successful future
+  {
+auto fut = Future<>::Make();
+fut.AddCallback(callback, options);
+fut.MarkFinished();
+fut.AddCallback(callback, options);
+ASSERT_EQ(1, spawn_count());
+  }
+  // Failing future
+  {
+auto fut = Future<>::Make();
+fut.AddCallback(callback, options);
+fut.MarkFinished(Status::Invalid("XYZ"));
+fut.AddCallback(callback, options);
+ASSERT_EQ(2, spawn_count());
+  }
+}
+
+class DelayedExecutor : public internal::Executor {

Review comment:
   My rationale was only that `DelayedExecutor` is only used in 
`future_test.cc` while `MockExecutor` is used in `future_test.cc` and 
`thread_pool_test.cc` but I see your point.  I'll move this into 
`test_common.h`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#discussion_r643636029



##
File path: cpp/src/arrow/util/test_common.h
##
@@ -85,4 +88,18 @@ inline void AssertIteratorExhausted(Iterator& it) {
 
 Transformer MakeFilter(std::function filter);
 
+class MockExecutor : public internal::Executor {

Review comment:
   There is a bit of categorization based on folder (e.g. 
`util/test_common.h` vs `io/test_common.h`).  This file isn't really all that 
large.  However, it could probably be split into `test_iterator_common.h` and 
`test_thread_pool_common.h`.  Or am I missing the point?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#discussion_r643634671



##
File path: cpp/src/arrow/util/future.h
##
@@ -453,30 +480,35 @@ class Future {
   /// cyclic reference to itself through the callback.
   template 
   typename 
std::enable_if::value>::type
-  AddCallback(OnComplete on_complete) const {
+  AddCallback(OnComplete on_complete,
+  CallbackOptions opts = CallbackOptions::Defaults()) const {
 // We know impl_ will not be dangling when invoking callbacks because at 
least one
 // thread will be waiting for MarkFinished to return. Thus it's safe to 
keep a
 // weak reference to impl_ here
 struct Callback {
-  void operator()() && { std::move(on_complete)(weak_self.get().result()); 
}
-  WeakFuture weak_self;
+  void operator()(const FutureImpl& impl) && {
+
std::move(on_complete)(*static_cast*>(impl.result_.get()));
+  }
   OnComplete on_complete;
 };
-impl_->AddCallback(Callback{WeakFuture(*this), std::move(on_complete)});
+impl_->AddCallback(Callback{std::move(on_complete)}, opts);
   }
 
   /// Overload for callbacks accepting a Status
   template 
   typename std::enable_if::value>::type
-  AddCallback(OnComplete on_complete) const {
+  AddCallback(OnComplete on_complete,
+  CallbackOptions opts = CallbackOptions::Defaults()) const {
 static_assert(std::is_same::value,
   "Callbacks for Future<> should accept Status and not 
Result");
 struct Callback {
-  void operator()() && { std::move(on_complete)(weak_self.get().status()); 
}
-  WeakFuture weak_self;
+  void operator()(const FutureImpl& impl) && {
+std::move(on_complete)(
+static_cast*>(impl.result_.get())->status());

Review comment:
   It is a method on `Future` (named `GetResult`) but `FutureImpl` is 
type-erased and so it has no reference to `ValueType`.  If it were a `.cc` file 
I could extract it into an anonymous function but no luck there because of 
templates.  Would it be acceptable to create a method `GetResultFromFutureImpl` 
inside of `arrow::detail`?  Or is there some other trick I can use?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#discussion_r643634671



##
File path: cpp/src/arrow/util/future.h
##
@@ -453,30 +480,35 @@ class Future {
   /// cyclic reference to itself through the callback.
   template 
   typename 
std::enable_if::value>::type
-  AddCallback(OnComplete on_complete) const {
+  AddCallback(OnComplete on_complete,
+  CallbackOptions opts = CallbackOptions::Defaults()) const {
 // We know impl_ will not be dangling when invoking callbacks because at 
least one
 // thread will be waiting for MarkFinished to return. Thus it's safe to 
keep a
 // weak reference to impl_ here
 struct Callback {
-  void operator()() && { std::move(on_complete)(weak_self.get().result()); 
}
-  WeakFuture weak_self;
+  void operator()(const FutureImpl& impl) && {
+
std::move(on_complete)(*static_cast*>(impl.result_.get()));
+  }
   OnComplete on_complete;
 };
-impl_->AddCallback(Callback{WeakFuture(*this), std::move(on_complete)});
+impl_->AddCallback(Callback{std::move(on_complete)}, opts);
   }
 
   /// Overload for callbacks accepting a Status
   template 
   typename std::enable_if::value>::type
-  AddCallback(OnComplete on_complete) const {
+  AddCallback(OnComplete on_complete,
+  CallbackOptions opts = CallbackOptions::Defaults()) const {
 static_assert(std::is_same::value,
   "Callbacks for Future<> should accept Status and not 
Result");
 struct Callback {
-  void operator()() && { std::move(on_complete)(weak_self.get().status()); 
}
-  WeakFuture weak_self;
+  void operator()(const FutureImpl& impl) && {
+std::move(on_complete)(
+static_cast*>(impl.result_.get())->status());

Review comment:
   It is a method on `Future` (named `GetResult`) but `FutureImpl` is 
type-erased and so it has no reference to `ValueType`.  If it were a `.cc` file 
I could extract it into an anonymous function but no luck there because of 
templates.  I think I can create a `future_internal.h`.  Would it be acceptable 
to create a method `GetResultFromFutureImpl` inside of `arrow::detail`?  Or is 
there some other trick I can use?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #443: add invariants spec

2021-06-01 Thread GitBox


codecov-commenter edited a comment on pull request #443:
URL: https://github.com/apache/arrow-datafusion/pull/443#issuecomment-850944261


   # 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/443?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#443](https://codecov.io/gh/apache/arrow-datafusion/pull/443?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (5ba60e2) into 
[master](https://codecov.io/gh/apache/arrow-datafusion/commit/321fda40a47bcc494c5d2511b6e8b02c9ea975b4?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (321fda4) will **increase** coverage by `0.67%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-datafusion/pull/443/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/443?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #443  +/-   ##
   ==
   + Coverage   75.16%   75.84%   +0.67% 
   ==
 Files 150  153   +3 
 Lines   2514425872 +728 
   ==
   + Hits1889919622 +723 
   - Misses   6245 6250   +5 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-datafusion/pull/443?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9jb21tb24ucnM=)
 | `84.21% <0.00%> (-2.00%)` | :arrow_down: |
   | 
[ballista/rust/scheduler/src/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9zY2hlZHVsZXIvc3JjL3BsYW5uZXIucnM=)
 | `66.91% <0.00%> (-0.74%)` | :arrow_down: |
   | 
[datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi90ZXN0cy9zcWwucnM=)
 | `99.27% <0.00%> (-0.62%)` | :arrow_down: |
   | 
[datafusion-cli/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1jbGkvc3JjL21haW4ucnM=)
 | `0.00% <0.00%> (ø)` | |
   | 
[ballista/rust/executor/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvbWFpbi5ycw==)
 | `0.00% <0.00%> (ø)` | |
   | 
[ballista/rust/executor/src/execution\_loop.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvZXhlY3V0aW9uX2xvb3AucnM=)
 | `0.00% <0.00%> (ø)` | |
   | 
[ballista/rust/executor/src/flight\_service.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvZmxpZ2h0X3NlcnZpY2UucnM=)
 | `0.00% <0.00%> (ø)` | |
   | 
[datafusion/src/physical\_plan/expressions/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9tb2QucnM=)
 | `71.42% <0.00%> (ø)` | |
   | 
[ballista/rust/core/src/serde/logical\_plan/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/443/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vbW9kLnJz)
 | `99.40% <0.00%> (ø)` | |
   | 

[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#discussion_r643632996



##
File path: cpp/src/arrow/util/future.cc
##
@@ -272,8 +315,8 @@ class ConcreteFutureImpl : public FutureImpl {
 //
 // In fact, it is important not to hold the locks because the callback
 // may be slow or do its own locking on other resources
-for (auto&& callback : callbacks_) {
-  std::move(callback)();
+for (auto& callback_record : callbacks_) {
+  RunOrScheduleCallback(callback_record, /*from_unfinished=*/true);

Review comment:
   On the bright side, if we remove `IF_UNFINISHED` then we can change 
`ShouldSchedule` to a `bool`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#discussion_r643632152



##
File path: cpp/src/arrow/util/future.h
##
@@ -202,8 +202,30 @@ enum class FutureState : int8_t { PENDING, SUCCESS, 
FAILURE };
 
 inline bool IsFutureFinished(FutureState state) { return state != 
FutureState::PENDING; }
 
+/// \brief Describes whether the callback should be scheduled or run 
synchronously
+enum ShouldSchedule {
+  /// Always run the callback synchronously (the default)
+  NEVER = 0,

Review comment:
   Hmm, technically the style guide prefers `kAlways` but I see `Always` 
used more often in Arrow.  Although some of the gandiva code uses kAlways. 
(https://google.github.io/styleguide/cppguide.html#Enumerator_Names).  Any 
preference?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#discussion_r643630671



##
File path: cpp/src/arrow/util/future.cc
##
@@ -272,8 +315,8 @@ class ConcreteFutureImpl : public FutureImpl {
 //
 // In fact, it is important not to hold the locks because the callback
 // may be slow or do its own locking on other resources
-for (auto&& callback : callbacks_) {
-  std::move(callback)();
+for (auto& callback_record : callbacks_) {
+  RunOrScheduleCallback(callback_record, /*from_unfinished=*/true);

Review comment:
   Hmm, `from_unfinished` is supposed to mean "was the callback added when 
the future was unfinished" but I can see how that is vague.  I could just 
remove this option entirely.  The `TryAddCallback` function already gives you 
this capability (it is used in `Transfer`) and so it isn't used anywhere.  Any 
strong opinion?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#discussion_r643628821



##
File path: cpp/src/arrow/util/future.cc
##
@@ -231,26 +232,68 @@ class ConcreteFutureImpl : public FutureImpl {
 
   void DoMarkFailed() { DoMarkFinishedOrFailed(FutureState::FAILURE); }
 
-  void AddCallback(Callback callback) {
+  void CheckOptions(const CallbackOptions& opts) {
+if (opts.should_schedule != ShouldSchedule::NEVER) {
+  DCHECK_NE(opts.executor, NULL)
+  << "An executor must be specified when adding a callback that might 
schedule";
+}
+  }
+
+  void AddCallback(Callback callback, CallbackOptions opts) {
+CheckOptions(opts);
 std::unique_lock lock(mutex_);
+CallbackRecord callback_record{std::move(callback), opts};
 if (IsFutureFinished(state_)) {
   lock.unlock();
-  std::move(callback)();
+  RunOrScheduleCallback(callback_record, /*from_unfinished=*/false);
 } else {
-  callbacks_.push_back(std::move(callback));
+  callbacks_.push_back(std::move(callback_record));
 }
   }
 
-  bool TryAddCallback(const std::function& callback_factory) {
+  bool TryAddCallback(const std::function& callback_factory,
+  CallbackOptions opts) {
+CheckOptions(opts);
 std::unique_lock lock(mutex_);
 if (IsFutureFinished(state_)) {
   return false;
 } else {
-  callbacks_.push_back(callback_factory());
+  callbacks_.push_back({callback_factory(), opts});
   return true;
 }
   }
 
+  bool ShouldSchedule(const CallbackRecord& callback_record, bool 
from_unfinished) {
+switch (callback_record.options.should_schedule) {
+  case ShouldSchedule::NEVER:
+return false;
+  case ShouldSchedule::ALWAYS:
+return true;
+  case ShouldSchedule::IF_UNFINISHED:
+return from_unfinished;
+  default:
+DCHECK(false) << "Unrecognized ShouldSchedule option";
+return false;
+}
+  }
+
+  void RunOrScheduleCallback(CallbackRecord& callback_record, bool 
from_unfinished) {
+if (ShouldSchedule(callback_record, from_unfinished)) {
+  // Need to make a copy of this to keep it alive until the callback has a 
chance
+  // to be scheduled.
+  struct CallbackTask {
+void operator()() { std::move(callback)(*self); }
+
+Callback callback;
+std::shared_ptr self;

Review comment:
   The weak pointer in the old implementation was used to prevent futures 
from creating a circular reference on themselves (callback references future 
which references callback).
   
   Unfortunately, the weak pointer relied on the future remaining valid until 
all callbacks had run.  If all callbacks run synchronously this is easy 
(whomever is calling `MarkFinished` must have a valid reference until all 
callbacks finish).  Once we start scheduling callbacks we run into the problem 
where `MarkFinished` can return before some callbacks have run and then when 
those callbacks get scheduled the future has been deleted.
   
   This fix isn't just a change to a strong pointer though (that would 
introduce the circular reference problem again).  Instead of the callback 
itself having a reference to the future I changed it so that the callback took 
the FutureImpl in as an argument (note, this is the internal `FutureImpl` 
callback and not the publicly exposed `Future` callback).  This allowed me to 
avoid the circular reference because the strong pointer is created when the 
callback is being triggered and not when the callback is being added.
   
   Also, the strong pointer is only created if it is a scheduled callback.  Any 
existing performance should remain the same since no strong pointer of 
`shared_from_this` call is made.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10258: ARROW-12560: [C++] Investigate utilizing aggressive thread task creation when adding callback to finished future

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10258:
URL: https://github.com/apache/arrow/pull/10258#discussion_r643625949



##
File path: cpp/src/arrow/util/future.cc
##
@@ -231,26 +232,68 @@ class ConcreteFutureImpl : public FutureImpl {
 
   void DoMarkFailed() { DoMarkFinishedOrFailed(FutureState::FAILURE); }
 
-  void AddCallback(Callback callback) {
+  void CheckOptions(const CallbackOptions& opts) {
+if (opts.should_schedule != ShouldSchedule::NEVER) {
+  DCHECK_NE(opts.executor, NULL)
+  << "An executor must be specified when adding a callback that might 
schedule";
+}
+  }
+
+  void AddCallback(Callback callback, CallbackOptions opts) {
+CheckOptions(opts);
 std::unique_lock lock(mutex_);
+CallbackRecord callback_record{std::move(callback), opts};
 if (IsFutureFinished(state_)) {
   lock.unlock();
-  std::move(callback)();
+  RunOrScheduleCallback(callback_record, /*from_unfinished=*/false);
 } else {
-  callbacks_.push_back(std::move(callback));
+  callbacks_.push_back(std::move(callback_record));
 }
   }
 
-  bool TryAddCallback(const std::function& callback_factory) {
+  bool TryAddCallback(const std::function& callback_factory,
+  CallbackOptions opts) {
+CheckOptions(opts);
 std::unique_lock lock(mutex_);
 if (IsFutureFinished(state_)) {
   return false;
 } else {
-  callbacks_.push_back(callback_factory());
+  callbacks_.push_back({callback_factory(), opts});
   return true;
 }
   }
 
+  bool ShouldSchedule(const CallbackRecord& callback_record, bool 
from_unfinished) {
+switch (callback_record.options.should_schedule) {
+  case ShouldSchedule::NEVER:
+return false;
+  case ShouldSchedule::ALWAYS:
+return true;
+  case ShouldSchedule::IF_UNFINISHED:
+return from_unfinished;
+  default:
+DCHECK(false) << "Unrecognized ShouldSchedule option";
+return false;
+}
+  }
+
+  void RunOrScheduleCallback(CallbackRecord& callback_record, bool 
from_unfinished) {
+if (ShouldSchedule(callback_record, from_unfinished)) {
+  // Need to make a copy of this to keep it alive until the callback has a 
chance

Review comment:
   The copy is a few lines down when we call `shared_from_this`.  I'll move 
the comment and make it more explicit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10421: ARROW-12903: [C++] Create new thread pool benchmark demonstrating the "scheduling" bottleneck

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10421:
URL: https://github.com/apache/arrow/pull/10421#discussion_r643623285



##
File path: cpp/src/arrow/util/thread_pool.h
##
@@ -288,6 +288,10 @@ class ARROW_EXPORT ThreadPool : public Executor {
   // tasks are finished.
   Status Shutdown(bool wait = true);
 
+  // Waits for the thread pool to reach a quiet state where all workers are

Review comment:
   Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10421: ARROW-12903: [C++] Create new thread pool benchmark demonstrating the "scheduling" bottleneck

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10421:
URL: https://github.com/apache/arrow/pull/10421#discussion_r643622911



##
File path: cpp/src/arrow/util/thread_pool_benchmark.cc
##
@@ -103,6 +103,52 @@ static void ThreadPoolSpawn(benchmark::State& state) {  // 
NOLINT non-const refe
   state.SetItemsProcessed(state.iterations() * nspawns);
 }
 
+// The ThreadPoolSpawn benchmark submits all tasks from a single outside 
thread.  This
+// ends up causing a worst-case scenario for the current simple thread pool.  
All threads
+// compete over the task queue mutex trying to grab the next thread off the 
queue and the
+// result is a large amount of contention.
+//
+// By spreading out the scheduling across multiple threads we can help reduce 
that
+// contention.  This benchmark demonstrates the ideal case where we are able 
to perfectly
+// partition the scheduling across the available threads.
+//
+// Both situations could be encountered (the thread pool can't choose how it 
is used) but
+// by having both benchmarks we can express the importance of distributed 
scheduling.
+static void ThreadPoolIdealSpawn(benchmark::State& state) {  // NOLINT 
non-const reference
+  const auto nthreads = static_cast(state.range(0));
+  const auto workload_size = static_cast(state.range(1));
+
+  Workload workload(workload_size);
+
+  // Spawn enough tasks to make the pool start up overhead negligible
+  const int32_t nspawns = 2 / workload_size + 1;
+  const int32_t nspawns_per_thread = nspawns / nthreads;
+
+  for (auto _ : state) {
+state.PauseTiming();
+std::shared_ptr pool;
+pool = *ThreadPool::Make(nthreads);
+state.ResumeTiming();
+
+for (int32_t i = 0; i < nthreads; ++i) {
+  // Pass the task by reference to avoid copying it around
+  ABORT_NOT_OK(pool->Spawn([, , nspawns_per_thread] {
+for (int32_t j = 0; j < nspawns_per_thread; j++) {
+  ABORT_NOT_OK(pool->Spawn(std::ref(workload)));
+}
+  }));
+}
+
+// Wait for all tasks to finish
+pool->WaitForIdle();

Review comment:
   At this point we cannot know that all the tasks have been spawned.  If I 
call `Shutdown(wait=true)` then a slow spawner will fail because `SpawnReal` 
returns `Status::Invalid` if `please_shutdown_` is `true`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace edited a comment on pull request #10421: ARROW-12903: [C++] Create new thread pool benchmark demonstrating the "scheduling" bottleneck

2021-06-01 Thread GitBox


westonpace edited a comment on pull request #10421:
URL: https://github.com/apache/arrow/pull/10421#issuecomment-852682551


   Just adding the benchmark...
   
   ```
   ThreadPoolSpawn/threads:1/task_cost:1000/real_time 104576026 ns 
39527736 ns7 items_per_second=1.91249M/s
   ThreadPoolSpawn/threads:2/task_cost:1000/real_time  81736943 ns 
69631881 ns8 items_per_second=2.44689M/s
   ThreadPoolSpawn/threads:4/task_cost:1000/real_time 395577537 ns
337000146 ns2 items_per_second=505.592k/s
   ThreadPoolSpawn/threads:8/task_cost:1000/real_time 326650393 ns
290524204 ns2 items_per_second=612.278k/s
   ThreadPoolSpawn/threads:1/task_cost:1/real_time 81345849 ns  
2355243 ns8 items_per_second=245.876k/s
   ThreadPoolSpawn/threads:2/task_cost:1/real_time 43399109 ns  
2694481 ns   16 items_per_second=460.862k/s
   ThreadPoolSpawn/threads:4/task_cost:1/real_time 22975768 ns  
3457033 ns   31 items_per_second=870.526k/s
   ThreadPoolSpawn/threads:8/task_cost:1/real_time 21717680 ns 
14331911 ns   37 items_per_second=920.955k/s
   ThreadPoolSpawn/threads:1/task_cost:10/real_time81517332 ns  
 270745 ns8 items_per_second=24.5469k/s
   ThreadPoolSpawn/threads:2/task_cost:10/real_time41615534 ns  
 281452 ns   17 items_per_second=48.083k/s
   ThreadPoolSpawn/threads:4/task_cost:10/real_time21324149 ns  
 316989 ns   33 items_per_second=93.8373k/s
   ThreadPoolSpawn/threads:8/task_cost:10/real_time12702954 ns  
 443910 ns   55 items_per_second=157.522k/s
   ThreadPoolIdealSpawn/threads:1/task_cost:1000/real_time107253606 ns  
  91704 ns6 items_per_second=1.86475M/s
   ThreadPoolIdealSpawn/threads:2/task_cost:1000/real_time 88176114 ns  
 117639 ns8 items_per_second=2.2682M/s
   ThreadPoolIdealSpawn/threads:4/task_cost:1000/real_time 86717904 ns  
 107266 ns8 items_per_second=2.30634M/s
   ThreadPoolIdealSpawn/threads:8/task_cost:1000/real_time 98762117 ns  
 209733 ns7 items_per_second=2.02508M/s
   ThreadPoolIdealSpawn/threads:1/task_cost:1/real_time84566727 ns  
  64819 ns8 items_per_second=236.511k/s
   ThreadPoolIdealSpawn/threads:2/task_cost:1/real_time46885981 ns  
  70276 ns   15 items_per_second=426.588k/s
   ThreadPoolIdealSpawn/threads:4/task_cost:1/real_time27807860 ns  
  94742 ns   26 items_per_second=719.257k/s
   ThreadPoolIdealSpawn/threads:8/task_cost:1/real_time16994645 ns  
 148328 ns   41 items_per_second=1.1769M/s
   ThreadPoolIdealSpawn/threads:1/task_cost:10/real_time   81094164 ns  
  52531 ns8 items_per_second=24.675k/s
   ThreadPoolIdealSpawn/threads:2/task_cost:10/real_time   42196762 ns  
  85439 ns   16 items_per_second=47.4207k/s
   ThreadPoolIdealSpawn/threads:4/task_cost:10/real_time   22380385 ns  
 120278 ns   32 items_per_second=89.4086k/s
   ThreadPoolIdealSpawn/threads:8/task_cost:10/real_time   12873517 ns  
 180938 ns   56 items_per_second=155.435k/s
   ```
   
   Early results from work-stealing (note, impl:1 is the single queue 
implementation, it benefits quite a bit from the generalized refactor (#10401) 
which shrinks the critical section)...
   
   ```
   ThreadPoolSpawn/impl:1/threads:1/task_cost:1000/real_time 109095290 
ns 45507459 ns6 items_per_second=1.83327M/s
   ThreadPoolSpawn/impl:1/threads:2/task_cost:1000/real_time  84445897 
ns 73408467 ns8 items_per_second=2.36839M/s
   ThreadPoolSpawn/impl:1/threads:4/task_cost:1000/real_time 384508473 
ns33388 ns2 items_per_second=520.147k/s
   ThreadPoolSpawn/impl:1/threads:8/task_cost:1000/real_time 340298964 
ns310431590 ns2 items_per_second=587.721k/s
   ThreadPoolSpawn/impl:1/threads:1/task_cost:1/real_time 84889601 
ns  2927850 ns8 items_per_second=235.612k/s
   ThreadPoolSpawn/impl:1/threads:2/task_cost:1/real_time 46962168 
ns  4429182 ns   16 items_per_second=425.896k/s
   ThreadPoolSpawn/impl:1/threads:4/task_cost:1/real_time 27891032 
ns  5498450 ns   24 items_per_second=717.112k/s
   ThreadPoolSpawn/impl:1/threads:8/task_cost:1/real_time 23484115 
ns 15697174 ns   29 items_per_second=851.682k/s
   ThreadPoolSpawn/impl:1/threads:1/task_cost:10/real_time86121178 
ns   466594 ns8 items_per_second=23.2347k/s
   ThreadPoolSpawn/impl:1/threads:2/task_cost:10/real_time47425209 
ns   563522 ns   14 items_per_second=42.1928k/s
   

[GitHub] [arrow] westonpace commented on pull request #10421: ARROW-12903: [C++] Create new thread pool benchmark demonstrating the "scheduling" bottleneck

2021-06-01 Thread GitBox


westonpace commented on pull request #10421:
URL: https://github.com/apache/arrow/pull/10421#issuecomment-852682551


   Just adding the benchmark...
   
   ```
   ThreadPoolSpawn/threads:1/task_cost:1000/real_time 104576026 ns 
39527736 ns7 items_per_second=1.91249M/s
   ThreadPoolSpawn/threads:2/task_cost:1000/real_time  81736943 ns 
69631881 ns8 items_per_second=2.44689M/s
   ThreadPoolSpawn/threads:4/task_cost:1000/real_time 395577537 ns
337000146 ns2 items_per_second=505.592k/s
   ThreadPoolSpawn/threads:8/task_cost:1000/real_time 326650393 ns
290524204 ns2 items_per_second=612.278k/s
   ThreadPoolSpawn/threads:1/task_cost:1/real_time 81345849 ns  
2355243 ns8 items_per_second=245.876k/s
   ThreadPoolSpawn/threads:2/task_cost:1/real_time 43399109 ns  
2694481 ns   16 items_per_second=460.862k/s
   ThreadPoolSpawn/threads:4/task_cost:1/real_time 22975768 ns  
3457033 ns   31 items_per_second=870.526k/s
   ThreadPoolSpawn/threads:8/task_cost:1/real_time 21717680 ns 
14331911 ns   37 items_per_second=920.955k/s
   ThreadPoolSpawn/threads:1/task_cost:10/real_time81517332 ns  
 270745 ns8 items_per_second=24.5469k/s
   ThreadPoolSpawn/threads:2/task_cost:10/real_time41615534 ns  
 281452 ns   17 items_per_second=48.083k/s
   ThreadPoolSpawn/threads:4/task_cost:10/real_time21324149 ns  
 316989 ns   33 items_per_second=93.8373k/s
   ThreadPoolSpawn/threads:8/task_cost:10/real_time12702954 ns  
 443910 ns   55 items_per_second=157.522k/s
   ThreadPoolIdealSpawn/threads:1/task_cost:1000/real_time107253606 ns  
  91704 ns6 items_per_second=1.86475M/s
   ThreadPoolIdealSpawn/threads:2/task_cost:1000/real_time 88176114 ns  
 117639 ns8 items_per_second=2.2682M/s
   ThreadPoolIdealSpawn/threads:4/task_cost:1000/real_time 86717904 ns  
 107266 ns8 items_per_second=2.30634M/s
   ThreadPoolIdealSpawn/threads:8/task_cost:1000/real_time 98762117 ns  
 209733 ns7 items_per_second=2.02508M/s
   ThreadPoolIdealSpawn/threads:1/task_cost:1/real_time84566727 ns  
  64819 ns8 items_per_second=236.511k/s
   ThreadPoolIdealSpawn/threads:2/task_cost:1/real_time46885981 ns  
  70276 ns   15 items_per_second=426.588k/s
   ThreadPoolIdealSpawn/threads:4/task_cost:1/real_time27807860 ns  
  94742 ns   26 items_per_second=719.257k/s
   ThreadPoolIdealSpawn/threads:8/task_cost:1/real_time16994645 ns  
 148328 ns   41 items_per_second=1.1769M/s
   ThreadPoolIdealSpawn/threads:1/task_cost:10/real_time   81094164 ns  
  52531 ns8 items_per_second=24.675k/s
   ThreadPoolIdealSpawn/threads:2/task_cost:10/real_time   42196762 ns  
  85439 ns   16 items_per_second=47.4207k/s
   ThreadPoolIdealSpawn/threads:4/task_cost:10/real_time   22380385 ns  
 120278 ns   32 items_per_second=89.4086k/s
   ThreadPoolIdealSpawn/threads:8/task_cost:10/real_time   12873517 ns  
 180938 ns   56 items_per_second=155.435k/s
   ```
   
   Early results from work-stealing (note, impl:1 is the single queue 
implementation, it benefits quite a bit from the generalized refactor (#10401) 
which shrinks the critical section considerably)...
   
   ```
   ThreadPoolSpawn/impl:1/threads:1/task_cost:1000/real_time 109095290 
ns 45507459 ns6 items_per_second=1.83327M/s
   ThreadPoolSpawn/impl:1/threads:2/task_cost:1000/real_time  84445897 
ns 73408467 ns8 items_per_second=2.36839M/s
   ThreadPoolSpawn/impl:1/threads:4/task_cost:1000/real_time 384508473 
ns33388 ns2 items_per_second=520.147k/s
   ThreadPoolSpawn/impl:1/threads:8/task_cost:1000/real_time 340298964 
ns310431590 ns2 items_per_second=587.721k/s
   ThreadPoolSpawn/impl:1/threads:1/task_cost:1/real_time 84889601 
ns  2927850 ns8 items_per_second=235.612k/s
   ThreadPoolSpawn/impl:1/threads:2/task_cost:1/real_time 46962168 
ns  4429182 ns   16 items_per_second=425.896k/s
   ThreadPoolSpawn/impl:1/threads:4/task_cost:1/real_time 27891032 
ns  5498450 ns   24 items_per_second=717.112k/s
   ThreadPoolSpawn/impl:1/threads:8/task_cost:1/real_time 23484115 
ns 15697174 ns   29 items_per_second=851.682k/s
   ThreadPoolSpawn/impl:1/threads:1/task_cost:10/real_time86121178 
ns   466594 ns8 items_per_second=23.2347k/s
   ThreadPoolSpawn/impl:1/threads:2/task_cost:10/real_time47425209 
ns   563522 ns   14 items_per_second=42.1928k/s
   

[GitHub] [arrow] liyafan82 commented on a change in pull request #10423: ARROW-12907: [Java] Fix memory leak on deserialization errors

2021-06-01 Thread GitBox


liyafan82 commented on a change in pull request #10423:
URL: https://github.com/apache/arrow/pull/10423#discussion_r643606492



##
File path: 
java/vector/src/test/java/org/apache/arrow/vector/ipc/MessageSerializerTest.java
##
@@ -197,12 +199,30 @@ public void testSerializeRecordBatchV5() throws 
IOException {
 IpcOption option = new IpcOption(false, MetadataVersion.V5);
 ByteArrayOutputStream out = new ByteArrayOutputStream();
 MessageSerializer.serialize(new WriteChannel(Channels.newChannel(out)), 
batch, option);
+validityb.close();
+valuesb.close();
+batch.close();
+
+{
+  ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray());

Review comment:
   It would be nice to wrap this into a try-with-resource block.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #10423: ARROW-12907: [Java] Fix memory leak on deserialization errors

2021-06-01 Thread GitBox


liyafan82 commented on a change in pull request #10423:
URL: https://github.com/apache/arrow/pull/10423#discussion_r643604065



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java
##
@@ -723,8 +723,13 @@ public static MessageMetadataResult 
readMessage(ReadChannel in) throws IOExcepti
   public static ArrowBuf readMessageBody(ReadChannel in, long bodyLength,
   BufferAllocator allocator) throws IOException {
 ArrowBuf bodyBuffer = allocator.buffer(bodyLength);

Review comment:
   Nice catch!
   A more conventional way is to wrap the statements in try-with-resource block?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-rs] nevi-me commented on a change in pull request #384: Implement faster arrow array reader

2021-06-01 Thread GitBox


nevi-me commented on a change in pull request #384:
URL: https://github.com/apache/arrow-rs/pull/384#discussion_r643577679



##
File path: parquet/src/arrow/arrow_array_reader.rs
##
@@ -0,0 +1,1394 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{any::Any, collections::VecDeque, marker::PhantomData};
+use std::{rc::Rc, cell::RefCell};
+use arrow::{array::{ArrayRef, Int16Array}, buffer::MutableBuffer, 
datatypes::{DataType as ArrowType, ToByteSlice}};
+use crate::{column::page::{Page, PageIterator}, memory::ByteBufferPtr, 
schema::types::{ColumnDescPtr, ColumnDescriptor}};
+use crate::arrow::schema::parquet_to_arrow_field;
+use crate::errors::{ParquetError, Result};
+use crate::basic::Encoding;
+use super::array_reader::ArrayReader;
+
+struct UnzipIter
+{
+shared_state: Rc>,
+select_item_buffer: fn( State) ->  VecDeque,
+consume_source_item: fn(source_item: Source, state:  State) -> Target,
+}
+
+impl UnzipIter
+{
+fn new(
+shared_state: Rc>, 
+item_buffer_selector: fn( State) ->  VecDeque, 
+source_item_consumer: fn(source_item: Source, state:  State) -> 
Target
+) -> Self {
+Self {
+shared_state,
+select_item_buffer: item_buffer_selector,
+consume_source_item: source_item_consumer,
+}
+}
+}
+
+trait UnzipIterState {
+type SourceIter: Iterator;
+fn source_iter( self) ->  Self::SourceIter;
+}
+
+impl> Iterator for 
UnzipIter {
+type Item = Target;
+
+fn next( self) -> Option {
+let mut inner = self.shared_state.borrow_mut();
+// try to get one from the stored data
+(self.select_item_buffer)( *inner).pop_front().or_else(|| 
+// nothing stored, we need a new element.
+inner.source_iter().next().map(|s| {
+(self.consume_source_item)(s,  inner)
+}))
+}
+}
+
+struct PageBufferUnzipIterState {
+iter: It,
+value_iter_buffer: VecDeque,
+def_level_iter_buffer: VecDeque,
+rep_level_iter_buffer: VecDeque,
+}
+
+impl> UnzipIterState<(V, L, L)> for 
PageBufferUnzipIterState {
+type SourceIter = It;
+
+#[inline]
+fn source_iter( self) ->  Self::SourceIter {
+ self.iter
+}
+}
+
+fn unzip_iter>(it: It) -> (
+UnzipIter<(V, L, L), V, PageBufferUnzipIterState>, 
+UnzipIter<(V, L, L), L, PageBufferUnzipIterState>,
+UnzipIter<(V, L, L), L, PageBufferUnzipIterState>,
+) {
+let shared_data = Rc::new(RefCell::new(PageBufferUnzipIterState { 
+iter: it,
+value_iter_buffer: VecDeque::new(),
+def_level_iter_buffer: VecDeque::new(),
+rep_level_iter_buffer: VecDeque::new(),
+}));
+
+let value_iter = UnzipIter::new(
+shared_data.clone(),
+|state|  state.value_iter_buffer,
+|(v, d, r), state| { 
+state.def_level_iter_buffer.push_back(d); 
+state.rep_level_iter_buffer.push_back(r);
+v
+}, 
+);
+
+let def_level_iter = UnzipIter::new(
+shared_data.clone(),
+|state|  state.def_level_iter_buffer,
+|(v, d, r), state| {
+state.value_iter_buffer.push_back(v);
+state.rep_level_iter_buffer.push_back(r);
+d
+}, 
+);
+
+let rep_level_iter = UnzipIter::new(
+shared_data,
+|state|  state.rep_level_iter_buffer,
+|(v, d, r), state| {
+state.value_iter_buffer.push_back(v);
+state.def_level_iter_buffer.push_back(d);
+r
+}, 
+);
+
+(value_iter, def_level_iter, rep_level_iter)
+}
+
+pub trait ArrayConverter {
+fn convert_value_bytes(, value_decoder:  impl ValueDecoder, 
num_values: usize) -> Result;
+}
+
+pub struct ArrowArrayReader<'a, C: ArrayConverter + 'a> {
+column_desc: ColumnDescPtr,
+data_type: ArrowType,
+def_level_decoder: Box,
+rep_level_decoder: Box,
+value_decoder: Box,
+last_def_levels: Option,
+last_rep_levels: Option,
+array_converter: C,
+}
+
+pub(crate) struct ColumnChunkContext {
+dictionary_values: Option>,
+}
+
+impl ColumnChunkContext {
+fn new() -> Self {
+Self {
+

[GitHub] [arrow] github-actions[bot] commented on pull request #10433: ARROW-12911: [Python] Export scalar aggregate options to pc.sum

2021-06-01 Thread GitBox


github-actions[bot] commented on pull request #10433:
URL: https://github.com/apache/arrow/pull/10433#issuecomment-852644263


   https://issues.apache.org/jira/browse/ARROW-12911


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 opened a new pull request #10433: ARROW-12911: [Python] Export scalar aggregate options to pc.sum

2021-06-01 Thread GitBox


cyb70289 opened a new pull request #10433:
URL: https://github.com/apache/arrow/pull/10433


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643590723



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##
@@ -0,0 +1,614 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/builder.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/time.h"
+#include "arrow/vendored/datetime.h"
+
+namespace arrow {
+
+namespace compute {
+namespace internal {
+
+using applicator::ScalarUnaryNotNull;
+using applicator::SimpleUnary;
+using arrow_vendored::date::days;
+using arrow_vendored::date::floor;
+using arrow_vendored::date::hh_mm_ss;
+using arrow_vendored::date::sys_days;
+using arrow_vendored::date::sys_time;
+using arrow_vendored::date::trunc;
+using arrow_vendored::date::weekday;
+using arrow_vendored::date::weeks;
+using arrow_vendored::date::year_month_day;
+using arrow_vendored::date::years;
+using arrow_vendored::date::literals::dec;
+using arrow_vendored::date::literals::jan;
+using arrow_vendored::date::literals::last;
+using arrow_vendored::date::literals::mon;
+using arrow_vendored::date::literals::thu;
+
+// Based on ScalarUnaryNotNullStateful. Adds timezone awareness.
+template 
+struct ScalarUnaryStatefulTemporal {
+  using ThisType = ScalarUnaryStatefulTemporal;
+  using OutValue = typename GetOutputType::T;
+
+  Op op;
+  explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {}
+
+  template 
+  struct ArrayExec {
+static Status Exec(const ThisType& functor, KernelContext* ctx, const 
ArrayData& arg0,
+   Datum* out) {
+  const std::string timezone =
+  std::static_pointer_cast(arg0.type)->timezone();
+  Status st = Status::OK();
+  ArrayData* out_arr = out->mutable_array();
+  auto out_data = out_arr->GetMutableValues(1);
+
+  if (timezone.empty()) {
+VisitArrayValuesInline(
+arg0,
+[&](int64_t v) {
+  *out_data++ = functor.op.template Call(ctx, v, );
+},
+[&]() {
+  // null
+  ++out_data;
+});
+  } else {
+st = Status::Invalid("Timezone aware timestamps not supported. 
Timezone found: ",
+ timezone);
+  }
+  return st;
+}
+  };
+
+  Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) {
+const std::string timezone =
+std::static_pointer_cast(arg0.type)->timezone();
+Status st = Status::OK();
+if (timezone.empty()) {
+  if (arg0.is_valid) {
+int64_t arg0_val = UnboxScalar::Unbox(arg0);
+BoxScalar::Box(this->op.template Call(ctx, 
arg0_val, ),
+out->scalar().get());
+  }
+} else {
+  st = Status::Invalid("Timezone aware timestamps not supported. Timezone 
found: ",
+   timezone);
+}
+return st;
+  }
+
+  Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  return ArrayExec::Exec(*this, ctx, *batch[0].array(), out);
+} else {
+  return Scalar(ctx, *batch[0].scalar(), out);
+}
+  }
+};
+
+template 
+struct ScalarUnaryTemporal {
+  using OutValue = typename GetOutputType::T;
+
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+// Seed kernel with dummy state
+ScalarUnaryStatefulTemporal kernel({});
+return kernel.Exec(ctx, batch, out);
+  }
+};
+
+// --
+// Extract year from timestamp
+
+template 
+struct Year {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).year()));
+  }
+};
+
+// --
+// Extract month from timestamp
+
+template 
+struct Month {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).month()));
+  }
+};
+
+// --
+// 

[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643590630



##
File path: docs/source/cpp/compute.rst
##
@@ -637,6 +637,54 @@ String extraction
   e.g. 'letter' and 'digit' for the regular expression
   ``(?P[ab])(?P\\d)``.
 
+Temporal component extraction
+~
+
+These functions extract datetime components (year, month, day, etc) from 
timestamp type.
+Note: timezone information is currently ignored if present.
+
++++---+-++
+| Function name  | Arity  | Input types   | Output type | 
Notes  |
++++===+=++
+| year   | Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| month  | Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| day| Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| day_of_week| Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| day_of_year| Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| iso_year   | Unary  | Temporal  | Numeric | \(1) 
  |
++++---+-++
+| iso_week   | Unary  | Temporal  | Numeric | \(1) 
  |
++++---+-++
+| iso_calendar   | Unary  | Temporal  | Scalar Struct   | \(2) 
  |
++++---+-++
+| quarter| Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| hour   | Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| minute | Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| second | Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| millisecond| Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| microsecond| Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| nanosecond | Unary  | Temporal  | Numeric |  
  |
++++---+-++
+| subsecond  | Unary  | Temporal  | Numeric |  
  |
++++---+-++
+
+* \(1) The ISO 8601 definition for week 01 is the week with the first Thursday
+  of the Gregorian year (i.e. of January) in it.
+  .. _Wikipedia ISO Week date: 
https://en.wikipedia.org/wiki/ISO_week_date#First_week

Review comment:
   Changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643590434



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##
@@ -0,0 +1,107 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include "arrow/compute/api_scalar.h"
+#include "arrow/compute/kernels/test_util.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/formatting.h"
+
+namespace arrow {
+
+using internal::StringFormatter;
+
+class ScalarTemporalTest : public ::testing::Test {};
+
+namespace compute {
+
+TEST(ScalarTemporalTest, TestSimpleTemporalComponentExtraction) {
+  const char* times =
+  R"(["1970-01-01T00:00:59.123456789","2000-02-29T23:23:23.9",
+  "1899-01-01T00:59:20.001001001","2033-05-18T03:33:20.0", 
null])";
+  auto unit = timestamp(TimeUnit::NANO);
+  auto timestamps = ArrayFromJSON(unit, times);
+  auto iso_calendar_type =
+  struct_({field("iso_year", int64()), field("iso_week", int64()),
+   field("weekday", int64())});
+
+  auto year = "[1970, 2000, 1899, 2033, null]";
+  auto month = "[1, 2, 1, 5, null]";
+  auto day = "[1, 29, 1, 18, null]";
+  auto day_of_week = "[4, 2, 7, 3, null]";
+  auto day_of_year = "[1, 60, 1, 138, null]";
+  auto iso_year = "[1970, 2000, 1899, 2033, null]";
+  auto iso_week = "[1, 9, 52, 20, null]";
+  auto iso_calendar = ArrayFromJSON(iso_calendar_type,
+R"([{"iso_year": 1970, "iso_week": 1, 
"weekday": 4},

Review comment:
   `day_of_week`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643590355



##
File path: cpp/src/arrow/compute/api_scalar.h
##
@@ -462,5 +462,177 @@ ARROW_EXPORT
 Result FillNull(const Datum& values, const Datum& fill_value,
ExecContext* ctx = NULLPTR);
 
+/// \brief Year returns year value for each element of `values`
+///
+/// \param[in] values input to extract year from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Year(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief Month returns month value for each element of `values`
+///
+/// \param[in] values input to extract month from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Month(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief Day returns day value for each element of `values`
+///
+/// \param[in] values input to extract day from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Day(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief DayOfWeek returns day of the week value for each element of `values`
+///
+/// \param[in] values input to extract dat of the week from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT Result DayOfWeek(const Datum& values, ExecContext* ctx = 
NULLPTR);
+
+/// \brief DayOfYear returns day of year value for each element of `values`
+///
+/// \param[in] values input to extract day of year from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT Result DayOfYear(const Datum& values, ExecContext* ctx = 
NULLPTR);
+
+/// \brief ISOYear returns ISO year value for each element of `values`
+///
+/// \param[in] values input to extract ISO year from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result ISOYear(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief ISOWeek returns ISO week of year value for each element of `values`
+///
+/// \param[in] values input to extract ISO week of year from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT Result ISOWeek(const Datum& values, ExecContext* ctx = 
NULLPTR);
+
+/// \brief ISOCalendar returns a (year, ISO week, weekday) struct for each 
element of
+/// `values`
+///
+/// \param[in] values input to ISO calendar struct from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT Result ISOCalendar(const Datum& values, ExecContext* ctx = 
NULLPTR);
+
+/// \brief Quarter returns quarter of year value for each element of `values`

Review comment:
   Done. :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643590073



##
File path: cpp/src/arrow/compute/api_scalar.h
##
@@ -462,5 +462,177 @@ ARROW_EXPORT
 Result FillNull(const Datum& values, const Datum& fill_value,
ExecContext* ctx = NULLPTR);
 
+/// \brief Year returns year value for each element of `values`
+///
+/// \param[in] values input to extract year from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Year(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief Month returns month value for each element of `values`

Review comment:
   Done.

##
File path: cpp/src/arrow/compute/api_scalar.h
##
@@ -462,5 +462,177 @@ ARROW_EXPORT
 Result FillNull(const Datum& values, const Datum& fill_value,
ExecContext* ctx = NULLPTR);
 
+/// \brief Year returns year value for each element of `values`
+///
+/// \param[in] values input to extract year from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Year(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief Month returns month value for each element of `values`
+///
+/// \param[in] values input to extract month from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Month(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief Day returns day value for each element of `values`

Review comment:
   Done.

##
File path: cpp/src/arrow/compute/api_scalar.h
##
@@ -462,5 +462,177 @@ ARROW_EXPORT
 Result FillNull(const Datum& values, const Datum& fill_value,
ExecContext* ctx = NULLPTR);
 
+/// \brief Year returns year value for each element of `values`
+///
+/// \param[in] values input to extract year from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Year(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief Month returns month value for each element of `values`
+///
+/// \param[in] values input to extract month from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Month(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief Day returns day value for each element of `values`
+///
+/// \param[in] values input to extract day from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Day(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief DayOfWeek returns day of the week value for each element of `values`

Review comment:
   Done.

##
File path: cpp/src/arrow/compute/api_scalar.h
##
@@ -462,5 +462,177 @@ ARROW_EXPORT
 Result FillNull(const Datum& values, const Datum& fill_value,
ExecContext* ctx = NULLPTR);
 
+/// \brief Year returns year value for each element of `values`
+///
+/// \param[in] values input to extract year from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Year(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief Month returns month value for each element of `values`
+///
+/// \param[in] values input to extract month from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Month(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief Day returns day value for each element of `values`
+///
+/// \param[in] values input to extract day from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT
+Result Day(const Datum& values, ExecContext* ctx = NULLPTR);
+
+/// \brief DayOfWeek returns day of the week value for each element of `values`
+///
+/// \param[in] values input to extract dat of the week from
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since 4.0.0
+/// \note API not yet finalized
+ARROW_EXPORT Result DayOfWeek(const Datum& values, ExecContext* ctx = 
NULLPTR);
+
+/// \brief DayOfYear returns day of year value for each element of `values`

Review comment:
   Done.

##
File path: 

[GitHub] [arrow-rs] nevi-me commented on a change in pull request #386: add more tests for window::shift and handle boundary cases

2021-06-01 Thread GitBox


nevi-me commented on a change in pull request #386:
URL: https://github.com/apache/arrow-rs/pull/386#discussion_r643581616



##
File path: arrow/src/compute/kernels/window.rs
##
@@ -33,56 +32,120 @@ use crate::{array::PrimitiveArray, 
datatypes::ArrowPrimitiveType, error::Result}
 /// use arrow::compute::shift;
 ///
 /// let a: Int32Array = vec![Some(1), None, Some(4)].into();
+///
 /// // shift array 1 element to the right
 /// let res = shift(, 1).unwrap();
 /// let expected: Int32Array = vec![None, Some(1), None].into();
-/// assert_eq!(res.as_ref(), )
+/// assert_eq!(res.as_ref(), );
+///
+/// // shift array 1 element to the left
+/// let res = shift(, -1).unwrap();
+/// let expected: Int32Array = vec![None, Some(4), None].into();
+/// assert_eq!(res.as_ref(), );
+///
+/// // shift array 0 element, although not recommended
+/// let res = shift(, 0).unwrap();
+/// let expected: Int32Array = vec![Some(1), None, Some(4)].into();
+/// assert_eq!(res.as_ref(), );
+///
+/// // shift array 3 element tot he right
+/// let res = shift(, 3).unwrap();
+/// let expected: Int32Array = vec![None, None, None].into();
+/// assert_eq!(res.as_ref(), );
 /// ```
 pub fn shift(values: , offset: i64) -> Result
 where
 T: ArrowPrimitiveType,
 {
-// Compute slice
-let slice_offset = clamp(-offset, 0, values.len() as i64) as usize;
-let length = values.len() - abs(offset) as usize;
-let slice = values.slice(slice_offset, length);
-
-// Generate array with remaining `null` items
-let nulls = abs(offset as i64) as usize;
+let value_len = values.len() as i64;
+if offset == 0 {
+Ok(values.slice(0, values.len()))

Review comment:
   Same comment as #388 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-rs] nevi-me commented on a change in pull request #388: window::shift to work for all array types

2021-06-01 Thread GitBox


nevi-me commented on a change in pull request #388:
URL: https://github.com/apache/arrow-rs/pull/388#discussion_r643581161



##
File path: arrow/src/compute/kernels/window.rs
##
@@ -33,56 +32,161 @@ use crate::{array::PrimitiveArray, 
datatypes::ArrowPrimitiveType, error::Result}
 /// use arrow::compute::shift;
 ///
 /// let a: Int32Array = vec![Some(1), None, Some(4)].into();
+///
 /// // shift array 1 element to the right
 /// let res = shift(, 1).unwrap();
 /// let expected: Int32Array = vec![None, Some(1), None].into();
-/// assert_eq!(res.as_ref(), )
+/// assert_eq!(res.as_ref(), );
+///
+/// // shift array 1 element to the left
+/// let res = shift(, -1).unwrap();
+/// let expected: Int32Array = vec![None, Some(4), None].into();
+/// assert_eq!(res.as_ref(), );
+///
+/// // shift array 0 element, although not recommended
+/// let res = shift(, 0).unwrap();
+/// let expected: Int32Array = vec![Some(1), None, Some(4)].into();
+/// assert_eq!(res.as_ref(), );
+///
+/// // shift array 3 element tot he right
+/// let res = shift(, 3).unwrap();
+/// let expected: Int32Array = vec![None, None, None].into();
+/// assert_eq!(res.as_ref(), );
 /// ```
-pub fn shift(values: , offset: i64) -> Result
-where
-T: ArrowPrimitiveType,
-{
-// Compute slice
-let slice_offset = clamp(-offset, 0, values.len() as i64) as usize;
-let length = values.len() - abs(offset) as usize;
-let slice = values.slice(slice_offset, length);
-
-// Generate array with remaining `null` items
-let nulls = abs(offset as i64) as usize;
-
-let null_arr = new_null_array(::DATA_TYPE, nulls);
-
-// Concatenate both arrays, add nulls after if shift > 0 else before
-if offset > 0 {
-concat(&[null_arr.as_ref(), slice.as_ref()])
+pub fn shift(array: , offset: i64) -> Result {

Review comment:
   This looks great! May you please add tests for struct and list types.

##
File path: arrow/src/compute/kernels/window.rs
##
@@ -33,56 +32,161 @@ use crate::{array::PrimitiveArray, 
datatypes::ArrowPrimitiveType, error::Result}
 /// use arrow::compute::shift;
 ///
 /// let a: Int32Array = vec![Some(1), None, Some(4)].into();
+///
 /// // shift array 1 element to the right
 /// let res = shift(, 1).unwrap();
 /// let expected: Int32Array = vec![None, Some(1), None].into();
-/// assert_eq!(res.as_ref(), )
+/// assert_eq!(res.as_ref(), );
+///
+/// // shift array 1 element to the left
+/// let res = shift(, -1).unwrap();
+/// let expected: Int32Array = vec![None, Some(4), None].into();
+/// assert_eq!(res.as_ref(), );
+///
+/// // shift array 0 element, although not recommended
+/// let res = shift(, 0).unwrap();
+/// let expected: Int32Array = vec![Some(1), None, Some(4)].into();
+/// assert_eq!(res.as_ref(), );
+///
+/// // shift array 3 element tot he right
+/// let res = shift(, 3).unwrap();
+/// let expected: Int32Array = vec![None, None, None].into();
+/// assert_eq!(res.as_ref(), );
 /// ```
-pub fn shift(values: , offset: i64) -> Result
-where
-T: ArrowPrimitiveType,
-{
-// Compute slice
-let slice_offset = clamp(-offset, 0, values.len() as i64) as usize;
-let length = values.len() - abs(offset) as usize;
-let slice = values.slice(slice_offset, length);
-
-// Generate array with remaining `null` items
-let nulls = abs(offset as i64) as usize;
-
-let null_arr = new_null_array(::DATA_TYPE, nulls);
-
-// Concatenate both arrays, add nulls after if shift > 0 else before
-if offset > 0 {
-concat(&[null_arr.as_ref(), slice.as_ref()])
+pub fn shift(array: , offset: i64) -> Result {
+let value_len = array.len() as i64;
+if offset == 0 {
+Ok(array.slice(0, array.len()))

Review comment:
   If offset == 0, why not return `array.clone()`? `array.slice` does incur 
small overhead as it might have to descend into the array children for nested 
types (see #389 )




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643578583



##
File path: cpp/src/arrow/compute/exec/expression.cc
##
@@ -613,6 +639,22 @@ std::vector FieldsInExpression(const Expression& 
expr) {
   return fields;
 }
 
+std::vector ParametersInExpression(const Expression& expr) {

Review comment:
   I may remove this, I don't think I'm using it anymore




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643578392



##
File path: cpp/src/arrow/compute/exec/expression.cc
##
@@ -61,13 +61,22 @@ Expression call(std::string function, 
std::vector arguments,
   call.function_name = std::move(function);
   call.arguments = std::move(arguments);
   call.options = std::move(options);
+
+  call.hash = std::hash{}(call.function_name);
+  for (const auto& arg : call.arguments) {
+call.hash ^= arg.hash();
+  }
   return Expression(std::move(call));
 }
 
 const Datum* Expression::literal() const { return 
util::get_if(impl_.get()); }
 
+const Expression::Parameter* Expression::parameter() const {

Review comment:
   a Parameter is a {field_ref, type, index} (with the last two properties 
only available after `Bind`). During simplification against guarantees (which 
are unbound), parameters must be compared by their field_refs




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643577382



##
File path: cpp/src/arrow/compute/exec/expression.cc
##
@@ -510,7 +475,67 @@ Result Expression::Bind(const Schema& 
in_schema,
   return Bind(ValueDescr::Array(struct_(in_schema.fields())), exec_context);
 }
 
-Result ExecuteScalarExpression(const Expression& expr, const Datum& 
input,
+Result MakeExecBatch(const Schema& full_schema, const Datum& 
partial) {
+  ExecBatch out;
+
+  if (partial.kind() == Datum::RECORD_BATCH) {
+const auto& partial_batch = *partial.record_batch();
+out.length = partial_batch.num_rows();
+
+for (const auto& field : full_schema.fields()) {
+  ARROW_ASSIGN_OR_RAISE(auto column,
+
FieldRef(field->name()).GetOneOrNone(partial_batch));
+
+  if (column) {
+if (!column->type()->Equals(field->type())) {
+  // Referenced field was present but didn't have the expected type.
+  // This *should* be handled by readers, and will just be an error in 
the future.
+  ARROW_ASSIGN_OR_RAISE(
+  auto converted,
+  compute::Cast(column, field->type(), 
compute::CastOptions::Safe()));
+  column = converted.make_array();
+}
+out.values.emplace_back(std::move(column));
+  } else {
+out.values.emplace_back(MakeNullScalar(field->type()));
+  }
+}
+return out;
+  }
+
+  // wasteful but useful for testing:
+  if (partial.type()->id() == Type::STRUCT) {
+if (partial.is_array()) {
+  ARROW_ASSIGN_OR_RAISE(auto partial_batch,
+
RecordBatch::FromStructArray(partial.make_array()));
+
+  return MakeExecBatch(full_schema, partial_batch);
+}
+
+if (partial.is_scalar()) {
+  ARROW_ASSIGN_OR_RAISE(auto partial_array,
+MakeArrayFromScalar(*partial.scalar(), 1));
+  ARROW_ASSIGN_OR_RAISE(auto out, MakeExecBatch(full_schema, 
partial_array));
+
+  for (Datum& value : out.values) {
+if (value.is_scalar()) continue;
+ARROW_ASSIGN_OR_RAISE(value, value.make_array()->GetScalar(0));
+  }

Review comment:
   This was as compact as I could write this case; if you see a way to 
compress/simplify it then I'll take it but the scalar/array cases are really 
just for testing purposes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643576195



##
File path: cpp/src/arrow/compute/exec/expression_test.cc
##
@@ -165,6 +165,56 @@ TEST(ExpressionUtils, StripOrderPreservingCasts) {
   Expect(cast(field_ref("i32"), uint64()), no_change);
 }
 
+TEST(ExpressionUtils, MakeExecBatch) {
+  auto Expect = [](std::shared_ptr partial_batch) {
+SCOPED_TRACE(partial_batch->ToString());
+ASSERT_OK_AND_ASSIGN(auto batch, MakeExecBatch(*kBoringSchema, 
partial_batch));
+
+ASSERT_EQ(batch.num_values(), kBoringSchema->num_fields());
+for (int i = 0; i < kBoringSchema->num_fields(); ++i) {
+  const auto& field = *kBoringSchema->field(i);
+
+  SCOPED_TRACE("Field#" + std::to_string(i) + " " + field.ToString());
+
+  EXPECT_TRUE(batch[i].type()->Equals(field.type()))
+  << "Incorrect type " << batch[i].type()->ToString();
+
+  ASSERT_OK_AND_ASSIGN(auto col, 
FieldRef(field.name()).GetOneOrNone(*partial_batch));

Review comment:
   GetOneOrNone raises a descriptive error if duplicate field names are 
found, whereas GetFieldByName will just return null IIRC




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643575745



##
File path: cpp/src/arrow/dataset/dataset_internal.h
##
@@ -204,5 +204,35 @@ arrow::Result> GetFragmentScanOptions(
   return internal::checked_pointer_cast(source);
 }
 
+class FragmentDataset : public Dataset {

Review comment:
   Whether or not the component fragments are in memory is up to the 
fragments, so I don't think it's appropriate to inherit that here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-rs] nevi-me commented on pull request #389: make slice work for nested types

2021-06-01 Thread GitBox


nevi-me commented on pull request #389:
URL: https://github.com/apache/arrow-rs/pull/389#issuecomment-852610400


   @jorgecarleitao I have the below failures, which are mostly related to 
`MutableArrayData`. I need your help when you have some time to spare.
   
   I worked on this over the weekend out of curiousity, so I'll continue 
working on the other tests this week.
   
   ```
   failures:
   array::array_union::tests::test_dense_mixed_with_nulls_and_offset
   array::array_union::tests::test_sparse_mixed_with_nulls_and_offset
   array::transform::tests::test_list_append
   array::transform::tests::test_list_nulls_append
   array::transform::tests::test_struct_offset
   compute::kernels::cast::tests::test_list_cast_offsets
   compute::kernels::concat::tests::test_concat_struct_array_slices
   json::writer::tests::write_dictionary
   json::writer::tests::write_nested_list
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #381: Respect max rowgroup size in Arrow writer

2021-06-01 Thread GitBox


codecov-commenter edited a comment on pull request #381:
URL: https://github.com/apache/arrow-rs/pull/381#issuecomment-850950657


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/381?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#381](https://codecov.io/gh/apache/arrow-rs/pull/381?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (530949e) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/f26ffb3091ae355d246edc4a6fcc2c8e5b9bc570?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (f26ffb3) will **increase** coverage by `0.02%`.
   > The diff coverage is `93.33%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/381/graphs/tree.svg?width=650=150=pr=pq9V9qWZ1N_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/381?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #381  +/-   ##
   ==
   + Coverage   82.60%   82.62%   +0.02% 
   ==
 Files 162  162  
 Lines   4419944275  +76 
   ==
   + Hits3650936583  +74 
   - Misses   7690 7692   +2 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/381?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz)
 | `82.78% <89.47%> (+0.19%)` | :arrow_up: |
   | 
[parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz)
 | `98.04% <96.77%> (-0.08%)` | :arrow_down: |
   | 
[parquet/src/file/properties.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9wcm9wZXJ0aWVzLnJz)
 | `95.74% <100.00%> (+0.01%)` | :arrow_up: |
   | 
[arrow/src/array/transform/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9ib29sZWFuLnJz)
 | `76.92% <0.00%> (-7.70%)` | :arrow_down: |
   | 
[arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=)
 | `85.29% <0.00%> (ø)` | |
   | 
[arrow/src/compute/kernels/concat.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb25jYXQucnM=)
 | `100.00% <0.00%> (ø)` | |
   | 
[arrow/src/compute/kernels/window.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy93aW5kb3cucnM=)
 | `100.00% <0.00%> (ø)` | |
   | 
[parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz)
 | `77.18% <0.00%> (+0.09%)` | :arrow_up: |
   | 
[parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz)
 | `95.04% <0.00%> (+0.19%)` | :arrow_up: |
   | 
[arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/381/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=)
 | `89.18% <0.00%> (+0.24%)` | :arrow_up: |
   | ... and [1 

[GitHub] [arrow-rs] nevi-me opened a new pull request #389: make slice work for nested types

2021-06-01 Thread GitBox


nevi-me opened a new pull request #389:
URL: https://github.com/apache/arrow-rs/pull/389


   # Which issue does this PR close?
   
   Corresponding issue might not yet exist, will check when finalising this PR
   
   # Rationale for this change

   `ArrayData::slice()` does not work for nested types, because only the 
`ArrayData::buffers` are updated with the new offset and length. This has 
caused a lot of issues in the past.
   
   This blocks us from being able to implement `RecordBatch::slice()`, and has 
led to creating #381 to sidestep this issue.
   
   # What changes are included in this PR?
   
   I propose a manual slice where we check if `ArrayData::child_data` is not 
empty, and then propagate the offset and length down to them. The only trick is 
with list and largelist, where we have to inspect the offset buffers to 
determine what the new offset and length should be.
   
   # Are there any user-facing changes?
   
   No UFC


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-rs] nevi-me commented on pull request #381: Respect max rowgroup size in Arrow writer

2021-06-01 Thread GitBox


nevi-me commented on pull request #381:
URL: https://github.com/apache/arrow-rs/pull/381#issuecomment-852597694


   I've addressed feedback, PTAL @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-rs] nevi-me commented on a change in pull request #381: Respect max rowgroup size in Arrow writer

2021-06-01 Thread GitBox


nevi-me commented on a change in pull request #381:
URL: https://github.com/apache/arrow-rs/pull/381#discussion_r643570618



##
File path: parquet/src/arrow/arrow_writer.rs
##
@@ -1176,31 +1236,51 @@ mod tests {
 let raw_values: Vec<_> = (0..SMALL_SIZE as i64).collect();
 let values = Arc::new(TimestampSecondArray::from_vec(raw_values, 
None));
 
-one_column_roundtrip("timestamp_second_single_column", values, false);
+one_column_roundtrip(

Review comment:
   I've changed the `SMALL_SIZE` to an odd number, and then changed the 
batch sizes of some tests.

##
File path: parquet/src/arrow/arrow_writer.rs
##
@@ -87,17 +92,31 @@ impl ArrowWriter {
 "Record batch schema does not match writer schema".to_string(),
 ));
 }
-// compute the definition and repetition levels of the batch
-let batch_level = LevelInfo::new_from_batch(batch);
-let mut row_group_writer = self.writer.next_row_group()?;
-for (array, field) in 
batch.columns().iter().zip(batch.schema().fields()) {
-let mut levels = batch_level.calculate_array_levels(array, field);
-// Reverse levels as we pop() them when writing arrays
-levels.reverse();
-write_leaves( row_group_writer, array,  levels)?;
+// Track the number of rows being written in the batch.
+// We currently do not have a way of slicing nested arrays, thus we
+// track this manually.
+let num_rows = batch.num_rows();
+let batches = (num_rows + self.max_row_group_size - 1) / 
self.max_row_group_size;

Review comment:
   Added a note, thanks :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jonkeane commented on a change in pull request #10430: ARROW-12915: [Release] Build of ubuntu-docs is failing on thrift

2021-06-01 Thread GitBox


jonkeane commented on a change in pull request #10430:
URL: https://github.com/apache/arrow/pull/10430#discussion_r643569671



##
File path: ci/docker/ubuntu-20.04-cpp.dockerfile
##
@@ -75,6 +75,7 @@ RUN apt-get update -y -q && \
 libcurl4-openssl-dev \
 libgflags-dev \
 libgoogle-glog-dev \
+libgrpc++-dev \

Review comment:
   Ah yes, thanks for that! Cruft leftover from me trying to get it working




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] BohuTANG edited a comment on issue #354: Implement some way to self assign tickets without having full edit access to github

2021-06-01 Thread GitBox


BohuTANG edited a comment on issue #354:
URL: 
https://github.com/apache/arrow-datafusion/issues/354#issuecomment-852561940


   Hello, 
   
   we can consider this bot https://github.com/datafuselabs/fusebots , who has 
an `/assignme` command to take the current issue away and then automatically 
add a `community-take` label, demo: 
https://github.com/datafuselabs/datafuse/issues/663#issuecomment-851260591


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #10430: ARROW-12915: [Release] Build of ubuntu-docs is failing on thrift

2021-06-01 Thread GitBox


kou commented on a change in pull request #10430:
URL: https://github.com/apache/arrow/pull/10430#discussion_r643565782



##
File path: ci/docker/ubuntu-20.04-cpp.dockerfile
##
@@ -75,6 +75,7 @@ RUN apt-get update -y -q && \
 libcurl4-openssl-dev \
 libgflags-dev \
 libgoogle-glog-dev \
+libgrpc++-dev \

Review comment:
   We don't need to install `libgrpc++-dev` and `protobuf-compiler-grpc` 
when we use bundled gRPC and ProtoBuf.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] BohuTANG edited a comment on issue #354: Implement some way to self assign tickets without having full edit access to github

2021-06-01 Thread GitBox


BohuTANG edited a comment on issue #354:
URL: 
https://github.com/apache/arrow-datafusion/issues/354#issuecomment-852561940


   Hello, we can consider this bot https://github.com/datafuselabs/fusebots , 
who has an '/assignme' command to take the current issue away and then 
automatically add a 'community-take' label, demo: 
https://github.com/datafuselabs/datafuse/issues/663#issuecomment-851260591


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] BohuTANG commented on issue #354: Implement some way to self assign tickets without having full edit access to github

2021-06-01 Thread GitBox


BohuTANG commented on issue #354:
URL: 
https://github.com/apache/arrow-datafusion/issues/354#issuecomment-852561940


   Hello, we can consider this bot https://github.com/datafuselabs/fusebots , 
who has an '/assignme' command to take the current issue away and then 
automatically add an 'community-take' label, demo: 
https://github.com/datafuselabs/datafuse/issues/663#issuecomment-851260591


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9620: ARROW-11843: [C++] Provide async Parquet reader

2021-06-01 Thread GitBox


lidavidm commented on a change in pull request #9620:
URL: https://github.com/apache/arrow/pull/9620#discussion_r643540804



##
File path: cpp/src/parquet/file_reader.cc
##
@@ -264,23 +264,92 @@ class SerializedFile : public ParquetFileReader::Contents 
{
   }
 }
 PARQUET_THROW_NOT_OK(cached_source_->Cache(ranges));
-return cached_source_->Wait();
   }
 
+  ::arrow::Future<> WhenBuffered(const std::vector& row_groups,
+ const std::vector& column_indices) const 
{
+if (!cached_source_) {
+  return ::arrow::Status::Invalid("Must call PreBuffer before 
WhenBuffered");
+}
+std::vector<::arrow::io::ReadRange> ranges;
+for (int row : row_groups) {
+  for (int col : column_indices) {
+ranges.push_back(
+ComputeColumnChunkRange(file_metadata_.get(), source_size_, row, 
col));
+  }
+}
+return cached_source_->WaitFor(ranges);
+  }
+
+  // Metadata/footer parsing. Divided up to separate sync/async paths, and to 
use
+  // exceptions for error handling (with the async path converting to 
Future/Status).
+
   void ParseMetaData() {

Review comment:
   I've updated the current encryption tests to test both Open and 
OpenAsync (though let's hope they don't make the test flaky again).

##
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##
@@ -2331,6 +2330,63 @@ TEST(TestArrowReadWrite, GetRecordBatchReaderNoColumns) {
   ASSERT_EQ(actual_batch->num_rows(), num_rows);
 }
 
+TEST(TestArrowReadWrite, GetRecordBatchGenerator) {
+  ArrowReaderProperties properties = default_arrow_reader_properties();
+  const int num_rows = 1024;
+  const int row_group_size = 512;
+  const int num_columns = 2;
+
+  std::shared_ptr table;
+  ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(num_columns, num_rows, 1, ));
+
+  std::shared_ptr buffer;
+  ASSERT_NO_FATAL_FAILURE(WriteTableToBuffer(table, row_group_size,
+ 
default_arrow_writer_properties(), ));
+
+  std::shared_ptr reader;
+  {
+std::unique_ptr unique_reader;
+FileReaderBuilder builder;
+ASSERT_OK(builder.Open(std::make_shared(buffer)));
+ASSERT_OK(builder.properties(properties)->Build(_reader));
+reader = std::move(unique_reader);
+  }
+
+  auto check_batches = [](const std::shared_ptr<::arrow::RecordBatch>& batch,
+  int num_columns, int num_rows) {
+ASSERT_NE(batch, nullptr);
+ASSERT_EQ(batch->num_columns(), num_columns);
+ASSERT_EQ(batch->num_rows(), num_rows);

Review comment:
   I've adjusted the test to check equality below.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] alamb commented on pull request #429: implement lead and lag built-in window function

2021-06-01 Thread GitBox


alamb commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-852533565


   I plan to review this PR tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] alamb commented on pull request #470: Support semi join

2021-06-01 Thread GitBox


alamb commented on pull request #470:
URL: https://github.com/apache/arrow-datafusion/pull/470#issuecomment-852533237


   I plan to review this tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-rs] alamb commented on pull request #384: Implement faster arrow array reader

2021-06-01 Thread GitBox


alamb commented on pull request #384:
URL: https://github.com/apache/arrow-rs/pull/384#issuecomment-852532332


   Thanks @yordan-pavlov  -- I will try and set time aside tomorrow to review 
this PR. Sorry for the delay


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #10381: ARROW-12722: [R] Raise error when attemping to print table with duplicated naming

2021-06-01 Thread GitBox


nealrichardson closed pull request #10381:
URL: https://github.com/apache/arrow/pull/10381


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #10432: ARROW-12924: [Gandiva][C++] Implement CONVERT_TIMEZONE SQL function in Gandiva

2021-06-01 Thread GitBox


github-actions[bot] commented on pull request #10432:
URL: https://github.com/apache/arrow/pull/10432#issuecomment-852494029


   https://issues.apache.org/jira/browse/ARROW-12924


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jvictorhuguenin opened a new pull request #10432: ARROW-12924: [Gandiva][C++] Implement CONVERT_TIMEZONE SQL function in Gandiva

2021-06-01 Thread GitBox


jvictorhuguenin opened a new pull request #10432:
URL: https://github.com/apache/arrow/pull/10432


   Converts timestamp to specified timezone. If the sourceTimezone parameter is 
not present, Dremio assumes the timestamp provided in the third parameter is in 
UTC format. The sourceTimezone and destinationTimezone parameters accept any of 
the following values: timezone name from sys.timezone_names, timezone 
abbreviation from sys.timezone_abbrevs, and offset, such as +02:00.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jonkeane commented on pull request #10430: ARROW-12915: [Release] Build of ubuntu-docs is failing on thrift

2021-06-01 Thread GitBox


jonkeane commented on pull request #10430:
URL: https://github.com/apache/arrow/pull/10430#issuecomment-852480193


   I ran this on crossbow (locally since the GH actions comment bot isn't 
working right now https://issues.apache.org/jira/browse/ARROW-12919) and the 
only ubuntu crossbow builds that failed were all ones that have other current 
nightly failures.
   
   I've also added a defaults docs nightly to catch if we have a mismatch like 
this in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #10386: ARROW-12859: [C++] Add ScalarFromJSON for testing

2021-06-01 Thread GitBox


lidavidm commented on pull request #10386:
URL: https://github.com/apache/arrow/pull/10386#issuecomment-852463069


   For CastTo: I think it was actually the Cast kernel (string->type cast).
   
   I added some basic tests. I don't think there's a good way to hit the DCHECK 
because that would imply a JSON value converter appended two array values for a 
single JSON value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #474: Update k8s user guide to use deployments

2021-06-01 Thread GitBox


codecov-commenter commented on pull request #474:
URL: https://github.com/apache/arrow-datafusion/pull/474#issuecomment-852435958


   # 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#474](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (56aeaea) into 
[master](https://codecov.io/gh/apache/arrow-datafusion/commit/16011120a1b73798049c5be49f9548b00f8a0a00?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (1601112) will **not change** coverage.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-datafusion/pull/474/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@   Coverage Diff   @@
   ##   master #474   +/-   ##
   ===
 Coverage   75.84%   75.84%   
   ===
 Files 153  153   
 Lines   2587225872   
   ===
 Hits1962219622   
 Misses   6250 6250   
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[ballista/rust/executor/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/474/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvbWFpbi5ycw==)
 | `0.00% <0.00%> (ø)` | |
   | 
[ballista/rust/scheduler/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/474/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9zY2hlZHVsZXIvc3JjL21haW4ucnM=)
 | `0.00% <0.00%> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=footer_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Last update 
[1601112...56aeaea](https://codecov.io/gh/apache/arrow-datafusion/pull/474?src=pr=lastupdated_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
 Read the [comment 
docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)

2021-06-01 Thread GitBox


codecov-commenter commented on pull request #436:
URL: https://github.com/apache/arrow-datafusion/pull/436#issuecomment-852434776


   # 
[Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/436?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#436](https://codecov.io/gh/apache/arrow-datafusion/pull/436?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (25adf53) into 
[master](https://codecov.io/gh/apache/arrow-datafusion/commit/db4f098d38993b96ce1134c4bc7bf5c6579509cf?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (db4f098) will **increase** coverage by `1.00%`.
   > The diff coverage is `86.61%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-datafusion/pull/436/graphs/tree.svg?width=650=150=pr=JXwWBKD3D9_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/436?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #436  +/-   ##
   ==
   + Coverage   74.94%   75.94%   +1.00% 
   ==
 Files 146  154   +8 
 Lines   2431426126+1812 
   ==
   + Hits1822119842+1621 
   - Misses   6093 6284 +191 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-datafusion/pull/436?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[datafusion/src/execution/context.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvZXhlY3V0aW9uL2NvbnRleHQucnM=)
 | `92.08% <ø> (+0.03%)` | :arrow_up: |
   | 
[datafusion/src/optimizer/simplify\_expressions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL3NpbXBsaWZ5X2V4cHJlc3Npb25zLnJz)
 | `86.61% <86.61%> (ø)` | |
   | 
[datafusion-cli/src/print\_format.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1jbGkvc3JjL3ByaW50X2Zvcm1hdC5ycw==)
 | `81.25% <0.00%> (-9.17%)` | :arrow_down: |
   | 
[datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9tb2QucnM=)
 | `78.70% <0.00%> (-4.06%)` | :arrow_down: |
   | 
[datafusion/src/physical\_plan/window\_functions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi93aW5kb3dfZnVuY3Rpb25zLnJz)
 | `85.71% <0.00%> (-3.01%)` | :arrow_down: |
   | 
[...tafusion/src/physical\_plan/datetime\_expressions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9kYXRldGltZV9leHByZXNzaW9ucy5ycw==)
 | `67.29% <0.00%> (-2.52%)` | :arrow_down: |
   | 
[datafusion/src/physical\_plan/common.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9jb21tb24ucnM=)
 | `84.21% <0.00%> (-2.00%)` | :arrow_down: |
   | 
[datafusion/src/physical\_plan/repartition.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9yZXBhcnRpdGlvbi5ycw==)
 | `82.45% <0.00%> (-1.89%)` | :arrow_down: |
   | 
[ballista/rust/scheduler/src/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/436/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9zY2hlZHVsZXIvc3JjL3BsYW5uZXIucnM=)
 | `66.91% <0.00%> (-0.74%)` | :arrow_down: |
   | 

[GitHub] [arrow-datafusion] edrevo commented on issue #72: Update link in Ballista donation blog post

2021-06-01 Thread GitBox


edrevo commented on issue #72:
URL: https://github.com/apache/arrow-datafusion/issues/72#issuecomment-852428471


   I'd say this is fixed, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] edrevo commented on pull request #474: Update k8s user guide to use deployments

2021-06-01 Thread GitBox


edrevo commented on pull request #474:
URL: https://github.com/apache/arrow-datafusion/pull/474#issuecomment-852426880


   cc @andygrove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #10410: ARROW-10640: [C++] A "where" kernel to combine two arrays based on a mask

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10410:
URL: https://github.com/apache/arrow/pull/10410#discussion_r643454193



##
File path: cpp/src/arrow/compute/kernels/scalar_if_else_test.cc
##
@@ -0,0 +1,264 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+namespace arrow {
+namespace compute {
+
+void CheckIfElseOutput(const Datum& cond, const Datum& left, const Datum& 
right,
+   const Datum& expected) {
+  ASSERT_OK_AND_ASSIGN(Datum datum_out, IfElse(cond, left, right));
+  if (datum_out.is_array()) {
+std::shared_ptr result = datum_out.make_array();
+ASSERT_OK(result->ValidateFull());
+std::shared_ptr expected_ = expected.make_array();
+AssertArraysEqual(*expected_, *result, /*verbose=*/true);
+  } else {  // expecting scalar
+const std::shared_ptr& result = datum_out.scalar();
+const std::shared_ptr& expected_ = expected.scalar();
+AssertScalarsEqual(*expected_, *result, /*verbose=*/true);
+  }
+}
+
+class TestIfElseKernel : public ::testing::Test {};
+
+template 
+class TestIfElsePrimitive : public ::testing::Test {};
+
+using PrimitiveTypes = ::testing::Types;
+
+TYPED_TEST_SUITE(TestIfElsePrimitive, PrimitiveTypes);
+
+TYPED_TEST(TestIfElsePrimitive, IfElseFixedSizeRand) {
+  using ArrayType = typename TypeTraits::ArrayType;
+  auto type = TypeTraits::type_singleton();
+
+  random::RandomArrayGenerator rand(/*seed=*/0);
+  int64_t len = 1000;
+  auto cond = std::static_pointer_cast(
+  rand.ArrayOf(boolean(), len, /*null_probability=*/0.01));
+  auto left = std::static_pointer_cast(
+  rand.ArrayOf(type, len, /*null_probability=*/0.01));
+  auto right = std::static_pointer_cast(
+  rand.ArrayOf(type, len, /*null_probability=*/0.01));
+
+  typename TypeTraits::BuilderType builder;
+
+  for (int64_t i = 0; i < len; ++i) {
+if (!cond->IsValid(i) || (cond->Value(i) && !left->IsValid(i)) ||
+(!cond->Value(i) && !right->IsValid(i))) {
+  ASSERT_OK(builder.AppendNull());
+  continue;
+}
+
+if (cond->Value(i)) {
+  ASSERT_OK(builder.Append(left->Value(i)));
+} else {
+  ASSERT_OK(builder.Append(right->Value(i)));
+}
+  }
+  ASSERT_OK_AND_ASSIGN(auto expected_data, builder.Finish());
+
+  CheckIfElseOutput(cond, left, right, expected_data);
+}
+
+void CheckWithDifferentShapes(const std::shared_ptr& cond,
+  const std::shared_ptr& left,
+  const std::shared_ptr& right,
+  const std::shared_ptr& expected) {
+  // this will check for whole arrays, every scalar at i'th index and slicing 
(offset)
+  CheckScalar("if_else", {cond, left, right}, expected);
+
+  auto len = left->length();
+
+  enum { COND_SCALAR = 1, LEFT_SCALAR = 2, RIGHT_SCALAR = 4 };
+  for (int mask = 0; mask < (COND_SCALAR | LEFT_SCALAR | RIGHT_SCALAR); 
++mask) {
+for (int64_t cond_idx = 0; cond_idx < len; ++cond_idx) {
+  Datum cond_in, cond_bcast;
+  if (mask & COND_SCALAR) {
+ASSERT_OK_AND_ASSIGN(cond_in, cond->GetScalar(cond_idx));
+ASSERT_OK_AND_ASSIGN(cond_bcast, 
MakeArrayFromScalar(*cond_in.scalar(), len));
+  } else {
+cond_in = cond_bcast = cond;
+  }

Review comment:
   Please add SCOPED_TRACEs here and below for left/right so that failures 
within the loop are more informative about which iteration failed
   ```suggestion
 Datum cond_in, cond_bcast;
 std::string trace_msg = "Cond";
 if (mask & COND_SCALAR) {
   ASSERT_OK_AND_ASSIGN(cond_in, cond->GetScalar(cond_idx));
   ASSERT_OK_AND_ASSIGN(cond_bcast, 
MakeArrayFromScalar(*cond_in.scalar(), len));
   trace_msg += "@" + std::to_string(cond_idx) + "=" + 
cond_in.scalar()->ToString();
 } else {
   cond_in = cond_bcast = cond;
 }
 SCOPED_TRACE(trace_msg);
   ```

##
File path: cpp/src/arrow/compute/kernels/scalar_if_else.cc
##
@@ -699,14 +699,58 @@ struct IfElseFunctor> {
   // AAS
   static Status Call(KernelContext* ctx, const ArrayData& cond, const 
ArrayData& left,
  const Scalar& right, 

[GitHub] [arrow-datafusion] edrevo opened a new pull request #474: K8s deployments

2021-06-01 Thread GitBox


edrevo opened a new pull request #474:
URL: https://github.com/apache/arrow-datafusion/pull/474


   # Which issue does this PR close?
   
   Closes #473.
   
   # What changes are included in this PR?
   
   - Rename port to bind-port since the configure_me create was trying to parse 
an env var that was being autopopulated by Kubernetes
   - Change the user guide for k8s
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] edrevo opened a new issue #473: [Ballista] Use deployments in k8s user guide

2021-06-01 Thread GitBox


edrevo opened a new issue #473:
URL: https://github.com/apache/arrow-datafusion/issues/473


   The executors can now be used as a k8s deployment, which is a more flexible 
and simpler k8s primitive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#issuecomment-852423550


   Thanks for the review @pitrou! I've addressed some of the comments and I'll 
try to finish the rest today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] jgoday commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)

2021-06-01 Thread GitBox


jgoday commented on a change in pull request #436:
URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643455939



##
File path: datafusion/src/optimizer/mod.rs
##
@@ -25,4 +25,5 @@ pub mod hash_build_probe_order;
 pub mod limit_push_down;
 pub mod optimizer;
 pub mod projection_push_down;
+pub mod remove_duplicate_filters;

Review comment:
   Ok, changed to simplify_expressions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643455290



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##
@@ -0,0 +1,614 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/builder.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/time.h"
+#include "arrow/vendored/datetime.h"
+
+namespace arrow {
+
+namespace compute {
+namespace internal {
+
+using applicator::ScalarUnaryNotNull;
+using applicator::SimpleUnary;
+using arrow_vendored::date::days;
+using arrow_vendored::date::floor;
+using arrow_vendored::date::hh_mm_ss;
+using arrow_vendored::date::sys_days;
+using arrow_vendored::date::sys_time;
+using arrow_vendored::date::trunc;
+using arrow_vendored::date::weekday;
+using arrow_vendored::date::weeks;
+using arrow_vendored::date::year_month_day;
+using arrow_vendored::date::years;
+using arrow_vendored::date::literals::dec;
+using arrow_vendored::date::literals::jan;
+using arrow_vendored::date::literals::last;
+using arrow_vendored::date::literals::mon;
+using arrow_vendored::date::literals::thu;
+
+// Based on ScalarUnaryNotNullStateful. Adds timezone awareness.
+template 
+struct ScalarUnaryStatefulTemporal {
+  using ThisType = ScalarUnaryStatefulTemporal;
+  using OutValue = typename GetOutputType::T;
+
+  Op op;
+  explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {}
+
+  template 
+  struct ArrayExec {
+static Status Exec(const ThisType& functor, KernelContext* ctx, const 
ArrayData& arg0,
+   Datum* out) {
+  const std::string timezone =
+  std::static_pointer_cast(arg0.type)->timezone();
+  Status st = Status::OK();
+  ArrayData* out_arr = out->mutable_array();
+  auto out_data = out_arr->GetMutableValues(1);
+
+  if (timezone.empty()) {
+VisitArrayValuesInline(
+arg0,
+[&](int64_t v) {
+  *out_data++ = functor.op.template Call(ctx, v, );
+},
+[&]() {
+  // null
+  ++out_data;
+});
+  } else {
+st = Status::Invalid("Timezone aware timestamps not supported. 
Timezone found: ",
+ timezone);
+  }
+  return st;
+}
+  };
+
+  Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) {
+const std::string timezone =
+std::static_pointer_cast(arg0.type)->timezone();
+Status st = Status::OK();
+if (timezone.empty()) {
+  if (arg0.is_valid) {
+int64_t arg0_val = UnboxScalar::Unbox(arg0);
+BoxScalar::Box(this->op.template Call(ctx, 
arg0_val, ),
+out->scalar().get());
+  }
+} else {
+  st = Status::Invalid("Timezone aware timestamps not supported. Timezone 
found: ",
+   timezone);
+}
+return st;
+  }
+
+  Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  return ArrayExec::Exec(*this, ctx, *batch[0].array(), out);
+} else {
+  return Scalar(ctx, *batch[0].scalar(), out);
+}
+  }
+};
+
+template 
+struct ScalarUnaryTemporal {
+  using OutValue = typename GetOutputType::T;
+
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+// Seed kernel with dummy state
+ScalarUnaryStatefulTemporal kernel({});
+return kernel.Exec(ctx, batch, out);
+  }
+};
+
+// --
+// Extract year from timestamp
+
+template 
+struct Year {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).year()));
+  }
+};
+
+// --
+// Extract month from timestamp
+
+template 
+struct Month {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).month()));
+  }
+};
+
+// --
+// 

[GitHub] [arrow-datafusion] edrevo commented on issue #472: [Ballista] Improve task and job metadata

2021-06-01 Thread GitBox


edrevo commented on issue #472:
URL: 
https://github.com/apache/arrow-datafusion/issues/472#issuecomment-852422276


   cc @pradomota. I'm opening this one in case you want to take a stab at it  
We can do pair programming if you want.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643454925



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##
@@ -0,0 +1,614 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/builder.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/time.h"
+#include "arrow/vendored/datetime.h"
+
+namespace arrow {
+
+namespace compute {
+namespace internal {
+
+using applicator::ScalarUnaryNotNull;
+using applicator::SimpleUnary;
+using arrow_vendored::date::days;
+using arrow_vendored::date::floor;
+using arrow_vendored::date::hh_mm_ss;
+using arrow_vendored::date::sys_days;
+using arrow_vendored::date::sys_time;
+using arrow_vendored::date::trunc;
+using arrow_vendored::date::weekday;
+using arrow_vendored::date::weeks;
+using arrow_vendored::date::year_month_day;
+using arrow_vendored::date::years;
+using arrow_vendored::date::literals::dec;
+using arrow_vendored::date::literals::jan;
+using arrow_vendored::date::literals::last;
+using arrow_vendored::date::literals::mon;
+using arrow_vendored::date::literals::thu;
+
+// Based on ScalarUnaryNotNullStateful. Adds timezone awareness.
+template 
+struct ScalarUnaryStatefulTemporal {
+  using ThisType = ScalarUnaryStatefulTemporal;
+  using OutValue = typename GetOutputType::T;
+
+  Op op;
+  explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {}
+
+  template 
+  struct ArrayExec {
+static Status Exec(const ThisType& functor, KernelContext* ctx, const 
ArrayData& arg0,
+   Datum* out) {
+  const std::string timezone =
+  std::static_pointer_cast(arg0.type)->timezone();
+  Status st = Status::OK();
+  ArrayData* out_arr = out->mutable_array();
+  auto out_data = out_arr->GetMutableValues(1);
+
+  if (timezone.empty()) {
+VisitArrayValuesInline(
+arg0,
+[&](int64_t v) {
+  *out_data++ = functor.op.template Call(ctx, v, );
+},
+[&]() {
+  // null
+  ++out_data;
+});
+  } else {
+st = Status::Invalid("Timezone aware timestamps not supported. 
Timezone found: ",
+ timezone);
+  }
+  return st;
+}
+  };
+
+  Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) {
+const std::string timezone =
+std::static_pointer_cast(arg0.type)->timezone();
+Status st = Status::OK();
+if (timezone.empty()) {
+  if (arg0.is_valid) {
+int64_t arg0_val = UnboxScalar::Unbox(arg0);
+BoxScalar::Box(this->op.template Call(ctx, 
arg0_val, ),
+out->scalar().get());
+  }
+} else {
+  st = Status::Invalid("Timezone aware timestamps not supported. Timezone 
found: ",
+   timezone);
+}
+return st;
+  }
+
+  Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  return ArrayExec::Exec(*this, ctx, *batch[0].array(), out);
+} else {
+  return Scalar(ctx, *batch[0].scalar(), out);
+}
+  }
+};
+
+template 
+struct ScalarUnaryTemporal {
+  using OutValue = typename GetOutputType::T;
+
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+// Seed kernel with dummy state
+ScalarUnaryStatefulTemporal kernel({});
+return kernel.Exec(ctx, batch, out);
+  }
+};
+
+// --
+// Extract year from timestamp
+
+template 
+struct Year {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).year()));
+  }
+};
+
+// --
+// Extract month from timestamp
+
+template 
+struct Month {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).month()));
+  }
+};
+
+// --
+// 

[GitHub] [arrow-datafusion] edrevo opened a new issue #472: [Ballista] Improve task and job metadata

2021-06-01 Thread GitBox


edrevo opened a new issue #472:
URL: https://github.com/apache/arrow-datafusion/issues/472


   The task and job status we save in the scheduler state is currently lacking. 
See:
   
   
https://github.com/apache/arrow-datafusion/blob/16011120a1b73798049c5be49f9548b00f8a0a00/ballista/rust/core/proto/ballista.proto#L669-L680
   
   And
   
   
https://github.com/apache/arrow-datafusion/blob/16011120a1b73798049c5be49f9548b00f8a0a00/ballista/rust/core/proto/ballista.proto#L614-L624
   
   The job status should include:
   
   - Start time for the job (the moment it was queued)
   - Finish time for the job (for the completed state)
   - Number of pending/running/completed/failed tasks and stages (for the 
running state)
   
   The task status should include:
   
   - Start time for the task
   - Finish time for the task (for the completed state)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643453195



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##
@@ -0,0 +1,614 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/builder.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/time.h"
+#include "arrow/vendored/datetime.h"
+
+namespace arrow {
+
+namespace compute {
+namespace internal {
+
+using applicator::ScalarUnaryNotNull;
+using applicator::SimpleUnary;
+using arrow_vendored::date::days;
+using arrow_vendored::date::floor;
+using arrow_vendored::date::hh_mm_ss;
+using arrow_vendored::date::sys_days;
+using arrow_vendored::date::sys_time;
+using arrow_vendored::date::trunc;
+using arrow_vendored::date::weekday;
+using arrow_vendored::date::weeks;
+using arrow_vendored::date::year_month_day;
+using arrow_vendored::date::years;
+using arrow_vendored::date::literals::dec;
+using arrow_vendored::date::literals::jan;
+using arrow_vendored::date::literals::last;
+using arrow_vendored::date::literals::mon;
+using arrow_vendored::date::literals::thu;
+
+// Based on ScalarUnaryNotNullStateful. Adds timezone awareness.
+template 
+struct ScalarUnaryStatefulTemporal {
+  using ThisType = ScalarUnaryStatefulTemporal;
+  using OutValue = typename GetOutputType::T;
+
+  Op op;
+  explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {}
+
+  template 
+  struct ArrayExec {
+static Status Exec(const ThisType& functor, KernelContext* ctx, const 
ArrayData& arg0,
+   Datum* out) {
+  const std::string timezone =
+  std::static_pointer_cast(arg0.type)->timezone();
+  Status st = Status::OK();
+  ArrayData* out_arr = out->mutable_array();
+  auto out_data = out_arr->GetMutableValues(1);
+
+  if (timezone.empty()) {
+VisitArrayValuesInline(
+arg0,
+[&](int64_t v) {
+  *out_data++ = functor.op.template Call(ctx, v, );
+},
+[&]() {
+  // null
+  ++out_data;
+});
+  } else {
+st = Status::Invalid("Timezone aware timestamps not supported. 
Timezone found: ",
+ timezone);
+  }
+  return st;
+}
+  };
+
+  Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) {
+const std::string timezone =
+std::static_pointer_cast(arg0.type)->timezone();
+Status st = Status::OK();
+if (timezone.empty()) {
+  if (arg0.is_valid) {
+int64_t arg0_val = UnboxScalar::Unbox(arg0);
+BoxScalar::Box(this->op.template Call(ctx, 
arg0_val, ),
+out->scalar().get());
+  }
+} else {
+  st = Status::Invalid("Timezone aware timestamps not supported. Timezone 
found: ",
+   timezone);
+}
+return st;
+  }
+
+  Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  return ArrayExec::Exec(*this, ctx, *batch[0].array(), out);
+} else {
+  return Scalar(ctx, *batch[0].scalar(), out);
+}
+  }
+};
+
+template 
+struct ScalarUnaryTemporal {
+  using OutValue = typename GetOutputType::T;
+
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+// Seed kernel with dummy state
+ScalarUnaryStatefulTemporal kernel({});
+return kernel.Exec(ctx, batch, out);
+  }
+};
+
+// --
+// Extract year from timestamp
+
+template 
+struct Year {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).year()));
+  }
+};
+
+// --
+// Extract month from timestamp
+
+template 
+struct Month {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).month()));
+  }
+};
+
+// --
+// 

[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643453054



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##
@@ -0,0 +1,614 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/builder.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/time.h"
+#include "arrow/vendored/datetime.h"
+
+namespace arrow {
+
+namespace compute {
+namespace internal {
+
+using applicator::ScalarUnaryNotNull;
+using applicator::SimpleUnary;
+using arrow_vendored::date::days;
+using arrow_vendored::date::floor;
+using arrow_vendored::date::hh_mm_ss;
+using arrow_vendored::date::sys_days;
+using arrow_vendored::date::sys_time;
+using arrow_vendored::date::trunc;
+using arrow_vendored::date::weekday;
+using arrow_vendored::date::weeks;
+using arrow_vendored::date::year_month_day;
+using arrow_vendored::date::years;
+using arrow_vendored::date::literals::dec;
+using arrow_vendored::date::literals::jan;
+using arrow_vendored::date::literals::last;
+using arrow_vendored::date::literals::mon;
+using arrow_vendored::date::literals::thu;
+
+// Based on ScalarUnaryNotNullStateful. Adds timezone awareness.
+template 
+struct ScalarUnaryStatefulTemporal {
+  using ThisType = ScalarUnaryStatefulTemporal;
+  using OutValue = typename GetOutputType::T;
+
+  Op op;
+  explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {}
+
+  template 
+  struct ArrayExec {
+static Status Exec(const ThisType& functor, KernelContext* ctx, const 
ArrayData& arg0,
+   Datum* out) {
+  const std::string timezone =
+  std::static_pointer_cast(arg0.type)->timezone();
+  Status st = Status::OK();
+  ArrayData* out_arr = out->mutable_array();
+  auto out_data = out_arr->GetMutableValues(1);
+
+  if (timezone.empty()) {
+VisitArrayValuesInline(
+arg0,
+[&](int64_t v) {
+  *out_data++ = functor.op.template Call(ctx, v, );
+},
+[&]() {
+  // null
+  ++out_data;
+});
+  } else {
+st = Status::Invalid("Timezone aware timestamps not supported. 
Timezone found: ",
+ timezone);
+  }
+  return st;
+}
+  };
+
+  Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) {
+const std::string timezone =
+std::static_pointer_cast(arg0.type)->timezone();
+Status st = Status::OK();
+if (timezone.empty()) {
+  if (arg0.is_valid) {
+int64_t arg0_val = UnboxScalar::Unbox(arg0);
+BoxScalar::Box(this->op.template Call(ctx, 
arg0_val, ),
+out->scalar().get());
+  }
+} else {
+  st = Status::Invalid("Timezone aware timestamps not supported. Timezone 
found: ",
+   timezone);
+}
+return st;
+  }
+
+  Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  return ArrayExec::Exec(*this, ctx, *batch[0].array(), out);
+} else {
+  return Scalar(ctx, *batch[0].scalar(), out);
+}
+  }
+};
+
+template 
+struct ScalarUnaryTemporal {
+  using OutValue = typename GetOutputType::T;
+
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+// Seed kernel with dummy state
+ScalarUnaryStatefulTemporal kernel({});
+return kernel.Exec(ctx, batch, out);
+  }
+};
+
+// --
+// Extract year from timestamp
+
+template 
+struct Year {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).year()));
+  }
+};
+
+// --
+// Extract month from timestamp
+
+template 
+struct Month {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).month()));
+  }
+};
+
+// --
+// 

[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643452934



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##
@@ -0,0 +1,614 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/builder.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/time.h"
+#include "arrow/vendored/datetime.h"
+
+namespace arrow {
+
+namespace compute {
+namespace internal {
+
+using applicator::ScalarUnaryNotNull;
+using applicator::SimpleUnary;
+using arrow_vendored::date::days;
+using arrow_vendored::date::floor;
+using arrow_vendored::date::hh_mm_ss;
+using arrow_vendored::date::sys_days;
+using arrow_vendored::date::sys_time;
+using arrow_vendored::date::trunc;
+using arrow_vendored::date::weekday;
+using arrow_vendored::date::weeks;
+using arrow_vendored::date::year_month_day;
+using arrow_vendored::date::years;
+using arrow_vendored::date::literals::dec;
+using arrow_vendored::date::literals::jan;
+using arrow_vendored::date::literals::last;
+using arrow_vendored::date::literals::mon;
+using arrow_vendored::date::literals::thu;
+
+// Based on ScalarUnaryNotNullStateful. Adds timezone awareness.
+template 
+struct ScalarUnaryStatefulTemporal {
+  using ThisType = ScalarUnaryStatefulTemporal;
+  using OutValue = typename GetOutputType::T;
+
+  Op op;
+  explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {}
+
+  template 
+  struct ArrayExec {
+static Status Exec(const ThisType& functor, KernelContext* ctx, const 
ArrayData& arg0,
+   Datum* out) {
+  const std::string timezone =
+  std::static_pointer_cast(arg0.type)->timezone();
+  Status st = Status::OK();
+  ArrayData* out_arr = out->mutable_array();
+  auto out_data = out_arr->GetMutableValues(1);
+
+  if (timezone.empty()) {
+VisitArrayValuesInline(
+arg0,
+[&](int64_t v) {
+  *out_data++ = functor.op.template Call(ctx, v, );
+},
+[&]() {
+  // null
+  ++out_data;
+});
+  } else {
+st = Status::Invalid("Timezone aware timestamps not supported. 
Timezone found: ",
+ timezone);
+  }
+  return st;
+}
+  };
+
+  Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) {
+const std::string timezone =
+std::static_pointer_cast(arg0.type)->timezone();
+Status st = Status::OK();
+if (timezone.empty()) {
+  if (arg0.is_valid) {
+int64_t arg0_val = UnboxScalar::Unbox(arg0);
+BoxScalar::Box(this->op.template Call(ctx, 
arg0_val, ),
+out->scalar().get());
+  }
+} else {
+  st = Status::Invalid("Timezone aware timestamps not supported. Timezone 
found: ",
+   timezone);
+}
+return st;
+  }
+
+  Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  return ArrayExec::Exec(*this, ctx, *batch[0].array(), out);
+} else {
+  return Scalar(ctx, *batch[0].scalar(), out);
+}
+  }
+};
+
+template 
+struct ScalarUnaryTemporal {
+  using OutValue = typename GetOutputType::T;
+
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+// Seed kernel with dummy state
+ScalarUnaryStatefulTemporal kernel({});
+return kernel.Exec(ctx, batch, out);
+  }
+};
+
+// --
+// Extract year from timestamp
+
+template 
+struct Year {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).year()));
+  }
+};
+
+// --
+// Extract month from timestamp
+
+template 
+struct Month {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(
+
year_month_day(floor(sys_time(Duration{arg}))).month()));
+  }
+};
+
+// --
+// 

[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643452234



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##
@@ -0,0 +1,614 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/builder.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/time.h"
+#include "arrow/vendored/datetime.h"
+
+namespace arrow {
+
+namespace compute {
+namespace internal {
+
+using applicator::ScalarUnaryNotNull;
+using applicator::SimpleUnary;
+using arrow_vendored::date::days;
+using arrow_vendored::date::floor;
+using arrow_vendored::date::hh_mm_ss;
+using arrow_vendored::date::sys_days;
+using arrow_vendored::date::sys_time;
+using arrow_vendored::date::trunc;
+using arrow_vendored::date::weekday;
+using arrow_vendored::date::weeks;
+using arrow_vendored::date::year_month_day;
+using arrow_vendored::date::years;
+using arrow_vendored::date::literals::dec;
+using arrow_vendored::date::literals::jan;
+using arrow_vendored::date::literals::last;
+using arrow_vendored::date::literals::mon;
+using arrow_vendored::date::literals::thu;
+
+// Based on ScalarUnaryNotNullStateful. Adds timezone awareness.
+template 
+struct ScalarUnaryStatefulTemporal {
+  using ThisType = ScalarUnaryStatefulTemporal;
+  using OutValue = typename GetOutputType::T;
+
+  Op op;
+  explicit ScalarUnaryStatefulTemporal(Op op) : op(std::move(op)) {}
+
+  template 
+  struct ArrayExec {
+static Status Exec(const ThisType& functor, KernelContext* ctx, const 
ArrayData& arg0,
+   Datum* out) {
+  const std::string timezone =
+  std::static_pointer_cast(arg0.type)->timezone();
+  Status st = Status::OK();
+  ArrayData* out_arr = out->mutable_array();
+  auto out_data = out_arr->GetMutableValues(1);
+
+  if (timezone.empty()) {
+VisitArrayValuesInline(
+arg0,
+[&](int64_t v) {
+  *out_data++ = functor.op.template Call(ctx, v, );
+},
+[&]() {
+  // null
+  ++out_data;
+});
+  } else {
+st = Status::Invalid("Timezone aware timestamps not supported. 
Timezone found: ",
+ timezone);
+  }
+  return st;
+}
+  };
+
+  Status Scalar(KernelContext* ctx, const Scalar& arg0, Datum* out) {
+const std::string timezone =
+std::static_pointer_cast(arg0.type)->timezone();
+Status st = Status::OK();
+if (timezone.empty()) {
+  if (arg0.is_valid) {
+int64_t arg0_val = UnboxScalar::Unbox(arg0);
+BoxScalar::Box(this->op.template Call(ctx, 
arg0_val, ),
+out->scalar().get());
+  }
+} else {
+  st = Status::Invalid("Timezone aware timestamps not supported. Timezone 
found: ",
+   timezone);
+}
+return st;
+  }
+
+  Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  return ArrayExec::Exec(*this, ctx, *batch[0].array(), out);
+} else {
+  return Scalar(ctx, *batch[0].scalar(), out);
+}
+  }
+};
+
+template 
+struct ScalarUnaryTemporal {
+  using OutValue = typename GetOutputType::T;
+
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+// Seed kernel with dummy state
+ScalarUnaryStatefulTemporal kernel({});
+return kernel.Exec(ctx, batch, out);
+  }
+};
+
+// --
+// Extract year from timestamp
+
+template 
+struct Year {
+  template 
+  static T Call(KernelContext*, int64_t arg, Status*) {
+return static_cast(static_cast(

Review comment:
   date.h won't allow casting to types other than int32 and throws at 
compile-time:
   ```
   ../src/arrow/compute/kernels/scalar_temporal.cc:144:78: error: invalid 
static_cast from type 'arrow_vendored::date::year' to type 'const int64_t' {aka 
'const long int'}
 144 | 
year_month_day(floor(sys_time(Duration{arg}))).year()));
   ```
   I'll take a look if there's a recommended way to deal with this.




-- 
This is an automated message from 

[GitHub] [arrow-datafusion] jgoday commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)

2021-06-01 Thread GitBox


jgoday commented on a change in pull request #436:
URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643450066



##
File path: datafusion/src/optimizer/remove_duplicate_filters.rs
##
@@ -0,0 +1,611 @@
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Remove duplicate filters optimizer rule
+
+use crate::execution::context::ExecutionProps;
+use crate::logical_plan::LogicalPlan;
+use crate::logical_plan::{lit, Expr};
+use crate::optimizer::optimizer::OptimizerRule;
+use crate::optimizer::utils;
+use crate::optimizer::utils::optimize_explain;
+use crate::scalar::ScalarValue;
+use crate::{error::Result, logical_plan::Operator};
+
+/// Remove duplicate filters optimizer.
+/// # Introduction
+/// It uses boolean algebra laws to simplify or reduce the number of terms in 
expressions.
+///
+/// Filter: #b Gt Int32(2) And #b Gt Int32(2)
+/// is optimized to
+/// Filter: #b Gt Int32(2)
+pub struct RemoveDuplicateFilters {}
+
+fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool {
+match expr {
+Expr::BinaryExpr {
+left,
+op: Operator::And,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+_ => expr == needle,
+}
+}
+
+fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> {
+match expr {
+Expr::BinaryExpr { .. } => Some(expr),
+_ => None,
+}
+}
+
+fn operator_is_boolean(op: ) -> bool {
+op == ::And || op == ::Or
+}
+
+fn is_one<'a>(s: &'a Expr) -> bool {
+match s {

Review comment:
   I like that last option :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643449818



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##
@@ -0,0 +1,614 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/builder.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/time.h"
+#include "arrow/vendored/datetime.h"
+
+namespace arrow {
+
+namespace compute {
+namespace internal {
+
+using applicator::ScalarUnaryNotNull;
+using applicator::SimpleUnary;
+using arrow_vendored::date::days;
+using arrow_vendored::date::floor;
+using arrow_vendored::date::hh_mm_ss;
+using arrow_vendored::date::sys_days;
+using arrow_vendored::date::sys_time;
+using arrow_vendored::date::trunc;
+using arrow_vendored::date::weekday;
+using arrow_vendored::date::weeks;
+using arrow_vendored::date::year_month_day;
+using arrow_vendored::date::years;
+using arrow_vendored::date::literals::dec;
+using arrow_vendored::date::literals::jan;
+using arrow_vendored::date::literals::last;
+using arrow_vendored::date::literals::mon;
+using arrow_vendored::date::literals::thu;
+
+// Based on ScalarUnaryNotNullStateful. Adds timezone awareness.

Review comment:
   Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643449669



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##
@@ -0,0 +1,107 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include "arrow/compute/api_scalar.h"
+#include "arrow/compute/kernels/test_util.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/formatting.h"
+
+namespace arrow {
+
+using internal::StringFormatter;
+
+class ScalarTemporalTest : public ::testing::Test {};
+
+namespace compute {
+
+TEST(ScalarTemporalTest, TestSimpleTemporalComponentExtraction) {
+  const char* times =
+  R"(["1970-01-01T00:00:59.123456789","2000-02-29T23:23:23.9",
+  "1899-01-01T00:59:20.001001001","2033-05-18T03:33:20.0", 
null])";
+  auto unit = timestamp(TimeUnit::NANO);
+  auto timestamps = ArrayFromJSON(unit, times);
+  auto iso_calendar_type =
+  struct_({field("iso_year", int64()), field("iso_week", int64()),
+   field("weekday", int64())});
+
+  auto year = "[1970, 2000, 1899, 2033, null]";
+  auto month = "[1, 2, 1, 5, null]";
+  auto day = "[1, 29, 1, 18, null]";
+  auto day_of_week = "[4, 2, 7, 3, null]";
+  auto day_of_year = "[1, 60, 1, 138, null]";
+  auto iso_year = "[1970, 2000, 1899, 2033, null]";
+  auto iso_week = "[1, 9, 52, 20, null]";
+  auto iso_calendar = ArrayFromJSON(iso_calendar_type,
+R"([{"iso_year": 1970, "iso_week": 1, 
"weekday": 4},
+{"iso_year": 2000, "iso_week": 9, "weekday": 2},
+{"iso_year": 1899, "iso_week": 52, "weekday": 7},
+{"iso_year": 2033, "iso_week": 20, "weekday": 3}, 
null])");
+  auto quarter = "[1, 1, 1, 2, null]";
+  auto hour = "[0, 23, 0, 3, null]";
+  auto minute = "[0, 23, 59, 33, null]";
+  auto second = "[59.123456789, 23.9, 20.001001001, 20.0, null]";
+  auto millisecond = "[123, 999, 1, 0, null]";
+  auto microsecond = "[456, 999, 1, 0, null]";
+  auto nanosecond = "[789, 999, 1, 0, null]";
+  auto subsecond = "[123456789, 9, 1001001, 0, null]";
+
+  CheckScalarUnary("year", unit, times, int64(), year);
+  CheckScalarUnary("month", unit, times, int64(), month);
+  CheckScalarUnary("day", unit, times, int64(), day);
+  CheckScalarUnary("day_of_week", unit, times, int64(), day_of_week);
+  CheckScalarUnary("day_of_year", unit, times, int64(), day_of_year);
+  CheckScalarUnary("iso_year", unit, times, int64(), iso_year);
+  CheckScalarUnary("iso_week", unit, times, int64(), iso_week);
+  CheckScalarUnary("iso_calendar", timestamps, iso_calendar);
+  CheckScalarUnary("quarter", unit, times, int64(), quarter);
+  CheckScalarUnary("hour", unit, times, int64(), hour);
+  CheckScalarUnary("minute", unit, times, int64(), minute);
+  CheckScalarUnary("second", unit, times, float64(), second);
+  CheckScalarUnary("millisecond", unit, times, int64(), millisecond);
+  CheckScalarUnary("microsecond", unit, times, int64(), microsecond);
+  CheckScalarUnary("nanosecond", unit, times, int64(), nanosecond);
+  CheckScalarUnary("subsecond", unit, times, int64(), subsecond);

Review comment:
   We'll probably need to calculate these to mimic pandas and R. We can 
calculate them here or later. [See 
discussion.](https://github.com/apache/arrow/pull/10176#discussion_r633564079)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

2021-06-01 Thread GitBox


thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r643448875



##
File path: r/R/util.R
##
@@ -110,3 +110,15 @@ handle_embedded_nul_error <- function(e) {
   }
   stop(e)
 }
+
+handle_parquet_io_error <- function(e, format) {

Review comment:
   Good point!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rok commented on a change in pull request #10176: ARROW-11759: [C++] Kernel to extract datetime components (year, month, day, etc) from timestamp type

2021-06-01 Thread GitBox


rok commented on a change in pull request #10176:
URL: https://github.com/apache/arrow/pull/10176#discussion_r643448314



##
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##
@@ -0,0 +1,107 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include "arrow/compute/api_scalar.h"
+#include "arrow/compute/kernels/test_util.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/formatting.h"
+
+namespace arrow {
+
+using internal::StringFormatter;
+
+class ScalarTemporalTest : public ::testing::Test {};
+
+namespace compute {
+
+TEST(ScalarTemporalTest, TestSimpleTemporalComponentExtraction) {
+  const char* times =
+  R"(["1970-01-01T00:00:59.123456789","2000-02-29T23:23:23.9",
+  "1899-01-01T00:59:20.001001001","2033-05-18T03:33:20.0", 
null])";
+  auto unit = timestamp(TimeUnit::NANO);
+  auto timestamps = ArrayFromJSON(unit, times);
+  auto iso_calendar_type =
+  struct_({field("iso_year", int64()), field("iso_week", int64()),
+   field("weekday", int64())});
+
+  auto year = "[1970, 2000, 1899, 2033, null]";
+  auto month = "[1, 2, 1, 5, null]";
+  auto day = "[1, 29, 1, 18, null]";
+  auto day_of_week = "[4, 2, 7, 3, null]";
+  auto day_of_year = "[1, 60, 1, 138, null]";
+  auto iso_year = "[1970, 2000, 1899, 2033, null]";
+  auto iso_week = "[1, 9, 52, 20, null]";
+  auto iso_calendar = ArrayFromJSON(iso_calendar_type,
+R"([{"iso_year": 1970, "iso_week": 1, 
"weekday": 4},
+{"iso_year": 2000, "iso_week": 9, "weekday": 2},
+{"iso_year": 1899, "iso_week": 52, "weekday": 7},
+{"iso_year": 2033, "iso_week": 20, "weekday": 3}, 
null])");
+  auto quarter = "[1, 1, 1, 2, null]";
+  auto hour = "[0, 23, 0, 3, null]";
+  auto minute = "[0, 23, 59, 33, null]";
+  auto second = "[59.123456789, 23.9, 20.001001001, 20.0, null]";
+  auto millisecond = "[123, 999, 1, 0, null]";
+  auto microsecond = "[456, 999, 1, 0, null]";
+  auto nanosecond = "[789, 999, 1, 0, null]";
+  auto subsecond = "[123456789, 9, 1001001, 0, null]";
+
+  CheckScalarUnary("year", unit, times, int64(), year);
+  CheckScalarUnary("month", unit, times, int64(), month);
+  CheckScalarUnary("day", unit, times, int64(), day);
+  CheckScalarUnary("day_of_week", unit, times, int64(), day_of_week);
+  CheckScalarUnary("day_of_year", unit, times, int64(), day_of_year);
+  CheckScalarUnary("iso_year", unit, times, int64(), iso_year);
+  CheckScalarUnary("iso_week", unit, times, int64(), iso_week);
+  CheckScalarUnary("iso_calendar", timestamps, iso_calendar);
+  CheckScalarUnary("quarter", unit, times, int64(), quarter);
+  CheckScalarUnary("hour", unit, times, int64(), hour);
+  CheckScalarUnary("minute", unit, times, int64(), minute);
+  CheckScalarUnary("second", unit, times, float64(), second);
+  CheckScalarUnary("millisecond", unit, times, int64(), millisecond);
+  CheckScalarUnary("microsecond", unit, times, int64(), microsecond);
+  CheckScalarUnary("nanosecond", unit, times, int64(), nanosecond);
+  CheckScalarUnary("subsecond", unit, times, int64(), subsecond);
+}
+
+TEST(ScalarTemporalTest, TestZonedTemporalComponentExtraction) {
+  std::string timezone = "Etc/UTC-2";
+  const char* times =
+  R"(["1970-01-01T00:00:59.123456789","2000-02-29T23:23:23.9",
+  "1899-01-01T00:59:20.001001001","2033-05-18T03:33:20.0", 
null])";
+  auto unit = timestamp(TimeUnit::NANO, timezone);
+  auto timestamps = ArrayFromJSON(unit, times);
+
+  ASSERT_RAISES(Invalid, Year(timestamps));
+  ASSERT_RAISES(Invalid, Month(timestamps));
+  ASSERT_RAISES(Invalid, Day(timestamps));
+  ASSERT_RAISES(Invalid, DayOfWeek(timestamps));
+  ASSERT_RAISES(Invalid, DayOfYear(timestamps));
+  ASSERT_RAISES(Invalid, ISOYear(timestamps));
+  ASSERT_RAISES(Invalid, ISOWeek(timestamps));
+  ASSERT_RAISES(Invalid, ISOCalendar(timestamps));
+  ASSERT_RAISES(Invalid, Quarter(timestamps));
+  ASSERT_RAISES(Invalid, Hour(timestamps));
+  ASSERT_RAISES(Invalid, Minute(timestamps));
+  ASSERT_RAISES(Invalid, Second(timestamps));
+  ASSERT_RAISES(Invalid, Millisecond(timestamps));
+  

[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643448100



##
File path: cpp/src/arrow/compute/exec/expression.h
##
@@ -207,11 +218,22 @@ Result SimplifyWithGuarantee(Expression,
 
 // Execution
 
-/// Execute a scalar expression against the provided state and input Datum. 
This
+/// Ensure that a RecordBatch (which may have missing or incorrectly ordered 
columns)

Review comment:
   will do, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643447263



##
File path: cpp/src/arrow/compute/exec/expression.cc
##
@@ -510,7 +475,67 @@ Result Expression::Bind(const Schema& 
in_schema,
   return Bind(ValueDescr::Array(struct_(in_schema.fields())), exec_context);
 }
 
-Result ExecuteScalarExpression(const Expression& expr, const Datum& 
input,
+Result MakeExecBatch(const Schema& full_schema, const Datum& 
partial) {
+  ExecBatch out;
+
+  if (partial.kind() == Datum::RECORD_BATCH) {
+const auto& partial_batch = *partial.record_batch();
+out.length = partial_batch.num_rows();
+
+for (const auto& field : full_schema.fields()) {
+  ARROW_ASSIGN_OR_RAISE(auto column,
+
FieldRef(field->name()).GetOneOrNone(partial_batch));

Review comment:
   It will raise Status::Invalid




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643447263



##
File path: cpp/src/arrow/compute/exec/expression.cc
##
@@ -510,7 +475,67 @@ Result Expression::Bind(const Schema& 
in_schema,
   return Bind(ValueDescr::Array(struct_(in_schema.fields())), exec_context);
 }
 
-Result ExecuteScalarExpression(const Expression& expr, const Datum& 
input,
+Result MakeExecBatch(const Schema& full_schema, const Datum& 
partial) {
+  ExecBatch out;
+
+  if (partial.kind() == Datum::RECORD_BATCH) {
+const auto& partial_batch = *partial.record_batch();
+out.length = partial_batch.num_rows();
+
+for (const auto& field : full_schema.fields()) {
+  ARROW_ASSIGN_OR_RAISE(auto column,
+
FieldRef(field->name()).GetOneOrNone(partial_batch));

Review comment:
   It will




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


bkietz commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643447166



##
File path: cpp/src/arrow/compute/exec/exec_plan.h
##
@@ -225,22 +212,43 @@ class ARROW_EXPORT ExecNode {
   virtual void StopProducing() = 0;
 
  protected:
-  ExecNode(ExecPlan* plan, std::string label, std::vector 
input_descrs,
+  ExecNode(ExecPlan*, std::string label, NodeVector inputs,
std::vector input_labels, BatchDescr output_descr,
int num_outputs);
 
   ExecPlan* plan_;
-
   std::string label_;
 
-  std::vector input_descrs_;
-  std::vector input_labels_;
   NodeVector inputs_;
+  std::vector input_labels_;
 
   BatchDescr output_descr_;
   int num_outputs_;
   NodeVector outputs_;
 };
 
+/// \brief Adapt an AsyncGenerator as a source node
+ARROW_EXPORT
+ExecNode* MakeSourceNode(ExecPlan*, std::string label, ExecNode::BatchDescr 
output_descr,
+ AsyncGenerator>);
+
+/// \brief Add a sink node which forwards to an AsyncGenerator
+ARROW_EXPORT
+AsyncGenerator> MakeSinkNode(ExecNode* input,
+   std::string label);
+
+/// \brief Make a node which excludes some rows from batches passed through it
+///
+/// filter Expression must be bound; no field references will be looked up by 
name
+ARROW_EXPORT
+ExecNode* MakeFilterNode(ExecNode* input, std::string label, Expression 
filter);
+
+/// \brief Make a node which executes expressions on input batches, producing 
new batches.
+///
+/// Expressions must be bound; no field references will be looked up by name
+ARROW_EXPORT
+ExecNode* MakeProjectNode(ExecNode* input, std::string label,

Review comment:
   I'll add clarification




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)

2021-06-01 Thread GitBox


Dandandan commented on a change in pull request #436:
URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643441081



##
File path: datafusion/src/optimizer/remove_duplicate_filters.rs
##
@@ -0,0 +1,611 @@
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Remove duplicate filters optimizer rule
+
+use crate::execution::context::ExecutionProps;
+use crate::logical_plan::LogicalPlan;
+use crate::logical_plan::{lit, Expr};
+use crate::optimizer::optimizer::OptimizerRule;
+use crate::optimizer::utils;
+use crate::optimizer::utils::optimize_explain;
+use crate::scalar::ScalarValue;
+use crate::{error::Result, logical_plan::Operator};
+
+/// Remove duplicate filters optimizer.
+/// # Introduction
+/// It uses boolean algebra laws to simplify or reduce the number of terms in 
expressions.
+///
+/// Filter: #b Gt Int32(2) And #b Gt Int32(2)
+/// is optimized to
+/// Filter: #b Gt Int32(2)
+pub struct RemoveDuplicateFilters {}
+
+fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool {
+match expr {
+Expr::BinaryExpr {
+left,
+op: Operator::And,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+_ => expr == needle,
+}
+}
+
+fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> {
+match expr {
+Expr::BinaryExpr { .. } => Some(expr),
+_ => None,
+}
+}
+
+fn operator_is_boolean(op: ) -> bool {
+op == ::And || op == ::Or
+}
+
+fn is_one<'a>(s: &'a Expr) -> bool {
+match s {
+Expr::Literal(ScalarValue::Int8(Some(1))) => true,
+Expr::Literal(ScalarValue::Int16(Some(1))) => true,
+Expr::Literal(ScalarValue::Int32(Some(1))) => true,
+Expr::Literal(ScalarValue::Int64(Some(1))) => true,
+Expr::Literal(ScalarValue::UInt8(Some(1))) => true,
+Expr::Literal(ScalarValue::UInt16(Some(1))) => true,
+Expr::Literal(ScalarValue::UInt32(Some(1))) => true,
+Expr::Literal(ScalarValue::UInt64(Some(1))) => true,
+Expr::Literal(ScalarValue::Float32(Some(v))) if *v == 1. => true,
+Expr::Literal(ScalarValue::Float64(Some(v))) if *v == 1. => true,
+_ => false
+}
+}
+
+fn is_zero<'a>(s: &'a Expr) -> bool {
+match s {
+Expr::Literal(ScalarValue::Int8(Some(0))) => true,
+Expr::Literal(ScalarValue::Int16(Some(0))) => true,
+Expr::Literal(ScalarValue::Int32(Some(0))) => true,
+Expr::Literal(ScalarValue::Int64(Some(0))) => true,
+Expr::Literal(ScalarValue::UInt8(Some(0))) => true,
+Expr::Literal(ScalarValue::UInt16(Some(0))) => true,
+Expr::Literal(ScalarValue::UInt32(Some(0))) => true,
+Expr::Literal(ScalarValue::UInt64(Some(0))) => true,
+Expr::Literal(ScalarValue::Float32(Some(v))) if *v == 0. => true,
+Expr::Literal(ScalarValue::Float64(Some(v))) if *v == 0. => true,
+_ => false
+}
+}
+
+fn is_true<'a>(expr: &'a Expr) -> bool {
+match expr {
+Expr::Literal(ScalarValue::Boolean(Some(v))) => *v,
+_ => false,
+}
+}
+
+fn is_false<'a>(expr: &'a Expr) -> bool {
+match expr {
+Expr::Literal(ScalarValue::Boolean(Some(v))) => *v == false,
+_ => false,
+}
+}
+
+fn simplify<'a>(expr: &'a Expr) -> Expr {
+match expr {
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} if is_true(left) || is_true(right) => lit(true),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} if is_false(left) => simplify(right),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} if is_false(right) => simplify(left),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} if left == right => simplify(left),
+Expr::BinaryExpr {
+left,
+op: Operator::And,
+right,
+} if is_false(left) || is_false(right) => lit(false),
+Expr::BinaryExpr {
+left,
+op: Operator::And,
+right,
+} if is_true(right) => 

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)

2021-06-01 Thread GitBox


Dandandan commented on a change in pull request #436:
URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643440084



##
File path: datafusion/src/optimizer/mod.rs
##
@@ -25,4 +25,5 @@ pub mod hash_build_probe_order;
 pub mod limit_push_down;
 pub mod optimizer;
 pub mod projection_push_down;
+pub mod remove_duplicate_filters;

Review comment:
   I think that would be a more accurate name by now  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)

2021-06-01 Thread GitBox


Dandandan commented on a change in pull request #436:
URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643439834



##
File path: datafusion/src/optimizer/remove_duplicate_filters.rs
##
@@ -0,0 +1,611 @@
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Remove duplicate filters optimizer rule
+
+use crate::execution::context::ExecutionProps;
+use crate::logical_plan::LogicalPlan;
+use crate::logical_plan::{lit, Expr};
+use crate::optimizer::optimizer::OptimizerRule;
+use crate::optimizer::utils;
+use crate::optimizer::utils::optimize_explain;
+use crate::scalar::ScalarValue;
+use crate::{error::Result, logical_plan::Operator};
+
+/// Remove duplicate filters optimizer.
+/// # Introduction
+/// It uses boolean algebra laws to simplify or reduce the number of terms in 
expressions.
+///
+/// Filter: #b Gt Int32(2) And #b Gt Int32(2)
+/// is optimized to
+/// Filter: #b Gt Int32(2)
+pub struct RemoveDuplicateFilters {}
+
+fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool {
+match expr {
+Expr::BinaryExpr {
+left,
+op: Operator::And,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+_ => expr == needle,
+}
+}
+
+fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> {
+match expr {
+Expr::BinaryExpr { .. } => Some(expr),
+_ => None,
+}
+}
+
+fn operator_is_boolean(op: ) -> bool {
+op == ::And || op == ::Or
+}
+
+fn is_one<'a>(s: &'a Expr) -> bool {
+match s {
+Expr::Literal(ScalarValue::Int8(Some(1))) => true,
+Expr::Literal(ScalarValue::Int16(Some(1))) => true,
+Expr::Literal(ScalarValue::Int32(Some(1))) => true,
+Expr::Literal(ScalarValue::Int64(Some(1))) => true,
+Expr::Literal(ScalarValue::UInt8(Some(1))) => true,
+Expr::Literal(ScalarValue::UInt16(Some(1))) => true,
+Expr::Literal(ScalarValue::UInt32(Some(1))) => true,
+Expr::Literal(ScalarValue::UInt64(Some(1))) => true,
+Expr::Literal(ScalarValue::Float32(Some(v))) if *v == 1. => true,
+Expr::Literal(ScalarValue::Float64(Some(v))) if *v == 1. => true,
+_ => false
+}
+}
+
+fn is_zero<'a>(s: &'a Expr) -> bool {
+match s {
+Expr::Literal(ScalarValue::Int8(Some(0))) => true,
+Expr::Literal(ScalarValue::Int16(Some(0))) => true,
+Expr::Literal(ScalarValue::Int32(Some(0))) => true,
+Expr::Literal(ScalarValue::Int64(Some(0))) => true,
+Expr::Literal(ScalarValue::UInt8(Some(0))) => true,
+Expr::Literal(ScalarValue::UInt16(Some(0))) => true,
+Expr::Literal(ScalarValue::UInt32(Some(0))) => true,
+Expr::Literal(ScalarValue::UInt64(Some(0))) => true,
+Expr::Literal(ScalarValue::Float32(Some(v))) if *v == 0. => true,
+Expr::Literal(ScalarValue::Float64(Some(v))) if *v == 0. => true,
+_ => false
+}
+}
+
+fn is_true<'a>(expr: &'a Expr) -> bool {
+match expr {
+Expr::Literal(ScalarValue::Boolean(Some(v))) => *v,
+_ => false,
+}
+}
+
+fn is_false<'a>(expr: &'a Expr) -> bool {
+match expr {
+Expr::Literal(ScalarValue::Boolean(Some(v))) => *v == false,
+_ => false,
+}
+}
+
+fn simplify<'a>(expr: &'a Expr) -> Expr {
+match expr {
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} if is_true(left) || is_true(right) => lit(true),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} if is_false(left) => simplify(right),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} if is_false(right) => simplify(left),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} if left == right => simplify(left),
+Expr::BinaryExpr {
+left,
+op: Operator::And,
+right,
+} if is_false(left) || is_false(right) => lit(false),
+Expr::BinaryExpr {
+left,
+op: Operator::And,
+right,
+} if is_true(right) => 

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)

2021-06-01 Thread GitBox


Dandandan commented on a change in pull request #436:
URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643439659



##
File path: datafusion/src/optimizer/remove_duplicate_filters.rs
##
@@ -0,0 +1,611 @@
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Remove duplicate filters optimizer rule
+
+use crate::execution::context::ExecutionProps;
+use crate::logical_plan::LogicalPlan;
+use crate::logical_plan::{lit, Expr};
+use crate::optimizer::optimizer::OptimizerRule;
+use crate::optimizer::utils;
+use crate::optimizer::utils::optimize_explain;
+use crate::scalar::ScalarValue;
+use crate::{error::Result, logical_plan::Operator};
+
+/// Remove duplicate filters optimizer.
+/// # Introduction
+/// It uses boolean algebra laws to simplify or reduce the number of terms in 
expressions.
+///
+/// Filter: #b Gt Int32(2) And #b Gt Int32(2)
+/// is optimized to
+/// Filter: #b Gt Int32(2)
+pub struct RemoveDuplicateFilters {}
+
+fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool {
+match expr {
+Expr::BinaryExpr {
+left,
+op: Operator::And,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+_ => expr == needle,
+}
+}
+
+fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> {
+match expr {
+Expr::BinaryExpr { .. } => Some(expr),
+_ => None,
+}
+}
+
+fn operator_is_boolean(op: ) -> bool {
+op == ::And || op == ::Or
+}
+
+fn is_one<'a>(s: &'a Expr) -> bool {
+match s {

Review comment:
   Or it might be better to just use `|` patterns (without `matches!`) in 
the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #10397: ARROW-11930: [C++][Dataset][Compute] Use an ExecPlan for dataset scans

2021-06-01 Thread GitBox


westonpace commented on a change in pull request #10397:
URL: https://github.com/apache/arrow/pull/10397#discussion_r643353334



##
File path: cpp/src/arrow/compute/exec/exec_plan.h
##
@@ -225,22 +212,43 @@ class ARROW_EXPORT ExecNode {
   virtual void StopProducing() = 0;
 
  protected:
-  ExecNode(ExecPlan* plan, std::string label, std::vector 
input_descrs,
+  ExecNode(ExecPlan*, std::string label, NodeVector inputs,

Review comment:
   Nit: Seems odd to have no name for `ExecPlan` but then have a name for 
everything else.

##
File path: cpp/src/arrow/compute/exec/expression.cc
##
@@ -61,13 +61,22 @@ Expression call(std::string function, 
std::vector arguments,
   call.function_name = std::move(function);
   call.arguments = std::move(arguments);
   call.options = std::move(options);
+
+  call.hash = std::hash{}(call.function_name);
+  for (const auto& arg : call.arguments) {
+call.hash ^= arg.hash();
+  }
   return Expression(std::move(call));
 }
 
 const Datum* Expression::literal() const { return 
util::get_if(impl_.get()); }
 
+const Expression::Parameter* Expression::parameter() const {

Review comment:
   Checking my knowledge.  A parameter is an index + an expected type?

##
File path: cpp/src/arrow/compute/exec/expression.cc
##
@@ -613,6 +639,22 @@ std::vector FieldsInExpression(const Expression& 
expr) {
   return fields;
 }
 
+std::vector ParametersInExpression(const Expression& expr) {

Review comment:
   It seems this could return duplicate indices (e.g. in something like `x 
< 5 && x > 0`).  Is that a problem?

##
File path: cpp/src/arrow/dataset/dataset_internal.h
##
@@ -204,5 +204,35 @@ arrow::Result> GetFragmentScanOptions(
   return internal::checked_pointer_cast(source);
 }
 
+class FragmentDataset : public Dataset {

Review comment:
   Should this be a base type of `InMemoryDataset`?

##
File path: cpp/src/arrow/compute/exec/test_util.cc
##
@@ -124,277 +130,42 @@ struct DummyNode : ExecNode {
   bool started_ = false;
 };
 
-struct RecordBatchReaderNode : ExecNode {
-  RecordBatchReaderNode(ExecPlan* plan, std::string label,
-std::shared_ptr reader, Executor* 
io_executor)
-  : ExecNode(plan, std::move(label), {}, {},
- DescrFromSchemaColumns(*reader->schema()), /*num_outputs=*/1),
-schema_(reader->schema()),
-reader_(std::move(reader)),
-io_executor_(io_executor) {}
-
-  RecordBatchReaderNode(ExecPlan* plan, std::string label, 
std::shared_ptr schema,
-RecordBatchGenerator generator, Executor* io_executor)
-  : ExecNode(plan, std::move(label), {}, {}, 
DescrFromSchemaColumns(*schema),
- /*num_outputs=*/1),
-schema_(std::move(schema)),
-generator_(std::move(generator)),
-io_executor_(io_executor) {}
-
-  const char* kind_name() override { return "RecordBatchReader"; }
-
-  void InputReceived(ExecNode* input, int seq_num, compute::ExecBatch batch) 
override {}
-
-  void ErrorReceived(ExecNode* input, Status error) override {}
-
-  void InputFinished(ExecNode* input, int seq_stop) override {}
-
-  Status StartProducing() override {
-next_batch_index_ = 0;
-if (!generator_) {
-  auto it = MakeIteratorFromReader(reader_);
-  ARROW_ASSIGN_OR_RAISE(generator_,
-MakeBackgroundGenerator(std::move(it), 
io_executor_));
-}
-GenerateOne(std::unique_lock{mutex_});
-return Status::OK();
-  }
-
-  void PauseProducing(ExecNode* output) override {}
-
-  void ResumeProducing(ExecNode* output) override {}
-
-  void StopProducing(ExecNode* output) override {
-ASSERT_EQ(output, outputs_[0]);
-std::unique_lock lock(mutex_);
-generator_ = nullptr;  // null function
-  }
-
-  void StopProducing() override { StopProducing(outputs_[0]); }
-
- private:
-  void GenerateOne(std::unique_lock&& lock) {
-if (!generator_) {
-  // Stopped
-  return;
-}
-auto plan = this->plan()->shared_from_this();
-auto fut = generator_();
-const auto batch_index = next_batch_index_++;
-
-lock.unlock();
-// TODO we want to transfer always here
-io_executor_->Transfer(std::move(fut))
-.AddCallback(
-[plan, batch_index, this](const 
Result>& res) {
-  std::unique_lock lock(mutex_);
-  if (!res.ok()) {
-for (auto out : outputs_) {
-  out->ErrorReceived(this, res.status());
-}
-return;
-  }
-  const auto& batch = *res;
-  if (IsIterationEnd(batch)) {
-lock.unlock();
-for (auto out : outputs_) {
-  out->InputFinished(this, batch_index);
-}
-  } else {
-lock.unlock();
-for (auto out : outputs_) {
-  out->InputReceived(this, 

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #436: Remove reundant filters (e.g. c> 5 AND c>5 --> c>5)

2021-06-01 Thread GitBox


Dandandan commented on a change in pull request #436:
URL: https://github.com/apache/arrow-datafusion/pull/436#discussion_r643438683



##
File path: datafusion/src/optimizer/remove_duplicate_filters.rs
##
@@ -0,0 +1,611 @@
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Remove duplicate filters optimizer rule
+
+use crate::execution::context::ExecutionProps;
+use crate::logical_plan::LogicalPlan;
+use crate::logical_plan::{lit, Expr};
+use crate::optimizer::optimizer::OptimizerRule;
+use crate::optimizer::utils;
+use crate::optimizer::utils::optimize_explain;
+use crate::scalar::ScalarValue;
+use crate::{error::Result, logical_plan::Operator};
+
+/// Remove duplicate filters optimizer.
+/// # Introduction
+/// It uses boolean algebra laws to simplify or reduce the number of terms in 
expressions.
+///
+/// Filter: #b Gt Int32(2) And #b Gt Int32(2)
+/// is optimized to
+/// Filter: #b Gt Int32(2)
+pub struct RemoveDuplicateFilters {}
+
+fn expr_contains<'a>(expr: &'a Expr, needle: &'a Expr) -> bool {
+match expr {
+Expr::BinaryExpr {
+left,
+op: Operator::And,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+Expr::BinaryExpr {
+left,
+op: Operator::Or,
+right,
+} => expr_contains(left, needle) || expr_contains(right, needle),
+_ => expr == needle,
+}
+}
+
+fn as_binary_expr<'a>(expr: &'a Expr) -> Option<&'a Expr> {
+match expr {
+Expr::BinaryExpr { .. } => Some(expr),
+_ => None,
+}
+}
+
+fn operator_is_boolean(op: ) -> bool {
+op == ::And || op == ::Or
+}
+
+fn is_one<'a>(s: &'a Expr) -> bool {
+match s {

Review comment:
   A `matches!` returns a boolean so you can also use `||` / boolean or to 
cover the float cases and keep the rest in a single `matches!`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] alamb merged pull request #434: fix: display the content of debug explain

2021-06-01 Thread GitBox


alamb merged pull request #434:
URL: https://github.com/apache/arrow-datafusion/pull/434


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] alamb closed issue #430: LogicalPlan::inputs function should return the input plan for Explain enum

2021-06-01 Thread GitBox


alamb closed issue #430:
URL: https://github.com/apache/arrow-datafusion/issues/430


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] anthonylouisbsb commented on pull request #10350: ARROW-12814: [C++][Gandiva] Implements the ABS, FLOOR, PI, SQRT, SIGN, LSHIFT, RSHIFT, CEIL, TRUNC and LN functions

2021-06-01 Thread GitBox


anthonylouisbsb commented on pull request #10350:
URL: https://github.com/apache/arrow/pull/10350#issuecomment-852402117


   The `LN` function was added in this PR too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-datafusion] NGA-TRAN commented on pull request #434: fix: display the content of debug explain

2021-06-01 Thread GitBox


NGA-TRAN commented on pull request #434:
URL: https://github.com/apache/arrow-datafusion/pull/434#issuecomment-852397463


   @alamb Finally all checks have passed :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #10431: ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API

2021-06-01 Thread GitBox


github-actions[bot] commented on pull request #10431:
URL: https://github.com/apache/arrow/pull/10431#issuecomment-852396357


   https://issues.apache.org/jira/browse/ARROW-12921


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] JayjeetAtGithub opened a new pull request #10431: ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API

2021-06-01 Thread GitBox


JayjeetAtGithub opened a new pull request #10431:
URL: https://github.com/apache/arrow/pull/10431


   The implementation includes a new `RadosParquetFileFormat` class that 
inherits from the `ParquetFileFormat` class to defer the evaluation of scan 
operations on a Parquet dataset to a RADOS storage backend. This new file 
format plugs into the `FileSystemDataset` API, converts filenames to object IDs 
using FS metadata and uses the 
[librados](https://docs.ceph.com/en/latest/rados/api/librados-intro/) C++ 
library to execute storage side functions that scan the files on the 
[Ceph](https://ceph.io) storage nodes (OSDs) using Arrow libraries. We ship 
unit and integration tests with our implementation where the tests are run 
against a single-node Ceph cluster.
   
   The storage-side code is implemented as a RADOS CLS (object storage class) 
using [Ceph's Object Class 
SDK](https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree.).
 The code lives in `cpp/src/arrow/adapters/arrow-rados-cls`, and is expected to 
be deployed on the storage nodes (Ceph's OSDs) prior to operating on tables via 
the `RadosParquetFileFormat` implementation. This PR includes a CMake 
configuration for building this library if desired (`ARROW_CLS` CMake option). 
We have also added Python bindings for our C++ implementations and added 
integration tests for them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   3   >