Re: [I] Cast from DATE to VARCHAR fails [datafusion]

2025-09-12 Thread via GitHub


findepi commented on issue #17533:
URL: https://github.com/apache/datafusion/issues/17533#issuecomment-3284008393

   `arrow_cast` to `Utf8` works
   `arrow_cast` to `Utf8View` fails
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Cast from DATE to VARCHAR fails [datafusion]

2025-09-12 Thread via GitHub


findepi commented on issue #17533:
URL: https://github.com/apache/datafusion/issues/17533#issuecomment-3284075814

   We probably need to fix `can_cast_types` too
   
   - https://github.com/apache/arrow-rs/pull/8328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Prepare for Merge Queue [datafusion]

2025-09-12 Thread via GitHub


Jefffrey commented on PR #17183:
URL: https://github.com/apache/datafusion/pull/17183#issuecomment-3284090122

   > Merge Queue only waits for the required steps - us all checks are 
important hence I made them all required.
   It'll become impossible to merge a PR if some of the checks are not passed, 
but I believe that's the flow we use now anyways.
   
   I think there's an edge case where if a PR doesn't kick off all the checks 
(e.g. a doc PR) then it is blocked since all checks must be successful now, but 
it doesn't trigger all the checks to run. For example, this simple PR is 
blocked from merging at the moment: #17532
   
   @blaginin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Panic when cast from `DATE` to `TIMESTAMP` overflows [datafusion]

2025-09-12 Thread via GitHub


findepi opened a new issue, #17534:
URL: https://github.com/apache/datafusion/issues/17534

   ### Describe the bug
   
   Casting a date to timestamp involves multiplication and may overflow.
   DataFusion should report query error in such case, rather than panicking.
   
   ### To Reproduce
   
   ```shell
   cargo run --bin datafusion-cli -- --command "SELECT CAST(DATE '-12-31' 
As timestamp);"
   ```
   
   
   ```
   datafusion main$ cargo run --bin datafusion-cli -- --command "SELECT 
CAST(DATE '-12-31' As timestamp);"
   Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.18s
Running `target/debug/datafusion-cli --command 'SELECT CAST(DATE 
'\''-12-31'\'' As timestamp);'`
   DataFusion CLI v50.0.0
   
   thread 'main' panicked at 
/Users/findepi/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-cast-56.0.0/src/cast/mod.rs:1881:58:
   attempt to multiply with overflow
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   
   ### Expected behavior
   
   - query error informing about overflow
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: implement job data cleanup in pull-staged strategy #1219 [datafusion-ballista]

2025-09-12 Thread via GitHub


KR-bluejay commented on code in PR #1314:
URL: 
https://github.com/apache/datafusion-ballista/pull/1314#discussion_r2336556200


##
ballista/executor/src/execution_loop.rs:
##
@@ -88,8 +90,29 @@ pub async fn poll_loop
 
 match poll_work_result {
 Ok(result) => {
-let tasks = result.into_inner().tasks;
+let PollWorkResult {
+tasks,
+jobs_to_clean,
+} = result.into_inner();
 active_job = !tasks.is_empty();
+let work_dir = PathBuf::from(&executor.work_dir);
+
+// Clean up any state related to the listed jobs

Review Comment:
   Hi, sorry for the delay — I’ve pushed the changes.
   
   I think I should add a test before merge, but I’m not sure how.  
   There is a `test_executor_clean_up`, but it seems to cover different 
functionality, not `remove_job_dir`.
   
   Could you suggest how I should write such a test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat(spark): implement Spark `try_parse_url` function [datafusion]

2025-09-12 Thread via GitHub


Jefffrey commented on code in PR #17485:
URL: https://github.com/apache/datafusion/pull/17485#discussion_r2340394000


##
datafusion/spark/src/function/url/parse_url.rs:
##
@@ -47,23 +46,7 @@ impl Default for ParseUrl {
 impl ParseUrl {
 pub fn new() -> Self {
 Self {
-signature: Signature::one_of(
-vec![
-TypeSignature::Uniform(
-1,
-vec![DataType::Utf8View, DataType::Utf8, 
DataType::LargeUtf8],
-),
-TypeSignature::Uniform(
-2,
-vec![DataType::Utf8View, DataType::Utf8, 
DataType::LargeUtf8],
-),
-TypeSignature::Uniform(
-3,
-vec![DataType::Utf8View, DataType::Utf8, 
DataType::LargeUtf8],
-),
-],
-Volatility::Immutable,
-),
+signature: Signature::user_defined(Volatility::Immutable),

Review Comment:
   Dictionary type is another way to represent strings, similar to how view 
type is a different way to represent strings.
   
   Can see here in the doc there's a reference to handling dictionary types:
   
   
https://github.com/apache/datafusion/blob/351675ddc27c42684a079b3a89fe2dee581d89a2/datafusion/expr-common/src/signature.rs#L229-L239



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: implement job data cleanup in pull-staged strategy #1219 [datafusion-ballista]

2025-09-12 Thread via GitHub


milenkovicm commented on code in PR #1314:
URL: 
https://github.com/apache/datafusion-ballista/pull/1314#discussion_r2336689475


##
ballista/executor/src/execution_loop.rs:
##
@@ -88,8 +90,29 @@ pub async fn poll_loop
 
 match poll_work_result {
 Ok(result) => {
-let tasks = result.into_inner().tasks;
+let PollWorkResult {
+tasks,
+jobs_to_clean,
+} = result.into_inner();
 active_job = !tasks.is_empty();
+let work_dir = PathBuf::from(&executor.work_dir);
+
+// Clean up any state related to the listed jobs

Review Comment:
   thanks @KR-bluejay, no delays at all! will have a look later today hopefully 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Cast from DATE to VARCHAR fails [datafusion]

2025-09-12 Thread via GitHub


findepi opened a new issue, #17533:
URL: https://github.com/apache/datafusion/issues/17533

   ### Describe the bug
   
   ```
   DataFusion CLI v50.0.0
   > SELECT CAST(DATE '-12-31' AS varchar);
   This feature is not implemented: Unsupported CAST from Date32 to Utf8View
   ```
   
   ### To Reproduce
   
   ```shell
   cargo run --bin datafusion-cli -- --command "SELECT CAST(DATE '-12-31' 
AS varchar);"
   ```
   
   ### Expected behavior
   
   ```
   ++
   | Utf8("-12-31") |
   ++
   | -12-31 |
   ++
   1 row(s) fetched.
   Elapsed 0.057 seconds.
   ```
   
   ### Additional context
   
   This used to work in DataFusion 47.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Prepare for Merge Queue [datafusion]

2025-09-12 Thread via GitHub


blaginin commented on PR #17183:
URL: https://github.com/apache/datafusion/pull/17183#issuecomment-3284132932

   Will resolve now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Panic when cast from `DATE` to `TIMESTAMP` overflows [datafusion]

2025-09-12 Thread via GitHub


findepi commented on issue #17534:
URL: https://github.com/apache/datafusion/issues/17534#issuecomment-3284162376

   "attempt to multiply with overflow" is a standard Rust check not present in 
release builds.:
   
   ```
   datafusion main$ cargo run --release --bin datafusion-cli -- --command 
"SELECT CAST(DATE '-12-31' As timestamp);"
   Finished `release` profile [optimized] target(s) in 0.21s
Running `target/release/datafusion-cli --command 'SELECT CAST(DATE 
'\''-12-31'\'' As timestamp);'`
   DataFusion CLI v50.0.0
   +---+
   | Utf8("-12-31")|
   +---+
   | 1816-03-29T05:56:08.066277376 |
   +---+
   ```
   
   
   However, i believe that DataFusion debug builds should strive to produce 
same results as release builds, in particular they should not panic. If silent 
overflow is intentional, the code should tell compiler that.
   
   FWIW, numeric multiplication silently overflows in debug builds, does not 
panic
   
   ```
   datafusion main$ cargo run --bin datafusion-cli -- --command "SELECT 
100 * 100 * 100;"
   Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.72s
Running `target/debug/datafusion-cli --command 'SELECT 100 * 
100 * 100;'`
   DataFusion CLI v50.0.0
   +--+
   | Int64(100) * Int64(100) * Int64(100) |
   +--+
   | 5076944270305263616  |
   +--+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Support `CAST` from temporal to `Utf8View` [datafusion]

2025-09-12 Thread via GitHub


findepi opened a new pull request, #17535:
URL: https://github.com/apache/datafusion/pull/17535

   - fixes https://github.com/apache/datafusion/issues/17533
   - requires https://github.com/apache/arrow-rs/pull/8328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]

2025-09-12 Thread via GitHub


LiaCastaneda commented on issue #17171:
URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3284143582

   > I think building an IN LIST in the dynamic filter (in addition to min/max) 
allows us to leverage the existing machinery that builds literal guarantees in 
PruningPredicate. The parquet data source already takes advantage of this in 
the case that bloom filters are present and can help us prune entire row 
groups, which definitely seems valuable to me.
   
   I think this would also help users with custom `ExecutionPlan` scan nodes 
that aren’t Parquet or don’t support bloom predicates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: pass the ordering information to native Scan [datafusion-comet]

2025-09-12 Thread via GitHub


codecov-commenter commented on PR #2375:
URL: 
https://github.com/apache/datafusion-comet/pull/2375#issuecomment-3276620996

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   :x: Patch coverage is `41.17647%` with `30 lines` in your changes missing 
coverage. Please review.
   :white_check_mark: Project coverage is 40.16%. Comparing base 
([`f09f8af`](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache))
 to head 
([`4d0c2ab`](https://app.codecov.io/gh/apache/datafusion-comet/commit/4d0c2ab573ab55219358c9de2a4428b703ba601b?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)).
   :warning: Report is 490 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[.../scala/org/apache/comet/serde/QueryPlanSerde.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2FQueryPlanSerde.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9RdWVyeVBsYW5TZXJkZS5zY2FsYQ==)
 | 51.21% | [14 Missing and 6 partials :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...la/org/apache/spark/sql/comet/CometExecUtils.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2FCometExecUtils.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvY29tZXQvQ29tZXRFeGVjVXRpbHMuc2NhbGE=)
 | 0.00% | [7 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...apache/spark/sql/comet/CometCollectLimitExec.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2FCometCollectLimitExec.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvY29tZXQvQ29tZXRDb2xsZWN0TGltaXRFeGVjLnNjYWxh)
 | 0.00% | [2 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...ark/sql/comet/CometTakeOrderedAndProjectExec.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2FCometTakeOrderedAndProjectExec.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvY29tZXQvQ29tZXRUYWtlT3JkZXJlZEFuZFByb2plY3RFeGVjLnNjYWxh)
 | 0.00% | [1 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   
   ```diff
   @@  Coverage Diff  @@
   ##   main#2375   +/-   ##
   =
   - Coverage 56.12%   40.16%   -15.97% 
   - Complexity  976  993   +17 
   =
 Files   119  147   +28 
 Lines 1174313520 +1777 
 Branches   2251 2396  +145 
   =
   - Hits   6591 5430 -1161 
   - Misses 4012 7136 +3124 
   + Partials   1140  954  -186 
   ```
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](ht

[PR] fix: Change `OuterReferenceColumn` to contain the entire outer field to prevent metadata loss [datafusion]

2025-09-12 Thread via GitHub


Kontinuation opened a new pull request, #17524:
URL: https://github.com/apache/datafusion/pull/17524

   ## Which issue does this PR close?
   
   - Closes #17422.
   
   ## Rationale for this change
   
   As reported by #17422, extension metadata was dropped in some queries 
involving subqueries. This happens when a column references another column in a 
correlated subquery, which creates an `OuterReferenceColumn` in DataFusion's 
logical plan.
   
   `OuterReferenceColumn` is defined to contain a data type and a referenced 
column. As we know, outer reference column refers to a field in another table, 
so there's no way to retrieve the field metadata of referenced column given the 
schema of the current table. To overcome this problem, we define 
`OuterReferenceColumn` to contain the entire referenced field as a `FieldRef`. 
We also updated the implementation of `to_field` to return the referenced 
field, which contains the original metadata of referenced field.
   
   ## What changes are included in this PR?
   
   Changed `OuterReferenceColumn` to contain the referenced outer field as a 
`FieldRef`.
   
   ## Are these changes tested?
   
   * Added a unit test
   * Added the repro in the issue to `user_defined_scalar_functions` test
   
   ## Are there any user-facing changes?
   
   Yes. This changes part of the public API.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Introduce `avg_distinct()` and `sum_distinct()` functions to DataFrame API [datafusion]

2025-09-12 Thread via GitHub


Jefffrey opened a new pull request, #17536:
URL: https://github.com/apache/datafusion/pull/17536

   ## Which issue does this PR close?
   
   
   
   - Closes #2409 and #2407
   
   ## Rationale for this change
   
   
   
   So we can use these distinct aggregates via DataFrames
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: synchronize partition bounds reporting in HashJoin [datafusion]

2025-09-12 Thread via GitHub


adriangb commented on PR #17452:
URL: https://github.com/apache/datafusion/pull/17452#issuecomment-3271833195

   > @adriangb I think there is opportunity to simplify the bounds collection 
for each partition. That is, we can probably just track the min/max across all 
partitions and build a single `AND` binary expr once we have the final min/max 
(i.e. all partition bounds have been reported).
   > 
   > Aside from one less mutex, I think it'll help reduce output in `EXPLAIN` 
as well. Happy to tackle in a follow-up PR
   
   I think that will regress performance: imagine partition 1 has bounds (0, 1) 
and partition 2 has bounds (98, 99). With bounds per partition the 
value 1234 is filtered out. The merged bounds of (0, 99) would include that 
value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Introduce `avg_distinct()` and `sum_distinct()` functions to DataFrame API [datafusion]

2025-09-12 Thread via GitHub


Jefffrey commented on code in PR #17536:
URL: https://github.com/apache/datafusion/pull/17536#discussion_r2343365209


##
datafusion/functions-aggregate/src/average.rs:
##
@@ -62,6 +62,17 @@ make_udaf_expr_and_func!(
 avg_udaf
 );
 
+pub fn avg_distinct(expr: Expr) -> Expr {
+Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf(
+avg_udaf(),
+vec![expr],
+true,
+None,
+vec![],
+None,
+))
+}

Review Comment:
   Same as how count handles it:
   
   
https://github.com/apache/datafusion/blob/bfc5067718a3ddcb87531b5a9633605792078546/datafusion/functions-aggregate/src/count.rs#L71-L80



##
datafusion/core/tests/dataframe/mod.rs:
##
@@ -496,32 +497,35 @@ async fn drop_with_periods() -> Result<()> {
 #[tokio::test]
 async fn aggregate() -> Result<()> {
 // build plan using DataFrame API
-let df = test_table().await?;
+// union so some of the distincts have a clearly distinct result
+let df = test_table().await?.union(test_table().await?)?;
 let group_expr = vec![col("c1")];
 let aggr_expr = vec![
-min(col("c12")),
-max(col("c12")),
-avg(col("c12")),
-sum(col("c12")),
-count(col("c12")),
-count_distinct(col("c12")),
+min(col("c4")).alias("min(c4)"),
+max(col("c4")).alias("max(c4)"),
+avg(col("c4")).alias("avg(c4)"),
+avg_distinct(col("c4")).alias("avg_distinct(c4)"),
+sum(col("c4")).alias("sum(c4)"),
+sum_distinct(col("c4")).alias("sum_distinct(c4)"),
+count(col("c4")).alias("count(c4)"),
+count_distinct(col("c4")).alias("count_distinct(c4)"),

Review Comment:
   I switched to `c4` from `c12` as `c12` had some precision variations for 
avg_distinct leading to inconsistent test results, and figured it was easier to 
switch columns than slap `round` on the outputs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] High CPU during dynamic filter bound computation: min_batch/max_batch [datafusion]

2025-09-12 Thread via GitHub


LiaCastaneda commented on issue #17486:
URL: https://github.com/apache/datafusion/issues/17486#issuecomment-3270213815

   cc @adriangb -- sorry for the direct ping. Since you did most of the dynamic 
filtering work, do you know if this behavior is expected? 🙇‍♀️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: implement job data cleanup in pull-staged strategy #1219 [datafusion-ballista]

2025-09-12 Thread via GitHub


KR-bluejay commented on code in PR #1314:
URL: 
https://github.com/apache/datafusion-ballista/pull/1314#discussion_r2336699580


##
ballista/executor/src/execution_loop.rs:
##
@@ -88,8 +90,29 @@ pub async fn poll_loop
 
 match poll_work_result {
 Ok(result) => {
-let tasks = result.into_inner().tasks;
+let PollWorkResult {
+tasks,
+jobs_to_clean,
+} = result.into_inner();
 active_job = !tasks.is_empty();
+let work_dir = PathBuf::from(&executor.work_dir);
+
+// Clean up any state related to the listed jobs

Review Comment:
   Thank you for reviewing this, @milenkovicm — I really appreciate it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor HashJoinExec to progressively accumulate dynamic filter bounds instead of computing them after data is accumulated [datafusion]

2025-09-12 Thread via GitHub


adriangb commented on PR #17444:
URL: https://github.com/apache/datafusion/pull/17444#issuecomment-3271240054

   > Benchmark clickbench_extended.json
   > │ QQuery 1 │  1216.87 ms │ 1446.94 ms │ 1.19x slower │
   
   I think this is noise, Q1 is:
   
   
https://github.com/apache/datafusion/blob/fc5888b49b9757b20289e479f8b41440f383deb4/benchmarks/queries/clickbench/extended/q1.sql#L4C1-L4C115
   
   Which has no join.
   
   > Benchmark tpch_mem_sf1.json
   > QQuery 8 │  37.27 ms │   29.13 ms │ +1.28x faster
   
   This looks real, lots of joins!
   
   
https://github.com/apache/datafusion/blob/fc5888b49b9757b20289e479f8b41440f383deb4/benchmarks/queries/q8.sql#L13-L21


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Refactor and improve job data cleanup logic [datafusion-ballista]

2025-09-12 Thread via GitHub


milenkovicm commented on issue #1316:
URL: 
https://github.com/apache/datafusion-ballista/issues/1316#issuecomment-3279895972

   I think you could make an epic in this area
   
   There are few things I would like to propose regarding to files: 
   
   1. shuffle files are absolute paths, i would like to make them relative to 
flight store directory (improves security)
   2. I would like to change how files are organised, instead of having 
`job_id/shuffle...` i would like to prefix files with 
`session_id/job_id/shuffle...` or `session_id/cache/`  or 
`session_id/job_id/operator_state`  ... i would like to be able to delete any 
of them, but also delete all files for given session_id when client disconnects 
   3. should we make arrow flight server owner of files, and expose actions on 
them to delete files ?
   4.  #558 
   
   wdyt @KR-bluejay ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Refactor and improve job data cleanup logic [datafusion-ballista]

2025-09-12 Thread via GitHub


KR-bluejay commented on issue #1316:
URL: 
https://github.com/apache/datafusion-ballista/issues/1316#issuecomment-3279848468

   I’d be happy to work on this as a follow-up, if this direction sounds good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Disable `required_status_checks` for now [datafusion]

2025-09-12 Thread via GitHub


blaginin opened a new pull request, #17537:
URL: https://github.com/apache/datafusion/pull/17537

   Follow up 
https://github.com/apache/datafusion/pull/17183#issuecomment-3284090122
   
   As a hotfix, disabled the checks to unblock  @Jefffrey
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Update Bug issue template to use Bug issue type [datafusion]

2025-09-12 Thread via GitHub


findepi opened a new pull request, #17540:
URL: https://github.com/apache/datafusion/pull/17540

   Same for Feature issue template.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Consumer receives duplicate predicates when join mode is CollectLeft [datafusion]

2025-09-12 Thread via GitHub


LiaCastaneda opened a new issue, #17541:
URL: https://github.com/apache/datafusion/issues/17541

   ### Describe the bug
   
   I see duplicated OR clauses on the DynamicPhysicalExpr I get in the consumer
   
   for an execution plan like this:
   ```
   
   ProjectionExec: expr=[c0@0 as c0, c1@1 as c1, c2@2 as c2]
 CoalescePartitionsExec: fetch=5
   CoalesceBatchesExec: target_batch_size=8192, fetch=5
 HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(c0@0, c32@32)]
   CoalesceBatchesExec: target_batch_size=8192
 FilterExec: c0@0 IS NOT NULL
   DataSourceExec: partitions=1, partition_sizes=[1]
RepartitionExec: partitioning=RoundRobinBatch(16), 
input_partitions=1
  CooperativeExec
DataSourceExec: partitions=1
   ```
   
   The bounds  predicates arrive as 16 identical conjuncts, 1 per (right) 
output partition it seems:
   
   ```
   (
 ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
 OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
   )
   ```
   
   
   This is probably related to 
[this](https://github.com/apache/datafusion/pull/17197#pullrequestreview-3135737978)
 comment. I wrote some logic in the consumer node to dedup the predicates but 
it seems worth handling in DataFusion.
   
   Following the code, in `CollectLeft` we derive the number of output 
predicates from the 
[right](https://github.com/apache/datafusion/blob/50733bcb801e21766dbed6e2403e87cfad7f8007/datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs#L157)
 side’s partition count. But iiuc `CollectLeft` collects the left into a single 
partition, so every right-side partition will see the same bounds in theory? 
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Blog: Add table of contents to blog article [datafusion-site]

2025-09-12 Thread via GitHub


alamb merged PR #107:
URL: https://github.com/apache/datafusion-site/pull/107


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Disable `required_status_checks` for now [datafusion]

2025-09-12 Thread via GitHub


blaginin merged PR #17537:
URL: https://github.com/apache/datafusion/pull/17537


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Rename Blaze to Auron [datafusion]

2025-09-12 Thread via GitHub


Jefffrey merged PR #17532:
URL: https://github.com/apache/datafusion/pull/17532


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update Bug issue template to use Bug issue type [datafusion]

2025-09-12 Thread via GitHub


Jefffrey merged PR #17540:
URL: https://github.com/apache/datafusion/pull/17540


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Blog: Add table of contents to blog article [datafusion-site]

2025-09-12 Thread via GitHub


nuno-faria commented on PR #107:
URL: https://github.com/apache/datafusion-site/pull/107#issuecomment-3285025510

   > Once we merge this PR, should we add TOC's to the older posts as well?
   
   Yeah I think so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix predicate simplification for incompatible types in push_down_filter [datafusion]

2025-09-12 Thread via GitHub


alamb commented on code in PR #17521:
URL: https://github.com/apache/datafusion/pull/17521#discussion_r2343741787


##
datafusion/sqllogictest/test_files/push_down_filter.slt:
##
@@ -395,3 +395,28 @@ order by t1.k, t2.v;
 
 1 1 1
 1000 1000 1000
+
+# Regression test for https://github.com/apache/datafusion/issues/17512
+
+query I
+COPY (
+SELECT arrow_cast('2025-01-01T00:00:00Z'::timestamptz, 
'Timestamp(Microsecond, Some("UTC"))') AS start_timestamp
+)
+TO 'test_files/scratch/push_down_filter/17512.parquet'
+STORED AS PARQUET;
+
+1
+
+statement ok
+CREATE EXTERNAL TABLE records STORED AS PARQUET LOCATION 
'test_files/scratch/push_down_filter/17512.parquet';
+
+query I
+SELECT 1
+FROM (
+SELECT start_timestamp
+FROM records
+WHERE start_timestamp <= '2025-01-01T00:00:00Z'::timestamptz
+) AS t
+WHERE t.start_timestamp::time < '00:00:01'::time;

Review Comment:
   > They need to take the f into account. The optimizer that does that is 
called "unwrap cast in comparison" AFAICT. 
   
   Yes I agree (assuming `f` is a `cast` expression)
   
   > The find_most_restrictive_predicate should operate only on predicates 
comparing column c directly, ignoring those which compare f(c).
   
   That is my understanding of what this PR does. I am not sure if you are just 
confirming this change or if you are proposing / suggesting something more
   
   



##
datafusion/optimizer/src/simplify_expressions/simplify_predicates.rs:
##
@@ -239,8 +239,99 @@ fn find_most_restrictive_predicate(
 fn extract_column_from_expr(expr: &Expr) -> Option {
 match expr {
 Expr::Column(col) => Some(col.clone()),
-// Handle cases where the column might be wrapped in a cast or other 
operation

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add Binary/LargeBinary/BinaryView/FixedSizeBinary to join_fuzz [datafusion]

2025-09-12 Thread via GitHub


alamb closed issue #17447: Add Binary/LargeBinary/BinaryView/FixedSizeBinary  
to join_fuzz
URL: https://github.com/apache/datafusion/issues/17447


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add binary to `join_fuzz` testing [datafusion]

2025-09-12 Thread via GitHub


alamb merged PR #17497:
URL: https://github.com/apache/datafusion/pull/17497


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add binary to `join_fuzz` testing [datafusion]

2025-09-12 Thread via GitHub


alamb commented on PR #17497:
URL: https://github.com/apache/datafusion/pull/17497#issuecomment-3285061792

   Thank you @jonathanc-n and @findepi 
   
   I agree it might be worth figuring out some way to expand coverage that 
doesn't require so many individual tests to be added / updated 🤔  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] application of simple optimizer rule produces incorrect results (DF 49 regression) [datafusion]

2025-09-12 Thread via GitHub


alamb commented on issue #17510:
URL: https://github.com/apache/datafusion/issues/17510#issuecomment-3285068318

   I think we can close this ticket as not a bug, is that ok @wkalt ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Nested Loop Join: Performance Regression in DataFusion 50 for Suboptimal Join Orderings [datafusion]

2025-09-12 Thread via GitHub


alamb closed issue #17488: Nested Loop Join: Performance Regression in 
DataFusion 50 for Suboptimal Join Orderings
URL: https://github.com/apache/datafusion/issues/17488


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Release DataFusion `50.0.0` (Aug/Sep 2025) [datafusion]

2025-09-12 Thread via GitHub


xudong963 commented on issue #16799:
URL: https://github.com/apache/datafusion/issues/16799#issuecomment-3284639539

   Hi @comphead, i have a severe headache today and i'm be off. 
   
   We may need to start the voting process today, could you please help me do 
it? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor TableProvider::scan into TableProvider::scan_with_args [datafusion]

2025-09-12 Thread via GitHub


alamb commented on code in PR #17336:
URL: https://github.com/apache/datafusion/pull/17336#discussion_r2344142167


##
datafusion/catalog/src/table.rs:
##
@@ -299,6 +334,119 @@ pub trait TableProvider: Debug + Sync + Send {
 }
 }
 
+/// Arguments for scanning a table with [`TableProvider::scan_with_args`].
+///
+/// `ScanArgs` provides a structured way to pass scan parameters to table 
providers,
+/// replacing the multiple individual parameters used by 
[`TableProvider::scan`].
+/// This struct uses the builder pattern for convenient construction.
+///
+/// # Examples
+///
+/// ```
+/// # use datafusion_catalog::ScanArgs;
+/// # use datafusion_expr::Expr;
+/// let args = ScanArgs::default()
+/// .with_projection(Some(vec![0, 2, 4]))
+/// .with_limit(Some(1000));
+/// ```
+#[derive(Debug, Clone, Default)]
+pub struct ScanArgs {
+filters: Option>,
+projection: Option>,
+limit: Option,
+}
+
+impl ScanArgs {
+/// Set the column projection for the scan.
+///
+/// The projection is a list of column indices from 
[`TableProvider::schema`]
+/// that should be included in the scan results. If `None`, all columns 
are included.
+///
+/// # Arguments
+/// * `projection` - Optional list of column indices to project
+pub fn with_projection(mut self, projection: Option>) -> Self {
+self.projection = projection;
+self
+}
+
+/// Get the column projection for the scan.
+///
+/// Returns a cloned copy of the projection column indices, or `None` if
+/// no projection was specified (meaning all columns should be included).
+pub fn projection(&self) -> Option> {

Review Comment:
   ```suggestion
   pub fn projection(&self) -> Option<&Vec> {
   ```



##
datafusion/catalog/src/table.rs:
##
@@ -299,6 +334,119 @@ pub trait TableProvider: Debug + Sync + Send {
 }
 }
 
+/// Arguments for scanning a table with [`TableProvider::scan_with_args`].
+///
+/// `ScanArgs` provides a structured way to pass scan parameters to table 
providers,
+/// replacing the multiple individual parameters used by 
[`TableProvider::scan`].
+/// This struct uses the builder pattern for convenient construction.
+///
+/// # Examples
+///
+/// ```
+/// # use datafusion_catalog::ScanArgs;
+/// # use datafusion_expr::Expr;
+/// let args = ScanArgs::default()
+/// .with_projection(Some(vec![0, 2, 4]))
+/// .with_limit(Some(1000));
+/// ```
+#[derive(Debug, Clone, Default)]
+pub struct ScanArgs {
+filters: Option>,
+projection: Option>,
+limit: Option,
+}
+
+impl ScanArgs {
+/// Set the column projection for the scan.
+///
+/// The projection is a list of column indices from 
[`TableProvider::schema`]
+/// that should be included in the scan results. If `None`, all columns 
are included.
+///
+/// # Arguments
+/// * `projection` - Optional list of column indices to project
+pub fn with_projection(mut self, projection: Option>) -> Self {
+self.projection = projection;
+self
+}
+
+/// Get the column projection for the scan.
+///
+/// Returns a cloned copy of the projection column indices, or `None` if
+/// no projection was specified (meaning all columns should be included).
+pub fn projection(&self) -> Option> {

Review Comment:
   I think this should return a reference (`Option<&Vec>`) to avoid 
forcing clones even if the caller doesn't need a copy



##
datafusion/core/src/physical_planner.rs:
##
@@ -459,9 +460,12 @@ impl DefaultPhysicalPlanner {
 // doesn't know (nor should care) how the relation was
 // referred to in the query
 let filters = unnormalize_cols(filters.iter().cloned());
-source
-.scan(session_state, projection.as_ref(), &filters, *fetch)
-.await?
+let opts = ScanArgs::default()
+.with_projection(projection.clone())

Review Comment:
   is there a reason we need to cone the projection here? I think it would be 
better and avoid the clone and pass in the arguments via reference (so 
something like `ScanArgs<'a>`)



##
datafusion/catalog/src/table.rs:
##
@@ -299,6 +334,119 @@ pub trait TableProvider: Debug + Sync + Send {
 }
 }
 
+/// Arguments for scanning a table with [`TableProvider::scan_with_args`].
+///
+/// `ScanArgs` provides a structured way to pass scan parameters to table 
providers,
+/// replacing the multiple individual parameters used by 
[`TableProvider::scan`].
+/// This struct uses the builder pattern for convenient construction.
+///
+/// # Examples
+///
+/// ```
+/// # use datafusion_catalog::ScanArgs;
+/// # use datafusion_expr::Expr;
+/// let args = ScanArgs::default()
+/// .with_projection(Some(vec![0, 2, 4]))
+/// .with_limit(Some(1000));
+/// ```
+#[derive(Debug, Clone, Default)]
+pub struct ScanArgs {
+filters: O

Re: [I] Nested Loop Join: Performance Regression in DataFusion 50 for Suboptimal Join Orderings [datafusion]

2025-09-12 Thread via GitHub


alamb commented on issue #17488:
URL: https://github.com/apache/datafusion/issues/17488#issuecomment-3285079247

   Per the comments above, let's close this ticket.
   
   @tobixdev  / @2010YOUY01  do you think it is worth tracking a  potential 
performance improvement as a separate ticket, or is there mot much to do 
without causing other tradeoffs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Introduce wildcard const for FixedSizeBinary type signature [datafusion]

2025-09-12 Thread via GitHub


findepi commented on PR #17531:
URL: https://github.com/apache/datafusion/pull/17531#issuecomment-3285251362

   I realized we may get away without introducing a generic type concept.
   The existing `TypeSignature::Coercible`'s `TypeSignatureClass` seems to be 
more or less it.
   did you try to use it?
   
   cc @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix predicate simplification for incompatible types in push_down_filter [datafusion]

2025-09-12 Thread via GitHub


findepi merged PR #17521:
URL: https://github.com/apache/datafusion/pull/17521


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Logical optimizer pushdown_filters rule fails with relatively simple query [datafusion]

2025-09-12 Thread via GitHub


findepi closed issue #17512: Logical optimizer pushdown_filters rule fails with 
relatively simple query
URL: https://github.com/apache/datafusion/issues/17512


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add assertion that ScalarUDFImpl implementation is consistent with declared return type [datafusion]

2025-09-12 Thread via GitHub


findepi commented on code in PR #17515:
URL: https://github.com/apache/datafusion/pull/17515#discussion_r2344236784


##
datafusion/expr/src/udf.rs:
##
@@ -233,7 +233,25 @@ impl ScalarUDF {
 ///
 /// See [`ScalarUDFImpl::invoke_with_args`] for details.
 pub fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
-self.inner.invoke_with_args(args)
+#[cfg(debug_assertions)]

Review Comment:
   yeah, match on invoke_with_args result is useless, the `?` is the way to go.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] add a ci job for typo checking [datafusion]

2025-09-12 Thread via GitHub


findepi commented on PR #17339:
URL: https://github.com/apache/datafusion/pull/17339#issuecomment-3285281075

   This new check just got a stupid typo i made in a hurry. Thank you for 
catching me!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Using `encode_arrow_schema` from arrow-rs. [datafusion]

2025-09-12 Thread via GitHub


samueleresca opened a new pull request, #17543:
URL: https://github.com/apache/datafusion/pull/17543

   ## Which issue does this PR close?
   
   - Closes #17542 17542
   
   ## What changes are included in this PR?
   
   - Removing the custom `encode_arrow_schema` implementation.
   - Removing unused dependencies.
   
   ## Are these changes tested?
   
   Unit test coverage
   
   ## Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] POC: datafusion-cli instrumented object store [datafusion]

2025-09-12 Thread via GitHub


alamb commented on PR #17266:
URL: https://github.com/apache/datafusion/pull/17266#issuecomment-3285476190

   I have not forgotten about this PR -- I am just busy with other projects 
now. I will come back to this soon (TM)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [Epic]: Google Summer of Code 2025 Correlated Subquery Support [datafusion]

2025-09-12 Thread via GitHub


alamb commented on issue #16059:
URL: https://github.com/apache/datafusion/issues/16059#issuecomment-3285523017

   the PR to add correlated subquery support is here:
   -  https://github.com/apache/datafusion/pull/17110
   
   I think with that let's claim this project is done. 
   
   Thank you for all your effort @irenjj 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [Epic]: Google Summer of Code 2025 Correlated Subquery Support [datafusion]

2025-09-12 Thread via GitHub


alamb closed issue #16059: [Epic]: Google Summer of Code 2025 Correlated 
Subquery Support
URL: https://github.com/apache/datafusion/issues/16059


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: [1941-Part2]: Introduce map_to_list scalar function [datafusion-comet]

2025-09-12 Thread via GitHub


comphead commented on PR #2312:
URL: 
https://github.com/apache/datafusion-comet/pull/2312#issuecomment-3285609973

   Thanks @rishvin appreciate if you can take #2388  fill up the context and we 
can go through PRs once again 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Revert #17295 (Support from-first SQL syntax) [datafusion]

2025-09-12 Thread via GitHub


adriangb commented on PR #17520:
URL: https://github.com/apache/datafusion/pull/17520#issuecomment-3285732843

   > Need to revert doc changes too:
   
   done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Numeric overflow should result in query error [datafusion]

2025-09-12 Thread via GitHub


findepi opened a new issue, #17539:
URL: https://github.com/apache/datafusion/issues/17539

   ### Describe the bug
   
   In programming languages it's rather typical that numeric overflows are not 
a runtime error.
   
   In SQL, it's rather typical that numeric overflows are identified during 
query execution and are treated as an error. 
   
   DataFusion is a SQL engine and therefore should report overflows as query 
errors. Continuing execution after overflows happened is a correctness issue 
(query results a wrong result).
   
   ### To Reproduce
   
   ```shell
   cargo run --bin datafusion-cli -- --command "SELECT 100 * 
100;"
   ```
   
   ```
   datafusion main$ cargo run --bin datafusion-cli -- --command "SELECT 
100 * 100;"
   Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.17s
Running `target/debug/datafusion-cli --command 'SELECT 100 * 
100;'`
   DataFusion CLI v50.0.0
   +-+
   | Int64(100) * Int64(100) |
   +-+
   | 7766279631452241920 |
   +-+
   ```
   
   
   
   ### Expected behavior
   
   ### PostgreSQL
   
   ```
   postgres=# SELECT 100 * 100;
   ERROR:  bigint out of range
   ```
   
   ### Trino
   ```
   trino> SELECT 100 * 100;
   Query 20250912_093345_0_kgiyz failed: bigint multiplication overflow: 
100 * 100
   ```
   
   ### Snowflake
   ```
   SELECT 100 * 100 * 100 * 100;
   -- Number out of representable range: type FIXED[SB16](38,0){not null}, 
value 1e+40
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] TPC-DS query #88 fails with disabled AQE [datafusion-comet]

2025-09-12 Thread via GitHub


and124578963 opened a new issue, #2389:
URL: https://github.com/apache/datafusion-comet/issues/2389

   ### Describe the bug
   
   When running the [TPC-DS query 
88](https://github.com/apache/doris/blob/master/tools/tpcds-tools/queries/sf1000/query88.sql)
 with **"spark.sql.adaptive.enabled": "false"**, it fails with the following 
error:
**java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers 
of partitions: ArrayBuffer(6958, 6975, 6958, 6958).**
   
   The issue is resolved by either setting 
"spark.comet.exec.broadcastHashJoin.enabled": "false" or re-enabling AQE.
   
   My hypothesis is that the query plan is built incorrectly, leading to RDDs 
from different subqueries with varying partition counts being zipped together. 
The query also runs successfully if rewritten to use a single common SELECT 
statement or if all subqueries are forced to have identical conditions, which 
results in RDDs of the same size.
   
   Stack Trace
   ```Problem with SQL: An error occurred while calling o384.showString.
   : java.util.concurrent.ExecutionException: 
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions: ArrayBuffer(6958, 6975, 6958, 6958)
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:212)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:517)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:208)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:204)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.prepareBroadcast(BroadcastNestedLoopJoinExec.scala:444)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.codegenInner(BroadcastNestedLoopJoinExec.scala:454)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:428)
at 
org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:196)
at 
org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:151)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.consume(BroadcastNestedLoopJoinExec.scala:32)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.codegenInner(BroadcastNestedLoopJoinExec.scala:469)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:428)
at 
org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:196)
at 
org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:151)
at 
org.apache.spark.sql.comet.CometColumnarToRowExec.consume(CometColumnarToRowExec.scala:54)
at 
org.apache.spark.sql.comet.CometColumnarToRowExec.doProduce(CometColumnarToRowExec.scala:277)
at 
org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
at 
org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:92)
at 
org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:92)
at 
org.apache.spark.sql.comet.CometColumnarToRowExec.produce(CometColumnarToRowExec.scala:54)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doProduce(BroadcastNestedLoopJoinExec.scala:423)
at 
org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
at 
org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:92)
at 
org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:92)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.produce(BroadcastNestedLoopJoinExec.scala:32)
at 
org.apache.spark.sql

Re: [PR] Revert #17295 (Support from-first SQL syntax) [datafusion]

2025-09-12 Thread via GitHub


alamb commented on PR #17520:
URL: https://github.com/apache/datafusion/pull/17520#issuecomment-3285898307

   Thank you @adriangb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] [iceberg] Tracking PR for deleted rows support [datafusion-comet]

2025-09-12 Thread via GitHub


parthchandra opened a new issue, #2390:
URL: https://github.com/apache/datafusion-comet/issues/2390

   Tracking https://github.com/apache/iceberg/pull/14062


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Preserves field metadata when creating logical plan for VALUES expression [datafusion]

2025-09-12 Thread via GitHub


Kontinuation commented on PR #17525:
URL: https://github.com/apache/datafusion/pull/17525#issuecomment-3286053055

   CC @paleolimbot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Implement CometInMemoryTableScanExec [datafusion-comet]

2025-09-12 Thread via GitHub


andygrove opened a new issue, #2391:
URL: https://github.com/apache/datafusion-comet/issues/2391

   ### What is the problem the feature request solves?
   
   In queries with InMemoryTableScanExec, we have to perform a 
CometColumnarToRowExec to convert to row format, and then the rest of the query 
falls back to Spark.
   
   We should investigate implementing a CometInMemoryTableScanExec that loads 
data in columnar format.
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Only compute bounds/ dynamic filters if consumer asks for it [datafusion]

2025-09-12 Thread via GitHub


adriangb commented on issue #17527:
URL: https://github.com/apache/datafusion/issues/17527#issuecomment-3285593615

   I think the pro of the subscriber option is:
   1. Like you say it can be applied to only some filters and not others.
   2. It avoids making the general filter pushdown API more complex (although 
the complexity gets pushed somewhere else).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Blog: Add table of contents to blog article [datafusion-site]

2025-09-12 Thread via GitHub


alamb commented on PR #107:
URL: https://github.com/apache/datafusion-site/pull/107#issuecomment-3284773235

   Once we merge this PR, should we add TOC's to the older posts as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Removing ad-hoc implementation of `encode_arrow_schema` [datafusion]

2025-09-12 Thread via GitHub


samueleresca opened a new issue, #17542:
URL: https://github.com/apache/datafusion/issues/17542

   ### Is your feature request related to a problem or challenge?
   
   The implementation of `encode_arrow_schema` is now included and exposed by 
arrow-rs (see: https://github.com/apache/arrow-rs/pull/6916). The datafusion 
ad-hoc implementation is not needed anymore
   
   ### Describe the solution you'd like
   
   Remove the ad-hoc datafusion implementation of `encode_arrow_schema`.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix predicate simplification for incompatible types in push_down_filter [datafusion]

2025-09-12 Thread via GitHub


alamb commented on code in PR #17521:
URL: https://github.com/apache/datafusion/pull/17521#discussion_r2344321562


##
datafusion/sqllogictest/test_files/push_down_filter.slt:
##
@@ -395,3 +395,28 @@ order by t1.k, t2.v;
 
 1 1 1
 1000 1000 1000
+
+# Regression test for https://github.com/apache/datafusion/issues/17512
+
+query I
+COPY (
+SELECT arrow_cast('2025-01-01T00:00:00Z'::timestamptz, 
'Timestamp(Microsecond, Some("UTC"))') AS start_timestamp
+)
+TO 'test_files/scratch/push_down_filter/17512.parquet'
+STORED AS PARQUET;
+
+1
+
+statement ok
+CREATE EXTERNAL TABLE records STORED AS PARQUET LOCATION 
'test_files/scratch/push_down_filter/17512.parquet';
+
+query I
+SELECT 1
+FROM (
+SELECT start_timestamp
+FROM records
+WHERE start_timestamp <= '2025-01-01T00:00:00Z'::timestamptz
+) AS t
+WHERE t.start_timestamp::time < '00:00:01'::time;

Review Comment:
   makes sense -- thank you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: output_ordering converted to Vec> [datafusion]

2025-09-12 Thread via GitHub


destrex271 commented on code in PR #17439:
URL: https://github.com/apache/datafusion/pull/17439#discussion_r2345062474


##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -383,7 +383,7 @@ impl FileScanConfigBuilder {
 
 /// Set the output ordering of the files
 pub fn with_output_ordering(mut self, output_ordering: Vec) 
-> Self {

Review Comment:
   @crepererum 
   
   I guess we should skip modifying the type of expr to Option 
since at this stage having None seems to be a little useless.
   
   ```rust
   #[derive(Debug, Clone)]
   pub struct SortExec {
   /// Input schema
   pub(crate) input: Arc,
   /// Sort expressions
   expr: LexOrdering,
   /// Containing all metrics set created during sort
   metrics_set: ExecutionPlanMetricsSet,
   /// Preserve partitions of input plan. If false, the input partitions
   /// will be sorted and merged into a single output partition.
   preserve_partitioning: bool,
   /// Fetch highest/lowest n results
   fetch: Option,
   /// Normalized common sort prefix between the input and the sort 
expressions (only used with fetch)
   common_sort_prefix: Vec,
   /// Cache holding plan properties like equivalences, output partitioning 
etc.
   cache: PlanProperties,
   /// Filter matching the state of the sort for dynamic filter pushdown.
   /// If `fetch` is `Some`, this will also be set and a TopK operator may 
be used.
   /// If `fetch` is `None`, this will be `None`.
   filter: Option>>,
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: feature specific tests [datafusion-comet]

2025-09-12 Thread via GitHub


codecov-commenter commented on PR #2372:
URL: 
https://github.com/apache/datafusion-comet/pull/2372#issuecomment-3286531985

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2372?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   :white_check_mark: All modified and coverable lines are covered by tests.
   :white_check_mark: Project coverage is 33.05%. Comparing base 
([`f09f8af`](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache))
 to head 
([`149a265`](https://app.codecov.io/gh/apache/datafusion-comet/commit/149a2654456f1b4f288444f60c9a5511cdd3bd39?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)).
   :warning: Report is 497 commits behind head on main.
   
   Additional details and impacted files
   
   
   
   ```diff
   @@  Coverage Diff  @@
   ##   main#2372   +/-   ##
   =
   - Coverage 56.12%   33.05%   -23.07% 
   + Complexity  976  736  -240 
   =
 Files   119  147   +28 
 Lines 1174313409 +1666 
 Branches   2251 2332   +81 
   =
   - Hits   6591 4433 -2158 
   - Misses 4012 8184 +4172 
   + Partials   1140  792  -348 
   ```
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/2372?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Regression: projection pushdown doesn't work as expected in DF50 [datafusion]

2025-09-12 Thread via GitHub


alamb closed issue #17513: Regression: projection pushdown doesn't work as 
expected in DF50
URL: https://github.com/apache/datafusion/issues/17513


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Revert #17295 (Support from-first SQL syntax) [datafusion]

2025-09-12 Thread via GitHub


alamb commented on PR #17520:
URL: https://github.com/apache/datafusion/pull/17520#issuecomment-3286659694

   This one looks good thanks everyone!
   
   I also made a backport PR to `branch-50` branch so we can include it in the 
RC
   - https://github.com/apache/datafusion/pull/17544


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] [branch-50] fix: Implement AggregateUDFImpl::reverse_expr for StringAgg (#17165) (#17473) [datafusion]

2025-09-12 Thread via GitHub


alamb opened a new pull request, #17544:
URL: https://github.com/apache/datafusion/pull/17544

   ## Which issue does this PR close?
   
   - related to https://github.com/apache/datafusion/issues/17513
   - related to #16799 
   
   
   ## Rationale for this change
   
   We fixed a bug found in testing DataFusion 50, so let's backport the change
   
   ## What changes are included in this PR?
   
   - backport  https://github.com/apache/datafusion/pull/17520 to [branch-50]
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] API Suggestion match casing for isnan and is_null [datafusion-python]

2025-09-12 Thread via GitHub


ntjohnson1 opened a new issue, #1235:
URL: https://github.com/apache/datafusion-python/issues/1235

   Maybe this is more of a question than a suggestion. Is there a reason they 
both aren't `is_nan` and `is_null`? It looks like this is a carry over from the 
underlying rust and it isn't clear to me why that interface is different either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Trying cargo machete to prune unused deps. [datafusion]

2025-09-12 Thread via GitHub


samueleresca opened a new pull request, #17545:
URL: https://github.com/apache/datafusion/pull/17545

   Very much a test with `cargo machete` to understand all the false positives 
detected by the tooling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Using `encode_arrow_schema` from arrow-rs. [datafusion]

2025-09-12 Thread via GitHub


alamb merged PR #17543:
URL: https://github.com/apache/datafusion/pull/17543


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Revert #17295 (Support from-first SQL syntax) [datafusion]

2025-09-12 Thread via GitHub


adriangb commented on PR #17520:
URL: https://github.com/apache/datafusion/pull/17520#issuecomment-3286713964

   @simonvandel I'm sorry we had to revert your contribution. I hope you're 
able to contribute again and now that we have the regression test and clarity 
about the behavior it should be okay to re-add your feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Removing ad-hoc implementation of `encode_arrow_schema` [datafusion]

2025-09-12 Thread via GitHub


alamb closed issue #17542: Removing ad-hoc implementation of 
`encode_arrow_schema`
URL: https://github.com/apache/datafusion/issues/17542


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `TableProvider::scan_with_args` [datafusion]

2025-09-12 Thread via GitHub


adriangb commented on PR #17336:
URL: https://github.com/apache/datafusion/pull/17336#issuecomment-3286697680

   @alamb I believe I've addressed your feedback, thank you for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add assertion that ScalarUDFImpl implementation is consistent with declared return type [datafusion]

2025-09-12 Thread via GitHub


alamb commented on code in PR #17515:
URL: https://github.com/apache/datafusion/pull/17515#discussion_r2343788618


##
datafusion/spark/src/function/url/parse_url.rs:
##
@@ -222,7 +223,7 @@ fn spark_parse_url(args: &[ArrayRef]) -> Result {
 )
 }
 (DataType::Utf8View, DataType::Utf8View, DataType::Utf8View) => {
-process_parse_url::<_, _, _, StringArray>(
+process_parse_url::<_, _, _, StringViewArray>(

Review Comment:
   nice



##
datafusion/expr/src/udf.rs:
##
@@ -233,7 +233,25 @@ impl ScalarUDF {
 ///
 /// See [`ScalarUDFImpl::invoke_with_args`] for details.
 pub fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
-self.inner.invoke_with_args(args)
+#[cfg(debug_assertions)]

Review Comment:
   I played around with this code a little bit and I found it could be made 
more concise with less indenting. Not needed, just for your consideration
   
   ```rust
   /// Invoke the function on `args`, returning the appropriate result.
   ///
   /// See [`ScalarUDFImpl::invoke_with_args`] for details.
   pub fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
   #[cfg(debug_assertions)]
   let return_field = Arc::clone(&args.return_field);
   let value = self.inner.invoke_with_args(args)?;
   
   // Maybe this could be enabled always?
   // This doesn't use debug_assert!, but it's meant to run anywhere 
except on production, so it's same in spirit, so conditioning on 
debug_assertions.
   #[cfg(debug_assertions)]
   if &value.data_type() != return_field.data_type() {
   return datafusion_common::exec_err!("Function '{}' returned 
value of type '{:?}' while the following type was promised at planning time and 
expected: '{:?}'",
   self.name(),
   value.data_type(),
   return_field.data_type()
   );
   }
   // TODO verify return data is non-null when it was promised to be?
   Ok(value)
   }
   ```



##
datafusion/expr/src/udf.rs:
##
@@ -233,7 +233,25 @@ impl ScalarUDF {
 ///
 /// See [`ScalarUDFImpl::invoke_with_args`] for details.
 pub fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
-self.inner.invoke_with_args(args)
+#[cfg(debug_assertions)]
+let return_field = Arc::clone(&args.return_field);
+match self.inner.invoke_with_args(args) {
+Ok(value) => {
+// Maybe this could be enabled always?
+// This doesn't use debug_assert!, but it's meant to run 
anywhere except on production, so it's same in spirit, so conditioning on 
debug_assertions.
+#[cfg(debug_assertions)]
+if &value.data_type() != return_field.data_type() {
+return datafusion_common::exec_err!("Function '{}' 
returned value of type '{:?}' while the following type was promised at planning 
time and expected: '{:?}'",

Review Comment:
   It probably doesn't matter, but I would suggeset making this `internal_err` 
instead of `exec_err` as it signifies a bug rather than some other potentially 
expected runtime error
   
   ```suggestion
   return datafusion_common::internal_err!("Function '{}' 
returned value of type '{:?}' while the following type was promised at planning 
time and expected: '{:?}'",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] use or instead of and for hash join filter pushdown [datafusion]

2025-09-12 Thread via GitHub


adriangb closed pull request #17461: use or instead of and for hash join filter 
pushdown
URL: https://github.com/apache/datafusion/pull/17461


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `TableProvider::scan_with_args` [datafusion]

2025-09-12 Thread via GitHub


alamb commented on code in PR #17336:
URL: https://github.com/apache/datafusion/pull/17336#discussion_r2345346409


##
datafusion/catalog/src/table.rs:
##
@@ -171,6 +171,38 @@ pub trait TableProvider: Debug + Sync + Send {
 limit: Option,
 ) -> Result>;
 
+/// Create an [`ExecutionPlan`] for scanning the table using structured 
arguments.
+///
+/// This method uses [`ScanArgs`] to pass scan parameters in a structured 
way
+/// and returns a [`ScanResult`] containing the execution plan. This 
approach
+/// allows for extensible parameter passing and result handling.

Review Comment:
   x
   ```suggestion
   /// and returns a [`ScanResult`] containing the execution plan. 
   ```



##
datafusion/catalog/src/table.rs:
##
@@ -171,6 +171,38 @@ pub trait TableProvider: Debug + Sync + Send {
 limit: Option,
 ) -> Result>;
 
+/// Create an [`ExecutionPlan`] for scanning the table using structured 
arguments.
+///
+/// This method uses [`ScanArgs`] to pass scan parameters in a structured 
way
+/// and returns a [`ScanResult`] containing the execution plan. This 
approach
+/// allows for extensible parameter passing and result handling.
+///
+/// Table providers can override this method to take advantage of 
additional
+/// parameters like `preferred_ordering` that may not be available through

Review Comment:
   ```suggestion
   /// parameters like the upcoming `preferred_ordering` that may not be 
available through
   ```



##
datafusion/core/src/datasource/listing/table.rs:
##
@@ -1169,6 +1169,24 @@ impl TableProvider for ListingTable {
 filters: &[Expr],
 limit: Option,
 ) -> Result> {
+let options = ScanArgs::default()
+.with_projection(projection.map(|p| p.as_slice()))
+.with_filters(Some(filters))
+.with_limit(limit);
+Ok(Arc::clone(
+self.scan_with_args(state, options).await?.plan(),
+))

Review Comment:
   ```suggestion
   Ok(self.scan_with_args(state, options).await?.into_inner())
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix predicate simplification for incompatible types in push_down_filter [datafusion]

2025-09-12 Thread via GitHub


findepi commented on code in PR #17521:
URL: https://github.com/apache/datafusion/pull/17521#discussion_r2344231059


##
datafusion/sqllogictest/test_files/push_down_filter.slt:
##
@@ -395,3 +395,28 @@ order by t1.k, t2.v;
 
 1 1 1
 1000 1000 1000
+
+# Regression test for https://github.com/apache/datafusion/issues/17512
+
+query I
+COPY (
+SELECT arrow_cast('2025-01-01T00:00:00Z'::timestamptz, 
'Timestamp(Microsecond, Some("UTC"))') AS start_timestamp
+)
+TO 'test_files/scratch/push_down_filter/17512.parquet'
+STORED AS PARQUET;
+
+1
+
+statement ok
+CREATE EXTERNAL TABLE records STORED AS PARQUET LOCATION 
'test_files/scratch/push_down_filter/17512.parquet';
+
+query I
+SELECT 1
+FROM (
+SELECT start_timestamp
+FROM records
+WHERE start_timestamp <= '2025-01-01T00:00:00Z'::timestamptz
+) AS t
+WHERE t.start_timestamp::time < '00:00:01'::time;

Review Comment:
   > > The find_most_restrictive_predicate should operate only on predicates 
comparing column c directly, ignoring those which compare f(c).
   > 
   > That is my understanding of what this PR does. I am not sure if you are 
just confirming this change or if you are proposing / suggesting something more
   
   At the time of writing it was a proposal.
   Now that the proposed change has been applied, it can be read as a 
confirmation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` [datafusion]

2025-09-12 Thread via GitHub


rkrishn7 opened a new pull request, #17546:
URL: https://github.com/apache/datafusion/pull/17546

   ## Which issue does this PR close?
   
   
   
   Closes https://github.com/apache/datafusion/issues/17511
   
   ## Rationale for this change
   
   When building equal conditions in a data source node, we want to ignore any 
stale references to columns that may have been swapped out (e.g. from 
`try_swapping_with_projection`).
   
   The current code reassigns predicate columns from the filter to refer to the 
corresponding ones in the updated schema. However, it only ignores 
non-projected columns. `reassign_predicate_columns` builds an invalid column 
expression (with index `usize::MAX`) if the column is not projected in the 
current schema. We don't want to refer to this in the equal conditions we build.
   
   
   ## What changes are included in this PR?
   
   Ignores any binary expressions that reference non-existent columns in the 
current schema (e.g. due to unnecessary projections being removed).
   
   ## Are these changes tested?
   
   ## Are there any user-facing changes?
   
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` [datafusion]

2025-09-12 Thread via GitHub


rkrishn7 commented on PR #17546:
URL: https://github.com/apache/datafusion/pull/17546#issuecomment-3286735863

   The fact that `reassign_predicate_columns` can return invalid column 
expressions seems like a big footgun. I didn't change within this PR due to 
usage elsewhere but we might want to think about refactoring it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` [datafusion]

2025-09-12 Thread via GitHub


rkrishn7 commented on PR #17546:
URL: https://github.com/apache/datafusion/pull/17546#issuecomment-3286768072

   Re-ran TPCH benchmark with the same configuration as the referenced issue 
and all the tests pass now.
   
   Will add a regression test here in a bit!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: Update documentation on Epics and Sponsoring Maintainers [datafusion]

2025-09-12 Thread via GitHub


comphead commented on PR #17505:
URL: https://github.com/apache/datafusion/pull/17505#issuecomment-3285580676

   > > Is it worth changing the term "sponsoring" to something like "driving" 
or "championing", to avoid any confusion with the monetary meaning of 
sponsoring?
   > 
   > That is a really good point. Here are some brainstorm options
   > 
   > * "Driving Committer"
   > * "Championing Committer"
   > * "Shepherding Committer" (my personal favorite)
   > * "Supervising Committer"
   > * "Herding Committer"
   > * "Stewarding Committer"
   > * "Managing Committer"
   > 
   > 🤔
   
   `supervising` sounds great IMO. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] refactor: Scala hygiene - remove `scala.collection.JavaConverters` [datafusion-comet]

2025-09-12 Thread via GitHub


hsiang-c opened a new pull request, #2393:
URL: https://github.com/apache/datafusion-comet/pull/2393

   ## Which issue does this PR close?
   
   
   
   Partially closes #. https://github.com/apache/datafusion-comet/issues/2255
   
   ## Rationale for this change
   
   
   
- The scope of https://github.com/apache/datafusion-comet/issues/2255 is 
huge, so fix the hygiene issues bit by bit.
   
   ## What changes are included in this PR?
   
   
- Depend on 
[scala-collection-compat](https://github.com/scala/scala-collection-compat) b/c 
Comet still have Scala 2.12 builds.
- `scala.collection.JavaConverters` is deprecated, replace it with 
`scala.jdk.CollectionConverters` (scala) and 
`scala.jdk.javaapi.CollectionConverters` (Java)
   
   ## How are these changes tested?
   
   
   
- Tested locally w/ build profile `-Pscala-2.12` and `-Pscala-2.13`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Enable feature specific unit tests [datafusion-comet]

2025-09-12 Thread via GitHub


parthchandra commented on issue #2360:
URL: 
https://github.com/apache/datafusion-comet/issues/2360#issuecomment-3286886609

   @andygrove @wForget this is ready for review. The test referred to above 
passes with `hdfs-opendal`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Scala hygiene - remove `scala.collection.JavaConverters` [datafusion-comet]

2025-09-12 Thread via GitHub


codecov-commenter commented on PR #2393:
URL: 
https://github.com/apache/datafusion-comet/pull/2393#issuecomment-3286897970

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   :x: Patch coverage is `0%` with `13 lines` in your changes missing coverage. 
Please review.
   :white_check_mark: Project coverage is 33.06%. Comparing base 
([`f09f8af`](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache))
 to head 
([`8beb445`](https://app.codecov.io/gh/apache/datafusion-comet/commit/8beb4459d6c78945c4871962b8c5241ccc722957?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)).
   :warning: Report is 497 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[...va/org/apache/comet/parquet/NativeBatchReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FNativeBatchReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L05hdGl2ZUJhdGNoUmVhZGVyLmphdmE=)
 | 0.00% | [7 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[.../apache/comet/parquet/CometParquetFileFormat.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fparquet%2FCometParquetFileFormat.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbWV0UGFycXVldEZpbGVGb3JtYXQuc2NhbGE=)
 | 0.00% | [2 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...a/org/apache/comet/parquet/NativeColumnReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FNativeColumnReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L05hdGl2ZUNvbHVtblJlYWRlci5qYXZh)
 | 0.00% | [1 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...et/execution/shuffle/CometUnsafeShuffleWriter.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2Fexecution%2Fshuffle%2FCometUnsafeShuffleWriter.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3NwYXJrL3NxbC9jb21ldC9leGVjdXRpb24vc2h1ZmZsZS9Db21ldFVuc2FmZVNodWZmbGVXcml0ZXIuamF2YQ==)
 | 0.00% | [1 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...t/parquet/CometParquetPartitionReaderFactory.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fparquet%2FCometParquetPartitionReaderFactory.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbWV0UGFycXVldFBhcnRpdGlvblJlYWRlckZhY3Rvcnkuc2NhbGE=)
 | 0.00% | [1 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...n/scala/org/apache/comet/rules/CometScanRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometScanRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW

Re: [I] Enable the `ListFilesCache` to be available for partitioned tables [datafusion]

2025-09-12 Thread via GitHub


BlakeOrth commented on issue #17211:
URL: https://github.com/apache/datafusion/issues/17211#issuecomment-3286924997

   @alamb I've found the in-review documentation on process updates in: 
- https://github.com/apache/datafusion/pull/17505
   
   In an effort to follow the project's processes, considering our existing 
discussions about this issue, would you be willing to be the 
" Committer" for this effort?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [D] DISCUSSION: DataFusion Meetup in New York, NY, USA - Sep 15, 2025 [datafusion]

2025-09-12 Thread via GitHub


GitHub user alamb added a comment to the discussion: DISCUSSION: DataFusion 
Meetup in New York, NY, USA - Sep 15, 2025

Looking forward to next week!

GitHub link: 
https://github.com/apache/datafusion/discussions/16265#discussioncomment-14388367


This is an automatically sent email for github@datafusion.apache.org.
To unsubscribe, please send an email to: 
github-unsubscr...@datafusion.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Release sqlparser-rs version `0.59.0` around 2025-09-15 [datafusion-sqlparser-rs]

2025-09-12 Thread via GitHub


alamb commented on issue #1956:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/issues/1956#issuecomment-3286957810

   @iffyio  -- I am thinking of making a release next week -- is there anything 
in particular you think we should wait for?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Slow aggregrate query with `array_agg`, Polars is 4 times faster for equal query [datafusion]

2025-09-12 Thread via GitHub


Omega359 commented on issue #17446:
URL: https://github.com/apache/datafusion/issues/17446#issuecomment-3286959549

   There was a previous attempt at a GroupsAccumulator for ArrayAgg 
(https://github.com/apache/datafusion/pull/11096) but the performance of that 
PR was mixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Scala hygiene - remove `scala.collection.JavaConverters` [datafusion-comet]

2025-09-12 Thread via GitHub


hsiang-c commented on PR #2393:
URL: 
https://github.com/apache/datafusion-comet/pull/2393#issuecomment-3286990162

   SparkSQL tests failed b/c of `java.lang.NoClassDefFoundError`
   
   ```
   java.lang.NoClassDefFoundError: scala/jdk/javaapi/CollectionConverters
   [info]   at 
org.apache.comet.parquet.NativeBatchReader.init(NativeBatchReader.java:259)
   [info]   at 
org.apache.comet.parquet.CometParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(CometParquetFileFormat.scala:169)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Improve Initcap test and docs [datafusion-comet]

2025-09-12 Thread via GitHub


andygrove merged PR #2387:
URL: https://github.com/apache/datafusion-comet/pull/2387


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Panic happens when adding a decimal256 to a float (SQLancer) [datafusion]

2025-09-12 Thread via GitHub


Jefffrey closed issue #16689: Panic happens when adding a decimal256 to a float 
(SQLancer)
URL: https://github.com/apache/datafusion/issues/16689


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: implement lazy evaluation in Coalesce function [datafusion-comet]

2025-09-12 Thread via GitHub


parthchandra commented on code in PR #2270:
URL: https://github.com/apache/datafusion-comet/pull/2270#discussion_r2345625780


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -394,6 +394,20 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("test coalesce lazy eval") {

Review Comment:
   Does this test verify that lazy evaluation did, in fact, occur? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: feature specific tests [datafusion-comet]

2025-09-12 Thread via GitHub


parthchandra commented on PR #2372:
URL: 
https://github.com/apache/datafusion-comet/pull/2372#issuecomment-3287165660

   @wForget ptal


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add test for decimal256 and float math [datafusion]

2025-09-12 Thread via GitHub


Jefffrey merged PR #17530:
URL: https://github.com/apache/datafusion/pull/17530


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [ANSI] Include original SQL in error messages [datafusion-comet]

2025-09-12 Thread via GitHub


parthchandra commented on issue #2215:
URL: 
https://github.com/apache/datafusion-comet/issues/2215#issuecomment-3287216142

   I think a different approach is required here. The error in this example 
comes from 
[QueryExecutionErrors](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala).
 Afaik, when an error of this type is thrown, Spark will handle the proper 
reporting of the error and the query.
   
   We should ideally have a mapping for Comet/Datafusion errors that matches 
the error to the relevant `QueryExecutionError` and in `CometExecIterator` we 
should catch the `CometNativeException`, convert it into a 
`QueryExecutionError`, and rethrow it.
   In `CometExecIterator` we already catch some `CometNativeException`s which 
are then converted into a `SparkException`. This currently is specific to a set 
of errors which are matched by checking the contents of the error message 
(which is not quite ideal). It would be good to change these too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [native_iceberg_compat] Add support for custom S3 endpoints [datafusion-comet]

2025-09-12 Thread via GitHub


parthchandra commented on issue #2261:
URL: 
https://github.com/apache/datafusion-comet/issues/2261#issuecomment-3287221701

   @andygrove I don't think this is a problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Support more date part expressions [datafusion-comet]

2025-09-12 Thread via GitHub


parthchandra commented on code in PR #2316:
URL: https://github.com/apache/datafusion-comet/pull/2316#discussion_r2345685485


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1681,14 +1681,17 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
-  test("Year") {
+  test("DatePart functions: 
Year/Month/DayOfMonth/DayOfWeek/DayOfYear/WeekOfYear/Quarter") {
 Seq(false, true).foreach { dictionary =>
   withSQLConf("parquet.enable.dictionary" -> dictionary.toString) {
 val table = "test"
 withTable(table) {
   sql(s"create table $table(col timestamp) using parquet")
   sql(s"insert into $table values (now()), (null)")
-  checkSparkAnswerAndOperator(s"SELECT year(col) FROM $table")
+  // TODO: weekday(col) 
https://github.com/apache/datafusion-comet/issues/2330
+  checkSparkAnswerAndOperator(

Review Comment:
   Can we change the session timezone when we read the values back? Also, maybe 
add some test values corresponding to days before the Unix Epoch.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add hdfs feature test job [datafusion-comet]

2025-09-12 Thread via GitHub


parthchandra commented on PR #2350:
URL: 
https://github.com/apache/datafusion-comet/pull/2350#issuecomment-3287286359

   oops. Yes, I did mean #2372. Let's get that merged and update the test in 
this PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Always run CI checks [datafusion]

2025-09-12 Thread via GitHub


blaginin opened a new pull request, #17538:
URL: https://github.com/apache/datafusion/pull/17538

   Follow up on https://github.com/apache/datafusion/pull/17183
   
   There are two ways to fix the "steps not run" problem:
   - 1: always run them. CI will become slower but we'll have merge queue 
anyway. On the positive side, we'll have consistent CI runs (for example, now 
when you just change `*.md` we won't run `check configs.md and ***_functions.md 
is up-to-date` although we should
   - 2: not run them (in a bit hacky way because of 
https://github.com/orgs/community/discussions/45899)
   
   2 consumes less resources, so going with it for now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: feature specific tests [datafusion-comet]

2025-09-12 Thread via GitHub


wForget commented on code in PR #2372:
URL: https://github.com/apache/datafusion-comet/pull/2372#discussion_r2345712676


##
spark/src/test/scala/org/apache/comet/parquet/ParquetReadFromFakeHadoopFsSuite.scala:
##
@@ -74,7 +74,18 @@ class ParquetReadFromFakeHadoopFsSuite extends CometTestBase 
with AdaptiveSparkP
 .startsWith(FakeHDFSFileSystem.PREFIX))
   }
 
-  ignore("test native_datafusion scan on fake fs") {
+  // This test fails for 'hdfs' but succeeds for 'open-dal'. 'hdfs' requires 
this fix
+  // https://github.com/datafusion-contrib/fs-hdfs/pull/29
+  test("test native_datafusion scan on fake fs") {
+// Skip test if HDFS feature is not enabled in native library
+val hdfsEnabled =
+  try {

Review Comment:
   Should we extract this try...catch block into a  common method in NativeBase?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add hdfs feature test job [datafusion-comet]

2025-09-12 Thread via GitHub


wForget commented on PR #2350:
URL: 
https://github.com/apache/datafusion-comet/pull/2350#issuecomment-3287309576

   > Should we also be adding the test suites to pr_build_macos.yml ?
   
   Thanks, I will add it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Support more date part expressions [datafusion-comet]

2025-09-12 Thread via GitHub


wForget commented on code in PR #2316:
URL: https://github.com/apache/datafusion-comet/pull/2316#discussion_r2345737849


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1681,14 +1681,17 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
-  test("Year") {
+  test("DatePart functions: 
Year/Month/DayOfMonth/DayOfWeek/DayOfYear/WeekOfYear/Quarter") {
 Seq(false, true).foreach { dictionary =>
   withSQLConf("parquet.enable.dictionary" -> dictionary.toString) {
 val table = "test"
 withTable(table) {
   sql(s"create table $table(col timestamp) using parquet")
   sql(s"insert into $table values (now()), (null)")
-  checkSparkAnswerAndOperator(s"SELECT year(col) FROM $table")
+  // TODO: weekday(col) 
https://github.com/apache/datafusion-comet/issues/2330
+  checkSparkAnswerAndOperator(

Review Comment:
   > Can we change the session timezone when we read the values back? 
   
   The input of these methods is `DateType` and output is `IntegerType`, , so 
it doesn’t seem to be related to the timezone.
   
   > Also, maybe add some test values corresponding to days before the Unix 
Epoch.
   
   Thanks, I will try it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   >