Re: [I] Cast from DATE to VARCHAR fails [datafusion]
findepi commented on issue #17533: URL: https://github.com/apache/datafusion/issues/17533#issuecomment-3284008393 `arrow_cast` to `Utf8` works `arrow_cast` to `Utf8View` fails -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Cast from DATE to VARCHAR fails [datafusion]
findepi commented on issue #17533: URL: https://github.com/apache/datafusion/issues/17533#issuecomment-3284075814 We probably need to fix `can_cast_types` too - https://github.com/apache/arrow-rs/pull/8328 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Prepare for Merge Queue [datafusion]
Jefffrey commented on PR #17183: URL: https://github.com/apache/datafusion/pull/17183#issuecomment-3284090122 > Merge Queue only waits for the required steps - us all checks are important hence I made them all required. It'll become impossible to merge a PR if some of the checks are not passed, but I believe that's the flow we use now anyways. I think there's an edge case where if a PR doesn't kick off all the checks (e.g. a doc PR) then it is blocked since all checks must be successful now, but it doesn't trigger all the checks to run. For example, this simple PR is blocked from merging at the moment: #17532 @blaginin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Panic when cast from `DATE` to `TIMESTAMP` overflows [datafusion]
findepi opened a new issue, #17534: URL: https://github.com/apache/datafusion/issues/17534 ### Describe the bug Casting a date to timestamp involves multiplication and may overflow. DataFusion should report query error in such case, rather than panicking. ### To Reproduce ```shell cargo run --bin datafusion-cli -- --command "SELECT CAST(DATE '-12-31' As timestamp);" ``` ``` datafusion main$ cargo run --bin datafusion-cli -- --command "SELECT CAST(DATE '-12-31' As timestamp);" Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.18s Running `target/debug/datafusion-cli --command 'SELECT CAST(DATE '\''-12-31'\'' As timestamp);'` DataFusion CLI v50.0.0 thread 'main' panicked at /Users/findepi/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-cast-56.0.0/src/cast/mod.rs:1881:58: attempt to multiply with overflow note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` ### Expected behavior - query error informing about overflow ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: implement job data cleanup in pull-staged strategy #1219 [datafusion-ballista]
KR-bluejay commented on code in PR #1314: URL: https://github.com/apache/datafusion-ballista/pull/1314#discussion_r2336556200 ## ballista/executor/src/execution_loop.rs: ## @@ -88,8 +90,29 @@ pub async fn poll_loop match poll_work_result { Ok(result) => { -let tasks = result.into_inner().tasks; +let PollWorkResult { +tasks, +jobs_to_clean, +} = result.into_inner(); active_job = !tasks.is_empty(); +let work_dir = PathBuf::from(&executor.work_dir); + +// Clean up any state related to the listed jobs Review Comment: Hi, sorry for the delay — I’ve pushed the changes. I think I should add a test before merge, but I’m not sure how. There is a `test_executor_clean_up`, but it seems to cover different functionality, not `remove_job_dir`. Could you suggest how I should write such a test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat(spark): implement Spark `try_parse_url` function [datafusion]
Jefffrey commented on code in PR #17485: URL: https://github.com/apache/datafusion/pull/17485#discussion_r2340394000 ## datafusion/spark/src/function/url/parse_url.rs: ## @@ -47,23 +46,7 @@ impl Default for ParseUrl { impl ParseUrl { pub fn new() -> Self { Self { -signature: Signature::one_of( -vec![ -TypeSignature::Uniform( -1, -vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8], -), -TypeSignature::Uniform( -2, -vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8], -), -TypeSignature::Uniform( -3, -vec![DataType::Utf8View, DataType::Utf8, DataType::LargeUtf8], -), -], -Volatility::Immutable, -), +signature: Signature::user_defined(Volatility::Immutable), Review Comment: Dictionary type is another way to represent strings, similar to how view type is a different way to represent strings. Can see here in the doc there's a reference to handling dictionary types: https://github.com/apache/datafusion/blob/351675ddc27c42684a079b3a89fe2dee581d89a2/datafusion/expr-common/src/signature.rs#L229-L239 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: implement job data cleanup in pull-staged strategy #1219 [datafusion-ballista]
milenkovicm commented on code in PR #1314: URL: https://github.com/apache/datafusion-ballista/pull/1314#discussion_r2336689475 ## ballista/executor/src/execution_loop.rs: ## @@ -88,8 +90,29 @@ pub async fn poll_loop match poll_work_result { Ok(result) => { -let tasks = result.into_inner().tasks; +let PollWorkResult { +tasks, +jobs_to_clean, +} = result.into_inner(); active_job = !tasks.is_empty(); +let work_dir = PathBuf::from(&executor.work_dir); + +// Clean up any state related to the listed jobs Review Comment: thanks @KR-bluejay, no delays at all! will have a look later today hopefully -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Cast from DATE to VARCHAR fails [datafusion]
findepi opened a new issue, #17533: URL: https://github.com/apache/datafusion/issues/17533 ### Describe the bug ``` DataFusion CLI v50.0.0 > SELECT CAST(DATE '-12-31' AS varchar); This feature is not implemented: Unsupported CAST from Date32 to Utf8View ``` ### To Reproduce ```shell cargo run --bin datafusion-cli -- --command "SELECT CAST(DATE '-12-31' AS varchar);" ``` ### Expected behavior ``` ++ | Utf8("-12-31") | ++ | -12-31 | ++ 1 row(s) fetched. Elapsed 0.057 seconds. ``` ### Additional context This used to work in DataFusion 47. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Prepare for Merge Queue [datafusion]
blaginin commented on PR #17183: URL: https://github.com/apache/datafusion/pull/17183#issuecomment-3284132932 Will resolve now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Panic when cast from `DATE` to `TIMESTAMP` overflows [datafusion]
findepi commented on issue #17534: URL: https://github.com/apache/datafusion/issues/17534#issuecomment-3284162376 "attempt to multiply with overflow" is a standard Rust check not present in release builds.: ``` datafusion main$ cargo run --release --bin datafusion-cli -- --command "SELECT CAST(DATE '-12-31' As timestamp);" Finished `release` profile [optimized] target(s) in 0.21s Running `target/release/datafusion-cli --command 'SELECT CAST(DATE '\''-12-31'\'' As timestamp);'` DataFusion CLI v50.0.0 +---+ | Utf8("-12-31")| +---+ | 1816-03-29T05:56:08.066277376 | +---+ ``` However, i believe that DataFusion debug builds should strive to produce same results as release builds, in particular they should not panic. If silent overflow is intentional, the code should tell compiler that. FWIW, numeric multiplication silently overflows in debug builds, does not panic ``` datafusion main$ cargo run --bin datafusion-cli -- --command "SELECT 100 * 100 * 100;" Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.72s Running `target/debug/datafusion-cli --command 'SELECT 100 * 100 * 100;'` DataFusion CLI v50.0.0 +--+ | Int64(100) * Int64(100) * Int64(100) | +--+ | 5076944270305263616 | +--+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Support `CAST` from temporal to `Utf8View` [datafusion]
findepi opened a new pull request, #17535: URL: https://github.com/apache/datafusion/pull/17535 - fixes https://github.com/apache/datafusion/issues/17533 - requires https://github.com/apache/arrow-rs/pull/8328 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]
LiaCastaneda commented on issue #17171: URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3284143582 > I think building an IN LIST in the dynamic filter (in addition to min/max) allows us to leverage the existing machinery that builds literal guarantees in PruningPredicate. The parquet data source already takes advantage of this in the case that bloom filters are present and can help us prune entire row groups, which definitely seems valuable to me. I think this would also help users with custom `ExecutionPlan` scan nodes that aren’t Parquet or don’t support bloom predicates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: pass the ordering information to native Scan [datafusion-comet]
codecov-commenter commented on PR #2375: URL: https://github.com/apache/datafusion-comet/pull/2375#issuecomment-3276620996 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report :x: Patch coverage is `41.17647%` with `30 lines` in your changes missing coverage. Please review. :white_check_mark: Project coverage is 40.16%. Comparing base ([`f09f8af`](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)) to head ([`4d0c2ab`](https://app.codecov.io/gh/apache/datafusion-comet/commit/4d0c2ab573ab55219358c9de2a4428b703ba601b?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)). :warning: Report is 490 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [.../scala/org/apache/comet/serde/QueryPlanSerde.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2FQueryPlanSerde.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9RdWVyeVBsYW5TZXJkZS5zY2FsYQ==) | 51.21% | [14 Missing and 6 partials :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...la/org/apache/spark/sql/comet/CometExecUtils.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2FCometExecUtils.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvY29tZXQvQ29tZXRFeGVjVXRpbHMuc2NhbGE=) | 0.00% | [7 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...apache/spark/sql/comet/CometCollectLimitExec.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2FCometCollectLimitExec.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvY29tZXQvQ29tZXRDb2xsZWN0TGltaXRFeGVjLnNjYWxh) | 0.00% | [2 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...ark/sql/comet/CometTakeOrderedAndProjectExec.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2FCometTakeOrderedAndProjectExec.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvY29tZXQvQ29tZXRUYWtlT3JkZXJlZEFuZFByb2plY3RFeGVjLnNjYWxh) | 0.00% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#2375 +/- ## = - Coverage 56.12% 40.16% -15.97% - Complexity 976 993 +17 = Files 119 147 +28 Lines 1174313520 +1777 Branches 2251 2396 +145 = - Hits 6591 5430 -1161 - Misses 4012 7136 +3124 + Partials 1140 954 -186 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/2375?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](ht
[PR] fix: Change `OuterReferenceColumn` to contain the entire outer field to prevent metadata loss [datafusion]
Kontinuation opened a new pull request, #17524: URL: https://github.com/apache/datafusion/pull/17524 ## Which issue does this PR close? - Closes #17422. ## Rationale for this change As reported by #17422, extension metadata was dropped in some queries involving subqueries. This happens when a column references another column in a correlated subquery, which creates an `OuterReferenceColumn` in DataFusion's logical plan. `OuterReferenceColumn` is defined to contain a data type and a referenced column. As we know, outer reference column refers to a field in another table, so there's no way to retrieve the field metadata of referenced column given the schema of the current table. To overcome this problem, we define `OuterReferenceColumn` to contain the entire referenced field as a `FieldRef`. We also updated the implementation of `to_field` to return the referenced field, which contains the original metadata of referenced field. ## What changes are included in this PR? Changed `OuterReferenceColumn` to contain the referenced outer field as a `FieldRef`. ## Are these changes tested? * Added a unit test * Added the repro in the issue to `user_defined_scalar_functions` test ## Are there any user-facing changes? Yes. This changes part of the public API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Introduce `avg_distinct()` and `sum_distinct()` functions to DataFrame API [datafusion]
Jefffrey opened a new pull request, #17536: URL: https://github.com/apache/datafusion/pull/17536 ## Which issue does this PR close? - Closes #2409 and #2407 ## Rationale for this change So we can use these distinct aggregates via DataFrames ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: synchronize partition bounds reporting in HashJoin [datafusion]
adriangb commented on PR #17452: URL: https://github.com/apache/datafusion/pull/17452#issuecomment-3271833195 > @adriangb I think there is opportunity to simplify the bounds collection for each partition. That is, we can probably just track the min/max across all partitions and build a single `AND` binary expr once we have the final min/max (i.e. all partition bounds have been reported). > > Aside from one less mutex, I think it'll help reduce output in `EXPLAIN` as well. Happy to tackle in a follow-up PR I think that will regress performance: imagine partition 1 has bounds (0, 1) and partition 2 has bounds (98, 99). With bounds per partition the value 1234 is filtered out. The merged bounds of (0, 99) would include that value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Introduce `avg_distinct()` and `sum_distinct()` functions to DataFrame API [datafusion]
Jefffrey commented on code in PR #17536: URL: https://github.com/apache/datafusion/pull/17536#discussion_r2343365209 ## datafusion/functions-aggregate/src/average.rs: ## @@ -62,6 +62,17 @@ make_udaf_expr_and_func!( avg_udaf ); +pub fn avg_distinct(expr: Expr) -> Expr { +Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf( +avg_udaf(), +vec![expr], +true, +None, +vec![], +None, +)) +} Review Comment: Same as how count handles it: https://github.com/apache/datafusion/blob/bfc5067718a3ddcb87531b5a9633605792078546/datafusion/functions-aggregate/src/count.rs#L71-L80 ## datafusion/core/tests/dataframe/mod.rs: ## @@ -496,32 +497,35 @@ async fn drop_with_periods() -> Result<()> { #[tokio::test] async fn aggregate() -> Result<()> { // build plan using DataFrame API -let df = test_table().await?; +// union so some of the distincts have a clearly distinct result +let df = test_table().await?.union(test_table().await?)?; let group_expr = vec![col("c1")]; let aggr_expr = vec![ -min(col("c12")), -max(col("c12")), -avg(col("c12")), -sum(col("c12")), -count(col("c12")), -count_distinct(col("c12")), +min(col("c4")).alias("min(c4)"), +max(col("c4")).alias("max(c4)"), +avg(col("c4")).alias("avg(c4)"), +avg_distinct(col("c4")).alias("avg_distinct(c4)"), +sum(col("c4")).alias("sum(c4)"), +sum_distinct(col("c4")).alias("sum_distinct(c4)"), +count(col("c4")).alias("count(c4)"), +count_distinct(col("c4")).alias("count_distinct(c4)"), Review Comment: I switched to `c4` from `c12` as `c12` had some precision variations for avg_distinct leading to inconsistent test results, and figured it was easier to switch columns than slap `round` on the outputs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] High CPU during dynamic filter bound computation: min_batch/max_batch [datafusion]
LiaCastaneda commented on issue #17486: URL: https://github.com/apache/datafusion/issues/17486#issuecomment-3270213815 cc @adriangb -- sorry for the direct ping. Since you did most of the dynamic filtering work, do you know if this behavior is expected? 🙇♀️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: implement job data cleanup in pull-staged strategy #1219 [datafusion-ballista]
KR-bluejay commented on code in PR #1314: URL: https://github.com/apache/datafusion-ballista/pull/1314#discussion_r2336699580 ## ballista/executor/src/execution_loop.rs: ## @@ -88,8 +90,29 @@ pub async fn poll_loop match poll_work_result { Ok(result) => { -let tasks = result.into_inner().tasks; +let PollWorkResult { +tasks, +jobs_to_clean, +} = result.into_inner(); active_job = !tasks.is_empty(); +let work_dir = PathBuf::from(&executor.work_dir); + +// Clean up any state related to the listed jobs Review Comment: Thank you for reviewing this, @milenkovicm — I really appreciate it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor HashJoinExec to progressively accumulate dynamic filter bounds instead of computing them after data is accumulated [datafusion]
adriangb commented on PR #17444: URL: https://github.com/apache/datafusion/pull/17444#issuecomment-3271240054 > Benchmark clickbench_extended.json > │ QQuery 1 │ 1216.87 ms │ 1446.94 ms │ 1.19x slower │ I think this is noise, Q1 is: https://github.com/apache/datafusion/blob/fc5888b49b9757b20289e479f8b41440f383deb4/benchmarks/queries/clickbench/extended/q1.sql#L4C1-L4C115 Which has no join. > Benchmark tpch_mem_sf1.json > QQuery 8 │ 37.27 ms │ 29.13 ms │ +1.28x faster This looks real, lots of joins! https://github.com/apache/datafusion/blob/fc5888b49b9757b20289e479f8b41440f383deb4/benchmarks/queries/q8.sql#L13-L21 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Refactor and improve job data cleanup logic [datafusion-ballista]
milenkovicm commented on issue #1316: URL: https://github.com/apache/datafusion-ballista/issues/1316#issuecomment-3279895972 I think you could make an epic in this area There are few things I would like to propose regarding to files: 1. shuffle files are absolute paths, i would like to make them relative to flight store directory (improves security) 2. I would like to change how files are organised, instead of having `job_id/shuffle...` i would like to prefix files with `session_id/job_id/shuffle...` or `session_id/cache/` or `session_id/job_id/operator_state` ... i would like to be able to delete any of them, but also delete all files for given session_id when client disconnects 3. should we make arrow flight server owner of files, and expose actions on them to delete files ? 4. #558 wdyt @KR-bluejay ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Refactor and improve job data cleanup logic [datafusion-ballista]
KR-bluejay commented on issue #1316: URL: https://github.com/apache/datafusion-ballista/issues/1316#issuecomment-3279848468 I’d be happy to work on this as a follow-up, if this direction sounds good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Disable `required_status_checks` for now [datafusion]
blaginin opened a new pull request, #17537: URL: https://github.com/apache/datafusion/pull/17537 Follow up https://github.com/apache/datafusion/pull/17183#issuecomment-3284090122 As a hotfix, disabled the checks to unblock @Jefffrey -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Update Bug issue template to use Bug issue type [datafusion]
findepi opened a new pull request, #17540: URL: https://github.com/apache/datafusion/pull/17540 Same for Feature issue template. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Consumer receives duplicate predicates when join mode is CollectLeft [datafusion]
LiaCastaneda opened a new issue, #17541: URL: https://github.com/apache/datafusion/issues/17541 ### Describe the bug I see duplicated OR clauses on the DynamicPhysicalExpr I get in the consumer for an execution plan like this: ``` ProjectionExec: expr=[c0@0 as c0, c1@1 as c1, c2@2 as c2] CoalescePartitionsExec: fetch=5 CoalesceBatchesExec: target_batch_size=8192, fetch=5 HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(c0@0, c32@32)] CoalesceBatchesExec: target_batch_size=8192 FilterExec: c0@0 IS NOT NULL DataSourceExec: partitions=1, partition_sizes=[1] RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1 CooperativeExec DataSourceExec: partitions=1 ``` The bounds predicates arrive as 16 identical conjuncts, 1 per (right) output partition it seems: ``` ( ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') OR ("c32" >= 'db-01' AND "c32" <= 'keb-03') ) ``` This is probably related to [this](https://github.com/apache/datafusion/pull/17197#pullrequestreview-3135737978) comment. I wrote some logic in the consumer node to dedup the predicates but it seems worth handling in DataFusion. Following the code, in `CollectLeft` we derive the number of output predicates from the [right](https://github.com/apache/datafusion/blob/50733bcb801e21766dbed6e2403e87cfad7f8007/datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs#L157) side’s partition count. But iiuc `CollectLeft` collects the left into a single partition, so every right-side partition will see the same bounds in theory? ### To Reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Blog: Add table of contents to blog article [datafusion-site]
alamb merged PR #107: URL: https://github.com/apache/datafusion-site/pull/107 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Disable `required_status_checks` for now [datafusion]
blaginin merged PR #17537: URL: https://github.com/apache/datafusion/pull/17537 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Rename Blaze to Auron [datafusion]
Jefffrey merged PR #17532: URL: https://github.com/apache/datafusion/pull/17532 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update Bug issue template to use Bug issue type [datafusion]
Jefffrey merged PR #17540: URL: https://github.com/apache/datafusion/pull/17540 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Blog: Add table of contents to blog article [datafusion-site]
nuno-faria commented on PR #107: URL: https://github.com/apache/datafusion-site/pull/107#issuecomment-3285025510 > Once we merge this PR, should we add TOC's to the older posts as well? Yeah I think so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix predicate simplification for incompatible types in push_down_filter [datafusion]
alamb commented on code in PR #17521: URL: https://github.com/apache/datafusion/pull/17521#discussion_r2343741787 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -395,3 +395,28 @@ order by t1.k, t2.v; 1 1 1 1000 1000 1000 + +# Regression test for https://github.com/apache/datafusion/issues/17512 + +query I +COPY ( +SELECT arrow_cast('2025-01-01T00:00:00Z'::timestamptz, 'Timestamp(Microsecond, Some("UTC"))') AS start_timestamp +) +TO 'test_files/scratch/push_down_filter/17512.parquet' +STORED AS PARQUET; + +1 + +statement ok +CREATE EXTERNAL TABLE records STORED AS PARQUET LOCATION 'test_files/scratch/push_down_filter/17512.parquet'; + +query I +SELECT 1 +FROM ( +SELECT start_timestamp +FROM records +WHERE start_timestamp <= '2025-01-01T00:00:00Z'::timestamptz +) AS t +WHERE t.start_timestamp::time < '00:00:01'::time; Review Comment: > They need to take the f into account. The optimizer that does that is called "unwrap cast in comparison" AFAICT. Yes I agree (assuming `f` is a `cast` expression) > The find_most_restrictive_predicate should operate only on predicates comparing column c directly, ignoring those which compare f(c). That is my understanding of what this PR does. I am not sure if you are just confirming this change or if you are proposing / suggesting something more ## datafusion/optimizer/src/simplify_expressions/simplify_predicates.rs: ## @@ -239,8 +239,99 @@ fn find_most_restrictive_predicate( fn extract_column_from_expr(expr: &Expr) -> Option { match expr { Expr::Column(col) => Some(col.clone()), -// Handle cases where the column might be wrapped in a cast or other operation Review Comment: 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add Binary/LargeBinary/BinaryView/FixedSizeBinary to join_fuzz [datafusion]
alamb closed issue #17447: Add Binary/LargeBinary/BinaryView/FixedSizeBinary to join_fuzz URL: https://github.com/apache/datafusion/issues/17447 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add binary to `join_fuzz` testing [datafusion]
alamb merged PR #17497: URL: https://github.com/apache/datafusion/pull/17497 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add binary to `join_fuzz` testing [datafusion]
alamb commented on PR #17497: URL: https://github.com/apache/datafusion/pull/17497#issuecomment-3285061792 Thank you @jonathanc-n and @findepi I agree it might be worth figuring out some way to expand coverage that doesn't require so many individual tests to be added / updated 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] application of simple optimizer rule produces incorrect results (DF 49 regression) [datafusion]
alamb commented on issue #17510: URL: https://github.com/apache/datafusion/issues/17510#issuecomment-3285068318 I think we can close this ticket as not a bug, is that ok @wkalt ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Nested Loop Join: Performance Regression in DataFusion 50 for Suboptimal Join Orderings [datafusion]
alamb closed issue #17488: Nested Loop Join: Performance Regression in DataFusion 50 for Suboptimal Join Orderings URL: https://github.com/apache/datafusion/issues/17488 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Release DataFusion `50.0.0` (Aug/Sep 2025) [datafusion]
xudong963 commented on issue #16799: URL: https://github.com/apache/datafusion/issues/16799#issuecomment-3284639539 Hi @comphead, i have a severe headache today and i'm be off. We may need to start the voting process today, could you please help me do it? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor TableProvider::scan into TableProvider::scan_with_args [datafusion]
alamb commented on code in PR #17336: URL: https://github.com/apache/datafusion/pull/17336#discussion_r2344142167 ## datafusion/catalog/src/table.rs: ## @@ -299,6 +334,119 @@ pub trait TableProvider: Debug + Sync + Send { } } +/// Arguments for scanning a table with [`TableProvider::scan_with_args`]. +/// +/// `ScanArgs` provides a structured way to pass scan parameters to table providers, +/// replacing the multiple individual parameters used by [`TableProvider::scan`]. +/// This struct uses the builder pattern for convenient construction. +/// +/// # Examples +/// +/// ``` +/// # use datafusion_catalog::ScanArgs; +/// # use datafusion_expr::Expr; +/// let args = ScanArgs::default() +/// .with_projection(Some(vec![0, 2, 4])) +/// .with_limit(Some(1000)); +/// ``` +#[derive(Debug, Clone, Default)] +pub struct ScanArgs { +filters: Option>, +projection: Option>, +limit: Option, +} + +impl ScanArgs { +/// Set the column projection for the scan. +/// +/// The projection is a list of column indices from [`TableProvider::schema`] +/// that should be included in the scan results. If `None`, all columns are included. +/// +/// # Arguments +/// * `projection` - Optional list of column indices to project +pub fn with_projection(mut self, projection: Option>) -> Self { +self.projection = projection; +self +} + +/// Get the column projection for the scan. +/// +/// Returns a cloned copy of the projection column indices, or `None` if +/// no projection was specified (meaning all columns should be included). +pub fn projection(&self) -> Option> { Review Comment: ```suggestion pub fn projection(&self) -> Option<&Vec> { ``` ## datafusion/catalog/src/table.rs: ## @@ -299,6 +334,119 @@ pub trait TableProvider: Debug + Sync + Send { } } +/// Arguments for scanning a table with [`TableProvider::scan_with_args`]. +/// +/// `ScanArgs` provides a structured way to pass scan parameters to table providers, +/// replacing the multiple individual parameters used by [`TableProvider::scan`]. +/// This struct uses the builder pattern for convenient construction. +/// +/// # Examples +/// +/// ``` +/// # use datafusion_catalog::ScanArgs; +/// # use datafusion_expr::Expr; +/// let args = ScanArgs::default() +/// .with_projection(Some(vec![0, 2, 4])) +/// .with_limit(Some(1000)); +/// ``` +#[derive(Debug, Clone, Default)] +pub struct ScanArgs { +filters: Option>, +projection: Option>, +limit: Option, +} + +impl ScanArgs { +/// Set the column projection for the scan. +/// +/// The projection is a list of column indices from [`TableProvider::schema`] +/// that should be included in the scan results. If `None`, all columns are included. +/// +/// # Arguments +/// * `projection` - Optional list of column indices to project +pub fn with_projection(mut self, projection: Option>) -> Self { +self.projection = projection; +self +} + +/// Get the column projection for the scan. +/// +/// Returns a cloned copy of the projection column indices, or `None` if +/// no projection was specified (meaning all columns should be included). +pub fn projection(&self) -> Option> { Review Comment: I think this should return a reference (`Option<&Vec>`) to avoid forcing clones even if the caller doesn't need a copy ## datafusion/core/src/physical_planner.rs: ## @@ -459,9 +460,12 @@ impl DefaultPhysicalPlanner { // doesn't know (nor should care) how the relation was // referred to in the query let filters = unnormalize_cols(filters.iter().cloned()); -source -.scan(session_state, projection.as_ref(), &filters, *fetch) -.await? +let opts = ScanArgs::default() +.with_projection(projection.clone()) Review Comment: is there a reason we need to cone the projection here? I think it would be better and avoid the clone and pass in the arguments via reference (so something like `ScanArgs<'a>`) ## datafusion/catalog/src/table.rs: ## @@ -299,6 +334,119 @@ pub trait TableProvider: Debug + Sync + Send { } } +/// Arguments for scanning a table with [`TableProvider::scan_with_args`]. +/// +/// `ScanArgs` provides a structured way to pass scan parameters to table providers, +/// replacing the multiple individual parameters used by [`TableProvider::scan`]. +/// This struct uses the builder pattern for convenient construction. +/// +/// # Examples +/// +/// ``` +/// # use datafusion_catalog::ScanArgs; +/// # use datafusion_expr::Expr; +/// let args = ScanArgs::default() +/// .with_projection(Some(vec![0, 2, 4])) +/// .with_limit(Some(1000)); +/// ``` +#[derive(Debug, Clone, Default)] +pub struct ScanArgs { +filters: O
Re: [I] Nested Loop Join: Performance Regression in DataFusion 50 for Suboptimal Join Orderings [datafusion]
alamb commented on issue #17488: URL: https://github.com/apache/datafusion/issues/17488#issuecomment-3285079247 Per the comments above, let's close this ticket. @tobixdev / @2010YOUY01 do you think it is worth tracking a potential performance improvement as a separate ticket, or is there mot much to do without causing other tradeoffs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Introduce wildcard const for FixedSizeBinary type signature [datafusion]
findepi commented on PR #17531: URL: https://github.com/apache/datafusion/pull/17531#issuecomment-3285251362 I realized we may get away without introducing a generic type concept. The existing `TypeSignature::Coercible`'s `TypeSignatureClass` seems to be more or less it. did you try to use it? cc @jayzhan211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix predicate simplification for incompatible types in push_down_filter [datafusion]
findepi merged PR #17521: URL: https://github.com/apache/datafusion/pull/17521 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Logical optimizer pushdown_filters rule fails with relatively simple query [datafusion]
findepi closed issue #17512: Logical optimizer pushdown_filters rule fails with relatively simple query URL: https://github.com/apache/datafusion/issues/17512 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add assertion that ScalarUDFImpl implementation is consistent with declared return type [datafusion]
findepi commented on code in PR #17515: URL: https://github.com/apache/datafusion/pull/17515#discussion_r2344236784 ## datafusion/expr/src/udf.rs: ## @@ -233,7 +233,25 @@ impl ScalarUDF { /// /// See [`ScalarUDFImpl::invoke_with_args`] for details. pub fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result { -self.inner.invoke_with_args(args) +#[cfg(debug_assertions)] Review Comment: yeah, match on invoke_with_args result is useless, the `?` is the way to go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] add a ci job for typo checking [datafusion]
findepi commented on PR #17339: URL: https://github.com/apache/datafusion/pull/17339#issuecomment-3285281075 This new check just got a stupid typo i made in a hurry. Thank you for catching me! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Using `encode_arrow_schema` from arrow-rs. [datafusion]
samueleresca opened a new pull request, #17543: URL: https://github.com/apache/datafusion/pull/17543 ## Which issue does this PR close? - Closes #17542 17542 ## What changes are included in this PR? - Removing the custom `encode_arrow_schema` implementation. - Removing unused dependencies. ## Are these changes tested? Unit test coverage ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] POC: datafusion-cli instrumented object store [datafusion]
alamb commented on PR #17266: URL: https://github.com/apache/datafusion/pull/17266#issuecomment-3285476190 I have not forgotten about this PR -- I am just busy with other projects now. I will come back to this soon (TM) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Epic]: Google Summer of Code 2025 Correlated Subquery Support [datafusion]
alamb commented on issue #16059: URL: https://github.com/apache/datafusion/issues/16059#issuecomment-3285523017 the PR to add correlated subquery support is here: - https://github.com/apache/datafusion/pull/17110 I think with that let's claim this project is done. Thank you for all your effort @irenjj -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Epic]: Google Summer of Code 2025 Correlated Subquery Support [datafusion]
alamb closed issue #16059: [Epic]: Google Summer of Code 2025 Correlated Subquery Support URL: https://github.com/apache/datafusion/issues/16059 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: [1941-Part2]: Introduce map_to_list scalar function [datafusion-comet]
comphead commented on PR #2312: URL: https://github.com/apache/datafusion-comet/pull/2312#issuecomment-3285609973 Thanks @rishvin appreciate if you can take #2388 fill up the context and we can go through PRs once again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Revert #17295 (Support from-first SQL syntax) [datafusion]
adriangb commented on PR #17520: URL: https://github.com/apache/datafusion/pull/17520#issuecomment-3285732843 > Need to revert doc changes too: done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Numeric overflow should result in query error [datafusion]
findepi opened a new issue, #17539: URL: https://github.com/apache/datafusion/issues/17539 ### Describe the bug In programming languages it's rather typical that numeric overflows are not a runtime error. In SQL, it's rather typical that numeric overflows are identified during query execution and are treated as an error. DataFusion is a SQL engine and therefore should report overflows as query errors. Continuing execution after overflows happened is a correctness issue (query results a wrong result). ### To Reproduce ```shell cargo run --bin datafusion-cli -- --command "SELECT 100 * 100;" ``` ``` datafusion main$ cargo run --bin datafusion-cli -- --command "SELECT 100 * 100;" Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.17s Running `target/debug/datafusion-cli --command 'SELECT 100 * 100;'` DataFusion CLI v50.0.0 +-+ | Int64(100) * Int64(100) | +-+ | 7766279631452241920 | +-+ ``` ### Expected behavior ### PostgreSQL ``` postgres=# SELECT 100 * 100; ERROR: bigint out of range ``` ### Trino ``` trino> SELECT 100 * 100; Query 20250912_093345_0_kgiyz failed: bigint multiplication overflow: 100 * 100 ``` ### Snowflake ``` SELECT 100 * 100 * 100 * 100; -- Number out of representable range: type FIXED[SB16](38,0){not null}, value 1e+40 ``` ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] TPC-DS query #88 fails with disabled AQE [datafusion-comet]
and124578963 opened a new issue, #2389: URL: https://github.com/apache/datafusion-comet/issues/2389 ### Describe the bug When running the [TPC-DS query 88](https://github.com/apache/doris/blob/master/tools/tpcds-tools/queries/sf1000/query88.sql) with **"spark.sql.adaptive.enabled": "false"**, it fails with the following error: **java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: ArrayBuffer(6958, 6975, 6958, 6958).** The issue is resolved by either setting "spark.comet.exec.broadcastHashJoin.enabled": "false" or re-enabling AQE. My hypothesis is that the query plan is built incorrectly, leading to RDDs from different subqueries with varying partition counts being zipped together. The query also runs successfully if rewritten to use a single common SELECT statement or if all subqueries are forced to have identical conditions, which results in RDDs of the same size. Stack Trace ```Problem with SQL: An error occurred while calling o384.showString. : java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: ArrayBuffer(6958, 6975, 6958, 6958) at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:212) at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:517) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:208) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:204) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.prepareBroadcast(BroadcastNestedLoopJoinExec.scala:444) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.codegenInner(BroadcastNestedLoopJoinExec.scala:454) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:428) at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:196) at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:151) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.consume(BroadcastNestedLoopJoinExec.scala:32) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.codegenInner(BroadcastNestedLoopJoinExec.scala:469) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:428) at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:196) at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:151) at org.apache.spark.sql.comet.CometColumnarToRowExec.consume(CometColumnarToRowExec.scala:54) at org.apache.spark.sql.comet.CometColumnarToRowExec.doProduce(CometColumnarToRowExec.scala:277) at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:92) at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:92) at org.apache.spark.sql.comet.CometColumnarToRowExec.produce(CometColumnarToRowExec.scala:54) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doProduce(BroadcastNestedLoopJoinExec.scala:423) at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:92) at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:92) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.produce(BroadcastNestedLoopJoinExec.scala:32) at org.apache.spark.sql
Re: [PR] Revert #17295 (Support from-first SQL syntax) [datafusion]
alamb commented on PR #17520: URL: https://github.com/apache/datafusion/pull/17520#issuecomment-3285898307 Thank you @adriangb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [iceberg] Tracking PR for deleted rows support [datafusion-comet]
parthchandra opened a new issue, #2390: URL: https://github.com/apache/datafusion-comet/issues/2390 Tracking https://github.com/apache/iceberg/pull/14062 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Preserves field metadata when creating logical plan for VALUES expression [datafusion]
Kontinuation commented on PR #17525: URL: https://github.com/apache/datafusion/pull/17525#issuecomment-3286053055 CC @paleolimbot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Implement CometInMemoryTableScanExec [datafusion-comet]
andygrove opened a new issue, #2391: URL: https://github.com/apache/datafusion-comet/issues/2391 ### What is the problem the feature request solves? In queries with InMemoryTableScanExec, we have to perform a CometColumnarToRowExec to convert to row format, and then the rest of the query falls back to Spark. We should investigate implementing a CometInMemoryTableScanExec that loads data in columnar format. ### Describe the potential solution _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Only compute bounds/ dynamic filters if consumer asks for it [datafusion]
adriangb commented on issue #17527: URL: https://github.com/apache/datafusion/issues/17527#issuecomment-3285593615 I think the pro of the subscriber option is: 1. Like you say it can be applied to only some filters and not others. 2. It avoids making the general filter pushdown API more complex (although the complexity gets pushed somewhere else). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Blog: Add table of contents to blog article [datafusion-site]
alamb commented on PR #107: URL: https://github.com/apache/datafusion-site/pull/107#issuecomment-3284773235 Once we merge this PR, should we add TOC's to the older posts as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Removing ad-hoc implementation of `encode_arrow_schema` [datafusion]
samueleresca opened a new issue, #17542: URL: https://github.com/apache/datafusion/issues/17542 ### Is your feature request related to a problem or challenge? The implementation of `encode_arrow_schema` is now included and exposed by arrow-rs (see: https://github.com/apache/arrow-rs/pull/6916). The datafusion ad-hoc implementation is not needed anymore ### Describe the solution you'd like Remove the ad-hoc datafusion implementation of `encode_arrow_schema`. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix predicate simplification for incompatible types in push_down_filter [datafusion]
alamb commented on code in PR #17521: URL: https://github.com/apache/datafusion/pull/17521#discussion_r2344321562 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -395,3 +395,28 @@ order by t1.k, t2.v; 1 1 1 1000 1000 1000 + +# Regression test for https://github.com/apache/datafusion/issues/17512 + +query I +COPY ( +SELECT arrow_cast('2025-01-01T00:00:00Z'::timestamptz, 'Timestamp(Microsecond, Some("UTC"))') AS start_timestamp +) +TO 'test_files/scratch/push_down_filter/17512.parquet' +STORED AS PARQUET; + +1 + +statement ok +CREATE EXTERNAL TABLE records STORED AS PARQUET LOCATION 'test_files/scratch/push_down_filter/17512.parquet'; + +query I +SELECT 1 +FROM ( +SELECT start_timestamp +FROM records +WHERE start_timestamp <= '2025-01-01T00:00:00Z'::timestamptz +) AS t +WHERE t.start_timestamp::time < '00:00:01'::time; Review Comment: makes sense -- thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: output_ordering converted to Vec> [datafusion]
destrex271 commented on code in PR #17439: URL: https://github.com/apache/datafusion/pull/17439#discussion_r2345062474 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -383,7 +383,7 @@ impl FileScanConfigBuilder { /// Set the output ordering of the files pub fn with_output_ordering(mut self, output_ordering: Vec) -> Self { Review Comment: @crepererum I guess we should skip modifying the type of expr to Option since at this stage having None seems to be a little useless. ```rust #[derive(Debug, Clone)] pub struct SortExec { /// Input schema pub(crate) input: Arc, /// Sort expressions expr: LexOrdering, /// Containing all metrics set created during sort metrics_set: ExecutionPlanMetricsSet, /// Preserve partitions of input plan. If false, the input partitions /// will be sorted and merged into a single output partition. preserve_partitioning: bool, /// Fetch highest/lowest n results fetch: Option, /// Normalized common sort prefix between the input and the sort expressions (only used with fetch) common_sort_prefix: Vec, /// Cache holding plan properties like equivalences, output partitioning etc. cache: PlanProperties, /// Filter matching the state of the sort for dynamic filter pushdown. /// If `fetch` is `Some`, this will also be set and a TopK operator may be used. /// If `fetch` is `None`, this will be `None`. filter: Option>>, } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: feature specific tests [datafusion-comet]
codecov-commenter commented on PR #2372: URL: https://github.com/apache/datafusion-comet/pull/2372#issuecomment-3286531985 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2372?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report :white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 33.05%. Comparing base ([`f09f8af`](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)) to head ([`149a265`](https://app.codecov.io/gh/apache/datafusion-comet/commit/149a2654456f1b4f288444f60c9a5511cdd3bd39?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)). :warning: Report is 497 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#2372 +/- ## = - Coverage 56.12% 33.05% -23.07% + Complexity 976 736 -240 = Files 119 147 +28 Lines 1174313409 +1666 Branches 2251 2332 +81 = - Hits 6591 4433 -2158 - Misses 4012 8184 +4172 + Partials 1140 792 -348 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/2372?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Regression: projection pushdown doesn't work as expected in DF50 [datafusion]
alamb closed issue #17513: Regression: projection pushdown doesn't work as expected in DF50 URL: https://github.com/apache/datafusion/issues/17513 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Revert #17295 (Support from-first SQL syntax) [datafusion]
alamb commented on PR #17520: URL: https://github.com/apache/datafusion/pull/17520#issuecomment-3286659694 This one looks good thanks everyone! I also made a backport PR to `branch-50` branch so we can include it in the RC - https://github.com/apache/datafusion/pull/17544 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] [branch-50] fix: Implement AggregateUDFImpl::reverse_expr for StringAgg (#17165) (#17473) [datafusion]
alamb opened a new pull request, #17544: URL: https://github.com/apache/datafusion/pull/17544 ## Which issue does this PR close? - related to https://github.com/apache/datafusion/issues/17513 - related to #16799 ## Rationale for this change We fixed a bug found in testing DataFusion 50, so let's backport the change ## What changes are included in this PR? - backport https://github.com/apache/datafusion/pull/17520 to [branch-50] ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] API Suggestion match casing for isnan and is_null [datafusion-python]
ntjohnson1 opened a new issue, #1235: URL: https://github.com/apache/datafusion-python/issues/1235 Maybe this is more of a question than a suggestion. Is there a reason they both aren't `is_nan` and `is_null`? It looks like this is a carry over from the underlying rust and it isn't clear to me why that interface is different either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Trying cargo machete to prune unused deps. [datafusion]
samueleresca opened a new pull request, #17545: URL: https://github.com/apache/datafusion/pull/17545 Very much a test with `cargo machete` to understand all the false positives detected by the tooling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Using `encode_arrow_schema` from arrow-rs. [datafusion]
alamb merged PR #17543: URL: https://github.com/apache/datafusion/pull/17543 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Revert #17295 (Support from-first SQL syntax) [datafusion]
adriangb commented on PR #17520: URL: https://github.com/apache/datafusion/pull/17520#issuecomment-3286713964 @simonvandel I'm sorry we had to revert your contribution. I hope you're able to contribute again and now that we have the regression test and clarity about the behavior it should be okay to re-add your feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Removing ad-hoc implementation of `encode_arrow_schema` [datafusion]
alamb closed issue #17542: Removing ad-hoc implementation of `encode_arrow_schema` URL: https://github.com/apache/datafusion/issues/17542 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `TableProvider::scan_with_args` [datafusion]
adriangb commented on PR #17336: URL: https://github.com/apache/datafusion/pull/17336#issuecomment-3286697680 @alamb I believe I've addressed your feedback, thank you for the review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add assertion that ScalarUDFImpl implementation is consistent with declared return type [datafusion]
alamb commented on code in PR #17515: URL: https://github.com/apache/datafusion/pull/17515#discussion_r2343788618 ## datafusion/spark/src/function/url/parse_url.rs: ## @@ -222,7 +223,7 @@ fn spark_parse_url(args: &[ArrayRef]) -> Result { ) } (DataType::Utf8View, DataType::Utf8View, DataType::Utf8View) => { -process_parse_url::<_, _, _, StringArray>( +process_parse_url::<_, _, _, StringViewArray>( Review Comment: nice ## datafusion/expr/src/udf.rs: ## @@ -233,7 +233,25 @@ impl ScalarUDF { /// /// See [`ScalarUDFImpl::invoke_with_args`] for details. pub fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result { -self.inner.invoke_with_args(args) +#[cfg(debug_assertions)] Review Comment: I played around with this code a little bit and I found it could be made more concise with less indenting. Not needed, just for your consideration ```rust /// Invoke the function on `args`, returning the appropriate result. /// /// See [`ScalarUDFImpl::invoke_with_args`] for details. pub fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result { #[cfg(debug_assertions)] let return_field = Arc::clone(&args.return_field); let value = self.inner.invoke_with_args(args)?; // Maybe this could be enabled always? // This doesn't use debug_assert!, but it's meant to run anywhere except on production, so it's same in spirit, so conditioning on debug_assertions. #[cfg(debug_assertions)] if &value.data_type() != return_field.data_type() { return datafusion_common::exec_err!("Function '{}' returned value of type '{:?}' while the following type was promised at planning time and expected: '{:?}'", self.name(), value.data_type(), return_field.data_type() ); } // TODO verify return data is non-null when it was promised to be? Ok(value) } ``` ## datafusion/expr/src/udf.rs: ## @@ -233,7 +233,25 @@ impl ScalarUDF { /// /// See [`ScalarUDFImpl::invoke_with_args`] for details. pub fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result { -self.inner.invoke_with_args(args) +#[cfg(debug_assertions)] +let return_field = Arc::clone(&args.return_field); +match self.inner.invoke_with_args(args) { +Ok(value) => { +// Maybe this could be enabled always? +// This doesn't use debug_assert!, but it's meant to run anywhere except on production, so it's same in spirit, so conditioning on debug_assertions. +#[cfg(debug_assertions)] +if &value.data_type() != return_field.data_type() { +return datafusion_common::exec_err!("Function '{}' returned value of type '{:?}' while the following type was promised at planning time and expected: '{:?}'", Review Comment: It probably doesn't matter, but I would suggeset making this `internal_err` instead of `exec_err` as it signifies a bug rather than some other potentially expected runtime error ```suggestion return datafusion_common::internal_err!("Function '{}' returned value of type '{:?}' while the following type was promised at planning time and expected: '{:?}'", ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] use or instead of and for hash join filter pushdown [datafusion]
adriangb closed pull request #17461: use or instead of and for hash join filter pushdown URL: https://github.com/apache/datafusion/pull/17461 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `TableProvider::scan_with_args` [datafusion]
alamb commented on code in PR #17336: URL: https://github.com/apache/datafusion/pull/17336#discussion_r2345346409 ## datafusion/catalog/src/table.rs: ## @@ -171,6 +171,38 @@ pub trait TableProvider: Debug + Sync + Send { limit: Option, ) -> Result>; +/// Create an [`ExecutionPlan`] for scanning the table using structured arguments. +/// +/// This method uses [`ScanArgs`] to pass scan parameters in a structured way +/// and returns a [`ScanResult`] containing the execution plan. This approach +/// allows for extensible parameter passing and result handling. Review Comment: x ```suggestion /// and returns a [`ScanResult`] containing the execution plan. ``` ## datafusion/catalog/src/table.rs: ## @@ -171,6 +171,38 @@ pub trait TableProvider: Debug + Sync + Send { limit: Option, ) -> Result>; +/// Create an [`ExecutionPlan`] for scanning the table using structured arguments. +/// +/// This method uses [`ScanArgs`] to pass scan parameters in a structured way +/// and returns a [`ScanResult`] containing the execution plan. This approach +/// allows for extensible parameter passing and result handling. +/// +/// Table providers can override this method to take advantage of additional +/// parameters like `preferred_ordering` that may not be available through Review Comment: ```suggestion /// parameters like the upcoming `preferred_ordering` that may not be available through ``` ## datafusion/core/src/datasource/listing/table.rs: ## @@ -1169,6 +1169,24 @@ impl TableProvider for ListingTable { filters: &[Expr], limit: Option, ) -> Result> { +let options = ScanArgs::default() +.with_projection(projection.map(|p| p.as_slice())) +.with_filters(Some(filters)) +.with_limit(limit); +Ok(Arc::clone( +self.scan_with_args(state, options).await?.plan(), +)) Review Comment: ```suggestion Ok(self.scan_with_args(state, options).await?.into_inner()) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix predicate simplification for incompatible types in push_down_filter [datafusion]
findepi commented on code in PR #17521: URL: https://github.com/apache/datafusion/pull/17521#discussion_r2344231059 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -395,3 +395,28 @@ order by t1.k, t2.v; 1 1 1 1000 1000 1000 + +# Regression test for https://github.com/apache/datafusion/issues/17512 + +query I +COPY ( +SELECT arrow_cast('2025-01-01T00:00:00Z'::timestamptz, 'Timestamp(Microsecond, Some("UTC"))') AS start_timestamp +) +TO 'test_files/scratch/push_down_filter/17512.parquet' +STORED AS PARQUET; + +1 + +statement ok +CREATE EXTERNAL TABLE records STORED AS PARQUET LOCATION 'test_files/scratch/push_down_filter/17512.parquet'; + +query I +SELECT 1 +FROM ( +SELECT start_timestamp +FROM records +WHERE start_timestamp <= '2025-01-01T00:00:00Z'::timestamptz +) AS t +WHERE t.start_timestamp::time < '00:00:01'::time; Review Comment: > > The find_most_restrictive_predicate should operate only on predicates comparing column c directly, ignoring those which compare f(c). > > That is my understanding of what this PR does. I am not sure if you are just confirming this change or if you are proposing / suggesting something more At the time of writing it was a proposal. Now that the proposed change has been applied, it can be read as a confirmation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` [datafusion]
rkrishn7 opened a new pull request, #17546: URL: https://github.com/apache/datafusion/pull/17546 ## Which issue does this PR close? Closes https://github.com/apache/datafusion/issues/17511 ## Rationale for this change When building equal conditions in a data source node, we want to ignore any stale references to columns that may have been swapped out (e.g. from `try_swapping_with_projection`). The current code reassigns predicate columns from the filter to refer to the corresponding ones in the updated schema. However, it only ignores non-projected columns. `reassign_predicate_columns` builds an invalid column expression (with index `usize::MAX`) if the column is not projected in the current schema. We don't want to refer to this in the equal conditions we build. ## What changes are included in this PR? Ignores any binary expressions that reference non-existent columns in the current schema (e.g. due to unnecessary projections being removed). ## Are these changes tested? ## Are there any user-facing changes? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` [datafusion]
rkrishn7 commented on PR #17546: URL: https://github.com/apache/datafusion/pull/17546#issuecomment-3286735863 The fact that `reassign_predicate_columns` can return invalid column expressions seems like a big footgun. I didn't change within this PR due to usage elsewhere but we might want to think about refactoring it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` [datafusion]
rkrishn7 commented on PR #17546: URL: https://github.com/apache/datafusion/pull/17546#issuecomment-3286768072 Re-ran TPCH benchmark with the same configuration as the referenced issue and all the tests pass now. Will add a regression test here in a bit! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Update documentation on Epics and Sponsoring Maintainers [datafusion]
comphead commented on PR #17505: URL: https://github.com/apache/datafusion/pull/17505#issuecomment-3285580676 > > Is it worth changing the term "sponsoring" to something like "driving" or "championing", to avoid any confusion with the monetary meaning of sponsoring? > > That is a really good point. Here are some brainstorm options > > * "Driving Committer" > * "Championing Committer" > * "Shepherding Committer" (my personal favorite) > * "Supervising Committer" > * "Herding Committer" > * "Stewarding Committer" > * "Managing Committer" > > 🤔 `supervising` sounds great IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] refactor: Scala hygiene - remove `scala.collection.JavaConverters` [datafusion-comet]
hsiang-c opened a new pull request, #2393: URL: https://github.com/apache/datafusion-comet/pull/2393 ## Which issue does this PR close? Partially closes #. https://github.com/apache/datafusion-comet/issues/2255 ## Rationale for this change - The scope of https://github.com/apache/datafusion-comet/issues/2255 is huge, so fix the hygiene issues bit by bit. ## What changes are included in this PR? - Depend on [scala-collection-compat](https://github.com/scala/scala-collection-compat) b/c Comet still have Scala 2.12 builds. - `scala.collection.JavaConverters` is deprecated, replace it with `scala.jdk.CollectionConverters` (scala) and `scala.jdk.javaapi.CollectionConverters` (Java) ## How are these changes tested? - Tested locally w/ build profile `-Pscala-2.12` and `-Pscala-2.13` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Enable feature specific unit tests [datafusion-comet]
parthchandra commented on issue #2360: URL: https://github.com/apache/datafusion-comet/issues/2360#issuecomment-3286886609 @andygrove @wForget this is ready for review. The test referred to above passes with `hdfs-opendal` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor: Scala hygiene - remove `scala.collection.JavaConverters` [datafusion-comet]
codecov-commenter commented on PR #2393: URL: https://github.com/apache/datafusion-comet/pull/2393#issuecomment-3286897970 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report :x: Patch coverage is `0%` with `13 lines` in your changes missing coverage. Please review. :white_check_mark: Project coverage is 33.06%. Comparing base ([`f09f8af`](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)) to head ([`8beb445`](https://app.codecov.io/gh/apache/datafusion-comet/commit/8beb4459d6c78945c4871962b8c5241ccc722957?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)). :warning: Report is 497 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...va/org/apache/comet/parquet/NativeBatchReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FNativeBatchReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L05hdGl2ZUJhdGNoUmVhZGVyLmphdmE=) | 0.00% | [7 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [.../apache/comet/parquet/CometParquetFileFormat.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fparquet%2FCometParquetFileFormat.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbWV0UGFycXVldEZpbGVGb3JtYXQuc2NhbGE=) | 0.00% | [2 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...a/org/apache/comet/parquet/NativeColumnReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FNativeColumnReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L05hdGl2ZUNvbHVtblJlYWRlci5qYXZh) | 0.00% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...et/execution/shuffle/CometUnsafeShuffleWriter.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2Fcomet%2Fexecution%2Fshuffle%2FCometUnsafeShuffleWriter.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3NwYXJrL3NxbC9jb21ldC9leGVjdXRpb24vc2h1ZmZsZS9Db21ldFVuc2FmZVNodWZmbGVXcml0ZXIuamF2YQ==) | 0.00% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...t/parquet/CometParquetPartitionReaderFactory.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fparquet%2FCometParquetPartitionReaderFactory.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbWV0UGFycXVldFBhcnRpdGlvblJlYWRlckZhY3Rvcnkuc2NhbGE=) | 0.00% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...n/scala/org/apache/comet/rules/CometScanRule.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2393?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Frules%2FCometScanRule.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW
Re: [I] Enable the `ListFilesCache` to be available for partitioned tables [datafusion]
BlakeOrth commented on issue #17211: URL: https://github.com/apache/datafusion/issues/17211#issuecomment-3286924997 @alamb I've found the in-review documentation on process updates in: - https://github.com/apache/datafusion/pull/17505 In an effort to follow the project's processes, considering our existing discussions about this issue, would you be willing to be the " Committer" for this effort? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [D] DISCUSSION: DataFusion Meetup in New York, NY, USA - Sep 15, 2025 [datafusion]
GitHub user alamb added a comment to the discussion: DISCUSSION: DataFusion Meetup in New York, NY, USA - Sep 15, 2025 Looking forward to next week! GitHub link: https://github.com/apache/datafusion/discussions/16265#discussioncomment-14388367 This is an automatically sent email for github@datafusion.apache.org. To unsubscribe, please send an email to: github-unsubscr...@datafusion.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Release sqlparser-rs version `0.59.0` around 2025-09-15 [datafusion-sqlparser-rs]
alamb commented on issue #1956: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1956#issuecomment-3286957810 @iffyio -- I am thinking of making a release next week -- is there anything in particular you think we should wait for? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Slow aggregrate query with `array_agg`, Polars is 4 times faster for equal query [datafusion]
Omega359 commented on issue #17446: URL: https://github.com/apache/datafusion/issues/17446#issuecomment-3286959549 There was a previous attempt at a GroupsAccumulator for ArrayAgg (https://github.com/apache/datafusion/pull/11096) but the performance of that PR was mixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor: Scala hygiene - remove `scala.collection.JavaConverters` [datafusion-comet]
hsiang-c commented on PR #2393: URL: https://github.com/apache/datafusion-comet/pull/2393#issuecomment-3286990162 SparkSQL tests failed b/c of `java.lang.NoClassDefFoundError` ``` java.lang.NoClassDefFoundError: scala/jdk/javaapi/CollectionConverters [info] at org.apache.comet.parquet.NativeBatchReader.init(NativeBatchReader.java:259) [info] at org.apache.comet.parquet.CometParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(CometParquetFileFormat.scala:169) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Improve Initcap test and docs [datafusion-comet]
andygrove merged PR #2387: URL: https://github.com/apache/datafusion-comet/pull/2387 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Panic happens when adding a decimal256 to a float (SQLancer) [datafusion]
Jefffrey closed issue #16689: Panic happens when adding a decimal256 to a float (SQLancer) URL: https://github.com/apache/datafusion/issues/16689 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: implement lazy evaluation in Coalesce function [datafusion-comet]
parthchandra commented on code in PR #2270: URL: https://github.com/apache/datafusion-comet/pull/2270#discussion_r2345625780 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -394,6 +394,20 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } + test("test coalesce lazy eval") { Review Comment: Does this test verify that lazy evaluation did, in fact, occur? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: feature specific tests [datafusion-comet]
parthchandra commented on PR #2372: URL: https://github.com/apache/datafusion-comet/pull/2372#issuecomment-3287165660 @wForget ptal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add test for decimal256 and float math [datafusion]
Jefffrey merged PR #17530: URL: https://github.com/apache/datafusion/pull/17530 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [ANSI] Include original SQL in error messages [datafusion-comet]
parthchandra commented on issue #2215: URL: https://github.com/apache/datafusion-comet/issues/2215#issuecomment-3287216142 I think a different approach is required here. The error in this example comes from [QueryExecutionErrors](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala). Afaik, when an error of this type is thrown, Spark will handle the proper reporting of the error and the query. We should ideally have a mapping for Comet/Datafusion errors that matches the error to the relevant `QueryExecutionError` and in `CometExecIterator` we should catch the `CometNativeException`, convert it into a `QueryExecutionError`, and rethrow it. In `CometExecIterator` we already catch some `CometNativeException`s which are then converted into a `SparkException`. This currently is specific to a set of errors which are matched by checking the contents of the error message (which is not quite ideal). It would be good to change these too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [native_iceberg_compat] Add support for custom S3 endpoints [datafusion-comet]
parthchandra commented on issue #2261: URL: https://github.com/apache/datafusion-comet/issues/2261#issuecomment-3287221701 @andygrove I don't think this is a problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Support more date part expressions [datafusion-comet]
parthchandra commented on code in PR #2316: URL: https://github.com/apache/datafusion-comet/pull/2316#discussion_r2345685485 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1681,14 +1681,17 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } - test("Year") { + test("DatePart functions: Year/Month/DayOfMonth/DayOfWeek/DayOfYear/WeekOfYear/Quarter") { Seq(false, true).foreach { dictionary => withSQLConf("parquet.enable.dictionary" -> dictionary.toString) { val table = "test" withTable(table) { sql(s"create table $table(col timestamp) using parquet") sql(s"insert into $table values (now()), (null)") - checkSparkAnswerAndOperator(s"SELECT year(col) FROM $table") + // TODO: weekday(col) https://github.com/apache/datafusion-comet/issues/2330 + checkSparkAnswerAndOperator( Review Comment: Can we change the session timezone when we read the values back? Also, maybe add some test values corresponding to days before the Unix Epoch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Add hdfs feature test job [datafusion-comet]
parthchandra commented on PR #2350: URL: https://github.com/apache/datafusion-comet/pull/2350#issuecomment-3287286359 oops. Yes, I did mean #2372. Let's get that merged and update the test in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Always run CI checks [datafusion]
blaginin opened a new pull request, #17538: URL: https://github.com/apache/datafusion/pull/17538 Follow up on https://github.com/apache/datafusion/pull/17183 There are two ways to fix the "steps not run" problem: - 1: always run them. CI will become slower but we'll have merge queue anyway. On the positive side, we'll have consistent CI runs (for example, now when you just change `*.md` we won't run `check configs.md and ***_functions.md is up-to-date` although we should - 2: not run them (in a bit hacky way because of https://github.com/orgs/community/discussions/45899) 2 consumes less resources, so going with it for now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: feature specific tests [datafusion-comet]
wForget commented on code in PR #2372: URL: https://github.com/apache/datafusion-comet/pull/2372#discussion_r2345712676 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadFromFakeHadoopFsSuite.scala: ## @@ -74,7 +74,18 @@ class ParquetReadFromFakeHadoopFsSuite extends CometTestBase with AdaptiveSparkP .startsWith(FakeHDFSFileSystem.PREFIX)) } - ignore("test native_datafusion scan on fake fs") { + // This test fails for 'hdfs' but succeeds for 'open-dal'. 'hdfs' requires this fix + // https://github.com/datafusion-contrib/fs-hdfs/pull/29 + test("test native_datafusion scan on fake fs") { +// Skip test if HDFS feature is not enabled in native library +val hdfsEnabled = + try { Review Comment: Should we extract this try...catch block into a common method in NativeBase? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Add hdfs feature test job [datafusion-comet]
wForget commented on PR #2350: URL: https://github.com/apache/datafusion-comet/pull/2350#issuecomment-3287309576 > Should we also be adding the test suites to pr_build_macos.yml ? Thanks, I will add it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Support more date part expressions [datafusion-comet]
wForget commented on code in PR #2316: URL: https://github.com/apache/datafusion-comet/pull/2316#discussion_r2345737849 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1681,14 +1681,17 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } - test("Year") { + test("DatePart functions: Year/Month/DayOfMonth/DayOfWeek/DayOfYear/WeekOfYear/Quarter") { Seq(false, true).foreach { dictionary => withSQLConf("parquet.enable.dictionary" -> dictionary.toString) { val table = "test" withTable(table) { sql(s"create table $table(col timestamp) using parquet") sql(s"insert into $table values (now()), (null)") - checkSparkAnswerAndOperator(s"SELECT year(col) FROM $table") + // TODO: weekday(col) https://github.com/apache/datafusion-comet/issues/2330 + checkSparkAnswerAndOperator( Review Comment: > Can we change the session timezone when we read the values back? The input of these methods is `DateType` and output is `IntegerType`, , so it doesn’t seem to be related to the timezone. > Also, maybe add some test values corresponding to days before the Unix Epoch. Thanks, I will try it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org