adriangb opened a new pull request, #22383:
URL: https://github.com/apache/datafusion/pull/22383

   ## Which issue does this PR close?
   
   - Closes #. <!-- TODO: link tracking issue -->
   
   ## Rationale for this change
   
   DuckDB supports rich glob expressions in file paths (`SELECT * FROM 
'/data/**/x.parquet'`, `CREATE EXTERNAL TABLE … LOCATION 
'/data/**/x.parquet'`), and DataFusion users routinely expect the same. Today 
DataFusion has *most* of the glob infrastructure (`glob::Pattern`, 
`ListingTableUrl` prefix splitting), but the matching semantics diverge from 
DuckDB in two important ways:
   
   1. **`*` crosses `/`.** DataFusion uses the `glob` crate's default 
`MatchOptions` (`require_literal_separator: false`), so `*.parquet` will 
silently match `subdir/3.parquet` under `listing_table_ignore_subdirectory = 
false`. DuckDB only crosses `/` for `**`.
   2. **Recursive globs silently return wrong results under the default 
config.** With `listing_table_ignore_subdirectory = true` (default) and an 
explicit glob, `ListingTableUrl::contains` only matched the glob against the 
*first* path segment. That meant `/data/**/x.parquet` returned only top-level 
matches and silently dropped everything under any subdirectory — no error, just 
wrong counts.
   3. Globs are not detected at all inside scheme URLs, so 
`s3://bucket/**/x.parquet` was treated as a literal object name.
   
   This PR brings DataFusion's listing-table glob semantics into line with 
DuckDB (verified empirically against duckdb v1.5.2 on the same patterns and 
paths) for both filesystem paths and scheme URLs.
   
   ## What changes are included in this PR?
   
   Five small, reviewable, stacked commits, each builds green on its own:
   
   1. **`refactor(datasource)`** — Extract `split_glob_expression` (and its 
unit tests) from `url.rs` into a new 
`datafusion/datasource/src/listing_glob.rs` module. No behaviour change; sets 
up the subsequent commits.
   2. **`feat(datasource): DuckDB-compatible glob semantics`** — Add 
`LISTING_GLOB_MATCH_OPTIONS` (`require_literal_separator: true`) and 
`normalize_glob_separators` (Windows `\` → `/`). Rewrite 
`ListingTableUrl::contains` to match the **full relative path** with these 
options when an explicit glob is present, and to ignore 
`listing_table_ignore_subdirectory` (the pattern is authoritative). Stop 
injecting an auto-glob in `ListingTableFactory` for folder LOCATIONs — that 
injection broke `ignore_subdirectory=false` recursion under the new 
`*`-doesn't-cross-`/` semantics; the extension filter on the listing options 
does the same job without breaking recursion.
   3. **`feat(datasource): detect globs in scheme URLs`** — Add 
`find_url_path_start` (handles IPv6 hosts via `[...]` bracket tracking) and a 
new `parse_url` path that glob-splits the raw URL path *before* `Url::parse` 
runs (so glob characters like `?` don't get consumed as the URL query 
delimiter). The URL path is sliced at the first `?` or `#` so HTTP-style 
queries are preserved.
   4. **`test(sqllogictest)`** — New 
`datafusion/sqllogictest/test_files/glob.slt` end-to-end exercising `*` 
not-crossing-`/`, `**` recursive, single-segment `*/dir`, partition-style 
`year=*/x.parquet`, character classes, and `?` — through `CREATE EXTERNAL 
TABLE`. Equivalent dynamic-file (`SELECT * FROM '<glob>'`) coverage added to 
`dynamic_file.slt` (which carries the `enable_url_table` SessionContext flag).
   5. **`docs`** — Rewrite the (previously incorrect) wildcard section in 
`docs/source/user-guide/cli/datasources.md` with the supported syntax, 
examples, and the `?`-vs-URL-query caveat. Upgrade-guide entry in `54.0.0.md`.
   
   ### Semantics summary
   
   | Wildcard         | Meaning                                                 
                                           |
   | ---------------- | 
--------------------------------------------------------------------------------------------------
 |
   | `*`              | any chars within a single segment; does **not** cross 
`/`                                          |
   | `**`             | zero or more directory levels                           
                                           |
   | `?`              | exactly one non-separator char. Filesystem paths only — 
reserved for URL query delimiter in scheme URLs |
   | `[abc]`, `[a-z]` | one char from class                                     
                                           |
   | `key=value` dirs | matched literally, so `year=*/x.parquet` works          
                                           |
   
   ### Justifiable differences from DuckDB
   
   - **`?` not supported in scheme URLs** — it is the URL query delimiter and 
the ambiguity with HTTP-style queries (`https://host/p?a=b`) is worse than the 
limitation. Use `[abc]` or `*` for single-char matching in remote globs. `?` 
continues to work as a glob in scheme-less filesystem paths.
   - **Braces `{a,b}` not supported** — DuckDB doesn't support them either; use 
separate tables or `UNION ALL`.
   
   ### Breaking change
   
   `*` no longer crosses `/`. Blast radius is small (only manifests under 
non-default `listing_table_ignore_subdirectory = false`, and the previous 
behaviour was already non-standard and DuckDB-incompatible). Existing 
`parquet.slt` glob test updated and a sibling `**/*.parquet` test added that 
returns the full set. The upgrade guide has migration guidance.
   
   ## Are these changes tested?
   
   Yes:
   
   - **Rust unit tests** in `listing_glob.rs` cover 
`LISTING_GLOB_MATCH_OPTIONS` (cross-checked against duckdb v1.5.2 directly), 
`find_url_path_start` (s3, file, http, IPv6), `split_glob_expression`, and 
separator normalisation.
   - **`test_url_contains_glob`** and **`test_url_parse_glob_in_scheme_urls`** 
in `datafusion/datasource/src/mod.rs` cover the public-API `contains()` and 
`parse()` end-to-end including partition-dir globs, percent-encoded literals, 
URL-query preservation, and IPv6 hosts.
   - **`glob.slt`** is a new end-to-end sqllogictest covering all DuckDB-target 
cases through `CREATE EXTERNAL TABLE`.
   - **`dynamic_file.slt`** extended with the `SELECT * FROM '<glob>'` 
equivalents.
   - **Factory unit tests** 
(`test_create_using_folder_{with,without}_compression`, 
`test_odd_directory_names`) updated to assert the new contract — folder 
LOCATIONs set `file_extension` on the listing options and leave `url.glob` as 
`None`.
   - **Empirical end-to-end** against the built `datafusion-cli` on a nested 
local filesystem fixture, every count matched DuckDB's reference output.
   
   ## Are there any user-facing changes?
   
   Yes — see the **Breaking change** section above and the new upgrade-guide 
entry at `docs/source/library-user-guide/upgrading/54.0.0.md`. Tagging this PR 
with the `api change` label.
   
   Library callers: `ListingTableUrl::contains` now matches the full relative 
path against the glob (rather than just the first segment) and uses 
`glob::MatchOptions { require_literal_separator: true, .. }`. Public signatures 
are unchanged.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to