adriangb opened a new pull request, #22383:
URL: https://github.com/apache/datafusion/pull/22383
## Which issue does this PR close?
- Closes #. <!-- TODO: link tracking issue -->
## Rationale for this change
DuckDB supports rich glob expressions in file paths (`SELECT * FROM
'/data/**/x.parquet'`, `CREATE EXTERNAL TABLE … LOCATION
'/data/**/x.parquet'`), and DataFusion users routinely expect the same. Today
DataFusion has *most* of the glob infrastructure (`glob::Pattern`,
`ListingTableUrl` prefix splitting), but the matching semantics diverge from
DuckDB in two important ways:
1. **`*` crosses `/`.** DataFusion uses the `glob` crate's default
`MatchOptions` (`require_literal_separator: false`), so `*.parquet` will
silently match `subdir/3.parquet` under `listing_table_ignore_subdirectory =
false`. DuckDB only crosses `/` for `**`.
2. **Recursive globs silently return wrong results under the default
config.** With `listing_table_ignore_subdirectory = true` (default) and an
explicit glob, `ListingTableUrl::contains` only matched the glob against the
*first* path segment. That meant `/data/**/x.parquet` returned only top-level
matches and silently dropped everything under any subdirectory — no error, just
wrong counts.
3. Globs are not detected at all inside scheme URLs, so
`s3://bucket/**/x.parquet` was treated as a literal object name.
This PR brings DataFusion's listing-table glob semantics into line with
DuckDB (verified empirically against duckdb v1.5.2 on the same patterns and
paths) for both filesystem paths and scheme URLs.
## What changes are included in this PR?
Five small, reviewable, stacked commits, each builds green on its own:
1. **`refactor(datasource)`** — Extract `split_glob_expression` (and its
unit tests) from `url.rs` into a new
`datafusion/datasource/src/listing_glob.rs` module. No behaviour change; sets
up the subsequent commits.
2. **`feat(datasource): DuckDB-compatible glob semantics`** — Add
`LISTING_GLOB_MATCH_OPTIONS` (`require_literal_separator: true`) and
`normalize_glob_separators` (Windows `\` → `/`). Rewrite
`ListingTableUrl::contains` to match the **full relative path** with these
options when an explicit glob is present, and to ignore
`listing_table_ignore_subdirectory` (the pattern is authoritative). Stop
injecting an auto-glob in `ListingTableFactory` for folder LOCATIONs — that
injection broke `ignore_subdirectory=false` recursion under the new
`*`-doesn't-cross-`/` semantics; the extension filter on the listing options
does the same job without breaking recursion.
3. **`feat(datasource): detect globs in scheme URLs`** — Add
`find_url_path_start` (handles IPv6 hosts via `[...]` bracket tracking) and a
new `parse_url` path that glob-splits the raw URL path *before* `Url::parse`
runs (so glob characters like `?` don't get consumed as the URL query
delimiter). The URL path is sliced at the first `?` or `#` so HTTP-style
queries are preserved.
4. **`test(sqllogictest)`** — New
`datafusion/sqllogictest/test_files/glob.slt` end-to-end exercising `*`
not-crossing-`/`, `**` recursive, single-segment `*/dir`, partition-style
`year=*/x.parquet`, character classes, and `?` — through `CREATE EXTERNAL
TABLE`. Equivalent dynamic-file (`SELECT * FROM '<glob>'`) coverage added to
`dynamic_file.slt` (which carries the `enable_url_table` SessionContext flag).
5. **`docs`** — Rewrite the (previously incorrect) wildcard section in
`docs/source/user-guide/cli/datasources.md` with the supported syntax,
examples, and the `?`-vs-URL-query caveat. Upgrade-guide entry in `54.0.0.md`.
### Semantics summary
| Wildcard | Meaning
|
| ---------------- |
--------------------------------------------------------------------------------------------------
|
| `*` | any chars within a single segment; does **not** cross
`/` |
| `**` | zero or more directory levels
|
| `?` | exactly one non-separator char. Filesystem paths only —
reserved for URL query delimiter in scheme URLs |
| `[abc]`, `[a-z]` | one char from class
|
| `key=value` dirs | matched literally, so `year=*/x.parquet` works
|
### Justifiable differences from DuckDB
- **`?` not supported in scheme URLs** — it is the URL query delimiter and
the ambiguity with HTTP-style queries (`https://host/p?a=b`) is worse than the
limitation. Use `[abc]` or `*` for single-char matching in remote globs. `?`
continues to work as a glob in scheme-less filesystem paths.
- **Braces `{a,b}` not supported** — DuckDB doesn't support them either; use
separate tables or `UNION ALL`.
### Breaking change
`*` no longer crosses `/`. Blast radius is small (only manifests under
non-default `listing_table_ignore_subdirectory = false`, and the previous
behaviour was already non-standard and DuckDB-incompatible). Existing
`parquet.slt` glob test updated and a sibling `**/*.parquet` test added that
returns the full set. The upgrade guide has migration guidance.
## Are these changes tested?
Yes:
- **Rust unit tests** in `listing_glob.rs` cover
`LISTING_GLOB_MATCH_OPTIONS` (cross-checked against duckdb v1.5.2 directly),
`find_url_path_start` (s3, file, http, IPv6), `split_glob_expression`, and
separator normalisation.
- **`test_url_contains_glob`** and **`test_url_parse_glob_in_scheme_urls`**
in `datafusion/datasource/src/mod.rs` cover the public-API `contains()` and
`parse()` end-to-end including partition-dir globs, percent-encoded literals,
URL-query preservation, and IPv6 hosts.
- **`glob.slt`** is a new end-to-end sqllogictest covering all DuckDB-target
cases through `CREATE EXTERNAL TABLE`.
- **`dynamic_file.slt`** extended with the `SELECT * FROM '<glob>'`
equivalents.
- **Factory unit tests**
(`test_create_using_folder_{with,without}_compression`,
`test_odd_directory_names`) updated to assert the new contract — folder
LOCATIONs set `file_extension` on the listing options and leave `url.glob` as
`None`.
- **Empirical end-to-end** against the built `datafusion-cli` on a nested
local filesystem fixture, every count matched DuckDB's reference output.
## Are there any user-facing changes?
Yes — see the **Breaking change** section above and the new upgrade-guide
entry at `docs/source/library-user-guide/upgrading/54.0.0.md`. Tagging this PR
with the `api change` label.
Library callers: `ListingTableUrl::contains` now matches the full relative
path against the glob (rather than just the first segment) and uses
`glob::MatchOptions { require_literal_separator: true, .. }`. Public signatures
are unchanged.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]