huan233usc opened a new pull request, #22839: URL: https://github.com/apache/datafusion/pull/22839
## Which issue does this PR close? - Closes #9430. ## Rationale for this change Users frequently want to pipe data into the CLI, e.g. `cat data.csv | datafusion-cli`, but pointing `LOCATION` at `/dev/stdin` did not work: - CSV failed with `Illegal seek` (a pipe is not seekable). - Parquet failed with `file size of 0 is less than footer` (a pipe reports size 0). - JSON silently returned 0 rows. This PR makes reading from standard input work for CSV, JSON, and Parquet. ## What changes are included in this PR? stdin is exposed as a `stdin://` object store, dispatched alongside the other schemes (`s3`, `gs`, `http`, ...) in `get_object_store` — conceptually similar to DuckDB's `PipeFileSystem`. - `rewrite_stdin_location` maps the well-known stdin pseudo-paths (`/dev/stdin`, `/dev/fd/0`, `/proc/self/fd/0`) to a canonical `stdin:///stdin.<ext>` URL, so they flow through the normal object-store/listing code path. The extension matches the declared `STORED AS` format because the listing layer filters candidate files by extension. - The `stdin://` store reads all of standard input into an in-memory object store. Buffering up front is required because a pipe is not seekable and Parquet stores its metadata at the end of the file. Known scope/limitations (left as potential follow-ups): - Only `CREATE EXTERNAL TABLE` is supported (not dynamic `SELECT * FROM '/dev/stdin'`). - Input is fully buffered in memory, so it must fit in memory. - stdin can only be consumed once per session. - Unix-only (`/dev/stdin` does not exist on Windows); writing to `/dev/stdout` is out of scope. ## Are these changes tested? Yes: - Unit tests in `object_storage.rs` cover `rewrite_stdin_location` and end-to-end reads for CSV, JSON, and Parquet via the in-memory store. - A `#[cfg(unix)]` integration test in `cli_integration.rs` drives the real binary through an actual pipe, exercising the real stdin read. - Manually verified all three formats via real pipes, and confirmed normal local-file reads are unaffected. ## Are there any user-facing changes? Yes — reading from stdin via `LOCATION '/dev/stdin'` is now supported. Documented in `docs/source/user-guide/cli/datasources.md` (new "Reading from standard input" section). No breaking changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
