huan233usc opened a new pull request, #22839:
URL: https://github.com/apache/datafusion/pull/22839

   ## Which issue does this PR close?
   
   - Closes #9430.
   
   ## Rationale for this change
   
   Users frequently want to pipe data into the CLI, e.g. `cat data.csv | 
datafusion-cli`, but pointing `LOCATION` at `/dev/stdin` did not work:
   
   - CSV failed with `Illegal seek` (a pipe is not seekable).
   - Parquet failed with `file size of 0 is less than footer` (a pipe reports 
size 0).
   - JSON silently returned 0 rows.
   
   This PR makes reading from standard input work for CSV, JSON, and Parquet.
   
   ## What changes are included in this PR?
   
   stdin is exposed as a `stdin://` object store, dispatched alongside the 
other schemes (`s3`, `gs`, `http`, ...) in `get_object_store` — conceptually 
similar to DuckDB's `PipeFileSystem`.
   
   - `rewrite_stdin_location` maps the well-known stdin pseudo-paths 
(`/dev/stdin`, `/dev/fd/0`, `/proc/self/fd/0`) to a canonical 
`stdin:///stdin.<ext>` URL, so they flow through the normal 
object-store/listing code path. The extension matches the declared `STORED AS` 
format because the listing layer filters candidate files by extension.
   - The `stdin://` store reads all of standard input into an in-memory object 
store. Buffering up front is required because a pipe is not seekable and 
Parquet stores its metadata at the end of the file.
   
   Known scope/limitations (left as potential follow-ups):
   - Only `CREATE EXTERNAL TABLE` is supported (not dynamic `SELECT * FROM 
'/dev/stdin'`).
   - Input is fully buffered in memory, so it must fit in memory.
   - stdin can only be consumed once per session.
   - Unix-only (`/dev/stdin` does not exist on Windows); writing to 
`/dev/stdout` is out of scope.
   
   ## Are these changes tested?
   
   Yes:
   - Unit tests in `object_storage.rs` cover `rewrite_stdin_location` and 
end-to-end reads for CSV, JSON, and Parquet via the in-memory store.
   - A `#[cfg(unix)]` integration test in `cli_integration.rs` drives the real 
binary through an actual pipe, exercising the real stdin read.
   - Manually verified all three formats via real pipes, and confirmed normal 
local-file reads are unaffected.
   
   ## Are there any user-facing changes?
   
   Yes — reading from stdin via `LOCATION '/dev/stdin'` is now supported. 
Documented in `docs/source/user-guide/cli/datasources.md` (new "Reading from 
standard input" section). No breaking changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to