corneliusroemer opened a new issue #1059: URL: https://github.com/apache/arrow-rs/issues/1059
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** I would like to stream 100+GB of SARS-CoV-2 sequence data into .parquet with zstd compression (works really well on these sequences) I would like to do this without having to hard-code the schema, for example through a CLI like https://github.com/domoritz/csv2parquet/blob/main/src/main.rs However, that CLI requires me to provide a `file` and does not allow me to read from `stdin`. Why? Because the reader builder requires input to be seekable which stdin is not. I **Describe the solution you'd like** It'd be good if the reader builder could be more flexible and infer schema from the first say 100 lines that can still be kept in memory. **Describe alternatives you've considered** I could add a schema option to the CLI tool, but that's annoying and unnecessary because I just want a very simple schema: str/str. I could also do schema inference myself but again this is quite difficult and would be good to be provided from arrow-rs directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
