corneliusroemer opened a new issue #1059:
URL: https://github.com/apache/arrow-rs/issues/1059


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   I would like to stream 100+GB of SARS-CoV-2 sequence data into .parquet with 
zstd compression (works really well on these sequences)
   
   I would like to do this without having to hard-code the schema, for example 
through a CLI like https://github.com/domoritz/csv2parquet/blob/main/src/main.rs
   
   However, that CLI requires me to provide a `file` and does not allow me to 
read from `stdin`. Why? Because the reader builder requires input to be 
seekable which stdin is not. I
   
   **Describe the solution you'd like**
   It'd be good if the reader builder could be more flexible and infer schema 
from the first say 100 lines that can still be kept in memory.
   
   **Describe alternatives you've considered**
   I could add a schema option to the CLI tool, but that's annoying and 
unnecessary because I just want a very simple schema: str/str. I could also do 
schema inference myself but again this is quite difficult and would be good to 
be provided from arrow-rs directly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to