[GitHub] [arrow-rs] corneliusroemer commented on issue #1059: Allow csv reader builder to do schema inference even when reading csv from stdin

GitBox Fri, 17 Dec 2021 23:46:47 -0800


corneliusroemer commented on issue #1059:
URL: https://github.com/apache/arrow-rs/issues/1059#issuecomment-997162988



   @jorgecarleitao I agree that implementing a custom `seekable` `Reader` would 
be necessary to solve this without dropping the `seekable` requirement. But I'm 
not sure your suggestion is the way forward since we shouldn't read all of 
`stdin` into memory. Memory is even scarcer than storage.
   
   If you have 1TB uncompressed that compresses to 1GB. I can uncompress the 
1GB to storage, then read it in with a normal file reader (which is seekable). 
Problem: I need 1TB of space and write-time.
   
   What doesn't work is read the 1TB into memory. No way.
   
   Alternative: read 1GB into memory, infer schema, then stream.
   
   Your suggestions seems to read it all into memory, doesn't it? When would 
you be allowed to drop early parts of the buffer?
   
   Couldn't one drop all back-buffer once one has gone beyond 
`max_read_records` in `builder = builder.infer_schema(opts.max_read_records);`? 
Once the max has been read, there's no need for seeking anymore. Seeking 
shouldn't happen anymore so one can drop whatever is behind you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] corneliusroemer commented on issue #1059: Allow csv reader builder to do schema inference even when reading csv from stdin

Reply via email to