corneliusroemer commented on issue #1059: URL: https://github.com/apache/arrow-rs/issues/1059#issuecomment-997162988
@jorgecarleitao I agree that implementing a custom `seekable` `Reader` would be necessary to solve this without dropping the `seekable` requirement. But I'm not sure your suggestion is the way forward since we shouldn't read all of `stdin` into memory. Memory is even scarcer than storage. If you have 1TB uncompressed that compresses to 1GB. I can uncompress the 1GB to storage, then read it in with a normal file reader (which is seekable). Problem: I need 1TB of space and write-time. What doesn't work is read the 1TB into memory. No way. Alternative: read 1GB into memory, infer schema, then stream. Your suggestions seems to read it all into memory, doesn't it? When would you be allowed to drop early parts of the buffer? Couldn't one drop all back-buffer once one has gone beyond `max_read_records` in `builder = builder.infer_schema(opts.max_read_records);`? Once the max has been read, there's no need for seeking anymore. Seeking shouldn't happen anymore so one can drop whatever is behind you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
