Alexx-G commented on issue #746: URL: https://github.com/apache/arrow-rs/issues/746#issuecomment-916702566
Thanks a lot folks! ❤️ @houqp Can you please send an invite to `[email protected]`? Thanks! The example is really useful. I think for PoC a dynamic schema is out of scope. I'd even require explicitly specifying all fields for such formats as Parquet. @jorgecarleitao Thanks for feedback! You're right, buffering and writing to parquet should separate phases, in this case writing to parquet (regardless it's file or in-memory buffer to send over the wire) is the last step and it should be relatively easy. @alamb This is super helpful! The MVP I'm interested in is relatively simple - buffer events in memory (at this point format doesn't really matter), split events into batches, each batch will be serialized into its own parquet file and stored in S3. Then I can use something like Athena to query the data. I suppose that individual batches _may_ have slightly different schemas, but it should work as long as the Athena table I create has columns common for all batches. Currently, in case there's a breaking change in log format, I just change versioning at path level (e.g. `service=foo/logger=bar-v1/date=X/*.tar.gz` -> `service=foo/logger=bar-v2/date=X/*.tar.gz`). It's a bit inconvenient when analyzing a time period which includes both versions, but proper evolution and compatibility layers just add tons of complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
