[GitHub] [arrow-rs] Alexx-G commented on issue #746: Examples (or guidance) on parquet usage

GitBox Fri, 10 Sep 2021 00:45:08 -0700


Alexx-G commented on issue #746:
URL: https://github.com/apache/arrow-rs/issues/746#issuecomment-916702566



   Thanks a lot folks! ❤️ 
   
   @houqp Can you please send an invite to `[email protected]`? Thanks! The 
example is really useful. I think for PoC a dynamic schema is out of scope. I'd 
even require explicitly specifying all fields for such formats as Parquet.
   
   @jorgecarleitao Thanks for feedback! You're right, buffering and writing to 
parquet should separate phases, in this case writing to parquet (regardless 
it's file or in-memory buffer to send over the wire) is the last step and it 
should be relatively easy.
   
   @alamb This is super helpful! The MVP I'm interested in is relatively simple 
- buffer events in memory (at this point format doesn't really matter), split 
events into batches, each batch will be serialized into its own parquet file 
and stored in S3. Then I can use something like Athena to query the data. I 
suppose that individual batches _may_ have slightly different schemas, but it 
should work as long as the Athena table I create has columns common for all 
batches. Currently, in case there's a breaking change in log format, I just 
change versioning at path level (e.g. 
`service=foo/logger=bar-v1/date=X/*.tar.gz` -> 
`service=foo/logger=bar-v2/date=X/*.tar.gz`). It's a bit inconvenient when 
analyzing a time period which includes both versions, but proper evolution and 
compatibility layers just add tons of complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] Alexx-G commented on issue #746: Examples (or guidance) on parquet usage

Reply via email to