jtuglu1 commented on PR #18880: URL: https://github.com/apache/druid/pull/18880#issuecomment-3721354011
> > 👍 A few questions: > > ``` > > * Will this support reading a single column from a segment (instead of needing to download + scan entire segment)? I guess mapping offset ranges of a segment file is analogous to the row-group concept in Parquet. > > ``` > > Definitely supporting partial downloads at the level of columns and/or projections is a goal of this format, and something it would enable doing. > > > ``` > > * Are there any thoughts to make Druid formats Arrow-compatible? This would open up many more integrations with existing big data ecosystem externally, as well as making intra-cluster data transfer potentially much faster (send everything as RecordBatch). > > ``` > > For intra-cluster data transfer, the MSQ query paths (which to me are the ones I want to focus on 😄) are using Frames, which are similar to Arrow in efficiency. For integrating with the big data ecosystem in ways that require actually using Arrow, there is a question about whether we're doing something for data in flight (RPC) or for data at rest (in object storage). For RPC I think an API that returns Arrow streams can make sense in theory. It wouldn't be related to the segment format, it would be more related to the query side. For data at rest, I don't know how much sense that makes. I haven't heard much of people using Arrow for data at rest. Check out https://github.com/lancedb/lance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
