jnturton commented on issue #2421: URL: https://github.com/apache/drill/issues/2421#issuecomment-1004749499
Paul Rogers wrote: Hi All, Thanks Charles for dredging up that old discussion, your memory is better than mine! And, thanks Ted for that summary of MapR history. As one of the "replacement crew" brought in after the original folks left, your description is consistent with my memory of events. Moreover, as we looked at what was needed to run Drill in production, an Arrow port was far down on the list: it would not have solved actual customer problems. Before we get too excited about Arrow, I think we should have a discussion about what we want in an internal storage format. I added a long (sorry) set of comments in that PR that Charles mentioned that tries to debunk the myths that have grown up around using a columnar format as the internal representation for a query engine. (Columnar is great for storage.) The note presents the many issues we've encountered over the years that have caused us to layer ever more code on top of vectors to solve various problems. It also highlights a distributed-systems problem which vectors make far worse. Arrow is meant to be portable, as Ted discussed, but it is still columnar, and this is the source of endless problems in an execution engine. So, we want to ask, what is the optimal format for what Drill actually does? I'm now of the opinion that Drill might actually better benefit from a row-based format, similar to what Impala uses. The notes even paint a path forward. Ted's description of the goal for Demio suggests that Arrow might be the right answer for that market. Drill, however, tends to be used to query myriad data sources at scale and as a "query integrator" across systems. This use case has different needs, which may be better served with a row-based format. The upshot is that "value vectors vs. Arrow" is the wrong place to start the discussion. The right place is "what does our many years of experience with Drill suggest is the most efficient format for how Drill is actually used?" Note that Drill could have an Arrow-based API independent of the internal format. The quote from Charles explains how we could do that. Thanks, - Paul -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org