jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004749499


   Paul Rogers wrote:
   
   Hi All,
   
   Thanks Charles for dredging up that old discussion, your memory is better
   than mine! And, thanks Ted for that summary of MapR history. As one of the
   "replacement crew" brought in after the original folks left, your
   description is consistent with my memory of events. Moreover, as we looked
   at what was needed to run Drill in production, an Arrow port was far down
   on the list: it would not have solved actual customer problems.
   
   Before we get too excited about Arrow, I think we should have a discussion
   about what we want in an internal storage format. I added a long (sorry)
   set of comments in that PR that Charles mentioned that tries to debunk the
   myths that have grown up around using a columnar format as the internal
   representation for a query engine. (Columnar is great for storage.) The
   note presents the many issues we've encountered over the years that have
   caused us to layer ever more code on top of vectors to solve various
   problems. It also highlights a distributed-systems problem which vectors
   make far worse.
   
   Arrow is meant to be portable, as Ted discussed, but it is still columnar,
   and this is the source of endless problems in an execution engine. So, we
   want to ask, what is the optimal format for what Drill actually does? I'm
   now of the opinion that Drill might actually better benefit  from a
   row-based format, similar to what Impala uses. The notes even paint a path
   forward.
   
   Ted's description of the goal for Demio suggests that Arrow might be the
   right answer for that market. Drill, however, tends to be used to query
   myriad data sources at scale and as a "query integrator" across systems.
   This use case has different needs, which may be better served with a
   row-based format.
   
   The upshot is that "value vectors vs. Arrow" is the wrong place to start
   the discussion. The right place is "what does our many years of experience
   with Drill suggest is the most efficient format for how Drill is actually
   used?"
   
   Note that Drill could have an Arrow-based API independent of the internal
   format. The quote from Charles explains how we could do that.
   
   Thanks,
   
   - Paul
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to