Re: About integration of drill and arrow

Andy Grove Mon, 13 Jan 2020 11:45:54 -0800

I just started working with Drill and I am a PMC member of Apache Arrow. I
am in the process of writing my first storage plugin for Drill, and I think
it would be interesting to build a storage plugin for the Apache Arrow
Flight protocol as a way for Drill to query Arrow data, although I'm not
sure what compelling use cases there are for something like this.


https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/



On Sun, Jan 12, 2020 at 7:15 PM Charles Givre <cgi...@gmail.com> wrote:

> Hello All,
> Glad to see this discussion happening!  I apologize for the long email.
>
> I thought I would contribute my .02 here, which are more strategic than
> technical. When I first heard about Arrow a few years ago, I was very
> excited about the promise it had for enabling data interchange. From what I
> understood at the time, the main advantage of Arrow was that a user could
> query data in platform X, then use that same data in memory in platform Y
> without having to serialize/deserialize it.
>
> From a data science perspective, this interoperability would be a huge
> win.  Consider the case of building a machine learning model from
> complicated data.  Let's say you are trying to do this from data that
> happens to be stored in MongoDB and CSV files.  Using conventional methods,
> a data scientist would probably write python code to get the data from
> Mongo and get it into a pandas dataframe.  Then they would do the save for
> the CSV and join these data sets together, hoping that they don't blow up
> their machine in the process.
>
> Once this data has been processed, they'd do a bunch of feature
> engineering and other cleanup before the data gets piped into a machine
> learning module like Tensorflow or Scikit-Learn for training.  Once the
> model has been trained, they would then have to create some script to
> recreate that pipeline to make predictions on unknown data.  This is
> simplified a bit but you get the idea.
>
> At the time, i was imagining that a user could use Drill to do all the
> data prep (IE joining the CSV with the Mongo data and associated cleanup)
> and then passing that to python to train the model.  Likewise, the inverse
> could happen that once the model is trained it could be   exported to some
> degree such that its output could be included in queries.
>
> However, in the years since Arrow's debut, I've honestly not seen the
> value of it.  Furthermore, John Omernik created a series of really amazing
> integrations for Jupyter Notebook that enable a user to type a query into a
> notebook cell and automatically get the results straight into a pandas
> dataframe.
>
> So, where does this leave us?
> From my optic, at the moment, I'm not convinced that Arrow offers a
> significant performance advantage over Drill's current data representation.
> (Correct me if I'm wrong here)
> Furthermore, as is evident by this email thread, converting Drill to use
> Arrow would be extremely complicated and likely mean rewriting a majority
> of Drill's codebase.
>
> I am not convinced that the benefit would be worth the effort.  BUT...
>
> (Forgive me here because these questions are a bit out of my technical
> depth on this subject), what if Drill could consume Arrow data FROM
> external sources and/or export data TO external sources.  Something like a
> storage plugin for Arrow.  My reasoning is that if the key benefit of Arrow
> is the data interchange, then is it possible to add that functionality to
> Drill without having to rewrite all of Drill?  I think Paul suggested
> something similar to this approach in a previous thread with the title
> "Crude but Effective Approach".  This approach would capture most of the
> benefit without most of the cost of the conversion. Admittedly, I suspect
> this would not be as performant as a pure Arrow approach, and I think there
> would have to be some experimentation to see how well this could actually
> work in practice.
>
> This is all based on the assumptions that fully integrating Arrow would be
> a major undertaking, that the data exchange is the key benefit and that
> Arrow's performance is comparable to our current implementation. If these
> assumptions are incorrect, please let me know.
> Best,
> -- C
>
>
>
> > On Jan 12, 2020, at 4:57 PM, Paul Rogers <par0...@yahoo.com.INVALID>
> wrote:
> >
> > Hi All,
> >
> > As you've seen, I've been suggesting that we consider multiple choices
> for our internal data representation beyond just the current value vector
> layout and the "obvious" Arrow layout. And, that we consider out options
> based on where we see Drill adding value in the open source community.
> >
> >
> > Frankly, I'm not convinced that Arrow is the best solution. Arrow
> evolved from value vectors, by the same people who built value vectors. No
> doubt that Arrow is a more advanced, faster, higher-quality version of
> value vectors -- the Arrow team has put lots of effort into Arrow while
> Drill has more-or-less kept Value Vectors the same.
> >
> > So, we can agree that Arrow is likely better than value vectors. But, is
> a direct-memory, columnar solution optimal for a tool like Drill? This is
> the real question we should ask.
> >
> > Why do we have value vectors and Arrow?
> >
> > * Industry experience shows that a columnar layout is more performant
> for some kinds of workloads.  Parquet makes good use of the columnar format
> for data compression and to reduce I/O for unwanted columns.
> >
> >
> > * Netty uses async I/O which requires data to be in direct memory. Value
> vectors and Arrow thus store data in direct memory to avoid copies on
> network sends and receives.
> >
> > * Columnar layouts should be able to exploit CPU vector instructions.
> However, those instructions don't work with nullable data, so some clever
> work is needed. And, as noted earlier, Drill actually generates row-wise
> code, not column-wise.
> >
> > * The Java GC is a known evil (or was in Java 5 when Drill started.) The
> Drill developers felt they could do a better job of memory allocation than
> Java could; hence our VERY complex memory allocator, buffers, ledgers and
> all the rest.
> >
> > * Java has no good way to efficiently lay out a row in memory as a
> generic data structure. Most row-based tools in Java use boxed objects for
> columns, an object array as the row. This seems an expensive solution.
> >
> >
> > The case for vectors (Value or Arrow) seems compelling. But, our job, as
> Drill leads, is to question and validate each of these assertions. The
> above are true, but they may not be the WHOLE truth.
> >
> >
> > Columnar-vs-row: As it turns out, Impala, Presto and Spark are all row
> based. Sadly, all three of these tools have more adoption than Drill. So, a
> row-based layout, per-se, is not necessarily a bad thing.
> >
> >
> > I saw a paper many years ago (which, alas, I can no longer find) that
> showed the result of experiments of columnar vs. row-based memory layouts.
> The gist was that, for numeric (analytic) queries, with a small number of
> columns, columnar performed better. Beyond some number of columns,
> row-based seemed to be better.
> >
> > Rather than continue to pick at each point one by one (this e-mail is
> already too long), let's just offer an example.
> >
> > Suppose we observe that 1) clients want a row-based data format, 2) most
> readers offer row-based data, 3) Drill uses Java code gen that processes
> data row-wise, 4) Java does have a very efficient row data structure: the
> class, 5) Java offers a built-in, fragmentation-free memory allocator: the
> GC system.
> >
> >
> > So, suppose we just generate a Java class that uses fields to hold data
> for columns. Now we can test the assertions about direct-memory columnar
> being faster. There is a class in Drill, PerformanceTool, which was used to
> compare column accessor performance vs. direct vector access. Let's
> re-purpose it to compare the column accessors (on value vectors) vs.
> simulated generated code that works with rows as Java classes.
> >
> > The test runs 300 batches of 4 million rows per batch (total of 1.23G
> rows). For each batch, it writes a single int to each row, finalizes the
> batch, then reads the int back from each row. That is, it simulates a very
> simple reader followed by a filter.
> >
> > First, we do this using vector accessors for a required and nullable int
> and time the results (on an ancient Core i7 CPU, Ubuntu Linux):
> >
> > Required accessor: 10,434 ms (121M rows/sec)
> >
> > Nullable accessor: 16,605 ms (76M rows/sec)
> >
> >
> > The above makes sense: a nullable vector has a second "bits" vector that
> we must fiddle with. The numbers include all the work to allocate vectors,
> much with ledgers and allocators, write to direct memory, read from direct
> memory and all the rest.
> >
> > Now, let's do the same thing with Java. We "generate" a class to hold
> the int (and for nullable, an "isSet" field). We use a Java array to hold
> the rows. Otherwise, the write-then-read pattern is the same. Now we get:
> >
> > Required object: 5100 ms  (247M rows/sec, 2x faster)
> >
> > Nullable object: 6656 ms  (189M rows/sec, 2.5x faster)
> >
> > WTF? We get twice the speed WITHOUT any of our fancy direct memory,
> DrillBuf, allocator, ledger, vector, accessor, code gen work? Could it be
> true that the entire Java industry, working to perfect Java GC, has done a
> better job than the original Drill developers did with our memory
> allocator? One wonders how much better Arrow might be.
> >
> >
> > Imagine how much simpler code gen will be if we work only with Java
> objects rather than with layers of holders, readers, vectors and buffers.
> Imagine how much simpler Drill would be if we scrap the allocators, ledgers
> and the rest. (And, yes, we would not even need much of the guts of the
> column accessors.)
> >
> >
> > Imagine how much simpler UDFs will be if they just work with Java.
> Imagine how much simpler format and plugins could be. All of this would
> help Drill be the best query tool to integrate with custom applications,
> data sources and logic.
> >
> >
> > Seems too easy. We must be overlooking something. For example, we have
> not measured the cost of network I/O. But, remember that our exchanges are
> almost always hash or partition based. In a large cluster (100 nodes) with
> a beefy CPU (24 cores), each fragment must buffer an outgoing batch for
> each target fragment: that is 2500 buffered batches (at 10-100 MB each)
> plus all the shuffling to move rows into the correct partition. Very little
> of that would be needed for rows based on Java objects. It could be that
> all our copying and buffering takes more resources than the copy-free
> approach saves. Could be that Java objects would transfer faster, with less
> memory overhead.
> >
> > Still, can't be right, can it? IIRC, this is exactly the approach that
> Spark took before Data Frames, and Spark seemed to work out alright. Of
> course, the above tests only about 1% of Drill, and so is far from
> conclusive.
> >
> >
> > Are Java objects the solution? I don't know, any more than I know that
> fixed-width blocks are the solution. I'm just asking the questions. The
> point is, we have to research and test to find what is best for the goals
> we identify for Drill.
> >
> >
> > Thanks,
> > - Paul
> >
> >
>
>

Re: About integration of drill and arrow

Reply via email to