I just started working with Drill and I am a PMC member of Apache Arrow. I am in the process of writing my first storage plugin for Drill, and I think it would be interesting to build a storage plugin for the Apache Arrow Flight protocol as a way for Drill to query Arrow data, although I'm not sure what compelling use cases there are for something like this.
https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ On Sun, Jan 12, 2020 at 7:15 PM Charles Givre <cgi...@gmail.com> wrote: > Hello All, > Glad to see this discussion happening! I apologize for the long email. > > I thought I would contribute my .02 here, which are more strategic than > technical. When I first heard about Arrow a few years ago, I was very > excited about the promise it had for enabling data interchange. From what I > understood at the time, the main advantage of Arrow was that a user could > query data in platform X, then use that same data in memory in platform Y > without having to serialize/deserialize it. > > From a data science perspective, this interoperability would be a huge > win. Consider the case of building a machine learning model from > complicated data. Let's say you are trying to do this from data that > happens to be stored in MongoDB and CSV files. Using conventional methods, > a data scientist would probably write python code to get the data from > Mongo and get it into a pandas dataframe. Then they would do the save for > the CSV and join these data sets together, hoping that they don't blow up > their machine in the process. > > Once this data has been processed, they'd do a bunch of feature > engineering and other cleanup before the data gets piped into a machine > learning module like Tensorflow or Scikit-Learn for training. Once the > model has been trained, they would then have to create some script to > recreate that pipeline to make predictions on unknown data. This is > simplified a bit but you get the idea. > > At the time, i was imagining that a user could use Drill to do all the > data prep (IE joining the CSV with the Mongo data and associated cleanup) > and then passing that to python to train the model. Likewise, the inverse > could happen that once the model is trained it could be exported to some > degree such that its output could be included in queries. > > However, in the years since Arrow's debut, I've honestly not seen the > value of it. Furthermore, John Omernik created a series of really amazing > integrations for Jupyter Notebook that enable a user to type a query into a > notebook cell and automatically get the results straight into a pandas > dataframe. > > So, where does this leave us? > From my optic, at the moment, I'm not convinced that Arrow offers a > significant performance advantage over Drill's current data representation. > (Correct me if I'm wrong here) > Furthermore, as is evident by this email thread, converting Drill to use > Arrow would be extremely complicated and likely mean rewriting a majority > of Drill's codebase. > > I am not convinced that the benefit would be worth the effort. BUT... > > (Forgive me here because these questions are a bit out of my technical > depth on this subject), what if Drill could consume Arrow data FROM > external sources and/or export data TO external sources. Something like a > storage plugin for Arrow. My reasoning is that if the key benefit of Arrow > is the data interchange, then is it possible to add that functionality to > Drill without having to rewrite all of Drill? I think Paul suggested > something similar to this approach in a previous thread with the title > "Crude but Effective Approach". This approach would capture most of the > benefit without most of the cost of the conversion. Admittedly, I suspect > this would not be as performant as a pure Arrow approach, and I think there > would have to be some experimentation to see how well this could actually > work in practice. > > This is all based on the assumptions that fully integrating Arrow would be > a major undertaking, that the data exchange is the key benefit and that > Arrow's performance is comparable to our current implementation. If these > assumptions are incorrect, please let me know. > Best, > -- C > > > > > On Jan 12, 2020, at 4:57 PM, Paul Rogers <par0...@yahoo.com.INVALID> > wrote: > > > > Hi All, > > > > As you've seen, I've been suggesting that we consider multiple choices > for our internal data representation beyond just the current value vector > layout and the "obvious" Arrow layout. And, that we consider out options > based on where we see Drill adding value in the open source community. > > > > > > Frankly, I'm not convinced that Arrow is the best solution. Arrow > evolved from value vectors, by the same people who built value vectors. No > doubt that Arrow is a more advanced, faster, higher-quality version of > value vectors -- the Arrow team has put lots of effort into Arrow while > Drill has more-or-less kept Value Vectors the same. > > > > So, we can agree that Arrow is likely better than value vectors. But, is > a direct-memory, columnar solution optimal for a tool like Drill? This is > the real question we should ask. > > > > Why do we have value vectors and Arrow? > > > > * Industry experience shows that a columnar layout is more performant > for some kinds of workloads. Parquet makes good use of the columnar format > for data compression and to reduce I/O for unwanted columns. > > > > > > * Netty uses async I/O which requires data to be in direct memory. Value > vectors and Arrow thus store data in direct memory to avoid copies on > network sends and receives. > > > > * Columnar layouts should be able to exploit CPU vector instructions. > However, those instructions don't work with nullable data, so some clever > work is needed. And, as noted earlier, Drill actually generates row-wise > code, not column-wise. > > > > * The Java GC is a known evil (or was in Java 5 when Drill started.) The > Drill developers felt they could do a better job of memory allocation than > Java could; hence our VERY complex memory allocator, buffers, ledgers and > all the rest. > > > > * Java has no good way to efficiently lay out a row in memory as a > generic data structure. Most row-based tools in Java use boxed objects for > columns, an object array as the row. This seems an expensive solution. > > > > > > The case for vectors (Value or Arrow) seems compelling. But, our job, as > Drill leads, is to question and validate each of these assertions. The > above are true, but they may not be the WHOLE truth. > > > > > > Columnar-vs-row: As it turns out, Impala, Presto and Spark are all row > based. Sadly, all three of these tools have more adoption than Drill. So, a > row-based layout, per-se, is not necessarily a bad thing. > > > > > > I saw a paper many years ago (which, alas, I can no longer find) that > showed the result of experiments of columnar vs. row-based memory layouts. > The gist was that, for numeric (analytic) queries, with a small number of > columns, columnar performed better. Beyond some number of columns, > row-based seemed to be better. > > > > Rather than continue to pick at each point one by one (this e-mail is > already too long), let's just offer an example. > > > > Suppose we observe that 1) clients want a row-based data format, 2) most > readers offer row-based data, 3) Drill uses Java code gen that processes > data row-wise, 4) Java does have a very efficient row data structure: the > class, 5) Java offers a built-in, fragmentation-free memory allocator: the > GC system. > > > > > > So, suppose we just generate a Java class that uses fields to hold data > for columns. Now we can test the assertions about direct-memory columnar > being faster. There is a class in Drill, PerformanceTool, which was used to > compare column accessor performance vs. direct vector access. Let's > re-purpose it to compare the column accessors (on value vectors) vs. > simulated generated code that works with rows as Java classes. > > > > The test runs 300 batches of 4 million rows per batch (total of 1.23G > rows). For each batch, it writes a single int to each row, finalizes the > batch, then reads the int back from each row. That is, it simulates a very > simple reader followed by a filter. > > > > First, we do this using vector accessors for a required and nullable int > and time the results (on an ancient Core i7 CPU, Ubuntu Linux): > > > > Required accessor: 10,434 ms (121M rows/sec) > > > > Nullable accessor: 16,605 ms (76M rows/sec) > > > > > > The above makes sense: a nullable vector has a second "bits" vector that > we must fiddle with. The numbers include all the work to allocate vectors, > much with ledgers and allocators, write to direct memory, read from direct > memory and all the rest. > > > > Now, let's do the same thing with Java. We "generate" a class to hold > the int (and for nullable, an "isSet" field). We use a Java array to hold > the rows. Otherwise, the write-then-read pattern is the same. Now we get: > > > > Required object: 5100 ms (247M rows/sec, 2x faster) > > > > Nullable object: 6656 ms (189M rows/sec, 2.5x faster) > > > > WTF? We get twice the speed WITHOUT any of our fancy direct memory, > DrillBuf, allocator, ledger, vector, accessor, code gen work? Could it be > true that the entire Java industry, working to perfect Java GC, has done a > better job than the original Drill developers did with our memory > allocator? One wonders how much better Arrow might be. > > > > > > Imagine how much simpler code gen will be if we work only with Java > objects rather than with layers of holders, readers, vectors and buffers. > Imagine how much simpler Drill would be if we scrap the allocators, ledgers > and the rest. (And, yes, we would not even need much of the guts of the > column accessors.) > > > > > > Imagine how much simpler UDFs will be if they just work with Java. > Imagine how much simpler format and plugins could be. All of this would > help Drill be the best query tool to integrate with custom applications, > data sources and logic. > > > > > > Seems too easy. We must be overlooking something. For example, we have > not measured the cost of network I/O. But, remember that our exchanges are > almost always hash or partition based. In a large cluster (100 nodes) with > a beefy CPU (24 cores), each fragment must buffer an outgoing batch for > each target fragment: that is 2500 buffered batches (at 10-100 MB each) > plus all the shuffling to move rows into the correct partition. Very little > of that would be needed for rows based on Java objects. It could be that > all our copying and buffering takes more resources than the copy-free > approach saves. Could be that Java objects would transfer faster, with less > memory overhead. > > > > Still, can't be right, can it? IIRC, this is exactly the approach that > Spark took before Data Frames, and Spark seemed to work out alright. Of > course, the above tests only about 1% of Drill, and so is far from > conclusive. > > > > > > Are Java objects the solution? I don't know, any more than I know that > fixed-width blocks are the solution. I'm just asking the questions. The > point is, we have to research and test to find what is best for the goals > we identify for Drill. > > > > > > Thanks, > > - Paul > > > > > >