Thanks Brock and Jason. I just drafted a proposed APIs for vectorized Parquet reader(attached in this email). Any comments and suggestions are appreciated.
Thanks, Zhenxiao On Tue, Oct 7, 2014 at 5:34 PM, Brock Noland <[email protected]> wrote: > Hi, > > The Hive + Parquet community is very interested in improving performance of > Hive + Parquet and Parquet generally. We are very interested in > contributing to the Parquet vectorization and lazy materialization effort. > Please add myself to any future meetings on this topic. > > BTW here it the JIRA tracking this effort from the Hive side: > https://issues.apache.org/jira/browse/HIVE-8120 > > Brock > > On Tue, Oct 7, 2014 at 2:04 PM, Zhenxiao Luo <[email protected]> > wrote: > > > Thanks Jason. > > > > Yes, Netflix is using Presto and Parquet for our BigDataPlatform( > > > > > http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html > > ). > > > > The fastest format currently in Presto is ORC, not DWRF(Parquet is fast, > > but not as fast as ORC). We are referring to ORC, not facebook's DWRF > > implementation. > > > > We already get Parquet working in Presto. We definitely would like to get > > it as fast as ORC. > > > > Facebook has did native support for ORC in Presto, which does not use the > > ORCRecordReader at all. They parses the ORC footer, and does Predicate > > Pushdown by skipping row groups, Vectorization by introducing Type > Specific > > Vectors, and Lazy Materialization by introducing LazyVectors(their code > has > > not been committed yet, I mean their pull request). We are planning to do > > similar optimization for Parquet in Presto. > > > > For the ParquetRecordReader, we need additional APIs to read the next > Batch > > of values, and read in a vector of values. For example, here are the > > related APIs in the ORC code: > > > > /** > > * Read the next row batch. The size of the batch to read cannot be > > controlled > > * by the callers. Caller need to look at VectorizedRowBatch.size of > the > > retunred > > * object to know the batch size read. > > * @param previousBatch a row batch object that can be reused by the > > reader > > * @return the row batch that was read > > * @throws java.io.IOException > > */ > > VectorizedRowBatch nextBatch(VectorizedRowBatch previousBatch) throws > > IOException; > > > > And, here are the related APIs in Presto code, which is used for ORC > > support in Presto: > > > > public void readVector(int columnIndex, Object vector); > > > > For lazy materialization, we may also consider adding LazyVectors or > > LazyBlocks, so that the value is not materialized until they are accessed > > by the Operator. > > > > Any comments and suggestions are appreciated. > > > > Thanks, > > Zhenxiao > > > > > > On Tue, Oct 7, 2014 at 1:05 PM, Jason Altekruse < > [email protected]> > > wrote: > > > > > Hello All, > > > > > > No updates from me yet, just sending out another message for some of > the > > > Netflix engineers that were still just subscribed to the google group > > mail. > > > This will allow them to respond directly with their research on the > > > optimized ORC reader for consideration in the design discussion. > > > > > > -Jason > > > > > > On Mon, Oct 6, 2014 at 10:51 PM, Jason Altekruse < > > [email protected] > > > > > > > wrote: > > > > > > > Hello Parquet team, > > > > > > > > I wanted to report the results of a discussion between the Drill team > > and > > > > the engineers at Netflix working to make Parquet run faster with > > Presto. > > > > As we have said in the last few hangouts we both want to make > > > contributions > > > > back to parquet-mr to add features and performance. We thought it > would > > > be > > > > good to sit down and speak directly about our real goals and the best > > > next > > > > steps to get an engineering effort started to accomplish these goals. > > > > > > > > Below is a summary of the meeting. > > > > > > > > - Meeting notes > > > > > > > > - Attendees: > > > > > > > > - Netflix : Eva Tse, Daniel Weeks, Zhenxiao Luo > > > > > > > > - MapR (Drill Team) : Jacques Nadeau, Jason Altekruse, Parth > > > Chandra > > > > > > > > - Minutes > > > > > > > > - Introductions/ Background > > > > > > > > - Netflix > > > > > > > > - Working on providing interactive SQL querying to users > > > > > > > > - have chosen Presto as the query engine and Parquet as high > > > > performance data > > > > > > > > storage format > > > > > > > > - Presto is providing needed speed in some cases, but others > are > > > > missing optimizations > > > > > > > > that could be avoiding reads > > > > > > > > - Have already started some development and investigation, > have > > > > identified key goals > > > > > > > > - Some initial benchmarks with a modified ORC reader DWRF, > > written > > > > by the Presto > > > > > > > > team shows that such gains are possible with a different > > reader > > > > implementation > > > > > > > > - goals > > > > > > > > - filter pushdown > > > > > > > > - skipping reads based on filter evaluation on one or > > more > > > > columns > > > > > > > > - this can happen at several granularities : row > group, > > > > page, record/value > > > > > > > > - late/lazy materialization > > > > > > > > - for columns not involved in a filter, avoid > > > materializing > > > > them entirely > > > > > > > > until they are know to be needed after evaluating a > > > > filter on other columns > > > > > > > > - Drill > > > > > > > > - the Drill engine uses an in-memory vectorized representation > > of > > > > records > > > > > > > > - for scalar and repeated types we have implemented a fast > > > > vectorized reader > > > > > > > > that is optimized to transform between Parquet's on disk and > > our > > > > in-memory format > > > > > > > > - this is currently producing performant table scans, but has > no > > > > facility for filter > > > > > > > > push down > > > > > > > > - Major goals going forward > > > > > > > > - filter pushdown > > > > > > > > - decide the best implementation for incorporating > > filter > > > > pushdown into > > > > > > > > our current implementation, or figure out a way to > > > > leverage existing > > > > > > > > work in the parquet-mr library to accomplish this > goal > > > > > > > > - late/lazy materialization > > > > > > > > - see above > > > > > > > > - contribute existing code back to parquet > > > > > > > > - the Drill parquet reader has a very strong emphasis > on > > > > performance, a > > > > > > > > clear interface to consume, that is sufficiently > > > > separated from Drill > > > > > > > > could prove very useful for other projects > > > > > > > > - First steps > > > > > > > > - Netflix team will share some of their thoughts and research > > from > > > > working with > > > > > > > > the DWRF code > > > > > > > > - we can have a discussion based off of this, which > aspects > > > are > > > > done well, > > > > > > > > and any opportunities they may have missed that we can > > > > incorporate into our > > > > > > > > design > > > > > > > > - do further investigation and ask the existing community > > for > > > > guidance on existing > > > > > > > > parquet-mr features or planned APIs that may provide > > desired > > > > functionality > > > > > > > > - We will begin a discussion of an API for the new > functionality > > > > > > > > - some outstanding thoughts for down the road > > > > > > > > - The Drill team has an interest in very late > > > > materialization for data stored > > > > > > > > in dictionary encoded pages, such as running a join > or > > > > filter on the dictionary > > > > > > > > and then going back to the reader to grab all of the > > > > values in the data that match > > > > > > > > the needed members of the dictionary > > > > > > > > - this is a later consideration, but just some of > > the > > > > idea of the reason we are > > > > > > > > opening up the design discussion early so that > the > > > > API can be flexible enough > > > > to allow this in the further, even if not > > > implemented > > > > too soon > > > > > > > > > >
