GPL dependencies are always a problem. This support would be a great candidate for an external project.
On Wed, Mar 13, 2013 at 2:08 PM, Tsuyoshi OZAWA <[email protected]>wrote: > One alternative columnar storage is wiredtiger used by amazon.com. > It provides with a columnar storage and record-style storage library > API like berkley DB. > > One concern is that wiredtiger is licensed by GPL and BSD. > However, supporting it can empower Drill project. > > http://wiredtiger.com/ > > On Wed, Mar 13, 2013 at 4:22 PM, Ted Dunning <[email protected]> > wrote: > > Can you bring 5 slides on parquet? (ppt or pptx?) > > > > On Tue, Mar 12, 2013 at 8:59 PM, Julien Le Dem <[email protected]> > wrote: > > > >> I should be able to come to the Drill meetup tomorrow. > >> We can chat about it then. > >> Julien > >> > >> On Tue, Mar 12, 2013 at 1:43 PM, Dmitriy Ryaboy <[email protected]> > >> wrote: > >> > ColumnIO implementations return values from a column independently of > >> other > >> > columns; RecordReaderImplementation does materialize the whole record > (by > >> > using a bunch of column readers at the same time). You could > construct a > >> > column-at-a-time, late materialization api by dropping directly into > >> using > >> > column readers; so it just depends on which level of abstraction you > want > >> > to hook up with. > >> > > >> > We were initially concerned with "record-oriented" frameworks so we > built > >> > the record materialization machinery for them first; a more truly > >> columnar > >> > engine should work with ColumnIO instead of RecordReaders. > >> > > >> > Also, since the API is still young, it's certainly open to discussion > and > >> > improvement. > >> > > >> > D > >> > > >> > > >> > On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <[email protected]> > wrote: > >> > > >> >> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <[email protected] > > > >> >> wrote: > >> >> > >> >> > Joined, thanks. I'm glad that the approach was open for this. I > >> think > >> >> > that helps it chances to be ubiquitous. As much as this might be > >> >> > blasphemous to some, I really hope that the final solution to the > >> query > >> >> > wars is a collaborative solution as opposed to a competitive one. > >> >> > > >> >> > Having not looked at the code yet, do the existing read interfaces > >> >> support > >> >> > working with "late materialization" execution strategies similar to > >> some > >> >> of > >> >> > the ideas at [1]? Definitely seems harder to implement in a > >> >> > nested/repeated environment but wanted to get a sense of the > thinking > >> >> > behind the initial efforts. > >> >> > > >> >> > >> >> The existing read interface in Java is tuple-at-a-time, but there's > no > >> >> reason one couldn't build a column-at-a-time late materialization > >> approach. > >> >> It would just be a lot more "custom", and not directly user-usable, > so > >> >> there's none in the initial implementation. > >> >> > >> >> Like you said, it's a little tougher with arbitrary nesting, but I > think > >> >> still doable. > >> >> > >> >> -Todd > >> >> > >> >> > > >> >> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]> > >> wrote: > >> >> > > >> >> > > Hey Jacques, > >> >> > > > >> >> > > Feel free to ping us with any questions. Despite some of the > >> _users_ of > >> >> > > Parquet competing with each other (eg query engines), we hope the > >> file > >> >> > > format itself can be easily implemented by everyone and become > >> >> > ubiquitous. > >> >> > > > >> >> > > There are a few changes still in flight that we're working on, so > >> you > >> >> may > >> >> > > want to join the parquet dev mailing list as well to follow > along. > >> >> > > > >> >> > > Thanks > >> >> > > -Todd > >> >> > > > >> >> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau < > >> [email protected]> > >> >> > > wrote: > >> >> > > > >> >> > > > When you said soon, you meant very soon. This looks like great > >> work. > >> >> > > > Thanks for sharing it with the world. Will come back after > >> spending > >> >> > > some > >> >> > > > time with it. > >> >> > > > > >> >> > > > thanks again, > >> >> > > > Jacques > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem < > >> [email protected]> > >> >> > > wrote: > >> >> > > > > >> >> > > > > The repo is now available: http://parquet.github.com/ > >> >> > > > > Let me know if you have questions > >> >> > > > > > >> >> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau < > >> >> [email protected] > >> >> > > > >> >> > > > > wrote: > >> >> > > > > > There definitely seem to be some new kids on the block. I > >> really > >> >> > > hope > >> >> > > > > that > >> >> > > > > > Drill can adopt either ORC or Parquet as a closely related > >> >> "native" > >> >> > > > > format. > >> >> > > > > > At the moment, I'm actually more focused on the in-memory > >> >> > execution > >> >> > > > > > format and the right abstraction to support compressed > >> columnar > >> >> > > > execution > >> >> > > > > > and vectorization. Historically, the biggest gaps I'd > worry > >> >> about > >> >> > > are > >> >> > > > > > java-centricity and expectation of early materialization & > >> >> > > > decompression. > >> >> > > > > > Once we get some execution stuff working, lets see how > each > >> fits > >> >> > in. > >> >> > > > > > Rather than start a third competing format (or fourth if > you > >> >> count > >> >> > > > > > Trevni), let's either use or extend/contribute back on one > of > >> the > >> >> > > > > existing > >> >> > > > > > new kids. > >> >> > > > > > > >> >> > > > > > Julien, do you think more will be shared about Parquet > before > >> the > >> >> > > > Hadoop > >> >> > > > > > Summit so we can start toying with using it inside of > Drill? > >> >> > > > > > > >> >> > > > > > J > >> >> > > > > > > >> >> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler > >> >> > > > > > <[email protected]>wrote: > >> >> > > > > > > >> >> > > > > >> Hi all, > >> >> > > > > >> > >> >> > > > > >> I've been trying to track down status/comparisons of > various > >> >> > > columnar > >> >> > > > > >> formats, and just heard about Parquet. > >> >> > > > > >> > >> >> > > > > >> I don't have any direct experience with Parquet, but > Really > >> >> Smart > >> >> > > Guy > >> >> > > > > said: > >> >> > > > > >> > >> >> > > > > >> > From what I hear there are two key features that > >> >> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be > >> >> > optionally > >> >> > > > > split > >> >> > > > > >> into > >> >> > > > > >> > separate files, and 2) the mechanism for shredding > nested > >> >> fields > >> >> > > > into > >> >> > > > > >> > columns is taken almost verbatim from Dremel. Feature > (1) > >> >> won't > >> >> > be > >> >> > > > > >> practical > >> >> > > > > >> > to use until Hadoop introduces support for a file group > >> >> locality > >> >> > > > > >> feature, but once it > >> >> > > > > >> > does this feature should enable more efficient use of > the > >> >> buffer > >> >> > > > cache > >> >> > > > > >> for predicate > >> >> > > > > >> > pushdown operations. > >> >> > > > > >> > >> >> > > > > >> -- Ken > >> >> > > > > >> > >> >> > > > > >> > >> >> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote: > >> >> > > > > >> > >> >> > > > > >> > Parquet is actually implementing the algorithm > described in > >> >> the > >> >> > > > > >> > "Nested Columnar Storage" section of the Dremel > paper[1]. > >> >> > > > > >> > > >> >> > > > > >> > [1] http://research.google.com/pubs/pub36632.html > >> >> > > > > >> > > >> >> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen < > >> >> > [email protected] > >> >> > > > > >> >> > > > > >> wrote: > >> >> > > > > >> >> Just saw this: > >> >> > > > > >> >> > >> >> > > > > >> >> http://t.co/ES1dGDZlKA > >> >> > > > > >> >> > >> >> > > > > >> >> I know Trevni is another Dremel inspired Columnar > format > >> as > >> >> > well, > >> >> > > > > anyone > >> >> > > > > >> >> saw much info Parquet and how it's different? > >> >> > > > > >> >> > >> >> > > > > >> >> Tim > >> >> > > > > >> > >> >> > > > > >> -------------------------- > >> >> > > > > >> Ken Krugler > >> >> > > > > >> +1 530-210-6378 > >> >> > > > > >> http://www.scaleunlimited.com > >> >> > > > > >> custom big data solutions & training > >> >> > > > > >> Hadoop, Cascading, Cassandra & Solr > >> >> > > > > >> > >> >> > > > > >> > >> >> > > > > >> > >> >> > > > > >> > >> >> > > > > >> > >> >> > > > > >> > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > -- > >> >> > > Todd Lipcon > >> >> > > Software Engineer, Cloudera > >> >> > > > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Todd Lipcon > >> >> Software Engineer, Cloudera > >> >> > >> > > > > -- > - Tsuyoshi >
