+1 for having a separate storage-api project to define common interfaces for people to develop against. It'll make things much easier to develop against generically.
I'm okay(+0) with the sub-project idea as opposed to enthusiastic about it, mostly because I have reservations that it'll encourage laziness and will in practice wind up being tied to hive releases and dev and over time assumptions of how hive works and what is available will bleed in. But, still, having a motion of separation will definitely help. On Aug 17, 2016 11:39, "Prasanth Jayachandran" < pjayachand...@hortonworks.com> wrote: > +1 for making it a subproject with separate (preferably shorter) release > cycle. The module in itself is too small for a separate project. Also > having a faster release cycle will resolve circular dependency and will > help other projects make use of vectorization, sarg, bloom filter etc. > > For version management, how about adding another version after patch > version i.e sub-project version? > Example: 2.2.0.[0] will be storage api’s release version. Hive will always > depend on 2.2.0-SNAPSHOT. I think maven will let us release modules with > different versions. https://dev.c-ware.de/confluence/display/PUBLIC/ > Releasing+modules+of+a+multi-module+project+with+ > independent+version+numbers > > Thanks > Prasanth > > > On Aug 17, 2016, at 10:46 AM, Alan Gates <alanfga...@gmail.com> wrote: > > > > +1 for making the API clean and easy for other projects to work with. A > few questions: > > > > 1) Would this also make it easier for Parquet and others to implement > Hive’s ACID interfaces? > > > > 2) Would we make any attempt to coordinate version numbers between Hive > and the storage module, or would a given version of Hive just depend on a > given version of the storage module? > > > > Alan. > > > >> On Aug 15, 2016, at 17:01, Owen O'Malley <omal...@apache.org> wrote: > >> > >> All, > >> > >> As part of moving ORC out of Hive, we pulled all of the vectorization > >> storage and sarg classes into a separate module, which is named > >> storage-api. Although it is currently only used by ORC, it could be > used > >> by Parquet or Avro if they wanted to make a fast vectorized reader that > >> read directly in to Hive's VectorizedRowBatch without needing a shim or > >> data copy. Note that this is in many ways similar to pulling the Arrow > >> project out of Drill. > >> > >> This unfortunately still leaves us with a circular dependency between > Hive > >> and ORC. I'd hoped that storage-api wouldn't change that much, but that > >> doesn't seem to be happening. As a result, ORC ends up shipping its own > >> fork of storage-api. > >> > >> Although we could make a new project for just the storage-api, I think > it > >> would be better to make it a subproject of Hive that is released > >> independently. > >> > >> What do others think? > >> > >> Owen > > > > > >