I see Parquet is currently using the SearchArgument class for predicates push down. Will this class be part of the new sub-module or project?
Following Sushanth idea, can we have other API interfaces in the new project that other components can use? Perhaps having this may be a good reason to create a project. I'm -1 with the 4th minor version. As Owen mentioned, changing the 4th version number for incompatible changes is ugly and confusing. I like the new project idea more, +1, but the storage-api may be too small for a new project. - Sergio On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <omal...@apache.org> wrote: > On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <alanfga...@gmail.com> wrote: > > > +1 for making the API clean and easy for other projects to work with. A > > few questions: > > > > 1) Would this also make it easier for Parquet and others to implement > > Hive’s ACID interfaces? > > > > Currently the ACID interfaces haven't been moved over to storage-api, > although it would make sense to do so at some point. > > > > > > 2) Would we make any attempt to coordinate version numbers between Hive > > and the storage module, or would a given version of Hive just depend on a > > given version of the storage module? > > > > The two options that I see are: > > * Let the numbers run separately starting from 2.2.0. > * Tie the numbers together with an additional level of versioning (eg. > 2.2.0.0). > > I think that letting the two version numbers diverge is better in the long > term. For example, if you need to make an incompatible change, it is pretty > ugly to do it as a fourth level version number (eg. an incompatible change > from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api would > move faster than Hive, but as it stabilizes I expect it might start moving > slower than Hive. > > I'd propose that we have Hive's build use a released version of storage-api > rather than a snapshot. > > Thoughts? > > Owen > > > > Alan. > > > > > On Aug 15, 2016, at 17:01, Owen O'Malley <omal...@apache.org> wrote: > > > > > > All, > > > > > > As part of moving ORC out of Hive, we pulled all of the vectorization > > > storage and sarg classes into a separate module, which is named > > > storage-api. Although it is currently only used by ORC, it could be > used > > > by Parquet or Avro if they wanted to make a fast vectorized reader that > > > read directly in to Hive's VectorizedRowBatch without needing a shim or > > > data copy. Note that this is in many ways similar to pulling the Arrow > > > project out of Drill. > > > > > > This unfortunately still leaves us with a circular dependency between > > Hive > > > and ORC. I'd hoped that storage-api wouldn't change that much, but that > > > doesn't seem to be happening. As a result, ORC ends up shipping its own > > > fork of storage-api. > > > > > > Although we could make a new project for just the storage-api, I think > it > > > would be better to make it a subproject of Hive that is released > > > independently. > > > > > > What do others think? > > > > > > Owen > > > > >