Re: [DISCUSS] Making storage-api a separately released artifact

Sergio Pena Fri, 19 Aug 2016 09:05:29 -0700

I see Parquet is currently using the SearchArgument class for predicates
push down.
Will this class be part of the new sub-module or project?


Following Sushanth idea, can we have other API interfaces in the new
project that other components can use?
Perhaps having this may be a good reason to create a project.

I'm -1 with the 4th minor version. As Owen mentioned, changing the 4th
version number for incompatible changes is ugly and confusing.
I like the new project idea more, +1, but  the storage-api may be too small
for a new project.

- Sergio

On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <omal...@apache.org> wrote:

> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <alanfga...@gmail.com> wrote:
>
> > +1 for making the API clean and easy for other projects to work with.  A
> > few questions:
> >
> > 1) Would this also make it easier for Parquet and others to implement
> > Hive’s ACID interfaces?
> >
>
> Currently the ACID interfaces haven't been moved over to storage-api,
> although it would make sense to do so at some point.
>
>
> >
> > 2) Would we make any attempt to coordinate version numbers between Hive
> > and the storage module, or would a given version of Hive just depend on a
> > given version of the storage module?
> >
>
> The two options that I see are:
>
> * Let the numbers run separately starting from 2.2.0.
> * Tie the numbers together with an additional level of versioning (eg.
> 2.2.0.0).
>
> I think that letting the two version numbers diverge is better in the long
> term. For example, if you need to make an incompatible change, it is pretty
> ugly to do it as a fourth level version number (eg. an incompatible change
> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api would
> move faster than Hive, but as it stabilizes I expect it might start moving
> slower than Hive.
>
> I'd propose that we have Hive's build use a released version of storage-api
> rather than a snapshot.
>
> Thoughts?
>
>    Owen
>
>
> > Alan.
> >
> > > On Aug 15, 2016, at 17:01, Owen O'Malley <omal...@apache.org> wrote:
> > >
> > > All,
> > >
> > > As part of moving ORC out of Hive, we pulled all of the vectorization
> > > storage and sarg classes into a separate module, which is named
> > > storage-api.  Although it is currently only used by ORC, it could be
> used
> > > by Parquet or Avro if they wanted to make a fast vectorized reader that
> > > read directly in to Hive's VectorizedRowBatch without needing a shim or
> > > data copy. Note that this is in many ways similar to pulling the Arrow
> > > project out of Drill.
> > >
> > > This unfortunately still leaves us with a circular dependency between
> > Hive
> > > and ORC. I'd hoped that storage-api wouldn't change that much, but that
> > > doesn't seem to be happening. As a result, ORC ends up shipping its own
> > > fork of storage-api.
> > >
> > > Although we could make a new project for just the storage-api, I think
> it
> > > would be better to make it a subproject of Hive that is released
> > > independently.
> > >
> > > What do others think?
> > >
> > >   Owen
> >
> >
>

Re: [DISCUSS] Making storage-api a separately released artifact

Reply via email to