RE: [DISCUSS] Making storage-api a separately released artifact

Xu, Cheng A Sun, 28 Aug 2016 22:14:52 -0700

Hi Sergio,
For vectorization, it works for most of types except decimal. For Hive row 
batch, it consists of an array of HiveDecimal which can't be initialized in 
Parquet side. We have to do a convert which will impact the performance.


-----Original Message-----
From: Sergio Pena [mailto:[email protected]] 
Sent: Saturday, August 27, 2016 3:59 AM
To: dev <[email protected]>
Subject: Re: [DISCUSS] Making storage-api a separately released artifact

Question:

Wouldn't be better to move part of the implementations to Orc, Parquet and 
Avro, and just have some interfaces and basic implementations on Hive? This way 
we could avoid Orc, Parquet and/or Avro to depend from Hive. I saw this on 
Parquet where they created a RowBatch class internally and returns that to 
Hive, then in Hive we will just bind it to the Hive vectorized interface to 
support vectorization. It justs an idea, I am not clear exactly what I am 
trying to say :)


On Fri, Aug 19, 2016 at 11:01 PM, Lefty Leverenz <[email protected]>
wrote:

> Sergey's idea is creative, although it leads to confusion about JIRA 
> fix versions.  Issues would be given fix versions based on assumptions 
> about whether SA or Hive will be released first.  (That's hard to 
> predict when it's months away.)
>
> Keeping the version numbers tied together is very appealing.  Would it 
> be possible to have incompatible changes in SA force a bump in the 
> Hive release number?  Hm, I guess that means Hive would need a release 
> at the same time as SA, but only for incompatible changes.
>
> What's the likelihood of another subproject getting spun off eventually?
> If that happened, the 4th minor version wouldn't make sense.  A 5th 
> minor version wouldn't work either.
>
> -- Lefty
>
>
> On Fri, Aug 19, 2016 at 9:46 PM, Sergey Shelukhin 
> <[email protected]>
> wrote:
>
> > I am suggesting we always skip the number. So only one component 
> > gets the next one :) In your example Hive trunk would be 2.3, and if 
> > SA is
> released
> > again it would become 2.4. Otherwise we’d need a compat table cause 
> > versions will be totally out of sync.
> >
> > On 16/8/19, 16:31, "Owen O'Malley" <[email protected]> wrote:
> >
> > >That won't necessarily work, especially in the beginning. If we 
> > >release
> SA
> > >2.2.0 and use it for Hive trunk with the assumption that the next 
> > >Hive release will be 2.2. What do we do when we need to make an 
> > >incompatible change in SA? I guess we could release SA as 2.3.0 and 
> > >when hive makes
> its
> > >next release skip over Hive 2.2 and go straight to Hive 2.3.0. In
> general
> > >I
> > >think that we'd be better off with the release numbers not tied
> together.
> > >
> > >.. Owen
> > >
> > >On Fri, Aug 19, 2016 at 4:14 PM, Sergey Shelukhin <
> [email protected]
> > >
> > >wrote:
> > >
> > >> Can we just run the versions thru? I.e. increment it every time 
> > >> but release only one component (or both if they happen to align I guess).
> > >> E.g. storage-api will be released at 2.2, and say 2.3 if it moves
> fast,
> > >> then Hive 2.4, then storage-api 2.5, etc.
> > >> That might make it easier to reason about compatibility because 
> > >>the order  is obvious.
> > >>
> > >> On 16/8/19, 09:04, "Sergio Pena" <[email protected]> wrote:
> > >>
> > >> >I see Parquet is currently using the SearchArgument class for
> > >>predicates
> > >> >push down.
> > >> >Will this class be part of the new sub-module or project?
> > >> >
> > >> >Following Sushanth idea, can we have other API interfaces in the 
> > >> >new project that other components can use?
> > >> >Perhaps having this may be a good reason to create a project.
> > >> >
> > >> >I'm -1 with the 4th minor version. As Owen mentioned, changing 
> > >> >the
> 4th
> > >> >version number for incompatible changes is ugly and confusing.
> > >> >I like the new project idea more, +1, but  the storage-api may 
> > >> >be too small for a new project.
> > >> >
> > >> >- Sergio
> > >> >
> > >> >On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley 
> > >> ><[email protected]>
> > >> wrote:
> > >> >
> > >> >> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates 
> > >> >> <[email protected]
> >
> > >> >>wrote:
> > >> >>
> > >> >> > +1 for making the API clean and easy for other projects to 
> > >> >> > +work
> > >>with.
> > >> >> A
> > >> >> > few questions:
> > >> >> >
> > >> >> > 1) Would this also make it easier for Parquet and others to
> > >>implement
> > >> >> > Hive’s ACID interfaces?
> > >> >> >
> > >> >>
> > >> >> Currently the ACID interfaces haven't been moved over to
> storage-api,
> > >> >> although it would make sense to do so at some point.
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > 2) Would we make any attempt to coordinate version numbers
> between
> > >> >>Hive
> > >> >> > and the storage module, or would a given version of Hive 
> > >> >> > just
> > >>depend
> > >> >>on a
> > >> >> > given version of the storage module?
> > >> >> >
> > >> >>
> > >> >> The two options that I see are:
> > >> >>
> > >> >> * Let the numbers run separately starting from 2.2.0.
> > >> >> * Tie the numbers together with an additional level of 
> > >> >> versioning
> > >>(eg.
> > >> >> 2.2.0.0).
> > >> >>
> > >> >> I think that letting the two version numbers diverge is better 
> > >> >> in
> the
> > >> >>long
> > >> >> term. For example, if you need to make an incompatible change, 
> > >> >>it
> is
> > >> >>pretty
> > >> >> ugly to do it as a fourth level version number (eg. an 
> > >> >>incompatible change  from 2.2.0.0 to 2.2.0.1). At the 
> > >> >>beginning, I expect that
> storage-api
> > >> >>would
> > >> >> move faster than Hive, but as it stabilizes I expect it might 
> > >> >>start moving  slower than Hive.
> > >> >>
> > >> >> I'd propose that we have Hive's build use a released version 
> > >> >>of storage-api  rather than a snapshot.
> > >> >>
> > >> >> Thoughts?
> > >> >>
> > >> >>    Owen
> > >> >>
> > >> >>
> > >> >> > Alan.
> > >> >> >
> > >> >> > > On Aug 15, 2016, at 17:01, Owen O'Malley 
> > >> >> > > <[email protected]>
> > >> wrote:
> > >> >> > >
> > >> >> > > All,
> > >> >> > >
> > >> >> > > As part of moving ORC out of Hive, we pulled all of the
> > >> >>vectorization
> > >> >> > > storage and sarg classes into a separate module, which is 
> > >> >> > > named storage-api.  Although it is currently only used by 
> > >> >> > > ORC, it
> > >>could be
> > >> >> used
> > >> >> > > by Parquet or Avro if they wanted to make a fast 
> > >> >> > > vectorized
> > >>reader
> > >> >>that
> > >> >> > > read directly in to Hive's VectorizedRowBatch without 
> > >> >> > > needing a
> > >> >>shim or
> > >> >> > > data copy. Note that this is in many ways similar to 
> > >> >> > > pulling
> the
> > >> >>Arrow
> > >> >> > > project out of Drill.
> > >> >> > >
> > >> >> > > This unfortunately still leaves us with a circular 
> > >> >> > > dependency
> > >> >>between
> > >> >> > Hive
> > >> >> > > and ORC. I'd hoped that storage-api wouldn't change that 
> > >> >> > > much,
> > >>but
> > >> >>that
> > >> >> > > doesn't seem to be happening. As a result, ORC ends up 
> > >> >> > > shipping
> > >>its
> > >> >>own
> > >> >> > > fork of storage-api.
> > >> >> > >
> > >> >> > > Although we could make a new project for just the 
> > >> >> > > storage-api,
> I
> > >> >>think
> > >> >> it
> > >> >> > > would be better to make it a subproject of Hive that is
> released
> > >> >> > > independently.
> > >> >> > >
> > >> >> > > What do others think?
> > >> >> > >
> > >> >> > >   Owen
> > >> >> >
> > >> >> >
> > >> >>
> > >>
> > >>
> >
> >
>

RE: [DISCUSS] Making storage-api a separately released artifact

Reply via email to