Re: Impala 4.0 breaking changes

Shant Hovsepian Tue, 17 Mar 2020 17:47:36 -0700

+1 on RUNTIME_FILTER_WAIT_TIME_MS increasing.


On Tue, Mar 17, 2020 at 5:43 PM Tim Armstrong <[email protected]> wrote:
>
> I think we should consider changing a couple more defaults, after having an
> offline conversion with Shant.
>
> We could change COMPRESSION_CODEC to LZ4 or ZSTD as the default. I think
> LZ4 is the safest option perf-wise, because it will be faster across the
> board and the decompression is now one of the main CPU bottlenecks for
> Parquet scanning. We might need to double-check that enough of the
> ecosystem supports LZ4, but this seems like it would be a good improvement.
>
> It *might* we worth enabled compute stats table sampling by default, but I
> think that could be open for discussion.
>
> We could also consider bumping RUNTIME_FILTER_WAIT_TIME_MS to a higher
> value, since I think generally higher values have proven to be more robust
> for complex queries (TPC-DS, etc).
>
> On Tue, Mar 17, 2020 at 11:56 AM Tim Armstrong <[email protected]>
> wrote:
>
> > >   - Do we still need the DECIMAL_V2 query option? Seems like this has
> > been  true for a while. Maybe we can add it to the list of deprecated flags?
> > Maybe we could officially deprecate it and phase it out soonish? It really
> > only exists as a workaround for people upgrading from the old behaviour in
> > 2.x. It hasn't been terribly bad maintaining the two code paths, but it
> > would be nice to simplify it.
> >
> > >   - Deprecate support for ADLS, since it has effectively been replaced
> > by ABFS
> > Makes sense. It probably isn't too much overhead to keep the old code
> > around for a while, is it? Just in case users have a bunch of data still
> > sitting in the old ADLS.
> >
> > >   - Deprecate (or even remove) support for HDFS cacheing? Not sure how
> > extensively this is used, removing the code would be nice as it simplifies
> > part of the HDFS read path
> > Anecdotally I do see it used, but a lot of times it's to affect scheduling
> > rather than because saving memcpy() makes a real difference (with
> > compressed parquet, that's rarely the bottleneck) . A compromise or
> > in-between step would be to remove the special-casing of the zero-copy code
> > path in the backend, but keep the scheduling behaviour.
> >
> > On Tue, Mar 17, 2020 at 11:50 AM Tim Armstrong <[email protected]>
> > wrote:
> >
> >> I think I generally support this. A few specific comments.
> >>
> >> > Proposal 3: Impala-lzo
> >> > Drop support for Impala-lzo/hadoop-lzo
> >>
> >> Does this mean dropping the plugin text scanner interface entirely? LZO
> >> is the only implementation of that that I'm aware of (and we rely on it to
> >> test the interface) so seems reasonable to me to remove something that has
> >> minimal adoption and not cleanly separated from the scanner implementation
> >> of core Impala.
> >>
> >> > Proposal 5: Sentry
> >> > Drop support for Sentry in favor of Ranger.
> >>
> >> I think moving this direction makes a lot of sense given that activity in
> >> the Sentry project has declined a lot (just look at the activity level on
> >> the two projects, it's dramatically different), unless someone in the
> >> community wants to step up and maintain the integration.
> >>
> >> > Proposal 6: Metadata
> >> > Metadata V2 will become the default. Metadata V1 will be deprecated.
> >> Maybe we should set a goal of removing the support in Impala 4.1 or 4.2?
> >> That would allow us to remove a lot of complex code
> >>
> >> On Mon, Mar 16, 2020 at 10:07 AM Joe McDonnell <[email protected]>
> >> wrote:
> >>
> >>> Now that Impala 3.4 is branched and master is Impala 4.0, we need to
> >>> decide
> >>> what breaking changes will happen in Impala 4.0. I have provided a series
> >>> of proposals below. I welcome feedback on them. Other proposals are also
> >>> welcome.
> >>>
> >>> Thanks,
> >>> Joe
> >>>
> >>> Proposal 0: Hadoop component versions
> >>>
> >>> Switch to CDP versions of components by default. This means that Impala
> >>> will use Hive 3+ (which is already essentially Hive 4 and may change
> >>> names
> >>> to being Hive 4).
> >>> Remove support for CDH versions of components.
> >>> This was already discussed in the original thread for Impala 4, so this
> >>> is
> >>> not new.
> >>>
> >>> Proposal 1: OS support
> >>>
> >>> Drop support for Centos 6, Ubuntu 14, and Debian (all versions)
> >>> Retain support for Ubuntu 16, Ubuntu 18, Centos 7, and SLES 12
> >>> Centos 7 development will be focused on newer Centos 7 versions such as
> >>> 7.6
> >>> and 7.7.
> >>> Add support for Centos 8
> >>> Move main development from Ubuntu 16 to Ubuntu 18 over time.
> >>>
> >>> Proposal 2: Python support
> >>>
> >>> Drop support for Python 2.6
> >>> Add support for Python 3 over time.
> >>>
> >>> Proposal 3: Impala-lzo
> >>>
> >>> Drop support for Impala-lzo/hadoop-lzo
> >>>
> >>> Proposal 4: Clients
> >>>
> >>> Deprecate beeswax protocol. This means that it can be removed in the next
> >>> major version number, but it would not be removed in Impala 4. Current
> >>> users of beeswax would need to start migrating to HS2.
> >>>
> >>> Proposal 5: Sentry
> >>>
> >>> Drop support for Sentry in favor of Ranger.
> >>>
> >>> Proposal 6: Metadata
> >>>
> >>> Metadata V2 will become the default. Metadata V1 will be deprecated.
> >>>
> >>> Thanks,
> >>> Joe
> >>>
> >>

Re: Impala 4.0 breaking changes

Reply via email to