Re: Impala 4.0 breaking changes

Zoltán Borók-Nagy Fri, 08 May 2020 01:45:08 -0700

About transactional tables:
If there's an ACID base directory in the table (due to compaction or INSERT
OVERWRITE), then files at table/partition-root level will be ignored.
So in that case Spark would need to do ACID-aware inserts.


Another aspect is that ACID-inserts are probably faster, especially on
object stores like S3.
The reason for this is that we don't need to create a staging directory and
move (which is a copy on S3) files to their final location.
However, read amplification is definitely greater for ACID tables.

Btw, do we want to achieve consistent default behavior with an upstream
Hive version?

That said, I think creating non-transactional tables is a good default.
Especially because Impala will probably support Hudi and Iceberg in the
future, so it's probably better to let the users choose explicitly.

- Zoltan


On Thu, May 7, 2020 at 11:46 PM Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> That's a pretty good argument against defaulting to transactional tables.
> You are right that it doesn't work out-of-the box with most other engines -
> writing files into the base directory of the table/partition will not work
> as intended afaik.
>
> On Thu, May 7, 2020 at 1:10 PM Shant Hovsepian <sh...@cloudera.com> wrote:
>
> > How compatible with other engines is the insert only transaction type.
> >
> > Very often data is loaded with spark, especially for cases with complex
> > types where it's the only option. Will landing parquet files in the table
> > path just work even if we don't get consistent inserts or does spark need
> > to be aware of the table format in either case?
> >
> > -Shant
> >
> > On Thu, May 7, 2020 at 3:09 PM Sahil Takiar <takiar.sa...@gmail.com>
> > wrote:
> >
> > > +1 on query results spooling, I've been thinking about enabling it by
> > > default recently since it seems to be relatively stable.
> > >
> > > On Thu, May 7, 2020 at 11:41 AM Tim Armstrong <tarmstr...@cloudera.com
> >
> > > wrote:
> > >
> > > > I'm going to revive this thread. I thought of a few more defaults
> that
> > we
> > > > might want to change. These are default changes we (putting on
> Cloudera
> > > hat
> > > > temporarily) have made for some new production deployments and have
> > been
> > > > happy with.
> > > >
> > > > Query result spooling has a bunch of advantages for resource
> > consumption
> > > > and fetch speed. It uses a bounded amount of memory and scratch
> space,
> > > but
> > > > I think it's overall a better default. We've been using it in
> > production
> > > > for a while now and haven't had any issues.
> > > >
> > > >
> > >
> >
> https://impala.apache.org/docs/build/html/topics/impala_spool_query_results.html
> > > >
> > > > I think we should also switch the default file format to parquet,
> > because
> > > > it's more correct (default text has some issues with escaping) and
> > > because
> > > > it's more performant.
> > > >
> > > >
> > >
> >
> https://impala.apache.org/docs/build/html/topics/impala_default_file_format.html
> > > >
> > > > We could also consider creating insert_only transactional tables by
> > > default
> > > > -
> > > >
> > > >
> > >
> >
> https://impala.apache.org/docs/build/html/topics/impala_default_transactional_type.html
> > > > .
> > > > The pros and cons here are more complex - we get more consistent
> > > behaviour
> > > > by default, but there can be perf/scalability consequences.
> > > >
> > > > Any objections or thoughts on these?
> > > >
> > > > On Thu, Mar 19, 2020 at 4:44 PM Tim Armstrong <
> tarmstr...@cloudera.com
> > >
> > > > wrote:
> > > >
> > > > > I think ARM support can ship in whatever release it's reading in,
> > since
> > > > > it's not a breaking change.
> > > > >
> > > > > On Wed, Mar 18, 2020 at 9:43 PM 赵 仁海 <zhaoren...@hotmail.com>
> wrote:
> > > > >
> > > > >> Thanks
> > > > >> I will work hard on this ^_^
> > > > >>
> > > > >> ________________________________
> > > > >> 发件人: Jim Apple <apa...@jbapple.com>
> > > > >> 发送时间: 2020年3月19日 10:21
> > > > >> 收件人: dev@impala.apache.org <dev@impala.apache.org>
> > > > >> 主题: Re: Impala 4.0 breaking changes
> > > > >>
> > > > >> I agree. I don’t know how far we are from having arm64 support,
> > > though,
> > > > >> and
> > > > >> we might not get there for a 4.0 release, I’d guess. But that
> > doesn’t
> > > > mean
> > > > >> it couldn’t arrive by the time for 4.1 or 4.7 or 5.55 or whatever.
> > > > >>
> > > > >> On Wed, Mar 18, 2020 at 6:32 PM Joe McDonnell <
> > > > joemcdonn...@cloudera.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Patches to add support for arm64 are definitely welcome in any
> > > > release.
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Joe
> > > > >> >
> > > > >> > On Mon, Mar 16, 2020 at 6:11 PM 赵 仁海 <zhaoren...@hotmail.com>
> > > wrote:
> > > > >> >
> > > > >> > > Hi
> > > > >> > >
> > > > >> > > Could we  add support for arm64?
> > > > >> > >
> > > > >> > > Thanks
> > > > >> > > Zhao Renhai
> > > > >> > >
> > > > >> > > ________________________________
> > > > >> > > 发件人: Joe McDonnell <joemcdonn...@cloudera.com>
> > > > >> > > 发送时间: 2020年3月17日 1:07
> > > > >> > > 收件人: dev@impala.apache.org <dev@impala.apache.org>
> > > > >> > > 主题: Impala 4.0 breaking changes
> > > > >> > >
> > > > >> > > Now that Impala 3.4 is branched and master is Impala 4.0, we
> > need
> > > to
> > > > >> > decide
> > > > >> > > what breaking changes will happen in Impala 4.0. I have
> > provided a
> > > > >> series
> > > > >> > > of proposals below. I welcome feedback on them. Other
> proposals
> > > are
> > > > >> also
> > > > >> > > welcome.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Joe
> > > > >> > >
> > > > >> > > Proposal 0: Hadoop component versions
> > > > >> > >
> > > > >> > > Switch to CDP versions of components by default. This means
> that
> > > > >> Impala
> > > > >> > > will use Hive 3+ (which is already essentially Hive 4 and may
> > > change
> > > > >> > names
> > > > >> > > to being Hive 4).
> > > > >> > > Remove support for CDH versions of components.
> > > > >> > > This was already discussed in the original thread for Impala
> 4,
> > so
> > > > >> this
> > > > >> > is
> > > > >> > > not new.
> > > > >> > >
> > > > >> > > Proposal 1: OS support
> > > > >> > >
> > > > >> > > Drop support for Centos 6, Ubuntu 14, and Debian (all
> versions)
> > > > >> > > Retain support for Ubuntu 16, Ubuntu 18, Centos 7, and SLES 12
> > > > >> > > Centos 7 development will be focused on newer Centos 7
> versions
> > > such
> > > > >> as
> > > > >> > 7.6
> > > > >> > > and 7.7.
> > > > >> > > Add support for Centos 8
> > > > >> > > Move main development from Ubuntu 16 to Ubuntu 18 over time.
> > > > >> > >
> > > > >> > > Proposal 2: Python support
> > > > >> > >
> > > > >> > > Drop support for Python 2.6
> > > > >> > > Add support for Python 3 over time.
> > > > >> > >
> > > > >> > > Proposal 3: Impala-lzo
> > > > >> > >
> > > > >> > > Drop support for Impala-lzo/hadoop-lzo
> > > > >> > >
> > > > >> > > Proposal 4: Clients
> > > > >> > >
> > > > >> > > Deprecate beeswax protocol. This means that it can be removed
> in
> > > the
> > > > >> next
> > > > >> > > major version number, but it would not be removed in Impala 4.
> > > > Current
> > > > >> > > users of beeswax would need to start migrating to HS2.
> > > > >> > >
> > > > >> > > Proposal 5: Sentry
> > > > >> > >
> > > > >> > > Drop support for Sentry in favor of Ranger.
> > > > >> > >
> > > > >> > > Proposal 6: Metadata
> > > > >> > >
> > > > >> > > Metadata V2 will become the default. Metadata V1 will be
> > > deprecated.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Joe
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > > Sahil Takiar
> > > Software Engineer
> > > takiar.sa...@gmail.com | (510) 673-0309
> > >
> >
>

Re: Impala 4.0 breaking changes

Reply via email to