+1 on RUNTIME_FILTER_WAIT_TIME_MS increasing.
On Tue, Mar 17, 2020 at 5:43 PM Tim Armstrong <tarmstr...@cloudera.com> wrote: > > I think we should consider changing a couple more defaults, after having an > offline conversion with Shant. > > We could change COMPRESSION_CODEC to LZ4 or ZSTD as the default. I think > LZ4 is the safest option perf-wise, because it will be faster across the > board and the decompression is now one of the main CPU bottlenecks for > Parquet scanning. We might need to double-check that enough of the > ecosystem supports LZ4, but this seems like it would be a good improvement. > > It *might* we worth enabled compute stats table sampling by default, but I > think that could be open for discussion. > > We could also consider bumping RUNTIME_FILTER_WAIT_TIME_MS to a higher > value, since I think generally higher values have proven to be more robust > for complex queries (TPC-DS, etc). > > On Tue, Mar 17, 2020 at 11:56 AM Tim Armstrong <tarmstr...@cloudera.com> > wrote: > > > > - Do we still need the DECIMAL_V2 query option? Seems like this has > > been true for a while. Maybe we can add it to the list of deprecated flags? > > Maybe we could officially deprecate it and phase it out soonish? It really > > only exists as a workaround for people upgrading from the old behaviour in > > 2.x. It hasn't been terribly bad maintaining the two code paths, but it > > would be nice to simplify it. > > > > > - Deprecate support for ADLS, since it has effectively been replaced > > by ABFS > > Makes sense. It probably isn't too much overhead to keep the old code > > around for a while, is it? Just in case users have a bunch of data still > > sitting in the old ADLS. > > > > > - Deprecate (or even remove) support for HDFS cacheing? Not sure how > > extensively this is used, removing the code would be nice as it simplifies > > part of the HDFS read path > > Anecdotally I do see it used, but a lot of times it's to affect scheduling > > rather than because saving memcpy() makes a real difference (with > > compressed parquet, that's rarely the bottleneck) . A compromise or > > in-between step would be to remove the special-casing of the zero-copy code > > path in the backend, but keep the scheduling behaviour. > > > > On Tue, Mar 17, 2020 at 11:50 AM Tim Armstrong <tarmstr...@cloudera.com> > > wrote: > > > >> I think I generally support this. A few specific comments. > >> > >> > Proposal 3: Impala-lzo > >> > Drop support for Impala-lzo/hadoop-lzo > >> > >> Does this mean dropping the plugin text scanner interface entirely? LZO > >> is the only implementation of that that I'm aware of (and we rely on it to > >> test the interface) so seems reasonable to me to remove something that has > >> minimal adoption and not cleanly separated from the scanner implementation > >> of core Impala. > >> > >> > Proposal 5: Sentry > >> > Drop support for Sentry in favor of Ranger. > >> > >> I think moving this direction makes a lot of sense given that activity in > >> the Sentry project has declined a lot (just look at the activity level on > >> the two projects, it's dramatically different), unless someone in the > >> community wants to step up and maintain the integration. > >> > >> > Proposal 6: Metadata > >> > Metadata V2 will become the default. Metadata V1 will be deprecated. > >> Maybe we should set a goal of removing the support in Impala 4.1 or 4.2? > >> That would allow us to remove a lot of complex code > >> > >> On Mon, Mar 16, 2020 at 10:07 AM Joe McDonnell <joemcdonn...@cloudera.com> > >> wrote: > >> > >>> Now that Impala 3.4 is branched and master is Impala 4.0, we need to > >>> decide > >>> what breaking changes will happen in Impala 4.0. I have provided a series > >>> of proposals below. I welcome feedback on them. Other proposals are also > >>> welcome. > >>> > >>> Thanks, > >>> Joe > >>> > >>> Proposal 0: Hadoop component versions > >>> > >>> Switch to CDP versions of components by default. This means that Impala > >>> will use Hive 3+ (which is already essentially Hive 4 and may change > >>> names > >>> to being Hive 4). > >>> Remove support for CDH versions of components. > >>> This was already discussed in the original thread for Impala 4, so this > >>> is > >>> not new. > >>> > >>> Proposal 1: OS support > >>> > >>> Drop support for Centos 6, Ubuntu 14, and Debian (all versions) > >>> Retain support for Ubuntu 16, Ubuntu 18, Centos 7, and SLES 12 > >>> Centos 7 development will be focused on newer Centos 7 versions such as > >>> 7.6 > >>> and 7.7. > >>> Add support for Centos 8 > >>> Move main development from Ubuntu 16 to Ubuntu 18 over time. > >>> > >>> Proposal 2: Python support > >>> > >>> Drop support for Python 2.6 > >>> Add support for Python 3 over time. > >>> > >>> Proposal 3: Impala-lzo > >>> > >>> Drop support for Impala-lzo/hadoop-lzo > >>> > >>> Proposal 4: Clients > >>> > >>> Deprecate beeswax protocol. This means that it can be removed in the next > >>> major version number, but it would not be removed in Impala 4. Current > >>> users of beeswax would need to start migrating to HS2. > >>> > >>> Proposal 5: Sentry > >>> > >>> Drop support for Sentry in favor of Ranger. > >>> > >>> Proposal 6: Metadata > >>> > >>> Metadata V2 will become the default. Metadata V1 will be deprecated. > >>> > >>> Thanks, > >>> Joe > >>> > >>