[ https://issues.apache.org/jira/browse/IMPALA-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715788#comment-16715788 ]
ASF subversion and git services commented on IMPALA-7659: --------------------------------------------------------- Commit 04d027df13e1c3c5c654b5a0bc965b670483b535 in impala's branch refs/heads/master from Bharath Vissapragada [ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=04d027d ] IMPALA-7659: Populate NULL count while computing column stats It was disabled for performance reasons (IMPALA-1003) and this patch re-enables it since a lot of codegen improvements have happened since then. This patch switches the aggregation to use the CASE conditional instead of IF since the former has proper codegen support (IMPALA-7655). Tests: ===== - Updated the affected tests to include the null counts. - Added unit tests that verify IS [NOT] NULL predicates' cardinality estimation. Perf note: ========= I reran the compute stats child query with null counts included on the store_sales table from 1000 SF (1TB) tpcds dataset. The table had 22 non-partitioned columns (on which null counts were computed) and ~2.8B rows. This experiment showed around 7-8% perf drop compared to the same child query without null counts for these columns. Change-Id: Ic68f8b4c3756eb1980ce299a602a7d56db1e507a Reviewed-on: http://gerrit.cloudera.org:8080/11565 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Collect count of nulls when collecting stats > -------------------------------------------- > > Key: IMPALA-7659 > URL: https://issues.apache.org/jira/browse/IMPALA-7659 > Project: IMPALA > Issue Type: Bug > Components: Backend, Frontend > Affects Versions: Impala 3.0, Impala 2.12.0, Impala 3.1.0 > Reporter: Piotr Findeisen > Assignee: bharath v > Priority: Major > Fix For: Impala 3.2.0 > > > When Impala calculates table stats, NULL count gets overridden with -1. > Number of NULLs in a table is a useful information. Even if Impala does not > benefit from this information, some other tools do. Thus, not collecting this > information may pose a problem for Impala users (potentially forcing them to > run COMPUTE STATS elsewhere). > Now, counting NULLs should be an operation that is cheaper than counting > NDVs. However, code comment in {{ComputeStatsStmt.java}} suggests otherwise > ([~tarmstrong] suggested this is because of IMPALA-7655). > My suggestion would be to > - improve expression used to collect NULL count > - collect NULL count during COMPUTE STATS -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org