[
https://issues.apache.org/jira/browse/IMPALA-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844229#comment-16844229
]
Todd Lipcon commented on IMPALA-8458:
-------------------------------------
I paged this code back into my head and remember why we had the weird
workaround. The hack there was to deal with our odd handling of boolean stats.
The LocalCatalog flow is:
- catalogd fetches stats from Hive, and converts them to our own internal
ColumnStats object via ColumnStats.update:{code}
{code}
BooleanColumnStatsData boolStats = statsData.getBooleanStats();
numNulls_ = boolStats.getNumNulls();
numDistinctValues_ = (numNulls_ > 0) ? 3 : 2;
{code}
- impalad fetches stats from catalogd in CatalogdMetaProvider. This interface
was originally built towards the "fetch directly from HMS" code path, so in
this case, the wire protocol consists of the catalogd needing to send back the
Hive ColumnStatitisticsObj type. So, we call
ColumnStats.createHiveColStatsData() to convert the bool stats back to the Hive
type:
{code}
case BOOLEAN:
colStatsData.setBooleanStats(new BooleanColumnStatsData(1, -1,
numNulls));
break;
{code}
When this hive object gets to the Impalad, it gets converted _back_ to Impala's
ColumnStats type with the first code snippet above.
This Hive->Impala->Hive->Impala conversion round tripping is somewhat lossy,
particularly for bools since Hive stores a numFalse/numTrue whereas we want to
have an NDV. I think we also end up with "lossiness" in the case that we didn't
find compatible stats in the HMS, since we don't really have a clear
distinction from "we have stats with unknown NDV" vs "we dont' have stats at
all".
I'll see if I can clean this up. Perhaps the easiest route is to have the
wire-protocol for fetch-from-catalogd just use the impala-internal stats object.
> Can't set numNull/maxSize/avgSize column stats with local catalog without
> also setting NDV
> ------------------------------------------------------------------------------------------
>
> Key: IMPALA-8458
> URL: https://issues.apache.org/jira/browse/IMPALA-8458
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Affects Versions: Impala 3.3.0
> Reporter: Tim Armstrong
> Assignee: Todd Lipcon
> Priority: Critical
>
> Repro:
> {noformat}
> [tarmstrong-box2.ca.cloudera.com:21000] default> create table test_stats2(s
> string);
> +-------------------------+
> | summary |
> +-------------------------+
> | Table has been created. |
> +-------------------------+
> Fetched 1 row(s) in 0.36s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats
> test_stats2;
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s | STRING | -1 | -1 | -1 | -1 |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.02s
> [tarmstrong-box2.ca.cloudera.com:21000] default> alter table test_stats2 set
> column stats s('avgSize'='1234');
> +-----------------------------------------+
> | summary |
> +-----------------------------------------+
> | Updated 0 partition(s) and 1 column(s). |
> +-----------------------------------------+
> Fetched 1 row(s) in 0.14s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats
> test_stats2;
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s | STRING | -1 | -1 | -1 | -1 |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.02s
> [tarmstrong-box2.ca.cloudera.com:21000] default> alter table test_stats2 set
> column stats s('maxSize'='1234');
> +-----------------------------------------+
> | summary |
> +-----------------------------------------+
> | Updated 0 partition(s) and 1 column(s). |
> +-----------------------------------------+
> Fetched 1 row(s) in 0.10s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats
> test_stats2;
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s | STRING | -1 | -1 | -1 | -1 |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.02s
> [tarmstrong-box2.ca.cloudera.com:21000] default> invalidate metadata
> test_stats2;
> Fetched 0 row(s) in 0.03s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats
> test_stats2;
> Query: show column stats test_stats2
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s | STRING | -1 | -1 | -1 | -1 |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.07s
> {noformat}
> I expected that the updates would take effect. Weirdly it doesn't happen for
> NDV and NULLS:
> {noformat}
> [tarmstrong-box2.ca.cloudera.com:21000] default> alter table test_stats2 set
> column stats s('numDVs'='1234','numNulls'='12345');
> Query: alter table test_stats2 set column stats
> s('numDVs'='1234','numNulls'='12345')
> +-----------------------------------------+
> | summary |
> +-----------------------------------------+
> | Updated 0 partition(s) and 1 column(s). |
> +-----------------------------------------+
> Fetched 1 row(s) in 0.12s
> [tarmstrong-box2.ca.cloudera.com:21000] default> show column stats
> test_stats2;
> Query: show column stats test_stats2
> +--------+--------+------------------+--------+----------+----------+
> | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
> +--------+--------+------------------+--------+----------+----------+
> | s | STRING | 1234 | 12345 | -1 | -1 |
> +--------+--------+------------------+--------+----------+----------+
> Fetched 1 row(s) in 0.02s
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]