Hi, +1 for NDV (number of distinct values) is a widely used terminology in table statistics.
I've also seen the one called `distinctCount`. This name can be found in databases like oracle too. [1] So it is not good to change a completely different name. [1] https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922 Best, Jingsong On Thu, Jun 2, 2022 at 4:46 PM Jark Wu <imj...@gmail.com> wrote: > Hi Jing, > > I can see there might be developers who don't understand the meaning at the > first glance. > However, NDV is a widely used terminology in table statistics, see > [1][2][3]. > If we use another name, it may confuse developers who are familiar with > stats and optimization. > I think at least, the Javadoc is needed to explain the meaning and full > name. > If we want to change the name, we can use the full name > "numberOfDistinctValues()". > > Best, > Jark > > [1]: > > https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute > [2]: > https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/ > [3]: > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > > On Thu, 2 Jun 2022 at 14:44, Becket Qin <becket....@gmail.com> wrote: > > > Hi Jing, > > > > While I do agree that NDV is a little confusing at first sight, it seems > > quite concise once I got the meaning. So personally I am OK with keeping > it > > as is, but proper documentation would be helpful. If we really want to > > replace it with a more professional name, *cardinality* might be a good > > alternative. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge <j...@ververica.com> wrote: > > > > > Hi Dev, > > > > > > I am not really sure if it is feasible to start this discussion. > > According > > > to the contribution guidelines, dev ml is the right place to reach > > > consensus. > > > > > > In ColumnStats, Currently ndv, which stands for "number of distinct > > > values", is used. First of all, it is difficult to understand the > meaning > > > with the abbreviation. Second, it might be good to use a professional > > > naming instead. > > > > > > > > > > > > Suggestion: > > > > > > replace ndv with granularityNumber: > > > > > > > > > > > > The good news, afaik, is that the method getNdv() hasn't been used > within > > > Flink which means the renaming will have very limited impact. > > > > > > > > > > > > ColumnStats { > > > > > > /** number of distinct values. */ > > > > > > @Deprecated > > > private final Long ndv; > > > > > > > > > > > > /**Granularity refers to the level of details used to sort and separate > > > data at column level. Highly granular data is categorized or separated > > very > > > precisely. For example, the granularity number of gender columns should > > > normally be 2. The granularity number of the month column will be 12. > In > > > the SQL world, it means the number of distinct values. */ > > > > > > private final Long granularityNumber; > > > > > > > > > > > > @Deprecated > > > public Long getNdv() > > > { return ndv; } > > > > > > > > > > > > public Long getGranularityNumber() > > > { return granularityNumber; } > > > } > > > > > > Best regards, > > > -- > > > > > > Jing > > > > > >