Hi,

+1 for NDV (number of distinct values) is a widely used terminology in
table statistics.

I've also seen the one called `distinctCount`.

This name can be found in databases like oracle too. [1]

So it is not good to change a completely different name.

[1]
https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922

Best,
Jingsong

On Thu, Jun 2, 2022 at 4:46 PM Jark Wu <imj...@gmail.com> wrote:

> Hi Jing,
>
> I can see there might be developers who don't understand the meaning at the
> first glance.
> However, NDV is a widely used terminology in table statistics, see
> [1][2][3].
> If we use another name, it may confuse developers who are familiar with
> stats and optimization.
> I think at least, the Javadoc is needed to explain the meaning and full
> name.
> If we want to change the name, we can use the full name
> "numberOfDistinctValues()".
>
> Best,
> Jark
>
> [1]:
>
> https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
> [2]:
> https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
> [3]:
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>
> On Thu, 2 Jun 2022 at 14:44, Becket Qin <becket....@gmail.com> wrote:
>
> > Hi Jing,
> >
> > While I do agree that NDV is a little confusing at first sight, it seems
> > quite concise once I got the meaning. So personally I am OK with keeping
> it
> > as is, but proper documentation would be helpful. If we really want to
> > replace it with a more professional name, *cardinality* might be a good
> > alternative.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge <j...@ververica.com> wrote:
> >
> > > Hi Dev,
> > >
> > > I am not really sure if it is feasible to start this discussion.
> > According
> > > to the contribution guidelines, dev ml is the right place to reach
> > > consensus.
> > >
> > > In ColumnStats, Currently ndv, which stands for "number of distinct
> > > values", is used. First of all, it is difficult to understand the
> meaning
> > > with the abbreviation. Second, it might be good to use a professional
> > > naming instead.
> > >
> > >
> > >
> > > Suggestion:
> > >
> > > replace ndv with granularityNumber:
> > >
> > >
> > >
> > > The good news, afaik, is that the method getNdv() hasn't been used
> within
> > > Flink which means the renaming will have very limited impact.
> > >
> > >
> > >
> > > ColumnStats {
> > >
> > > /** number of distinct values. */
> > >
> > > @Deprecated
> > > private final Long ndv;
> > >
> > >
> > >
> > > /**Granularity refers to the level of details used to sort and separate
> > > data at column level. Highly granular data is categorized or separated
> > very
> > > precisely. For example, the granularity number of gender columns should
> > > normally be 2. The granularity number of the month column will be 12.
> In
> > > the SQL world, it means the number of distinct values. */
> > >
> > > private final Long granularityNumber;
> > >
> > >
> > >
> > > @Deprecated
> > > public Long getNdv()
> > > { return ndv; }
> > >
> > >
> > >
> > > public Long getGranularityNumber()
> > > { return granularityNumber; }
> > > }
> > >
> > > Best regards,
> > > --
> > >
> > > Jing
> > >
> >
>

Reply via email to