Myracle opened a new pull request, #27555:
URL: https://github.com/apache/flink/pull/27555

   ## What is the purpose of the change
   
   This pull request extends the `APPROX_COUNT_DISTINCT` aggregate function to 
support streaming mode, including Window TVF (TUMBLE, HOP, CUMULATE). 
Previously, this function was only available in batch mode. The implementation 
uses the HyperLogLog++ algorithm to provide approximate distinct counting with 
approximately 1% relative standard error while using constant memory 
(approximately 16KB). This enables efficient cardinality estimation in 
streaming scenarios where exact counts are not required.
   
   ## Brief change log
   
   - Added a new unified `ApproxCountDistinctAggFunctions` class that supports 
both batch and streaming modes
   - Deprecated the existing `BatchApproxCountDistinctAggFunctions` class for 
backward compatibility
   - Updated `AggFunctionFactory` to use the new implementation and added 
validation to reject retraction scenarios in non-windowed streaming aggregations
   - Implemented `merge()` method required for Window TVF aggregation
   - Optimized timestamp hashing to support high-precision TIMESTAMP values 
(nanosecond precision)
   - Added SQL function documentation in both English and Chinese
   - Added release notes for Flink 2.2
   
   ## Verifying this change
   
   Please make sure both new and modified tests in this PR follow [the 
conventions for tests defined in our code quality 
guide](https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing).
   
   This change added tests and can be verified as follows:
   
   - Added `ApproxCountDistinctAggFunctionTest` with unit tests covering all 
supported data types (TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, 
DATE, TIME, TIMESTAMP, TIMESTAMP_LTZ, VARCHAR/CHAR) and merge functionality
   - Added `StreamApproxCountDistinctITCase` with integration tests for:
     - TUMBLE window aggregation
     - HOP window aggregation
     - CUMULATE window aggregation
     - Different data types in streaming mode
     - NULL value handling
     - Precision validation with 10,000 records (error rate < 2%)
   - Added `BatchApproxCountDistinctITCase` with integration tests for:
     - All supported data types in batch mode
     - GROUP BY aggregation
     - NULL value handling
     - Empty table handling
     - Precision validation with 10,000 records
   
   ## Does this pull request potentially affect one of the following parts:
   
   - Dependencies (does it add or upgrade a dependency): **no**
   - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
   - The serializers: **no**
   - The runtime per-record code paths (performance sensitive): **no**
   - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: **no**
   - The S3 file system connector: **no**
   
   ## Documentation
   
   - Does this pull request introduce a new feature? **yes**
   - If yes, how is the feature documented? **docs / JavaDocs**
     - Added function documentation in `docs/data/sql_functions.yml` and 
`docs/data/sql_functions_zh.yml`
     - Added comprehensive JavaDocs in `ApproxCountDistinctAggFunctions` class
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to