[GitHub] [hudi] voonhous commented on a diff in pull request #8418: [HUDI-6052] Standardise TIMESTAMP(6) format when writing to Parquet f…

via GitHub Tue, 11 Apr 2023 01:38:57 -0700


voonhous commented on code in PR #8418:
URL: https://github.com/apache/hudi/pull/8418#discussion_r1162484156



##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java:
##########
@@ -419,4 +425,179 @@ public void 
testHoodieFlinkClusteringScheduleAfterArchive() throws Exception {
         .stream().anyMatch(fg -> fg.getSlices()
             .stream().anyMatch(s -> 
s.getDataFilePath().contains(firstClusteringInstant))));
   }
+
+  /**
+   * Test to ensure that creating a table with a column of TIMESTAMP(9) will 
throw errors
+   * @throws Exception
+   */
+  @Test
+  public void testHoodieFlinkClusteringWithTimestampNanos() {
+    // create hoodie table and insert into data

Review Comment:
   > Can the append mode write timestamp(9) then?
   
   Nope, APPEND can't write TIMESTAMP(9).
   
   > BTW, Spark use the INT96 as the default output timestamp type in their 
parquet writer: 
https://github.com/apache/spark/blob/0a63a496bdced946a5d4825ca66df12de51d3a87/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L970
   
   I don't think we are using this by default, writing TIMESTAMP_NANOS with 
Hudi-on-Spark will write INT64.
   
   Example:
   
   ```sql
   CREATE TABLE `dev_hudi`.`timestamp_test`
   (
       `id`            INTEGER,
       `bigint_col`    BIGINT,
       `string_col`    STRING,
       `double_col`    DOUBLE,
       `timestamp_col` TIMESTAMP,
       `operation`     STRING
   ) USING hudi
   TBLPROPERTIES (
     'primaryKey' = 'id',
     'type' = 'cow',
     'preCombineField' = 'bigint_col'
   )
   LOCATION 'hdfs://path/to/timestamp_test';
   
   -- use nanos, however, this will fallback to micros
   INSERT INTO `dev_hudi`.`timestamp_test`
   VALUES (1, 1000, "string_col_1", 1.1, TIMESTAMP "1970-01-01 
00:00:01.001001001", "init"),
          (2, 2000, "string_col_2", 2.2, TIMESTAMP "1970-01-01 
00:00:02.001001001", "init");
   
   
   SELECT * FROM `dev_hudi`.`timestamp_test`;
   
   20230411163354949    20230411163354949_0_0   1               
5ea1112a-3f7d-4c6a-8f20-5275055ee330-0_0-17-20_20230411163354949.parquet        
1       1000    string_col_1    1.1     1970-01-01 00:00:01.001001      init
   20230411163354949    20230411163354949_0_1   2               
5ea1112a-3f7d-4c6a-8f20-5275055ee330-0_0-17-20_20230411163354949.parquet        
2       2000    string_col_2    2.2     1970-01-01 00:00:02.001001      init
   ```
   
   parquet-tools snippet:
   
   ```
   ############ Column(timestamp_col)[row group 0] ############
   name: timestamp_col
   path: timestamp_col
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, 
is_from_converted_type=true, force_set_converted_type=false)
   converted_type (legacy): TIMESTAMP_MICROS
   compression: GZIP (space_saved: -20%)
   total_compressed_size: 100
   total_uncompressed_size: 83
   ```
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] voonhous commented on a diff in pull request #8418: [HUDI-6052] Standardise TIMESTAMP(6) format when writing to Parquet f…

Reply via email to