[ANNOUNCE] Hudi Community Update(2023-05-29 ~ 2023-06-11)

2023-06-11 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-05-29 ~ 2023-06-11 with
updates on feature and bug fixes.


===
Feature

[Core] a more effective HoodieMergeHandler for COW table with parquet [1]
[Core] Support Hudi on Spark 3.4.0 [2]


[1] https://issues.apache.org/jira/browse/HUDI-4790
[2] https://issues.apache.org/jira/browse/HUDI-6198


Bugs

[Spark] Clustering enhancements [1]
[Core] Fix Memory Leak in RealtimeCompactedRecordReader [2]
[Flink] Make HoodieFlinkCompactor's parallelism of compact_task more
reasonable [3]
[Core] Fix the data table archiving and MDT cleaning config conf [4]
[Core] Fix bug when hive queries Array type [5]
[Core] Sync TIMESTAMP_MILLIS to hive [6]
[Core] Enhancements to the MDT for improving performance of larger indexes
[7]
[Core] Hive sync use state transient time to avoid losing partitions [8]
[Core] Integrate logcompaction table service to metadata table and provides
various bugfixes to metadata table [10]


[1] https://issues.apache.org/jira/browse/HUDI-6277
[2] https://issues.apache.org/jira/browse/HUDI-6287
[3] https://issues.apache.org/jira/browse/HUDI-6293
[4] https://issues.apache.org/jira/browse/HUDI-6256
[5] https://issues.apache.org/jira/browse/HUDI-6309
[6] https://issues.apache.org/jira/browse/HUDI-6307
[7] https://issues.apache.org/jira/browse/HUDI-5238
[8] https://issues.apache.org/jira/browse/HUDI-6182
[9] https://issues.apache.org/jira/browse/HUDI-3775
[10] https://issues.apache.org/jira/browse/HUDI-6334


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2023-05-15 ~ 2023-05-28)

2023-05-28 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-05-15 ~ 2023-05-28 with
updates on feature and bug fixes.


===
Feature

[Flink] Update and Delete statements for Flink [1]
[Core] Bucket index supports bulk insert row writer [2]
[Core] Use Spark 3.2 as default Spark version [3]

[1] https://issues.apache.org/jira/browse/HUDI-6235
[2] https://issues.apache.org/jira/browse/HUDI-5994
[3] https://issues.apache.org/jira/browse/HUDI-3088


Bugs

[Spark] Failed to add fields in BUCKET index table [1]
[Core] Prevent clean run concurrently in flink [2]
[Core] Clean deleted partition with clean policy [3]
[Flink] HoodieInternalWriteStatus marks failure with totalErrorRecords
increment [4]
[Core] fix lazy clean schedule rollback on completed instant [5]
[Core] Adding hardening checks for transformer output schema for quarantine
enabled/disabled [6]
[Flink] Fix HoodieMergeHandle shutdown sequence [7]
[Hive] Limit MDT deltacommits when data table has pending action [8]
[Core] Allow for offline compaction of MOR tables via spark streaming [9]
[Core] Parallelize deletion of files during rollback. [10]
[Core] create marker file for every log file [11]


[1] https://issues.apache.org/jira/browse/HUDI-6210
[2] https://issues.apache.org/jira/browse/HUDI-6134
[3] https://issues.apache.org/jira/browse/HUDI-6104
[4] https://issues.apache.org/jira/browse/HUDI-6229
[5] https://issues.apache.org/jira/browse/HUDI-5675
[6] https://issues.apache.org/jira/browse/HUDI-6115
[7] https://issues.apache.org/jira/browse/HUDI-5238
[8] https://issues.apache.org/jira/browse/HUDI-5520
[9] https://issues.apache.org/jira/browse/HUDI-3775
[10] https://issues.apache.org/jira/browse/HUDI-6213
[11] https://issues.apache.org/jira/browse/HUDI-1517


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2023-05-01 ~ 2023-05-14)

2023-05-14 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-05-01 ~ 2023-05-14 with
updates on feature and bug fixes.


===
Feature

[Core] Adding auto generation of record keys support to Hudi/Spark [1]
[Core] Support partial insert in MERGE INTO command [2]

[1] https://issues.apache.org/jira/browse/HUDI-5514
[2] https://issues.apache.org/jira/browse/HUDI-6105


Bugs

[Flink] Strengthen Flink clustering commit and rollback strategy [1]
[Core] Fixing checkpoint management for multiple streaming writers [2]
[Core] Support multiple transformers using the same config keys in
DeltaStreamer [3]
[Flink] Fix potential data loss for flink streaming source from table with
multi writer [4]
[Core] Fix global index duplicate and handle custom payload when update
partition  [5]
[Core] Deltastreamer finish failed compaction before ingestion [6]
[Flink] Clustering operation on consistent hashing index resulting in
duplicate data [7]
[Hive] Hive3 query returns null when the where clause has a partition field
[8]
[Core] Unify call procedure options [9]


[1] https://issues.apache.org/jira/browse/HUDI-6158
[2] https://issues.apache.org/jira/browse/HUDI-6071
[3] https://issues.apache.org/jira/browse/HUDI-6113
[4] https://issues.apache.org/jira/browse/HUDI-6157
[5] https://issues.apache.org/jira/browse/HUDI-5968
[6] https://issues.apache.org/jira/browse/HUDI-6147
[7] https://issues.apache.org/jira/browse/HUDI-6047
[8] https://issues.apache.org/jira/browse/HUDI-5308
[9] https://issues.apache.org/jira/browse/HUDI-6122



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2023-04-17 ~ 2023-04-30)

2023-04-30 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-04-17 ~ 2023-04-30 with
updates on feature and bug fixes.


===
Feature

[Flink] Files pruning for bucket index table pk filtering queries [1]
[Core] Add Java 11 and 17 to bundle validation [2]
[Flink] Support Flink 1.17 [3]

[1] https://issues.apache.org/jira/browse/HUDI-6070
[2] https://issues.apache.org/jira/browse/HUDI-6091
[3] https://issues.apache.org/jira/browse/HUDI-6057


Bugs

[Core] Standardise TIMESTAMP(6) format when writing to Parquet files [1]
[Flink] Flink Hive Catalog throws exception through SQL Client when table
type contains smallint [2]
[Core] Let the jetty server in TimelineService create daemon threads [3]
[Spark] Improved the performance of checking for valid commits when tagging
record location  [4]
[Core] Fix table not exist when using 'db.table' in
createHoodieClientFromPath [5]
[Core] support config minPartitions when reading from kafka [6]
[Flink] Flink Hudi Write support commit on an empty batch [7]



[1] https://issues.apache.org/jira/browse/HUDI-6052
[2] https://issues.apache.org/jira/browse/HUDI-6071
[3] https://issues.apache.org/jira/browse/HUDI-6009
[4] https://issues.apache.org/jira/browse/HUDI-6099
[5] https://issues.apache.org/jira/browse/HUDI-5957
[6] https://issues.apache.org/jira/browse/HUDI-6019
[7] https://issues.apache.org/jira/browse/HUDI-6127




Best,
Leesf


[ANNOUNCE] Hudi Community Update(2023-04-03 - 2023-04-16)

2023-04-16 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-04-03 ~ 2023-04-16 with
updates on feature and bug fixes.


===
Feature

[Flink] Support partition pruning for flink streaming source in runtime  [1]
[Core] Support append mode by default for MOR table with INSERT operation
[2]


[1] https://issues.apache.org/jira/browse/HUDI-5880
[2] https://issues.apache.org/jira/browse/HUDI-6045


Bugs

[Flink] Cleans the ckp meta while the JM restarts [1]
[Flink] Fix async compact/clustering serdes conflicts caused by
WatermarkStatus [2]
[Core] Fix incremental clean not work caused by archiving [3]
[Spark] Fix date conversion issue when performing partition pruning on
Spark [4]
[Core] Avoid missing data during incremental queries [5]
[Core] Add simpleBucketPartitioner to support using the simple bucket index
under bulkinsert [6]



[1] https://issues.apache.org/jira/browse/HUDI-6030
[2] https://issues.apache.org/jira/browse/HUDI-6038
[3] https://issues.apache.org/jira/browse/HUDI-5955
[4] https://issues.apache.org/jira/browse/HUDI-5989
[5] https://issues.apache.org/jira/browse/HUDI-5990
[6] https://issues.apache.org/jira/browse/HUDI-5690




Best,
Leesf


[ANNOUNCE] Hudi Community Update(2023-03-20 ~ 2023-04-02)

2023-04-02 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-03-20 ~ 2023-04-02 with
updates on feature and bug fixes.


===
Feature

[Flink] Automatically infer key generator type [1]
[Core] Infer cleaning policy based on clean configs [2]


[1] https://issues.apache.org/jira/browse/HUDI-5929
[2] https://issues.apache.org/jira/browse/HUDI-5954


Bugs

[Core] Fixing pending instant deduction to trigger compaction in MDT [1]
[Flink] Fix bucket stream writer fileId not found exception [2]
[Spark] Add partition ordering for full table scans [3]
[Spark] Support savepoint call procedure with base path in Spark [4]
[Spark] [HUDI-5978] Update timeline timezone when write in spark [5]
[Core] Fix clustering on bootstrapped tables [6]
[Core] Fix Date to String column schema evolution [7]
[Core] Empty preCombineKey should never be stored in hoodie.properties [8]
[Core] Fixing shutting down deltastreamer properly when post write
termination strategy is enabled [9]
[Core] Connection leak for lock provider [10]
[Flink] Auto generate client id for Flink multi writer [11]
[Flink] Always write parquets for insert overwrite operation [12]




[1] https://issues.apache.org/jira/browse/HUDI-5950
[2] https://issues.apache.org/jira/browse/HUDI-5822
[3] https://issues.apache.org/jira/browse/HUDI-5967
[4] https://issues.apache.org/jira/browse/HUDI-5941
[5] https://issues.apache.org/jira/browse/HUDI-5978
[6] https://issues.apache.org/jira/browse/HUDI-5891
[7] https://issues.apache.org/jira/browse/HUDI-5977
[8] https://issues.apache.org/jira/browse/HUDI-5986
[9] https://issues.apache.org/jira/browse/HUDI-5928
[10] https://issues.apache.org/jira/browse/HUDI-5993
[11] https://issues.apache.org/jira/browse/HUDI-6005
[12] https://issues.apache.org/jira/browse/HUDI-6010




Best,
Leesf


[ANNOUNCE] Hudi Community Update(2023-03-06 ~ 2023-03-19)

2023-03-19 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-03-06 ~ 2023-03-19 with
updates on feature and bug fixes.


===
Feature

[Flink] Enable matadata table by default for flink [1]


[1] https://issues.apache.org/jira/browse/HUDI-4372


Bugs

[Core] Avoid throwing error if data table does not exist in Metadata Table
Validator [1]
[Spark] Insert overwrite into bucket table would generate new file group id
[2]
[Spark] support more than one update actions in merge into table [3]
[Core] Table can not read correctly when computed column is in the midst [4]
[Core] Fix the validation of partition listing in metadata table validator
[5]



[1] https://issues.apache.org/jira/browse/HUDI-5883
[2] https://issues.apache.org/jira/browse/HUDI-5857
[3] https://issues.apache.org/jira/browse/HUDI-5904
[4] https://issues.apache.org/jira/browse/HUDI-5913
[5] https://issues.apache.org/jira/browse/HUDI-5919




Best,
Leesf


[ANNOUNCE] Hudi Community Update(2023-02-20 ~ 2023-03-05)

2023-03-05 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-02-20 ~ 2023-03-05 with
updates on bug fixes.


===


Bugs

[Core] Fix timestamp(6) field long overflow  [1]
[Spark] After performing the update operation, the hoodie table cannot be
read normally by spark [2]
[Core] Handle empty payloads for AbstractDebeziumAvroPayload [3]
[Core] HoodieTimelineArchiver archives the latest instant before inflight
replacecommit [4]
[Core] Add support for multiple metric reporters and metric labels [5]
[Spark] Adding auto inferring partition from incoming df [6]
[Core] Fix HoodieMetadataFileSystemView serving stale view at the timeline
server [7]



[1] https://issues.apache.org/jira/browse/HUDI-5329
[2] https://issues.apache.org/jira/browse/HUDI-5557
[3] https://issues.apache.org/jira/browse/HUDI-5791
[4] https://issues.apache.org/jira/browse/HUDI-5728
[5] https://issues.apache.org/jira/browse/HUDI-5847
[6] https://issues.apache.org/jira/browse/HUDI-5796
[7] https://issues.apache.org/jira/browse/HUDI-5863




Best,
Leesf


[ANNOUNCE] Hudi Community Update(2023-02-06 ~ 2023-02-19)

2023-02-19 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-02-06 ~ 2023-02-19 with
updates on bug fixes.


===
Features

[Core] [RFC-48] Create RFC for LogCompaction support to Hudi [1]
[Core] Support archive command for spark sql [2]


[1] https://issues.apache.org/jira/browse/HUDI-3580
[2] https://issues.apache.org/jira/browse/HUDI-5773



Bugs

[Spark] Spark reads hudi table error when flink creates the table without
preCombine fields  [1]
[Flink] Fix NPE if filters condition contains null literal when using
column stats data skipping for flink [2]
[Flink] Duplicate key error when insert_overwrite the same partition in
multi writer [3]
[Flink] Fix flink batch read skip clustering data lost [4]
[Spark] Fix Deletes issued without any prior commits [5]
[Flink] Support multi writer for bucket index with guarded lock[6]



[1] https://issues.apache.org/jira/browse/HUDI-5329
[2] https://issues.apache.org/jira/browse/HUDI-5557
[3] https://issues.apache.org/jira/browse/HUDI-5270
[4] https://issues.apache.org/jira/browse/HUDI-5734
[5] https://issues.apache.org/jira/browse/HUDI-5737
[6] https://issues.apache.org/jira/browse/HUDI-5673




Best,
Leesf


[ANNOUNCE] Hudi Community Update(2023-01-02 ~ 2023-02-05)

2023-02-05 Thread leesf
Dear community,

Nice to share Hudi community updates for 2023-01-02 ~ 2023-02-05 with
updates on bug fixes.


===
Features

[Core] Add in support for a keyless workflow by building an ID based off of
values within the record [1]
[Core] Add client for Hudi table service manager (TSM) [2]
[Flink] Support CDC for flink bounded source [3]
[Core] Early Conflict Detection For Multi-writer [4]

[1] https://issues.apache.org/jira/browse/HUDI-5514
[2] https://issues.apache.org/jira/browse/HUDI-4148
[3] https://issues.apache.org/jira/browse/HUDI-5559
[4] https://issues.apache.org/jira/browse/HUDI-1575


Bugs

[Flink] Fix concurrency conflict for flink async compaction with latency
marker [1]
[Spark] Fix insert overwrite table for partitioned table [2]
[Flink] Fix reading data using the HoodieHiveCatalog will cause the Spark
write to fail [3]
[Spark] Support json schema in SchemaRegistryProvider [4]
[Spark] Closing write client for spark ds writer in all cases (including
exception) [5]
[Flink] Fix flink creates and writes the table, the spark alter table
reports an error [6]
[Spark] Fix CTAS and Insert Into to avoid combine-on-insert by default [7]
[Flink] BucketIndexPartitioner partition algorithm skew [8]
[Flink] Bucket index does not work correctly for multi-writer scenarios [9]


[1] https://issues.apache.org/jira/browse/HUDI-5504
[2] https://issues.apache.org/jira/browse/HUDI-5317
[3] https://issues.apache.org/jira/browse/HUDI-5275
[4] https://issues.apache.org/jira/browse/HUDI-2608
[5] https://issues.apache.org/jira/browse/HUDI-5655
[6] https://issues.apache.org/jira/browse/HUDI-5585
[7] https://issues.apache.org/jira/browse/HUDI-5684
[8] https://issues.apache.org/jira/browse/HUDI-5671
[9] https://issues.apache.org/jira/browse/HUDI-5682



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-12-19 ~ 2023-01-01)

2023-01-01 Thread leesf
Dear community,

Happy New Year. Nice to share Hudi community bi-weekly updates for
2022-12-19 ~ 2023-01-01 with updates on bug fixes.


===
Bugs

[Flink] Send the boostrap event if the JM also rebooted [1]
[Flink]  Flink mor table streaming read throws NPE [2]
[Flink] Flink streaming read skips uncommitted instants [3]
[Spark] Avoid virtual key info for COW table in the input format [4]
[Flink] HoodieFlinkStreamer supports async clustering for append mode [5]



[1] https://issues.apache.org/jira/browse/HUDI-5412
[2] https://issues.apache.org/jira/browse/HUDI-5399
[3] https://issues.apache.org/jira/browse/HUDI-5456
[4] https://issues.apache.org/jira/browse/HUDI-5411
[5] https://issues.apache.org/jira/browse/HUDI-5343



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-12-05 ~ 2022-12-18)

2022-12-18 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-12-05 ~ 2022-12-18
with updates on bug fixes.


===
Features


[Core] add call help procedure [1]
[Spark] Hudi supports Spark TVF [2]
[Spark] Add new bulk insert sort modes repartitioning data by partition
path [3]
[Spark] Upgrade to spark 3.3.1 & 3.2.2 [4]



[1] https://issues.apache.org/jira/browse/HUDI-5314
[2] https://issues.apache.org/jira/browse/HUDI-5340
[3] https://issues.apache.org/jira/browse/HUDI-5342
[4] https://issues.apache.org/jira/browse/HUDI-4411

===
Bugs

[Spark] Support type change for schema on read + reconcile schema [1]
[Spark] Fix checkpoint reading for structured streaming [2]
[Core] Flink async compaction is not thread safe when use watermark [3]
[Spark] Fix failure handling with spark datasource write [4]
[Spark] FIxing performance traps in Spark SQL MERGE INTO implementation [5]
[Spark] Fixing Create Table as Select (CTAS) performance gaps [6]
[Flink] Fix oom cause compaction event lost problem [7]
[Spark] Checkpoint management for muti-writer scenario [8]


[1] https://issues.apache.org/jira/browse/HUDI-5294
[2] https://issues.apache.org/jira/browse/HUDI-5334
[3] https://issues.apache.org/jira/browse/HUDI-3661
[4] https://issues.apache.org/jira/browse/HUDI-5163
[5] https://issues.apache.org/jira/browse/HUDI-5347
[6] https://issues.apache.org/jira/browse/HUDI-5346
[7] https://issues.apache.org/jira/browse/HUDI-5350
[8] https://issues.apache.org/jira/browse/HUDI-4432


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-11-21 ~ 2022-12-04)

2022-12-04 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-11-21 ~ 2022-12-04
with updates on bug fixes.


===
Features


[Flink] Flink engine support for comprehensive schema evolution [1]



[1] https://issues.apache.org/jira/browse/HUDI-3981



===
Bugs

[Core] Allow user specified start offset for streaming query [1]
[Core] Fix bugs in schema evolution client with lost operation field and
not found schema [2]
[Core] Improve exporter file listing and copy perf [3]
[Flink] ClusteringCommitSink supports to rollback clustering [4]
[Core] Fix insert into sql command with strict sql insert mode [5]
[Core] Addressing schema handling issues in the write path [6]
[Flink] Streaming read skip clustering [7]
[Flink] Prevent Hudi from reading the entire timeline's when performing a
LATEST streaming read [8]
[Core] Support more conf to cluster procedure [9]


[1] https://issues.apache.org/jira/browse/HUDI-5162
[2] https://issues.apache.org/jira/browse/HUDI-5244
[3] https://issues.apache.org/jira/browse/HUDI-712
[4] https://issues.apache.org/jira/browse/HUDI-5252
[5] https://issues.apache.org/jira/browse/HUDI-5260
[6] https://issues.apache.org/jira/browse/HUDI-4588
[7] https://issues.apache.org/jira/browse/HUDI-5234
[8] https://issues.apache.org/jira/browse/HUDI-5007
[9] https://issues.apache.org/jira/browse/HUDI-5278


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-11-07 ~ 2022-11-20)

2022-11-20 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-11-07 ~ 2022-11-20
with updates on bug fixes.


===
Features


[Flink] Add Call show_table_properties for spark sql [1]
[Core] [RFC-60] Optimized storage layout for Cloud Object Stores [2]



[1] https://issues.apache.org/jira/browse/HUDI-5178
[2] https://issues.apache.org/jira/browse/HUDI-3625



===
Bugs

[Core] Rollback failed with log file not found when rollOver in rollback
process [1]
[Core] Fix incremental source to consider inflight commits before completed
commits [2]
[Spark] Fix Orc support broken for Spark 3.x and more [3]
[Flink] Flink table service job fs view conf overwrites the one of writing
job  [4]
[Core] Lazy fetching partition path & file slice for HoodieFileIndex [5]
[Core] Fixing FileIndex impls to properly batch partitions listing [6]



[1] https://issues.apache.org/jira/browse/HUDI-5025
[2] https://issues.apache.org/jira/browse/HUDI-5176
[3] https://issues.apache.org/jira/browse/HUDI-4496
[4] https://issues.apache.org/jira/browse/HUDI-5228
[5] https://issues.apache.org/jira/browse/HUDI-4812
[6] https://issues.apache.org/jira/browse/HUDI-4812


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-10-24 ~ 2022-11-06)

2022-11-06 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-10-24 ~ 2022-11-06
with updates on bug fixes.


===
Features


[Flink] Supports dropPartition for Flink catalog [1]
[Core] glue support drop partitions [2]
[Core] Support schema evolution for Hive/presto [3]
[Flink] source operator(monitor and reader) support user uid [4]
[Flink] Add Call show_commit_extra_metadata for spark sql [5]
[Core] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing
Efficiency [6]



[1] https://issues.apache.org/jira/browse/HUDI-5049
[2] https://issues.apache.org/jira/browse/HUDI-4809
[3] https://issues.apache.org/jira/browse/HUDI-5000
[4] https://issues.apache.org/jira/browse/HUDI-5102
[5] https://issues.apache.org/jira/browse/HUDI-5105
[6] https://issues.apache.org/jira/browse/HUDI-3963


===
Bugs

[Core] Using hudi to build a large number of tables in spark on hive causes
OOM [1]
[Core] Fix clustering schedule problem in flink when enable schedule
clustering and disable async clustering [2]
[Flink] Fix flink catalog read spark table error : primary key col can not
be nullable[3]
[Core] fix merge into with no preCombineField having dup row by only insert
 [4]
[Core] Fix msck repair external hudi table [5]
[Core] presto/hive respect payload during merge parquet file and logfile
when reading mor table [6]



[1] https://issues.apache.org/jira/browse/HUDI-4281
[2] https://issues.apache.org/jira/browse/HUDI-5042
[3] https://issues.apache.org/jira/browse/HUDI-5058
[4] https://issues.apache.org/jira/browse/HUDI-4946
[5] https://issues.apache.org/jira/browse/HUDI-5057
[6] https://issues.apache.org/jira/browse/HUDI-4898


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-09-26 ~ 2022-10-23)

2022-10-23 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-09-26 ~ 2022-10-23
with updates on bug fixes.


===
Features


[Core] Implement CDC Read in Spark [1]
[Core] Add Kerberos kinit command support [2]
[Core] Add support for unraveling proto schemas in
ProtoClassBasedSchemaProvider [3]
[Core] Add incremental source from GCS to Hudi [4]
[Flink] Implement change log feed for Flink [5]
[Core] Support log compaction action for MOR tables [6]
[Core] Extend InProcessLockProvider to support multiple table ingestion[7]
[Core] Implement Create/Drop/Show/Refresh Secondary Index [8]
[Core] Early Conflict Detection For Multi-writer [9]
[Flink] Support all the hive sync options for flink sql [10]



[1] https://issues.apache.org/jira/browse/HUDI-3478
[2] https://issues.apache.org/jira/browse/HUDI-4718
[3] https://issues.apache.org/jira/browse/HUDI-4904
[4] https://issues.apache.org/jira/browse/HUDI-4850
[5] https://issues.apache.org/jira/browse/HUDI-4916
[6] https://issues.apache.org/jira/browse/HUDI-3900
[7] https://issues.apache.org/jira/browse/HUDI-4963
[8] https://issues.apache.org/jira/browse/HUDI-4293
[9] https://issues.apache.org/jira/browse/HUDI-1575
[10] https://issues.apache.org/jira/browse/HUDI-5046

===
Bugs

[Core] Fixing repeated trigger of data file creations w/ clustering [1]
[Core] Fix HoodieSnapshotExporter for writing to a different S3 bucket or
FS [2]
[Core] Fix schema to include partition columns in bootstrap operation [3]
[Core] Fix the issue of Mor log skipping complete blocks when reading data
[4]
[Core] Relaxing MERGE INTO constraints to permit limited casting operations
w/in matched-on conditions [5]
[Core] READ_OPTIMIZED read mode will temporary loss of data when compaction
[6]
[Core] Fixing invalid min/max record key stats in Parquet metadata [7]
[Core] Fixing reading from metadata table when there are no inflight
commits [8]


[1] https://issues.apache.org/jira/browse/HUDI-4760
[2] https://issues.apache.org/jira/browse/HUDI-4913
[3] https://issues.apache.org/jira/browse/HUDI-4453
[4] https://issues.apache.org/jira/browse/HUDI-2780
[5] https://issues.apache.org/jira/browse/HUDI-4861
[6] https://issues.apache.org/jira/browse/HUDI-4308
[7] https://issues.apache.org/jira/browse/HUDI-4992
[8] https://issues.apache.org/jira/browse/HUDI-4952

Best,
Leesf


Re: [ANNOUNCE] Apache Hudi 0.12.1 released

2022-10-19 Thread leesf
Great job!

Alexey Kudinkin  于2022年10月20日周四 03:45写道:

> Thanks Zhaojing for masterfully navigating this release!
>
> On Wed, Oct 19, 2022 at 7:46 AM Vinoth Chandar  wrote:
>
> > Great job everyone!
> >
> > On Wed, Oct 19, 2022 at 07:11 zhaojing yu  wrote:
> >
> > > The Apache Hudi team is pleased to announce the release of Apache Hudi
> > > 0.12.1.
> > >
> > > Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
> > > and Incrementals. Apache Hudi manages storage of large analytical
> > > datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> > > storage) and provides the ability to query them.
> > >
> > > This release comes 2 months after 0.12.0. It includes more than
> > > 150 resolved issues, comprising of a few new features as well as
> > > general improvements and bug fixes. You can read the release
> > > highlights at https://hudi.apache.org/releases/release-0.12.1.
> > >
> > > For details on how to use Hudi, please look at the quick start page
> > located
> > > at https://hudi.apache.org/docs/quick-start-guide.html
> > >
> > > If you'd like to download the source release, you can find it here:
> > > https://github.com/apache/hudi/releases/tag/release-0.12.1
> > >
> > > Release notes including the resolved issues can be found here:
> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352182
> > >
> > > We welcome your help and feedback. For more information on how to
> report
> > > problems, and to get involved, visit the project website at
> > > https://hudi.apache.org
> > >
> > > Thanks to everyone involved!
> > >
> > > Release Manager
> > >
> >
>


[ANNOUNCE] Hudi Community Update(2022-09-12 ~ 2022-09-25)

2022-09-25 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-09-12 ~ 2022-09-25
with updates on bug fixes.


===
Features


[Core] Consistent bucket index: bucket resizing (split) & concurrent
write during resizing [1]
[Core] Add Postgres Schema Name to Postgres Debezium Source [2]
[Core] Support compaction strategy based on delta log file num [3]
[Core] Support partial update payload [4]
[Core] Implement CDC Write in Spark [5]
[Core] Supporting delete savepoint for MOR [6]
[Core] Support hiveSync command based on Call Produce Command [7]



[1] https://issues.apache.org/jira/browse/HUDI-3558
[2] https://issues.apache.org/jira/browse/HUDI-4833
[3] https://issues.apache.org/jira/browse/HUDI-4842
[4] https://issues.apache.org/jira/browse/HUDI-3304
[5] https://issues.apache.org/jira/browse/HUDI-3478
[6] https://issues.apache.org/jira/browse/HUDI-4883
[7] https://issues.apache.org/jira/browse/HUDI-4559

===
Bugs

[Core] Fix AWSDmsAvroPayload#getInsertValue,combineAndGetUpdateValue to
invoke correct api [1]
[Flink] Hudi-flink support GLOBAL_BLOOM,GLOBAL_SIMPLE,BUCKET index type [2]
[Core]  hoodie.logfile.max.size It does not take effect, causing the log
file to be too large [3]
[Spark]  Fix infer keygen not work in sparksql side issue [4]
[Core] Fix HoodieSimpleBucketIndex not consider bucket num in log file
issue [5]
[Core] Fix file group pending compaction cannot be queried when query _ro
table [6]
[Spark] Support Clustering row writer to improve performance [7]



[1] https://issues.apache.org/jira/browse/HUDI-4831
[2] https://issues.apache.org/jira/browse/HUDI-4628
[3] https://issues.apache.org/jira/browse/HUDI-4780
[4] https://issues.apache.org/jira/browse/HUDI-4813
[5] https://issues.apache.org/jira/browse/HUDI-4808
[6] https://issues.apache.org/jira/browse/HUDI-4729
[7] https://issues.apache.org/jira/browse/HUDI-4363


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-08-29 ~ 2022-09-11)

2022-09-11 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-08-29 ~ 2022-09-11
with updates on bug fixes.


===
Features


[Core] Add support for ProtoKafkaSource [1]
[CLI] Adding support to hudi-cli to repair deprecated partition [2]
[CLI] Support rename partition through CLI [3]
[Flink] Support TIMESTAMP_LTZ type for flink [4]


[1] https://issues.apache.org/jira/browse/HUDI-4418
[2] https://issues.apache.org/jira/browse/HUDI-4642
[3] https://issues.apache.org/jira/browse/HUDI-4648
[4] https://issues.apache.org/jira/browse/HUDI-4782


===
Bugs

[Core] Support batch synchronization of partition to HMS to avoid timeout
[1]
[Core] Fix AWS Glue partition's location is wrong when updatePartition [2]
[Core] Fixing incremental source for MOR table [3]
[Spark] Make HoodieStreamingSink idempotent [4]
[Spark] Fix merge into use unresolved assignment [5]
[Core] Fix KryoException when bulk insert into a not bucket index hudi
table [6]
[Spark] fix merge into table for source table with different column order
[7]
[Core] Fix HoodieBackedTableMetadata concurrent reading issue [8]
[Core] Allow hoodie read client to choose index [9]


[1] https://issues.apache.org/jira/browse/HUDI-4582
[2] https://issues.apache.org/jira/browse/HUDI-4742
[3] https://issues.apache.org/jira/browse/HUDI-4775
[4] https://issues.apache.org/jira/browse/HUDI-4389
[5] https://issues.apache.org/jira/browse/HUDI-4776
[6] https://issues.apache.org/jira/browse/HUDI-4795
[7] https://issues.apache.org/jira/browse/HUDI-4797
[8] https://issues.apache.org/jira/browse/HUDI-3453
[9] https://issues.apache.org/jira/browse/HUDI-4763

Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-08-15 ~ 2022-08-28)

2022-08-28 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-08-15 ~ 2022-08-28
with updates on bug fixes.


===
Features


[DeltaStreamer] Adding PulsarSource to DeltaStreamer to support ingesting
from Apache Pulsar [1]
[CLI] Add timeline commands in hudi-cli [2]



[1] https://issues.apache.org/jira/browse/HUDI-4616
[2] https://issues.apache.org/jira/browse/HUDI-3579


===
Bugs

[Core] Fallback to full table scan with incremental query when files are
cleaned up or achived for MOR table [1]
[Core] Fixed timeline based marker thread safety issue [2]
[Core] Read error from MOR table after compaction with timestamp
partitioning [3]
[Spark] MergeInto syntax WHEN MATCHED is optional but must be set [4]
[Core] infer cleaner policy when write concurrency mode is OCC [5]
[Core] Fix savepoints will be cleaned in keeping latest versions policy [6]
[Core] Fixing DebeziumSource to properly commit offsets [7]


[1] https://issues.apache.org/jira/browse/HUDI-3189
[2] https://issues.apache.org/jira/browse/HUDI-4574
[3] https://issues.apache.org/jira/browse/HUDI-4601
[3] https://issues.apache.org/jira/browse/HUDI-4477
[4] https://issues.apache.org/jira/browse/HUDI-4643
[5] https://issues.apache.org/jira/browse/HUDI-4676
[6] https://issues.apache.org/jira/browse/HUDI-4515
[6] https://issues.apache.org/jira/browse/HUDI-4616


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-08-01 ~ 2022-08-14)

2022-08-14 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-08-01 ~ 2022-08-14
with updates on bug fixes.


===
Features

[Spark] support to create ro/rt table by spark sql [1]
[Flink] Support online compaction in the flink batch mode write [2]
[Flink] support retain hour cleaning policy for flink [3]



[1] https://issues.apache.org/jira/browse/HUDI-4487
[2] https://issues.apache.org/jira/browse/HUDI-4385
[3] https://issues.apache.org/jira/browse/HUDI-4544


===
Bugs

[Flink] Repair config "hive_sync.metastore.uris" in flink sql hive schema
sync is not effective [1]
[Spark] Throwing exception when restore is attempted with
hoodie.arhive.beyond.savepoint is enabled [2]
[Flink] Adjust partition number of flink sink task [3]
[Spark] optimize CTAS to adapt to saveAsTable api in different modes [4]
[Spark] Repair the exception when reading optimized query for mor in hive
and presto/trino [5]
[Flink] Fix 'Not a valid schema field: ts' error in HoodieFlinkCompactor if
precombine field is not ts [6]



[1] https://issues.apache.org/jira/browse/HUDI-4510
[2] https://issues.apache.org/jira/browse/HUDI-4501
[3] https://issues.apache.org/jira/browse/HUDI-4477
[4] https://issues.apache.org/jira/browse/HUDI-4514
[5] https://issues.apache.org/jira/browse/HUDI-4508
[6] https://issues.apache.org/jira/browse/HUDI-4572


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-07-18 ~ 2022-07-31)

2022-07-31 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-07-18 ~ 2022-07-31
with updates on bug fixes.


===
Features

[Core] Add FileBasedLockProvider [1]
[Spark] Allow loading external configs while querying Hudi tables with
Spark [2]
[Spark] Add sync validate procedure [3]
[Spark] Support Hudi with Spark 3.3.0 [4]


[1] https://issues.apache.org/jira/browse/HUDI-4065
[2] https://issues.apache.org/jira/browse/HUDI-3764
[3] https://issues.apache.org/jira/browse/HUDI-3510
[4] https://issues.apache.org/jira/browse/HUDI-4186


===
Bugs

[Spark] Porting Nested Schema Pruning optimization for Hudi's custom
Relations [1]
[Spark] Replacing UDF in Bulk Insert w/ RDD transformation [2]
[Spark] Fix missing bloom filters in metadata table in non-partitioned
table [3]
[Spark] Fix insert into dynamic partition write misalignment [4]
[Spark] Make NONE sort mode as default for bulk insert [5]
[Spark] fix merge into sql data quality in concurrent scene [6]
[Core] Optimize performance of Column Stats Index reading in Data Skipping
[7]
[Spark] Addressing Spark SQL vs Spark DS performance gap [8]



[1] https://issues.apache.org/jira/browse/HUDI-3896
[2] https://issues.apache.org/jira/browse/HUDI-3993
[3] https://issues.apache.org/jira/browse/HUDI-4400
[4] https://issues.apache.org/jira/browse/HUDI-4404
[5] https://issues.apache.org/jira/browse/HUDI-4071
[6] https://issues.apache.org/jira/browse/HUDI-4348
[7] https://issues.apache.org/jira/browse/HUDI-4250
[8] https://issues.apache.org/jira/browse/HUDI-4081


Best,
Leesf


Re: Request for contribution permission

2022-07-26 Thread leesf
done and welcome

chankyeong won  于2022年7月26日周二 08:17写道:

> Hello, Hudi.
>
> I made the JIRA Issues about hudi-cli. (
> https://issues.apache.org/jira/browse/HUDI-4433)
> I want contribute for that. So I request the for self-assign for jira
> ticket.
>
> Thank you!
>


Re: Request for contribution permission

2022-07-26 Thread leesf
done and welcome

Lewin Ma  于2022年7月26日周二 14:58写道:

> Hello, Hudi.
>
> I made the JIRA Issues about Hudi insert. (
> https://issues.apache.org/jira/browse/HUDI-4477)
> I want to make a contribution for it. So I request for self-assign for JIRA
> ticket.
> ticket.
>
> --
> Warmest Regards~
> From:  Lewin Ma
>


Re: Apply to be a Contributor

2022-07-19 Thread leesf
done and welcome.

JerryYue <272614...@qq.com.invalid> 于2022年7月19日周二 13:02写道:

> HI everyone
>
> Hi, I have begin to use Apache Hudi for a while and would like to make
> more contributions to it. Would you please add me as a contributor, my Jira
> username is YueMeng. I have contributed some pr already. can you give me a
> self-assign
>
>
>
> The PR that Merged with my contributions:
>
>
>
> https://github.com/apache/hudi/pull/5445/files
>
> https://github.com/apache/hudi/pull/5049
>
> https://github.com/apache/hudi/pull/6106
>
>
>
> The PR that still in progress
>
> https://github.com/apache/hudi/pull/6100
>
> https://github.com/apache/hudi/pull/6134
>
> https://github.com/apache/hudi/pull/6108
>
>
>
>


[ANNOUNCE] Hudi Community Update(2022-07-04 ~ 2022-07-17)

2022-07-17 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-07-04 ~ 2022-07-17
with updates on bug fixes.


===
Features

[Flink] Column stats data skipping for flink [1]
[Spark] Add call procedure for UpgradeOrDowngradeCommand [2]
[Spark] Add call procedure for MetadataCommand [3]
[Spark] Add a new HoodieDropPartitionsTool to let users drop table
partitions through a standalone job [4]
[Spark] Support show_fs_path_detail command on Call Produce Command [5]
[Flink] Support flink 1.15.x [6]
[Flink] Flink offline compaction support compacting multi compaction plan
at once  [7]
[Spark] Support copyToTable on call [8]
[Spark] Add call procedure for RepairsCommand [9]
[Flink] Bump Flink versions to 1.14.5 and 1.15.1 [10]
[Flink] Flink Inline Cluster and Compact plan distribute strategy changed
from rebalance to hash to avoid potential multiple threads accessing the
same file [11]
[Spark] Add call procedure for CleanCommand [12]


[1] https://issues.apache.org/jira/browse/HUDI-4353
[2] https://issues.apache.org/jira/browse/HUDI-3505
[1] https://issues.apache.org/jira/browse/HUDI-3511
[2] https://issues.apache.org/jira/browse/HUDI-3116
[3] https://issues.apache.org/jira/browse/HUDI-4359
[4] https://issues.apache.org/jira/browse/HUDI-4357
[5] https://issues.apache.org/jira/browse/HUDI-4152
[6] https://issues.apache.org/jira/browse/HUDI-4367
[7] https://issues.apache.org/jira/browse/HUDI-4353
[8] https://issues.apache.org/jira/browse/HUDI-3505
[9] https://issues.apache.org/jira/browse/HUDI-3500
[10] https://issues.apache.org/jira/browse/HUDI-4379
[11] https://issues.apache.org/jira/browse/HUDI-4397
[12] https://issues.apache.org/jira/browse/HUDI-3503

===
Bugs

[Spark] Merge Into when update expression "col=s.col+2" on precombine cause
exception [1]
[Spark] fix spark32 repartition error [2]
[Spark] Reconcile schema-inject null values for missing fields and add new
fields [3]
[Spark] Make user can use hoodie.datasource.read.paths to read necessary
files [4]



[1] https://issues.apache.org/jira/browse/HUDI-4219
[2] https://issues.apache.org/jira/browse/HUDI-4309
[3] https://issues.apache.org/jira/browse/HUDI-4267
[4] https://issues.apache.org/jira/browse/HUDI-4170



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-06-20 ~ 2022-07-03)

2022-07-03 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-06-20 ~ 2022-07-03
with updates on bug fixes.


===
Features

[Spark] Support export command based on Call Produce Command [1]
[Spark] Initialize hudi table management module [2]
[Spark] Add call procedure for FileSystemViewCommand [3]
[Spark] Add call procedure for HoodieLogFileCommand [4]
[Flink] Support inline schedule clustering for Flink stream [5]
[Spark] Add call procedure for StatsCommand [6]
[Spark] Support hdfs parquet import command based on Call Produce Command
[7]
[Spark] Add call procedure for CommitsCommand [8]
[Flink] Column stats data skipping for flink [9]
[Spark] Add call procedure for UpgradeOrDowngradeCommand [10]


[1] https://issues.apache.org/jira/browse/HUDI-3507
[2] https://issues.apache.org/jira/browse/HUDI-3475
[3] https://issues.apache.org/jira/browse/HUDI-3508
[4] https://issues.apache.org/jira/browse/HUDI-3509
[5] https://issues.apache.org/jira/browse/HUDI-4273
[6] https://issues.apache.org/jira/browse/HUDI-3512
[7] https://issues.apache.org/jira/browse/HUDI-3502
[8] https://issues.apache.org/jira/browse/HUDI-3506
[9] https://issues.apache.org/jira/browse/HUDI-4353
[10] https://issues.apache.org/jira/browse/HUDI-3505

===
Bugs

[Flink] Fix when HoodieTable removes data file before the end of Flink job
[1]
[Spark] Fix wrong results if the user read no base files hudi table by glob
paths [2]
[Core] Bootstrap op data loading missing [3]
[Flink] Fix Flink lose data on some rollback scene [4]
[Core] Fix records overwritten bug with binary primary key [5]
[Flink] Flink Hudi module should support low-level source and sink api [6]



[1] https://issues.apache.org/jira/browse/HUDI-4258
[2] https://issues.apache.org/jira/browse/HUDI-4173
[3] https://issues.apache.org/jira/browse/HUDI-4270
[4] https://issues.apache.org/jira/browse/HUDI-4311
[5] https://issues.apache.org/jira/browse/HUDI-4336
[6] https://issues.apache.org/jira/browse/HUDI-3953



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-06-06 ~ 2022-06-19)

2022-06-19 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-06-06 ~ 2022-06-19
with updates on bug fixes.


===
Features

[Spark] Add Call Procedure for marker deletion [1]
[Spark] Add Call Procedure for show rollbacks [2]
[Spark] Support Create/Drop/Show/Refresh Index Syntax for Spark SQL [3]


[1] https://issues.apache.org/jira/browse/HUDI-4168
[2] https://issues.apache.org/jira/browse/HUDI-3499
[3] https://issues.apache.org/jira/browse/HUDI-4165

===
Bugs

[Spark] Fix using HoodieCatalog to create non-hudi tables [1]
[Core] Fix partition order in aws glue sync [2]
[Core] Fixing TableSchemaResolver to avoid repeated `HoodieCommitMetadata`
parsing  [3]
[Core] Fix Async indexer to support building FILES partition [4]
[Core] Fixing Non partitioned with virtual keys in read path [5]
[Spark] Addressing performance regressions in Spark DataSourceV2
Integration [6]
[Spark] Infer keygen clazz for Spark SQL [7]
[Flink] improvement for flink write operator name to identify tables easily
[8]
[Core] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata [9]
[Flink] Make the flink merge and replace handle intermediate file visible
[10]


[1] https://issues.apache.org/jira/browse/HUDI-4183
[2] https://issues.apache.org/jira/browse/HUDI-4187
[3] https://issues.apache.org/jira/browse/HUDI-4176
[4] https://issues.apache.org/jira/browse/HUDI-4197
[5] https://issues.apache.org/jira/browse/HUDI-4171
[6] https://issues.apache.org/jira/browse/HUDI-4187
[7] https://issues.apache.org/jira/browse/HUDI-4213
[8] https://issues.apache.org/jira/browse/HUDI-4139
[9] https://issues.apache.org/jira/browse/HUDI-4221
[10] https://issues.apache.org/jira/browse/HUDI-4255


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-05-23 ~ 2022-06-05)

2022-06-05 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-05-23 ~ 2022-06-05
with updates on bug fixes.


===
Features

[Flink] Support independent flink hudi clustering function [1]
[Flink] flink split_reader supports rocksdb [2]
[Core] Add the Oracle Cloud Infrastructure (oci) Object Storage URI scheme
[3]
[Core] Add Call Procedure for marker deletion [4]


[1] https://issues.apache.org/jira/browse/HUDI-2207
[2] https://issues.apache.org/jira/browse/HUDI-4151
[3] https://issues.apache.org/jira/browse/HUDI-3551
[4] https://issues.apache.org/jira/browse/HUDI-4168

===
Bugs

[Flink] Fix the concurrency modification of hoodie table config for flink
[1]
[Core] Fixing compaction write operation in commit metadata [2]
[Core] Archives the metadata file in HoodieInstant.State sequence [3]
[Deltastreamer] Fixing determining target table schema for delta sync with
empty batch [4]
[Deltastreamer] Fix NULL schema for empty batches in deltastreamer [5]
[Core] Bulk insert Support CustomColumnsSortPartitioner with Row [6]
[Spark] Fix using HoodieCatalog to create non-hudi tables [7]
[Core] Fix partition order in aws glue sync [8]


[1] https://issues.apache.org/jira/browse/HUDI-4138
[2] https://issues.apache.org/jira/browse/HUDI-2473
[3] https://issues.apache.org/jira/browse/HUDI-4145
[4] https://issues.apache.org/jira/browse/HUDI-4132
[5] https://issues.apache.org/jira/browse/HUDI-4072
[6] https://issues.apache.org/jira/browse/HUDI-4040
[7] https://issues.apache.org/jira/browse/HUDI-4183
[8] https://issues.apache.org/jira/browse/HUDI-4187


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-05-09 ~ 2022-05-22)

2022-05-22 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-05-09 ~ 2022-05-22
with updates on bug fixes.


===
Features

[Core] consistent hashing index: basic write path (upsert/insert) [1]
[Core] Preparations for hudi metastore. [2]
[Core] Allow nested field as primary key and preCombineField in spark sql
[3]


[1] https://issues.apache.org/jira/browse/HUDI-3123
[2] https://issues.apache.org/jira/browse/HUDI-3654
[3] https://issues.apache.org/jira/browse/HUDI-4051

===
Bugs

[Core] Refactor ratelimiter to avoid stack overflow [1]
[Core] Making perf optimizations for bulk insert row writer path [2]
[Core] Avoid calling getDataSize after every record written [3]
[Spark] Supports showing table comment for hudi with spark3 [4]
[Deltastreamer] Fix NULL schema for empty batches in deltastreamer [5]
[Core] Suport kerberos hbase index [6]
[Spark] Support dropping RO and RT table in DropHoodieTableCommand [7]



[1] https://issues.apache.org/jira/browse/HUDI-4055
[2] https://issues.apache.org/jira/browse/HUDI-3995
[3] https://issues.apache.org/jira/browse/HUDI-4038
[4] https://issues.apache.org/jira/browse/HUDI-4097
[5] https://issues.apache.org/jira/browse/HUDI-4072
[6] https://issues.apache.org/jira/browse/HUDI-3980
[7] https://issues.apache.org/jira/browse/HUDI-4087


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-04-25 ~ 2022-05-08)

2022-05-08 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-04-25 ~ 2022-05-08
with updates on bug fixes.

Apache Hudi 0.11.0 released. [1]


[1] https://hudi.apache.org/releases/release-0.11.0

===
Features

[Core] [RFC-44] Add RFC for Hudi Connector for Presto [1]
[Spark] Support truncate-partition for Spark-3.2 [2]
[Deltastreamer] Adding post write termination strategy to deltastreamer
continuous mode [3]


[1] https://issues.apache.org/jira/browse/HUDI-3211
[2] https://issues.apache.org/jira/browse/HUDI-4042
[3] https://issues.apache.org/jira/browse/HUDI-3675

===
Bugs

[Core] Avoid clustering update handling when no pending replacecommit [1]
[Core] Make HoodieParquetWriter Thread safe and memory executor exit
gracefully [2]
[Core] AvroDeserializer supports AVRO_REBASE_MODE_IN_READ configuration [3]
[Flink] Flink hudi table with date type partition path throws
HoodieNotSupportedException [4]
[Core] Fixing hoodie.properties/tableConfig for no preCombine field with
writes [5]



[1] https://issues.apache.org/jira/browse/HUDI-4031
[2] https://issues.apache.org/jira/browse/HUDI-2875
[3] https://issues.apache.org/jira/browse/HUDI-3849
[4] https://issues.apache.org/jira/browse/HUDI-3977
[5] https://issues.apache.org/jira/browse/HUDI-3972


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-04-11 ~ 2022-04-24)

2022-04-24 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-04-11 ~ 2022-04-24
with updates on bug fixes.


===
Bugs

[Core] Adding default null for some of the fields in col stats in MDT
schema [1]
[Core] Fix target schema handling in HoodieSparkUtils while creating RDD [2]
[Core] Fixing file-partitioning seq for base-file only views to make sure
we bucket the files efficiently [3]
[Core] Drop index to delete pending index instants from timeline if
applicable [4]
[Flink] Flink write task hangs if last checkpoint has no data input [5]
[Flink] Fix lose data when rollback in flink async compact [6]
[Core] Fixing partition-values being derived from partition-path instead of
source columns [7]
[Core] Fix cast exception while reading boolean type of partitioned field
[8]



[1] https://issues.apache.org/jira/browse/HUDI-3886
[2] https://issues.apache.org/jira/browse/HUDI-3707
[3] https://issues.apache.org/jira/browse/HUDI-3895
[4] https://issues.apache.org/jira/browse/HUDI-3899
[5] https://issues.apache.org/jira/browse/HUDI-3917
[6] https://issues.apache.org/jira/browse/HUDI-3912
[7] https://issues.apache.org/jira/browse/HUDI-3204
[8] https://issues.apache.org/jira/browse/HUDI-3923



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-03-28 ~ 2022-04-10)

2022-04-10 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-03-28 ~ 2022-04-10
with updates on features, bug fixes.


===
Features


[Core] Support Compaction Command Based on Call Procedure Command for Spark
SQL [1]
[Core] Implement Hudi AWS Glue sync [2]
[Core] Add hudi-datahub-sync implementation [3]
[Core] Implement async metadata indexing [4]
[Core] Support full Schema evolution for Spark [5]
[Flink] flink supports sync table information to aws glue [6]
[Core] MVP implementation of BigQuerySyncTool [7]



[1] https://issues.apache.org/jira/browse/HUDI-3538
[2] https://issues.apache.org/jira/browse/HUDI-2757
[3] https://issues.apache.org/jira/browse/HUDI-3536
[4] https://issues.apache.org/jira/browse/HUDI-3175
[5] https://issues.apache.org/jira/browse/HUDI-2429
[6] https://issues.apache.org/jira/browse/HUDI-3771
[7] https://issues.apache.org/jira/browse/HUDI-3357


===
Bugs

[Core] High performance costs of AvroSerizlizer in DataSource writing [1]
[Flink] Flink bucket index bucketID bootstrap optimization [2]
[Core] Fix the logic of reattempting pending rollback [3]
[Core] Fix truncate hudi table's error [4]
[Core] Fix drop table issue when sync to Hive [5]
[Core] Fixing Column Stats Index record Merging sequence missing
`columnName` [6]
[Core] Fix drop partition issue when sync to hive [7]
[Core] Removing dependency on "spark-avro" [8]
[Core] Fix CTAS statment issue when sync to hive [9]
[Flink] Fix flink bucket index bulk insert generates too many small files
[10]
[Core] Issue with out of order commits in the timeline when ingestion
writers using SparkAllowUpdateStrategy [11]
[Core] fixed the per regression by enable vectorizeReader for parquet file
[12]
[Core] Improve HoodieSparkSqlWriter write performance [13]
[Core] The MOR DELETE block breaks the event time sequence of CDC [14]
[Core] fixed the bug that the cow table(contains decimalType) write by
flink cannot be read by spark [15]


[1] https://issues.apache.org/jira/browse/HUDI-3719
[2] https://issues.apache.org/jira/browse/HUDI-3539
[3] https://issues.apache.org/jira/browse/HUDI-3720
[4] https://issues.apache.org/jira/browse/HUDI-3722
[5] https://issues.apache.org/jira/browse/HUDI-2520
[6] https://issues.apache.org/jira/browse/HUDI-3731
[7] https://issues.apache.org/jira/browse/HUDI-2520
[8] https://issues.apache.org/jira/browse/HUDI-3549
[9] https://issues.apache.org/jira/browse/HUDI-2520
[10] https://issues.apache.org/jira/browse/HUDI-3741
[11] https://issues.apache.org/jira/browse/HUDI-3355
[12] https://issues.apache.org/jira/browse/HUDI-3729
[13] https://issues.apache.org/jira/browse/HUDI-2777
[14] https://issues.apache.org/jira/browse/HUDI-2752
[15] https://issues.apache.org/jira/browse/HUDI-3096


Best,
Leesf


Re: [ANNOUNCE] New Apache Hudi Committer - Zhaojing Yu

2022-03-31 Thread leesf
Congrats!

Vino Yang  于2022年3月31日周四 17:03写道:

> Congrats!
>
> Best,
> Vino
>
> Gary Li  于2022年3月25日周五 19:11写道:
> >
> > Congrats!
> >
> > Best,
> > Gary
> >
> > On Fri, Mar 25, 2022 at 4:07 PM Shiyan Xu 
> > wrote:
> >
> > > Congrats!
> > >
> > > On Fri, Mar 25, 2022 at 1:40 PM Danny Chan 
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > On behalf of the PMC, I'm very happy to announce Zhaojing Yu as a new
> > > > Hudi committer.
> > > >
> > > > Zhaojing is very active in Flink Hudi contributions, many cool
> > > > features such as the flink streaming bootstrap, compaction service
> and
> > > > all kinds of writing modes are contributed by him. He also fixed many
> > > > critical bugs from the Flink side.
> > > >
> > > > Besides that, Zhaojing is also active in use case publicity of Hudi
> in
> > > > China, he is very active in answering user questions in our Dingtalk
> > > > group. Now he is working in Bytedance for pushing forward the
> Volcanic
> > > > cloud service Hudi products !
> > > >
> > > > Please join me in congratulating Zhaojing for becoming a Hudi
> committer!
> > > >
> > > > Cheers,
> > > > Danny
> > > >
> > >
> > >
> > > --
> > > --
> > > Best,
> > > Shiyan
> > >
>


[ANNOUNCE] Hudi Community Update(2022-03-14 ~ 2022-03-27)

2022-03-27 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2022-03-14 ~ 2022-03-27
with updates on features, bug fixes.


===
Features


[Core] Rebase Data Skipping flow to rely on MT Column Stats index [1]
[Flink] Support backend switch in HoodieFlinkStreamer [2]
[Flink] Support flink multiple versions [3]
[Core] Provide an option to trigger clean every nth commit [4]
[Flink] Flink bulk_insert support bucket hash index [5]
[Core] Supporting Composite Expressions over Data Table Columns in Data
Skipping flow [6]



[1] https://issues.apache.org/jira/browse/HUDI-3514
[2] https://issues.apache.org/jira/browse/HUDI-3607
[3] https://issues.apache.org/jira/browse/HUDI-3665
[4] https://issues.apache.org/jira/browse/HUDI-1436
[5] https://issues.apache.org/jira/browse/HUDI-3701
[6] https://issues.apache.org/jira/browse/HUDI-3594


===
Bugs

[Flink] flink sync hive metadata supports table properties and serde
properties [1]
[Core] Automatically adjust write configs based on metadata table and write
concurrency mode [2]
[Core] Replace RDD with HoodieData in HoodieSparkTable and commit executors
[3]
[Core] Refactored Spark DataSource Relations to avoid code duplication [4]
[Core] Fixing Column Stats index to properly handle first Data Table commit
[5]
[Core] Refactor hive sync tool / config to use reflection and standardize
configs [6]
[Core] Refactoring MergeOnReadRDD to avoid duplication, fetch only
projected columns [7]
[Core] Do not throw exception when instant to rollback does not exist in
metadata table active timeline [8]
[Core] OOM occurred when use bulk_insert cow table with flink BUCKET index
[9]


[1] https://issues.apache.org/jira/browse/HUDI-3589
[2] https://issues.apache.org/jira/browse/HUDI-3404
[3] https://issues.apache.org/jira/browse/HUDI-2439
[4] https://issues.apache.org/jira/browse/HUDI-3457
[5] https://issues.apache.org/jira/browse/HUDI-3663
[6] https://issues.apache.org/jira/browse/HUDI-2883
[7] https://issues.apache.org/jira/browse/HUDI-3396
[8] https://issues.apache.org/jira/browse/HUDI-3435
[9] https://issues.apache.org/jira/browse/HUDI-3716




Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-02-28 ~ 2022-03-13)

2022-03-13 Thread leesf
ira/browse/HUDI-3418
[8] https://issues.apache.org/jira/browse/HUDI-3465
[9] https://issues.apache.org/jira/browse/HUDI-3516
[10] https://issues.apache.org/jira/browse/HUDI-2631
[11] https://issues.apache.org/jira/browse/HUDI-3264
[12] https://issues.apache.org/jira/browse/HUDI-3544
[13] https://issues.apache.org/jira/browse/HUDI-3548
[14] https://issues.apache.org/jira/browse/HUDI-3460
[15] https://issues.apache.org/jira/browse/HUDI-2761
[16] https://issues.apache.org/jira/browse/HUDI-3130
[17] https://issues.apache.org/jira/browse/HUDI-3069
[18] https://issues.apache.org/jira/browse/HUDI-3213
[19] https://issues.apache.org/jira/browse/HUDI-3561
[20] https://issues.apache.org/jira/browse/HUDI-3365
[21] https://issues.apache.org/jira/browse/HUDI-2747
[22] https://issues.apache.org/jira/browse/HUDI-3576
[23] https://issues.apache.org/jira/browse/HUDI-3573
[24] https://issues.apache.org/jira/browse/HUDI-3574
[25] https://issues.apache.org/jira/browse/HUDI-3356
[26] https://issues.apache.org/jira/browse/HUDI-3383
[27] https://issues.apache.org/jira/browse/HUDI-3396
[28] https://issues.apache.org/jira/browse/HUDI-3595
[29] https://issues.apache.org/jira/browse/HUDI-3567
[30] https://issues.apache.org/jira/browse/HUDI-3513
[31] https://issues.apache.org/jira/browse/HUDI-3592
[32] https://issues.apache.org/jira/browse/HUDI-3556
[33] https://issues.apache.org/jira/browse/HUDI-3593
[34] https://issues.apache.org/jira/browse/HUDI-3583


===
Tests

[Tests] Refactor HoodieTestDataGenerator to provide for reproducible Builds
[1]
[Tests] Add UT to verify HoodieRealtimeFileSplit serde  [2]
[Tests] Skip integ test modules by default [3]
[Tests] Add Trino Queries in integration tests [4]
[Tests] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in
TestSchemaPostProcessor [5]


[1] https://issues.apache.org/jira/browse/HUDI-3469
[2] https://issues.apache.org/jira/browse/HUDI-3348
[3] https://issues.apache.org/jira/browse/HUDI-3584
[4] https://issues.apache.org/jira/browse/HUDI-3586
[5] https://issues.apache.org/jira/browse/HUDI-3575



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-02-13 ~ 2022-02-27)

2022-02-27 Thread leesf
UDI-3515


===
Tests

[Tests] Remove hardcoded logic of disabling metadata table in tests [1]
[Tests] Enchancements to integ test suite [2]
[Tests] Support clustering scheduleAndExecute for hudi-cli and add
clustering-cli Tests [3]



[1] https://issues.apache.org/jira/browse/HUDI-3366
[2] https://issues.apache.org/jira/browse/HUDI-3480
[3] https://issues.apache.org/jira/browse/HUDI-3429



Best,
Leesf


Re: [DISCUSS] Change data feed for spark sql

2022-02-13 Thread leesf
+1 for the feature.

vino yang  于2022年2月12日周六 22:14写道:

> +1 for this feature, looking forward to share more details or design doc.
>
> Best,
> Vino
>
> Xianghu Wang  于2022年2月12日周六 17:06写道:
>
> > this is definitely a great feature
> >  +1
> >
> > On 2022/02/12 02:32:32 Forward Xu wrote:
> > > Hi All,
> > >
> > > I want to support change data feed for to spark sql, This feature can
> be
> > > achieved in two ways.
> > >
> > > 1. Call Procedure Command
> > > sql syntax
> > > CALL system.table_changes('tableName',  start_timestamp, end_timestamp)
> > > example:
> > > CALL system.table_changes('tableName', TIMESTAMP '2021-01-23 04:30:45',
> > > TIMESTAMP '2021-02-23 6:00:00')
> > >
> > > 2. Support querying MOR(CDC) table as of a savepoint
> > > SELECT * FROM A.B TIMESTAMP AS OF 1643119574;
> > > SELECT * FROM A.B TIMESTAMP AS OF '2019-01-29 00:37:58' ;
> > >
> > > SELECT * FROM A.B TIMESTAMP AS OF '2019-01-29 00:37:58'  AND
> '2021-02-23
> > > 6:00:00' ;
> > > SELECT * FROM A.B VERSION AS OF 'Snapshot123456789';
> > >
> > > Any feedback is welcome!
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Forward Xu
> > >
> > > Related Links:
> > > [1] Call Procedure Command <
> > https://issues.apache.org/jira/browse/HUDI-3161>
> > > [2] Support querying a table as of a savepoint
> > > 
> > > [3] Change data feed
> > > <
> >
> https://docs.databricks.com/delta/delta-change-data-feed.html#language-sql
> > >
> > >
> >
>


[ANNOUNCE] Hudi Community Update(2022-01-16 ~ 2022-02-13)

2022-02-13 Thread leesf
]
[Core] Fix restore to rollback pending clustering operations followed by
other rolling back other commits [57]
[Deltastreamer] fix jackson parse error when empty message from
JsonKafkaSource Using HoodieDeltaStreamer [58]

[1] https://issues.apache.org/jira/browse/HUDI-3179
[2] https://issues.apache.org/jira/browse/HUDI-3257
[3] https://issues.apache.org/jira/browse/HUDI-3194
[4] https://issues.apache.org/jira/browse/HUDI-3252
[5] https://issues.apache.org/jira/browse/HUDI-3261
[6] https://issues.apache.org/jira/browse/HUDI-3263
[7] https://issues.apache.org/jira/browse/HUDI-2903
[8] https://issues.apache.org/jira/browse/HUDI-3245
[9] https://issues.apache.org/jira/browse/HUDI-3191
[10] https://issues.apache.org/jira/browse/HUDI-3277
[11] https://issues.apache.org/jira/browse/HUDI-3236
[12] https://issues.apache.org/jira/browse/HUDI-3283
[13] https://issues.apache.org/jira/browse/HUDI-3285
[14] https://issues.apache.org/jira/browse/HUDI-3281
[15] https://issues.apache.org/jira/browse/HUDI-3268
[16] https://issues.apache.org/jira/browse/HUDI-2837
[17] https://issues.apache.org/jira/browse/HUDI-1850
[18] https://issues.apache.org/jira/browse/HUDI-3282
[19] https://issues.apache.org/jira/browse/HUDI-3072
[20] https://issues.apache.org/jira/browse/HUDI-2872
[21] https://issues.apache.org/jira/browse/HUDI-3237
[22] https://issues.apache.org/jira/browse/HUDI-1822
[23] https://issues.apache.org/jira/browse/HUDI-2763
[24] https://issues.apache.org/jira/browse/HUDI-2596
[25] https://issues.apache.org/jira/browse/HUDI-2688
[26] https://issues.apache.org/jira/browse/HUDI-2943
[27] https://issues.apache.org/jira/browse/HUDI-1977
[28] https://issues.apache.org/jira/browse/HUDI-3253
[29] https://issues.apache.org/jira/browse/HUDI-3318
[30] https://issues.apache.org/jira/browse/HUDI-3292
[31] https://issues.apache.org/jira/browse/HUDI-2711
[32] https://issues.apache.org/jira/browse/HUDI-3346
[33] https://issues.apache.org/jira/browse/HUDI-3293
[34] https://issues.apache.org/jira/browse/HUDI-2589
[35] https://issues.apache.org/jira/browse/HUDI-3322
[36] https://issues.apache.org/jira/browse/HUDI-3337
[37] https://issues.apache.org/jira/browse/HUDI-1295
[38] https://issues.apache.org/jira/browse/HUDI-3191
[39] https://issues.apache.org/jira/browse/HUDI-2656
[40] https://issues.apache.org/jira/browse/HUDI-2491
[41] https://issues.apache.org/jira/browse/HUDI-3360
[42] https://issues.apache.org/jira/browse/HUDI-2941
[43] https://issues.apache.org/jira/browse/HUDI-3206
[44] https://issues.apache.org/jira/browse/HUDI-3058
[45] https://issues.apache.org/jira/browse/HUDI-3373
[46] https://issues.apache.org/jira/browse/HUDI-3320
[47] https://issues.apache.org/jira/browse/HUDI-3091
[48] https://issues.apache.org/jira/browse/HUDI-3361
[49] https://issues.apache.org/jira/browse/HUDI-3276
[50] https://issues.apache.org/jira/browse/HUDI-3239
[51] https://issues.apache.org/jira/browse/HUDI-
[52] https://issues.apache.org/jira/browse/HUDI-3395
[53] https://issues.apache.org/jira/browse/HUDI-2610
[54] https://issues.apache.org/jira/browse/HUDI-2987
[55] https://issues.apache.org/jira/browse/HUDI-3402
[56] https://issues.apache.org/jira/browse/HUDI-3338
[57] https://issues.apache.org/jira/browse/HUDI-3362
[58] https://issues.apache.org/jira/browse/HUDI-3413

===
Tests

[Tests] add UT for update/delete on non-pk condition [1]
[Tests] Fixing utilities and integ test suite bundle to include hudi spark
datasource [2]
[Tests] Solve UT for Spark 3.2 [3]
[Tests] Remove fixture test tables for multi writer tests [4]
[Tests] Fixing spark yaml and adding hive validation to integ test suite [5]


[1] https://issues.apache.org/jira/browse/HUDI-2968
[2] https://issues.apache.org/jira/browse/HUDI-3262
[3] https://issues.apache.org/jira/browse/HUDI-3215
[4] https://issues.apache.org/jira/browse/HUDI-3330
[5] https://issues.apache.org/jira/browse/HUDI-3312


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2022-01-02 ~ 2022-01-16)

2022-01-16 Thread leesf
 InProcessLockProvider for all multi-writer tests instead
of FileSystemBasedLockProviderTestClass [2]


[1] https://issues.apache.org/jira/browse/HUDI-3138
[2] https://issues.apache.org/jira/browse/HUDI-3165


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-12-19 ~ 2022-01-02)

2022-01-02 Thread leesf
Dear community,

Happy new year. Nice to share Hudi community bi-weekly updates for
2021-12-19 ~ 2022-01-02 with updates on features, bug fixes.


===
Features


[Core] Add table option to set utc timezone [1]
[Spark] APurge drop partition for spark sql  [2]
[Spark] Support Spark 3.2 [3]
[Flink] Support component data types for flink bulk_insert [4]
[Spark] Add bucket hash index, compatible with the hive bucket [5]



[1] https://issues.apache.org/jira/browse/HUDI-3014
[2] https://issues.apache.org/jira/browse/HUDI-3099
[3] https://issues.apache.org/jira/browse/HUDI-2811
[4] https://issues.apache.org/jira/browse/HUDI-3083
[5] https://issues.apache.org/jira/browse/HUDI-1951


===
Bugs

[Core] Fixing HoodieFileIndex partition column parsing for nested fields [1]
[Core] Do not clean the log files right after compaction for metadata table
[2]
[Flink] Schedule Flink compaction in service [3]
[Core] Adding ability to read entire data with HoodieIncrSource with empty
checkpoint [4]
[Core] drop table for spark sql [5]
[Core] Excluding compaction instants from pending rollback info [6]
[Core] Do not store rollback plan in inflight instant [7]
[Core] Fixing AvroDFSSource does not use the overridden schema to
deserialize Avro binaries [8]
[Core] fix spark-sql query table that write with TimestampBasedKeyGenerator
[9]
[Core] Fix HiveSyncTool not sync schema [10]
[Core] Fix the exception 'Not an Avro data file' when archive and clean [11]
[Flink] Bootstrap when timeline have completed instant [12]
[Flink] Cache compactionPlan in buffer [13]
[Core] abstract partition filter logic to enable code reuse [14]
[Core] Fix HiveSyncTool drop partitions using JDBC or hivesql or hms [15]
[Spark] Fix insert error after adding columns on Spark 3.2.0 [16]
[Spark] Fix merge/insert/show partitions error on Spark3.2 [17]
[Spark] fix ctas error in spark3.1.1 [18]



[1] https://issues.apache.org/jira/browse/HUDI-3008
[2] https://issues.apache.org/jira/browse/HUDI-3032
[3] https://issues.apache.org/jira/browse/HUDI-2547
[4] https://issues.apache.org/jira/browse/HUDI-3011
[5] https://issues.apache.org/jira/browse/HUDI-3060
[6] https://issues.apache.org/jira/browse/HUDI-3101
[7] https://issues.apache.org/jira/browse/HUDI-3102
[8] https://issues.apache.org/jira/browse/HUDI-2374
[9] https://issues.apache.org/jira/browse/HUDI-3093
[10] https://issues.apache.org/jira/browse/HUDI-3106
[11] https://issues.apache.org/jira/browse/HUDI-2675
[12] https://issues.apache.org/jira/browse/HUDI-3124
[13] https://issues.apache.org/jira/browse/HUDI-3120
[14] https://issues.apache.org/jira/browse/HUDI-3095
[15] https://issues.apache.org/jira/browse/HUDI-3107
[16] https://issues.apache.org/jira/browse/HUDI-3134
[17] https://issues.apache.org/jira/browse/HUDI-3136
[18] https://issues.apache.org/jira/browse/HUDI-3131





Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-12-05 ~ 2021-12-19)

2021-12-19 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-12-05 ~ 2021-12-19
with updates on features, bug fixes and tests.


===
Features


[Core] Add a hudi-trino-bundle for Trino [1]
[Core] Add a repair util to clean up dangling data and log files [2]



[1] https://issues.apache.org/jira/browse/HUDI-2784
[2] https://issues.apache.org/jira/browse/HUDI-2906


===
Bugs

[Core] Fix corrupt block end position [1]
[Core] for hive/presto hudi should remove the temp file which created by
HoodieMergedLogRecordSanner when the query finished [2]
[Core] Fixing aws lock configs to inherit from HoodieConfig [3]
[Flink] Shade kryo jar for flink bundle jar [4]
[Core] Fix overflow of huge log file in HoodieLogFormatWriter [5]
[Core] Cache BaseDir if HudiTableNotFound Exception thrown [6]
[Core] Add TaskCompletionListener for HoodieMergeOnReadRDD to close
logScaner when the query finished [7]
[Core] Fixed the bug clustering jobs cannot running in parallel [8]
[Core] Improve SparkUI job description for write path [9]
[Core] Fixing metadata table for non-partitioned dataset [10]
[Core] Make Z-index more generic Column-Stats Index [11]
[Core] Make the prefix for metrics name configurable [12]
[Core] Implement #close for AbstractTableFileSystemView [13]
[Build] Upgrade maven plugins to be compatible with higher Java versions
[14]
[Core] Metadata table util to get latest file slices for reader/writers [15]
[Core] Sync to HMS when deleting partitions [16]
[Core] Add judgement to existed partitionPath in the catch code block [17]
[Flink] Flink streaming reader 'skip_compaction' option does not work [18]
[Flink] Skip the corrupt meta file for pending rollback action [19]
[Flink] Add explicit write handler for flink [20]
[Core] Implement #reset and #sync for metadata filesystem view [21]
[Core] lean up the marker directory when finish bootstrap operation [22]
[Core] Automatically set spark.sql.parquet.writelegacyformat, when using
bulkinsert to insert data which contains decimalType [23]
[Core] InProcess lock provider to guard single writer process with async
table operations [24]
[Core] Transaction manager: avoid deadlock when doing begin and end
transactions [25]



[1] https://issues.apache.org/jira/browse/HUDI-2900
[2] https://issues.apache.org/jira/browse/HUDI-2876
[3] https://issues.apache.org/jira/browse/HUDI-2964
[4] https://issues.apache.org/jira/browse/HUDI-2957
[5] https://issues.apache.org/jira/browse/HUDI-2665
[6] https://issues.apache.org/jira/browse/HUDI-2779
[7] https://issues.apache.org/jira/browse/HUDI-2966
[8] https://issues.apache.org/jira/browse/HUDI-2901
[9] https://issues.apache.org/jira/browse/HUDI-2849
[10] https://issues.apache.org/jira/browse/HUDI-2952
[11] https://issues.apache.org/jira/browse/HUDI-2814
[12] https://issues.apache.org/jira/browse/HUDI-2974
[13] https://issues.apache.org/jira/browse/HUDI-2984
[14] https://issues.apache.org/jira/browse/HUDI-2946
[15] https://issues.apache.org/jira/browse/HUDI-2938
[16] https://issues.apache.org/jira/browse/HUDI-2990
[17] https://issues.apache.org/jira/browse/HUDI-2994
[18] https://issues.apache.org/jira/browse/HUDI-2996
[19] https://issues.apache.org/jira/browse/HUDI-2997
[20] https://issues.apache.org/jira/browse/HUDI-3024
[21] https://issues.apache.org/jira/browse/HUDI-3015
[22] https://issues.apache.org/jira/browse/HUDI-3001
[23] https://issues.apache.org/jira/browse/HUDI-2958
[24] https://issues.apache.org/jira/browse/HUDI-2962
[25] https://issues.apache.org/jira/browse/HUDI-3029


==
Tests

[Tests] Add data count checks in async clustering tests [1]
[Tests] Multi writer test with conflicting async table services [2]
[Tests] Adding some test fixes to continuous mode multi writer tests [3]
[Tests] De-coupling multi writer tests [4]
[Tests] Fixing a bug in TransactionManager and FileSystemTestLock [5]
[Tests] Fixing default lock configs for FileSystemBasedLock and fixing a
flaky test [6]
[Tests] Fix flaky testJsonKafkaSourceResetStrategy [7]
[Tests] Adding tests for archival of replace commit actions [8]


[1] https://issues.apache.org/jira/browse/HUDI-2936
[2] https://issues.apache.org/jira/browse/HUDI-2527
[3] https://issues.apache.org/jira/browse/HUDI-3043
[4] https://issues.apache.org/jira/browse/HUDI-3043
[5] https://issues.apache.org/jira/browse/HUDI-3064
[6] https://issues.apache.org/jira/browse/HUDI-3054
[7] https://issues.apache.org/jira/browse/HUDI-3052
[8] https://issues.apache.org/jira/browse/HUDI-2970




Best,
Leesf


Re: Regular minor/patch releases

2021-12-15 Thread leesf
+1

We could create new branches such as release-0.10 as the master branch for
0.10.0, 0.10.1 .etc version release, and when fixing bugs against the
master branch, the contributors/committers should also open a new PR
against the release-0.10 branch if needed. That would avoid cherry-picking
all bug fixes from master to release-0.10 at one time and cause so many
conflicts. You would see the Spark[1] and Flink[2] community also
maintaining a multi-master branch as well.

[1] https://github.com/apache/spark/tree/branch-3.1
https://github.com/apache/spark/tree/branch-3.2
[2] https://github.com/apache/flink/tree/release-1.12
https://github.com/apache/flink/tree/release-1.13

vino yang  于2021年12月15日周三 18:12写道:

> +1
>
> Agree that minor release mostly for bug fix purpose.
>
> Best,
> Vino
>
> Danny Chan  于2021年12月15日周三 10:35写道:
>
> > I guess we must do that for current rapid development and iteration. As
> for
> > the release 0.10.0, after the announcement of only a few days we have
> > received a bunch of bugs reported by the github issues: such as
> >
> > - the empty meta file: https://github.com/apache/hudi/issues/4249
> > - and the timeline based marker files:
> > https://github.com/apache/hudi/issues/4230
> >
> > With the rush in features without enough tests, I'm afraid the major
> > release version is never ready for production, unless there is production
> > validation like in Uber internal.
> >
> > And for minor releases, there should only include the bug fixes, no
> > breaking change, no feature, it should not be a hard work i think.
> >
> > Best,
> > Danny
> >
> > Sivabalan 于2021年12月14日 周二上午4:06写道:
> >
> > > +1 in general. but yeah, not sure if we have resources to do this for
> > every
> > > major release.
> > >
> > > On Mon, Dec 13, 2021 at 10:01 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > In the past we had plans for minor releases [1], but invariably we
> end
> > up
> > > > doing major ones, which also deliver the bug fixes.
> > > >
> > > > The reason was the cost involved in doing a release. We have made
> some
> > > good
> > > > progress towards regression/integration test, which prompts me to
> > revive
> > > > this.
> > > >
> > > > What does everyone think about a monthly bugfix release on the last
> > > > major/minor version. (not on every major release, we still don't have
> > > > enough contributors to pull that off IMO). So we would be trying to
> do
> > a
> > > > 0.10.1 early jan for e.g, in this model?
> > > >
> > > > [1]
> > https://cwiki.apache.org/confluence/display/HUDI/Release+Management
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


Re: [DISCUSS] Propose Consistent Hashing Indexing for Dynamic Bucket Number

2021-12-12 Thread leesf
+1 for the improvement to make bucket index more comprehensive and looking
forward to the RFC for more details.

Yuwei Xiao  于2021年12月10日周五 16:22写道:

> Dear Hudi Community,
>
> I would like to propose Consistent Hashing Indexing to enable dynamic
> bucket number, saving hyper-parameter tuning for Hudi users.
>
> Currently, we have Bucket Index on landing [1]. It is an effective index
> approach to address the performance issue during Upsert. I observed ~3x
> throughput improvement for Upsert in my local setup compared to the Bloom
> Filter approach. However, it requires pre-configure a bucket number when
> creating the table. As described in [1], this imposes two limitations:
>
> - Due to the one-one mapping between buckets and file groups, the size of a
> single file group may grow infinitely. Services like compaction will take
> longer because of the larger read/write amplification.
>
> - There may exist data skew because of imbalance data distribution,
> resulting in long-tail read/write.
>
> Based on the above observation, supporting dynamic bucket number is
> necessary, especially for rapidly changing hudi tables. Looking at the
> market, Consistent Hashing has been adopted in DB systems[2][3]. The main
> idea of it is to turn the "key->bucket" mapping into
> "key->hash_value->(range mapping)->bucket", constraining the re-hashing
> process to touch only several local buckets (e.g., only large file groups)
> rather than shuffling the whole hash table.
>
> In order to introduce Consistent Hashing to Hudi, we need to consider the
> following issues:
>
> - Storing hashing metadata, such as range mapping infos. Metadata size and
> concurrent updates to metadata should also be considered.
>
> - Splitting & Merging criteria. We need to design a (or several) policies
> to manage 'when and how to split & merge bucket'. A simple policy would be
> splitting in the middle when the file group reaches the size threshold.
>
> - Supporting concurrent write & read. The splitting or merging must not
> block concurrent writer & reader, and the whole process should be fast
> enough (e.g., one bucket at a time) to minimize the impact on other
> operations.
>
> - Integrating splitting & merging process into existing hudi table service
> pipelines.
>
> I have sketched a prototype design to address the above problems:
>
> - Maintain hashing metadata for each partition (persisted as files), and
> use instant to manage multi-version and concurrent updates of it.
>
> - A flexible framework will be implemented for different pluggable
> policies. The splitting plan, specifying which and how the bucket to split
> (merge), will be generated during the scheduling (just like how compaction
> does).
>
> - Dual-write will be activated once the writer observes the splitting(or
> merging) process, upserting records as log files into both old and new
> buckets (file groups). Readers can see records once the writer completes,
> regardless of the splitting process.
>
> - The splitting & merging could be integrated as a sub-task into the
> Clustering service, because we could view them as a special case of the
> Clustering's goal (i.e., managing file groups based on file size). Though
> we need to modify Clustering to handle log files, the bucket index enhances
> Clustering by allowing concurrent updates.
>
>
> Would love to hear your thoughts and any feedback about the proposal. I can
> draft an RFC with a detailed design once we reach an agreement.
>
> [1]
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
>
> [2] YugabyteDB
>
> https://docs.yugabyte.com/latest/architecture/docdb-sharding/sharding/#example
>
> [3] PolarDB-X
> https://help.aliyun.com/document_detail/316603.html#title-y5n-2i1-5ws
>
>
>
> Best,
>
> Yuwei Xiao
>


[ANNOUNCE] Hudi Community Update(2021-11-21 ~ 2021-12-05)

2021-12-05 Thread leesf
ttps://issues.apache.org/jira/browse/HUDI-2702
[2] https://issues.apache.org/jira/browse/HUDI-2533
[3] https://issues.apache.org/jira/browse/HUDI-2559
[4] https://issues.apache.org/jira/browse/HUDI-2550
[5] https://issues.apache.org/jira/browse/HUDI-2737
[6] https://issues.apache.org/jira/browse/HUDI-1937
[7] https://issues.apache.org/jira/browse/HUDI-2743
[8] https://issues.apache.org/jira/browse/HUDI-2778
[9] https://issues.apache.org/jira/browse/HUDI-2409
[10] https://issues.apache.org/jira/browse/HUDI-2332
[11] https://issues.apache.org/jira/browse/HUDI-2325
[12] https://issues.apache.org/jira/browse/HUDI-2831
[13] https://issues.apache.org/jira/browse/HUDI-2818
[14] https://issues.apache.org/jira/browse/HUDI-2838
[15] https://issues.apache.org/jira/browse/HUDI-2847
[16] https://issues.apache.org/jira/browse/HUDI-2671
[17] https://issues.apache.org/jira/browse/HUDI-2443
[18] https://issues.apache.org/jira/browse/HUDI-2778
[19] https://issues.apache.org/jira/browse/HUDI-2766
[20] https://issues.apache.org/jira/browse/HUDI-2793
[21] https://issues.apache.org/jira/browse/HUDI-2853
[22] https://issues.apache.org/jira/browse/HUDI-2844
[23] https://issues.apache.org/jira/browse/HUDI-2792
[24] https://issues.apache.org/jira/browse/HUDI-2480
[25] https://issues.apache.org/jira/browse/HUDI-1290
[26] https://issues.apache.org/jira/browse/HUDI-2800
[27] https://issues.apache.org/jira/browse/HUDI-2794
[28] https://issues.apache.org/jira/browse/HUDI-2858
[29] https://issues.apache.org/jira/browse/HUDI-2841
[30] https://issues.apache.org/jira/browse/HUDI-2840
[31] https://issues.apache.org/jira/browse/HUDI-2005
[32] https://issues.apache.org/jira/browse/HUDI-2852
[33] https://issues.apache.org/jira/browse/HUDI-2850
[34] https://issues.apache.org/jira/browse/HUDI-2814
[35] https://issues.apache.org/jira/browse/HUDI-2861
[36] https://issues.apache.org/jira/browse/HUDI-2767
[37] https://issues.apache.org/jira/browse/HUDI-2845
[38] https://issues.apache.org/jira/browse/HUDI-2475
[39] https://issues.apache.org/jira/browse/HUDI-2642
[40] https://issues.apache.org/jira/browse/HUDI-2891
[41] https://issues.apache.org/jira/browse/HUDI-2880
[42] https://issues.apache.org/jira/browse/HUDI-2881
[43] https://issues.apache.org/jira/browse/HUDI-2904
[44] https://issues.apache.org/jira/browse/HUDI-2914
[45] https://issues.apache.org/jira/browse/HUDI-2924
[46] https://issues.apache.org/jira/browse/HUDI-2902
[47] https://issues.apache.org/jira/browse/HUDI-2911
[48] https://issues.apache.org/jira/browse/HUDI-2894
[49] https://issues.apache.org/jira/browse/HUDI-2890
[50] https://issues.apache.org/jira/browse/HUDI-2923
[51] https://issues.apache.org/jira/browse/HUDI-2935


==
Tests

[Tests] Add more Spark CI build tasks [1]
[Tests] Fix skipped HoodieSparkSqlWriterSuite [2]



[1] https://issues.apache.org/jira/browse/HUDI-1870
[2] https://issues.apache.org/jira/browse/HUDI-2868




Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-11-07 ~ 2021-11-21)

2021-11-21 Thread leesf
/browse/HUDI-2738
[16] https://issues.apache.org/jira/browse/HUDI-2746
[17] https://issues.apache.org/jira/browse/HUDI-2151
[18] https://issues.apache.org/jira/browse/HUDI-2718
[19] https://issues.apache.org/jira/browse/HUDI-2741
[20] https://issues.apache.org/jira/browse/HUDI-2756
[21] https://issues.apache.org/jira/browse/HUDI-2706
[22] https://issues.apache.org/jira/browse/HUDI-2744
[23] https://issues.apache.org/jira/browse/HUDI-2683
[24] https://issues.apache.org/jira/browse/HUDI-2712
[25] https://issues.apache.org/jira/browse/HUDI-2769
[26] https://issues.apache.org/jira/browse/HUDI-2753
[27] https://issues.apache.org/jira/browse/HUDI-2151
[28] https://issues.apache.org/jira/browse/HUDI-2734
[29] https://issues.apache.org/jira/browse/HUDI-2789
[30] https://issues.apache.org/jira/browse/HUDI-2790
[31] https://issues.apache.org/jira/browse/HUDI-2641
[32] https://issues.apache.org/jira/browse/HUDI-2791
[33] https://issues.apache.org/jira/browse/HUDI-2798
[34] https://issues.apache.org/jira/browse/HUDI-2731
[35] https://issues.apache.org/jira/browse/HUDI-2796
[36] https://issues.apache.org/jira/browse/HUDI-2242
[37] https://issues.apache.org/jira/browse/HUDI-2804
[38] https://issues.apache.org/jira/browse/HUDI-2392
[39] https://issues.apache.org/jira/browse/HUDI-1932


==
Tests

[Tests] Enabling metadata table in TestHoodieIndex and
TestMergeOnReadRollbackActionExecutor [1]
[Tests]Enabling metadata table for TestHoodieMergeOnReadTable and
TestHoodieCompactor [2]



[1] https://issues.apache.org/jira/browse/HUDI-2472
[2] https://issues.apache.org/jira/browse/HUDI-2472




Best,
Leesf


Re: [DISCUSS] Move to Spark DataSource V2 API

2021-11-15 Thread leesf
Thanks Raymond for sharing the work has been done, agree that the 1st
approach would need more work and time to make it  totally adapted to V2
interfaces and would be different with different engines. Thus, to abstract
the Hudi core writing/reading framework to adapt to different engines
(approach 2 ) looks good to me at the moment since we do not need extra
work to adapt to other engines and focus on spark writing/reading side.

Raymond Xu  于2021年11月14日周日 下午5:44写道:

> Great initiative and idea, Leesf.
>
> Totally agreed on the benefits of adopting V2 APIs. On the 4th point "Total
> use V2 writing interface"
>
> I have previously worked on implementing upsert with V2 writing interface
> with SimpleIndex using broadcast join. The POC worked without fully
> integrating with other table services. The downside of going this route
> would be re-implementing most of the logic we have today with the RDD
> writer path, including different indexing implementations, which are
> non-trivial.
>
> Another route I've PoC'ed is to treat the current RDD writer path as Hudi
> "writer framework": input Dataset going through different components
> as we see today Client -> Specific ActionExecutor -> Helper ->
> (dedup/indexing/tagging/build profile) -> Base Write ActionExecutor -> (map
> partitions and perform write on Row iterator via parquet writer/reader) ->
> return Dataset
>
> As you can see, the 1st approach is to adopt an engine-native framework (V2
> writing interface in this case) to realize Hudi operations while the 2nd
> approach is to adopt the Hudi "writer framework" by using engine-native
> data-level APIs to realize Hudi operations. The 2nd approach gives better
> flexibility in adopting different engines; it leverages engines'
> capabilities to manipulate data while ensuring write operations
> were realized in the "Hudi" way. The prerequisite to this is to have a
> flexible Hudi abstraction on top of different engines' data-level APIs.
> Ethan has landed 2 major abstraction PRs to pave the way for it, which will
> enable a great deal of code-reuse.
>
> The Hudi "writer framework" today consists of a bunch of Java classes. It
> can be formalized and refactored along the way while implementing Row
> writing. Once the "framework" is formalized, its flexibility can really
> shine on bringing in new processing engines to Hudi. Something similar
> could be done on the reader path too I suppose.
>
> On Tue, Nov 9, 2021 at 7:55 AM leesf  wrote:
>
> > Hi all,
> >
> > I did see the community discuss moving to V2 datasource API before [1]
> but
> > get no more progress. So I want to bring up the discussion again to move
> to
> > spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD
> > api to index, repartition and so on given the flexibility of RDD API.
> > However V2 api eliminates RDD usage and introduces CatalogPlugin
> mechanism
> > to give the ability to manage Hudi tables and totally new writing and
> > reading interface, so it caused some challenges since Hudi uses the RDD
> in
> > both writing and reading path, However I think it is still necessary to
> > integrate Hudi with V2 api as the V1 api is too old and the benefits from
> > V2 api optimization such as more pushdown filters regarding query side to
> > accelerate the query speed when integrating with RFC-27 [2].
> >
> > And here is work I think we should do when moving to V2 api.
> >
> > 1. Integrate with V2 writing interface(Bulk_insert row path already
> > implemented, but not for upsert/insert operations, would fallback to V1
> > writing code path)
> > 2. Integrate with V2 reading interface
> > 3. Introducing CatalogPlugin to manage Hudi tables
> > 4. Total use V2 writing interface(use Iterator that may need
> > some refactor to HoodieSparkWriteClient to make precombining, indexing
> etc
> > working fine).
> >
> > Please add other work that no mentioned above and would love to hear
> other
> > opinions and feedback from the community. I see there is already an
> > umbrella ticket to track datasource V2 [3] and I will put on a RFC for
> more
> > details, also you would join the channel #spark-datasource-v2 in Hudi
> slack
> > for more discussion
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> > [3] https://issues.apache.org/jira/browse/HUDI-1297
> >
> >
> >
> > Thanks
> > Leesf
> >
>


[DISCUSS] Move to Spark DataSource V2 API

2021-11-09 Thread leesf
Hi all,

I did see the community discuss moving to V2 datasource API before [1] but
get no more progress. So I want to bring up the discussion again to move to
spark datasource V2 api, Hudi still uses V1 api and relies heavily on RDD
api to index, repartition and so on given the flexibility of RDD API.
However V2 api eliminates RDD usage and introduces CatalogPlugin mechanism
to give the ability to manage Hudi tables and totally new writing and
reading interface, so it caused some challenges since Hudi uses the RDD in
both writing and reading path, However I think it is still necessary to
integrate Hudi with V2 api as the V1 api is too old and the benefits from
V2 api optimization such as more pushdown filters regarding query side to
accelerate the query speed when integrating with RFC-27 [2].

And here is work I think we should do when moving to V2 api.

1. Integrate with V2 writing interface(Bulk_insert row path already
implemented, but not for upsert/insert operations, would fallback to V1
writing code path)
2. Integrate with V2 reading interface
3. Introducing CatalogPlugin to manage Hudi tables
4. Total use V2 writing interface(use Iterator that may need
some refactor to HoodieSparkWriteClient to make precombining, indexing etc
working fine).

Please add other work that no mentioned above and would love to hear other
opinions and feedback from the community. I see there is already an
umbrella ticket to track datasource V2 [3] and I will put on a RFC for more
details, also you would join the channel #spark-datasource-v2 in Hudi slack
for more discussion

[1]
https://lists.apache.org/thread.html/r0411d53b46d8bb2a57c697e295c83a274fa0bc817a2a8ca8eb103a3d%40%3Cdev.hudi.apache.org%3E
[2]
https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
[3] https://issues.apache.org/jira/browse/HUDI-1297



Thanks
Leesf


[ANNOUNCE] Hudi Community Update(2021-10-24 ~ 2021-11-07)

2021-11-07 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-10-24 ~ 2021-11-07
with updates on features, bug fixes and tests.


===
Features

[Spark SQL] Support replace commit in DeltaSync with commit metadata
preserved [1]
[Flink Integration] Adding inline read and seek based read(batch get) for
hfile log blocks in metadata table [2]
[Core] Hash ID generator util for Hudi table columns, partition and file [3]
[Core] support z-order for hudi [4]
[Core] Support concurrent key gen for different tables with row writer path
[5]
[Spark] Upgrading Spark3 To 3.1 [6]
[Spark SQL] Add support ignoring case in merge into [7]
[Core] Add ORC support in Bootstrap Op [8]


[1] https://issues.apache.org/jira/browse/HUDI-1500
[2] https://issues.apache.org/jira/browse/HUDI-1294
[3] https://issues.apache.org/jira/browse/HUDI-1295
[4] https://issues.apache.org/jira/browse/HUDI-2101
[5] https://issues.apache.org/jira/browse/HUDI-2582
[6] https://issues.apache.org/jira/browse/HUDI-1869
[7] https://issues.apache.org/jira/browse/HUDI-2471
[8] https://issues.apache.org/jira/browse/HUDI-1827

===
Bugs

[Core] Avoiding direct fs calls in HoodieLogFileReader [1]
[Core] Remove duplicated hadoop-common with tests classifier exists in
bundles [2]
[Core] Remove duplicated hadoop-hdfs with tests classifier exists in
bundles [3]
[Flink] Schema evolution for flink parquet reader [4]
[Flink] Make precombine field optional for flink [5]
[Core] Refactor index in hudi-client module [6]
[Core] Fixing double locking with multi-writers [7]
[Flink Integration] Schedules the compaction from earliest for flink [8]
[Flink Integration] Add compaction failed event(part2) [9]
[Core] Remove duplicated hbase-common with tests classifier exists in
bundles [10]
[Core] Add close when producing records failed [11]
[Core] persist some configs to hoodie.properties when the first write [12]
[Hive Integration] hudi hive reader should not print read values [13]
[Flink Integration] Delete the view storage properties first before
creation [14]
[Hive Integration] Hudi should synchronize owner information to hudi
_rt/_ro table [15]
[Flink Integration] flink writer writes huge log file [16]
[Core] Use DefaultHoodieRecordPayload when precombine field is specified
specifically [17]
[Flink Integration] Sync all the missing sql options for
HoodieFlinkStreamer [18]
[Flink Integration] Proccess record after all bootstrap operator ready [19]
[Flink Integration] Remove the aborted checkpoint notification from
coordinator [20]
[Core] Moved static COMMIT_FORMATTER to thread local variable as
SimpleDateFormat is not thread safe [21]
[Core] Make spark.sql.parquet.writeLegacyFormat configurable [22]
[Flink Integration] Set up keygen class explicit for write config for flink
table upgrade [23]
[Hive Integration] bugfix: NPE when select count start from a realtime
table with Tez [24]



[1] https://issues.apache.org/jira/browse/HUDI-2005
[2] https://issues.apache.org/jira/browse/HUDI-2600
[3] https://issues.apache.org/jira/browse/HUDI-2614
[4] https://issues.apache.org/jira/browse/HUDI-2632
[5] https://issues.apache.org/jira/browse/HUDI-2633
[6] https://issues.apache.org/jira/browse/HUDI-2502
[7] https://issues.apache.org/jira/browse/HUDI-2573
[8] https://issues.apache.org/jira/browse/HUDI-2654
[9] https://issues.apache.org/jira/browse/HUDI-2654
[10] https://issues.apache.org/jira/browse/HUDI-2643
[11] https://issues.apache.org/jira/browse/HUDI-2515
[12] https://issues.apache.org/jira/browse/HUDI-2538
[13] https://issues.apache.org/jira/browse/HUDI-2674
[14] https://issues.apache.org/jira/browse/HUDI-2660
[15] https://issues.apache.org/jira/browse/HUDI-2676
[16] https://issues.apache.org/jira/browse/HUDI-2678
[17] https://issues.apache.org/jira/browse/HUDI-2684
[18] https://issues.apache.org/jira/browse/HUDI-2651
[19] https://issues.apache.org/jira/browse/HUDI-2686
[20] https://issues.apache.org/jira/browse/HUDI-2696
[21] https://issues.apache.org/jira/browse/HUDI-1794
[22] https://issues.apache.org/jira/browse/HUDI-2526
[23] https://issues.apache.org/jira/browse/HUDI-2702
[24] https://issues.apache.org/jira/browse/HUDI-313


==
Tests

[Tests] Fix TestHoodieDeltaStreamerWithMultiWriter [1]
[Tests] Enabling Metadata table for some of TestCleaner unit tests [2]



[1] https://issues.apache.org/jira/browse/HUDI-2077
[2] https://issues.apache.org/jira/browse/HUDI-2472




Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-10-10 ~ 2021-10-24)

2021-10-24 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-10-10 ~ 2021-10-24
with updates on features, bug fixes and tests.


===
Features

[Spark SQL] support 'drop partition' sql [1]
[Flink Integration] Support merging small files for flink insert operation
[2]
[Core] Add HoodieData abstraction and refactor compaction actions in
hudi-client module [3]




[1] https://issues.apache.org/jira/browse/HUDI-2482
[2] https://issues.apache.org/jira/browse/HUDI-2578
[3] https://issues.apache.org/jira/browse/HUDI-2501


===
Bugs

[Flink] Fix metadata table for flink [1]
[Core] Insert duplicate records when precombined is deactivated for
"insert" operation [2]
[Flink] AppendWriteFunction throws NPE when checkpointing without written
data [3]
[Core] Fixed wrong validation for metadataTableEnabled in Hoodie Table [4]
[Core] Metadata table compaction trigger max delta commits [5]
[Core] Fixing glob pattern to skip all hoodie meta paths [6]
[Core] Fix clustering handle errors [7]
[Flink Integration] Flink streaming reader misses the rolling over file
handles [8]
[Flink Integration] Support DefaultHoodieRecordPayload for flink [9]
[Flink Integration] Tweak some default config options for flink [10]
[Flink Integration] Embedded timeline server on JobManager [11]
[Flink Integration] Shade javax.servlet for flink bundle jar [12]
[Flink Integration] Simplify the view storage config properties [13]
[Flink Integration] Shaded hive for flink bundle jar [14]
[Flink Integration] Remove include-flink-sql-connector-hive profile from
flink bundle [15]
[Core] BitCaskDiskMap - avoiding hostname resolution when logging messages
[16]
[Flink Integration] Strength flink compaction rollback strategy [17]
[Core] Replace json based payload with protobuf for Transaction protocol
[18]
[CI] Generate more dependency list file for other bundles [19]
[Core] Metadata table compaction trigger max delta commits [20]
[Core] Fix write empty array when write.precombine.field is decimal type
[21]
[Core] Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView,
aiming to reduce unnecessary list/get requests [22]
[Core] Metadata table support for rolling back the first commit [23]



[1] https://issues.apache.org/jira/browse/HUDI-2537
[2] https://issues.apache.org/jira/browse/HUDI-2496
[3] https://issues.apache.org/jira/browse/HUDI-2542
[4] https://issues.apache.org/jira/browse/HUDI-2540
[5] https://issues.apache.org/jira/browse/HUDI-2532
[6] https://issues.apache.org/jira/browse/HUDI-2494
[7] https://issues.apache.org/jira/browse/HUDI-2435
[8] https://issues.apache.org/jira/browse/HUDI-2548
[9] https://issues.apache.org/jira/browse/HUDI-2551
[10] https://issues.apache.org/jira/browse/HUDI-2556
[11] https://issues.apache.org/jira/browse/HUDI-2562
[12] https://issues.apache.org/jira/browse/HUDI-2557
[13] https://issues.apache.org/jira/browse/HUDI-2568
[14] https://issues.apache.org/jira/browse/HUDI-2569
[15] https://issues.apache.org/jira/browse/HUDI-2571
[16] https://issues.apache.org/jira/browse/HUDI-2561
[17] https://issues.apache.org/jira/browse/HUDI-2572
[18] https://issues.apache.org/jira/browse/HUDI-2469
[19] https://issues.apache.org/jira/browse/HUDI-2507
[20] https://issues.apache.org/jira/browse/HUDI-2553
[21] https://issues.apache.org/jira/browse/HUDI-2592
[22] https://issues.apache.org/jira/browse/HUDI-2489
[23] https://issues.apache.org/jira/browse/HUDI-2468

==
Tests

[Tests] Fixing some test failures to unblock broken CI master [1]
[Tests] Fix few Cleaner tests with metadata table enabled [2]
[Tests] Fix flakiness in TestHoodieDeltaStreamer [3]
[Tests] Refactor TestWriteCopyOnWrite test cases [4]



[1] https://issues.apache.org/jira/browse/HUDI-2552
[2] https://issues.apache.org/jira/browse/HUDI-2472
[3] https://issues.apache.org/jira/browse/HUDI-2077
[4] https://issues.apache.org/jira/browse/HUDI-2583



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-09-26 ~ 2021-10-10)

2021-10-10 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-09-26 ~ 2021-10-10
with updates on features, bug fixes and tests.


===
Features

[Core] Add dependency change diff script for dependency governace [1]
[Spark SQL] support 'show partitions' sql [2]



[1] https://issues.apache.org/jira/browse/HUDI-2440
[2] https://issues.apache.org/jira/browse/HUDI-2456


===
Bugs

[Core] Fix JsonKafkaSource cannot filter empty messages from kafka [1]
[Core] Refreshing timeline for every operation in Hudi when metadata is
enabled [2]
[DeltaStreamer] HoodieDeltaStreamer reading ORC files directly using
ORCDFSSource [3]
[Hive Integration] Making jdbc-url, user and pass as non-required field for
other sync modes [4]
[Core] Refactor clean and restore actions in hudi-client module [5]
[Core] Metadata table synchronous design [6]
[Core] Refactor table upgrade and downgrade actions in hudi-client module
[7]
[Flink Integration] Remove the sort operation when bulk_insert in batch
mode [8]


[1] https://issues.apache.org/jira/browse/HUDI-2487
[2] https://issues.apache.org/jira/browse/HUDI-2474
[3] https://issues.apache.org/jira/browse/HUDI-2277
[4] https://issues.apache.org/jira/browse/HUDI-2499
[5] https://issues.apache.org/jira/browse/HUDI-2497
[6] https://issues.apache.org/jira/browse/HUDI-2285
[7] https://issues.apache.org/jira/browse/HUDI-2513
[8] https://issues.apache.org/jira/browse/HUDI-2534



==
Tests

[Tests] Adding async compaction support to integ test suite framework [1]



[1] https://issues.apache.org/jira/browse/HUDI-2530



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-09-12 ~ 2021-09-26)

2021-09-26 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-09-12 ~ 2021-09-26
with updates on features, bug fixes and tests.


===
Features

[Core] Add metrics-jmx to spark and flink bundles [1]
[Java Client] Adding support for merge-on-read tables [2]
[Flink Integration] Incremental read for Flink [3]
[Flink Integration] Consume as mini-batch for flink stream reader [4]


[1] https://issues.apache.org/jira/browse/HUDI-2404
[2] https://issues.apache.org/jira/browse/HUDI-2335
[3] https://issues.apache.org/jira/browse/HUDI-2449
[4] https://issues.apache.org/jira/browse/HUDI-2485


===
Bugs

[Hive Integration] Add --enable-sync parameter [1]
[Core] Fix getDefaultBootstrapIndexClass logical error [2]
[Flink Integration] Catch the throwable when scheduling the cleaning task
for flink writer [3]
[Kafka Connect] Fix protocol and other issues after stress testing Hudi
Kafka Connect [4]
[Flink Integraion] Make decimal compatible with hudi for flink writer [5]
[Core] Refactor rollback actions in hudi-client module [6]
[Core] Archive service executed after cleaner finished [7]
[Core] Separate some config logic from HoodieMetricsConfig into
HoodieMetricsGraphiteConfig HoodieMetricsJmxConfig [8]
[Core] Adding rollback plan and rollback requested instant [9]
[Core] Make periodSeconds of GraphiteReporter configurable [10]
[Core] Fixing delete files corner cases wrt cleaning and rollback when
applying changes to metadata [11]
[Spark SQL] Fix the exception for mergeInto when the primaryKey and
preCombineField of source table and target table differ in case only [12]
[Flink Integration] HoodieFileIndex throws NPE for FileSlice with pure log
files [13]
[Spark Integration] Clean the marker files after compaction [14]
[Hive Integration] Fixing the closing of hms client [15]
[Core] Make parquet dictionary encoding configurable [16]
[Flink Integration] Infer changelog mode for flink compactor [17]
[Deltastreamer] Fix hive sync mode setting in Deltastreamer [18]
[Core] On windows client with hdfs server for wrong file separator [19]


[1] https://issues.apache.org/jira/browse/HUDI-2397
[2] https://issues.apache.org/jira/browse/HUDI-2410
[3] https://issues.apache.org/jira/browse/HUDI-2421
[4] https://issues.apache.org/jira/browse/HUDI-2428
[5] https://issues.apache.org/jira/browse/HUDI-2430
[6] https://issues.apache.org/jira/browse/HUDI-2433
[7] https://issues.apache.org/jira/browse/HUDI-2355
[8] https://issues.apache.org/jira/browse/HUDI-2423
[9] https://issues.apache.org/jira/browse/HUDI-2422
[10] https://issues.apache.org/jira/browse/HUDI-2434
[11] https://issues.apache.org/jira/browse/HUDI-2444
[12] https://issues.apache.org/jira/browse/HUDI-2343
[13] https://issues.apache.org/jira/browse/HUDI-2479
[14] https://issues.apache.org/jira/browse/HUDI-2383
[15] https://issues.apache.org/jira/browse/HUDI-2248
[16] https://issues.apache.org/jira/browse/HUDI-2385
[17] https://issues.apache.org/jira/browse/HUDI-2483
[18] https://issues.apache.org/jira/browse/HUDI-2484
[19] https://issues.apache.org/jira/browse/HUDI-2451


==
Tests

[Tests] TestHoodieMultiTableDeltaStreamer CI failed due to exception [1]
[Tests] Add DAG nodes for Spark SQL in integration test suite [2]
[Tests] Metadata tests rewrite [3]


[1] https://issues.apache.org/jira/browse/HUDI-2425
[2] https://issues.apache.org/jira/browse/HUDI-2388
[3] https://issues.apache.org/jira/browse/HUDI-2395


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-08-29 ~ 2021-09-12)

2021-09-12 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-08-29 ~ 2021-09-12
with updates on features, bug fixes and tests.


===
Features

[Core] Add support ByteArrayDeserializer in AvroKafkaSource [1]
[Core] Add configs for common and pre validate [2]
[CI] Use GitHub Actions to build different scala spark versions [3]
[Flink Integration] Add pipeline for Append mode [4]
[Flink Integration] Add metadata table listing for flink query source [5]
[Core] Implement Kafka Sink Protocol for Hudi for Ingesting Immutable Data
[6]
[Flink Integration] Add timestamp based partitioning for flink writer [7]


[1] https://issues.apache.org/jira/browse/HUDI-2320
[2] https://issues.apache.org/jira/browse/HUDI-2378
[3] https://issues.apache.org/jira/browse/HUDI-2280
[4] https://issues.apache.org/jira/browse/HUDI-2376
[5] https://issues.apache.org/jira/browse/HUDI-2403
[6] https://issues.apache.org/jira/browse/HUDI-2394
[7] https://issues.apache.org/jira/browse/HUDI-2412


===
Bugs

[Flink Integration] Include the pending compaction file groups for flink
streaming reader [1]
[Core] Change log file size config to long [2]
[Flink Integration] Do not send partition delete record when changelog mode
enabled [3]
[Core] The default archive folder should be 'archived' [4]
[Flink Integraion] Load archived instants for flink streaming reader  [5]
[Core] Extract common FS and IO utils for marker mechanism [6]
[Core] Fix TimelineServer error because of replacecommit archive [7]
[Core] Collect event time for inserts in DefaultHoodieRecordPayload [8]


[1] https://issues.apache.org/jira/browse/HUDI-2379
[2] https://issues.apache.org/jira/browse/HUDI-2384
[3] https://issues.apache.org/jira/browse/HUDI-2392
[4] https://issues.apache.org/jira/browse/HUDI-2380
[5] https://issues.apache.org/jira/browse/HUDI-2401
[6] https://issues.apache.org/jira/browse/HUDI-2351
[7] https://issues.apache.org/jira/browse/HUDI-2354
[8] https://issues.apache.org/jira/browse/HUDI-2398


==
Tests

[Tests] Fix flakiness in TestHoodieMergeOnReadTable [1]
[Tests] Disable HDFSParquetImporter related tests [2]
[Tests] Rebalance CI jobs for shorter wait time [3]
[Tests] Make CLI command tests functional [4]
[Tests] Move to ubuntu-18.04 for Azure CI [5]
[Tests] Deprecate FunctionalTestHarness to avoid init DFS [6]
[Tests]  Add yamls for large scale testing [7]

[1] https://issues.apache.org/jira/browse/HUDI-1989
[2] https://issues.apache.org/jira/browse/HUDI-1989
[3] https://issues.apache.org/jira/browse/HUDI-2399
[4] https://issues.apache.org/jira/browse/HUDI-2079
[5] https://issues.apache.org/jira/browse/HUDI-2080
[6] https://issues.apache.org/jira/browse/HUDI-2408
[7] https://issues.apache.org/jira/browse/HUDI-2393

Best,
Leesf


Re: [ANNOUNCE] Apache Hudi 0.9.0 released

2021-09-03 Thread leesf
Thanks Udit for driving the release, Great news!

Rubens Rodrigues  于2021年9月4日周六 上午10:12写道:

> Hello
>
> Im from Brazil and Im follow hudi since version 0.5, congratulations for
> everyone, The hudi evolution in only one year is impressive.
>
> Me and my folks are very happy to choose hudi for our datalake.
>
> Thank you so much for this wonderfull work
>
> Em sex., 3 de set. de 2021 22:57, Raymond Xu 
> escreveu:
>
> > Congrats! Another awesome release.
> >
> > On Wed, Sep 1, 2021 at 11:49 AM Pratyaksh Sharma 
> > wrote:
> >
> > > Great news! This one really feels like a major release with so many
> good
> > > features getting added. :)
> > >
> > > On Wed, Sep 1, 2021 at 7:19 AM Udit Mehrotra 
> wrote:
> > >
> > > > The Apache Hudi team is pleased to announce the release of Apache
> Hudi
> > > > 0.9.0.
> > > >
> > > > This release comes almost 5 months after 0.8.0. It includes 387
> > resolved
> > > > issues, comprising new features as well as
> > > > general improvements and bug-fixes. Here are a few quick highlights:
> > > >
> > > > *Spark SQL DML and DDL Support*
> > > > We have added experimental support for DDL/DML using Spark SQL
> taking a
> > > > huge step towards making Hudi more
> > > > easily accessible and operable by all personas (non-engineers,
> analysts
> > > > etc). Users can now use SQL statements like
> > > > "CREATE TABLEUSING HUDI" and "CREATE TABLE .. AS SELECT" to
> > > > create/manage tables in catalogs like Hive,
> > > > and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE"
> > > > statements to manipulate data.
> > > > For more information, checkout our docs here
> > > >  clicking on the
> > > SparkSQL
> > > > tab.
> > > >
> > > > *Query Side Improvements*
> > > > Hudi tables are now registered with Hive as spark datasource tables,
> > > > meaning Spark SQL on these tables now uses the
> > > > datasource as well, instead of relying on the Hive fallbacks within
> > > Spark,
> > > > which are ill-maintained/cumbersome. This
> > > > unlocks many optimizations such as the use of Hudi's own FileIndex
> > > > <
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46
> > > > >
> > > > implementation for optimized caching and the use
> > > > of the Hudi metadata table, for faster listing of large tables. We
> have
> > > > also added support for time travel query
> > > > ,
> > for
> > > > spark
> > > > datasource.
> > > >
> > > > *Writer Side Improvements*
> > > > This release has several major writer side improvements. Virtual key
> > > > support has been added to avoid populating meta
> > > > fields and leverage existing fields to populate record keys and
> > partition
> > > > paths.
> > > > Bulk Insert operation using row writer is now enabled by default for
> > > faster
> > > > inserts.
> > > > Hudi's automatic cleaning of uncommitted data has been enhanced to be
> > > > performant over cloud stores. You can learn
> > > > more about this new centrally coordinated marker mechanism in this
> blog
> > > >  >.
> > > > Async Clustering support has been added to both DeltaStreamer and
> Spark
> > > > Structured Streaming Sink. More on this
> > > > can be found in this blog
> > > > .
> > > > Users can choose to drop fields used to generate partition paths.
> > > > Added a new write operation "delete_partition" support in spark.
> Users
> > > can
> > > > leverage this to delete older partitions in
> > > > bulk, in addition to record level deletes.
> > > > Added Support for Huawei Cloud Object Storage, BAIDU AFS storage
> > format,
> > > > Baidu BOS storage in Hudi.
> > > > A pre commit validator framework
> > > > <
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java
> > > > >
> > > > has been added for spark engine, which can be used for DeltaStreamer
> > and
> > > > Spark
> > > > Datasource writers. Users can leverage this to add any validations to
> > be
> > > > executed before committing writes to Hudi.
> > > > Few out of the box validators are available like
> > > > SqlQueryEqualityPreCommitValidator
> > > > <
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java
> > > > >,
> > > > SqlQueryInequalityPreCommitValidator
> > > > <
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java
> > > > >

[ANNOUNCE] Hudi Community Update(2021-08-15 ~ 2021-08-29)

2021-08-29 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-08-15 ~ 2021-08-29
with updates on features, bug fixes and tests.


===
Features

[Flink Integration] Support Flink batch upsert [1]
[Core] AHoodie columns sort partitioner for bulk insert [2]



[1] https://issues.apache.org/jira/browse/HUDI-2316
[2] https://issues.apache.org/jira/browse/HUDI-2345


===
Bugs

[Core] fix FileSliceMetrics utils bug [1]
[Core] HoodieCompactionConfig get HoodieCleaningPolicy NullPointerException
[2]
[Core] Include _hoodie_operation meta column in removeMetadataFields [3]
[Core] Use correct meta columns while preparing dataset for bulk insert [4]
[Spark Integraion] Create Table If Not Exists Failed After Alter Table [5]
[Flink Integration] Merge the data set for flink bounded source when
changelog mode turns off [6]
[Flink Integration] Optimize Bootstrap operator [7]
[Spark Integration] Support referencing subquery with column aliases by
table alias [8]
[Flink Integration] The upgrade downgrade action of flink writer should be
singleton [9]
[Spark Integration] MERGE INTO doesn't work for tables created using CTAS
[10]
[Core] Catch Throwable in BoundedInMemoryExecutor  [11]
[Core] Use the caller classloader for ReflectionUtils [12]
[Core] Refact HoodieFlinkStreamer to reuse the pipeline of HoodieTableSink
[13]
[Core] Optimizing overwriteField method with Objects.equals [14]
[Flink Integration] Improvement flink streaming reader [15]


[1] https://issues.apache.org/jira/browse/HUDI-2301
[2] https://issues.apache.org/jira/browse/HUDI-2167
[3] https://issues.apache.org/jira/browse/HUDI-1363
[4] https://issues.apache.org/jira/browse/HUDI-2322
[5] https://issues.apache.org/jira/browse/HUDI-2339
[6] https://issues.apache.org/jira/browse/HUDI-2340
[7] https://issues.apache.org/jira/browse/HUDI-2342
[8] https://issues.apache.org/jira/browse/HUDI-2259
[9] https://issues.apache.org/jira/browse/HUDI-2352
[10] https://issues.apache.org/jira/browse/HUDI-2357
[11] https://issues.apache.org/jira/browse/HUDI-2368
[12] https://issues.apache.org/jira/browse/HUDI-2321
[13] https://issues.apache.org/jira/browse/HUDI-2229
[14] https://issues.apache.org/jira/browse/HUDI-2365
[15] https://issues.apache.org/jira/browse/HUDI-2371


==
Tests

[Tests] Adding spark delete node to integ test suite [1]
[Tests] Add basic "hoodie_is_deleted" unit tests to TestDataSource classes
[2]
[Tests] Refactor HoodieSparkSqlWriterSuite to add setup and teardown [3]

[1] https://issues.apache.org/jira/browse/HUDI-2349
[2] https://issues.apache.org/jira/browse/HUDI-2359
[3] https://issues.apache.org/jira/browse/HUDI-2264

Best,
Leesf


[ANNOUNCE] HUdi Community Update(2021-08-01 ~ 2021-08-15)

2021-08-15 Thread leesf
/browse/HUDI-2292
[12] https://issues.apache.org/jira/browse/HUDI-2286
[13] https://issues.apache.org/jira/browse/HUDI-2298
[14] https://issues.apache.org/jira/browse/HUDI-1518
[15] https://issues.apache.org/jira/browse/HUDI-1292
[16] https://issues.apache.org/jira/browse/HUDI-2151
[17] https://issues.apache.org/jira/browse/HUDI-2119
[18] https://issues.apache.org/jira/browse/HUDI-2307
[19] https://issues.apache.org/jira/browse/HUDI-2305


==
Tests

[Tests] Migrating some long running tests to functional test profile [1]

[1] https://issues.apache.org/jira/browse/HUDI-2273


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-07-18 ~ 2021-08-01)

2021-08-01 Thread leesf

==
Tests

[Tests] Fixing hudi_test_suite for spark nodes and adding spark bulk_insert
node [1]
[Tests] Fix NullPointerException in TestHoodieConsoleMetrics [2]
[Tests] Refactoring few tests to reduce runningtime. DeltaStreamer and
MultiDeltaStreamer tests. Bulk insert row writer tests [3]

[1] https://issues.apache.org/jira/browse/HUDI-2007
[2] https://issues.apache.org/jira/browse/HUDI-2211
[3] https://issues.apache.org/jira/browse/HUDI-2253

Best,
Leesf


Re: [DISCUSS] Disable ASF GitHub Bot comments under the JIRA issue

2021-08-01 Thread leesf
+1 to disable.

Vinoth Chandar  于2021年7月28日周三 上午12:37写道:

> Anybody with strong opinions to keep them?
> I am happy to go back to clicking to get to github links.
>
> On Tue, Jul 27, 2021 at 6:33 AM xuedong luan 
> wrote:
>
> > +1
> >
> > Danny Chan  于2021年7月27日周二 上午10:38写道:
> >
> > > I found that there are many ASF GitHub Bot comments under our issue
> now,
> > it
> > > messes up with the design discussions and is hard to read. The normal
> > > comments are drowned in these junk messages.
> > >
> > > So i request to disable it to make the JIRA comments clear and clean.
> > >
> > > Best,
> > > Danny Chan
> > >
> >
>


[ANNOUNCE] Hudi Community Update(2021-07-04 ~ 2021-07-18)

2021-07-18 Thread leesf
://issues.apache.org/jira/browse/HUDI-2129
[4] https://issues.apache.org/jira/browse/HUDI-2131
[5] https://issues.apache.org/jira/browse/HUDI-2122
[6] https://issues.apache.org/jira/browse/HUDI-2132
[7] https://issues.apache.org/jira/browse/HUDI-2106
[8] https://issues.apache.org/jira/browse/HUDI-2098
[9] https://issues.apache.org/jira/browse/HUDI-2046
[10] https://issues.apache.org/jira/browse/HUDI-2093
[11] https://issues.apache.org/jira/browse/HUDI-2061
[12] https://issues.apache.org/jira/browse/HUDI-2016
[13] https://issues.apache.org/jira/browse/HUDI-2115
[14] https://issues.apache.org/jira/browse/HUDI-2069
[15] https://issues.apache.org/jira/browse/HUDI-2134
[16] https://issues.apache.org/jira/browse/HUDI-2009
[17] https://issues.apache.org/jira/browse/HUDI-2099
[18] https://issues.apache.org/jira/browse/HUDI-2136
[19] https://issues.apache.org/jira/browse/HUDI-2143
[20] https://issues.apache.org/jira/browse/HUDI-2142
[21] https://issues.apache.org/jira/browse/HUDI-2144
[22] https://issues.apache.org/jira/browse/HUDI-2168
[23] https://issues.apache.org/jira/browse/HUDI-2180
[24] https://issues.apache.org/jira/browse/HUDI-2149
[25] https://issues.apache.org/jira/browse/HUDI-2153
[26] https://issues.apache.org/jira/browse/HUDI-2185

==
Tests

[Tests] Fix integration testing failure caused by sql results out of order
[1]
[Tests] Fixed the unit test
TestHoodieBackedMetadata.testOnlyValidPartitionsAdded [2]
[Tests] Update unit tests to support ORC as the base file format [3]

[1] https://issues.apache.org/jira/browse/HUDI-2113
[2] https://issues.apache.org/jira/browse/HUDI-2140
[3] https://issues.apache.org/jira/browse/HUDI-1828


Best,
Leesf


Welcome New Committers: Pengzhiwei and DannyChan

2021-07-16 Thread leesf
Hi all,

Please join me in congratulating our newest committers *Pengzhiwei *and
* DannyChan.*

*Pengzhiwei *has been a consistent contributor to Hudi, he has contributed
numerous features to Hudi, such as Spark SQL integration with Hudi, Spark
Structured Streaming Source for Hudi and Spark FileIndex for Hudi and also
lots of other good contributions around Spark, and also very active to
answer users's questions. He is a solid team player and an asset to the
project.

*DannyChan* has contributed many good features, such as new streaming write
pipeline for Flink with automatic compaction and cleaning (COW and MOR),
batch and streaming reader for Flink (COW and MOR) and support Flink SQL
connectors (reader and writer), he is actively join the ML and
answer users' questions as well as wrote a Hudi Flink integration guide and
launched a live show to promote Hudi Flink integration for Chinese users.

Thanks so much for your continued contributions to make Hudi better and
better!

Also I would like to introduce the current state of Hudi in China. Hudi
becomes more and more popular in China with the help of all community
members and has been adopted by almost all top companies in China,
including Alibaba, Baidu, ByteDance, Huawei, Tencent and other companies,
from startups to large companies, data scale from TB to PB. You would find
the logo wall below(PS: *unofficial statistics*, just listed some of them
and you can contact me to add your company logo if wanted).

We would not achieve this without such a good community and the
contribution of all community members. Cheers and Go!

[image: poweredby-0706.png]

Thanks,
Leesf


[ANNOUNCE] Hudi Community Update(2021-06-20 ~ 2021-07-04)

2021-07-04 Thread leesf
jira/browse/HUDI-2038
[11] https://issues.apache.org/jira/browse/HUDI-2061
[12] https://issues.apache.org/jira/browse/HUDI-2053
[13] https://issues.apache.org/jira/browse/HUDI-2069
[14] https://issues.apache.org/jira/browse/HUDI-2062
[15] https://issues.apache.org/jira/browse/HUDI-2073
[16] https://issues.apache.org/jira/browse/HUDI-2074
[17] https://issues.apache.org/jira/browse/HUDI-2067
[18] https://issues.apache.org/jira/browse/HUDI-2084
[19] https://issues.apache.org/jira/browse/HUDI-2097
[20] https://issues.apache.org/jira/browse/HUDI-2092
[21] https://issues.apache.org/jira/browse/HUDI-2103
[22] https://issues.apache.org/jira/browse/HUDI-2088
[23] https://issues.apache.org/jira/browse/HUDI-2105
[24] https://issues.apache.org/jira/browse/HUDI-2114
[25] https://issues.apache.org/jira/browse/HUDI-2123
[26] https://issues.apache.org/jira/browse/HUDI-2057
[27] https://issues.apache.org/jira/browse/HUDI-2116

==
Tests

[Tests] Increase timeout for deltaStreamerTestRunner in
TestHoodieDeltaStreamer [1]
[Tests] Fix TestHoodieBackedMetadata#testOnlyValidPartitionsAdded [2]
[Tests] Added tests for KafkaOffsetGen [3]
[Tests] Move schema util tests out from TestHiveSyncTool [4]
[Tests] Adding more yaml templates to test suite [5]

[1] https://issues.apache.org/jira/browse/HUDI-1248
[2] https://issues.apache.org/jira/browse/HUDI-2064
[3] https://issues.apache.org/jira/browse/HUDI-2060
[4] https://issues.apache.org/jira/browse/HUDI-2081
[5] https://issues.apache.org/jira/browse/HUDI-2006

Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-06-06 ~ 2021-06-20)

2021-06-20 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-06-06 ~ 2021-06-20
with updates on features, bug fixes and tests.


===
Features

[CLI] Add fetching latest schema to table command in hudi-cli [1]
[Spark Integration] Added support for SqlFileBasedTransformer [2]
[Flink Integration] add BootstrapFunction to support index bootstrap [3]
[Spark Integration] Basic Implement Of Spark Sql Support For Hoodie [4]
[Core] Support configure KeyGenerator by type [5]
[Spark Integration] Added SqlSource to fetch data from any partitions for
backfill use case [6]
[Flink Integration] Support independent flink hudi compaction function [7]
[Core] ORC reader writer Implementation [8]
[Flink Integration] Support flink hive sync in batch mode [9]
[Flink Integration] Add metadata cache to WriteProfile to reduce IO [10]
[Flink Integration] Make flink writer as exactly-once by default [11]
[Deltasteramer] Adds JDBC source support for DeltaStreamer [12]


[1] https://issues.apache.org/jira/browse/HUDI-1914
[2] https://issues.apache.org/jira/browse/HUDI-1743
[3] https://issues.apache.org/jira/browse/HUDI-1924
[4] https://issues.apache.org/jira/browse/HUDI-1659
[5] https://issues.apache.org/jira/browse/HUDI-1929
[6] https://issues.apache.org/jira/browse/HUDI-1790
[7] https://issues.apache.org/jira/browse/HUDI-1984
[8] https://issues.apache.org/jira/browse/HUDI-765
[9] https://issues.apache.org/jira/browse/HUDI-2014
[10] https://issues.apache.org/jira/browse/HUDI-2030
[11] https://issues.apache.org/jira/browse/HUDI-2040
[12] https://issues.apache.org/jira/browse/HUDI-251


===
Bugs

[Spark Integration] Add Default value for HIVE_AUTO_CREATE_DATABASE_OPT_KEY
in HoodieSparkSqlWriter [1]
[Flink Integration] BucketAssignFunction use ValueState instead of MapState
[2]
[Flink Integration] Skip Commits with empty files [3]
[Core] Fix NPE when avro field value is null [4]
[Flink Integration] Skip creating marker files for flink merge handle [5]
[Flink Integration] Fix non partition table hive meta sync for flink writer
[6]
[Flink Integration] Release the new records map for merge handle #close [7]
[Flink Integration] Release the new records iterator for append handle
#close [8]
[Flink Integratoin] Release file writer for merge handle #close [9]
[Spark Integration] Fixing drop dups exception in bulk insert row writer
path [10]
[Flink Integration] Refresh the base file view cache for WriteProfile [11]
[Flink Integration] Release writer for append handle #close [12]
[Code Cleanup] Avoid the raw type usage in some classes under
hudi-utilities module [13]
[Core] Fix the filter condition is missing in the judgment condition of
compaction instance [14]
[Flink Integration] Fix flink operator uid to allow multiple pipelines in
one job [15]
[Spaark Integration] Fix RO Tables Returning Snapshot Result [16]
[Spark Integration] Set up the file system view storage config for
singleton embedded server write config every time [17]
[Flink Integration] Make keygen class and keygen type optional for
FlinkStreamerConfig [18]
[Spark Integration] ClassCastException Throw When PreCombineField Is String
Type [19]
[Flink Integration] Move the compaction plan scheduling out of flink writer
coordinator [20]


[1] https://issues.apache.org/jira/browse/HUDI-1942
[2] https://issues.apache.org/jira/browse/HUDI-1931
[3] https://issues.apache.org/jira/browse/HUDI-1909
[4] https://issues.apache.org/jira/browse/HUDI-1895
[5] https://issues.apache.org/jira/browse/HUDI-1723
[6] https://issues.apache.org/jira/browse/HUDI-1987
[7] https://issues.apache.org/jira/browse/HUDI-1992
[8] https://issues.apache.org/jira/browse/HUDI-1994
[9] https://issues.apache.org/jira/browse/HUDI-2000
[10] https://issues.apache.org/jira/browse/HUDI-1991
[11] https://issues.apache.org/jira/browse/HUDI-1999
[12] https://issues.apache.org/jira/browse/HUDI-2022
[13] https://issues.apache.org/jira/browse/HUDI-2008
[14] https://issues.apache.org/jira/browse/HUDI-1955
[15] https://issues.apache.org/jira/browse/HUDI-2015
[16] https://issues.apache.org/jira/browse/HUDI-1879
[17] https://issues.apache.org/jira/browse/HUDI-2019
[18] https://issues.apache.org/jira/browse/HUDI-2032
[19] https://issues.apache.org/jira/browse/HUDI-2033
[20] https://issues.apache.org/jira/browse/HUDI-2036

==
Tests

[Tests] Move TestHiveMetastoreBasedLockProvider to functional [1]
[Tests] Move CheckpointUtils test cases to independant class [2]
[Tests] Fix Azure CI failure in TestParquetUtils [3]

[1] https://issues.apache.org/jira/browse/HUDI-1950
[2] https://issues.apache.org/jira/browse/HUDI-2004
[3] https://issues.apache.org/jira/browse/HUDI-1950

Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-05-23 ~ 2021-06-06)

2021-06-06 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-05-22 ~ 2021-06-06
with updates on features, bug fixes and tests.


===
Features

[Flink Integration] Exactly-once write for flink writer [1]
[Spark Integration] Support Partition Prune For MergeOnRead Snapshot Table
 [2]
[Flink Integration] Improve HoodieFlinkStreamer [3]
[Flink Integration] Refactor BucketAssigner to make it more efficient [4]
[Flink Integration] Add target io option for flink compaction [5]


[1] https://issues.apache.org/jira/browse/HUDI-1923
[2] https://issues.apache.org/jira/browse/HUDI-1879
[3] https://issues.apache.org/jira/browse/HUDI-1927
[4] https://issues.apache.org/jira/browse/HUDI-1949
[5] ttps://issues.apache.org/jira/browse/HUDI-1921


===
Bugs

[Spark Integration] collect() call causing issues with very large upserts
[1]
[Flink Integration] Type mismatch when streaming read copy_on_write table
using flink[2]
[Core] Set archived as the default value of
HOODIE_ARCHIVELOG_FOLDER_PROP_NAME [3]
[Flink Integration] Close the file handles gracefully for flink write
function to avoid corrupted files [4]
[Core] Fix path selector listing files with the same mod date [5]
[Core] Bulk insert with row writer supports mor table [6]
[Flink Integration] Make embedded time line service singleton [7]
[Flink Integration] Exclude file slices in pending compaction when
performing small file sizing [8]
[Flink Integratoin] Shade kryo-shaded jar for hudi flink bundle  [9]
[Flink Integration] Lose properties when hoodieWriteConfig initializtion
[10]
[Flink Integration] Fix hive3 meta sync for flink writer [11]
[Flink Integration] Fix NPE due to not set the output type of the operator
[12]
[Flink Integration] Fix flink timeline service lack jetty dependency [13]
[Flink Integration] only reset bucket when flush bucket success [14]
[Core] Add deltacommit to ActionType [15]
[Hive Integration] Fix the NPE for MOR Hive rt table query [16]


[1] https://issues.apache.org/jira/browse/HUDI-1873
[2] https://issues.apache.org/jira/browse/HUDI-1919
[3] https://issues.apache.org/jira/browse/HUDI-1920
[4] https://issues.apache.org/jira/browse/HUDI-1895
[5] https://issues.apache.org/jira/browse/HUDI-1723
[6] https://issues.apache.org/jira/browse/HUDI-1922
[7] https://issues.apache.org/jira/browse/HUDI-1865
[8] https://issues.apache.org/jira/browse/HUDI-1800
[9] https://issues.apache.org/jira/browse/HUDI-1948
[10] https://issues.apache.org/jira/browse/HUDI-1943
[11] https://issues.apache.org/jira/browse/HUDI-1952
[12] https://issues.apache.org/jira/browse/HUDI-1953
[13] https://issues.apache.org/jira/browse/HUDI-1957
[14] https://issues.apache.org/jira/browse/HUDI-1917
[15] https://issues.apache.org/jira/browse/HUDI-1281
[16] https://issues.apache.org/jira/browse/HUDI-1967

==
Tests

[Tests] Add SqlQueryBasedTransformer unit test [1]
[Tests] Add a debezium json integration test case for flink [2]

[1] https://issues.apache.org/jira/browse/HUDI-1940
[2] https://issues.apache.org/jira/browse/HUDI-1961


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-05-09 ~ 2021-05-23)

2021-05-23 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-05-09 ~ 2021-05-22
with updates on features, bug fixes and tests.


===
Features

[Flink Integration] Avoid to generates corrupted files for flink sink [1]
[Core] Support reading older snapshots [2]
[Flink Integration] Global index for flink writer [3]
[Flink Integration] Reuse the partition path and file group id for flink
write data buffer [4]


[1] https://issues.apache.org/jira/browse/HUDI-1886
[2] https://issues.apache.org/jira/browse/HUDI-1789
[3] https://issues.apache.org/jira/browse/HUDI-1902
[4] https://issues.apache.org/jira/browse/HUDI-1911


===
Bugs

[Core] Reduces log level for too verbose messages from info to debug level
[1]
[Flink Integration] FlinkCreateHandle and FlinkAppendHandle canWrite should
always return true[2]
[Flink Integration] Validate required fields for Flink HoodieTable [3]
[Flink Integration] Close the file handles gracefully for flink write
function to avoid corrupted files [4]
[Spark Integration] Fix hive beeline/spark-sql query specified field on mor
table occur NPE [5]
[Flink Integration] Always close the file handle for a flink mini-batch
write [6]
[Flink Integration] Support skip bootstrapIndex's init in abstract fs view
init [7]
[Flink Integration] Clean the corrupted files generated by
FlinkMergeAndReplaceHandle [8]
[Hive Integratoin] Honoring skipROSuffix in spark ds [9]
[Core] Using streams instead of loops for input/output [10]
[Flink Integration] Fix the file id for write data buffer before flushing
[11]
[Flink Integration] Fix hive conf for Flink writer hive meta sync [12]
[Hive Integration] hive on spark/mr,Incremental query of the mor table, the
partition field is incorrect [13]
[Flink Integration] Remove the metadata sync logic in
HoodieFlinkWriteClient#preWrite because it is not thread safe [14]
[Core]  Fix NPE when the nested partition path field has null value [15]
[Flink Integration] Fix incorrect keyBy field cause serious data skew, to
avoid multiple subtasks write to a partition at the same time [16]
[Core] Fix insert-overwrite API archival [17]


[1] https://issues.apache.org/jira/browse/HUDI-1707
[2] https://issues.apache.org/jira/browse/HUDI-1890
[3] https://issues.apache.org/jira/browse/HUDI-1818
[4] https://issues.apache.org/jira/browse/HUDI-1895
[5] https://issues.apache.org/jira/browse/HUDI-1722
[6] https://issues.apache.org/jira/browse/HUDI-1900
[7] https://issues.apache.org/jira/browse/HUDI-1446
[8] https://issues.apache.org/jira/browse/HUDI-1876
[9] https://issues.apache.org/jira/browse/HUDI-1806
[10] https://issues.apache.org/jira/browse/HUDI-1913
[11] https://issues.apache.org/jira/browse/HUDI-1915
[12] https://issues.apache.org/jira/browse/HUDI-1871
[13] https://issues.apache.org/jira/browse/HUDI-1719
[14] https://issues.apache.org/jira/browse/HUDI-1917
[15] https://issues.apache.org/jira/browse/HUDI-1888
[16] https://issues.apache.org/jira/browse/HUDI-1918
[17] https://issues.apache.org/jira/browse/HUDI-1740

==
Tests

[Tests] Adding test suite long running automate scripts for docker [1]
[Tests] Remove hardcoded parquet in tests [2]
[Tests] add spark datasource unit test for schema validate add column [3]

[1] https://issues.apache.org/jira/browse/HUDI-1851
[2] https://issues.apache.org/jira/browse/HUDI-1055
[3] https://issues.apache.org/jira/browse/HUDI-1768


Best,
Leesf


Re: Contributor Permission Application

2021-05-23 Thread leesf
Done and welcome!

manasa s  于2021年5月23日周日 下午6:47写道:

> HI ,
> I want to contribute to apache hudi .
> Could you please give me necessary permission.
> Jira id - manasaks
>
> Regards,
> Manasa.
>


Re: Contributor permission application

2021-05-23 Thread leesf
Done and welcome to the community.

Well Tang  于2021年5月23日周日 上午9:54写道:

> Hi,
>
>
>
>
> I want to contribute to Apache Hudi.
>
> Would you please give me the contributor permission?
>
> My JIRA ID is HUDI-1919. Grazie molto!


Re: Welcome new committers and PMC Members!

2021-05-12 Thread leesf
Congratulations Gary and Wenning

Nishith  于2021年5月12日周三 上午11:23写道:

> Congratulations Gary and Wenning!
>
> -Nishith
>
> > On May 11, 2021, at 7:18 PM, vino yang  wrote:
> >
> > Congrats to Gary and Wenning!
> >
> > wangxianghu  于2021年5月12日周三 上午8:40写道:
> >
> >> Congratulations @Gary Li and @Wenning Ding!
> >>
> >>> 2021年5月12日 上午7:18,Prashant Wason  写道:
> >>>
> >>> Congratulations Gary and Wenning!
> >>>
> >>> On Tue, May 11, 2021 at 3:59 PM Raymond Xu <
> xu.shiyan.raym...@gmail.com>
> >>> wrote:
> >>>
>  Big congrats to Gary and Wenning!
> 
>  On Tue, May 11, 2021 at 1:14 PM vbal...@apache.org <
> vbal...@apache.org>
>  wrote:
> 
> > Many Congratulations Gary Li and Wenning Ding. Well deserved !!
> > Balaji.V
> >   On Tuesday, May 11, 2021, 01:06:47 PM PDT, Bhavani Sudha <
> > bhavanisud...@gmail.com> wrote:
> >
> > Congratulations @Gary Li and @Wenning Ding!
> > On Tue, May 11, 2021 at 12:42 PM Vinoth Chandar 
>  wrote:
> >
> > Hello all,
> > Please join me in congratulating our newest set of committers and
> PMCs.
> > Wenning Ding (Committer) Wenning has been a consistent contributor to
> > Hudi, over the past year or so. He has added some critical bug fixes,
>  lots
> > of good contributions around Spark!
> > Gary Li (PMC Member) Gary is a regular feature on all our support
> > channels. He has contributed numerous features to Hudi, and
> evangelized
> > across many companies including Bosch/Bytedance. Most of all, he is a
>  solid
> > team player and an asset to the project.
> > Thanks so much for your continued contributions, to make Hudi better
> >> and
> > better!
> > ThanksVinoth
> >
> >
> 
> >>
> >>
>


[ANNOUNCE] Hudi Community Update(2021-04-25 ~ 2021-05-09)

2021-05-09 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-04-25 ~ 2021-05-09
with updates on features, bug fixes and tests.


===
Features

[Flink Integration] Add option to flush when total buckets memory exceeds
the threshold [1]
[Core] Add optional instant range to log record scanner for log [2]
[Deltastreamer] Improve table level config priority for
HoodieMultiTableDeltaStreamer [3]
[Flink Integration] Tweak the min max commits to keep when setting up
cleaning retain commits for Flink [4]
[Flink Integration] Logging consuming instant to
StreamReadOperator#processSplits [5]
[Spark Integration] use jsc union instead of rdd union [6]
[Flink Integration] Add rate limiter to Flink writer to avoid OOM for
bootstrap [7]
[Flink Integration] Streaming read for Flink COW table  [8]
[Deltastreamer] Add SCHEMA_REGISTRY_SOURCE_URL_SUFFIX and
SCHEMA_REGISTRY_TARGET_URL_SUFFIX property [9]
[Flink Integration] Remove legacy code for Flink writer [10]
[Flink Integration] Support streaming read with compaction and cleaning [11]
[Flink Integration] Add max memory option for flink writer task [12]



[1] https://issues.apache.org/jira/browse/HUDI-1844
[2] https://issues.apache.org/jira/browse/HUDI-1837
[3] https://issues.apache.org/jira/browse/HUDI-1742
[4] https://issues.apache.org/jira/browse/HUDI-1841
[5] https://issues.apache.org/jira/browse/HUDI-1836
[6] https://issues.apache.org/jira/browse/HUDI-1690
[7] https://issues.apache.org/jira/browse/HUDI-1863
[8] https://issues.apache.org/jira/browse/HUDI-1867
[9] https://issues.apache.org/jira/browse/HUDI-1852
[10] https://issues.apache.org/jira/browse/HUDI-1821
[11] https://issues.apache.org/jira/browse/HUDI-1880
[12] https://issues.apache.org/jira/browse/HUDI-1878


===
Bugs

[Core] Fixing kafka native config param for auto offset reset [1]
[Core] rollback pending clustering even if there is greater commit [2]
[Flink Integration] Fix cannot create table due to jar conflict [3]
[Hive Integration] Exception Throws When Sync Non-Partitioned Table To Hive
With MultiPartKeysValueExtractor [4]
[Spark Integration] Fix getting incorrect partition path while using incr
query by spark-sql [5]
[Flink Integration] Fix Flink streaming reader throws ClassCastException [6]
[Flink Integration] When query incr view of mor table which has Multi level
partitions, the query failed [7]
[Core] wiring in Hadoop Conf with AvroSchemaConverters instantiation [8]
[Hive Integratoin] Save one connection retry to hive metastore when
hiveSyncTool run with useJdbc=false [9]


[1] https://issues.apache.org/jira/browse/HUDI-1835
[2] https://issues.apache.org/jira/browse/HUDI-1833
[3] https://issues.apache.org/jira/browse/HUDI-1858
[4] https://issues.apache.org/jira/browse/HUDI-1798
[5] https://issues.apache.org/jira/browse/HUDI-1801
[6] https://issues.apache.org/jira/browse/HUDI-1781
[7] https://issues.apache.org/jira/browse/HUDI-1718
[8] https://issues.apache.org/jira/browse/HUDI-1876
[9] https://issues.apache.org/jira/browse/HUDI-1759


==
Tests

[Tests] Fix TestHoodieRealtimeRecordReader [1]
[Tests] Fix azure setting for integ tests [2]
[Tests] Fix Metrics UT [3]

[1] https://issues.apache.org/jira/browse/HUDI-1811
[2] https://issues.apache.org/jira/browse/HUDI-1810
[3] https://issues.apache.org/jira/browse/HUDI-1620


Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-04-11 ~ 2021-04-25)

2021-04-25 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-04-11 ~ 2021-04-25
with updates on features, bug fixes and tests.


===
Features

[Hudi Client] Move OperationConverter to hudi-client-common for code reuse
 [1]
[Flink Integration] Add option for merge max memory [2]
[Spark Integration] Insert overwrite (table) for Flink writer [3]
[Core] Support BAIDU AFS storage format in hudi [4]
[CLI] Add Hudi-CLI support for clustering [5]
[Spark Integration] Read Hoodie Table As Spark DataSource Table [5]
[Flink Integration] Non partitioned table for Flink writer  [6]
[Flink Integration] Add explicit index state TTL option for Flink writer
 [7]
[CLI] Added support for replace commits in commit showpartitions, commit
show_write_stats, commit showfiles [8]
[Core] Add support for BigDecimal and Integer when partitioning based on
time. [9]



[1] https://issues.apache.org/jira/browse/HUDI-1785
[2] https://issues.apache.org/jira/browse/HUDI-1786
[3] https://issues.apache.org/jira/browse/HUDI-1788
[4] https://issues.apache.org/jira/browse/HUDI-1803
[5] https://issues.apache.org/jira/browse/HUDI-1415
[6] https://issues.apache.org/jira/browse/HUDI-1814
[7] https://issues.apache.org/jira/browse/HUDI-1812
[8] https://issues.apache.org/jira/browse/HUDI-1746
[9] https://issues.apache.org/jira/browse/HUDI-1551


===
Bugs

[Flink Integration] Remove the rocksdb jar from hudi-flink-bundle  [1]
[Core] Fix RealtimeCompactedRecordReader StackOverflowError [2]
[Spark Integration] Fixing usage of NULL schema for delete operation in
HoodieSparkSqlWriter [3]
[Flink Integration] Flink streaming reader should always monitor the delta
commits files [4]
[Flink Integration] FlinkMergeHandle rolling over may miss to rename the
latest file handle [5]
[Flink Integration] flink-client query error when processing files larger
than 128mb [6]
[Flink Integration] Continue to write when Flink write task restart because
of container killing [7]
[Spark Integration] Resolving default values for schema from dataframe [8]
[Timeline Server]  Timeline Server Bundle need to include
com.esotericsoftware package [9]
[Core] rollback fails on mor table when the partition path hasn't any files
[10]
[Flink Integration] Flink merge on read input split uses wrong base file
path for default merge type [11]
[Flink Integration] Use while loop instead of recursive call in
MergeOnReadInputFormat to avoid StackOverflow [12]

[1] https://issues.apache.org/jira/browse/HUDI-1787
[2] https://issues.apache.org/jira/browse/HUDI-1720
[3] https://issues.apache.org/jira/browse/HUDI-1751
[4] https://issues.apache.org/jira/browse/HUDI-1798
[5] https://issues.apache.org/jira/browse/HUDI-1801
[6] https://issues.apache.org/jira/browse/HUDI-1792
[7] https://issues.apache.org/jira/browse/HUDI-1804
[8] https://issues.apache.org/jira/browse/HUDI-1716
[9] https://issues.apache.org/jira/browse/HUDI-1802
[10] https://issues.apache.org/jira/browse/HUDI-1744
[11] https://issues.apache.org/jira/browse/HUDI-1809
[12] https://issues.apache.org/jira/browse/HUDI-1829

==
Tests

[Tests] Added tests to TestHoodieTimelineArchiveLog for the archival of
completed clean and rollback actions [1]

[1] https://issues.apache.org/jira/browse/HUDI-1714



Best,
Leesf


Re: confluence permission & jira permisson apply

2021-04-16 Thread leesf
done and welcome to the community.

Roc Marshal  于2021年4月16日周五 下午10:14写道:

> Hi,
>
> I want to contribute to Apache Hudi. Would you please give me the
> confluence permission  and  jira permisson ?
> My Confluence ID is roc-marshal. Full name is RocMarshal.
> My JIRA ID is RocMarshal.
> Thank you .
>
>
> Best, Roc.


Re: confluence permission apply

2021-04-15 Thread leesf
done and welcome.

Brit  于2021年4月16日周五 下午1:38写道:

> Hi,
>
> I want to contribute to Apache Hudi. Would you please give me the
> confluence permission? My Confluence ID is Xu Guang Lv


Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread leesf
+1. Cool and promising.

Mehrotra, Udit  于2021年4月14日周三 上午2:57写道:

> Agree with the rebranding Vinoth. Hudi is not just a "table format" and we
> need to do justice to all the cool auxiliary features/services we have
> built.
>
> Also, timeline metadata service in particular would be a really big win if
> we move towards something like that.
>
> On 4/13/21, 11:01 AM, "Pratyaksh Sharma"  wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> Definitely we are doing much more than only ingesting and managing data
> over DFS.
>
> +1 from my side as well. :)
>
> On Tue, Apr 13, 2021 at 10:02 PM Susu Dong 
> wrote:
>
> > I love this rebranding. Totally agree. +1
> >
> > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> xu.shiyan.raym...@gmail.com>
> > wrote:
> >
> > > +1 The vision looks fantastic.
> > >
> > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li  wrote:
> > >
> > > > Awesome summary of Hudi! +1 as well.
> > > >
> > > > Gary Li
> > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> rubenssoto2...@gmail.com>
> > > > wrote:
> > > > > Excellent, I agree
> > > > >
> > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> yanghua1...@gmail.com>
> > > > escreveu:
> > > > >
> > > > > > +1 Excited by this new vision!
> > > > > >
> > > > > > Best,
> > > > > > Vino
> > > > > >
> > > > > > Dianjin Wang  于2021年4月13日周二
> > > 下午3:53写道:
> > > > > >
> > > > > > > +1  The new brand is straightforward, a better description
> of
> > Hudi.
> > > > > > >
> > > > > > > Best,
> > > > > > > Dianjin Wang
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > > bhavanisud...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 . Cannot agree more. I think this makes total sense
> and will
> > > > provide
> > > > > > > for
> > > > > > > > a much better representation of the project.
> > > > > > > >
> > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > > vin...@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello all,
> > > > > > > > >
> > > > > > > > > Reading one more article today, positioning Hudi, as
> just a
> > > table
> > > > > > > format,
> > > > > > > > > made me wonder, if we have done enough justice in
> explaining
> > > > what we
> > > > > > > have
> > > > > > > > > built together here.
> > > > > > > > > I tend to think of Hudi as the data lake platform,
> which has
> > > the
> > > > > > > > following
> > > > > > > > > components, of which - one if a table format, one is a
> > > > transactional
> > > > > > > > > storage layer.
> > > > > > > > > But the whole stack we have is definitely worth more
> than the
> > > > sum of
> > > > > > > all
> > > > > > > > > the parts IMO (speaking from my own experience from
> the past
> > > 10+
> > > > > > years
> > > > > > > of
> > > > > > > > > open source software dev).
> > > > > > > > >
> > > > > > > > > Here's what we have built so far.
> > > > > > > > >
> > > > > > > > > a) *table format* : something that stores table
> schema, a
> > > > metadata
> > > > > > > table
> > > > > > > > > that stores file listing today, and being extended to
> store
> > > > column
> > > > > > > ranges
> > > > > > > > > and more in the future (RFC-27)
> > > > > > > > > b) *aux metadata* : bloom filters, external record
> level
> > > indexes
> > > > > > today,
> > > > > > > > > bitmaps/interval trees and other advanced on-disk data
> > > structures
> > > > > > > > tomorrow
> > > > > > > > > c) *concurrency control* : we always supported MVCC
> based log
> > > > based
> > > > > > > > > concurrency (serialize writes into a time ordered
> log), and
> > we
> > > > now
> > > > > > also
> > > > > > > > > have OCC for batch merge workloads with 0.8.0. We will
> have
> > > > > > multi-table
> > > > > > > > and
> > > > > > > > > fully non-blocking writers soon (see future work
> section of
> > > > RFC-22)
> > > > > > > > > d) *updates/deletes* : this is the bread-and-butter
> use-case
> > > for
> > > > > > Hudi,
> > > > > > > > but
> > > > > > > > > we support primary/unique key constraints and we could
> add
> > > > foreign
> > > > > > keys
> > > > > > > > as
> > > > > > > > > an extension, once our transactions can span tables.
> > > > > > > > > e) *table services*: a hudi pipeline today is
> self-managing -
> > > > sizes
> > > > > > > > files,
> > > > > > > > > cleans, compacts, clusters data, bootstraps existing
> data -
> > 

Re: Apache Hudi 0.8.0 Released

2021-04-09 Thread leesf
Thanks gary for driving the release, great job.

Pratyaksh Sharma  于2021年4月9日周五 下午10:40写道:

> Great news!
>
> On Fri, Apr 9, 2021 at 11:42 AM Sivabalan  wrote:
>
> > Awesome! Great job Gary on the release work!
> >
> > On Fri, Apr 9, 2021 at 1:59 AM Gary Li  wrote:
> >
> > > Thanks Vinoth.
> > >
> > > The page for 0.8.0 is ready
> > > https://hudi.apache.org/docs/0.8.0-spark_quick-start-guide.html.
> > > The release note could be found here
> > https://hudi.apache.org/releases.html
> > >
> > > Best,
> > > Gary Li
> > >
> > > On Thu, Apr 8, 2021 at 12:15 AM Vinoth Chandar 
> > wrote:
> > >
> > > > This is awesome! Thanks for sharing, Gary!
> > > >
> > > > Are we waiting for the site to be rendered with 0.8.0 release info
> and
> > > > homepage update?
> > > >
> > > > On Wed, Apr 7, 2021 at 7:54 AM Gary Li 
> > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We are excited to share that Apache Hudi 0.8.0 was released. Since
> > the
> > > > > 0.7.0 release, we resolved 97 JIRA tickets and made 120 code
> commits.
> > > We
> > > > > implemented many new features, bugfix, and performance improvement.
> > > > Thanks
> > > > > to all the contributors who had made this happened.
> > > > >
> > > > > *Release Highlights*
> > > > >
> > > > > *Flink Integration*
> > > > > Since the initial support for the Hudi Flink Writer in the 0.7.0
> > > release,
> > > > > the Hudi community made great progress on improving the Flink/Hudi
> > > > > integration, including redesigning the Flink writer pipeline with
> > > better
> > > > > performance and scalability, state-backed indexing with bootstrap
> > > > support,
> > > > > Flink writer for MOR table, batch reader for COW table,
> streaming
> > > > > reader for MOR table, and Flink SQL connector for both source and
> > sink.
> > > > In
> > > > > the 0.8.0 release, the user is able to use all those features with
> > > Flink
> > > > > 1.11+.
> > > > >
> > > > > Please see [RFC-24](
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal
> > > > > )
> > > > > for more implementation details of the Flink writer and follow this
> > > > [page](
> > > > > https://hudi.apache.org/docs/flink-quick-start-guide.html) to get
> > > > started
> > > > > with Flink!
> > > > >
> > > > > *Parallel Writers Support*
> > > > > As many users requested, now Hudi supports multiple ingestion
> writers
> > > to
> > > > > the same Hudi Table with optimistic concurrency control. Hudi
> > supports
> > > > file
> > > > > level OCC, i.e., for any 2 commits (or writers) happening to the
> same
> > > > > table, if they do not have writes to overlapping files being
> changed,
> > > > both
> > > > > writers are allowed to succeed. This feature is currently
> > experimental
> > > > and
> > > > > requires either Zookeeper or HiveMetastore to acquire locks.
> > > > >
> > > > > Please see [RFC-22](
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+22+%3A+Snapshot+Isolation+using+Optimistic+Concurrency+Control+for+multi-writers
> > > > > )
> > > > > for more implementation details and follow this [page](
> > > > > https://hudi.apache.org/docs/concurrency_control.html) to get
> > started
> > > > with
> > > > > concurrency control!
> > > > >
> > > > > *Writer side improvements*
> > > > > - InsertOverwrite Support for Flink writer client.
> > > > > - Support CopyOnWriteTable in Java writer client.
> > > > >
> > > > > *Query side improvements*
> > > > > - Support Spark Structured Streaming read from Hudi table.
> > > > > - Performance improvement of Metadata table.
> > > > > - Performance improvement of Clustering.
> > > > >
> > > > > *Raw Release Notes*
> > > > > The raw release notes are available [here](
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12349423
> > > > > )
> > > > >
> > > > > Thanks,
> > > > > Gary Li
> > > > > (on behalf of the Hudi community)
> > > > >
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


[ANNOUNCE] Hudi Community Update(2021-03-14 ~ 2021-03-28)

2021-03-28 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-03-14 ~ 2021-03-28
with updates on features, bug fixes and tests.


===
Features

[Flink Integration]  Tweak hudi-flink-bundle module pom and reorganize the
pacakges for hudi-flink module [1]
[Flink Integration] Bounded source for stream writer  [2]
[Metadata Table] Improve performance of key lookups from base file in
Metadata Table [3]
[Core] Added locking capability to allow multiple writers [4]
[Core] Implement HoodieTableSource.explainSource for all kinds of pushing
down [5]
[Flink Integration] Use PRIMARY KEY syntax to define record keys for Flink
Hudi table [6]
[Spark Integration] hudi write should uncache rdd, when the write operation
is finnished [7]
[Core] Add support for composite keys in NonpartitionedKeyGenerator [8]
[Flink Integration] Flush as per data bucket for mini-batch write  [9]
[Core] Custom avro kafka deserializer [10]
[Core] Improving config names and adding hive metastore uri config [11]
[Flink Integration] Read optimized query type for Flink batch reader [12]
[Core] Rename & standardize config to match other configs [13]
[Flink Integration] Bump Flink version to 1.12.2 [14]
[Java Client] Introduce HoodieBloomIndex to hudi-java-client [15]

[1] https://issues.apache.org/jira/browse/HUDI-1684
[2] https://issues.apache.org/jira/browse/HUDI-1692
[3] https://issues.apache.org/jira/browse/HUDI-1552
[4] https://issues.apache.org/jira/browse/HUDI-845
[5] https://issues.apache.org/jira/browse/HUDI-1701
[6] https://issues.apache.org/jira/browse/HUDI-1688
[7] https://issues.apache.org/jira/browse/HUDI-1663
[8] https://issues.apache.org/jira/browse/HUDI-1653
[9] https://issues.apache.org/jira/browse/HUDI-1705
[10] https://issues.apache.org/jira/browse/HUDI-1650
[11] https://issues.apache.org/jira/browse/HUDI-1709
[12] https://issues.apache.org/jira/browse/HUDI-1710
[13] https://issues.apache.org/jira/browse/HUDI-1712
[14] https://issues.apache.org/jira/browse/HUDI-1495
[15] https://issues.apache.org/jira/browse/HUDI-1478

===
Bugs

[GCS] Fixing input stream detection of GCS FileSystem [1]
[Core] Fixing null schema in bulk_insert row writer path [2]
[Core] Fixing spark3 bundles [3]
[Core] Fix a null value related bug for spark vectorized reader. [4]
[Core] Fix MethodNotFound for HiveMetastore Locks [5]


[1] https://issues.apache.org/jira/browse/HUDI-1496
[2] https://issues.apache.org/jira/browse/HUDI-1615
[3] https://issues.apache.org/jira/browse/HUDI-1568
[4] https://issues.apache.org/jira/browse/HUDI-1667
[5] https://issues.apache.org/jira/browse/HUDI-1728



Best,
Leesf


[ANNOUNCE] Hudi Community Update(2021-02-28 ~ 2021-03-14)

2021-03-14 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2021-02-28 ~ 2021-03-14
with updates on features, bug fixes and tests.


===
Features

[Flink Integration] Supports merge on read write mode for Flink writer [1]
[Timeline] Configuration and metrics for the TimelineService [2]
[Core] Add latency and freshness support [3]
[Flink Integration] Supports snapshot read for Flink [4]
[Core] Provide mechanism to read uncommitted data through InputFormat [5]
[DFS] Support custom date format and fix unsupported exception in
DatePartitionPathSelector [6]
[Flink Integration] Streaming read for Flink MOR table [7]
[Flink Integration] Row level delete for Flink sink[8]
[Flink Integration] Avro schema inference for Flink SQL table  [9]
[Flink Integration] Support object storage for Flink writer [10]


[1] https://issues.apache.org/jira/browse/HUDI-1632
[2] https://issues.apache.org/jira/browse/HUDI-1553
[3] https://issues.apache.org/jira/browse/HUDI-1587
[4] https://issues.apache.org/jira/browse/HUDI-1647
[5] https://issues.apache.org/jira/browse/HUDI-1646
[6] https://issues.apache.org/jira/browse/HUDI-1655
[7] https://issues.apache.org/jira/browse/HUDI-1663
[8] https://issues.apache.org/jira/browse/HUDI-1678
[9] https://issues.apache.org/jira/browse/HUDI-1664
[10] https://issues.apache.org/jira/browse/HUDI-1681


===
Bugs

[Core] Fix bug that Hudi will skip remaining log files if there is logFile
with zero size in logFileList when merge on read[1]
[Bundle] Fixing commons codec dependency in bundle jars [2]
[Core] Do not delete older rollback instants as part of rollback [3]
[Core] Re-bootstrap metadata table when un-synced instants have been
archived. [4]
[Flink Integration] Modify maker file path, which should start with the
target base path [5]
[Core] Support Builder Pattern To Build Table Properties For
HoodieTableConfig [6]
[Core] Excluding compaction and clustering instants from inflight rollback
[7]
[Core] Exclude clustering commits from getExtraMetadataFromLatest API [8]
[DeltaStreamer] Fixing NPE with Parquet src in multi table delta streamer
[9]
[Hive Integration] Fix hive date type conversion for mor table [10]
[Flink Integration] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex
[11]
[Core] Fix archival of requested replacecommit [12]
[Core] keep updating current date for every batch [13]
[GCS] Fixing input stream detection of GCS FileSystem [14]


[1] https://issues.apache.org/jira/browse/HUDI-1583
[2] https://issues.apache.org/jira/browse/HUDI-1540
[3] https://issues.apache.org/jira/browse/HUDI-1644
[4] https://issues.apache.org/jira/browse/HUDI-1634
[5] https://issues.apache.org/jira/browse/HUDI-1584
[6] https://issues.apache.org/jira/browse/HUDI-1636
[7] https://issues.apache.org/jira/browse/HUDI-1660
[8] https://issues.apache.org/jira/browse/HUDI-1661
[9] https://issues.apache.org/jira/browse/HUDI-1618
[10] https://issues.apache.org/jira/browse/HUDI-1662
[11] https://issues.apache.org/jira/browse/HUDI-1673
[12] https://issues.apache.org/jira/browse/HUDI-1651
[13] https://issues.apache.org/jira/browse/HUDI-1685
[14] https://issues.apache.org/jira/browse/HUDI-1496

===
Tests

[Tests] Improvements to Hudi Test Suite [1]


[1] https://issues.apache.org/jira/browse/HUDI-1635

Best,
Leesf


Re: 0.8.0 Release discussion

2021-03-02 Thread leesf
+1 to release monthly if possible, and thanks Danny for the great work on
Flink.

Vinoth Chandar  于2021年3月2日周二 上午10:30写道:

> +1
>
> There are two more PRs to land for multi writers, and some bug fixes around
> the metadata table.
>
> This alongside all the great progress on Flink, would make for an exciting
> 0.8.0.
>
> On Mon, Mar 1, 2021 at 6:11 PM Danny Chan  wrote:
>
> > Thanks Gary Li for firing this discussion ~
> >
> > +1 for the date to be in the middle of March, before that, i would make
> > some local integration test and performance test.
> >
> > Best,
> > Danny
> >
> > Gary Li  于2021年3月1日周一 下午12:56写道:
> >
> > > Hi All,
> > >
> > > I’d like to start a discussion about the 0.8.0 release planning.
> > Recently,
> > > we made a great progress on the Flink writer(thank you Danny) and
> landed
> > > many bugfix/perf improvement commits. I think it’s a good time to make
> > > 0.8.0 release soon. Targeting the middle of March could be a good
> timing
> > > IMO.
> > >
> > > Also, I think we can release more frequently, like monthly or
> bi-monthly,
> > > to get the latest features out quickly. What do you guys think?
> > >
> > > Best Regards,
> > > Gary Li
> > >
> > >
> >
>


[ANNOUNCE] Hudi Community Update(2021-01-31 ~ 2021-02-28)

2021-02-28 Thread leesf
Dear community,

Nice to share Hudi community updates for 2021-01-31 ~ 2021-02-28 with
updates on features, bug fixes and tests.

===
Features

[Core] Improve minKey/maxKey computation in HoodieHFileWriter [1]
[Flink] Introduce FlinkHoodieSimpleIndex to hudi-flink-client [2]
[Flink Integration] InstantGenerateOperator support multiple parallelism [3]
[Flink Integration] Introduce FlinkHoodieBloomIndex to hudi-flink-client [4]
[CLI] Adding commit_show_records_info to display record sizes for commit [5]
[Flink Integration] Make Flink write pipeline write task scalable [6]
[Spark Integration] Translate the api partitionBy in spark datasource to
hoodie.datasource.write.partitionpath.field [7]
[Flink Integration] Write as minor batches during one checkpoint interval
for the new writer [8]
[Spark Integration] Support Spark Structured Streaming read from Hudi table
[9]
[Flink Integration] Gets the parallelism from context when init
StreamWriteOperatorCoordinator [10]
[Core] Schedule compaction based on time elapsed [11]
[Metaclient] Adding builder for HoodieTableMetaClient initialization [12]
[Core] Remove inline inflight rollback in hoodie writer [13]
[Flink Integration] Reduce the coupling of hadoop [14]
[Flink Integration] The state based index should bootstrap from existing
base files [15]
[Java Client] Support copyOnWriteTable in java client [16]
[Flink Integration] Avoid to rename for bucket update when there is only
one flush action during a checkpoint [17]
[Flink Integration] Some improvements to BucketAssignFunction [18]
[DeltaStreamer] Make deltaStreamer transition from dfsSouce to kafkasouce
[19]
[Hive Integration] Make whether the failure of connect hive affects hudi
ingest process configurable [20]
[Metadata Table] Added a configuration to allow specific directories to be
filtered out during Metadata Table bootstrap [21]


[1] https://issues.apache.org/jira/browse/HUDI-1519
[2] https://issues.apache.org/jira/browse/HUDI-1335
[3] https://issues.apache.org/jira/browse/HUDI-1511
[4] https://issues.apache.org/jira/browse/HUDI-1332
[5] https://issues.apache.org/jira/browse/HUDI-1571
[6] https://issues.apache.org/jira/browse/HUDI-1557
[7] https://issues.apache.org/jira/browse/HUDI-1526
[8] https://issues.apache.org/jira/browse/HUDI-1598
[9] https://issues.apache.org/jira/browse/HUDI-1109
[10] https://issues.apache.org/jira/browse/HUDI-1621
[11] https://issues.apache.org/jira/browse/HUDI-1381
[12] https://issues.apache.org/jira/browse/HUDI-1315
[13] https://issues.apache.org/jira/browse/HUDI-1486
[14] https://issues.apache.org/jira/browse/HUDI-1586
[15] https://issues.apache.org/jira/browse/HUDI-1624
[16] https://issues.apache.org/jira/browse/HUDI-1477
[17] https://issues.apache.org/jira/browse/HUDI-1637
[18] https://issues.apache.org/jira/browse/HUDI-1638
[19] https://issues.apache.org/jira/browse/HUDI-1367
[20] https://issues.apache.org/jira/browse/HUDI-1269
[21] https://issues.apache.org/jira/browse/HUDI-1611

===
Bugs

[Core] Honor ordering field for MOR Spark datasource reader [1]
[Core] Call mkdir(partition) only if not exists [2]
[Core] Try to init class trying different signatures instead of checking
its name [3]
[Core] IHoodieTableMetaClient.getMarkerFolderPath works incorrectly on
windows client with hdfs server for wrong file seperator [4]
[Core] Fix Rollback Metadata AVRO backwards incompatiblity [5]
[Core] fix DefaultHoodieRecordPayload serialization failure [6]
[Hive Integration] Throw an exception when syncHoodieTable() fails, with
RuntimeException [7]
[Core] Fix bug in HoodieCombineRealtimeRecordReader with reading empty
iterators [8]
[HBase Index] Fix Hbase index to make rollback synchronous (via config) [9]


[1] https://issues.apache.org/jira/browse/HUDI-1550
[2] https://issues.apache.org/jira/browse/HUDI-1523
[3] https://issues.apache.org/jira/browse/HUDI-1538
[4] https://issues.apache.org/jira/browse/HUDI-1420
[5] https://issues.apache.org/jira/browse/HUDI-1589
[6] https://issues.apache.org/jira/browse/HUDI-1603
[7] https://issues.apache.org/jira/browse/HUDI-1582
[8] https://issues.apache.org/jira/browse/HUDI-1539
[9] https://issues.apache.org/jira/browse/HUDI-1347


===
Tests

[Tests] CI intermittent failure: TestJsonStringToHoodieRecordMapFunction [1]
[Tests] Add test cases for INSERT_OVERWRITE Operation [2]
[Tests] Fix write test flakiness in StreamWriteITCase [3]
[CI] Add azure pipelines configs [4]


[1] https://issues.apache.org/jira/browse/HUDI-1547
[2] https://issues.apache.org/jira/browse/HUDI-1545
[3] https://issues.apache.org/jira/browse/HUDI-1612
[4] https://issues.apache.org/jira/browse/HUDI-1620

Best,
Leesf


[ANNOUNCE] Hudi Community Bi-Weekly Update(2021-01-17 ~ 2021-01-31)

2021-01-31 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2020-01-17 ~ 2021-01-31
with updates on features, bug fixes and tests.

===
[Release] Apache Hudi 0.7.0 Released, this is a major release with many
features, you would check the release notes for more details [1]

[1] http://hudi.apache.org/docs/0.7.0-quick-start-guide.html

===
Features

[Metadata] Add block size to the FileStatus objects returned from metadata
table to avoid too many file splits [1]
[Metadata] Harden RFC-15 Implementation based on production testing [2]
[Flink Integration] InstantGenerateOperator support multiple parallelism [3]
[Flink Integration] Introduce FlinkHoodieBloomIndex to hudi-flink-client [4]
[Core] Insert new records to data files without merging for "Insert"
operation [5]
[Flink Integration] Add a new pipeline for Flink writer [6]
[Clustering] Remove isEmpty to improve clustering execution performance [7]


[1] https://issues.apache.org/jira/browse/HUDI-1529
[2] https://issues.apache.org/jira/browse/HUDI-1308
[3] https://issues.apache.org/jira/browse/HUDI-1511
[4] https://issues.apache.org/jira/browse/HUDI-1332
[5] https://issues.apache.org/jira/browse/HUDI-1234
[6] https://issues.apache.org/jira/browse/HUDI-1522
[7] https://issues.apache.org/jira/browse/HUDI-1555

===
Bugs

[Core] Make SerializableSchema work for large schemas and add ability to
sortBy numeric values [1]
[Core] Fixed suboptimal implementation of a magic sequence search [2]
[Spark Integration] Fixing commons codec shading in spark bundle [3]
[Flink Integration] Fix NPE using HoodieFlinkStreamer to etl data from
kafka to hudi [4]
[Core] Remove UpgradePayloadFromUberToApache [5]


[1] https://issues.apache.org/jira/browse/HUDI-1553
[2] https://issues.apache.org/jira/browse/HUDI-1532
[3] https://issues.apache.org/jira/browse/HUDI-1540
[4] https://issues.apache.org/jira/browse/HUDI-1453
[5] https://issues.apache.org/jira/browse/HUDI-623


===
Tests

[Tests] Fix spark 2 unit tests failure with Spark 3 [1]
[Tests] Introduce unit test infra for java client [2]
[Tests] Add unit test for validating replacecommit rollback [3]


[1] https://issues.apache.org/jira/browse/HUDI-1512
[2] https://issues.apache.org/jira/browse/HUDI-1476
[3] https://issues.apache.org/jira/browse/HUDI-1266


Best,
Leesf


Congrats to our newest committers!

2021-01-27 Thread leesf
Hi all,

I am very happy to announce our newest committers.

Wang Xianghu: Xianghu has done a great job in decoupling hudi with spark
and implemented the first version of flink and contributed bug fixes, also
he is very active in answering users questions in china wechat group.

Li Wei: Liwei has also done a great job in driving major features like
RFC-19 together with satish, also contributed many features and bug fixes
in core modules.

Please join me in congratulating them!

Thanks,
Leesf


Re: [VOTE] Release 0.7.0, release candidate #2

2021-01-24 Thread leesf
+1 binding

- Build successful
- Ran quickstart successful.
- Additional manual testing with and without Metadata based listing enabled
for COW and MOR table against aliyun OSS.

Sivabalan  于2021年1月23日周六 下午9:55写道:

> Got it, I didn't do -1, but just wanted to remind you, so that you don't
> miss it when you redo the steps again to promote the final one.
>
> +1 binding.
> But do ensure when you release, the staged repo (promoted candidate) has
> only one set of artifacts and it's a new repo.
>
>
> On Sat, Jan 23, 2021 at 2:03 AM nishith agarwal 
> wrote:
>
> > +1 binding
> >
> > - Build Successful
> > - Release validation script Successful
> > - Quick start runs Successfully
> >
> > Checking Checksum of Source Release
> > Checksum Check of Source Release - [OK]
> >
> >   % Total% Received % Xferd  Average Speed   TimeTime Time
> >  Current
> >  Dload  Upload   Total   SpentLeft
> >  Speed
> > 100 34972  100 349720 0  96076  0 --:--:-- --:--:-- --:--:--
> > 96076
> > Checking Signature
> > Signature Check - [OK]
> >
> > Checking for binary files in source release
> > No Binary Files in Source Release? - [OK]
> >
> > Checking for DISCLAIMER
> > DISCLAIMER file exists ? [OK]
> >
> > Checking for LICENSE and NOTICE
> > License file exists ? [OK]
> > Notice file exists ? [OK]
> >
> > Performing custom Licensing Check
> > Licensing Check Passed [OK]
> >
> > Running RAT Check
> > RAT Check Passed [OK]
> >
> > Thanks,
> > Nishith
> >
> > On Fri, Jan 22, 2021 at 9:28 PM Vinoth Chandar 
> wrote:
> >
> > > Thanks Siva! I am not sure if thats a required aspect for the binding
> > vote.
> > > Its a minor aspect that does not interfere with testing/validation in
> > > anyway. The actual release artifact needs to be rebuilt and repushed
> > anyway
> > > from a separate repo. Like I noted, I found the wiki instructions bit
> > > ambiguous and I intend to make it clearer going forward so we can avoid
> > > this in future.
> > >
> > > I request everyone to consider this explanation, when casting your
> vote.
> > >
> > > Thanks
> > > Vinoth
> > >
> > >
> > > On Fri, Jan 22, 2021 at 8:35 PM Sivabalan  wrote:
> > >
> > > > - checksums and signatures [OK]
> > > > - successfully built [OK]
> > > > - ran quick start guide [OK]
> > > > - Ran release validation guide [OK]
> > > > - Ran test suite job w/ inserts, upserts, deletes and
> validation(spark
> > > sql
> > > > and hive). Also same job w/ metadata enabled as well [OK]
> > > >
> > > > - Artifacts in staging repo : should be in separate repo where only
> rc2
> > > is
> > > > present. Right now, I see both rc1 and rc2 are present in the same
> > repo.
> > > >
> > > > Will add my binding vote once artifacts are fixed.
> > > >
> > > >
> > > >
> > > > On Fri, Jan 22, 2021 at 9:17 PM Udit Mehrotra 
> > wrote:
> > > >
> > > > > +1
> > > > > - Build successful
> > > > > - Ran quickstart against S3
> > > > > - Additional manual tests with MOR
> > > > > - Additional manual testing with and without Metadata based listing
> > > > enabled
> > > > > - Release validation script successful
> > > > >
> > > > > Validating hudi-0.7.0-rc2 with release type "dev"
> > > > > Checking Checksum of Source Release
> > > > > -e Checksum Check of Source Release - [OK]
> > > > >
> > > > >   % Total% Received % Xferd  Average Speed   TimeTime
> >  Time
> > > > >  Current
> > > > >  Dload  Upload   Total   Spent
> > Left
> > > > >  Speed
> > > > > 100 34972  100 349720 0  70937  0 --:--:-- --:--:--
> > > --:--:--
> > > > > 70793
> > > > > Checking Signature
> > > > > -e Signature Check - [OK]
> > > > >
> > > > > Checking for binary files in source release
> > > > > -e No Binary Files in Source Release? - [OK]
> > > > >
> > > > > Checking for DISCLAIMER
> > > > > -e DISCLAIMER file exists ? [OK]
> > > > >
> > > > > Checking for LICENSE and NOTICE
> > > > > -e License file exists ? [OK]
> > > > > -e Notice file exists ? [OK]
> > > > >
> > > > > Performing custom Licensing Check
> > > > > -e Licensing Check Passed [OK]
> > > > >
> > > > > Running RAT Check
> > > > > -e RAT Check Passed [OK]
> > > > >
> > > > > Thanks,
> > > > > Udit
> > > > >
> > > > > On Fri, Jan 22, 2021 at 12:41 PM Vinoth Chandar  >
> > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Please review and vote on the release candidate #2 for the
> version
> > > > 0.7.0,
> > > > > > as follows:
> > > > > >
> > > > > > [ ] +1, Approve the release
> > > > > >
> > > > > > [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > > > >
> > > > > >
> > > > > >
> > > > > > The complete staging area is available for your review, which
> > > includes:
> > > > > >
> > > > > > * JIRA release notes [1],
> > > > > >
> > > > > > * the official Apache source release and binary convenience
> > releases
> > > to
> > > > > be
> > > > > > deployed to dist.apache.org [2], which are signed with the key
> > with

Re: [VOTE] Release 0.7.0, release candidate #1

2021-01-22 Thread leesf
-1 binding

as users reported the issue when running flink hudi job.
[image: image.png]
this patch https://github.com/apache/hudi/pull/2473 should fix it.


Bhavani Sudha  于2021年1月22日周五 下午4:12写道:

> +1 (binding)
>
> - compile ok
> - quickstart ok
> - checksum ok
> - ran some ide tests - ok
> - release validation script - ok
> ./release/validate_staged_release.sh --release=0.7.0 --rc_num=1
> /tmp/validation_scratch_dir_001 ~/Sudha/hudi/scripts
> Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> Validating hudi-0.7.0-rc1 with release type "dev"
> Checking Checksum of Source Release
> Checksum Check of Source Release - [OK]
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
>  Current
>  Dload  Upload   Total   SpentLeft
>  Speed
> 100 34972  100 349720 0  78237  0 --:--:-- --:--:-- --:--:--
> 78237
> Checking Signature
> Signature Check - [OK]
>
> Checking for binary files in source release
> No Binary Files in Source Release? - [OK]
>
> Checking for DISCLAIMER
> DISCLAIMER file exists ? [OK]
>
> Checking for LICENSE and NOTICE
> License file exists ? [OK]
> Notice file exists ? [OK]
>
> Performing custom Licensing Check
> Licensing Check Passed [OK]
>
> Running RAT Check
> RAT Check Passed [OK]
>
>
>
> On Thu, Jan 21, 2021 at 8:46 PM Sivabalan  wrote:
>
> > +1 binding
> >
> > - checksums and signatures [OK]
> > - successfully built [OK]
> > - ran quick start guide [OK]
> > - Ran release validation guide [OK]
> > - Verified artifacts in staging repo [OK]
> > - Ran test suite job w/ inserts, upserts, deletes and validation(spark
> sql
> > and hive). Also same job w/ metadata enabled as well [OK]
> >
> >
> > ./release/validate_staged_release.sh --release=0.7.0 --rc_num=1
> > /tmp/validation_scratch_dir_001
> >
> ~/Documents/personal/projects/siva_hudi/hudi_070_rc1/hudi-0.7.0-rc1/scripts
> > Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> > Validating hudi-0.7.0-rc1 with release type "dev"
> > Checking Checksum of Source Release
> > Checksum Check of Source Release - [OK]
> >
> >   % Total% Received % Xferd  Average Speed   TimeTime Time
> >  Current
> >  Dload  Upload   Total   SpentLeft
> >  Speed
> > 100 34972  100 349720 0   105k  0 --:--:-- --:--:-- --:--:--
> >  104k
> > Checking Signature
> > Signature Check - [OK]
> >
> > Checking for binary files in source release
> > No Binary Files in Source Release? - [OK]
> >
> > Checking for DISCLAIMER
> > DISCLAIMER file exists ? [OK]
> >
> > Checking for LICENSE and NOTICE
> > License file exists ? [OK]
> > Notice file exists ? [OK]
> >
> > Performing custom Licensing Check
> > Licensing Check Passed [OK]
> >
> > Running RAT Check
> > RAT Check Passed [OK]
> >
> >
> > On Thu, Jan 21, 2021 at 8:21 PM Satish Kotha
>  > >
> > wrote:
> >
> > > +1,
> > >
> > > 1) Able to build
> > > 2) Integration tests pass
> > > 3) Unit tests pass locally
> > > 4) Successfully ran clustering on a small dataset (metadata table not
> > > enabled)
> > > 5) Verified insert, upsert, insert_overwrite works using QuickStart
> > > commands on COW table (metadata table not enabled)
> > >
> > >
> > >
> > > On Thu, Jan 21, 2021 at 12:44 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #1 for the version
> > 0.7.0,
> > > > as follows:
> > > >
> > > > [ ] +1, Approve the release
> > > >
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > >
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > >
> > > > * JIRA release notes [1],
> > > >
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint 7F2A3BEB922181B06ACB1AA45F7D09E581D2BCB6 [3],
> > > >
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > >
> > > > * source code tag "release-0.7.0-rc1" [5],
> > > >
> > > >
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Release Manager
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12348721
> > > >
> > > >
> > > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.7.0-rc1/
> > > >
> > > > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> > > >
> > > > [4]
> > > https://repository.apache.org/content/repositories/orgapachehudi-1027/
> > > >
> > > > [5] https://github.com/apache/hudi/tree/release-0.7.0-rc1
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


[ANNOUNCE] Hudi Community Bi-Weekly Update(2021-01-03 ~ 2020-01-17)

2021-01-17 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2020-01-03 ~ 2021-01-17
with updates on features, bug fixes and tests.


===
Features

[Metadata] Implementation of HUDI RFC-15 [1]
[Metadata] Use metadata table for listing in HoodieROTablePathFilter [2]
[Metadata] Faster initialization of metadata table using parallelized
listing [3]
[Metadata] Merge updates of unsynced instants to metadata table [4]
[Metadata] Support for metadata listing for snapshot queries through
Hive/SparkSQL [5]
[Metadata] Allow log files generated during restore/rollback to be synced
as well [6]
[Metadata] Read clustering plan from requested file for inflight instant[7]
[Client] Introduce WriteClient#preWrite() and relocate metadata table
syncing [8]
[Common] Move HoodieEngineContext and its dependencies to hudi-common [9]
[Spark Integration] Support Incremental query for MOR table [10]
[Metadata] Make Clustering/ReplaceCommit and Metadata table be compatible
[11]
[Clustering] support a independent clustering spark job to asynchronously
clustering [12]
[Core] Use HoodieEngineContext to parallelize fetching of partition paths
[13]
[Spark Integration] add configure for spark sql overwrite use
INSERT_OVERWRITE_TABLE [14]
[Metadata] MOR rollback and restore support for metadata sync [15]


[1] https://issues.apache.org/jira/browse/HUDI-841
[2] https://issues.apache.org/jira/browse/HUDI-1450
[3] https://issues.apache.org/jira/browse/HUDI-1469
[4] https://issues.apache.org/jira/browse/HUDI-1325
[5] https://issues.apache.org/jira/browse/HUDI-1312
[6] https://issues.apache.org/jira/browse/HUDI-1504
[7] https://issues.apache.org/jira/browse/HUDI-1498
[8] https://issues.apache.org/jira/browse/HUDI-1513
[9] https://issues.apache.org/jira/browse/HUDI-1510
[10] https://issues.apache.org/jira/browse/HUDI-920
[11] https://issues.apache.org/jira/browse/HUDI-1459
[12] https://issues.apache.org/jira/browse/HUDI-1399
[13] https://issues.apache.org/jira/browse/HUDI-1479
[14] https://issues.apache.org/jira/browse/HUDI-1520
[15] https://issues.apache.org/jira/browse/HUDI-1502

===
Bugs

[Core] Fix wrong exception thrown in HoodieAvroUtils [1]
[Hive Integration] Fixing sorting of partition vals for hive sync
computation [2]
[Metadata] Change timeline utils to support reading replacecommit metadata
[3]
[Core] Avoid raw type use for parameter of Transformer interface [4]
[Hive Integration] Reverting LinkedHashSet changes to combine fields from
oldSchema and newSchema in favor of using only new schema for record
rewriting [5]


[1] https://issues.apache.org/jira/browse/HUDI-1506
[2] https://issues.apache.org/jira/browse/HUDI-1485
[3] https://issues.apache.org/jira/browse/HUDI-1507
[4] https://issues.apache.org/jira/browse/HUDI-1514
[5] https://issues.apache.org/jira/browse/HUDI-1509


===
Tests

[Tests] fix test hbase index [1]


[1] https://issues.apache.org/jira/browse/HUDI-1525



Best,
Leesf


[ANNOUNCE] Hudi Community Bi-Weekly Update(2020-12-20 ~ 2021-01-03)

2021-01-03 Thread leesf
Dear community,

Happy New Year everyone. Nice to share Hudi community bi-weekly updates for
2020-12-20 ~ 2021-01-03 with updates on features, bug fixes and tests.


===
Features

[Core] Adding DefaultHoodieRecordPayload to honor ordering with
combineAndGetUpdateValue [1]
[Client] Add base implementation for hudi java client [2]
[Core] Implement simple clustering strategies to create ClusteringPlan and
to run the plan [3]
[Core] Support bulk insert v2 with Spark 3.0.0 [4]
[Core] Block updates and replace on file groups in clustering [5]
[Core] Support Partition level delete API in HUDI [6]
[Core] Upgrade Flink version to 1.12.0 [7]
[Client] Support delete in hudi-java-client [8]

[1] https://issues.apache.org/jira/browse/HUDI-115
[2] https://issues.apache.org/jira/browse/HUDI-1419
[3] https://issues.apache.org/jira/browse/HUDI-1075
[4] https://issues.apache.org/jira/browse/HUDI-1451
[5] https://issues.apache.org/jira/browse/HUDI-1354
[6] https://issues.apache.org/jira/browse/HUDI-1350
[7] https://issues.apache.org/jira/browse/HUDI-1495
[8] https://issues.apache.org/jira/browse/HUDI-1423


===
Bugs

[QuickStart] Make QuickStartUtils generate deletes according to specific ts
[1]
[Core] Fix Deletes issued without any prior commits exception [2]
[Bootstrap] Fix null pointer exception when reading updated written
bootstrap table [3]
[Core] Incremental Query should work even when there are partitions that
have no incremental changes [4]
[Core] Align insert file size for reducing IO [5]
[Hive Integration] Escape the partition value in HiveSyncTool [6]
[Core] Modify GenericRecordFullPayloadGenerator to generate valid
timestamps [7]
[Core] Fixed schema compatibility check for fields [8]
[Core] fix incorrect log file path in HoodieWriteStat [9]


[1] https://issues.apache.org/jira/browse/HUDI-1471
[2] https://issues.apache.org/jira/browse/HUDI-1485
[3] https://issues.apache.org/jira/browse/HUDI-1489
[4] https://issues.apache.org/jira/browse/HUDI-1490
[5] https://issues.apache.org/jira/browse/HUDI-1398
[6] https://issues.apache.org/jira/browse/HUDI-1484
[7] https://issues.apache.org/jira/browse/HUDI-1147
[8] https://issues.apache.org/jira/browse/HUDI-1493
[9] https://issues.apache.org/jira/browse/HUDI-1434


===
Tests

[Tests] Fix Test Case Failure in TestHBaseIndex [1]
[Tests] fix unit test testCopyOnWriteStorage random failed [2]
[Tests] Adding support for validating entire dataset and long running tests
in test suite framework [3]
[Tests] add structured streaming and delta streamer clustering unit test [4]
[Tests] Add additional unit tests to TestHBaseIndex [5]
[Tests] Set up flink client unit test infra [6]

[1] https://issues.apache.org/jira/browse/HUDI-1488
[2] https://issues.apache.org/jira/browse/HUDI-1487
[3] https://issues.apache.org/jira/browse/HUDI-1331
[4] https://issues.apache.org/jira/browse/HUDI-1481
[5] https://issues.apache.org/jira/browse/HUDI-1474
[6] https://issues.apache.org/jira/browse/HUDI-1418


Best,
Leesf


[ANNOUNCE] Hudi Community Bi-Weekly Update(2020-12-06 ~ 2020-12-20)

2020-12-20 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2020-12-06 ~ 2020-12-20
with updates on features and bug fixes.

===
Discussion

[Best Practise] There is a discussion about Hudi Record Key Best Practices,
you would check the discussion if you have some questions. [1]
[CI/DI] A discussion about speeding up CI/CD build to the PRs, please chime
in if you have some new ideas. [2]
[Core] A discussion about supporting parallel writing to Hudi tables, this
resolves some of the outstanding requirements [3]
[Core] Time Travel (querying the historical versions of data) ability for
Hudi Table [4]
[Community] There is a proposal about organising one event to list the
accomplishments and roadmap of the community [5]
[Release] There is a discussion about 0.7.0 release planning, which would
like release by Dec 31 [6]
[Core] There is a discussion about SQL Support using Apache Calcite, which
makes writing data to hudi via sql [7]


===
Features

[Config] Make HoodieWriteConfig support setting different default value
according to engine type  [8]
[Spark Integration] Make Hudi support Spark 3  [9]
[Core] Refactor AbstractHoodieLogRecordScanner to use Builder [10]
[Meta Sync] Hudi dla sync support skip rt table syncing [11]
[Spark Integration] Drop Hudi metadata cols at the beginning of Spark
datasource writing [12]
[DeltaStreamer] Add date partition based source input selector for
Deltastreamer [13]
[Core] Adding DefaultHoodieRecordPayload to honor ordering with
combineAndGetUpdateValue [14]
[Core] Add base implementation for hudi java client [15]


===
Bugs

[Writer Core] Fix partition path using FSUtils [16]
[Spark Integration] Remove scala dependency from hudi-client-common [17]
[Writer Core] Clean old fileslice is invalid [18]
[Index] Fix bug in Marker File Reconciliation for Non-Partitioned datasets
[19]
[Spark Integration] support more accurate spark JobGroup for better
performance tracking [20]
[Integration Test] Use the latest writer schema, when reading from existing
parquet files in the hudi-test-suite [21]



[1]
https://lists.apache.org/thread.html/r27792b6d0b354c7b6bbb7a258cdd7af14cbe3fdd777137fd619e9f63%40%3Cdev.hudi.apache.org%3E
[2]
https://lists.apache.org/thread.html/r1e69b6dac9b2d27a3f7c06491ac16dc5c0b5bd8e0807f4d9782b8e77%40%3Cdev.hudi.apache.org%3E
[3]
https://lists.apache.org/thread.html/r412c97452218f461e9bb52bc4a2f795609ec8eec5b3da1a60b9aa050%40%3Cdev.hudi.apache.org%3E
[4]
https://lists.apache.org/thread.html/rf978b608a5ebc3d7580b004da1a53f06ac3aaa2bb91ad069adc869f3%40%3Cdev.hudi.apache.org%3E
[5]
https://lists.apache.org/thread.html/r1d1b414c01cba2f127ab5e5b9aca314464ed433e11eae43b25d7c65a%40%3Cdev.hudi.apache.org%3E
[6]
https://lists.apache.org/thread.html/rf2ae5b4946440a0fea0e74f188db23f2099fbceaf9631bf35a4633ee%40%3Cdev.hudi.apache.org%3E
[7]
https://lists.apache.org/thread.html/ra04c70186f5880899ebbc8e87ed66c4b166c8e3ee062e0b8901ca6fc%40%3Cdev.hudi.apache.org%3E
[8] https://issues.apache.org/jira/browse/HUDI-1412
[9] https://issues.apache.org/jira/browse/HUDI-1040
[10] https://issues.apache.org/jira/browse/HUDI-1445
[11] https://issues.apache.org/jira/browse/HUDI-1448
[12] https://issues.apache.org/jira/browse/HUDI-1376
[13] https://issues.apache.org/jira/browse/HUDI-1406
[14] https://issues.apache.org/jira/browse/HUDI-115
[15] https://issues.apache.org/jira/browse/HUDI-1419
[16] https://issues.apache.org/jira/browse/HUDI-1395
[17] https://issues.apache.org/jira/browse/HUDI-1439
[18] https://issues.apache.org/jira/browse/HUDI-1428
[19] https://issues.apache.org/jira/browse/HUDI-1435
[20] https://issues.apache.org/jira/browse/HUDI-1437
[21] https://issues.apache.org/jira/browse/HUDI-1470


Best,
Leesf


[ANNOUNCE] Hudi Community Bi-Weekly Update(2020-11-22 ~ 2020-12-06)

2020-12-06 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly updates for 2020-11-22 ~ 2020-12-06
with updates on features and bug fixes.

===
Discussion

[Release] There is a discussion about the release time of the next version,
which will be cut at the end of december.
[Committers] Congrats to our newest committers Satish and Prashant.


===
Features

[CLI] Add compaction action in archive command  [3]
[Writer Core] Add HoodieJavaEngineContext to hudi-java-client  [4]
[Bootstrap] Fix for preventing bootstrap datasource jobs from hanging via
spark-submit [5]
[Writer Core] Add Support for OpenJ9 JVM [6]
[Writer Core] Added a check to validate records are not lost during merges
[7]
[Spark Integration] spark sql support overwrite use insert_overwrite_table
[8]
[Writer Core] Add standard schema postprocessor which would rewrite the
schema using spark-avro conversion [9]


===
Bugs

[Writer Core] Fix leaks in DiskBasedMap and LazyFileIterable [10]
[Spark Integration] lose partition info when using spark parameter basePath
[11]
[Writer Core] Write Type changed to BULK_INSERT when set
ENABLE_ROW_WRITER_OPT_KEY=true [12]
[Index] Update HoodieKey when deduplicating records with global index [13]
[Spark Integration] Fix FileAlreadyExistsException when set
HOODIE_AUTO_COMMIT_PROP to true [14]


[1]
https://lists.apache.org/thread.html/r40bca8747be3c7be5aa1067a9a978c928d39c0acd5d0b685d367fea1%40%3Cdev.hudi.apache.org%3E
[2]
https://lists.apache.org/thread.html/r9a6599ab39484644ec5b2df67f39b2725998797a5c733cd43e5d74b0%40%3Cdev.hudi.apache.org%3E
[3] https://issues.apache.org/jira/browse/HUDI-1397
[4] https://issues.apache.org/jira/browse/HUDI-1364
[5] https://issues.apache.org/jira/browse/HUDI-1396
[6] https://issues.apache.org/jira/browse/HUDI-1373
[7] https://issues.apache.org/jira/browse/HUDI-1357
[8] https://issues.apache.org/jira/browse/HUDI-1349
[9] https://issues.apache.org/jira/browse/HUDI-1343
[10] https://issues.apache.org/jira/browse/HUDI-1358
[11] https://issues.apache.org/jira/browse/HUDI-1396
[12] https://issues.apache.org/jira/browse/HUDI-1424
[13] https://issues.apache.org/jira/browse/HUDI-1196
[14] https://issues.apache.org/jira/browse/HUDI-1427


Best,
Leesf


Re: Re: Congrats to our newest committers!

2020-12-04 Thread leesf
Big congrats, Satish and Prashant!

Raymond Xu  于2020年12月5日周六 上午3:58写道:

> Big congrats, Satish and Prashant! Very well deserved!
>
> On Thu, Dec 3, 2020 at 6:40 PM vino yang  wrote:
>
> > Congrats to both!
> >
> > Trevor  于2020年12月4日周五 上午10:18写道:
> >
> > >
> > > Congratulations to the new committers!Excited about the next release!
> > >
> > > Best,
> > >
> > > Trevor
> > >
> > >
> > > wowtua...@gmail.com
> > >
> > > From: Sivabalan
> > > Date: 2020-12-04 09:59
> > > To: dev
> > > CC: us...@hudi.apache.org
> > > Subject: Re: Congrats to our newest committers!
> > > Congratz guys! Well deserved and excited for upcoming release.
> > >
> > > On Thu, Dec 3, 2020 at 5:58 PM Gary Li  wrote:
> > >
> > > > Congratulations Satish and Prashant! Excited about the next release!
> > > >
> > > > Gary Li
> > > > 
> > > > From: Mehrotra, Udit 
> > > > Sent: Friday, December 4, 2020 8:35:07 AM
> > > > To: dev@hudi.apache.org 
> > > > Cc: us...@hudi.apache.org 
> > > > Subject: Re: Congrats to our newest committers!
> > > >
> > > > Huge congrats guys ! Well deserved indeed.
> > > >
> > > > On 12/3/20, 11:44 AM, "Prashant Wason" 
> > wrote:
> > > >
> > > > CAUTION: This email originated from outside of the organization.
> Do
> > > > not click links or open attachments unless you can confirm the sender
> > and
> > > > know the content is safe.
> > > >
> > > >
> > > >
> > > > Thanks everyone.
> > > >
> > > > Over the past one year I have really enjoyed learning and
> > developing
> > > > with HUDI. Excited to be part of the group.
> > > >
> > > > > On Dec 3, 2020, at 11:37 AM, Balaji Varadarajan
> > > >  wrote:
> > > > >
> > > > > Very Well deserved !! Many congratulations to Satish and
> > Prashant.
> > > > > Balaji.V
> > > > >On Thursday, December 3, 2020, 11:07:09 AM PST, Bhavani
> Sudha
> > <
> > > > bhavanisud...@gmail.com> wrote:
> > > > >
> > > > > Congratulations Satish and Prashant!
> > > > > On Thu, Dec 3, 2020 at 11:03 AM Pratyaksh Sharma <
> > > > pratyaks...@gmail.com> wrote:
> > > > >
> > > > > Congratulations Satish and Prashant!
> > > > >
> > > > > On Fri, Dec 4, 2020 at 12:22 AM Vinoth Chandar <
> > vin...@apache.org>
> > > > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I am really happy to announce our newest set of committers.
> > > > >>
> > > > >> *Satish Kotha*: Satish has ramped very quickly across our
> entire
> > > > code base
> > > > >> and contributed bug fixes and also drove large, unique
> features
> > > like
> > > > >> clustering, replace/overwrite which are about to go out in the
> > > 0.7.0
> > > > >> release. These efforts largely complete parts of our vision
> and
> > it
> > > > could
> > > > >> have happened without Satish.
> > > > >>
> > > > >> *Prashant Wason*: In addition to a number of patches, Prashant
> > has
> > > > been
> > > > >> shouldering massive responsibility on RFC-15, and thanks to
> his
> > > > efforts, we
> > > > >> have a simplified design, very solid implementation right now,
> > > that
> > > > is
> > > > >> being tested now for 0.7.0 release again.
> > > > >>
> > > > >> Please join me in congratulating them on this great milestone!
> > > > >>
> > > > >> Thanks,
> > > > >> Vinoth
> > > > >>
> > > > >
> > > >
> > > >
> > > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


[ANNOUNCE] Hudi Community Bi-Weekly Update(2020-11-08 ~ 2020-11-22)

2020-11-22 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly update for 2020-11-08 ~ 2020-11-22
with updates on features and bugfixs.

===
Features

[Writer Core] Introduce base implemetation of hudi-flink-client  [1]
[Writer Core] Replace Operation enum with WriteOperationType [2]
[Writer Core] Decoupling hive jdbc dependency when HIVE_USE_JDBC_OPT_KEY
set false [3]


===
Bugs

[Writer Core] Fix Memory Leak in HoodieLogFormatWriter [4]

[1] https://issues.apache.org/jira/browse/HUDI-1327
[2] https://issues.apache.org/jira/browse/HUDI-1400
[3] https://issues.apache.org/jira/browse/HUDI-1384
[4] https://issues.apache.org/jira/browse/HUDI-1358



Best,
Leesf


[ANNOUNCE] Hudi Community Bi-Weekly Update(2020-10-25 ~ 2020-11-08)

2020-11-08 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly update for 2020-10-25 ~ 2020-11-08
with updates on features, bugfixs and tests.

===
Features

[Writer Core] Cleanup rollback files residing in .hoodie folder  [1]
[Hive Integration] Make hive synchronization supports hourly partition [2]
[Writer Core] Use RateLimiter instead of sleep. Repartition WriteStatus to
optimize Hbase index writes [3]
[Writer Core] Refactor and relocate KeyGenerator to support more engines [4]
[Hive Integration] RealtimeParquetInputFormat skip adding projection
columns if there are no log files [4]
[Writer Core] Add FileSystemView APIs to query pending clustering
operations [5]


===
Bugs

[Writer Core] Fix bug in HoodieAvroUtils.removeMetadataFields() method [6]


==
Tests

[Test] Improvements to the hudi test suite for scalability and repeated
testing. [7]
[Test] Adding Delete support to test suite framework [8]


[1] https://issues.apache.org/jira/browse/HUDI-1118
[2] https://issues.apache.org/jira/browse/HUDI-1274
[3] https://issues.apache.org/jira/browse/HUDI-316
[4] https://issues.apache.org/jira/browse/HUDI-912
[5] https://issues.apache.org/jira/browse/HUDI-1352
[6] https://issues.apache.org/jira/browse/HUDI-1375
[7] https://issues.apache.org/jira/browse/HUDI-1351
[8] https://issues.apache.org/jira/browse/HUDI-1338



Best,
Leesf


[ANNOUNCE] Hudi Community Bi-Weekly Update(2020-10-11 ~ 2020-10-25)

2020-10-25 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly update for 2020-10-11 ~ 2020-10-25
with updates on features, bugfixs and tests.

===
Features

[Flink Integration] Introduce HoodieFlinkEngineContext to hudi-flink-client
[1]
[Hive Integration] Add support for timestamp field in HiveSync [2]
[Writer Core] IBM Cloud Object Storage Support [3]
[Metrics] Added an API to force publish metrics and flush them [4]


===
Bugs

[DeltaStreamer] Replace null by Option in Delta Streamer [5]
[Index] Fix the support of hbase index partition path change [6]
[Writer Core] Add better error messages when IOException occurs during log
file reading [7]
[Writer Core] Remove relocation of pattern for hbase dependencies and add
shading of guava in hadoop, spark, and presto bundles [8]
[Writer Core] Remove Hbase and htrace relocation from utilities bundle [9]
[Writer Core] handle prefix filtering at directory level [10]
[Writer Core] fixed NPE in CustomKeyGenerator [11]
[DeltaStreamer] Properties File must be optional when running deltastreamer
[12]


==
Tests

[Test] Migrate HoodieTestUtils APIs to HoodieTestTable [13]
[Test] Add unit test for testing compaction on replaced file groups [14]
[Test] add test to check timestamp date decimal type write and read
consistent [15]
[Test] add more test for UpdateSchemaEvolution [16]


[1] https://issues.apache.org/jira/browse/HUDI-1308
[2] https://issues.apache.org/jira/browse/HUDI-1302
[3] https://issues.apache.org/jira/browse/HUDI-1344
[4] https://issues.apache.org/jira/browse/HUDI-1326
[5] https://issues.apache.org/jira/browse/HUDI-791
[6] https://issues.apache.org/jira/browse/HUDI-1184
[7] https://issues.apache.org/jira/browse/HUDI-1298
[8] https://issues.apache.org/jira/browse/HUDI-1289
[9] https://issues.apache.org/jira/browse/HUDI-1345
[10] https://issues.apache.org/jira/browse/HUDI-1330
[11] https://issues.apache.org/jira/browse/HUDI-1200
[12] https://issues.apache.org/jira/browse/HUDI-1209
[13] https://issues.apache.org/jira/browse/HUDI-995
[14] https://issues.apache.org/jira/browse/HUDI-1304
[15] https://issues.apache.org/jira/browse/HUDI-307
[16] https://issues.apache.org/jira/browse/HUDI-284



Best,
Leesf


[ANNOUNCE] Hudi Community Bi-Weekly Update(2020-09-27 ~ 2020-10-11)

2020-10-11 Thread leesf
Dear community,

Nice to share Hudi community bi-weekly update for 2020-09-27 ~ 2020-10-11
with updates on features, bugfixs and tests.

===
Features

[Hive Sync] Make create hive database automatically configurable [1]
[Writer Core] Deltastreamer Kafka consumption delay reporting indicators [2]
[Writer Core] Introduce REPLACE top level action. Implement
insert_overwrite operation on top of replace action [3]
[Writer Core] Refactor hudi-client to support multi-engine [4]
[Metrics] Added an API to shutdown and remove the metrics reporter [5]
[Writer Core] add port configuration for EmbeddedTimelineService [6]
[Spark Integration] use spark INCREMENTAL mode query hudi dataset support
schema version [7]



===
Bugs

[Writer Core] Avoid blank file created by HoodieLogFormatWriter [8]
[Writer Core] relocated jetty in hudi-utilities-bundle pom [9]
[Writer Core]  Ordering Field should be optional when precombine is turned
off [10]
[DeltaStreamer] DeltaStreamer can now fetch schema before every run in
continuous mode [11]



==
Tests

[Test] Some improvements for the HUDI Test Suite [12]
[Test] Migrate HoodieTestUtils APIs to HoodieTestTable [13]


[1] https://issues.apache.org/jira/browse/HUDI-1192
[2] https://issues.apache.org/jira/browse/HUDI-1233
[3] https://issues.apache.org/jira/browse/HUDI-1072
[4] https://issues.apache.org/jira/browse/HUDI-1089
[5] https://issues.apache.org/jira/browse/HUDI-1305
[6] https://issues.apache.org/jira/browse/HUDI-1203
[7] https://issues.apache.org/jira/browse/HUDI-1301
[8] https://issues.apache.org/jira/browse/HUDI-840
[9] https://issues.apache.org/jira/browse/HUDI-1199
[10] https://issues.apache.org/jira/browse/HUDI-1208
[11] https://issues.apache.org/jira/browse/HUDI-603
[12] https://issues.apache.org/jira/browse/HUDI-1303
[13] https://issues.apache.org/jira/browse/HUDI-995


Best,
Leesf


[ANNOUNCE] Hudi Community Weekly Update(2020-09-20 ~ 2020-09-27)

2020-09-27 Thread leesf
Dear community,

Nice to share Hudi community weekly update for 2020-09-20 ~ 2020-09-27 with
updates on discussion, features and bugfixes.

===
Discussion

[Roadmap] There is a discussion on the planning for Releases 0.6.1 and
0.7.0, you would chime in to put on your thoughts. [1]


===
Features

[Writer Core] Adding a way to post process schema after it is fetched [2]
[Bootstrap] Set Default for the bootstrap config :
hoodie.bootstrap.full.input.provider [3]


===
Bugs

[Spark Integration] fix UpgradeDowngrade fs Rename issue for hdfs and
aliyun oss [4]
[Code Cleanup] Archived commits command code cleanup [5]


==


[1]
https://lists.apache.org/thread.html/r172456c489b3e53f5d2143bebfd4fd69fe94331627c77a5032ede0ba%40%3Cdev.hudi.apache.org%3E
[2] https://issues.apache.org/jira/browse/HUDI-801
[3] https://issues.apache.org/jira/browse/HUDI-1213
[4] https://issues.apache.org/jira/browse/HUDI-1268
[5] https://issues.apache.org/jira/browse/HUDI-554


Best,
Leesf


Re: [DISCUSS] Planning for Releases 0.6.1 and 0.7.0

2020-09-23 Thread leesf
Thanks Vinoth, also we would consider support full schema evolution(such as
drop some fields) of hudi in 0.7.0, since right now hudi follows avro
schema compatibility

tanu dua  于2020年9月23日周三 下午12:38写道:

> Thanks Vinoth. These are really exciting items and hats off to you and team
> in pushing the releases swiftly and improving the framework all the time. I
> hope someday I will start contributing once I will get free from my major
> deliverables and have more understanding the nitty gritty details of Hudi.
>
> You have mentioned Spark3.0 support in next release. We were actually
> thinking of moving to Spark 3.0 but thought it’s too early with 0.6
> release. Is 0.6 not fully tested with Spark 3.0 ?
>
>
> On Wed, 23 Sep 2020 at 8:25 AM, Vinoth Chandar  wrote:
>
> > Hello all,
> >
> >
> >
> > Pursuant to our conversation around release planning, I am happy to share
> >
> > the initial set of proposals for the next minor/major releases (minor
> >
> > release ofc can go out based on time)
> >
> >
> >
> > *Next Minor version 0.6.1 (with stuff that did not make it to 0.6.0..) *
> >
> > Flink/Writer common refactoring for Flink
> >
> > Small file handling support w/o caching
> >
> > Spark3 Support
> >
> > Remaining bootstrap items
> >
> > Completing bulk_insertV2 (sort mode, de-dup etc)
> >
> > Full list here :
> >
> > https://issues.apache.org/jira/projects/HUDI/versions/12348168
> >
> > 
> >
> >
> >
> > *0.7.0 with major new features *
> >
> > RFC-15: metadata, range index (w/ spark support), bloom index (eliminate
> >
> > file listing, query pruning, improve bloom index perf)
> >
> > RFC-08: Record Index (to solve global index scalability/perf)
> >
> > RFC-18/19: Clustering/Insert overwrite
> >
> > Spark 3 based datasource rewrite (structured streaming sink/source,
> >
> > DELETE/MERGE)
> >
> > Incremental Query on logs (Hive, Spark)
> >
> > Parallel writing support
> >
> > Redesign of marker files for S3
> >
> > Stretch: ORC, PrestoSQL Support
> >
> >
> >
> > Full list here :
> >
> > https://issues.apache.org/jira/projects/HUDI/versions/12348721
> >
> >
> >
> > Please chime in with your thoughts. If you would like to commit to
> >
> > contributing a feature towards a release, please do so by marking *`Fix
> >
> > Version/s`* field with that release number.
> >
> >
> >
> > Thanks
> >
> > Vinoth
> >
> >
>


[ANNOUNCE] Hudi Community Weekly Update(2020-09-13 ~ 2020-09-20)

2020-09-20 Thread leesf
Dear community,

Nice to share Hudi community weekly update for 2020-09-13 ~ 2020-09-20 with
updates on features, bugfixs and tests.


===
Features

[Integration Test] Check whether the topic exists before deltastrmer
consumes Kafka [1]
[Hudi CLI] Add deduping logic for upserts case [2]
[Writer Core] Adding a way to post process schema after it is fetched [3]

===
Bugs

[Spark Integration] Fix for preventing MOR datasource jobs from hanging via
spark-submit [4]


==
Tests

[Test] Change timestamp field in HoodieTestDataGenerator from double to
long [5]
[Test] Use HoodieTestTable in more classes [6]
[Test] Migrate HoodieTestUtils APIs to HoodieTestTable [7]


[1] https://issues.apache.org/jira/browse/HUDI-1228
[2] https://issues.apache.org/jira/browse/HUDI-976
[3] https://issues.apache.org/jira/browse/HUDI-801
[4] https://issues.apache.org/jira/browse/HUDI-1230
[5] https://issues.apache.org/jira/browse/HUDI-1143
[6] https://issues.apache.org/jira/browse/HUDI-995
[7] https://issues.apache.org/jira/browse/HUDI-995


Best,
Leesf


[ANNOUNCE] Hudi Community Weekly Update(2020-09-06 ~ 2020-09-13)

2020-09-13 Thread leesf
Dear community,

Nice to share Hudi community weekly update for 2020-09-06 ~ 2020-09-13 with
updates on discussion, features, bugfixs and tests.


===
Discussion

[API] A discussion about standardizing Java date time APIs in codebase,
there are many different ways of manipulating date time, some of which
are inferior due to lack of thread-safety. [1]


===
Features

[Integration Test] hudi-test-suite support for schema evolution (can be
triggered on any insert/upsert DAG node) [2]
[Writer Core] Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload)
for updating specified fields in storage [3]


===
Bugs

[Writer Core] Fix decimal type display issue for record key field [4]
[Writer Core] TypedProperties can not get values by initializing an
existing properties [5]
[Writer Core] AWSDmsTransformer does not handle insert and delete of a row
in a single batch correctly [6]


==
Tests

[Test] Add HoodieWriteableTestTable [7]


[1]
https://lists.apache.org/thread.html/ra0bccb431ee61f7580851560e0aa36f2ed81b5e30c3cfef84f33aaad%40%3Cdev.hudi.apache.org%3E
[2] https://issues.apache.org/jira/browse/HUDI-1130
[3] https://issues.apache.org/jira/browse/HUDI-1255
[4] https://issues.apache.org/jira/browse/HUDI-1181
[5] https://issues.apache.org/jira/browse/HUDI-1254
[6] https://issues.apache.org/jira/browse/HUDI-802
[7] https://issues.apache.org/jira/browse/HUDI-781

Best,
Leesf


[ANNOUNCE] Hudi Community Weekly Update(2020-08-30 ~ 2020-09-06)

2020-09-06 Thread leesf
Dear community,

Nice to share Hudi community weekly update for 2020-08-30 ~ 2020-09-06 with
updates on discussion, features, bugfixs.

===
Discussion

[Community] A discussion about DevX, Test infra Rgdn with the community, as
the community become larger, we need more clear roles to extend the
community [1]
[Writer Core] A discussion to introduce incremental processing API in Hudi
to make incremental processing with other systems more easily [2]
[Core] A discussion to enable cross AZ consistency and quality checks of
hudi datasets [3]
[Release] A discussion about formalizing the release process, that would
make community release new version in a more standard way [4]
[Community] There are new four committers to the community [5]

===
Features

[Writer Core] Implementation of the HFile base and log file format [6]
[Writer Core] Let delete API use "hoodie.delete.shuffle.parallelism". [7]


===
Bugs

[Writer Core] Spark DataSource and Streaming Write must fail when operation
type is misconfigured [8]



[1]
https://lists.apache.org/thread.html/rfd0d7793e9def7485718d0228e24330dcf7a9dd02ccfbd27554bb9f4%40%3Cdev.hudi.apache.org%3E
[2]
https://lists.apache.org/thread.html/r7d3006dc5e6db85fbdc6ef714bf4eb1c99534aa52be0ba365a29c772%40%3Cdev.hudi.apache.org%3E
[3]
https://lists.apache.org/thread.html/r295aa716aac1bdc58a49ede7b50220b29d595531aad9058d7ee33da6%40%3Cdev.hudi.apache.org%3E
[4]
https://lists.apache.org/thread.html/r3ac49b5ae17734120f13b4403e7c4fc548ba4c772825d3a5c67e09eb%40%3Cdev.hudi.apache.org%3E
[5]
https://lists.apache.org/thread.html/r53a38909bec9905df096694a75de522fe7bf37d59cc4ce0500300938%40%3Cdev.hudi.apache.org%3E
[6] https://issues.apache.org/jira/browse/HUDI-960
[7] https://issues.apache.org/jira/browse/HUDI-993
[8] https://issues.apache.org/jira/browse/HUDI-1153


Best,
Leesf


Re: Congrats to our newest committers!

2020-09-03 Thread leesf
Congrats everyone, well deserved !

selvaraj periyasamy  于2020年9月4日周五
上午5:05写道:

> Congrats everyone !
>
> On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar  wrote:
>
> > Hi all,
> >
> > I am really excited to share the good news about our new committers on
> the
> > project!
> >
> > *Udit Mehrotra *: Udit has travelled with the project since sept/oct last
> > year and immensely helped us making Hudi work well with the AWS
> ecosystem.
> > His most notable contributions are towards driving large parts of the
> > implementation of RFC-12, Hive/Spark integration points. He has also
> helped
> > our users in various tricky issues.
> >
> > *Gary Li:* Gary is a great success story for the project, starting out as
> > an early user and steadily grown into a strong contributor, who has
> > demonstrated the ability to take up challenging implementations (e.g
> Impala
> > support, MOR snapshot query impl on Spark), as well as patiently
> > iterate through feedback and evolve the design/code. He has also been
> > helping users on Slack and mailing lists
> >
> > *Raymond Xu:* Raymond has also been a consistent feature on our mailing
> > lists, slack and github. He has been proposing immensely valuable
> > test/tooling improvements. He has contributed a great deal of code as
> well,
> > towards the same. Many many users thank Raymond for the generous help on
> > Slack.
> >
> > *Pratyaksh Sharma:* This is yet another great example of user ->
> > contributor -> committer. Pratyaksh has been a great champion for the
> > project, over the past year or so, steadily contributing many
> improvements
> > around the Delta Streamer tool.
> >
> > Please join me in, congratulating them on this well deserved milestone!
> >
> > Onwards and upwards,
> > Vinoth
> >
>


[ANNOUNCE] Hudi Community Weekly Update(2020-08-23 ~ 2020-08-30)

2020-08-30 Thread leesf
Dear community,

Nice to share Hudi community weekly update for 2020-08-23 ~ 2020-08-30 with
updates on discussion, features, bugfixs.

===
Discussion

[Release] Hudi 0.6.0 has been released, it contains many features and
bugfixes [1]


===
Features

[Writer Core] Add option to configure different path selector [2]
[Writer Core] Add back findInstantsAfterOrEquals to the HoodieTimeline
class. [3]
[Writer Core] Make timeline server timeout settings configurable [4]
[Writer Common] Add incremental meta client API to query partitions
modified in a time window [5]
[Writer Core] Tune buffer sizes for the diskbased external spillable map [6]
[Build] Specify version information for each component separately [7]
[Core] Add utility method to query extra metadata [8]


===
Bugs

[Writer Core] Fix unable to parse input partition field :1 exception when
using TimestampBasedKeyGenerator [9]
[Writer Core] Fix ComplexKeyGenerator for non-partitioned tables [10]
[Release] Fix release validate script for rc_num and release_type [11]
[Core] Fix: Avro Date logical type not handled correctly when converting to
Spark Row [12]


===
Tests

[DOCS] Add java doc for the test classes of hudi test suite [13]


[1]
https://lists.apache.org/thread.html/rb62934ceff46fc15800afa1947b15fa6f62c15d90c48fd56940a874d%40%3Cdev.hudi.apache.org%3E
[2] https://issues.apache.org/jira/browse/HUDI-1137
[3] https://issues.apache.org/jira/browse/HUDI-1136
[4] https://issues.apache.org/jira/browse/HUDI-1135
[5] https://issues.apache.org/jira/browse/HUDI-1191
[6] https://issues.apache.org/jira/browse/HUDI-1131
[7] https://issues.apache.org/jira/browse/HUDI-978
[8] https://issues.apache.org/jira/browse/HUDI-1228
[9] https://issues.apache.org/jira/browse/HUDI-1150
[10] https://issues.apache.org/jira/browse/HUDI-1226
[11] https://issues.apache.org/jira/browse/HUDI-1056
[12] https://issues.apache.org/jira/browse/HUDI-1225
[13] https://issues.apache.org/jira/browse/HUDI-532


Best,
Leesf


Re: [ANNOUNCE] Apache Hudi 0.6.0 released

2020-08-25 Thread leesf
Great, thanks sudha and all involved.

Pratyaksh Sharma  于2020年8月25日周二 下午1:17写道:

> Great news! :)
>
> On Tue, Aug 25, 2020 at 10:09 AM Vinoth Chandar  wrote:
>
> > - announce
> >
> > Folks, please keep the follow ups to dev@ and users@
> >
> >
> >
> > On Mon, Aug 24, 2020 at 9:26 PM vino yang  wrote:
> >
> > > Great news!
> > >
> > > Thanks to Bhavani Sudha for driving the release! And thanks to every
> one
> > of
> > > the whole community!
> > >
> > > Best,
> > > Vino
> > >
> > > Bhavani Sudha  于2020年8月25日周二 上午11:37写道:
> > >
> > > > The Apache Hudi team is pleased to announce the release of Apache
> Hudi
> > > > 0.6.0.
> > > >
> > > > Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> > > > Incrementals. Apache Hudi manages storage of large analytical
> datasets
> > on
> > > > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)
> > and
> > > > provides the ability to query them.
> > > >
> > > > This release comes 2 months after 0.5.3. It includes more than 200
> > > > resolved issues, comprising new features, perf improvements, as well
> as
> > > > general improvements and bug-fixes. Hudi 0.6.0 introduces mechanisms
> to
> > > > efficiently bootstrap large datasets into Hudi without having to copy
> > the
> > > > data (experimental feature), via both Spark datasource writer and
> > > > DeltaStreamer tool. A new index (HoodieSimpleIndex) is added that can
> > be
> > > > faster than bloom index for cases where updates/deletes spread
> across a
> > > > large portion of the table. With this version, rollbacks are done
> using
> > > > marker files and a supporting upgrade and downgrade infrastructure is
> > > > provided to users for smooth transition. HoodieMultiDeltaStreamer
> tool
> > > > (experimental feature) is added in this version to support ingesting
> > > > multiple kafka streams in a single DeltaStreamer deployment for
> > enhancing
> > > > operational experience. Bulk inserts are further improved by avoiding
> > any
> > > > dataframe-rdd conversions, accompanied with configurable sorting
> modes.
> > > > While this conversion of dataframe to rdd, is not a bottleneck for
> > > > upsert/deletes, subsequent releases will expand this to other write
> > > > operations. Other performance improvements include supporting async
> > > > compaction for spark streaming writes.
> > > >
> > > > For details on how to use Hudi, please look at the quick start page
> > > > located at:
> > > > https://hudi.apache.org/docs/quick-start-guide.html
> > > >
> > > > If you'd like to download the source release, you can find it here:
> > > > https://github.com/apache/hudi/releases/tag/release-0.6.0
> > > >
> > > > You can read more about the release (including release notes) here:
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12346663
> > > >
> > > > We would like to thank all contributors, the community, and the
> Apache
> > > > Software Foundation for enabling this release and we look forward to
> > > > continued collaboration. We welcome your help and feedback. For more
> > > > information on how to report problems, and to get involved, visit the
> > > > project website at:
> > > > http://hudi.apache.org/
> > > >
> > > > Thanks to everyone involved!
> > > > - Bhavani Sudha
> > > >
> > >
> >
>


[ANNOUNCE] Hudi Community Weekly Update(2020-08-16 ~ 2020-08-23)

2020-08-23 Thread leesf
Dear community,

Nice to share Hudi community weekly update for 2020-08-16 ~ 2020-08-23 with
updates on discussion, features, bugfixs.

===
Discussion

[Release] Hudi 0.6.0 RC1 has been sent and approved by the community [1]
[Writer Core] A discussion about supporting for `_hoodie_record_key` as a
virtual column [2]


===
Features

[Writer Core] Meter RPC calls in HoodieWrapperFileSystem [3]
[Writer Core] Introduce a kafka implementation of hoodie write commit
callback [4]


===
Bugs

[Build] Fix import issue that fails scala 2.12 build [5]
[Index] Fix HBASE index MOR tables not considering record index valid [6]
[Core] fixed TaskNotSerializableException in TimestampBasedKeyGenerator [7]
[Core] Optimization in determining insert bucket location for a given key
[8]


===
Tests

[Test] Introduce HoodieTestTable for test preparation [9]


[1]
https://lists.apache.org/thread.html/r43ebf459a0c4f99cc918b7ef0110de3f5785346b57bf3f8ba2a64378%40%3Cdev.hudi.apache.org%3E
[2]
https://lists.apache.org/thread.html/r42b12fa44c87ea34898c9455c290c8ac86810c93bd8cbeed4b01be86%40%3Cdev.hudi.apache.org%3E
[3] https://issues.apache.org/jira/browse/HUDI-1025
[4] https://issues.apache.org/jira/browse/HUDI-1122
[5] https://issues.apache.org/jira/browse/HUDI-1197
[6] https://issues.apache.org/jira/browse/HUDI-1188
[7] https://issues.apache.org/jira/browse/HUDI-1177
[8] https://issues.apache.org/jira/browse/HUDI-1083
[9] https://issues.apache.org/jira/browse/HUDI-781


Best,
Leesf


Re: [VOTE] Release 0.6.0, release candidate #1

2020-08-22 Thread leesf
+1 (binding)
- mvn clean package -DskipTests OK
- ran quickstart guide OK (still get the exception ERROR
view.PriorityBasedFileSystemView: Got error running preferred function.
Trying secondary
org.apache.hudi.exception.HoodieRemoteException: 192.168.1.102:56544 failed
to respond
at
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFile(RemoteHoodieTableFileSystemView.java:426)
at
org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:96)
at
org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFile(PriorityBasedFileSystemView.java:139),
but still ran successfully)
- writing demos to sync to hive & dla OK

Sivabalan  于2020年8月22日周六 上午5:29写道:

> +1 (non binding)
> - Compilation successful
> - Ran validation script which verifies checksum, keys, license, etc.
> - Ran quick start
> - Ran some tests from intellij.
>
> JFYI: when I ran mvn test, encountered some test failures due to multiple
> spark contexts. Have raised a ticket here
> . But all tests are
> succeeding in CI and I could run from within intellij. So, not blocking the
> RC.
>
> Checking Checksum of Source Release-e Checksum Check of Source Release -
> [OK]
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>Dload  Upload
> Total   SpentLeft  Speed
> 100 30225  100 302250 0   106k  0 --:--:-- --:--:-- --:--:--
> 106k
> Checking Signature
> -e Signature Check - [OK]
> Checking for binary files in source release
> -e No Binary Files in Source Release? - [OK]
> Checking for DISCLAIMER
> -e DISCLAIMER file exists ? [OK]
> Checking for LICENSE and NOTICE
> -e License file exists ? [OK]-
> e Notice file exists ? [OK]
> Performing custom Licensing Check
> -e Licensing Check Passed [OK]
> Running RAT Check
> -e RAT Check Passed [OK]
>
>
>
> On Fri, Aug 21, 2020 at 12:37 PM Bhavani Sudha 
> wrote:
>
> > Vino yang,
> >
> > I am working on the release blog. While the RC is in progress, the doc
> and
> > site updates are happening this week.
> >
> > Thanks,
> > Sudha
> >
> > On Fri, Aug 21, 2020 at 4:23 AM vino yang  wrote:
> >
> > > +1 from my side
> > >
> > > I checked:
> > >
> > > - ran `mvn clean package` [OK]
> > > - ran `mvn test` in my local [OK]
> > > - signature [OK]
> > >
> > > BTW, where is like of the release blog?
> > >
> > > Best,
> > > Vino
> > >
> > > Bhavani Sudha  于2020年8月20日周四 下午12:03写道:
> > >
> > > > Hi everyone,
> > > > Please review and vote on the release candidate #1 for the version
> > 0.6.0,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint 7F66CD4CE990983A284672293224F200E1FC2172 [3],
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag "release-0.6.0-rc1" [5],
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > > Release Manager
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12346663
> > > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.6.0-rc1/
> > > > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> > > > [4]
> > > https://repository.apache.org/content/repositories/orgapachehudi-1025/
> > > > [5] https://github.com/apache/hudi/tree/release-0.6.0-rc1
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


  1   2   3   >