vinothchandar commented on pull request #2645: URL: https://github.com/apache/hudi/pull/2645#issuecomment-830440547
@pengzhiwei2018 did some basic testing of functionality. Most works good. Below is my summary. Can we look at each issue and see if we can fix/address it? I had problem with partitioned tables and a custom merge expression actually. So wondering what I am missing or if this is expected. ## Writing ``` // Create gh sample data. create table gh_raw using parquet location 'file:///Users/vs/Cache/lake-microbenchmarks/sample-parquet'; ``` ### Working #### Creating a table over existing Hudi table ``` create table hudi_debug using hudi location 'file:///Users/vs/Cache/hudi-debug/junit6920417274916003231/dataset/'; ``` #### Create table as select ``` create table hudi_managed using hudi as select type, public, payload, repo, actor, org, id, other from gh_raw; ``` Issues: 1) Even if it fails, it ends up creating the table (i.e its not atomic per se) ``` java.lang.RuntimeException: Table default.hudi_managed already exists. You need to drop it first. at org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:48) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) ``` 3) When selecting all columns (probably need more tests across data types) ``` Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Could not convert field Type from TIMESTAMP to bigint for field created_at at org.apache.hudi.hive.util.HiveSchemaUtil.getSchemaDifference(HiveSchemaUtil.java:98) at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:205) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:155) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:109) ... 52 more ``` #### Create table, external location ``` create table hudi_gh_ext using hudi location 'file:///tmp/hudi-gh' as select type, public, payload, repo, actor, org, id, other from gh_raw; ``` #### Create table with schema, no location ``` create table hudi_gh_managed_fixed (id int, name string, price double, ts long) using hudi options(primaryKey = 'id', precombineField = 'ts') ``` #### Truncate table ``` TRUNCATE table hudi_gh ``` Issues 1) Truncation succeeds, but throws an error. may be confusing to user ``` 21/04/30 14:34:46 WARN TruncateTableCommand: Exception when attempting to uncache table `default`.`hudi_gh` org.sparkproject.guava.util.concurrent.UncheckedExecutionException: org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in path Unable to find a hudi table for the user provided paths. ... at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in path Unable to find a hudi table for the user provided paths. at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:81) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:99) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:337) at org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:256) ``` 2) Querying truncated table throws error, instead of returning empty result set ``` Caused by: org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in path Unable to find a hudi table for the user provided paths. at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:81) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:99) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:337) at org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:256) at org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) ``` #### Drop table managed table is deleted, external table is not deleted. All good. ``` drop table hudi_managed; drop table hudi_gh_ext; ``` #### Insert INTO .. Values ``` create table hudi_gh_ext_fixed (id int, name string, price double, ts long) using hudi options(primaryKey = 'id', precombineField = 'ts') location 'file:///tmp/hudi-gh-fixed'; insert into hudi_gh_ext_fixed values(3, 'AMZN', 530, 120); ``` #### Insert INTO tbl SELECT * from anotherTbl ``` create table hudi_gh (type string, public boolean, payload string, id string) using hudi; insert into hudi_gh select type, public, payload, id from gh_raw; ``` #### Insert overwrite ``` insert overwrite table hudi_gh select type, public, payload, id from gh_raw limit 10000; ``` #### Update Table ``` update hudi_gh_ext_fixed set price = 100.0 where name = 'UBER'; ``` #### Merge Table ``` create table hudi_fixed (id int, name string, price double, ts long) using hudi options(primaryKey = 'id', precombineField = 'ts') location 'file:///tmp/hudi-fixed'; insert into hudi_fixed values(1, 'UBER', 200, 120); MERGE INTO hudi_fixed USING hudi_gh_ext_fixed ON hudi_fixed.id = hudi_gh_ext_fixed.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * MERGE INTO hudi_fixed USING (select id, name, price, ts from hudi_gh_ext_fixed) updates ON hudi_fixed.id = updates.id WHEN MATCHED THEN UPDATE SET hudi_fixed.price = hudi_fixed.price + updates.price WHEN NOT MATCHED THEN INSERT * ``` Issues. 1) Fails due to assignment field/schema mismatch ``` spark-sql> describe hudi_fixed; ... _hoodie_commit_time string NULL _hoodie_commit_seqno string NULL _hoodie_record_key string NULL _hoodie_partition_path string NULL _hoodie_file_name string NULL id int NULL name string NULL price double NULL ts bigint NULL Time taken: 0.027 seconds, Fetched 9 row(s) spark-sql> describe hudi_gh_ext_fixed; ... _hoodie_commit_time string NULL _hoodie_commit_seqno string NULL _hoodie_record_key string NULL _hoodie_partition_path string NULL _hoodie_file_name string NULL id int NULL name string NULL price double NULL ts bigint NULL Time taken: 0.028 seconds, Fetched 9 row(s) spark-sql> ``` ``` java.lang.AssertionError: assertion failed: The number of update assignments[1] must equal to the targetTable field size[4] at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.$anonfun$checkUpdateAssignments$1(MergeIntoHoodieTableCommand.scala:307) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.$anonfun$checkUpdateAssignments$1$adapted(MergeIntoHoodieTableCommand.scala:305) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ``` 2. Merges only allowed by PK ``` java.lang.IllegalArgumentException: Merge Key[name] is not Equal to the defined primary key[id] in table hudi_fixed at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:429) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:156) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) ``` 3. Merge not updating to new value ``` MERGE INTO hudi_fixed USING hudi_gh_ext_fixed ON hudi_fixed.id = hudi_gh_ext_fixed.id WHEN MATCHED THEN UPDATE SET hudi_fixed.price =hudi_gh_ext_fixed + hudi_fixed.price WHEN NOT MATCHED THEN INSERT * ``` I see no effect on the hudi_fixed table. Old values remain. #### Delete Table ``` delete from hudi_gh_ext_fixed where _hoodie_record_key = 'id:3'; delete from hudi_gh_ext where type = 'GollumEvent'; ``` Issues 1) Non PK based deletes are not working atm ``` java.lang.AssertionError: assertion failed: There are no primary key in table `default`.`hudi_gh_ext`, cannot execute delete operator at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.hudi.command.DeleteHoodieTableCommand.buildHoodieConfig(DeleteHoodieTableCommand.scala:68) at org.apache.spark.sql.hudi.command.DeleteHoodieTableCommand.run(DeleteHoodieTableCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) ``` 2) Why do we have to encode column name into reach record key? i.e _hoodie_record_key = '1' vs being _hoodie_record_key = 'id:1' ### Not working: #### Create or Replace table ``` spark-sql> create or replace table hudi_debug using hudi location 'file:///Users/vs/Cache/hudi-debug/junit6920417274916003231/dataset/'; Error in query: REPLACE TABLE is only supported with v2 tables.; ``` #### Create table, partitioned by ``` create table hudi_gh_ext using hudi partitioned by (type) location 'file:///tmp/hudi-gh-ext' as select type, public, payload, repo, actor, org, id, other from gh_raw; select count(*) from hudi_gh_ext; 0 ``` Issues : 1) Throws an error after running sql (Physically deleted external table basepath, dropped table and yet create fails ) ``` 21/04/30 15:32:20 ERROR SparkSQLDriver: Failed in [create table hudi_gh_ext using hudi partitioned by (type) location 'file:///tmp/hudi-gh-ext' as select type, public, payload, repo, actor, org, id, other from gh_raw] java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.catalog.CatalogTable.partitionSchema(interface.scala:259) at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignOutputFields(InsertIntoHoodieTableCommand.scala:104) at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:85) at org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:64) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org