schema compatibility check and change column type
Hi All, I want to change the type of one column in my COW table, from int to long. When I set “hoodie.avro.schema.validate = true” and upsert new data with long type, I got a “Failed upsert schema compatibility check” error. Dose it break backwards compatibility? If I disable hoodie.avro.schema.validate, I can upsert and read normally. code demo: https://gist.github.com/cadl/be433079747aeea88c9c1f45321cc2eb stacktrace: org.apache.hudi.exception.HoodieUpsertException: Failed upsert schema compatibility check. at org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:572) at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:190) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:260) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) ... 69 elided Caused by: org.apache.hudi.exception.HoodieException: Failed schema compatibility check for writerSchema :{"type":"record","name":"foo_record","namespace":"hoodie.foo","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"__row_key","type":"int"},{"name":"__row_version","type":"int"}]}, table schema :{"type":"record","name":"foo_record","namespace":"hoodie.foo","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a","type":"int"},{"name":"b","type":"string"},{"name":"__row_key","type":"int"},{"name":"__row_version","type":"int"}]}, base path :file:///jfs/cadl/hudi_data/schema/foo at org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:564) at org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:570) ... 94 more
Re: Congrats to our newest committers!
Udit, Gary, Raymond and Pratyaksh, Many congratulations :) Well deserved. Looking forward to your continued contributions. Balaji.V On Thursday, September 3, 2020, 07:19:45 PM PDT, Sivabalan wrote: Congrats to all 3. Much deserved and really excited to see more committers On Thu, Sep 3, 2020 at 9:23 PM leesf wrote: > Congrats everyone, well deserved ! > > > > selvaraj periyasamy 于2020年9月4日周五 > > 上午5:05写道: > > > > > Congrats everyone ! > > > > > > On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar wrote: > > > > > > > Hi all, > > > > > > > > I am really excited to share the good news about our new committers on > > > the > > > > project! > > > > > > > > *Udit Mehrotra *: Udit has travelled with the project since sept/oct > last > > > > year and immensely helped us making Hudi work well with the AWS > > > ecosystem. > > > > His most notable contributions are towards driving large parts of the > > > > implementation of RFC-12, Hive/Spark integration points. He has also > > > helped > > > > our users in various tricky issues. > > > > > > > > *Gary Li:* Gary is a great success story for the project, starting out > as > > > > an early user and steadily grown into a strong contributor, who has > > > > demonstrated the ability to take up challenging implementations (e.g > > > Impala > > > > support, MOR snapshot query impl on Spark), as well as patiently > > > > iterate through feedback and evolve the design/code. He has also been > > > > helping users on Slack and mailing lists > > > > > > > > *Raymond Xu:* Raymond has also been a consistent feature on our mailing > > > > lists, slack and github. He has been proposing immensely valuable > > > > test/tooling improvements. He has contributed a great deal of code as > > > well, > > > > towards the same. Many many users thank Raymond for the generous help > on > > > > Slack. > > > > > > > > *Pratyaksh Sharma:* This is yet another great example of user -> > > > > contributor -> committer. Pratyaksh has been a great champion for the > > > > project, over the past year or so, steadily contributing many > > > improvements > > > > around the Delta Streamer tool. > > > > > > > > Please join me in, congratulating them on this well deserved milestone! > > > > > > > > Onwards and upwards, > > > > Vinoth > > > > > > > > > -- Regards, -Sivabalan
Re: Congrats to our newest committers!
Congrats to all 3. Much deserved and really excited to see more committers On Thu, Sep 3, 2020 at 9:23 PM leesf wrote: > Congrats everyone, well deserved ! > > > > selvaraj periyasamy 于2020年9月4日周五 > > 上午5:05写道: > > > > > Congrats everyone ! > > > > > > On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar wrote: > > > > > > > Hi all, > > > > > > > > I am really excited to share the good news about our new committers on > > > the > > > > project! > > > > > > > > *Udit Mehrotra *: Udit has travelled with the project since sept/oct > last > > > > year and immensely helped us making Hudi work well with the AWS > > > ecosystem. > > > > His most notable contributions are towards driving large parts of the > > > > implementation of RFC-12, Hive/Spark integration points. He has also > > > helped > > > > our users in various tricky issues. > > > > > > > > *Gary Li:* Gary is a great success story for the project, starting out > as > > > > an early user and steadily grown into a strong contributor, who has > > > > demonstrated the ability to take up challenging implementations (e.g > > > Impala > > > > support, MOR snapshot query impl on Spark), as well as patiently > > > > iterate through feedback and evolve the design/code. He has also been > > > > helping users on Slack and mailing lists > > > > > > > > *Raymond Xu:* Raymond has also been a consistent feature on our mailing > > > > lists, slack and github. He has been proposing immensely valuable > > > > test/tooling improvements. He has contributed a great deal of code as > > > well, > > > > towards the same. Many many users thank Raymond for the generous help > on > > > > Slack. > > > > > > > > *Pratyaksh Sharma:* This is yet another great example of user -> > > > > contributor -> committer. Pratyaksh has been a great champion for the > > > > project, over the past year or so, steadily contributing many > > > improvements > > > > around the Delta Streamer tool. > > > > > > > > Please join me in, congratulating them on this well deserved milestone! > > > > > > > > Onwards and upwards, > > > > Vinoth > > > > > > > > > -- Regards, -Sivabalan
Re: Congrats to our newest committers!
Congrats everyone, well deserved ! selvaraj periyasamy 于2020年9月4日周五 上午5:05写道: > Congrats everyone ! > > On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar wrote: > > > Hi all, > > > > I am really excited to share the good news about our new committers on > the > > project! > > > > *Udit Mehrotra *: Udit has travelled with the project since sept/oct last > > year and immensely helped us making Hudi work well with the AWS > ecosystem. > > His most notable contributions are towards driving large parts of the > > implementation of RFC-12, Hive/Spark integration points. He has also > helped > > our users in various tricky issues. > > > > *Gary Li:* Gary is a great success story for the project, starting out as > > an early user and steadily grown into a strong contributor, who has > > demonstrated the ability to take up challenging implementations (e.g > Impala > > support, MOR snapshot query impl on Spark), as well as patiently > > iterate through feedback and evolve the design/code. He has also been > > helping users on Slack and mailing lists > > > > *Raymond Xu:* Raymond has also been a consistent feature on our mailing > > lists, slack and github. He has been proposing immensely valuable > > test/tooling improvements. He has contributed a great deal of code as > well, > > towards the same. Many many users thank Raymond for the generous help on > > Slack. > > > > *Pratyaksh Sharma:* This is yet another great example of user -> > > contributor -> committer. Pratyaksh has been a great champion for the > > project, over the past year or so, steadily contributing many > improvements > > around the Delta Streamer tool. > > > > Please join me in, congratulating them on this well deserved milestone! > > > > Onwards and upwards, > > Vinoth > > >
Re: Congrats to our newest committers!
Congrats everyone ! On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar wrote: > Hi all, > > I am really excited to share the good news about our new committers on the > project! > > *Udit Mehrotra *: Udit has travelled with the project since sept/oct last > year and immensely helped us making Hudi work well with the AWS ecosystem. > His most notable contributions are towards driving large parts of the > implementation of RFC-12, Hive/Spark integration points. He has also helped > our users in various tricky issues. > > *Gary Li:* Gary is a great success story for the project, starting out as > an early user and steadily grown into a strong contributor, who has > demonstrated the ability to take up challenging implementations (e.g Impala > support, MOR snapshot query impl on Spark), as well as patiently > iterate through feedback and evolve the design/code. He has also been > helping users on Slack and mailing lists > > *Raymond Xu:* Raymond has also been a consistent feature on our mailing > lists, slack and github. He has been proposing immensely valuable > test/tooling improvements. He has contributed a great deal of code as well, > towards the same. Many many users thank Raymond for the generous help on > Slack. > > *Pratyaksh Sharma:* This is yet another great example of user -> > contributor -> committer. Pratyaksh has been a great champion for the > project, over the past year or so, steadily contributing many improvements > around the Delta Streamer tool. > > Please join me in, congratulating them on this well deserved milestone! > > Onwards and upwards, > Vinoth >
Congrats to our newest committers!
Hi all, I am really excited to share the good news about our new committers on the project! *Udit Mehrotra *: Udit has travelled with the project since sept/oct last year and immensely helped us making Hudi work well with the AWS ecosystem. His most notable contributions are towards driving large parts of the implementation of RFC-12, Hive/Spark integration points. He has also helped our users in various tricky issues. *Gary Li:* Gary is a great success story for the project, starting out as an early user and steadily grown into a strong contributor, who has demonstrated the ability to take up challenging implementations (e.g Impala support, MOR snapshot query impl on Spark), as well as patiently iterate through feedback and evolve the design/code. He has also been helping users on Slack and mailing lists *Raymond Xu:* Raymond has also been a consistent feature on our mailing lists, slack and github. He has been proposing immensely valuable test/tooling improvements. He has contributed a great deal of code as well, towards the same. Many many users thank Raymond for the generous help on Slack. *Pratyaksh Sharma:* This is yet another great example of user -> contributor -> committer. Pratyaksh has been a great champion for the project, over the past year or so, steadily contributing many improvements around the Delta Streamer tool. Please join me in, congratulating them on this well deserved milestone! Onwards and upwards, Vinoth
[DISCUSS] enable cross AZ consistency and quality checks of hudi datasets
Hello folks, We have a use case to make sure data in the same hudi datasets stored in different DC ( for high availability / disaster recovery ) are strongly consistent as well as pass all quality checks before they can be consumed by users who we try to query them. Currently, we have an offline service that runs quality checks as well as asynchronously syncs the hudi datasets between different DC/AZ but till the sync happens queries running in these different DC see inconsistent results. For some of our most critical datasets this inconsistency is causing so many problems. We want to support the need for following use cases 1) data consistency 2) Adding data quality checks post commit. Our flow looks like this 1) write new batch of data at t1 2) user queries will not see data at t1 3) data quality checks are done by setting a session property to include t1 4) optionally replicate t1 to other AZs and promote t1 so regular user queries will see data at t1 We want to make the following changes to achieve this. 1. Change the HoodieParquetInputFormat to look for 'last_replication_timestamp' property in the JobConf and use this to create a new ActiveTimeline that limits the commits seen to be lesser than or equal to this timestamp. This can be overridden by a session property that will allow us to make such data visible for quality checks. 2. We are storing this particular timestamp as a table property in HiveMetaStore. To make it easier to update we want to extend the HiveSyncTool to also update this table property when syncing hudi dataset to the hms. The extended tool will take in a list of HMS's to be updated and will try to update each of them one by one. ( In case of global HMS across all DC this is just one, but if there is region local HMS per DC the update of all HMS is not truly transaction so there is a small window of time where the queries can return inconsistent results ). If the tool can't update all the HMS it will rollback the updated ones ( again not applicable for global HMS ). We have made the above changes to our internal branch and we are successfully running it in production. Please let us know of feedback about this change. Sanjay
Re: [DISCUSS] Formalizing the release process
+1 on the general release policy. Realistically speaking, bit skeptical on minor version releases every month, but nvm. guess its just a rough estimate. On Tue, Sep 1, 2020 at 8:41 PM Balaji Varadarajan wrote: > > +1 on the process. > Balaji.VOn Tuesday, September 1, 2020, 04:56:55 PM PDT, Gary Li < > garyli1...@outlook.com> wrote: > > +1 > Gary LiFrom: Bhavani Sudha > Sent: Wednesday, September 2, 2020 3:11:06 AM > To: us...@hudi.apache.org > Cc: dev@hudi.apache.org > Subject: Re: [DISCUSS] Formalizing the release process +1 on the release > process formalization. > > On Tue, Sep 1, 2020 at 10:21 AM Vinoth Chandar wrote: > > > Hi all, > > > > Love to start a discussion around how we can formalize the release > > process, timelines more so that we can ensure timely and quality > releases. > > > > Below is an outline of an idea that was discussed in the last community > > sync (also in the weekly sync notes). > > > > - We will do a "feature driven" major version release, every 3 months or > > so. i.e going from version x.y to x.y+1. The idea here is this ships once > > all the committed features are code complete, tested and verified. > > - We keep doing patches, bug fixes and usability improvements to the > > project always. So, we will also do a "time driven" minor version release > > x.y.z → x.y.z+1 every month or so > > - We will always be releasing from master and thus major release features > > need to be guarded by flags, on minor versions. > > - We will try to avoid patch releases. i.e cherry-picking a few commits > > onto an earlier release version. (during 0.5.3 we actually found the > > cherry-picking of master onto 0.5.2 pretty tricky and even error-prone). > > Some cases, we may have to just make patch releases. But only extenuating > > circumstances. Over time, with better tooling and a larger community, we > > might be able to do this. > > > > As for the major release planning process. > > > >- PMC/Committers can come up with an initial list sourced based on > >user asks, support issue > >- List is shared with the community, for feedback. community can > >suggest new items, re-prioritizations > >- Contributors are welcome to commit more features/asks, (with due > >process) > > > > I would love to hear +1s, -1s and also any new, completely different > ideas > > as well. Let's use this thread to align ourselves. > > > > Once we align ourselves, there are some release certification tools that > > need to be built out. Hopefully, we can do this together. :) > > > > > > Thanks > > Vinoth > > > -- Regards, -Sivabalan