schema compatibility check and change column type

2020-09-03 Thread cadl
Hi All,

I want to change the type of one column in my COW table, from int to long. When 
I set “hoodie.avro.schema.validate = true” and upsert new data with long type, 
I got a “Failed upsert schema compatibility check” error.  Dose it break 
backwards compatibility? If I disable hoodie.avro.schema.validate, I can upsert 
and read normally.


code demo: https://gist.github.com/cadl/be433079747aeea88c9c1f45321cc2eb

stacktrace:


org.apache.hudi.exception.HoodieUpsertException: Failed upsert schema 
compatibility check.
  at 
org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:572)
  at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:190)
  at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:260)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125)
  at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
  ... 69 elided
Caused by: org.apache.hudi.exception.HoodieException: Failed schema 
compatibility check for writerSchema 
:{"type":"record","name":"foo_record","namespace":"hoodie.foo","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"__row_key","type":"int"},{"name":"__row_version","type":"int"}]},
 table schema 
:{"type":"record","name":"foo_record","namespace":"hoodie.foo","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a","type":"int"},{"name":"b","type":"string"},{"name":"__row_key","type":"int"},{"name":"__row_version","type":"int"}]},
 base path :file:///jfs/cadl/hudi_data/schema/foo
  at org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:564)
  at 
org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:570)
  ... 94 more

Re: Congrats to our newest committers!

2020-09-03 Thread Balaji Varadarajan
 Udit, Gary, Raymond and Pratyaksh,
Many congratulations :) Well deserved. Looking forward to your continued 
contributions.
Balaji.V
On Thursday, September 3, 2020, 07:19:45 PM PDT, Sivabalan 
 wrote:  
 
 Congrats to all 3. Much deserved and really excited to see more committers


On Thu, Sep 3, 2020 at 9:23 PM leesf  wrote:

> Congrats everyone, well deserved !
>
>
>
> selvaraj periyasamy  于2020年9月4日周五
>
> 上午5:05写道:
>
>
>
> > Congrats everyone !
>
> >
>
> > On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar  wrote:
>
> >
>
> > > Hi all,
>
> > >
>
> > > I am really excited to share the good news about our new committers on
>
> > the
>
> > > project!
>
> > >
>
> > > *Udit Mehrotra *: Udit has travelled with the project since sept/oct
> last
>
> > > year and immensely helped us making Hudi work well with the AWS
>
> > ecosystem.
>
> > > His most notable contributions are towards driving large parts of the
>
> > > implementation of RFC-12, Hive/Spark integration points. He has also
>
> > helped
>
> > > our users in various tricky issues.
>
> > >
>
> > > *Gary Li:* Gary is a great success story for the project, starting out
> as
>
> > > an early user and steadily grown into a strong contributor, who has
>
> > > demonstrated the ability to take up challenging implementations (e.g
>
> > Impala
>
> > > support, MOR snapshot query impl on Spark), as well as patiently
>
> > > iterate through feedback and evolve the design/code. He has also been
>
> > > helping users on Slack and mailing lists
>
> > >
>
> > > *Raymond Xu:* Raymond has also been a consistent feature on our mailing
>
> > > lists, slack and github. He has been proposing immensely valuable
>
> > > test/tooling improvements. He has contributed a great deal of code as
>
> > well,
>
> > > towards the same. Many many users thank Raymond for the generous help
> on
>
> > > Slack.
>
> > >
>
> > > *Pratyaksh Sharma:* This is yet another great example of user ->
>
> > > contributor -> committer. Pratyaksh has been a great champion for the
>
> > > project, over the past year or so, steadily contributing many
>
> > improvements
>
> > > around the Delta Streamer tool.
>
> > >
>
> > > Please join me in, congratulating them on this well deserved milestone!
>
> > >
>
> > > Onwards and upwards,
>
> > > Vinoth
>
> > >
>
> >
>
> --
Regards,
-Sivabalan  

Re: Congrats to our newest committers!

2020-09-03 Thread Sivabalan
Congrats to all 3. Much deserved and really excited to see more committers


On Thu, Sep 3, 2020 at 9:23 PM leesf  wrote:

> Congrats everyone, well deserved !
>
>
>
> selvaraj periyasamy  于2020年9月4日周五
>
> 上午5:05写道:
>
>
>
> > Congrats everyone !
>
> >
>
> > On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar  wrote:
>
> >
>
> > > Hi all,
>
> > >
>
> > > I am really excited to share the good news about our new committers on
>
> > the
>
> > > project!
>
> > >
>
> > > *Udit Mehrotra *: Udit has travelled with the project since sept/oct
> last
>
> > > year and immensely helped us making Hudi work well with the AWS
>
> > ecosystem.
>
> > > His most notable contributions are towards driving large parts of the
>
> > > implementation of RFC-12, Hive/Spark integration points. He has also
>
> > helped
>
> > > our users in various tricky issues.
>
> > >
>
> > > *Gary Li:* Gary is a great success story for the project, starting out
> as
>
> > > an early user and steadily grown into a strong contributor, who has
>
> > > demonstrated the ability to take up challenging implementations (e.g
>
> > Impala
>
> > > support, MOR snapshot query impl on Spark), as well as patiently
>
> > > iterate through feedback and evolve the design/code. He has also been
>
> > > helping users on Slack and mailing lists
>
> > >
>
> > > *Raymond Xu:* Raymond has also been a consistent feature on our mailing
>
> > > lists, slack and github. He has been proposing immensely valuable
>
> > > test/tooling improvements. He has contributed a great deal of code as
>
> > well,
>
> > > towards the same. Many many users thank Raymond for the generous help
> on
>
> > > Slack.
>
> > >
>
> > > *Pratyaksh Sharma:* This is yet another great example of user ->
>
> > > contributor -> committer. Pratyaksh has been a great champion for the
>
> > > project, over the past year or so, steadily contributing many
>
> > improvements
>
> > > around the Delta Streamer tool.
>
> > >
>
> > > Please join me in, congratulating them on this well deserved milestone!
>
> > >
>
> > > Onwards and upwards,
>
> > > Vinoth
>
> > >
>
> >
>
> --
Regards,
-Sivabalan


Re: Congrats to our newest committers!

2020-09-03 Thread leesf
Congrats everyone, well deserved !

selvaraj periyasamy  于2020年9月4日周五
上午5:05写道:

> Congrats everyone !
>
> On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar  wrote:
>
> > Hi all,
> >
> > I am really excited to share the good news about our new committers on
> the
> > project!
> >
> > *Udit Mehrotra *: Udit has travelled with the project since sept/oct last
> > year and immensely helped us making Hudi work well with the AWS
> ecosystem.
> > His most notable contributions are towards driving large parts of the
> > implementation of RFC-12, Hive/Spark integration points. He has also
> helped
> > our users in various tricky issues.
> >
> > *Gary Li:* Gary is a great success story for the project, starting out as
> > an early user and steadily grown into a strong contributor, who has
> > demonstrated the ability to take up challenging implementations (e.g
> Impala
> > support, MOR snapshot query impl on Spark), as well as patiently
> > iterate through feedback and evolve the design/code. He has also been
> > helping users on Slack and mailing lists
> >
> > *Raymond Xu:* Raymond has also been a consistent feature on our mailing
> > lists, slack and github. He has been proposing immensely valuable
> > test/tooling improvements. He has contributed a great deal of code as
> well,
> > towards the same. Many many users thank Raymond for the generous help on
> > Slack.
> >
> > *Pratyaksh Sharma:* This is yet another great example of user ->
> > contributor -> committer. Pratyaksh has been a great champion for the
> > project, over the past year or so, steadily contributing many
> improvements
> > around the Delta Streamer tool.
> >
> > Please join me in, congratulating them on this well deserved milestone!
> >
> > Onwards and upwards,
> > Vinoth
> >
>


Re: Congrats to our newest committers!

2020-09-03 Thread selvaraj periyasamy
Congrats everyone !

On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar  wrote:

> Hi all,
>
> I am really excited to share the good news about our new committers on the
> project!
>
> *Udit Mehrotra *: Udit has travelled with the project since sept/oct last
> year and immensely helped us making Hudi work well with the AWS ecosystem.
> His most notable contributions are towards driving large parts of the
> implementation of RFC-12, Hive/Spark integration points. He has also helped
> our users in various tricky issues.
>
> *Gary Li:* Gary is a great success story for the project, starting out as
> an early user and steadily grown into a strong contributor, who has
> demonstrated the ability to take up challenging implementations (e.g Impala
> support, MOR snapshot query impl on Spark), as well as patiently
> iterate through feedback and evolve the design/code. He has also been
> helping users on Slack and mailing lists
>
> *Raymond Xu:* Raymond has also been a consistent feature on our mailing
> lists, slack and github. He has been proposing immensely valuable
> test/tooling improvements. He has contributed a great deal of code as well,
> towards the same. Many many users thank Raymond for the generous help on
> Slack.
>
> *Pratyaksh Sharma:* This is yet another great example of user ->
> contributor -> committer. Pratyaksh has been a great champion for the
> project, over the past year or so, steadily contributing many improvements
> around the Delta Streamer tool.
>
> Please join me in, congratulating them on this well deserved milestone!
>
> Onwards and upwards,
> Vinoth
>


Congrats to our newest committers!

2020-09-03 Thread Vinoth Chandar
Hi all,

I am really excited to share the good news about our new committers on the
project!

*Udit Mehrotra *: Udit has travelled with the project since sept/oct last
year and immensely helped us making Hudi work well with the AWS ecosystem.
His most notable contributions are towards driving large parts of the
implementation of RFC-12, Hive/Spark integration points. He has also helped
our users in various tricky issues.

*Gary Li:* Gary is a great success story for the project, starting out as
an early user and steadily grown into a strong contributor, who has
demonstrated the ability to take up challenging implementations (e.g Impala
support, MOR snapshot query impl on Spark), as well as patiently
iterate through feedback and evolve the design/code. He has also been
helping users on Slack and mailing lists

*Raymond Xu:* Raymond has also been a consistent feature on our mailing
lists, slack and github. He has been proposing immensely valuable
test/tooling improvements. He has contributed a great deal of code as well,
towards the same. Many many users thank Raymond for the generous help on
Slack.

*Pratyaksh Sharma:* This is yet another great example of user ->
contributor -> committer. Pratyaksh has been a great champion for the
project, over the past year or so, steadily contributing many improvements
around the Delta Streamer tool.

Please join me in, congratulating them on this well deserved milestone!

Onwards and upwards,
Vinoth


[DISCUSS] enable cross AZ consistency and quality checks of hudi datasets

2020-09-03 Thread Sanjay Sundaresan
Hello folks,

We have a use case to make sure data in the same hudi datasets stored in
different DC ( for high availability / disaster recovery ) are strongly
consistent as well as pass all quality checks before they can be consumed
by users who we try to query them. Currently, we have an offline service
that runs quality checks as well as asynchronously syncs the hudi datasets
between different DC/AZ but till the sync happens queries running in these
different DC see inconsistent results. For some of our most critical
datasets this inconsistency is causing so many problems.

We want to support the need for following use cases 1) data consistency 2)
Adding data quality checks post commit.

Our flow looks like this
1) write new batch of data at t1
2) user queries will not see data at t1
3) data quality checks are done by setting a session property to include t1
4) optionally replicate t1 to other AZs and promote t1 so regular user
queries will see data at t1

We want to make the following changes to achieve this.

1. Change the HoodieParquetInputFormat to look for
'last_replication_timestamp' property in the JobConf and use this to create
a new ActiveTimeline that limits the commits seen to be lesser than or
equal to this timestamp. This can be overridden by a session property that
will allow us to make such data visible for quality checks.

2. We are storing this particular timestamp as a table property in
HiveMetaStore. To make it easier to update we want to extend the
HiveSyncTool to also update this table property when syncing hudi dataset
to the hms. The extended tool will take in a list of HMS's to be updated
and will try to update each of them one by one. ( In case of global HMS
across all DC this is just one, but if there is region local HMS per DC the
update of all HMS is not truly transaction so there is a small window of
time where the queries can return inconsistent results ). If the tool can't
update all the HMS it will rollback the updated ones ( again not applicable
for global HMS ).

We have made the above changes to our internal branch and we are
successfully running it in production.

Please let us know of feedback about this change.

Sanjay


Re: [DISCUSS] Formalizing the release process

2020-09-03 Thread Sivabalan
+1 on the general release policy. Realistically speaking, bit skeptical on
minor version releases every month, but nvm. guess its just a rough
estimate.

On Tue, Sep 1, 2020 at 8:41 PM Balaji Varadarajan
 wrote:

>
> +1 on the process.
> Balaji.VOn Tuesday, September 1, 2020, 04:56:55 PM PDT, Gary Li <
> garyli1...@outlook.com> wrote:
>
>  +1
> Gary LiFrom: Bhavani Sudha 
> Sent: Wednesday, September 2, 2020 3:11:06 AM
> To: us...@hudi.apache.org 
> Cc: dev@hudi.apache.org 
> Subject: Re: [DISCUSS] Formalizing the release process +1 on the release
> process formalization.
>
> On Tue, Sep 1, 2020 at 10:21 AM Vinoth Chandar  wrote:
>
> > Hi all,
> >
> > Love to start a discussion around how we can formalize the release
> > process, timelines more so that we can ensure timely and quality
> releases.
> >
> > Below is an outline of an idea that was discussed in the last community
> > sync (also in the weekly sync notes).
> >
> > - We will do a "feature driven" major version release, every 3 months or
> > so. i.e going from version x.y to x.y+1. The idea here is this ships once
> > all the committed features are code complete, tested and verified.
> > - We keep doing patches, bug fixes and usability improvements to the
> > project always. So, we will also do a "time driven" minor version release
> > x.y.z → x.y.z+1 every month or so
> > - We will always be releasing from master and thus major release features
> > need to be guarded by flags, on minor versions.
> > - We will try to avoid patch releases. i.e cherry-picking a few commits
> > onto an earlier release version. (during 0.5.3 we actually found the
> > cherry-picking of master onto 0.5.2 pretty tricky and even error-prone).
> > Some cases, we may have to just make patch releases. But only extenuating
> > circumstances. Over time, with better tooling and a larger community, we
> > might be able to do this.
> >
> > As for the major release planning process.
> >
> >- PMC/Committers can come up with an initial list sourced based on
> >user asks, support issue
> >- List is shared with the community, for feedback. community can
> >suggest new items, re-prioritizations
> >- Contributors are welcome to commit more features/asks, (with due
> >process)
> >
> > I would love to hear +1s, -1s and also any new, completely different
> ideas
> > as well. Let's use this thread to align ourselves.
> >
> > Once we align ourselves, there are some release certification tools that
> > need to be built out. Hopefully, we can do this together. :)
> >
> >
> > Thanks
> > Vinoth
> >
>



-- 
Regards,
-Sivabalan