Re: Datasource Writer Schema Evolution

Vinoth Chandar Thu, 06 Feb 2020 10:29:18 -0800

Hi,

When reading through the datasource API like you are.. The schema merging
etc behaves the same as spak.read.parquet().. Hudi merely filters the files
on storage for the latest snapshot


https://hudi.apache.org/docs/querying_data.html#read-optimized-query-1

thanks
Vinoth

On Thu, Feb 6, 2020 at 8:11 AM leesf <[email protected]> wrote:

> If you update the partition(20200205) after adding fields. it will show the
> add fields by using ` val hudiDF2 =
> spark.read.format("org.apache.hudi").load("/tmp/hudi/drivers/*");
> hudiDF2.show `, which needn't mergeSchema from all files.
>
> Igor Basko <[email protected]> 于2020年2月6日周四 下午8:35写道：
>
> > Thanks a lot for the answer.
> > I was sure Hudi would store the latest schema, instead of merging it from
> > all the files.
> >
> > On Thu, 6 Feb 2020 at 01:10, leesf <[email protected]> wrote:
> >
> > > Hi Igor,
> > >
> > > It is because the Spark ParquetFileFormat infer schema from the parquet
> > > file under 20200205 dir, and the file do not contains the added
> > > column(direction), you would just try `val hudiDF2 =
> > > spark.read.format("org.apache.hudi").option("mergeSchema",
> > > "true").load("/tmp/hudi/drivers/*")` to get schema merged from 20200205
> > and
> > > 20200206, and it shows the added column, I do not know whether it is a
> > > common soulution but it solves the problem.
> > >
> > > Best,
> > > Leesf
> > > `
> > >
> > > Igor Basko <[email protected]> 于2020年2月5日周三 下午3:33写道：
> > >
> > > > Hi All,
> > > > I've tried to write data with some schema changes using the
> Datasource
> > > > Writer.
> > > > The procedure was:
> > > > First I wrote an event with a specific schema.
> > > > After that I wrote a different event with the same schema but with
> one
> > > more
> > > > added field.
> > > >
> > > > When I read from the Hudi table, I get both the events, with the
> > original
> > > > schema.
> > > > I was expecting to get both events with the newer schema with some
> > > default
> > > > value in the new
> > > > field for the first event.
> > > >
> > > > I've created a gist that describes my experience:
> > > > https://gist.github.com/igorbasko01/4a1d0cf7c06a5b216382260efaa1f333
> > > >
> > > > Would like to know, if schema evolution is supported using the
> > Datasource
> > > > Writer.
> > > > Or maybe I'm doing something wrong.
> > > >
> > > > Thanks a lot.
> > > >
> > >
> >
>

Re: Datasource Writer Schema Evolution

Reply via email to