Re: HUDI Table Primary Key - UUID or Custom For Better Performance

2020-10-21 Thread tanu dua
Thanks got it. Unfortunately it’s not very straightforward for me to
provide ordered keys. So far I am getting a decent write performance so
will revisit if required.

On Wed, 21 Oct 2020 at 7:45 AM, Vinoth Chandar <
mail.vinoth.chan...@gmail.com> wrote:

> For now, bloom filters are not actually leveraged in the read/query path
> but only by the writer performing the index lookup for upserting. Hudi is
> write optimized like an OLTP store and read optimized like OLAP, if
> that makes sense.
>
> As for bloom index performance, our tuning guide and FAQ talk about this.
> If you eventually want to support de-duplication say, it might be good to
> pick a key that is ordered. Something like _hoodie_seq_no that keeps
> increasing with new commits, then the bloom indexing mechanism will be also
> able to do range pruning effectively improving performance significantly.
> Pure uuid keys are not very conducive for range pruning ie files written
> during each commit will over lap in key range with almost every other file.
>
> Thanks
> Vinoth
>
> On Fri, Oct 16, 2020 at 8:42 PM Tanuj  wrote:
>
> > Thanks Prashant. To answer your questions -
> > 1) Yes size of keys are something around 5-8 alphanumeric but since its
> > composite key of 3 domain keys I believe it will be almost equal to UUID
> > 4) Thats the business need. We need to keep a track/audit for every
> > insertion of new record. We had 2 options - Update Existing Record , make
> > an Audit Table to store old records or keep pushing in the same table
> with
> > timestamp so that it always works with Append mode. We choose Option 2
> > 5) Thats what I want to understand how Bloom Filters will be useful here.
> > And in general also is bloom filter used in HUDI for read. I understand
> the
> > write process where its being used but does it use in read as well as I
> > believe after picking up the correct parquet file Hudi delegates the read
> > to Spark . Please correct me if I am wrong here
> > 6) We will only query on domain object keys excluding create_date.
> >
> > On 2020/10/16 18:53:21, Prashant Wason  wrote:
> > > Hi Tanu,
> > >
> > > Some points to consider:
> > > 1. UUID is fixed size compared to domain_object_keys (dont know the
> > size).
> > > Smaller keys will reduce the storage requirements.
> > > 2. UUIDs don't compress. Your domain object keys may compress better.
> > > 3. From the bloom filter perspective, I dont think there is any
> > difference
> > > unless the size difference of keys is very large.
> > > 4. If the domain object keys are already unique, what is the use of
> > > suffixing the create_date?
> > > 5. If you query by "primary key minus timestamp", the entire record key
> > > column will have to be read to match it. So bloom filters won't be
> useful
> > > here.
> > > 6. What do the domain object keys look like? Are they going to be
> > included
> > > in any other field in the record? Would you ever want to query on
> domain
> > > object keys?
> > >
> > > Thanks
> > > Prashant
> > >
> > >
> > > On Thu, Oct 15, 2020 at 8:21 PM tanu dua 
> wrote:
> > >
> > > > read query pattern will be (partition key + primary key minus
> > timestamp)
> > > > where my primary key is domain keys + timestamp.
> > > >
> > > > Read Write queries are as per dataset but mostly all the tables are
> > read
> > > > and write frequently and equally
> > > >
> > > > Read will be mostly done by providing the partitions and not by
> blanket
> > > > query.
> > > >
> > > > If we have to choose between read and write I will choose write but I
> > want
> > > > to stick only with COW table.
> > > >
> > > > Please let me know if you need more information.
> > > >
> > > >
> > > > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan 
> wrote:
> > > >
> > > > > Can you give us a sense of how your read workload looks like?
> > Depending
> > > > on
> > > > > that read perf could vary.
> > > > >
> > > > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj 
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > > We don't have an "UPDATE" use case and all ingested rows will be
> > > > "INSERT"
> > > > > > so what is the best way to define PRIMARY key. As of now we have
> > > > designed
> > > > > > primary key as per domain object with create_date which is -
> > > > > > ,,
> > > > > >
> > > > > > Since its always an INSERT for us , I can potentially use UUID as
> > well
> > > > .
> > > > > >
> > > > > > We use keys for Bloom Index in HUDI so just wanted to know if I
> > get a
> > > > > > better performance in writing if I will have the UUID vs
> composite
> > > > domain
> > > > > > keys.
> > > > > >
> > > > > > I believe read is not impacted as per the Primary Key as its not
> > being
> > > > > > considered ?
> > > > > >
> > > > > > Please suggest
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > -Sivabalan
> > > > >
> > > >
> > >
> >
>


Re: Hudi - Concurrent Writes

2020-10-19 Thread tanu dua
Thank you so much.

On Mon, 19 Oct 2020 at 10:16 PM, Balaji Varadarajan
 wrote:

>
> We are planning to add parallel writing to Hudi (at different partition)
> levels in the next release.
> Balaji.V On Friday, October 16, 2020, 11:54:51 PM PDT, tanu dua <
> tanu.dua...@gmail.com> wrote:
>
>  Hi,
> Do we have a support of concurrent writes in 0.6 as I got a similar
> requirement to ingest parallely from multiple jobs ? I am ok even if
> parallel writes are supported with different partitions.
>
> On Thu, 9 Jul 2020 at 9:22 AM, Vinoth Chandar  wrote:
>
> > We are looking into adding support for parallel writers in 0.6.0. So that
> > should help.
> >
> > I am curious to understand though why you prefer to have 1000 different
> > writer jobs, as opposed to having just one writer. Typical use cases for
> > parallel writing I have seen are related to backfills and such.
> >
> > +1 to Mario’s comment. Can’t think of anything else if your users are
> happy
> > querying 1000 tables.
> >
> > On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> > wrote:
> >
> > > hey Shayan,
> > >
> > > that seems actually a very good approach ... just curious with the glue
> > > metastore you mentioned. Would it be an external metastore for spark to
> > > query over ??? external in terms of not managed by Hudi ???
> > >
> > > that would be my only concern ... how to maintain the sync between all
> > > metadata partitions but , again, a very promising approach !
> > >
> > > regards,
> > >
> > > Mario.
> > >
> > > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati  >
> > > escreveu:
> > >
> > > > Hi folks,
> > > >
> > > > We have a use-case where we want to ingest data concurrently for
> > > different
> > > > partitions. Currently Hudi doesn't support concurrent writes on the
> > same
> > > > Hudi table.
> > > >
> > > > One of the approaches we were thinking was to use one hudi table per
> > > > partition of data. So let us say we have 1000 partitions, we will
> have
> > > 1000
> > > > Hudi tables which will enable us to write concurrently on each
> > partition.
> > > > And the metadata for each partition will be synced to a single
> > metastore
> > > > table (Assumption here is schema is same for all partitions). So this
> > > > single metastore table can be used for all the spark, hive queries
> when
> > > > querying data. Basically this metastore glues all the different hudi
> > > table
> > > > data together in a single table.
> > > >
> > > > We already tested this approach and its working fine and each
> partition
> > > > will have its own timeline and hudi table.
> > > >
> > > > We wanted to know if there are some gotchas or any other issues with
> > this
> > > > approach to enable concurrent writes? Or if there are any other
> > > approaches
> > > > we can take?
> > > >
> > > > Thanks,
> > > > Shayan
> > > >
> > >
> >


Re: Hudi - Concurrent Writes

2020-10-17 Thread tanu dua
Hi,
Do we have a support of concurrent writes in 0.6 as I got a similar
requirement to ingest parallely from multiple jobs ? I am ok even if
parallel writes are supported with different partitions.

On Thu, 9 Jul 2020 at 9:22 AM, Vinoth Chandar  wrote:

> We are looking into adding support for parallel writers in 0.6.0. So that
> should help.
>
> I am curious to understand though why you prefer to have 1000 different
> writer jobs, as opposed to having just one writer. Typical use cases for
> parallel writing I have seen are related to backfills and such.
>
> +1 to Mario’s comment. Can’t think of anything else if your users are happy
> querying 1000 tables.
>
> On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> wrote:
>
> > hey Shayan,
> >
> > that seems actually a very good approach ... just curious with the glue
> > metastore you mentioned. Would it be an external metastore for spark to
> > query over ??? external in terms of not managed by Hudi ???
> >
> > that would be my only concern ... how to maintain the sync between all
> > metadata partitions but , again, a very promising approach !
> >
> > regards,
> >
> > Mario.
> >
> > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati 
> > escreveu:
> >
> > > Hi folks,
> > >
> > > We have a use-case where we want to ingest data concurrently for
> > different
> > > partitions. Currently Hudi doesn't support concurrent writes on the
> same
> > > Hudi table.
> > >
> > > One of the approaches we were thinking was to use one hudi table per
> > > partition of data. So let us say we have 1000 partitions, we will have
> > 1000
> > > Hudi tables which will enable us to write concurrently on each
> partition.
> > > And the metadata for each partition will be synced to a single
> metastore
> > > table (Assumption here is schema is same for all partitions). So this
> > > single metastore table can be used for all the spark, hive queries when
> > > querying data. Basically this metastore glues all the different hudi
> > table
> > > data together in a single table.
> > >
> > > We already tested this approach and its working fine and each partition
> > > will have its own timeline and hudi table.
> > >
> > > We wanted to know if there are some gotchas or any other issues with
> this
> > > approach to enable concurrent writes? Or if there are any other
> > approaches
> > > we can take?
> > >
> > > Thanks,
> > > Shayan
> > >
> >
>


Re: HUDI Table Primary Key - UUID or Custom For Better Performance

2020-10-15 Thread tanu dua
read query pattern will be (partition key + primary key minus timestamp)
where my primary key is domain keys + timestamp.

Read Write queries are as per dataset but mostly all the tables are read
and write frequently and equally

Read will be mostly done by providing the partitions and not by blanket
query.

If we have to choose between read and write I will choose write but I want
to stick only with COW table.

Please let me know if you need more information.


On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan  wrote:

> Can you give us a sense of how your read workload looks like? Depending on
> that read perf could vary.
>
> On Thu, Oct 15, 2020 at 4:06 AM Tanuj  wrote:
>
> > Hi all,
> > We don't have an "UPDATE" use case and all ingested rows will be "INSERT"
> > so what is the best way to define PRIMARY key. As of now we have designed
> > primary key as per domain object with create_date which is -
> > ,,
> >
> > Since its always an INSERT for us , I can potentially use UUID as well .
> >
> > We use keys for Bloom Index in HUDI so just wanted to know if I get a
> > better performance in writing if I will have the UUID vs composite domain
> > keys.
> >
> > I believe read is not impacted as per the Primary Key as its not being
> > considered ?
> >
> > Please suggest
> >
> >
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Planning for Releases 0.6.1 and 0.7.0

2020-09-22 Thread tanu dua
Thanks Vinoth. These are really exciting items and hats off to you and team
in pushing the releases swiftly and improving the framework all the time. I
hope someday I will start contributing once I will get free from my major
deliverables and have more understanding the nitty gritty details of Hudi.

You have mentioned Spark3.0 support in next release. We were actually
thinking of moving to Spark 3.0 but thought it’s too early with 0.6
release. Is 0.6 not fully tested with Spark 3.0 ?


On Wed, 23 Sep 2020 at 8:25 AM, Vinoth Chandar  wrote:

> Hello all,
>
>
>
> Pursuant to our conversation around release planning, I am happy to share
>
> the initial set of proposals for the next minor/major releases (minor
>
> release ofc can go out based on time)
>
>
>
> *Next Minor version 0.6.1 (with stuff that did not make it to 0.6.0..) *
>
> Flink/Writer common refactoring for Flink
>
> Small file handling support w/o caching
>
> Spark3 Support
>
> Remaining bootstrap items
>
> Completing bulk_insertV2 (sort mode, de-dup etc)
>
> Full list here :
>
> https://issues.apache.org/jira/projects/HUDI/versions/12348168
>
> 
>
>
>
> *0.7.0 with major new features *
>
> RFC-15: metadata, range index (w/ spark support), bloom index (eliminate
>
> file listing, query pruning, improve bloom index perf)
>
> RFC-08: Record Index (to solve global index scalability/perf)
>
> RFC-18/19: Clustering/Insert overwrite
>
> Spark 3 based datasource rewrite (structured streaming sink/source,
>
> DELETE/MERGE)
>
> Incremental Query on logs (Hive, Spark)
>
> Parallel writing support
>
> Redesign of marker files for S3
>
> Stretch: ORC, PrestoSQL Support
>
>
>
> Full list here :
>
> https://issues.apache.org/jira/projects/HUDI/versions/12348721
>
>
>
> Please chime in with your thoughts. If you would like to commit to
>
> contributing a feature towards a release, please do so by marking *`Fix
>
> Version/s`* field with that release number.
>
>
>
> Thanks
>
> Vinoth
>
>


Re: Enforcing Dataset Schema before pushing to HUDI

2020-09-19 Thread tanu dua
No we don’t want dataframe to be converted to schema and just needs a
validation. Following logic which I mentioned earlier is the only way I
could find in spark to validate but don’t find it quite effective as we are
unnecessarily creating a dataframe

 Dataset hudiDs = spark.createDataFrame(
  dataset.select(columnNamesToSelect.stream().map(s -> new
Column(s)).toArray(Column[]::new)).rdd(),
  );

On Sat, 19 Sep 2020 at 11:35 PM, Vinoth Chandar  wrote:

> We could add support to validate the data frame against a schema string
>
> passed
>
> to the data source writer. I guess you want the dataframe to be also
>
> converted into the provided schema?
>
>
>
> On Tue, Sep 15, 2020 at 9:02 PM tanu dua  wrote:
>
>
>
> > Hmm but our use case has multiple schemas one for each dataset as each
>
> >
>
> > dataset is unique in our case and hence the need to validate the schema
> for
>
> >
>
> > each dataset while writing.
>
> >
>
> >
>
> >
>
> > On Tue, 15 Sep 2020 at 2:53 AM, Vinoth Chandar 
> wrote:
>
> >
>
> >
>
> >
>
> > > Hi,
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > > Typically writing people use a single schema, thats probably why.
> During
>
> >
>
> > >
>
> >
>
> > > read however, you are dealing with files written by different writes
> with
>
> >
>
> > >
>
> >
>
> > > different schema.
>
> >
>
> > >
>
> >
>
> > > So the ability to pass in a schema is handy. Hope that makes sense
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > > On Sat, Sep 12, 2020 at 12:38 PM tanu dua 
> wrote:
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > > > Thanks Vinoth. Yes that’s always an option with me to validate
> myself.
>
> > I
>
> >
>
> > >
>
> >
>
> > > > just wanted to confirm if Spark does it for me for all my datasets
> and
>
> > I
>
> >
>
> > >
>
> >
>
> > > > wonder why they haven’t provided it for write but provided it for
> read.
>
> >
>
> > >
>
> >
>
> > > >
>
> >
>
> > >
>
> >
>
> > > > On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar 
>
> >
>
> > > wrote:
>
> >
>
> > >
>
> >
>
> > > >
>
> >
>
> > >
>
> >
>
> > > > > Hi,
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > IIUC, you want to be able to pass in a schema to write? AFAIK,
> Spark
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > Datasource V1 atleast does not allow for passing in the schema.
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > Hudi writing will just use the schema for the df you pass in.
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > Just throwing it out there. can you write a step to drop all
>
> >
>
> > > unnecessary
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > columns before issuing the write i.e
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > df.map(funcToDropExtraCols()).write.format("hudi")
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > thanks
>

Re: Hudi Concurrent Ingestion with Spark Streaming

2020-09-17 Thread tanu dua
Thank you so much Nisheth. I understand now how it’s going to work.

On Wed, 16 Sep 2020 at 11:15 PM, nishith agarwal 
wrote:

> Tanu,
>
>
>
> I'm assuming you're talking about multiple kafka partitions from a single
>
> Spark Streaming job. In this case, your job can read from
>
> multiple partitions but at the end, this data should be written to a single
>
> table. The dataset/rdd resulting from reading multiple partitions is passed
>
> as a whole to the Hudi writer and spark parallelism takes care of ensuring
>
> you don't lose the kafka partition parallelism. In this case, there are no
>
> "multi-writers" to Hudi tables. Is your setup different from the one I
>
> described ?
>
>
>
> -Nishith
>
>
>
> On Wed, Sep 16, 2020 at 9:50 AM tanu dua  wrote:
>
>
>
> > Hi,
>
> > I need to try myself more on this but how Hudi concurrent ingestion works
>
> > with Spark Streaming.
>
> > We have multiple Kafka partitions on which Spark is listening on so there
>
> > is a possibility that at any given point of time multiple executors will
> be
>
> > reading the kafka partitions and start ingesting data. What is the
>
> > behaviour I can expect from Hudi. It’s possible that they may writing to
>
> > the same Hudi partition.
>
> >
>
> > Would both writes be successful ? Would one overwrite another if both
> have
>
> > same primary key ?
>
> >
>
>


Hudi Concurrent Ingestion with Spark Streaming

2020-09-16 Thread tanu dua
Hi,
I need to try myself more on this but how Hudi concurrent ingestion works
with Spark Streaming.
We have multiple Kafka partitions on which Spark is listening on so there
is a possibility that at any given point of time multiple executors will be
reading the kafka partitions and start ingesting data. What is the
behaviour I can expect from Hudi. It’s possible that they may writing to
the same Hudi partition.

Would both writes be successful ? Would one overwrite another if both have
same primary key ?


Re: Enforcing Dataset Schema before pushing to HUDI

2020-09-15 Thread tanu dua
Hmm but our use case has multiple schemas one for each dataset as each
dataset is unique in our case and hence the need to validate the schema for
each dataset while writing.

On Tue, 15 Sep 2020 at 2:53 AM, Vinoth Chandar  wrote:

> Hi,
>
>
>
> Typically writing people use a single schema, thats probably why. During
>
> read however, you are dealing with files written by different writes with
>
> different schema.
>
> So the ability to pass in a schema is handy. Hope that makes sense
>
>
>
> On Sat, Sep 12, 2020 at 12:38 PM tanu dua  wrote:
>
>
>
> > Thanks Vinoth. Yes that’s always an option with me to validate myself. I
>
> > just wanted to confirm if Spark does it for me for all my datasets and I
>
> > wonder why they haven’t provided it for write but provided it for read.
>
> >
>
> > On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar 
> wrote:
>
> >
>
> > > Hi,
>
> > >
>
> > >
>
> > >
>
> > > IIUC, you want to be able to pass in a schema to write? AFAIK, Spark
>
> > >
>
> > > Datasource V1 atleast does not allow for passing in the schema.
>
> > >
>
> > > Hudi writing will just use the schema for the df you pass in.
>
> > >
>
> > >
>
> > >
>
> > > Just throwing it out there. can you write a step to drop all
> unnecessary
>
> > >
>
> > > columns before issuing the write i.e
>
> > >
>
> > > df.map(funcToDropExtraCols()).write.format("hudi")
>
> > >
>
> > >
>
> > >
>
> > > thanks
>
> > >
>
> > > Vinoth
>
> > >
>
> > >
>
> > >
>
> > >
>
> > >
>
> > > On Wed, Sep 9, 2020 at 6:59 AM Tanuj  wrote:
>
> > >
>
> > >
>
> > >
>
> > > > Hi,
>
> > >
>
> > > > We are working on Dataset which goes through lot of
>
> > transformations
>
> > >
>
> > > > and then pushed to HUDI. Since HUDI follows evolving schema so if I
> am
>
> > >
>
> > > > going to add a new column it will allow to do so.
>
> > >
>
> > > >
>
> > >
>
> > > > When we write into HUDI using spark I don't see any option where DS
> is
>
> > >
>
> > > > validated against schema (StructType) which we have in read which can
>
> > > cause
>
> > >
>
> > > > us to write some unwanted columns specially in lower envs.
>
> > >
>
> > > >
>
> > >
>
> > > > For eg.
>
> > >
>
> > > > READ has an option like
> spark.read.format("hudi").schema()
>
> > >
>
> > > > which validates the schema however
>
> > >
>
> > > > WRITE doesnt have schema option spark.write.format("hudi") so all
>
> > columns
>
> > >
>
> > > > go out without validation against schema.
>
> > >
>
> > > >
>
> > >
>
> > > > The workaround that I cam up is to recreate the dataset again with
>
> > schema
>
> > >
>
> > > > but I don't like it as it has an overhead. Do we have any other
> better
>
> > >
>
> > > > option and am I missing something ?
>
> > >
>
> > > >
>
> > >
>
> > > >  Dataset hudiDs = spark.createDataFrame(
>
> > >
>
> > > >   dataset.select(columnNamesToSelect.stream().map(s ->
> new
>
> > >
>
> > > > Column(s)).toArray(Column[]::new)).rdd(),
>
> > >
>
> > > >   );
>
> > >
>
> > > >
>
> > >
>
> > > >
>
> > >
>
> > > >
>
> > >
>
> > >
>
> >
>
>


Re: Enforcing Dataset Schema before pushing to HUDI

2020-09-12 Thread tanu dua
Thanks Vinoth. Yes that’s always an option with me to validate myself. I
just wanted to confirm if Spark does it for me for all my datasets and I
wonder why they haven’t provided it for write but provided it for read.

On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar  wrote:

> Hi,
>
>
>
> IIUC, you want to be able to pass in a schema to write? AFAIK, Spark
>
> Datasource V1 atleast does not allow for passing in the schema.
>
> Hudi writing will just use the schema for the df you pass in.
>
>
>
> Just throwing it out there. can you write a step to drop all unnecessary
>
> columns before issuing the write i.e
>
> df.map(funcToDropExtraCols()).write.format("hudi")
>
>
>
> thanks
>
> Vinoth
>
>
>
>
>
> On Wed, Sep 9, 2020 at 6:59 AM Tanuj  wrote:
>
>
>
> > Hi,
>
> > We are working on Dataset which goes through lot of transformations
>
> > and then pushed to HUDI. Since HUDI follows evolving schema so if I am
>
> > going to add a new column it will allow to do so.
>
> >
>
> > When we write into HUDI using spark I don't see any option where DS is
>
> > validated against schema (StructType) which we have in read which can
> cause
>
> > us to write some unwanted columns specially in lower envs.
>
> >
>
> > For eg.
>
> > READ has an option like spark.read.format("hudi").schema()
>
> > which validates the schema however
>
> > WRITE doesnt have schema option spark.write.format("hudi") so all columns
>
> > go out without validation against schema.
>
> >
>
> > The workaround that I cam up is to recreate the dataset again with schema
>
> > but I don't like it as it has an overhead. Do we have any other better
>
> > option and am I missing something ?
>
> >
>
> >  Dataset hudiDs = spark.createDataFrame(
>
> >   dataset.select(columnNamesToSelect.stream().map(s -> new
>
> > Column(s)).toArray(Column[]::new)).rdd(),
>
> >   );
>
> >
>
> >
>
> >
>
>


Re: HUDI Read | Leverage Partitions

2020-08-19 Thread tanu dua
I am so sorry to bother you. It worked , there was some typo. Really
apologize

On Wed, Aug 19, 2020 at 7:01 PM tanu dua  wrote:

> Hi Gary,
> I am getting an exception while loading HUDI tables using glob path. Does
> it work ? Have someone tried it ? If I use without {} it works
> Caused by: org.apache.spark.sql.AnalysisException: Path does not exist:
> file:/C:/Hudi/data/co/A/2019/{3,4};
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:552)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
>
> On Tue, Jun 30, 2020 at 7:39 PM Tanuj  wrote:
>
>> Thanks a lot. I understand now.
>>
>> On 2020/06/27 02:45:52, Gary Li  wrote:
>> > Hi,
>> >
>> > If you use year=xxx/month=xxx folder structure, you can use Dataset
>> > df=
>> >
>> spark.read().format("hudi").schema(schema).load(+).
>> > Without a glob postfix, Spark can automatically load the partition
>> > information, just like regular parquet files.
>> >
>> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery
>> >
>> > If you use something like 2020/06, you may need to build the glob string
>> > and add it to the load() to skip the unnecessary partitions. e.g.
>> > .load(++"2020/{05,06}")
>> >
>> > Or list one parquet file from different partitions and use a map
>> function
>> > to load 1 row from each path with a limit clause.
>> >
>> > On Fri, Jun 26, 2020 at 8:33 AM Tanuj  wrote:
>> >
>> > > Hi,
>> > > We have created a table with partition depth of 2 as year/month. We
>> need
>> > > to read data from HUDI in Spark Streaming layer where we get the
>> batch data
>> > > of say 10 rows which we need to use to read from HUDI. We are reading
>> it
>> > > like -
>> > >
>> > > // Read from HUDI
>> > > Dataset df=
>> > >
>> spark.read().format("hudi").schema(schema).load(++"/*/*")
>> > >
>> > > //Apply filter
>> > >
>> > >
>> df=df.filter(df.col("year").isin().filter(df.col("month").isin()).filter(df.col("id").isin());
>> > >
>> > > Is it the best way to read the data ? Will HUDI take care of just
>> reading
>> > > from the partitions or we need to take care of ? For eg. If I need to
>> read
>> > > just 1 row we can build the full path and then read which will read
>> the
>> > > parquet file from that partition quickly but here our requirement is
>> to
>> > > read data from multiple partitions.
>> > >
>> > >
>> > >
>> >
>>
>


Re: HUDI Read | Leverage Partitions

2020-08-19 Thread tanu dua
Hi Gary,
I am getting an exception while loading HUDI tables using glob path. Does
it work ? Have someone tried it ? If I use without {} it works
Caused by: org.apache.spark.sql.AnalysisException: Path does not exist:
file:/C:/Hudi/data/co/A/2019/{3,4};
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:552)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)

On Tue, Jun 30, 2020 at 7:39 PM Tanuj  wrote:

> Thanks a lot. I understand now.
>
> On 2020/06/27 02:45:52, Gary Li  wrote:
> > Hi,
> >
> > If you use year=xxx/month=xxx folder structure, you can use Dataset
> > df=
> >
> spark.read().format("hudi").schema(schema).load(+).
> > Without a glob postfix, Spark can automatically load the partition
> > information, just like regular parquet files.
> >
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery
> >
> > If you use something like 2020/06, you may need to build the glob string
> > and add it to the load() to skip the unnecessary partitions. e.g.
> > .load(++"2020/{05,06}")
> >
> > Or list one parquet file from different partitions and use a map function
> > to load 1 row from each path with a limit clause.
> >
> > On Fri, Jun 26, 2020 at 8:33 AM Tanuj  wrote:
> >
> > > Hi,
> > > We have created a table with partition depth of 2 as year/month. We
> need
> > > to read data from HUDI in Spark Streaming layer where we get the batch
> data
> > > of say 10 rows which we need to use to read from HUDI. We are reading
> it
> > > like -
> > >
> > > // Read from HUDI
> > > Dataset df=
> > >
> spark.read().format("hudi").schema(schema).load(++"/*/*")
> > >
> > > //Apply filter
> > >
> > >
> df=df.filter(df.col("year").isin().filter(df.col("month").isin()).filter(df.col("id").isin());
> > >
> > > Is it the best way to read the data ? Will HUDI take care of just
> reading
> > > from the partitions or we need to take care of ? For eg. If I need to
> read
> > > just 1 row we can build the full path and then read which will read the
> > > parquet file from that partition quickly but here our requirement is to
> > > read data from multiple partitions.
> > >
> > >
> > >
> >
>


Re: Recommendation to load HUDI data across partitions

2020-08-14 Thread tanu dua
Thanks Vinoth for detailed explanation and I was about to reply you that it
worked and followed most of the steps that you mentioned below.
Used forEachBatch() of stream to process the batch data from kafka and then
finding out the partitions using aggregate functions on Kafka Dataset and
then feed those partitions using Glob Pattern to Hudi to get hudiDs
Then performed join on both Ds , I had some complex logic to deduce from
both kafkaDs and hudiDs and hence using flatMap but I am now able to remove
flatMap and could use Dataset joins.

Thanks again for all your help as always !!




On Thu, Aug 13, 2020 at 1:42 PM Vinoth Chandar  wrote:

> Hi Tanuj,
>
> From this example, it appears as if you are trying to use sparkSession from
> within the executor? This will be problematic. Can you please open a
> support ticket with the full stack trace?
>
> I think what you are describing is a join between Kafka and Hudi tables. So
> I'd read from Kafka first, cache the 2K messages in memory, find out what
> partitions they belong to, and only load those affected partitions instead
> of the entire table.
> At this point, you will have two datasets : kafkaDF and hudiDF (or RDD or
> DataSet.. my suggestion remains valid)
> And instead of hand crafting the join at the record level, like you have.
> you can just use RDD/DataSet level join operations and then get a resultDF
>
> then you do a resultDF.write.format("hudi") and you are done?
>
> On Tue, Aug 11, 2020 at 2:33 AM Tanuj  wrote:
>
> > Hi,
> > I have a problem statement where I am consuming messages from Kafka and
> > then depending upon that Kafka message (2K records) I need to query Hudi
> > table and create a dataset (with both updates and inserts) and push them
> > back to Hudi table.
> >
> > I tried following but it threw NP exception from sparkSession scala code
> > and rightly so as sparkSession was used in Executor.
> >
> >  Dataset hudiDs = companyStatusDf.flatMap(new
> > FlatMapFunction() {
> > @Override
> > public Iterator call(KafkaRecord kafkaRecord)
> > throws Exception {
> > String prop1= kafkaRecord.getProp1();
> > String prop2= kafkaRecord.getProp2();
> > HudiRecord hudiRecord =  sparkSession.read()
> > .format(HUDI_DATASOURCE)
> > .schema()
> > .load()
> > .as(Encoders.bean((HudiRecord.class)))
> > .filter( say prop1);
> > hudiRecord = tranform();
> > // Modificiation in hudi record
> > return Arrays.asList(kafkaRecord, hudiRecord).iterator();
> > }
> >
> > }
> > }, Encoders.bean(CompanyStatusGoldenRecord.class));
> >
> > In HUDI, I have 2 level of partitions (year and month) so for eg if I get
> > 2K records from Kafka which will be spanned across multiple partitions -
> > what is advisable load first the full table like "/*/*/*" or first read
> > kafka record, find out which partitions need to be hit and then load only
> > those HUDI tables as per partitions .I believe 2nd option would be faster
> > i.e. loading the specific partitions and thats what I was trying in above
> > snippet of code. So if have to leverage partitions, is collect() on Kafka
> > Dataset to get the list of partitions  and then supply to HUDI is the
> only
> > option or I can do it just with the spark datasets ?
> >
>


Re: DISCUSS code, config, design walk through sessions

2020-07-30 Thread tanu dua
I missed it due to work commitments. Can we please have the recording ?

On Thu, 30 Jul 2020 at 11:52 PM, Zijing Guo 
wrote:

>  Thanks for the great session Vinoth!  Can we have those session in a
> regular basis? I personally find today's session are super helpful!
> On Thursday, July 30, 2020, 01:36:06 PM EDT, Vinoth Chandar <
> vin...@apache.org> wrote:
>
>  Thanks everyone who joined!
>
> I am hanging out in #general on slack, if we want to finish off any
> remaining questions. Please @vc me for questions.
>
> On Thu, Jul 30, 2020 at 8:00 AM Vinoth Chandar  wrote:
>
> > yes! Please join
> >
> > On Thu, Jul 30, 2020 at 7:35 AM Pratyaksh Sharma 
> > wrote:
> >
> >> Hi Vinoth,
> >>
> >> Is this happening now?
> >>
> >> On Mon, Jul 27, 2020 at 3:50 AM Vinoth Chandar 
> wrote:
> >>
> >> > Hi all,
> >> >
> >> > We will be using the conference link we use for the community sync.
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+Community+Weekly+Sync
> >> >
> >> >
> >> > Once again, the date time is : July 30 8-10 AM PST
> >> > We will try to follow the following agenda
> >> >
> >> > - Hudi design overview (30 mins, with 5 mins Q)
> >> > - Hudi Code walkthrough (45 mins, talking questions as we go)
> >> >- Code structure
> >> >- Important classes
> >> >- Configs run through
> >> > - Hudi ongoing work, future direction (30 mins)
> >> >- Upcoming RFCs
> >> >- How code may be evolving to support them
> >> >
> >> > Thanks
> >> > Vinoth
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Jul 23, 2020 at 7:53 AM Adam Feldman 
> >> wrote:
> >> >
> >> > > Great! Thank you
> >> > >
> >> > > On Thu, Jul 23, 2020, 10:49 Vinoth Chandar 
> wrote:
> >> > >
> >> > > > Hi Adam,
> >> > > >
> >> > > > Next week. July 30th 8AM PST.
> >> > > >
> >> > > > I will be sending dial in information over the weekend.
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thu, Jul 23, 2020 at 7:47 AM Adam Feldman  >
> >> > > wrote:
> >> > > >
> >> > > > > Hey, was this decided for today or the 30th?
> >> > > > >
> >> > > > > On Thu, Jul 16, 2020, 06:32 Zijing Guo
>  >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > +1 for the time.
> >> > > > > >
> >> > > > > >
> >> > > > > > Sent from Yahoo Mail for iPhone
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wednesday, July 15, 2020, 11:42 PM, Vinoth Chandar <
> >> > > > vin...@apache.org
> >> > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > Great!  Moving on to date. Would July 23/30 Thursday 8 AM PST
> >> work
> >> > > for
> >> > > > > > everyone?
> >> > > > > >
> >> > > > > > On Tue, Jul 14, 2020 at 12:17 PM Shiyan Xu <
> >> > > > xu.shiyan.raym...@gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > +1
> >> > > > > > >
> >> > > > > > > On Tue, Jul 14, 2020, 11:34 AM Vinoth Chandar <
> >> vin...@apache.org
> >> > >
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Typo: date TBD (not data :))
> >> > > > > > > >
> >> > > > > > > > On Tue, Jul 14, 2020 at 11:20 AM Adam Feldman <
> >> > > afeldm...@gmail.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > +1
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Jul 14, 2020, 14:09 Gary Li <
> >> > yanjia.gary...@gmail.com>
> >> > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > +1. 8am works for me.
> >> > > > > > > > > >
> >> > > > > > > > > > On Tue, Jul 14, 2020 at 11:01 AM Vinoth Chandar <
> >> > > > > vin...@apache.org
> >> > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hello all,
> >> > > > > > > > > > >
> >> > > > > > > > > > > please chime in. We will plan to freeze Tuesday 8AM
> >> (data
> >> > > > TBD)
> >> > > > > by
> >> > > > > > > EOD
> >> > > > > > > > > PST
> >> > > > > > > > > > > today.
> >> > > > > > > > > > >
> >> > > > > > > > > > > thanks
> >> > > > > > > > > > > Vinoth
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Mon, Jul 13, 2020 at 12:38 AM Pratyaksh Sharma <
> >> > > > > > > > > pratyaks...@gmail.com
> >> > > > > > > > > > >
> >> > > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > 8 AM PST works for me. This is actually more
> >> suitable
> >> > for
> >> > > > me
> >> > > > > > than
> >> > > > > > > > the
> >> > > > > > > > > > > > community sync time.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Will wait for others to respond. If 8 AM does not
> >> work
> >> > > for
> >> > > > > > > majority
> >> > > > > > > > > of
> >> > > > > > > > > > > > people, I will start a new thread for revoting.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Mon, Jul 13, 2020 at 11:55 AM David Sheard <
> >> > > > > > > > > > > > david.she...@datarefactory.com.au> wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > That is 01:00 Canberra Australia time. But that
> is
> >> > fine
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Cheers
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Mon, 13 Jul. 2020, 11:55 am 

Re: Date handling in HUDI

2020-07-21 Thread tanu dua
Thanks and even I am struggling with all data types except String with same
decode exception. For eg for both double and int and I got the exception
and when I convert to string all works fine in spark sql.

On Tue, 21 Jul 2020 at 1:38 PM, Balaji Varadarajan
 wrote:

>
> Gary/Udit,
> As you are familiar with this part of it, Can you please answer this
> question ?
> Thanks,Balaji.VOn Monday, July 20, 2020, 08:18:16 AM PDT, tanu dua <
> tanu.dua...@gmail.com> wrote:
>
>  Hi Guys,
> May I know how do you guys handle date and time stamp in Hudi.
> When I set DataTypes as Date in StructType it’s getting ingested as int but
> when I query using spark sql I get the following
>
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-17557
>
> So not sure if it’s only me who face this. Do I need to change to String
> ?


Date handling in HUDI

2020-07-20 Thread tanu dua
Hi Guys,
May I know how do you guys handle date and time stamp in Hudi.
When I set DataTypes as Date in StructType it’s getting ingested as int but
when I query using spark sql I get the following

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-17557

So not sure if it’s only me who face this. Do I need to change to String ?


Re: Expose HUDI CLI as a Service

2020-07-19 Thread tanu dua
It’s tanudua

Thanks.

On Sun, 19 Jul 2020 at 9:10 PM, Vinoth Chandar  wrote:

> Absolutely. Please share your cwiki id.
>
> On Sat, Jul 18, 2020 at 11:21 PM tanu dua  wrote:
>
> > Can I please have an access of Confluence to post RFC
> >
> > On Sun, Jul 19, 2020 at 6:05 AM Vinoth Chandar 
> wrote:
> >
> > > Great. Please feel free to post more followup thoughts here or on an
> RFC,
> > > as you prefer.
> > >
> > > On Thu, Jul 16, 2020 at 9:46 PM tanu dua 
> wrote:
> > >
> > > > Thanks Vinoth. I understand now. I would also look timeline server to
> > > > understand more how it works.
> > > >
> > > > On Fri, Jul 17, 2020 at 9:33 AM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > We have support both modes. Standalone or embedded within driver?
> > > > >
> > > > > We run javalin timeline server today during the spark write. All
> the
> > > file
> > > > > listings during the writing actually are spark executors talking to
> > > this
> > > > > driver service. This way, we don't keep listing S3/HDFS repeatedly
> > > during
> > > > > the write process. This timeline server can also be run in a long
> > > running
> > > > > mode as a separate process. That is what the hudi-timeline-service
> > > module
> > > > > does..
> > > > >
> > > > > I was suggesting something similar for the UI itself.. We can just
> > get
> > > a
> > > > > working UI/service first may be.. and pulling this into Spark
> driver
> > > > won't
> > > > > be a big deal. Someone else may also be interested in taking it
> up..
> > > > >
> > > > > Overall +1 from me
> > > > >
> > > > >
> > > > > On Wed, Jul 15, 2020 at 9:10 PM tanu dua 
> > > wrote:
> > > > >
> > > > > > Sure we can go ahead with a rudimentary UI.
> > > > > > On running it as a part of Spark Driver itself we should first
> > > conclude
> > > > > > what are end goals of this service should be. I was rather
> thinking
> > > of
> > > > > > hosting it as separate service so that without running a spark
> > > program
> > > > I
> > > > > > can browse through the table metadata as I can do in Hudi CLI but
> > > since
> > > > > CLI
> > > > > > is shell based and everyone will not have an access to shell so
> > those
> > > > > > service can help there.
> > > > > > I noticed that we start a javalin server when we start Spark
> > program
> > > > but
> > > > > > honestly I don’t know where do we use it . Do we use it in hudi
> > spark
> > > > > code
> > > > > > ? Is it a good idea to access rest services from spark code ?
> > > > > >
> > > > > > On Thu, 16 Jul 2020 at 9:11 AM, Vinoth Chandar <
> vin...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > Sorry, did not realize my response was still stuck in my
> outbox.
> > > > > > >
> > > > > > > At a high level, that sounds good to me. I would start with a
> > > > > rudimentary
> > > > > > > UI to begin with if possible. Having a service alone may not
> make
> > > > this
> > > > > > very
> > > > > > > readily consumable?
> > > > > > >
> > > > > > > Other random thought is, if we can host this UI service as a
> part
> > > of
> > > > > the
> > > > > > > spark driver itself? In terms of deployment model - it would be
> > > nice
> > > > > > > atleast for spark streaming/DeltaStreamer continuous mode to
> > > atleast
> > > > > have
> > > > > > > UI hosted by the spark driver. This way people don’t have to
> run
> > a
> > > > > > separate
> > > > > > > server per se..
> > > > > > > (we already have a timeline-server which we have not pursued
> > > actively
> > > > > as
> > > > > > a
> > > > > > > separate running service for the same reasons. )
> > > > > 

Re: Expose HUDI CLI as a Service

2020-07-19 Thread tanu dua
Can I please have an access of Confluence to post RFC

On Sun, Jul 19, 2020 at 6:05 AM Vinoth Chandar  wrote:

> Great. Please feel free to post more followup thoughts here or on an RFC,
> as you prefer.
>
> On Thu, Jul 16, 2020 at 9:46 PM tanu dua  wrote:
>
> > Thanks Vinoth. I understand now. I would also look timeline server to
> > understand more how it works.
> >
> > On Fri, Jul 17, 2020 at 9:33 AM Vinoth Chandar 
> wrote:
> >
> > > We have support both modes. Standalone or embedded within driver?
> > >
> > > We run javalin timeline server today during the spark write. All the
> file
> > > listings during the writing actually are spark executors talking to
> this
> > > driver service. This way, we don't keep listing S3/HDFS repeatedly
> during
> > > the write process. This timeline server can also be run in a long
> running
> > > mode as a separate process. That is what the hudi-timeline-service
> module
> > > does..
> > >
> > > I was suggesting something similar for the UI itself.. We can just get
> a
> > > working UI/service first may be.. and pulling this into Spark driver
> > won't
> > > be a big deal. Someone else may also be interested in taking it up..
> > >
> > > Overall +1 from me
> > >
> > >
> > > On Wed, Jul 15, 2020 at 9:10 PM tanu dua 
> wrote:
> > >
> > > > Sure we can go ahead with a rudimentary UI.
> > > > On running it as a part of Spark Driver itself we should first
> conclude
> > > > what are end goals of this service should be. I was rather thinking
> of
> > > > hosting it as separate service so that without running a spark
> program
> > I
> > > > can browse through the table metadata as I can do in Hudi CLI but
> since
> > > CLI
> > > > is shell based and everyone will not have an access to shell so those
> > > > service can help there.
> > > > I noticed that we start a javalin server when we start Spark program
> > but
> > > > honestly I don’t know where do we use it . Do we use it in hudi spark
> > > code
> > > > ? Is it a good idea to access rest services from spark code ?
> > > >
> > > > On Thu, 16 Jul 2020 at 9:11 AM, Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hi,
> > > > > Sorry, did not realize my response was still stuck in my outbox.
> > > > >
> > > > > At a high level, that sounds good to me. I would start with a
> > > rudimentary
> > > > > UI to begin with if possible. Having a service alone may not make
> > this
> > > > very
> > > > > readily consumable?
> > > > >
> > > > > Other random thought is, if we can host this UI service as a part
> of
> > > the
> > > > > spark driver itself? In terms of deployment model - it would be
> nice
> > > > > atleast for spark streaming/DeltaStreamer continuous mode to
> atleast
> > > have
> > > > > UI hosted by the spark driver. This way people don’t have to run a
> > > > separate
> > > > > server per se..
> > > > > (we already have a timeline-server which we have not pursued
> actively
> > > as
> > > > a
> > > > > separate running service for the same reasons. )
> > > > >
> > > > > Any thoughts on this?
> > > > >
> > > > > Thanks for driving this forward
> > > > > Vinoth
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jul 10, 2020 at 7:02 AM Tanuj 
> wrote:
> > > > >
> > > > > > This is what my high level thought and design, please correct me
> > if I
> > > > am
> > > > > > wrong.
> > > > > > 1) We are using Spring Shell for hudi cli and for each command we
> > > have
> > > > > > class and methods annotated with CliCommand
> > > > > > 2) We initiate the static file system fs once we connect to the
> > table
> > > > and
> > > > > > then all operations interact with that fs
> > > > > >
> > > > > > On the similar lines, we can write a Spring Boot app  -
> > > > > > 1) Which will spin up a new microservices server and in place of
> > > > > > CliCommand  we will have Spring Boot end point
> > > > > > 2) Since micr

Re: Expose HUDI CLI as a Service

2020-07-16 Thread tanu dua
Thanks Vinoth. I understand now. I would also look timeline server to
understand more how it works.

On Fri, Jul 17, 2020 at 9:33 AM Vinoth Chandar  wrote:

> We have support both modes. Standalone or embedded within driver?
>
> We run javalin timeline server today during the spark write. All the file
> listings during the writing actually are spark executors talking to this
> driver service. This way, we don't keep listing S3/HDFS repeatedly during
> the write process. This timeline server can also be run in a long running
> mode as a separate process. That is what the hudi-timeline-service module
> does..
>
> I was suggesting something similar for the UI itself.. We can just get a
> working UI/service first may be.. and pulling this into Spark driver won't
> be a big deal. Someone else may also be interested in taking it up..
>
> Overall +1 from me
>
>
> On Wed, Jul 15, 2020 at 9:10 PM tanu dua  wrote:
>
> > Sure we can go ahead with a rudimentary UI.
> > On running it as a part of Spark Driver itself we should first conclude
> > what are end goals of this service should be. I was rather thinking of
> > hosting it as separate service so that without running a spark program I
> > can browse through the table metadata as I can do in Hudi CLI but since
> CLI
> > is shell based and everyone will not have an access to shell so those
> > service can help there.
> > I noticed that we start a javalin server when we start Spark program but
> > honestly I don’t know where do we use it . Do we use it in hudi spark
> code
> > ? Is it a good idea to access rest services from spark code ?
> >
> > On Thu, 16 Jul 2020 at 9:11 AM, Vinoth Chandar 
> wrote:
> >
> > > Hi,
> > > Sorry, did not realize my response was still stuck in my outbox.
> > >
> > > At a high level, that sounds good to me. I would start with a
> rudimentary
> > > UI to begin with if possible. Having a service alone may not make this
> > very
> > > readily consumable?
> > >
> > > Other random thought is, if we can host this UI service as a part of
> the
> > > spark driver itself? In terms of deployment model - it would be nice
> > > atleast for spark streaming/DeltaStreamer continuous mode to atleast
> have
> > > UI hosted by the spark driver. This way people don’t have to run a
> > separate
> > > server per se..
> > > (we already have a timeline-server which we have not pursued actively
> as
> > a
> > > separate running service for the same reasons. )
> > >
> > > Any thoughts on this?
> > >
> > > Thanks for driving this forward
> > > Vinoth
> > >
> > >
> > >
> > > On Fri, Jul 10, 2020 at 7:02 AM Tanuj  wrote:
> > >
> > > > This is what my high level thought and design, please correct me if I
> > am
> > > > wrong.
> > > > 1) We are using Spring Shell for hudi cli and for each command we
> have
> > > > class and methods annotated with CliCommand
> > > > 2) We initiate the static file system fs once we connect to the table
> > and
> > > > then all operations interact with that fs
> > > >
> > > > On the similar lines, we can write a Spring Boot app  -
> > > > 1) Which will spin up a new microservices server and in place of
> > > > CliCommand  we will have Spring Boot end point
> > > > 2) Since microservices are stateless, we can't rely on static
> filesytem
> > > > variable fs. So in place of that we can have a
> map
> > > with
> > > > auto invalidation after specified time
> > > > 3) We will integrate this service with LDAP using Spring Security etc
> > and
> > > > authorisation at table and commands/endpoint level
> > > >
> > > > So we should be able to leverage most of the CLI code with some
> > > > modification.
> > > >
> > > > I am deferring UI as of now if we are ok with the service design but
> if
> > > we
> > > > go with the basic UI, we can just have a tree of tables on the left
> > with
> > > > all greyed out. Once user connects to the table, then relevant
> context
> > > menu
> > > > options will be enabled depending upon user authorisation. The output
> > of
> > > > the command can be printed on the right panel leveraging the CLI
> output
> > > > format.
> > > >
> > > >
> > > > On 2020/07/07 23:52:15, Vinoth Chandar  wrote:
> > > > >

Re: Expose HUDI CLI as a Service

2020-07-15 Thread tanu dua
Sure we can go ahead with a rudimentary UI.
On running it as a part of Spark Driver itself we should first conclude
what are end goals of this service should be. I was rather thinking of
hosting it as separate service so that without running a spark program I
can browse through the table metadata as I can do in Hudi CLI but since CLI
is shell based and everyone will not have an access to shell so those
service can help there.
I noticed that we start a javalin server when we start Spark program but
honestly I don’t know where do we use it . Do we use it in hudi spark code
? Is it a good idea to access rest services from spark code ?

On Thu, 16 Jul 2020 at 9:11 AM, Vinoth Chandar  wrote:

> Hi,
> Sorry, did not realize my response was still stuck in my outbox.
>
> At a high level, that sounds good to me. I would start with a rudimentary
> UI to begin with if possible. Having a service alone may not make this very
> readily consumable?
>
> Other random thought is, if we can host this UI service as a part of the
> spark driver itself? In terms of deployment model - it would be nice
> atleast for spark streaming/DeltaStreamer continuous mode to atleast have
> UI hosted by the spark driver. This way people don’t have to run a separate
> server per se..
> (we already have a timeline-server which we have not pursued actively as a
> separate running service for the same reasons. )
>
> Any thoughts on this?
>
> Thanks for driving this forward
> Vinoth
>
>
>
> On Fri, Jul 10, 2020 at 7:02 AM Tanuj  wrote:
>
> > This is what my high level thought and design, please correct me if I am
> > wrong.
> > 1) We are using Spring Shell for hudi cli and for each command we have
> > class and methods annotated with CliCommand
> > 2) We initiate the static file system fs once we connect to the table and
> > then all operations interact with that fs
> >
> > On the similar lines, we can write a Spring Boot app  -
> > 1) Which will spin up a new microservices server and in place of
> > CliCommand  we will have Spring Boot end point
> > 2) Since microservices are stateless, we can't rely on static filesytem
> > variable fs. So in place of that we can have a map
> with
> > auto invalidation after specified time
> > 3) We will integrate this service with LDAP using Spring Security etc and
> > authorisation at table and commands/endpoint level
> >
> > So we should be able to leverage most of the CLI code with some
> > modification.
> >
> > I am deferring UI as of now if we are ok with the service design but if
> we
> > go with the basic UI, we can just have a tree of tables on the left with
> > all greyed out. Once user connects to the table, then relevant context
> menu
> > options will be enabled depending upon user authorisation. The output of
> > the command can be printed on the right panel leveraging the CLI output
> > format.
> >
> >
> > On 2020/07/07 23:52:15, Vinoth Chandar  wrote:
> > > Nope. We can begin on a fresh slate. Feel free to even create a new
> RFC,
> > if
> > > that does not fit with what you have in mind..
> > >
> > >
> > >
> > > On Mon, Jul 6, 2020 at 6:31 AM tanu dua  wrote:
> > >
> > > > Sure me and my team can think of in contributing here. May I know if
> > > > something has already kicked off and the technologies that are used
> to
> > > > build the services and UI ?
> > > >
> > > > On Mon, 6 Jul 2020 at 5:26 PM, Vinoth Chandar 
> > wrote:
> > > >
> > > > > Hi Tanuj,
> > > > >
> > > > > Good idea to have a service/UI..  There is an inactive proposal
> > around
> > > > > this, if you want to revive and drive it forward.
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Sun, Jul 5, 2020 at 11:07 PM Tanuj 
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > > HUDI CLI is a great tool but I believe the biggest limitation of
> > HUDI
> > > > CLI
> > > > > > is that you can only access it from shell and in the higher
> > > > environments
> > > > > we
> > > > > > may not get a shell to execute the commands.
> > > > > >
> > > > > > How about exposing HUDI CLI as a service backed by LDAP and with
> > all
> > > > > > proper authorisation may be as a Spring Boot service ?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Expose HUDI CLI as a Service

2020-07-15 Thread tanu dua
Hi Vinoth,
Please let me know if you are ok with high level design or you have any
suggestions ?

Thanks.

On Fri, 10 Jul 2020 at 7:32 PM, Tanuj  wrote:

> This is what my high level thought and design, please correct me if I am
> wrong.
> 1) We are using Spring Shell for hudi cli and for each command we have
> class and methods annotated with CliCommand
> 2) We initiate the static file system fs once we connect to the table and
> then all operations interact with that fs
>
> On the similar lines, we can write a Spring Boot app  -
> 1) Which will spin up a new microservices server and in place of
> CliCommand  we will have Spring Boot end point
> 2) Since microservices are stateless, we can't rely on static filesytem
> variable fs. So in place of that we can have a map with
> auto invalidation after specified time
> 3) We will integrate this service with LDAP using Spring Security etc and
> authorisation at table and commands/endpoint level
>
> So we should be able to leverage most of the CLI code with some
> modification.
>
> I am deferring UI as of now if we are ok with the service design but if we
> go with the basic UI, we can just have a tree of tables on the left with
> all greyed out. Once user connects to the table, then relevant context menu
> options will be enabled depending upon user authorisation. The output of
> the command can be printed on the right panel leveraging the CLI output
> format.
>
>
> On 2020/07/07 23:52:15, Vinoth Chandar  wrote:
> > Nope. We can begin on a fresh slate. Feel free to even create a new RFC,
> if
> > that does not fit with what you have in mind..
> >
> >
> >
> > On Mon, Jul 6, 2020 at 6:31 AM tanu dua  wrote:
> >
> > > Sure me and my team can think of in contributing here. May I know if
> > > something has already kicked off and the technologies that are used to
> > > build the services and UI ?
> > >
> > > On Mon, 6 Jul 2020 at 5:26 PM, Vinoth Chandar 
> wrote:
> > >
> > > > Hi Tanuj,
> > > >
> > > > Good idea to have a service/UI..  There is an inactive proposal
> around
> > > > this, if you want to revive and drive it forward.
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Sun, Jul 5, 2020 at 11:07 PM Tanuj  wrote:
> > > >
> > > > > Hi all,
> > > > > HUDI CLI is a great tool but I believe the biggest limitation of
> HUDI
> > > CLI
> > > > > is that you can only access it from shell and in the higher
> > > environments
> > > > we
> > > > > may not get a shell to execute the commands.
> > > > >
> > > > > How about exposing HUDI CLI as a service backed by LDAP and with
> all
> > > > > proper authorisation may be as a Spring Boot service ?
> > > > >
> > > > > Thanks.
> > > > >
> > > >
> > >
> >
>


Re: Expose HUDI CLI as a Service

2020-07-06 Thread tanu dua
Sure me and my team can think of in contributing here. May I know if
something has already kicked off and the technologies that are used to
build the services and UI ?

On Mon, 6 Jul 2020 at 5:26 PM, Vinoth Chandar  wrote:

> Hi Tanuj,
>
> Good idea to have a service/UI..  There is an inactive proposal around
> this, if you want to revive and drive it forward.
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233
>
> Thanks
> Vinoth
>
> On Sun, Jul 5, 2020 at 11:07 PM Tanuj  wrote:
>
> > Hi all,
> > HUDI CLI is a great tool but I believe the biggest limitation of HUDI CLI
> > is that you can only access it from shell and in the higher environments
> we
> > may not get a shell to execute the commands.
> >
> > How about exposing HUDI CLI as a service backed by LDAP and with all
> > proper authorisation may be as a Spring Boot service ?
> >
> > Thanks.
> >
>


Re: DISCUSS code, config, design walk through sessions

2020-07-06 Thread tanu dua
+1
It will be really helpful

On Mon, 6 Jul 2020 at 11:53 AM, Shahida Khan  wrote:

> +1.
>
> *Regards,*
> *Shahida R. Khan*
> 
>
>
>
> On Mon, 6 Jul 2020 at 09:46, Gary Li  wrote:
>
> > +1. Technical deep dive will be very helpful.
> >
> > On Sun, Jul 5, 2020 at 8:30 PM Vinoth Chandar  wrote:
> >
> > > Hi all,
> > >
> > > As we scale the community, its important that more of us are able to
> help
> > > users, users becoming contributors.
> > >
> > > In the past, we have drafted faqs, trouble shooting guides. But I feel
> > > sometimes, more hands on walk through sessions over video could help.
> > >
> > > I am happy to spend 2 hours each on code/configs,
> > design/perf/architecture.
> > > Have the session be recorded as well for future.
> > >
> > > What does everyone think?
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>


Re: Prometheus Support

2020-06-30 Thread tanu dua
Thanks Vinoth. Will wait for it

On Tue, 30 Jun 2020 at 9:54 PM, Vinoth Chandar  wrote:

> Hi Tanu,
>
> It's under work.. probably will make its way into 0.6.0
> https://github.com/apache/hudi/pull/1726
>
> Thanks
> Vinoth
>
> On Tue, Jun 30, 2020 at 7:10 AM Tanuj  wrote:
>
> > Hi,
> > Do we have any plan to support Prometheus ? I can see Graphite and
> Datadog.
> > Thanks.
> >
>


Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

2020-06-04 Thread tanu dua
Thanks a lot Vinoth for your suggestion. I will look into it.

On Thu, 4 Jun 2020 at 10:15 AM, Vinoth Chandar  wrote:

> This is a good conversation. The ask for support of bucketed tables has not
> actually come up much, since if you are looking up things at that
> granularity, it almost feels like you are doing OLTP/database like queries?
>
> Assuming you hash the primary key into a hash that denotes the partition,
> then a simple workaround is to always add a where clause using a UDF in
> presto, I.e where key = 123 and partition = hash_udf(123)
>
> But of course the down side Is that your ops team needs to remember to add
> the second partition clause (which is not very different from querying
> large time partitioned tables today)
>
> Our mid term plan is to build out column indexes (RFC-15 has the details,
> if you are interested)
>
> On Wed, Jun 3, 2020 at 2:54 AM tanu dua  wrote:
>
> > If I need to plugin this hashing algorithm to resolve the partitions in
> > Presto and hive what is the code I should look into ?
> >
> > On Wed, Jun 3, 2020, 12:04 PM tanu dua  wrote:
> >
> > > Yes that’s also on cards and for developers that’s ok but we need to
> > > provide an interface to our ops people to execute the queries from
> presto
> > > so I need to find out if they fire a query on primary key how can I
> > > calculate the hash. They can fire a query including primary key with
> > other
> > > fields. So that is the only problem I see in hash partitions and to get
> > if
> > > work I believe I need to go deeper into presto Hudi plugin
> > >
> > > On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah 
> > > wrote:
> > >
> > >> Hi Tanu,
> > >>
> > >> If your primary key is integer you can add one more field as hash of
> > >> integer and partition based on hash field. It will add some complexity
> > to
> > >> read and write because hash has to be computed prior to each read or
> > >> write.
> > >> Not whether overhead of doing this exceeds performance gains due to
> less
> > >> partitions. I wonder why HUDI don't directly support hash based
> > >> partitions?
> > >>
> > >> Thanks
> > >> Jaimin
> > >>
> > >> On Wed, 3 Jun 2020 at 10:07, tanu dua  wrote:
> > >>
> > >> > Thanks Vinoth for detailed explanation. Even I was thinking on the
> > same
> > >> > lines and I will relook. We can reduce the 2nd and 3rd partition but
> > >> it’s
> > >> > very difficult to reduce the 1st partition as that is the basic
> > primary
> > >> key
> > >> > of our domain model on which analysts and developers need to query
> > >> almost
> > >> > 90% of time and its an integer primary key and can’t be decomposed
> > >> further.
> > >> >
> > >> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar 
> > >> wrote:
> > >> >
> > >> > > Hi tanu,
> > >> > >
> > >> > > For good query performance, its recommended to write optimally
> sized
> > >> > files.
> > >> > > Hudi already ensures that.
> > >> > >
> > >> > > Generally speaking, if you have too many partitions, then it also
> > >> means
> > >> > too
> > >> > > many files. Mostly people limit to 1000s of partitions in their
> > >> datasets,
> > >> > > since queries typically crunch data based on time or a
> > business_domain
> > >> > (e.g
> > >> > > city for uber)..  Partitioning too granular - say based on
> user_id -
> > >> is
> > >> > not
> > >> > > very useful unless your queries only crunch per user.. if you are
> > >> using
> > >> > > Hive metastore then 25M partitions mean 25M rows in your backing
> > mysql
> > >> > > metastore db as well - not very scalable.
> > >> > >
> > >> > > What I am trying to say is : even outside of Hudi, if analytics is
> > >> your
> > >> > use
> > >> > > case, might be worth partitioning at lower granularity and
> increase
> > >> rows
> > >> > > per parquet file.
> > >> > >
> > >> > > Thanks
> > >> > > Vinoth
> > >> > >
> > >> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj 
> wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > > We have a requirement to ingest 30M records in S3 backed up by
> > >> HUDI. I
> > >> > am
> > >> > > > figuring out the partition strategy and ending up with lot of
> > >> > partitions
> > >> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
> > >> partition)
> > >> > > -->
> > >> > > > 2.5 M (third partition) and each parquet file will have the
> > records
> > >> > with
> > >> > > > less than 10 rows of data.
> > >> > > >
> > >> > > > Our dataset will be ingested at once in full and then it will be
> > >> > > > incremental daily with less than 1k updates. So its more read
> > heavy
> > >> > > rather
> > >> > > > than write heavy
> > >> > > >
> > >> > > > So what should be the suggestion in terms of HUDI performance -
> go
> > >> > ahead
> > >> > > > with the above partition strategy or shall I reduce my
> partitions
> > >> and
> > >> > > > increase  no of rows in each parquet file.
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>


Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

2020-06-03 Thread tanu dua
If I need to plugin this hashing algorithm to resolve the partitions in
Presto and hive what is the code I should look into ?

On Wed, Jun 3, 2020, 12:04 PM tanu dua  wrote:

> Yes that’s also on cards and for developers that’s ok but we need to
> provide an interface to our ops people to execute the queries from presto
> so I need to find out if they fire a query on primary key how can I
> calculate the hash. They can fire a query including primary key with other
> fields. So that is the only problem I see in hash partitions and to get if
> work I believe I need to go deeper into presto Hudi plugin
>
> On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah 
> wrote:
>
>> Hi Tanu,
>>
>> If your primary key is integer you can add one more field as hash of
>> integer and partition based on hash field. It will add some complexity to
>> read and write because hash has to be computed prior to each read or
>> write.
>> Not whether overhead of doing this exceeds performance gains due to less
>> partitions. I wonder why HUDI don't directly support hash based
>> partitions?
>>
>> Thanks
>> Jaimin
>>
>> On Wed, 3 Jun 2020 at 10:07, tanu dua  wrote:
>>
>> > Thanks Vinoth for detailed explanation. Even I was thinking on the same
>> > lines and I will relook. We can reduce the 2nd and 3rd partition but
>> it’s
>> > very difficult to reduce the 1st partition as that is the basic primary
>> key
>> > of our domain model on which analysts and developers need to query
>> almost
>> > 90% of time and its an integer primary key and can’t be decomposed
>> further.
>> >
>> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar 
>> wrote:
>> >
>> > > Hi tanu,
>> > >
>> > > For good query performance, its recommended to write optimally sized
>> > files.
>> > > Hudi already ensures that.
>> > >
>> > > Generally speaking, if you have too many partitions, then it also
>> means
>> > too
>> > > many files. Mostly people limit to 1000s of partitions in their
>> datasets,
>> > > since queries typically crunch data based on time or a business_domain
>> > (e.g
>> > > city for uber)..  Partitioning too granular - say based on user_id -
>> is
>> > not
>> > > very useful unless your queries only crunch per user.. if you are
>> using
>> > > Hive metastore then 25M partitions mean 25M rows in your backing mysql
>> > > metastore db as well - not very scalable.
>> > >
>> > > What I am trying to say is : even outside of Hudi, if analytics is
>> your
>> > use
>> > > case, might be worth partitioning at lower granularity and increase
>> rows
>> > > per parquet file.
>> > >
>> > > Thanks
>> > > Vinoth
>> > >
>> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj  wrote:
>> > >
>> > > > Hi,
>> > > > We have a requirement to ingest 30M records in S3 backed up by
>> HUDI. I
>> > am
>> > > > figuring out the partition strategy and ending up with lot of
>> > partitions
>> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
>> partition)
>> > > -->
>> > > > 2.5 M (third partition) and each parquet file will have the records
>> > with
>> > > > less than 10 rows of data.
>> > > >
>> > > > Our dataset will be ingested at once in full and then it will be
>> > > > incremental daily with less than 1k updates. So its more read heavy
>> > > rather
>> > > > than write heavy
>> > > >
>> > > > So what should be the suggestion in terms of HUDI performance - go
>> > ahead
>> > > > with the above partition strategy or shall I reduce my partitions
>> and
>> > > > increase  no of rows in each parquet file.
>> > > >
>> > >
>> >
>>
>


Re: Query Incremental Updates on same primary key

2020-05-29 Thread tanu dua
Yes I followed the following and accordingly wrote the queries.
I believe the difference is primary key selection as in the examples below
the primary key is always unique like uuid which means that every data
ingestion will be insert and hence old and new records will be in the
latest parquet file.
In my case primary key is not always unique and hence update will trigger
and new file will have updated value and not the old value.

But I can try again if you believe that incremental query scans through all
the parquet files and not just the latest one.

On Fri, 29 May 2020 at 10:48 PM, Satish Kotha 
wrote:

> Hi,
>
>
> > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its in
> > old parquet file. So doesn't incremental query run on old parquet files ?
> >
>
> Could you share the command you are using for incremental query?  Specific
> config is required by hoodie for doing incremental queries. Please see
> example
> here
> <
> https://hudi.apache.org/docs/docker_demo.html#step-7-b-incremental-query-with-spark-sql
> >
> and
> more documentation here
> . Please
> try this and let me know if it works as expected.
>
> Thanks
> Satish
>
> On Fri, May 29, 2020 at 5:18 AM tanujdua  wrote:
>
> > Hi,
> > We have a requirement where we keep audit_history of every change and
> > sometimes query on that as well. In RDBMS we have separate tables for
> > audit_history. However in HUDI, history is being created at every
> ingestion
> > and I want to leverage so I do have a question on incremental query.
> > Does incremental query runs on latest parquet file or on all the parquet
> > files in the partition ? I can see it runs only on latest parquet file.
> >
> > Let me illustrate more what we need. For eg we have data with 2 columns -
> > (id | name) where id is the primary key.
> >
> > Batch 1 -
> > Inserted 2 record --> 1 | Tom ; 2 | Jerry
> > A new parquet file is created say 1.parquet with these 2 entries
> >
> > Batch 2 -
> > Inserted 2 records --> 1 | Mickey  ; 3 | Donald . So here primary key
> with
> > 1 is updated from Tom to Mickey
> > A new parquet file is created say 2.parquet with following entries -
> > 1 | Mickey (Record Updated)
> > 2 | Jerry (Record Not changed and retained)
> > 3 | Donald (New Record)
> >
> > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its in
> > old parquet file. So doesn't incremental query run on old parquet files ?
> >
> > I can use plain vanilla spark to achieve but is there any better way to
> > get the audit history of updated rows using HUDI
> > 1) Using spark I can read all parquet files (without hoodie) -
> > spark.read().load(hudiConfig.getBasePath() + hudiConfig.getTableName() +
> > "//*//*//*.parquet");
> >
> >
> >
> >
>


Re: Rollback to previous version for COW

2020-05-21 Thread tanu dua
Thanks it worked.

On Thu, 21 May 2020 at 2:08 AM, Vinoth Chandar  wrote:

> Hi Tanu,
>
> You should be able to use the CLI
> https://hudi.apache.org/docs/deployment.html#cli
> and perform the rollback/restore. Have you given this a shot?
>
> On Wed, May 20, 2020 at 6:43 AM tanujdua  wrote:
>
> > I have provided wrong schema while appending data and that has corrupted
> > my Hudi table.
> > So is there a way I can rollback to previous version which has good data
> > and delete my current version
> >
> > Thanks,
> > Tanu
> >
>