Re: HUDI Table Primary Key - UUID or Custom For Better Performance

2020-10-21 Thread tanu dua
Thanks got it. Unfortunately it’s not very straightforward for me to
provide ordered keys. So far I am getting a decent write performance so
will revisit if required.

On Wed, 21 Oct 2020 at 7:45 AM, Vinoth Chandar <
mail.vinoth.chan...@gmail.com> wrote:

> For now, bloom filters are not actually leveraged in the read/query path
> but only by the writer performing the index lookup for upserting. Hudi is
> write optimized like an OLTP store and read optimized like OLAP, if
> that makes sense.
>
> As for bloom index performance, our tuning guide and FAQ talk about this.
> If you eventually want to support de-duplication say, it might be good to
> pick a key that is ordered. Something like _hoodie_seq_no that keeps
> increasing with new commits, then the bloom indexing mechanism will be also
> able to do range pruning effectively improving performance significantly.
> Pure uuid keys are not very conducive for range pruning ie files written
> during each commit will over lap in key range with almost every other file.
>
> Thanks
> Vinoth
>
> On Fri, Oct 16, 2020 at 8:42 PM Tanuj  wrote:
>
> > Thanks Prashant. To answer your questions -
> > 1) Yes size of keys are something around 5-8 alphanumeric but since its
> > composite key of 3 domain keys I believe it will be almost equal to UUID
> > 4) Thats the business need. We need to keep a track/audit for every
> > insertion of new record. We had 2 options - Update Existing Record , make
> > an Audit Table to store old records or keep pushing in the same table
> with
> > timestamp so that it always works with Append mode. We choose Option 2
> > 5) Thats what I want to understand how Bloom Filters will be useful here.
> > And in general also is bloom filter used in HUDI for read. I understand
> the
> > write process where its being used but does it use in read as well as I
> > believe after picking up the correct parquet file Hudi delegates the read
> > to Spark . Please correct me if I am wrong here
> > 6) We will only query on domain object keys excluding create_date.
> >
> > On 2020/10/16 18:53:21, Prashant Wason  wrote:
> > > Hi Tanu,
> > >
> > > Some points to consider:
> > > 1. UUID is fixed size compared to domain_object_keys (dont know the
> > size).
> > > Smaller keys will reduce the storage requirements.
> > > 2. UUIDs don't compress. Your domain object keys may compress better.
> > > 3. From the bloom filter perspective, I dont think there is any
> > difference
> > > unless the size difference of keys is very large.
> > > 4. If the domain object keys are already unique, what is the use of
> > > suffixing the create_date?
> > > 5. If you query by "primary key minus timestamp", the entire record key
> > > column will have to be read to match it. So bloom filters won't be
> useful
> > > here.
> > > 6. What do the domain object keys look like? Are they going to be
> > included
> > > in any other field in the record? Would you ever want to query on
> domain
> > > object keys?
> > >
> > > Thanks
> > > Prashant
> > >
> > >
> > > On Thu, Oct 15, 2020 at 8:21 PM tanu dua 
> wrote:
> > >
> > > > read query pattern will be (partition key + primary key minus
> > timestamp)
> > > > where my primary key is domain keys + timestamp.
> > > >
> > > > Read Write queries are as per dataset but mostly all the tables are
> > read
> > > > and write frequently and equally
> > > >
> > > > Read will be mostly done by providing the partitions and not by
> blanket
> > > > query.
> > > >
> > > > If we have to choose between read and write I will choose write but I
> > want
> > > > to stick only with COW table.
> > > >
> > > > Please let me know if you need more information.
> > > >
> > > >
> > > > On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan 
> wrote:
> > > >
> > > > > Can you give us a sense of how your read workload looks like?
> > Depending
> > > > on
> > > > > that read perf could vary.
> > > > >
> > > > > On Thu, Oct 15, 2020 at 4:06 AM Tanuj 
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > > We don't have an "UPDATE" use case and all ingested rows will be
> > > > "INSERT"
> > > > > > so what is the best way to define PRIMARY key. As of now we have
> > > > designed
> > > > > > primary key as per domain object with create_date which is -
> > > > > > ,,
> > > > > >
> > > > > > Since its always an INSERT for us , I can potentially use UUID as
> > well
> > > > .
> > > > > >
> > > > > > We use keys for Bloom Index in HUDI so just wanted to know if I
> > get a
> > > > > > better performance in writing if I will have the UUID vs
> composite
> > > > domain
> > > > > > keys.
> > > > > >
> > > > > > I believe read is not impacted as per the Primary Key as its not
> > being
> > > > > > considered ?
> > > > > >
> > > > > > Please suggest
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > -Sivabalan
> > > > >
> > > >
> > >
> >
>


Re: Bucketing in Hudi

2020-10-21 Thread Balaji Varadarajan
 Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup 
is nicely abstracted out. We have a Jira for supporting Bucket Indexing : 
https://issues.apache.org/jira/browse/HUDI-55 
You can get bucket indexing done by implementing that interface along with 
additional changes for handling initial writes to the partition and for 
bucketing information which IMO is not significant. If you are interested in 
contributing, we would be happy to help you in guiding and landing the change.
Thanks,Balaji.V



On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy 
 wrote:  
 
 Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of 
compaction so that during query time, only the files relevant to the "id" in 
query would be scanned. We are told that bucketing is not currently supported 
in Hudi. Is it possible to extend Hudi to support it? What does it take to 
extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option 
to consider and need your help in analyzing and planning for it.

Thanks,
Roopa



  

Re: Deleting Hudi Partitons

2020-10-21 Thread Satish Kotha
Yes, that would work. You would typically add below option on dataframe to
use insert overwrite  (InsertOverwrite is a new API, I haven't updated
documentation yet).

   - hoodie.datasource.write.operation: insert_overwrite


Let me know if you have any questions.

@Balaji Thanks for creating the follow up ticket. Agree this can be
supported in a much simpler way using insert_overwrite primitive.

On Wed, Oct 21, 2020 at 6:19 PM Balaji Varadarajan
 wrote:

>  cc Satish who implemented Insert Overwrite support.
> We have recently landed Insert Overwrite support in Hudi. Partition level
> deletion is a logical extension of this feature but not currently available
> yet.  I have added a jira to track this :
> https://issues.apache.org/jira/browse/HUDI-1350
> Meanwhile, using master branch, you can do this in 2 steps. You can
> generate a record for each partition you want to delete and commit the
> batch. This would essentially truncate the partition to 1 record. You can
> then issue a hard delete on that record.  By keeping cleaner retention to
> 1, you can essentially cleanup the files in the directory. Satish - Can you
> chime in and see if this makes sense and if you are seeing any issues with
> this ?
> Thanks,Balaji.V
> On Tuesday, October 20, 2020, 11:31:45 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1...@gmail.com> wrote:
>
>  Team ,
>
> I have a COW table which has sub partition columns
> Date/Hour . For some of the use case , I need to totally remove free
> petitions (removing few hours alone) . Hudi maintains metadata info.
> Manually removing folders as well as in hive megastore , may mess up hudi
> metadata. What is the best way to do this?
>
>
> Thanks,
> Selva
>


Re: Deleting Hudi Partitons

2020-10-21 Thread Balaji Varadarajan
 
Fixing incorrect Satish's email.On Wednesday, October 21, 2020, 06:19:43 PM 
PDT, Balaji Varadarajan  wrote:  
 
  cc Satish who implemented Insert Overwrite support.
We have recently landed Insert Overwrite support in Hudi. Partition level 
deletion is a logical extension of this feature but not currently available 
yet.  I have added a jira to track this : 
https://issues.apache.org/jira/browse/HUDI-1350
Meanwhile, using master branch, you can do this in 2 steps. You can generate a 
record for each partition you want to delete and commit the batch. This would 
essentially truncate the partition to 1 record. You can then issue a hard 
delete on that record.  By keeping cleaner retention to 1, you can essentially 
cleanup the files in the directory. Satish - Can you chime in and see if this 
makes sense and if you are seeing any issues with this ?
Thanks,Balaji.V 
    On Tuesday, October 20, 2020, 11:31:45 PM PDT, selvaraj periyasamy 
 wrote:  
 
 Team ,

I have a COW table which has sub partition columns
Date/Hour . For some of the use case , I need to totally remove free
petitions (removing few hours alone) . Hudi maintains metadata info.
Manually removing folders as well as in hive megastore , may mess up hudi
metadata. What is the best way to do this?


Thanks,
Selva
    

Bucketing in Hudi

2020-10-21 Thread Roopa Murthy
Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of 
compaction so that during query time, only the files relevant to the "id" in 
query would be scanned. We are told that bucketing is not currently supported 
in Hudi. Is it possible to extend Hudi to support it? What does it take to 
extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option 
to consider and need your help in analyzing and planning for it.

Thanks,
Roopa





Re: Deleting Hudi Partitons

2020-10-21 Thread Balaji Varadarajan
 cc Satish who implemented Insert Overwrite support.
We have recently landed Insert Overwrite support in Hudi. Partition level 
deletion is a logical extension of this feature but not currently available 
yet.  I have added a jira to track this : 
https://issues.apache.org/jira/browse/HUDI-1350
Meanwhile, using master branch, you can do this in 2 steps. You can generate a 
record for each partition you want to delete and commit the batch. This would 
essentially truncate the partition to 1 record. You can then issue a hard 
delete on that record.  By keeping cleaner retention to 1, you can essentially 
cleanup the files in the directory. Satish - Can you chime in and see if this 
makes sense and if you are seeing any issues with this ?
Thanks,Balaji.V 
On Tuesday, October 20, 2020, 11:31:45 PM PDT, selvaraj periyasamy 
 wrote:  
 
 Team ,

I have a COW table which has sub partition columns
Date/Hour . For some of the use case , I need to totally remove free
petitions (removing few hours alone) . Hudi maintains metadata info.
Manually removing folders as well as in hive megastore , may mess up hudi
metadata. What is the best way to do this?


Thanks,
Selva
  

Deleting Hudi Partitons

2020-10-21 Thread selvaraj periyasamy
Team ,

I have a COW table which has sub partition columns
Date/Hour . For some of the use case , I need to totally remove free
petitions (removing few hours alone) . Hudi maintains metadata info.
Manually removing folders as well as in hive megastore , may mess up hudi
metadata. What is the best way to do this?


Thanks,
Selva