Re: [EXT] Re: Bucketing in Hudi

2020-10-22 Thread Balaji Varadarajan
 Hi Roopa,
Bucketing is a more general concept. I think what you are referring to is how 
to integrate with spark sql bucketing syntax.  I was proposing a Hudi native 
solution where we can implement Bucket indexing which gives the same end result 
of writing compacted (parquet) files with keys hashed to get bucket-id. You can 
then use the Hudi's Spark data source integration to write to this table and 
get bucketized organization.
Let me know if this makes sense. 

Thanks,Balaji.V
On Thursday, October 22, 2020, 05:23:11 PM PDT, Roopa Murthy 
 wrote:  
 
 Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am 
not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered 
fashion such that when a spark sql query has a certain id, only the 
bucket(file) which hashes to that id would be scanned for matching records. 
This means, data during compaction has to be written using Spark’s saveAsTable 
API with bucketBy set to the desired number of buckets. Refer: 
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html
 . This will create a spark bucketed table having metadata different from Hive 
bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan 
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" 
Cc: DL-AIE 
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup 
is nicely abstracted out. We have a Jira for supporting Bucket Indexing : 
https://issues.apache.org/jira/browse/HUDI-55

You can get bucket indexing done by implementing that interface along with 
additional changes for handling initial writes to the partition and for 
bucketing information which IMO is not significant. If you are interested in 
contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy 
 wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of 
compaction so that during query time, only the files relevant to the "id" in 
query would be scanned. We are told that bucketing is not currently supported 
in Hudi. Is it possible to extend Hudi to support it? What does it take to 
extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option 
to consider and need your help in analyzing and planning for it.

Thanks,
Roopa



  

Re: [EXT] Re: Bucketing in Hudi

2020-10-22 Thread Roopa Murthy
Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am 
not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered 
fashion such that when a spark sql query has a certain id, only the 
bucket(file) which hashes to that id would be scanned for matching records. 
This means, data during compaction has to be written using Spark’s saveAsTable 
API with bucketBy set to the desired number of buckets. Refer: 
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html
 . This will create a spark bucketed table having metadata different from Hive 
bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan 
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" 
Cc: DL-AIE 
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup 
is nicely abstracted out. We have a Jira for supporting Bucket Indexing : 
https://issues.apache.org/jira/browse/HUDI-55

You can get bucket indexing done by implementing that interface along with 
additional changes for handling initial writes to the partition and for 
bucketing information which IMO is not significant. If you are interested in 
contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy 
 wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of 
compaction so that during query time, only the files relevant to the "id" in 
query would be scanned. We are told that bucketing is not currently supported 
in Hudi. Is it possible to extend Hudi to support it? What does it take to 
extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option 
to consider and need your help in analyzing and planning for it.

Thanks,
Roopa