JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53
-Nishith On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal <nagar...@apache.org> wrote: > All, > > Currently, Hudi supports partitioned and non-partitioned datasets. A > partitioned dataset is one which bucketizes groups of files (data) into > buckets called partitions. A hudi dataset may be composed of N number of > partitions with M number of files. This structure helps canonical > hive/presto/spark queries to limit the amount of data read by using the > partition as a filter. The value of the partition/bucket in most cases is > derived from the incoming data itself. The requirement is that once a > record is mapped to a partition/bucket, this mapping should be a) known to > hudi b) should remain constant for the lifecycle of the dataset for hudi to > perform upserts on them. Consequently, in a non-partitioned dataset one can > think of this problem as a record key <-> file id mapping that is required > for hudi to be able to perform upserts on a record. > Current solution is either a) for the client/user to provide the correct > partition value as part of the payload or b) use a GlobalBloomIndex > implementation to scan all the files under a given path (say > non-partitioned table). In both these cases, we are limited either by the > capability of the user to provide this information or by the performance > overhead of scanning all files' bloom index. > I'm proposing a new design, naming it global index, that is a mapping of > (recordKey <-> fileId). This mapping will be stored and maintained by Hudi > as another implementation of HoodieIndex and will address the 2 limitations > mentioned above. I'd like to see if there are other community members > interested in this project. I will send out a HIP shortly describing more > details around the need and architecture of this. > > Thanks, > Nishith >