JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53

-Nishith



On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal <nagar...@apache.org> wrote:

> All,
>
> Currently, Hudi supports partitioned and non-partitioned datasets. A
> partitioned dataset is one which bucketizes groups of files (data) into
> buckets called partitions. A hudi dataset may be composed of N number of
> partitions with M number of files. This structure helps canonical
> hive/presto/spark queries to limit the amount of data read by using the
> partition as a filter. The value of the partition/bucket in most cases is
> derived from the incoming data itself. The requirement is that once a
> record is mapped to a partition/bucket, this mapping should be a) known to
> hudi b) should remain constant for the lifecycle of the dataset for hudi to
> perform upserts on them. Consequently, in a non-partitioned dataset one can
> think of this problem as a record key <-> file id mapping that is required
> for hudi to be able to perform upserts on a record.
> Current solution is either a) for the client/user to provide the correct
> partition value as part of the payload or b) use a GlobalBloomIndex
> implementation to scan all the files under a given path (say
> non-partitioned table). In both these cases, we are limited either by the
> capability of the user to provide this information or by the performance
> overhead of scanning all files' bloom index.
> I'm proposing a new design, naming it global index, that is a mapping of
> (recordKey <-> fileId). This mapping will be stored and maintained by Hudi
> as another implementation of HoodieIndex and will address the 2 limitations
> mentioned above. I'd like to see if there are other community members
> interested in this project. I will send out a HIP shortly describing more
> details around the need and architecture of this.
>
> Thanks,
> Nishith
>

Reply via email to