[GitHub] [hudi] yihua commented on a diff in pull request #6730: [HUDI-4884] Fixing faq for indexes with hudi and fixing docker demo for hive sync

GitBox Fri, 23 Sep 2022 18:12:14 -0700


yihua commented on code in PR #6730:
URL: https://github.com/apache/hudi/pull/6730#discussion_r979134260



##########
website/docs/faq.md:
##########
@@ -245,9 +245,13 @@ The indexing component is a key part of the Hudi writing 
and it maps a given rec
 
 Hudi supports a few options for indexing as below
 
- - *HoodieBloomIndex (default)* : Uses a bloom filter and ranges information 
placed in the footer of parquet/base files (and soon log files as well)
- - *HoodieGlobalBloomIndex* : The default indexing only enforces uniqueness of 
a key inside a single partition i.e the user is expected to know the partition 
under which a given record key is stored. This helps the indexing scale very 
well for even [very large 
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some 
cases, it might be necessary instead to do the de-duping/enforce uniqueness 
across all partitions and the global bloom index does exactly that. If this is 
used, incoming records are compared to files across the entire dataset and 
ensure a recordKey is only present in one partition.
+ - *HoodieBloomIndex * : Uses a bloom filter and ranges information placed in 
the footer of parquet/base files (and soon log files as well)
+ - *HoodieGlobalBloomIndex* : The non global indexing only enforces uniqueness 
of a key inside a single partition i.e the user is expected to know the 
partition under which a given record key is stored. This helps the indexing 
scale very well for even [very large 
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some 
cases, it might be necessary instead to do the de-duping/enforce uniqueness 
across all partitions and the global bloom index does exactly that. If this is 
used, incoming records are compared to files across the entire dataset and 
ensure a recordKey is only present in one partition.
  - *HBaseIndex* : Apache HBase is a key value store, typically found in close 
proximity to HDFS. You can also store the index inside HBase, which could be 
handy if you are already operating HBase.
+ - *HoodieSimpleIndex (default)* : A simple index which reads interested 
fields(record key and partition path) from base files and joins with incoming 
records to find the tagged location.

Review Comment:
   nit: `interested fields(record key` -> `interested fields (record key` 



##########
website/docs/faq.md:
##########
@@ -245,9 +245,13 @@ The indexing component is a key part of the Hudi writing 
and it maps a given rec
 
 Hudi supports a few options for indexing as below
 
- - *HoodieBloomIndex (default)* : Uses a bloom filter and ranges information 
placed in the footer of parquet/base files (and soon log files as well)
- - *HoodieGlobalBloomIndex* : The default indexing only enforces uniqueness of 
a key inside a single partition i.e the user is expected to know the partition 
under which a given record key is stored. This helps the indexing scale very 
well for even [very large 
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some 
cases, it might be necessary instead to do the de-duping/enforce uniqueness 
across all partitions and the global bloom index does exactly that. If this is 
used, incoming records are compared to files across the entire dataset and 
ensure a recordKey is only present in one partition.
+ - *HoodieBloomIndex * : Uses a bloom filter and ranges information placed in 
the footer of parquet/base files (and soon log files as well)
+ - *HoodieGlobalBloomIndex* : The non global indexing only enforces uniqueness 
of a key inside a single partition i.e the user is expected to know the 
partition under which a given record key is stored. This helps the indexing 
scale very well for even [very large 
datasets](https://eng.uber.com/uber-big-data-platform/). However, in some 
cases, it might be necessary instead to do the de-duping/enforce uniqueness 
across all partitions and the global bloom index does exactly that. If this is 
used, incoming records are compared to files across the entire dataset and 
ensure a recordKey is only present in one partition.
  - *HBaseIndex* : Apache HBase is a key value store, typically found in close 
proximity to HDFS. You can also store the index inside HBase, which could be 
handy if you are already operating HBase.
+ - *HoodieSimpleIndex (default)* : A simple index which reads interested 
fields(record key and partition path) from base files and joins with incoming 
records to find the tagged location.
+ - *HoodieGlobalSimpleIndex* : Global version of Simple Index, where in 
uniqueness is on record key across entire table. 
+ - *HoodieBucketIndex* : Each partition has statically defined buckets to 
which records are tagged with. Since locations are tagged via hashing 
mechanism, this index look up will be very efficient. 

Review Comment:
   nit: `look up` -> `lookup`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on a diff in pull request #6730: [HUDI-4884] Fixing faq for indexes with hudi and fixing docker demo for hive sync

Reply via email to