garyli1019 commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r528295097
##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each
of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records
during write operations. Hoodie index is a very critical piece in Hoodie as it
gives record level lookup support to Hudi for efficient write operations. This
blog talks about different indices and when to use which one.
+
+Hoodie dataset can be of two types in general, partitioned and
non-partitioned. So, most index has two implementations one for partitioned
dataset and another for non-partitioned called as global index.
+
+These are the types of index supported by Hoodie as of now.
+
+- InMemory
+- Bloom
+- Simple
+- Hbase
+
+You could use “hoodie.index.type” to choose any of these indices.
+
+### 1.1 Motivation
+Different workloads have different access patterns. Hudi supports different
indexing schemes to cater to the needs of different workloads. So depending on
one’s use-case, indexing schema can be chosen.
+
+For eg: …….
+To Be filled
+
+Let's take a brief look at each of these indices.
+
+## 2. InMemory
+Stores an in memory hashmap of records to location mapping. Intended to be
used for local testing.
+
+## 3. Bloom
+Leverages bloom index stored with data files to find the location for the
incoming records. This is the most commonly used Index in Hudi and is the
default one. On a high level, this does a range pruning followed by bloom look
up. So, if the record keys are laid out such that it follows some type of
ordering like timestamps, then this will essentially cut down a lot of files to
be looked up as bloom would have filtered out most of the files. But Range
pruning is optional depending on your use-case. If your write batch is such
that the records have no ordering in them (e.g uuid), but the pattern is such
that mostly the recent partitions are updated with a long tail of
updates/deletes to the older partitions, then still bloom index will be faster.
But better to turn off range pruning as it just incurs the cost of checking w/o
much benefit.
+
+For instance, consider a list of file slices in a partition
+
+F1 : key_t0 to key_t10000
+F2 : key_t10001 to key_t20000
+F3 : key_t20001 to key_t30000
+F4 : key_t30001 to key_t40000
+F5 : key_t40001 to key_t50000
+
+So, when looking up records ranging from key_t25000 to key_t28000, bloom will
filter every file slice except F3 with range pruning.
+
+Here is a high level pseudocode used for this bloom:
+
+- Fetch interested partitions from incoming records
+- Load all file info (range info) for every partition. So, we have Map of
<partition -> List<FileInfo> >
+- Find all file -> hoodie key pairs to be looked up.
+// For every <partition, record key> pairs, use index File filter to filter
interested files. Index file filter will leverage file range info and trim down
the files to be looked up. Hoodie has a tree map like structure for efficient
index file filtering.
+- Sort <file, hoodie key> pairs.
+- Load each file and look up mapped keys to find the exact location for the
record keys.
+- Tag back location to incoming records. // this step is required for those
newly inserted records in the incoming batch.
+
+As you could see, first range pruning is done to cut down on files to be
looked up. Following which actual bloom look up is done. By default this is the
index type chosen.
+
+## 4. Simple Index
+For a decent sized dataset, Simple index comes in handy. In the bloom index
discussed above, hoodie reads the file twice. Once to load the file range info
and again to load the bloom filter. So, this simple index simplifies if the
data is within reasonable size.
+
+- From incoming records, find Pair<record key, partition path>
+- Load interested fields (record keys, partition path and location) from all
files and to find Pair<record key, partition path, location> for all entries in
storage.
+- Join above two outputs to find the location for all incoming records.
+
+Since we load only interested fields from files and join directly w/ incoming
records, this works pretty well for small scale data even when compared to
bloom index. But at larger scale, this may deteriorate since all files are
touched w/o any upfront trimming.
+
+## 5. HBase
+Both bloom and simple index are implicit index. In other words, there is no
explicit or external index files created/stored. But Hbase is an external index
where record locations are stored and retrieved. This is straightforward as
fetch location will do a get on hbase table and update location will update the
records in hbase.
+
+// talk about hbase configs?
+
+## 6. UserDefinedIndex
+Hoodie also support user defined index. All you need to do is to implement
“org.apache.hudi.index.SparkHoodieIndex”. You can use this config to set the
user defined class name. If this value is set, this will take precedence over
“hoodie.index.type”.
Review comment:
Should we use `HoodieIndex` instead of `SparkHoodieIndex`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]