[
https://issues.apache.org/jira/browse/HUDI-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406147#comment-17406147
]
ASF GitHub Bot commented on HUDI-2369:
--------------------------------------
pratyakshsharma commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r697831983
##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and
"upsert" to ingest data into a hudi table.
+There are different sort modes that one could employ while using bulk_insert.
This blog will talk about
+different sort modes available out of the box, and how each compares with
others.
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a
hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk
insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some
writer side optimizations like
+small files are not managed with bulk_insert.
+
+Bulk insert offers 3 different sort modes to cater to different needs of
users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is
laid out well. Especially if your record keys
+ have some sort of ordering (timestamp, etc) characteristics, sorting will
assist in trimming down a lot of files
+ during upsert. If data is sorted by frequently queried columns, queries will
leverage parquet predicate pushdown
+ to trim down the data to ensure lower latency as well.
+
+- Additionally, parquet writing is quite a memory intensive operation. When
writing large volumes of data into a table
+ that is also partitioned into 1000s of partitions, without sorting of any
kind, the writer may have to keep 1000s of
+ parquet writers open simultaneously incurring unsustainable memory pressure
and eventually leading to crashes.
+
+- It's also desirable to start with the smallest amount of files possible when
bulk importing data, as to avoid
+ metadata overhead later on for writers and queries.
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and
`NONE`.
+
+## Configurations
+One can set the config
[“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode)
to either
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort
mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input
partitions, which maximizes the number of files
+pruned using key ranges, during index lookups for subsequent upserts. This is
because each file has non-overlapping
+min, max values for keys, which really helps, when the key has some ordering
characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition
path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also
due to global sorting, each small table
+partition path will be written from atmost two spark partition and thus
contain just 2 files.
Review comment:
I guess adding some visual representation might help here.
Also can you explain how a partition path will be written from at most 2
spark partitions? It depends on the file size and the amount of data present in
a particular spark partition right?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Blog on bulk insert sort modes
> ------------------------------
>
> Key: HUDI-2369
> URL: https://issues.apache.org/jira/browse/HUDI-2369
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Docs
> Reporter: sivabalan narayanan
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Blog on bulk insert sort modes
--
This message was sent by Atlassian Jira
(v8.3.4#803005)