[GitHub] [hudi] yihua commented on a change in pull request #3588: [MINOR] Fix wording and table in the marker blog

GitBox Sun, 09 Jan 2022 21:58:22 -0800


yihua commented on a change in pull request #3588:
URL: https://github.com/apache/hudi/pull/3588#discussion_r780906284




##########
File path: website/blog/2021-08-18-improving-marker-mechanism.md
##########
@@ -47,26 +47,26 @@ Note that the worker thread always checks whether the 
marker has already been cr
 
 ## Marker-related write options
 
-We introduce the following new marker-related write options in `0.9.0` 
release, to configure the marker mechanism.
+We introduce the following new marker-related write options in `0.9.0` 
release, to configure the marker mechanism.  Note that the 
timeline-server-based marker mechanism is not yet supported for HDFS in `0.9.0` 
release, and we plan to support the timeline-server-based marker mechanism for 
HDFS in the future.
 
 | Property Name |   Default   |     Meaning    |        
 | ------------- | ----------- | :-------------:| 
-| `hoodie.write.markers.type`     | direct | Marker type to use.  Two modes 
are supported: (1) `direct`: individual marker file corresponding to each data 
file is directly created by the writer; (2) `timeline_server_based`: marker 
operations are all handled at the timeline service which serves as a proxy.  
New marker entries are batch processed and stored in a limited number of 
underlying files for efficiency. |
+| `hoodie.write.markers.type`     | direct | Marker type to use.  Two modes 
are supported: (1) `direct`: individual marker file corresponding to each data 
file is directly created by the executor; (2) `timeline_server_based`: marker 
operations are all handled at the timeline service which serves as a proxy.  
New marker entries are batch processed and stored in a limited number of 
underlying files for efficiency. |
 | `hoodie.markers.timeline_server_based.batch.num_threads`     | 20 | Number 
of threads to use for batch processing marker creation requests at the timeline 
server. | 
 | `hoodie.markers.timeline_server_based.batch.interval_ms` | 50 | The batch 
interval in milliseconds for marker creation batch processing. |
 
 ## Performance
 
-We evaluate the write performance over both direct and timeline-server-based 
marker mechanisms by bulk-inserting a large dataset using Amazon EMR with Spark 
and S3. The input data is around 100GB.  We configure the write operation to 
generate a large number of data files concurrently by setting the max parquet 
file size to be 1MB and parallelism to be 240. As we noted before, while the 
latency of direct marker mechanism is acceptable for incremental writes with 
smaller number of data files written, it increases dramatically for large bulk 
inserts/writes which produce much more data files.
+We evaluate the write performance over both direct and timeline-server-based 
marker mechanisms by bulk-inserting a large dataset using Amazon EMR with Spark 
and S3. The input data is around 100GB.  We configure the write operation to 
generate a large number of data files concurrently by setting the max parquet 
file size to be 1MB and parallelism to be 240.  Note that it is unlikely to set 
max parquet file size to 1MB in production and such a setup is only to evaluate 
the performance regarding the marker mechanisms. As we noted before, while the 
latency of direct marker mechanism is acceptable for incremental writes with 
smaller number of data files written, it increases dramatically for large bulk 
inserts/writes which produce much more data files.
 
-As shown below, the timeline-server-based marker mechanism generates much 
fewer files storing markers because of the batch processing, leading to much 
less time on marker-related I/O operations, thus achieving 31% lower write 
completion time compared to the direct marker file mechanism.
+As shown below, direct marker mechanism works really well, when a part of the 
table is written, e.g., 1K out of 165K data files.  However, the time of direct 
marker operations is non-trivial when we need to write significant number of 
data files. Compared to the direct marker mechanism, the timeline-server-based 
marker mechanism generates much fewer files storing markers because of the 
batch processing, leading to much less time on marker-related I/O operations, 
thus achieving 31% lower write completion time compared to the direct marker 
file mechanism.
 
-| Marker Type |   Total Files   |  Num data files written | Files created for 
markers | Marker deletion time | Bulk Insert Time (including marker deletion) |
+| Marker Type |   Input data size   |  Num data files written | Files created 
for markers | Marker deletion time | Bulk Insert Time (including marker 
deletion) |
 | ----------- | --------- | :---------: | :---------: | :---------: | 
:---------: | 
-| Direct | 165K | 1k | 165k | 5.4secs | - |
-| Direct | 165K | 165k | 165k | 15min | 55min |
-| Timeline-server-based | 165K | 165k | 20 | ~3s | 38min |
+| Direct | 600MB | 1k | 1k | 5.4secs | - |

Review comment:
       Somehow missed the comment.  I put a PR to fix that: #4547 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on a change in pull request #3588: [MINOR] Fix wording and table in the marker blog

Reply via email to