This is an automated email from the ASF dual-hosted git repository.

rdhabalia pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/pulsar.wiki.git


The following commit(s) were added to refs/heads/master by this push:
     new ad78d32  Created PIP 37: Large message size handling in Pulsar 
(markdown)
ad78d32 is described below

commit ad78d32f02848b0d3f7da257df3dcf5373e33f90
Author: Rajan Dhabalia <[email protected]>
AuthorDate: Thu May 16 17:28:15 2019 -0700

    Created PIP 37: Large message size handling in Pulsar (markdown)
---
 PIP-37:-Large-message-size-handling-in-Pulsar.md | 60 ++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/PIP-37:-Large-message-size-handling-in-Pulsar.md 
b/PIP-37:-Large-message-size-handling-in-Pulsar.md
new file mode 100644
index 0000000..3cf5a93
--- /dev/null
+++ b/PIP-37:-Large-message-size-handling-in-Pulsar.md
@@ -0,0 +1,60 @@
+# PIP 37: Large message size handling in Pulsar: Chunking Vs Txn
+
+- Status: Proposed
+- Author: Rajan Dhabalia
+- Discussion Thread:
+- Issue: 
+
+## Motivation
+
+We have multiple asks from users who want to publish large size messages in 
the data-pipeline. Most of those usecases are mainly streaming where they need 
message ordering in their streaming data pipeline. For example: sending large 
database-records which needs to be processed in order,  streaming pipeline on 
grid which consume raw input data from a topic, aggregate, transform and write 
to new topics for further processing, etc. TCP already handles such problem by 
splitting large packets [...]
+
+## Approach
+
+Large message payloads can be split into multiple smaller chunks that can be 
accepted by brokers. The chunks can be stored at broker in the same way as 
ordinary messages are stored in the managed-ledger. The only difference is that 
the consumer would need to buffer the chunks and combine them into the real 
message when all chunks have been collected.
+The chunks in the managed-ledger can be interwoven with ordinary messages.
+
+For example,
+### Usecase 1: Single producer with ordered consumer
+Topic has one producer which publishes large message payload in chunks along 
with regular non-chunked messages. Producer first published message M1 in three 
chunks M1-C1, M1-C2 and M1-C3. Broker stores all 3 chunked-messages in 
managed-ledger and dispatches to the ordered (exclusive/fail-over) consumer in 
the same order. Consumer buffers all  the chunked messages in memory until it 
receives all the chunks,  combines them to one real message and hand over 
original large M1 message to the  [...]
+
+![image](https://user-images.githubusercontent.com/2898254/57895169-230e0d00-77ff-11e9-808d-a04c3ef14679.png)
+
+                              [Fig 1: One producer with ordered  
(Exclusive/Failover) consumer]
+
+### Usecase 2: Multiple producers with ordered consumer
+Sometimes, data-pipeline can have multiple publishers which publish chunked 
messages into the single topic. In this case, broker stores all the chunked 
messages coming from the different publishers in the same ledger. So, all 
chunked of the specific message will be still in the order but might not be 
consecutive in the ledger.
+This usecase can be still served by ordered consumer but it will create  
little memory pressure at consumer because now, consumer has to keep separate 
buffer for each large-message to aggregate all chunked of the large message and 
combine them to one real message.
+So, one of the drawbacks is that consumer has to maintain multiple buffers 
into memory but number of buffers are same as number of publishers.
+
+If we compare chunked messages  with transaction in pulsar then chunked 
message has limited life-cycle where at a time only one transaction can exist 
for a producer. So, we know that consumer has to deal with only one transaction 
coming from each producer at anytime so, if consumer is handling 
transaction-aggregation instead of broker (as per PIP-31) then consumer doesn’t 
have to much worry about the memory-pressure because there are not many 
concurrent transactions happening on the topic.
+
+The main difference between this approach and PIP-31(Txn) is assembling of 
messages happen at consumer side instead at broker without much worrying about 
memory-pressure at consumer.
+
+
+![image](https://user-images.githubusercontent.com/2898254/57895200-4cc73400-77ff-11e9-9edf-a4e1c202cddf.png)
+
+                            [Fig 2: Multiple producer with ordered  
(Exclusive/Failover) consumer]
+
+
+### Usecase 3: Multiple producers with shared consumers
+
+We discussed how message chunking works without any broker changes when there 
is a single ordered consumer consumes messages published by  single/multiple 
publishers. In this section we will discuss how it works with shared consumers.
+Message chunking/split and joins requires all chunks related to one message 
must be delivered to one consumer. So, in the case of shared consumers we need 
a small broker change while dispatching messages. (1) Broker keeps a sorted 
list of shared consumer based on consumer connected time (2) broker reads 
message_id from metadata(unique message-id which attach to all message-chunks 
of that message) of the chunked-message while dispatching the message and based 
on message-id hash , broker s [...]
+
+![image](https://user-images.githubusercontent.com/2898254/57895228-741e0100-77ff-11e9-8feb-334ebec83f4a.png)
+
+                                   [Fig 3: Multiple producer with shared  
consumer]
+
+
+## Chunking on large-message Vs Txn with large-message
+
+1. Txn approach requires lot of new enhancement at broker which requires extra 
service for txn-coordinator, extra CPU to read over txn-buffer for each 
txn-msg, extra memory to maintain txn-buffer, extra metadata for new 
txn-partition. And it might not be convenient to deploy txn-service for any 
specialized system which serves 1M topics with large traffic.
+
+2. Chunking only requires minimal changes at client side and doesn’t create 
any cost at broker side. Consumer is the only module which has to pay the cost 
in terms of memory while building buffer but that is also limited by number of 
publishers on the topic.
+
+3. Chunking is the alternative of transactions?
+- No. chunking is one of the useacse of transaction. chunking is a short-lived 
transaction where number of concurrent chunking-txn is limited by number of 
publishers. But chunking can’t replace all txn-usecases with large session-time 
and large number of concurrent txns on the topic because client can’t afford 
memory to handle such usecase.
+
+
+

Reply via email to