mcvsubbu opened a new issue #5359:
URL: https://github.com/apache/incubator-pinot/issues/5359


   Pinot code stopped making reference to Kafka back in 0.1.0 days. HLC can 
support pretty much any stream. LLC it still uses one property of streams in 
the code in its raw form -- the offset of a stream message within a partition. 
   
   This is assumed to be a long (8 bytes). It appears as so in Segment ZK 
metadata, maintained as long in the stream consumers, and expected to be a 
`long` (primitive) in all the consuming interfaces.  
   
   This works fine with Kafka, Eventhub and such, but is not so with some of 
the other streams.
   
   We need to extend the code to support more generic offsets. The support for 
this has to be done somewhat carefully since it can break backward 
compatibility and cause production outage. It is better to do it in smaller 
steps, making sure that we are not breaking anything. 
   
   Offsets are NOT stored in on-disk segment metadata (good!)
   
   Broadly, the usage of offset is in these areas:
   
   1. The controller queries the stream's metadata to get the offset in each 
partition of the stream. The controller writes this offset into segment 
metadata as the starting offset of each realtime segment. Further, the 
controller also writes the offset into zk segment metadata when the segment 
completes.
   2. The server uses the offset to request the stream partition yo return 
messages starting with that offset.
   3. The server and controller exchange the offset value (as a long) in the 
segment completion protocol.
   
   The broad set of steps are as follows (but the devil is in the details, and 
we will know better as we move along):
   
   1. Change `long` into a class (`StreamPartitionMsgOffset`? -- must be 
`Comparable` and `Serializable`) in all places except Kafka-specific areas. For 
now, use LongOffset as the sub-class implementing this interface. Don't change 
any persistent code as yet. 
   2. Change stream consumer interface to support the 
`StreamPartitionMsgOffset` class instead of a long (both metadata fetcher and 
data consumer interfaces).
   3. Change the Segment Completion Protocol to add an additional serialized 
element into the protocol. Both controller and server will pay attention to the 
new element if it is present. Since we will be serializing LongOffset class, it 
should work well. The sender should include both raw form and serialized form 
in the protocol. The receiver chooses serialized if available, and falls back 
to raw if not.
   4. Change the segment metadata in zk to include serialized offset (in a new 
field). The deser will pick the serialized form if available, otherwise choose 
the long offset.
   5. Over time, remove the us of `long` in persistent data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to