a2l007 opened a new issue, #12506:
URL: https://github.com/apache/druid/issues/12506

   ### Motivation
   
   Druid users today can use Kafka/Kinesis indexing service for a variety of 
realtime usecases and the way pipelines are setup may also vary accordingly. 
Customers may have their application produce events directly to Kafka which is 
then consumed by Druid. For such usecases, once the data is ingested into Druid 
there isn't a way to recover the original unaggregated events without setting 
up a separate exporting system such as Kafka Connect that writes to a storage 
sink such as HDFS or S3. 
   
   Following are a couple of usecases where it would be useful to have the 
original unaggregated data available:
   
   - Misconfigured ingestion specs on Druid: The supervisor specs may not have 
the complete set of dimensions/aggregators defined or they may have incorrect 
transformSpecs or incorrect configuration on their inputFormat. In such cases, 
the events may get ingested to Druid and the customers may only realize the 
issue later on. By this time, the kafka retention period for those events 
probably expired and if reindexing from existing segments doesn't help, there 
isn't a way to fix the issue without having access to the raw events. 
   
   - In certain usecases, the upstream events are published into Kafka but 
Druid might be used for non-realtime usecases. It would be sufficient to run 
batch indexing here and middlemanagers dont have to be configured for query 
performance as with real-time ingestion. Having access to the raw data provides 
this flexibility.
   
   For supporting such usecases that require access to the raw events, my 
proposal is to add support for journaling within Druid.
   
   ### Proposed changes
   
   The journaling system will basically: read events from Kafka -> write it to 
local files on the middlemanager -> push the journaled files periodically to 
the configured deep storage. It will be implemented as a druid extension that 
will have the following:
   
   - `Journaling Supervisor`: A supervisor implementation that will read 
offsets from the configured Kafka topic and spawn journaling tasks with start 
and end offsets that the task needs to read from.
   - `Journal tasks` will read and persist the events locally and the written 
files will be pushed to the configured store.
   - The journaling tasks should tap into the offset management mechanism 
provided by `SeekableStreamIndexTaskRunner` wherever possible.
   - Journal supervisor and associated tasks run independent of the Kafka 
supervisor and kafka indexing tasks. 
   - Configurable retention periods that dictate the lifetime of the journal 
files on deep storage.
   
   ### Rationale
   
   With the journaling extension, users will have an option to store their raw 
events to the configured store. They wouldn't need to setup an external 
exporting sink to journal their data. Users will not have to increase their 
retention periods
   
   ### Operational impact
   
   No impact to current behavior. The journaling supervisor will be independent 
from the Kafka supervisors and once configured, it would require task slots on 
the middlemanager to run the journaling tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to