[GitHub] gianm opened a new issue #6136: Compaction and ingestion running simultaneously

GitBox Thu, 09 Aug 2018 14:12:51 -0700

gianm opened a new issue #6136: Compaction and ingestion running simultaneously
URL: https://github.com/apache/incubator-druid/issues/6136
 
 
   We'd like to be able to run compaction and append-oriented ingestion for the 
same time, for the same time chunks. "Compaction" here means an indexing task 
that reads from Druid segments and writes back equivalent optimized segment(s).
   
   Right now (0.12.x) we can run these at the same time for the same 
datasource, as long as the time chunk is different (different day, for example, 
if segment granularity is day). This is good but it cannot help in two cases:
   
   1. Backfills (historical data loads) done through Kafka. This causes 
problems, especially if the historical data is coming in no particular order, 
because the Kafka tasks end up publishing lots of waves of small segments. 
Imagine loading a month of data with segment granularity "hour": that's 720 
time chunks, and each one may get lots of small segments as they all fill up 
simultaneously.
   2. Ingestion pipelines that have a long trickle of late data. Consider a 
situation where most data for a particular day comes in real time, but small 
amounts of late data come in over the next 30 days. If this late data comes in 
regularly enough, it becomes impossible to run a compaction task for this day 
until late data stops coming in. We have to wait 30 days, and during that time, 
queries can really slow down due to the potentially large number of tiny 
segments.
   
   Both of these are challenging to address via tuning, since when faced with 
such data delivery patterns, we can only do so much to create optimal segments 
upfront. But these could be addressed by an ability to compact segments even if 
other segments are being written with the same interval. This has another 
benefit: it suggests that we can compact partial time chunks, which means that 
compaction doesn't necessarily need to be distributed, even for large amounts 
of data.
   
   I am not sure what this should look like, but I think some things are true:
   
   - It will need to involve some changes to how the VersionedIntervalTimeline 
works, since it has no ability to view some segments within an interval/version 
pair as obsolete, without considering them _all_ obsolete.
   - It would be nice to maintain the property that VersionedIntervalTimelines 
can be constructed from a collection of DataSegments, which suggests that we'll 
be modifying the DataSegment class somehow.
   
   Maybe something like this (not sure if this is the best design, but 
something that might work). Add a "replaces" list to DataSegment that looks 
like `"replaces" : [0, 1, 2]`. This means that DataSegment object replaces 
others for the same interval/version pair with partition numbers 0, 1, 2. Let's 
say it's partitionNum 3. So, the VersionedIntervalTimeline should return either 
3 _or_ 0, 1, and 2; but not mix them. It's self-describing in the sense that 
once you see 3, you know to stop looking at 0, 1, and 2. It would be nice to be 
able to do a N -> M compaction (rather than N -> 1) but I don't think this 
particular design will generalize to that. Maybe that's ok.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] gianm opened a new issue #6136: Compaction and ingestion running simultaneously

Reply via email to