himanshug edited a comment on issue #8249: ability to let user configure 
segment version in indexing task
URL: 
https://github.com/apache/incubator-druid/issues/8249#issuecomment-518816885
 
 
   > I guess you're using a sort of workflow scheduler tool and, ideally, this 
issue should be addressed in the tool. Do we need this because it's too hard or 
complex to guarantee the proper job execution order in the tool?
   
   Let me elaborate the scenario a bit more. There are many ETL things running 
(outside of my control) which would produce data in any arbitrary interval . If 
data interval overlaps from two different ETL jobs then , for the overlapped 
interval, data produced by later ETL job is "correct".
   
   the "tool" could address it by being intelligent i.e. never submit druid 
task for a dataset if a druid task for another dataset with overlapping 
interval already running , wait for that to finish first, if task failed then 
retry it (if task continues to fail for some unexpected data corruption then 
that would block the whole pipeline, and also tasks fail for things unrelated 
such as occassionally druid k8s pods got rescheduled etc). Overall, aside from 
needing more intelligence in the tool, This limits our ability to parallelize 
indexing ability of different datasets as in most cases each overlaps with the 
previous one . This is not a major concern as of now but would become a 
limitation as load increases.
   In this use case, we never "append" , never "rollup" as data is already 
always grouped by ETL jobs upfront.
   
   
   OTOH if I could let segment version be a timestamp token coming from ETL job 
then tool can run druid tasks for any of the uploaded data set in any order , 
in parallel or whatever. That makes tool's(and mine as writer of that tool 
infra) life so much more easy.
   
   
   >  I feel like it's a hacky way to avoid the segment versioning system of 
Druid which could be hard to use and even dangerous if something happens (like 
they might see some stale data unexpectedly). Also, it's very weird to me if 
indexing tasks could generate segments overshadowed by the existing segments. 
It could be just waste of time and resources I guess.
   
   all the catches you mentioned are acceptable. For Euphemism, I would call it 
a "power user" option instead of "hack" :) , where user understands the 
consequences. Unless, there could be a more ideal solution to the problem

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to