himanshug edited a comment on issue #8249: ability to let user configure segment version in indexing task URL: https://github.com/apache/incubator-druid/issues/8249#issuecomment-518816885 > I guess you're using a sort of workflow scheduler tool and, ideally, this issue should be addressed in the tool. Do we need this because it's too hard or complex to guarantee the proper job execution order in the tool? Let me elaborate the scenario a bit more. There are many ETL things running (outside of my control) which would produce data in any arbitrary interval . If data interval overlaps from two different ETL jobs then , for the overlapped interval, data produced by later ETL job is "correct". the "tool" could address it by being intelligent i.e. never submit druid task for a dataset if a druid task for another dataset with overlapping interval already running , wait for that to finish first, if task failed then retry it (if task continues to fail for some unexpected data corruption then that would block the whole pipeline, and also tasks fail for things unrelated such as occassionally druid k8s pods got rescheduled etc). Overall, aside from needing more intelligence in the tool, This limits our ability to parallelize indexing ability of different datasets as in most cases each overlaps with the previous one . This is not a major concern as of now but would become a limitation as load increases. In this use case, we never "append" , never "rollup" as data is already always grouped by ETL jobs upfront. OTOH if I could let segment version be a timestamp token coming from ETL job then tool can run druid tasks for any of the uploaded data set in any order , in parallel or whatever. That makes tool's(and mine as writer of that tool infra) life so much more easy. > I feel like it's a hacky way to avoid the segment versioning system of Druid which could be hard to use and even dangerous if something happens (like they might see some stale data unexpectedly). Also, it's very weird to me if indexing tasks could generate segments overshadowed by the existing segments. It could be just waste of time and resources I guess. all the catches you mentioned are acceptable. For Euphemism, I would call it a "power user" option instead of "hack" :) , where user understands the consequences. Unless, there could be a more ideal solution to the problem
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
