glasser edited a comment on issue #6989: Behavior of index_parallel with appendToExisting=false and no bucketIntervals in GranularitySpec is surprising URL: https://github.com/apache/incubator-druid/issues/6989#issuecomment-460859550 What I was missing here is that native batch parallel ingestion effectively acts as if appendToExisting is true unless you specify explicit intervals in the GranularitySpec. This seems to be different from both Hadoop batch ingestion and the Local Index Task (including `index_parallel` with a non-splittable FirehoseFactory) — all of these (if I understand correctly) will run an additional phase to calculate the intervals if they are not provided. I think this is confusing and I'd like to help fix it. My honest instinct is that we should consider the current behavior a bug and we should make an error in the top-level parallel index task if all of the following are true): - Running`index_parallel` - `FirehoseFactory.isSplittable()` (or possibly leave this one out) - `appendToExisting == false` - granularitySpec does not specify intervals While this would be a backwards-incompatible change in 0.14, native batch ingestion is still a very new feature and this behavior is very surprising — and there's a trivial workaround of setting appendToExisting to true if you like the current behavior. If that's not the right change, we could fix the docs instead. I'd update the doc of appendToExisting in native_tasks.md to mention that it is effectively true if intervals aren't specified, and the docs of `intervals` in ingestion_spec should mention that native parallel tasks care about them more. (I suppose one could also make parallel indexing do two scans in this case, but in my case I certainly would have been happier being asked to add one line to my spec rather than have my experience take twice as long, and it's more complex.) I'm happy to do implement either the new error or the docs update based on what is best. Thoughts (@jihoonson ?)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
