glasser edited a comment on issue #6989: Behavior of index_parallel with 
appendToExisting=false and no bucketIntervals in GranularitySpec is surprising
URL: 
https://github.com/apache/incubator-druid/issues/6989#issuecomment-460859550
 
 
   What I was missing here is that native batch parallel ingestion effectively 
acts as if appendToExisting is true unless you specify explicit intervals in 
the GranularitySpec.
   
   This seems to be different from both Hadoop batch ingestion and the Local 
Index Task (including `index_parallel` with a non-splittable FirehoseFactory) — 
all of these (if I understand correctly) will run an additional phase to 
calculate the intervals if they are not provided.
   
   This confused me. I'd like to help fix it!
   
   I think we should consider the current behavior a bug and the top-level 
parallel index task should error if all of the following are true):
   - Running`index_parallel`
   - `FirehoseFactory.isSplittable()` (or possibly leave this one out)
   - `appendToExisting == false`
   - granularitySpec does not specify intervals
   
   While this would be a backwards-incompatible change in 0.14, native batch 
ingestion is still a very new feature and this behavior is very surprising — 
and there's a trivial workaround of setting appendToExisting to true if you 
like the current behavior.
   
   If that's not the right change, we could fix the docs instead. I'd update 
the doc of appendToExisting in native_tasks.md to mention that it is 
effectively true if intervals aren't specified, and the docs of `intervals` in 
ingestion_spec should mention that native parallel tasks care about them more.
   
   (I suppose one could also make parallel indexing do two scans in this case, 
but in my case I certainly would have been happier being asked to add one line 
to my spec rather than have my experience take twice as long, and it's more 
complex.)
   
   I'm happy to do implement either the new error or the docs update based on 
what is best.
   Thoughts (@jihoonson ?)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to