glasser edited a comment on issue #6989: Behavior of index_parallel with 
appendToExisting=false and no bucketIntervals in GranularitySpec is surprising
URL: 
https://github.com/apache/incubator-druid/issues/6989#issuecomment-460859550
 
 
   What I was missing here is that native batch parallel ingestion effectively 
acts as if appendToSegment is true unless you specify explicit intervals in the 
GranularitySpec.  This seems to be different from both Hadoop batch ingestion 
and the Local Index Task (including `index_parallel` with a non-splittable 
FirehoseFactory) — all of these (if I understand correctly) will run an 
additional phase to calculate the intervals if they are not provided.
   
   I think this is confusing and I'd like to help fix it.
   
   My honest instinct is that we should consider the current behavior a bug and 
we should make the following combination into an error in the top-level 
parallel index task:
   - Running`index_parallel`
   - `FirehoseFactory.isSplittable()`
   - `appendToExisting == true`
   - granularitySpec does not specify intervals
   
   While this would be a backwards-incompatible change in 0.14, native batch 
ingestion is still a very new feature and this behavior is very surprising — 
and there's a trivial workaround of setting appendToExisting to true if you 
like the current behavior.
   
   If that's not the right change, we could fix the docs instead. I'd update 
the doc of appendToExisting in native_tasks.md to mention that it is 
effectively true if intervals aren't specified, and the docs of `intervals` in 
ingestion_spec should mention that native parallel tasks care about them more.
   
   (I suppose one could also make parallel indexing do two scans in this case, 
but in my case I certainly would have been happier being asked to add one line 
to my spec rather than have my experience take twice as long, and it's more 
complex.)
   
   I'm happy to do implement either the new error or the docs update based on 
what is best.
   Thoughts (@jihoonson ?)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to