gianm edited a comment on issue #11231:
URL: https://github.com/apache/druid/issues/11231#issuecomment-844772014


   @loquisgon thank you for the well written proposal.
   
   I think it makes sense to think about improving batch behavior by leveraging 
differences in batch and realtime requirements, so I like the big picture idea.
   
   About structuring the code: there isn't really any perfect way to do it, I 
think. Introducing a flag is best for minimizing code duplication, but if there 
are a lot of differences between the paths, they become tough to track since 
they're mixed together. So separating the classes seems like a good idea. I'd 
avoid a common superclass, since in cases where we have done it (IndexMerger, 
IncrementalIndex) I find the logic really hard to follow. There isn't a clear 
direction of control: sometimes the subclass calls into the superclass, and 
sometimes the superclass calls into the subclass. IMO the best approach is a 
shared "helper" class instead of a shared _superclass_, where control only 
flows in one direction (the main class calls the helper class; not the other 
way around).
   
   About performance: how big in bytes was your 1M row test file? It looks like 
it took 60–90 mins to ingest, which seems like a really long time for just 1M 
rows. I'd expect being able to do it orders of magnitude faster than that. Did 
it take a long time because each row is really big, or is it related to the 
fact that there are a lot of segments? (Another way of asking: how long does it 
take to ingest the same 1M rows if the timestamps are adjusted to all be the 
same?) For datasets that worked without error prior to your changes, do your 
changes have a measurable effect on ingestion speed?
   
   About future work: would you expect these changes to help with non-dynamic 
partitioning modes? For example, would these changes affect the pre-shuffle 
partial segment generation phase? Would you expect them to help? It would be 
interesting to hear your thoughts about future work in this area.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to