[GitHub] [incubator-druid] jon-wei commented on issue #7900: Develop a new BigIndexer process for running ingestion tasks

GitBox Tue, 25 Jun 2019 18:02:04 -0700

jon-wei commented on issue #7900: Develop a new BigIndexer process for running 
ingestion tasks
URL: 
https://github.com/apache/incubator-druid/issues/7900#issuecomment-505676072
 
 
   @dclim 
   
   > Do we get query prioritization support? Since the query-serving resources 
are a shared pool among all the tasks, we'd want to make sure that important 
queries can take precedence over heavier ones.
   
   Yes, this should be supported.
   
   > What does the query endpoint look like? Will it be shared among all tasks 
similar to the chat handler (rather than using a different port for each task)? 
Something like POST http://{bigIndexerHost}:{bigIndexerPort}/druid/v2/{taskId}
   
   I'm thinking it should be shared across all tasks, but the endpoint would 
just be `http://{bigIndexerHost}:{bigIndexerPort}/druid/v2` without a taskId
   
   > How do shared lookups work? Are all lookups loaded into memory when the 
BigIndexer starts, regardless of whether it is running tasks and if those tasks 
are query-serving? Will a BigIndexer release lookups that haven't been used in 
a while?
   
   For simplicity in the initial implementation, I'm thinking all lookups 
should be loaded into memory when the indexer starts. (Re: query-serving, I 
believe lookups can also be used during ingestion, via transform specs->lookup 
expressions)
   
   > Can the connection/thread pool for chat handlers be separate from the one 
used for serving queries? For supervised tasks, the supervisor tends to flip 
out if the task stops responding to control/status requests, and if the pool is 
shared, I could see some annoying issues where heavy query load to one set of 
tasks causes the supervisor to kill completely unrelated tasks for being 
unresponsive.
   
   That sounds like a good idea, I'll look into that.
   
   > To me this feels like a throwback to the current model where every worker 
is allocated the same memory, whether it needs it or not, and large tasks 
suffer when run together with a bunch of small tasks. The main advantage I can 
see for this is that the global spill will not trigger in this case, which 
prevents unnecessary spills when you have a bunch of small tasks running with 
some big tasks that would otherwise keep triggering the global spill.
   
   Hm, it was intended as a throwback mode, I'll reconsider whether that's 
useful to include. 
   
   > As another option to consider, what do you think about having a 
configuration parameter that controls the minimum bytes in memory that needs to 
be reached before a sink is eligible to be spilled during a global spill? The 
default value could be something like globalIngestionHeapLimitBytes / 
druid.worker.capacity + some accounting for merging memory requirements. This 
would give the flexibility of heavy tasks being able to utilize more than its 
'equal share' of memory, while preventing unnecessary fragmentation for 
lightweight tasks.
   
   This sounds useful, I will look into this as well.
   
   
   @fjy 
   
   > I think we should just call it an indexer process.
   
   That sounds good.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] jon-wei commented on issue #7900: Develop a new BigIndexer process for running ingestion tasks

Reply via email to