jon-wei commented on issue #7900: Develop a new BigIndexer process for running ingestion tasks URL: https://github.com/apache/incubator-druid/issues/7900#issuecomment-505676072 @dclim > Do we get query prioritization support? Since the query-serving resources are a shared pool among all the tasks, we'd want to make sure that important queries can take precedence over heavier ones. Yes, this should be supported. > What does the query endpoint look like? Will it be shared among all tasks similar to the chat handler (rather than using a different port for each task)? Something like POST http://{bigIndexerHost}:{bigIndexerPort}/druid/v2/{taskId} I'm thinking it should be shared across all tasks, but the endpoint would just be `http://{bigIndexerHost}:{bigIndexerPort}/druid/v2` without a taskId > How do shared lookups work? Are all lookups loaded into memory when the BigIndexer starts, regardless of whether it is running tasks and if those tasks are query-serving? Will a BigIndexer release lookups that haven't been used in a while? For simplicity in the initial implementation, I'm thinking all lookups should be loaded into memory when the indexer starts. (Re: query-serving, I believe lookups can also be used during ingestion, via transform specs->lookup expressions) > Can the connection/thread pool for chat handlers be separate from the one used for serving queries? For supervised tasks, the supervisor tends to flip out if the task stops responding to control/status requests, and if the pool is shared, I could see some annoying issues where heavy query load to one set of tasks causes the supervisor to kill completely unrelated tasks for being unresponsive. That sounds like a good idea, I'll look into that. > To me this feels like a throwback to the current model where every worker is allocated the same memory, whether it needs it or not, and large tasks suffer when run together with a bunch of small tasks. The main advantage I can see for this is that the global spill will not trigger in this case, which prevents unnecessary spills when you have a bunch of small tasks running with some big tasks that would otherwise keep triggering the global spill. Hm, it was intended as a throwback mode, I'll reconsider whether that's useful to include. > As another option to consider, what do you think about having a configuration parameter that controls the minimum bytes in memory that needs to be reached before a sink is eligible to be spilled during a global spill? The default value could be something like globalIngestionHeapLimitBytes / druid.worker.capacity + some accounting for merging memory requirements. This would give the flexibility of heavy tasks being able to utilize more than its 'equal share' of memory, while preventing unnecessary fragmentation for lightweight tasks. This sounds useful, I will look into this as well. @fjy > I think we should just call it an indexer process. That sounds good.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
