[
https://issues.apache.org/jira/browse/GOBBLIN-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ray Yang reopened GOBBLIN-336:
------------------------------
More changes are needed.
> Gobblin Cluster Job Isolation
> -----------------------------
>
> Key: GOBBLIN-336
> URL: https://issues.apache.org/jira/browse/GOBBLIN-336
> Project: Apache Gobblin
> Issue Type: Improvement
> Reporter: Ray Yang
>
> Gobblin cluster runs Gobblin jobs. Each cluster worker host runs jobs in a
> thread pool in a single JVM. The thread pool is reused for next jobs after
> previous jobs finish.
> Gobblin cluster recently ran into issues with resource leakage. The cluster
> would fail all job executions when certain resources such as threads were
> exhausted. To recover, the whole cluster has to be restarted and jobs have to
> be retried. With the expected increase in the number of jobs executed, such
> errors happen more frequently. We have identified the causes and fixes have
> been verfied. However, there are concerns that unknown similar bugs may show
> up later that may bring the whole cluster down.
> In general, any bug in one job’s code may affect the executions of another
> job since they run in the same JVM. It’s also possible that a bug will only
> be triggered by certain input data which is specific to a subset of jobs.
> The cluster will be more robust if a job execution is better isolated from
> another job.
> In the future, we expect jobs will become more diverse as more use cases are
> on-boarded. The need for job isolation will become more important over time.
> In the future job isolation may be required for security reasons too.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)