Nevena Veljkovic created GRIFFIN-293:
----------------------------------------
Summary: [Service] livy.need.queue=true
Key: GRIFFIN-293
URL: https://issues.apache.org/jira/browse/GRIFFIN-293
Project: Griffin
Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Nevena Veljkovic
Fix For: 0.6.0
While using griffin in several productions environments, having x10 jobs
starting at same hour, minute, second, we figured out that 2 (or more)
concurrent griffin jobs are not submitted and executed to the end (the last was
submitted multiple times, the rest never).
example
2 jobs "beta_node_metrics_fact" and "beta_node_master_dimension_device",
difference between them is 1 millisecond
{code:java}
2019-09-28 14:00:37.090 INFO 2732 --- [ryBean_Worker-4]
o.a.g.c.j.SparkSubmitJob [203] : {
"measure.type" : "griffin",
"id" : 60560,
"name" : "beta_node_metrics_fact",
2019-09-28 14:00:37.091 INFO 2732 --- [ryBean_Worker-5]
o.a.g.c.j.SparkSubmitJob [203] : {
"measure.type" : "griffin",
"id" : 63751,
"name" : "beta_node_master_dimension_device",
{code}
livy submitted 2 jobs/tasks, both contained "beta_node_master_dimension_device"
That's why decided to use setting "livy.need.queue=true".
During testing we figured out queueing does not work at all as
LivyTaskSubmitHelper's member sparkSubmitJob was not instantiated
[https://github.com/apache/griffin/blob/master/service/src/main/java/org/apache/griffin/core/job/LivyTaskSubmitHelper.java#L64]
We fixed this and continue with testing.
During testing we figured out that curConcurrentTaskNum does not decrease
finished tasks (state SUCCESS or DEAD).
[https://github.com/apache/griffin/blob/master/service/src/main/java/org/apache/griffin/core/job/JobServiceImpl.java#L632-L633]
We fixed this also.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)