Nevena Veljkovic created GRIFFIN-293:
----------------------------------------

             Summary: [Service] livy.need.queue=true
                 Key: GRIFFIN-293
                 URL: https://issues.apache.org/jira/browse/GRIFFIN-293
             Project: Griffin
          Issue Type: Bug
    Affects Versions: 0.6.0
            Reporter: Nevena Veljkovic
             Fix For: 0.6.0


While using griffin in several productions environments, having x10 jobs 
starting at same hour, minute, second, we figured out that 2 (or more) 
concurrent griffin jobs are not submitted and executed to the end (the last was 
submitted multiple times, the rest never).

example
 2 jobs "beta_node_metrics_fact" and "beta_node_master_dimension_device", 
difference between them is 1 millisecond
{code:java}
2019-09-28 14:00:37.090 INFO 2732 --- [ryBean_Worker-4] 
o.a.g.c.j.SparkSubmitJob [203] : {
 "measure.type" : "griffin",
 "id" : 60560,
 "name" : "beta_node_metrics_fact",

2019-09-28 14:00:37.091 INFO 2732 --- [ryBean_Worker-5] 
o.a.g.c.j.SparkSubmitJob [203] : {
 "measure.type" : "griffin",
 "id" : 63751,
 "name" : "beta_node_master_dimension_device",
{code}
livy submitted 2 jobs/tasks, both contained "beta_node_master_dimension_device"

That's why decided to use setting "livy.need.queue=true".
 During testing we figured out queueing does not work at all as 
LivyTaskSubmitHelper's member sparkSubmitJob was not instantiated
 
[https://github.com/apache/griffin/blob/master/service/src/main/java/org/apache/griffin/core/job/LivyTaskSubmitHelper.java#L64]

We fixed this and continue with testing.

During testing we figured out that curConcurrentTaskNum does not decrease 
finished tasks (state SUCCESS or DEAD).
 
[https://github.com/apache/griffin/blob/master/service/src/main/java/org/apache/griffin/core/job/JobServiceImpl.java#L632-L633]

We fixed this also.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to