Andras Piros created OOZIE-3336:
-----------------------------------

             Summary: [persistence] Refactor entity classes to feature PK, FK, 
and UQ constraints
                 Key: OOZIE-3336
                 URL: https://issues.apache.org/jira/browse/OOZIE-3336
             Project: Oozie
          Issue Type: Improvement
          Components: core
    Affects Versions: 5.0.0
            Reporter: Andras Piros
             Fix For: 5.2.0


When an Oozie database grows substantial in size, let's say, over a few hundred 
thousands of {{WorkflowActionBean}}, {{CoordinatorActionBean}} instances, we 
face a couple of performance issues. Here is an analysis why.

Current Oozie JPA {{@Entity}} usage, and the resulting database DDL, suffers 
from a couple of drawback from a performance point of view:
* {{@Id}} fields are {{String}}:
** leaving no space for database primary key indices to work effectively
** those values are calculated in case of {{WorkflowActionBean}}, 
{{CoordinatorActionBean}}, and {{BundleActionBean}} instances
* no foreign constraint is set from {{WorkflowActionBean}} to 
{{WorkflowJobBean}}, from {{CoordinatorActionBean}} to {{CoordinatorJobBean}}, 
or from {{BundleActionBean}} to {{BundleJobBean}} instances:
** have to assess JPA queries discovering parent-child relationships by hand
** no database indices are created, and hence, those queries that contain any 
{{JOIN}} instances are slower
* no use of unique constraints whatsoever
* JPA queries are created by hand instead of relying on OpenJPA
* JPA entities are filled by hand instead of relying on OpenJPA

Following enhancements are necessary:
# keeping the existing {{String compositeId}} fields, let's break down the 
contents to following new fields:
## {{@Id long id}} - an auto-increment value that is unique across Oozie 
database
## {{long currentSequence}} - the sequence number of the current run since last 
Oozie server restart. The first part of the {{compositeId}}
## {{Timestamp serverStartupTimestamp}} - the timestamp when the Oozie server 
was last started. The second part of the {{compositeId}}
## {{String serverName}} - the third part of the {{compositeId}}
## {{String name}} - the fourth and last part of the {{compositeId}}
## {{compositeId}} might be calculated when an entity is loaded / persisted, 
and then stored
# FK constraints:
## {{@OneToMany}} fields where we have a list of child references inside parent
## {{@ManyToOne}} fields where we have a parent reference inside child
## pay attention to {{FetchType}}, most of the times {{LAZY}} will be needed
## the containment fields should not be {{@Transient}} anymore
# UQ constraints:
## on {{currentSequence}} and {{serverStartupTimestamp}}
## on {{currentSequence}} and {{name}}
# new JPQL queries:
## to cover changed parent-child relationships
## to get use of each disassembled part of {{originalId}} when doing e.g. 
filtering
# let JPA fill entities instead performing this by hand

Following enhancements can be considered as nice-to-have:
* upgrade to an OpenJPA version that features JPA 2.1's composite indexing 
capability
* see whether to have an optimistic locking field using {{@Version}} instead of 
ZooKeeper based pessimistic locking would increase High Availability 
characteristics
* refactor also SLA related entity classes

It's necessary to have performance benchmarks with some database types like 
MySQL/MariaDB, and PostgreSQL before and after the changes for following use 
cases:
* {{CoordinatorJobBean}} and {{WorkflowJobBean}} instances up to millions
* {{CoordinatorActionBean}} and {{WorkflowActionBean}} instances up to tens of 
millions
* performance for JPQLs that get a list of entities
* performance of persisting a new entity
* performance of querying lists of entities based on popular / possible filters 
like the ones used by {{VxJobsServlet}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to