[
https://issues.apache.org/jira/browse/USERGRID-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Nine updated USERGRID-324:
-------------------------------
Description:
We need a system to allow us to build distributed/parallel data process flows.
Ultimately, this system must have the following requirements at a component
level.
# Work well within Reactive/Streams logic
# Allow easy operationalization. No additional external systems should be
required
# Allow failover during node failure, and automatic recovery. Processing should
not stop/restart because of a failure, it should resume.
Some examples are the following.
# Migrations
# Import/Export
# Distributed Indexing on heavily connected entities
# Post processing deletes
# Collection deletes
# Application deletes
I have the following requirements.
# You can define a deployment topology and limit the number of sub processes in
the workflow
# Ability to reject requests when there is no capacity
# Preferably, do not introduce another dependency (like Zookeeper) and deploy
it in the stack war file
# An easy intuitive interface for programming flows which will work in a single
node, or clustered environment
h2. Examples
h3. Reindex all entities in the system.
# Launch a root process. This process emits all application ids within the
system.
# Child processes receive the application id. For each app, create the index
in elastic search, then emit all collections.
# Child processes receive the collections and app ids. For each collection,
emit the entity id
# Child process receives the app, collection, and id. For each entity, get
it's edges, and re-index the documents within elastic search.
h3. Delete a collection
Realtime -> Update collection alias to point to a new internal collection name.
Fire delete collection task.
Job process
# Launch root process. Load previous collection name and emit to 2 child tasks.
# Child task 1: Remove every entity of the previous type from elastic search
using bulk delete until empty.
# Child task 2: Iterate every entity and remove it from Cassandra, as well as
it's graph edge.
was:
We need a system to allow us to build distributed/parallel data process flows.
Some examples are the following.
# Migrations
# Import/Export
# Distributed Indexing on heavily connected entities
# Post processing deletes
# Collection deletes
# Application deletes
I have the following requirements.
# You can define a deployment topology and limit the number of sub processes in
the workflow
# Ability to reject requests when there is no capacity
# Preferably, do not introduce another dependency (like Zookeeper) and deploy
it in the stack war file
# An easy intuitive interface for programming flows which will work in a single
node, or clustered environment
h2. Examples
h3. Reindex all entities in the system.
# Launch a root process. This process emits all application ids within the
system.
# Child processes receive the application id. For each app, create the index
in elastic search, then emit all collections.
# Child processes receive the collections and app ids. For each collection,
emit the entity id
# Child process receives the app, collection, and id. For each entity, get
it's edges, and re-index the documents within elastic search.
h3. Delete a collection
Realtime -> Update collection alias to point to a new internal collection name.
Fire delete collection task.
Job process
# Launch root process. Load previous collection name and emit to 2 child tasks.
# Child task 1: Remove every entity of the previous type from elastic search
using bulk delete until empty.
# Child task 2: Iterate every entity and remove it from Cassandra, as well as
it's graph edge.
> [SPIKE] Prototype a few distributed realtime parallel processing systems
> ------------------------------------------------------------------------
>
> Key: USERGRID-324
> URL: https://issues.apache.org/jira/browse/USERGRID-324
> Project: Usergrid
> Issue Type: Story
> Reporter: Todd Nine
> Assignee: Todd Nine
>
> We need a system to allow us to build distributed/parallel data process
> flows. Ultimately, this system must have the following requirements at a
> component level.
> # Work well within Reactive/Streams logic
> # Allow easy operationalization. No additional external systems should be
> required
> # Allow failover during node failure, and automatic recovery. Processing
> should not stop/restart because of a failure, it should resume.
> Some examples are the following.
> # Migrations
> # Import/Export
> # Distributed Indexing on heavily connected entities
> # Post processing deletes
> # Collection deletes
> # Application deletes
> I have the following requirements.
> # You can define a deployment topology and limit the number of sub processes
> in the workflow
> # Ability to reject requests when there is no capacity
> # Preferably, do not introduce another dependency (like Zookeeper) and deploy
> it in the stack war file
> # An easy intuitive interface for programming flows which will work in a
> single node, or clustered environment
> h2. Examples
> h3. Reindex all entities in the system.
> # Launch a root process. This process emits all application ids within the
> system.
> # Child processes receive the application id. For each app, create the index
> in elastic search, then emit all collections.
> # Child processes receive the collections and app ids. For each collection,
> emit the entity id
> # Child process receives the app, collection, and id. For each entity, get
> it's edges, and re-index the documents within elastic search.
> h3. Delete a collection
> Realtime -> Update collection alias to point to a new internal collection
> name. Fire delete collection task.
> Job process
> # Launch root process. Load previous collection name and emit to 2 child
> tasks.
> # Child task 1: Remove every entity of the previous type from elastic search
> using bulk delete until empty.
> # Child task 2: Iterate every entity and remove it from Cassandra, as well as
> it's graph edge.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)