[ 
https://issues.apache.org/jira/browse/BLUR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000470#comment-15000470
 ] 

Aaron McCurry commented on BLUR-445:
------------------------------------

First I would want to add a timestamp to the Document/Row and 
SubDocument/Record objects to provide a most recent data wins when multiple 
mutates to the data occur rapidly.  Next the index manager daemon would read 
from data sources (think files, dirs, queues, etc) for properly formed data 
mutations.  Then the index manager would perform the necessary MR (or insert 
other processing tech) to create index deltas.  Those index deltas would then 
be merged into the indexes (this is a change currently the shard servers do 
this) and committed by creating a HDFS snapshot.  After it's committed the 
shard servers would move to the newly committed snapshot of indexes for the 
given table.  After all the shard servers moved to serving the new indexes in 
the new snapshot, old HDFS snapshots could be removed.

Mutates could still be achieved in a similar way they are today via Kafka (or 
something similar), but the data would not be readable until after the index 
manager brought the data online.  This is why the timestamp is needed for 
updates.

Also if the amount of data that is being ingested is very small the index 
manager could just update the indexes directly without the bulk update.  This 
would allow for more timely updates to occur.

> Remove online mutates from the Blur thrift api
> ----------------------------------------------
>
>                 Key: BLUR-445
>                 URL: https://issues.apache.org/jira/browse/BLUR-445
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>
> The primary use case for Blur is for massive ingestion of information to be 
> indexed and searched.  Currently I believe the system has been made overly 
> complex due to the atomic operations in the online index mutation system.  It 
> forces the shard servers to have writers open to each of the indexes in the 
> given table, this requires a lot of memory, cpu, and file resources per shard.
> Currently the system only allows for mutates to be atomic when mutating a 
> single row.  Batch mutates are not atomic.
> I propose that we move all index mutations to the bulk indexing approach and 
> utilize hdfs snapshots for commiting index information within a given table.  
> This will allow the controller and shard servers to become readonly with 
> respect to the indexes.
> Assuming we move forward with this approach a new daemon will need to 
> created, and index manager.  This daemon will coordinate indexing (MR, Spark, 
> Tez, Flink, etc) and merging globally for the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to