[jira] [Commented] (BLUR-132) Create Index Snapshots

Aaron McCurry (JIRA) Tue, 13 Aug 2013 03:56:55 -0700

    [ 
https://issues.apache.org/jira/browse/BLUR-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738082#comment-13738082
 ]


Aaron McCurry commented on BLUR-132:
------------------------------------

Thanks for the writeup!

-We can provide the backup in multiple ways : 
-     1. Backup all shards(all tables) on a shard server onto local 
filesystem(cluster/table/shard/files). While restoring from backup, every shard 
server 
-        reads from its local filesystem and copies the shards onto HDFS. 
-     2. Bakup all shards on all shard servers onto a common HDFS location. 
While restoring we would partition the shards onto shard servers.

I will offer a third option.
3. Snapshot the index and don't make a copy of the data.  We could simply leave 
the files in the shard directory that are referenced in the snapshot and not 
allow the IndexDeletionPolicy to remove the files.

In general I didn't really consider snapshots a way of creating a backup but 
rather a known state of the index that was light and fast to create.  Although 
I think that you would have to have snapshots to allow a backup to work 
correctly.  So perhaps we should create another task to actually do the copying 
of a snapshot for backup.

-We should also have some mechanism to restore from a backup. For us to restore 
the index from a backup, we might as well need a point-in-time copy of all the 
table descriptors. 

Agreed we should be to restore to a previous snapshot.  Possible 
pre-requirement for this will be to change a table into read-only while it's 
running.  So that way we can close and reopen the IndexWriter on each shard.  
Also this could be broken out into another task after snapshots are created.

-How are we planning to expose this snapshot functionality (Shell, API, BOTH)? 

Both, the shell just uses the api.

-Where are we even using LocalIndexServer?

We don't anywhere, it's legacy code that probably should be removed.

- I was able to take a backup by wrapping IndexDeletionPolicy with 
SnapshotDeletionPolicy and then take a snapshot and copy all the files to a 
local file system. This technically works even if the index is being actively 
updated, but the way in which the code is structured 
(DistributeIndexServer.openShard), we would only get a BlurIndexReader when the 
shard is being updated. but the sample code I have below is using the writer to 
take the snapshot. May be there is a different way? 

There were two reasons I wanted to create snapshots.
1. Primary - Create a static view of the index so that MapReduce jobs (or other 
external systems) could open and use the indexes (from a snapshot) and they 
would not be changing while they were being used.
2. Create the ability to snapshot commit points through time so that if I 
needed to backup to a certain point I could and drop all the data afterward 
that point, I could.
3. Low priority - Run a shard off a certain commit point and allow the snapshot 
commit point to be changed to any other snapshot as well as the head of the 
index.

- Also What happens when multiple sources try to add documents to the same 
shard simultaneously(using the same IndexWriter)?

If you are asking about what happens to the index commit points or how we would 
deal with the multiple sources.  We don't, the snapshots will only operate on 
committed data, so before we create a snapshot we will need to call commit on 
the index.

- Would really love to know your thoughts and appreciate it if someone can fill 
in gaps in my understanding. Thank You.

If you want to pick a starting point I think that #1 is a good place to start.
    1. Primary - Create a static view of the index so that MapReduce jobs (or 
other external systems) could open and use the indexes (from a snapshot) and 
they would not be changing while they were being used.

I can help you create or modify an IndexDeletetionPolicy to behave the way we 
need or help with any other questions.  This I really going to be a feature 
that will likely grow and change over time but I think we can start pretty 
basic for now.

Aaron




                
> Create Index Snapshots
> ----------------------
>
>                 Key: BLUR-132
>                 URL: https://issues.apache.org/jira/browse/BLUR-132
>             Project: Apache Blur
>          Issue Type: New Feature
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BLUR-132) Create Index Snapshots

Reply via email to