[
https://issues.apache.org/jira/browse/BLUR-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738082#comment-13738082
]
Aaron McCurry commented on BLUR-132:
------------------------------------
Thanks for the writeup!
-We can provide the backup in multiple ways :
- 1. Backup all shards(all tables) on a shard server onto local
filesystem(cluster/table/shard/files). While restoring from backup, every shard
server
- reads from its local filesystem and copies the shards onto HDFS.
- 2. Bakup all shards on all shard servers onto a common HDFS location.
While restoring we would partition the shards onto shard servers.
I will offer a third option.
3. Snapshot the index and don't make a copy of the data. We could simply leave
the files in the shard directory that are referenced in the snapshot and not
allow the IndexDeletionPolicy to remove the files.
In general I didn't really consider snapshots a way of creating a backup but
rather a known state of the index that was light and fast to create. Although
I think that you would have to have snapshots to allow a backup to work
correctly. So perhaps we should create another task to actually do the copying
of a snapshot for backup.
-We should also have some mechanism to restore from a backup. For us to restore
the index from a backup, we might as well need a point-in-time copy of all the
table descriptors.
Agreed we should be to restore to a previous snapshot. Possible
pre-requirement for this will be to change a table into read-only while it's
running. So that way we can close and reopen the IndexWriter on each shard.
Also this could be broken out into another task after snapshots are created.
-How are we planning to expose this snapshot functionality (Shell, API, BOTH)?
Both, the shell just uses the api.
-Where are we even using LocalIndexServer?
We don't anywhere, it's legacy code that probably should be removed.
- I was able to take a backup by wrapping IndexDeletionPolicy with
SnapshotDeletionPolicy and then take a snapshot and copy all the files to a
local file system. This technically works even if the index is being actively
updated, but the way in which the code is structured
(DistributeIndexServer.openShard), we would only get a BlurIndexReader when the
shard is being updated. but the sample code I have below is using the writer to
take the snapshot. May be there is a different way?
There were two reasons I wanted to create snapshots.
1. Primary - Create a static view of the index so that MapReduce jobs (or other
external systems) could open and use the indexes (from a snapshot) and they
would not be changing while they were being used.
2. Create the ability to snapshot commit points through time so that if I
needed to backup to a certain point I could and drop all the data afterward
that point, I could.
3. Low priority - Run a shard off a certain commit point and allow the snapshot
commit point to be changed to any other snapshot as well as the head of the
index.
- Also What happens when multiple sources try to add documents to the same
shard simultaneously(using the same IndexWriter)?
If you are asking about what happens to the index commit points or how we would
deal with the multiple sources. We don't, the snapshots will only operate on
committed data, so before we create a snapshot we will need to call commit on
the index.
- Would really love to know your thoughts and appreciate it if someone can fill
in gaps in my understanding. Thank You.
If you want to pick a starting point I think that #1 is a good place to start.
1. Primary - Create a static view of the index so that MapReduce jobs (or
other external systems) could open and use the indexes (from a snapshot) and
they would not be changing while they were being used.
I can help you create or modify an IndexDeletetionPolicy to behave the way we
need or help with any other questions. This I really going to be a feature
that will likely grow and change over time but I think we can start pretty
basic for now.
Aaron
> Create Index Snapshots
> ----------------------
>
> Key: BLUR-132
> URL: https://issues.apache.org/jira/browse/BLUR-132
> Project: Apache Blur
> Issue Type: New Feature
> Affects Versions: 0.3.0
> Reporter: Aaron McCurry
> Fix For: 0.3.0
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira