[
https://issues.apache.org/jira/browse/BLUR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13805539#comment-13805539
]
Aaron McCurry commented on BLUR-234:
------------------------------------
Ok, so the basics of what IndexImporter is doing and where the problem lies.
The IndexImporter is basically calling addIndex (or addDirectory) on the
IndexWriter that is the main index for the shard. The normal operation of
Lucene is to copy all the files from the /XXXXX.commit directory to the main.
However this reads all the index files from the XXXXX.commit directory and
writes them into the main index. This can be very low and used a lot of
resources, so the current process is basically intercept the copy call and
actually move the HDFS file from the XXXXX.commit index into the main, renaming
the file as needed. So because of the copy this is a dangerous operation
because if the files are moved and the shard process dies then those files that
were moved in are lost because Lucene deletes unknown files on writer open.
So my solution to this problem is create a SoftLinkDirectory directory so that
instead of moving a file from the XXXXX.commit index it creates softlink to the
XXXXX.commit index. That way the in a failure no data is lost. Let me know
what you think about this approach.
Thanks!
Aaron
> Create a softlink like capability in the HDFSDirectory
> ------------------------------------------------------
>
> Key: BLUR-234
> URL: https://issues.apache.org/jira/browse/BLUR-234
> Project: Apache Blur
> Issue Type: Sub-task
> Components: Blur
> Affects Versions: 0.3.0
> Reporter: Aaron McCurry
> Fix For: 0.3.0
>
>
> The problem we are trying to solve here is minimizing file copying. During a
> merge of an external index produced by MR into a shard index normally the
> index files are copied. In a lot of cases the new external index(es) are
> very large. This can cause some serious performance problems because all the
> new data would be copied into shard index. Normally this can happens across
> the cluster at the same time so it will likely turn into an IO storm.
> The current implementation in the IndexImporter that deals with this problem
> does so by overriding method in the HDFSDirectory that moves the files in
> HDFS instead of copying. This makes those merges very fast, but it's risky
> because if the shard index writer doesn't commit the changes the files are
> not moved back to their original location. Instead they are deleted, loss of
> data.
> So I'm preposing that we create a softlink system that allows for links to
> the be created instead of being moved. That way if the commit fails the
> links are removed and the original data files are in the their original
> location.
--
This message was sent by Atlassian JIRA
(v6.1#6144)