[ 
https://issues.apache.org/jira/browse/BLUR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13805539#comment-13805539
 ] 

Aaron McCurry commented on BLUR-234:
------------------------------------

Ok, so the basics of what IndexImporter is doing and where the problem lies.  
The IndexImporter is basically calling addIndex (or addDirectory) on the 
IndexWriter that is the main index for the shard.  The normal operation of 
Lucene is to copy all the files from the /XXXXX.commit directory to the main.  
However this reads all the index files from the XXXXX.commit directory and 
writes them into the main index.  This can be very low and used a lot of 
resources, so the current process is basically intercept the copy call and 
actually move the HDFS file from the XXXXX.commit index into the main, renaming 
the file as needed.  So because of the copy this is a dangerous operation 
because if the files are moved and the shard process dies then those files that 
were moved in are lost because Lucene deletes unknown files on writer open.

So my solution to this problem is create a SoftLinkDirectory directory so that 
instead of moving a file from the XXXXX.commit index it creates softlink to the 
XXXXX.commit index.  That way the in a failure no data is lost.  Let me know 
what you think about this approach.

Thanks!

Aaron

> Create a softlink like capability in the HDFSDirectory
> ------------------------------------------------------
>
>                 Key: BLUR-234
>                 URL: https://issues.apache.org/jira/browse/BLUR-234
>             Project: Apache Blur
>          Issue Type: Sub-task
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>
> The problem we are trying to solve here is minimizing file copying.  During a 
> merge of an external index produced by MR into a shard index normally the 
> index files are copied.  In a lot of cases the new external index(es) are 
> very large.  This can cause some serious performance problems because all the 
> new data would be copied into shard index.  Normally this can happens across 
> the cluster at the same time so it will likely turn into an IO storm.
> The current implementation in the IndexImporter that deals with this problem 
> does so by overriding method in the HDFSDirectory that moves the files in 
> HDFS instead of copying.  This makes those merges very fast, but it's risky 
> because if the shard index writer doesn't commit the changes the files are 
> not moved back to their original location.  Instead they are deleted, loss of 
> data.
> So I'm preposing that we create a softlink system that allows for links to 
> the be created instead of being moved.  That way if the commit fails the 
> links are removed and the original data files are in the their original 
> location.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to