[ 
https://issues.apache.org/jira/browse/BLUR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13805288#comment-13805288
 ] 

Vikrant Navalgund commented on BLUR-234:
----------------------------------------

Hi Aaron,
Can you please help me in understanding the context. I have gone through the 
IndexImporter timer task. 
These are some of my questions.
  
    "To provide for a mechanism to not copy around the index files in HDFS"
   
    Right now the index from /table/shard-0000x/xxxx.commit are being added by 
the indexwriter in the destination.
    After the indexing the /table/shard-0000x/xxxx.commit is being deleted. 
    I also understand that we need some kind of a transaction mechanism to make 
sure the failed operations are re-tried.
    But how does a HDFSSoftLinkDirectory help us? 
    What is a HDFSSoftlinkDirectory semantics?
        Does this provide the mechanism for the transaction like operation, by 
deleting the links once the operation is done?
        Does the linkpath have flat files like filename0001.tmp that gets 
deleted once the indexing is over?   
        Do we have a linkpath for each corresponding ../xxxx.commit path ?
    
   The main ticket and the sub-task(this ticket) are kind of confusing me. What 
am I missing here?

   Regards,
   Vikrant

> Create a softlink like capability in the HDFSDirectory
> ------------------------------------------------------
>
>                 Key: BLUR-234
>                 URL: https://issues.apache.org/jira/browse/BLUR-234
>             Project: Apache Blur
>          Issue Type: Sub-task
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>
> The problem we are trying to solve here is minimizing file copying.  During a 
> merge of an external index produced by MR into a shard index normally the 
> index files are copied.  In a lot of cases the new external index(es) are 
> very large.  This can cause some serious performance problems because all the 
> new data would be copied into shard index.  Normally this can happens across 
> the cluster at the same time so it will likely turn into an IO storm.
> The current implementation in the IndexImporter that deals with this problem 
> does so by overriding method in the HDFSDirectory that moves the files in 
> HDFS instead of copying.  This makes those merges very fast, but it's risky 
> because if the shard index writer doesn't commit the changes the files are 
> not moved back to their original location.  Instead they are deleted, loss of 
> data.
> So I'm preposing that we create a softlink system that allows for links to 
> the be created instead of being moved.  That way if the commit fails the 
> links are removed and the original data files are in the their original 
> location.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to