On Tue, Oct 15, 2013 at 1:48 PM, Ravikumar Govindarajan < [email protected]> wrote:
> Sorry, some truncated content > > > 1. Specifying hints to the namenode about the favored set of datanodes > [HDFS-2576] when writing a file in hadoop. > > The downside of such a feature is a datanode failure. HDFS Balancer and > HBase Balancer will contend for re-distribution of blocks > > Facebook's HBase port in GitHub has this patch in production and > adjusts > with conflicts by periodically running a cron process. > > Even running HDFS Balancer when all Shards are online can screw the > carefully planned layed. So there is HDFS-4420 for this, to exclude > certain > sub-trees from balancing > > 2. There is another interesting HDFS-4606 yet to begin, which will aim to > move > the block to local datanode, when it detects that it has performed a > remote > read. > > This is will be the best when completed, without requiring any plumbing > code, whatsoever from Blur, HBase etc... > Yes I like this one. This is basically what I was thinking about implementing (logically). Since it hasn't even started dev yet, it will likely be a year or more before we see is in a stable release. But maybe I will be surprised. > > One big downside is the effect it has on the local data-node which will > already be serving data, in addition to performing disk-writes as a result > of remote reads > > If you are not already aware of it or have alternate ideas for Blur, please > let me know your thoughts > I think that #2 is likely what will be most beneficial long term. In the near term, I haven't really seen a huge need for this type of feature. That may be because I haven't tested Blur where the reads were mostly local and a short-circuit read could be made. Or it might be because the systems that I am using having a enough hardware where the remote reads are not much of an issue. Either way I think that this would benefit us and I'm all for trying to come up with a solution. :-) Aaron > > -- > Ravi > > > On Tue, Oct 15, 2013 at 11:07 PM, Ravikumar Govindarajan < > [email protected]> wrote: > > > Actually, what I unearthed after long-time fishing is this > > > > 1. Specifying hints to the namenode about the favored set of datanodes > > [HDFS-2576] when writing a file in hadoop. > > > > The downside of such a feature is a datanode failure. HDFS Balancer > and > > > > Facebook's HBase port in GitHub has this patch in production > > > > 2. > > > > > > > > > > On Tue, Oct 15, 2013 at 7:15 PM, Aaron McCurry <[email protected]> > wrote: > > > >> On Tue, Oct 15, 2013 at 9:25 AM, Ravikumar Govindarajan < > >> [email protected]> wrote: > >> > >> > Your idea looks great. > >> > > >> > But I am afraid I don't seem to grasp it fully. I am attempting to > >> > understand your suggestions. So please bear for a moment. > >> > > >> > When co-locating shard-server and data-nodes > >> > > >> > 1. Every shard-server will attempt to write a local copy > >> > 2. Proceed with default settings of hadoop [remote rack etc...] for > next > >> > copies > >> > > >> > Ideally, there are 2 problems when writing a file > >> > > >> > 1. Unable to write a local copy. [Lack of sufficient storage etc...] > >> Hadoop > >> > deflects the > >> > > >> > >> True, this might happen if the cluster is low on storage and there are > >> local disk failures. If the cluster is in this kind of condition and > >> running the normal Hadoop balancer won't help (reduce the local storage > by > >> migrating to another machine) then it's likely we can't do anything to > >> help > >> the situation. > >> > >> > >> > write to some other destination, internally. > >> > 2. Shard-server failure. Since this is stateless, it will most likely > >> be a > >> > hardware failure/planned-maintenance. > >> > > >> > >> Yes this is true. > >> > >> > >> > > >> > Instead of asynchronously re-balancing every write[moving > >> blocks-to-shard], > >> > is it possible for us to trap into the above cases alone? > >> > > >> > >> Not sure what you mean here. > >> > >> > >> > > >> > BTW, how do we re-arrange shard layout when shards are added/removed. > >> > > >> > I looked at 0.20 code and it seems to move shards too much. That could > >> be > >> > detrimental for what we are discussing now, right? > >> > > >> > >> I have fixed this issue or at least improved it. > >> > >> https://issues.apache.org/jira/browse/BLUR-260 > >> > >> This implementation will only move the shards that are down, and will > >> never > >> move any more shards than is necessary to maintain a properly balanced > >> shard cluster. > >> > >> Basically it recalculates an optimal layout when the number of servers > >> changes. It will likely need to be improved by optimizing the location > of > >> the shard (which server) by figuring out what server that can serve the > >> shard has the most blocks from the indexes of the shard. Doing this > >> should > >> minimize block movement. > >> > >> > >> Also had another thought, if we follow something similar to what HBase > >> does > >> and perform a full optimization on each shard once a day. That would > >> locate the data to the local server without us having to do any other > >> work. > >> Of course this assumes that there is enough space locally and that > there > >> has been some change in the index in the last 24 hours to warrant doing > >> the > >> merge. > >> > >> Aaron > >> > >> > >> > -- > >> > Ravi > >> > > >> > > >> > > >> > > >> > On Sun, Oct 13, 2013 at 12:51 AM, Aaron McCurry <[email protected]> > >> > wrote: > >> > > >> > > On Fri, Oct 11, 2013 at 11:12 AM, Ravikumar Govindarajan < > >> > > [email protected]> wrote: > >> > > > >> > > > I came across this interesting JIRA in hadoop > >> > > > https://issues.apache.org/jira/browse/HDFS-385 > >> > > > > >> > > > In essence, it allows us more control over where blocks are > placed. > >> > > > > >> > > > A BlockPlacementPolicy to optimistically write all blocks of a > given > >> > > > index-file into same set of datanodes could be highly helpful. > >> > > > > >> > > > Co-locating shard-servers and datanodes along with short-circuit > >> reads > >> > > > should improve greatly. We can always utilize the file-system > cache > >> for > >> > > > local files. In case a given file is not served locally, then > >> > > shard-servers > >> > > > can use the block-cache > >> > > > > >> > > > Do you see some positives in such an approach? > >> > > > > >> > > > >> > > I'm sure that there will be some performance improvements my using > >> local > >> > > reads for accessing HDFS. The areas I would assume to see the > biggest > >> > > increases in performance would be merging and fetching data for > >> > retrieval. > >> > > Even with accessing local drives, the search time would likely not > be > >> > > improved assuming the index is hot. One of the reasons for doing > >> > short-cut > >> > > reads is to make use of the OS file system cache and since Blur > >> already > >> > > uses a filesystem cache it might just be duplicating functionality. > >> > > Another big reason for short-cut reads is to reduce network > traffic. > >> > > > >> > > Given the scenario when another server has gone down. If the shard > >> > server > >> > > is in an opening state we would have to change the Blur layout > system > >> to > >> > > only fail to the server that contains the data for the index. This > >> might > >> > > have some problems because the layout of the shards are based on how > >> many > >> > > shards are online in the existing shards not where they are located. > >> > > > >> > > So in a perfect world if the shard fails to the correct server it > >> would > >> > > reduce the amount of network traffic (to near 0 if it was perfect). > >> So > >> > in > >> > > short it could work, but it might be very hard. > >> > > > >> > > Assuming for a moment that the system is not dealing with a failure > >> and > >> > > assuming that the shard server is also running a datanode. The > first > >> > > replica for HDFS is the local datanode, so if we just configure > Hadoop > >> > for > >> > > the short-cut reads we would already get the benefits for data > >> fetching > >> > and > >> > > merging. > >> > > > >> > > I had another idea that I wanted to run you. > >> > > > >> > > What if we instead of actively placing blocks during writes with > the a > >> > > Hadoop block layout policy, we write something like the Hadoop > >> balancer. > >> > > We get Hadoop to move the blocks to the shard server (datanode) > that > >> is > >> > > hosting the shard. That way it would be asynchronous and after a > >> > failure / > >> > > shard movement it would relocate the blocks and still use a > short-cut > >> > read. > >> > > Kind of like the blocks following the shard around, instead of > >> deciding > >> > up > >> > > front where the data for the shard should be located. If this is > what > >> > you > >> > > were thinking I'm sorry not understanding your suggestion. > >> > > > >> > > Aaron > >> > > > >> > > > >> > > > >> > > > > >> > > > -- > >> > > > Ravi > >> > > > > >> > > > >> > > >> > > > > >
