I had to think about this problem a lot for a product I worked on at one point, but I think a lot of the same applies here.
To Corey's point, running the rebalancer is most definitely an issue, but simply turning it off is not a good answer in a lot of situations. It exists for a reason! You can run into problems with highly utilized clusters where individual data nodes run out of disk space and all kinds of bad things start to happen then. Also, if you are also using the cluster for MapReduce, you can see performance gains by rebalancing on highly utilized clusters. In general, the placement of blocks is the NameNode's responsibility, so even if it's nice to assume that blocks get written to the local data node, that's not really an assumption you can always make. There has been talk about custom block placement strategies for HDFS in the NameNode. I just checked up on it and it does look like it is on the horizon: https://issues.apache.org/jira/browse/HDFS-2576 In theory, you could have Accumulo "hint" that it wants blocks in a certain place colocated. There is another interesting problem with the results of minor compactions. Let's say you've been minor compacting all day and have a dozen or so of these files written. The replication policy is pretty random. DataNode that the tablet server has a fatal problem and never comes back. There is no way to "collect" the replicas together onto one DataNode-- they are scattered all over data nodes. Eventually a major compaction happens and all is good again. There were some ideas of telling NameNode that certain blocks have an affinity for one another to keep them together. I think this can be solved scientifically pretty easily on a live production cluster: Step 1: measure performance of your current application and note if it does lots of single fetches, full table scans, etc. Step 2: Run the rebalancer Step 3: measure performance again Step 4: force major compaction to move everything back (optional) Unfortunately I don't have any systems right now that I could do this on that would provide me any sort of real results. Overall, the question on does it matter? It most absolutely matters. Slurping off disk locally is always going to be faster than slurping off disk AND going over the network. The real question is if it's worth our time. 10GigE is a beautiful thing. In some cases it may be, in others it may not. For example, if you are just doing small fetches of data here and there you might not notice. I imagine if you were doing multiple large scans you might start seeing your network get saturated. I think this also becomes a problem at larger scales where your network infrastructure is a bit more ridiculous. Let's say for the sake of argument you have a 25,000 node Accumulo cluster... you might have some sort of tiered network where you are constrained from a throughput perspective somewhere. This would matter then. My 8 cents, -d On Thu, Jun 19, 2014 at 12:56 PM, Josh Elser <[email protected]> wrote: > I may also be getting this conflated with how reads work. Time for me to > read some HDFS code. > > > On 6/19/14, 8:52 AM, Josh Elser wrote: > >> I believe this happens via the DfsClient, but you can only expect the >> first block of a file to actually be on the local datanode (assuming >> there is one). Everything else is possible to be remote. Assuming you >> have a proper rack script set up, you would imagine that you'll still >> get at least one rack-local replica (so you'd have a block nearby). >> >> Interestingly (at least to me), I believe HBase does a bit of work in >> region (tablet) assignments to try to maximize the locality of regions >> WRT the datanode that is hosting the blocks that make up that file. I >> need to dig into their code some day though. >> >> In general, Accumulo and HBase tend to be relatively comparable to one >> another with performance when properly configured which makes me apt to >> think that data locality can help, but it's not some holy grail (of >> course you won't ever hear me claim anything be in that position). I >> will say that I haven't done any real quantitative analysis either though. >> >> tl;dr HDFS block locality should not be affecting the functionality of >> Accumulo. >> >> On 6/19/14, 7:25 AM, Corey Nolet wrote: >> >>> AFAIK, the locality may not be guaranteed right away unless the data >>> for a >>> tablet was first ingested on the tablet server that is responsible for >>> that >>> tablet, otherwise you'll need to wait for a major compaction to >>> rewrite the >>> RFiles locally on the tablet server. I would assume if the tablet >>> server is >>> not on the same node as the datanode, those files will probably be spread >>> across the cluster as if you were ingesting data from outside the cloud. >>> >>> A recent discussion with Bill Slacum also brought to light a possible >>> problem of the HDFS balancer [1] re-balancing blocks after the fact which >>> could eventually pull blocks onto datanodes that are not local to the >>> tablets. I believe remedy for this was to turn off the balancer or not >>> have >>> it run. >>> >>> [1] >>> http://www.swiss-scalability.com/2013/08/hadoop-hdfs- >>> balancer-explained.html >>> >>> >>> >>> >>> >>> On Thu, Jun 19, 2014 at 10:07 AM, David Medinets >>> <[email protected]> >>> wrote: >>> >>> At the Accumulo Summit and on a recent client site, there have been >>>> conversations about Data Locality and Accumulo. >>>> >>>> I ran an experiment to see that Accumulo can scan tables when the >>>> tserver process is run on a server without a datanode process. I >>>> followed these steps: >>>> >>>> 1. Start three node cluster >>>> 2. Load data >>>> 3. Kill datanode on slave1 >>>> 4. Wait until Hadoop notices dead node. >>>> 5. Kill tserver on slave2 >>>> 6. Wait until Accumulo notices dead node. >>>> 7. Run the accumulo shell on master and slave1 to verify entries can be >>>> scanned. >>>> >>>> Accumulo handled this situation just fine. As I expected. >>>> >>>> How important (or not) is it to run tserver and datanode on the same >>>> server? >>>> Does the Data Locality implied by running them together exist? >>>> Can the benefit be quantified? >>>> >>>> >>> -- Donald Miner Chief Technology Officer ClearEdge IT Solutions, LLC Cell: 443 799 7807 www.clearedgeit.com
