I may also be getting this conflated with how reads work. Time for me to read some HDFS code.

On 6/19/14, 8:52 AM, Josh Elser wrote:
I believe this happens via the DfsClient, but you can only expect the
first block of a file to actually be on the local datanode (assuming
there is one). Everything else is possible to be remote. Assuming you
have a proper rack script set up, you would imagine that you'll still
get at least one rack-local replica (so you'd have a block nearby).

Interestingly (at least to me), I believe HBase does a bit of work in
region (tablet) assignments to try to maximize the locality of regions
WRT the datanode that is hosting the blocks that make up that file. I
need to dig into their code some day though.

In general, Accumulo and HBase tend to be relatively comparable to one
another with performance when properly configured which makes me apt to
think that data locality can help, but it's not some holy grail (of
course you won't ever hear me claim anything be in that position). I
will say that I haven't done any real quantitative analysis either though.

tl;dr HDFS block locality should not be affecting the functionality of
Accumulo.

On 6/19/14, 7:25 AM, Corey Nolet wrote:
AFAIK, the locality may not be guaranteed right away unless the data
for a
tablet was first ingested on the tablet server that is responsible for
that
tablet, otherwise you'll need to wait for a major compaction to
rewrite the
RFiles locally on the tablet server. I would assume if the tablet
server is
not on the same node as the datanode, those files will probably be spread
across the cluster as if you were ingesting data from outside the cloud.

A recent discussion with Bill Slacum also brought to light a possible
problem of the HDFS balancer [1] re-balancing blocks after the fact which
could eventually pull blocks onto datanodes that are not local to the
tablets. I believe remedy for this was to turn off the balancer or not
have
it run.

[1]
http://www.swiss-scalability.com/2013/08/hadoop-hdfs-balancer-explained.html





On Thu, Jun 19, 2014 at 10:07 AM, David Medinets
<[email protected]>
wrote:

At the Accumulo Summit and on a recent client site, there have been
conversations about Data Locality and Accumulo.

I ran an experiment to see that Accumulo can scan tables when the
tserver process is run on a server without a datanode process. I
followed these steps:

1. Start three node cluster
2. Load data
3. Kill datanode on slave1
4. Wait until Hadoop notices dead node.
5. Kill tserver on slave2
6. Wait until Accumulo notices dead node.
7. Run the accumulo shell on master and slave1 to verify entries can be
scanned.

Accumulo handled this situation just fine. As I expected.

How important (or not) is it to run tserver and datanode on the same
server?
Does the Data Locality implied by running them together exist?
Can the benefit be quantified?


Reply via email to