Re: Short circuit reads and hadoop

Aaron McCurry Tue, 15 Oct 2013 11:00:15 -0700

On Tue, Oct 15, 2013 at 1:48 PM, Ravikumar Govindarajan <
[email protected]> wrote:


> Sorry, some truncated content
>
>
> 1. Specifying hints to the namenode about the favored set of datanodes
>     [HDFS-2576] when writing a file in hadoop.
>
>     The downside of such a feature is a datanode failure. HDFS Balancer and
>     HBase Balancer will contend for re-distribution of blocks
>
>     Facebook's HBase port in GitHub has this patch in production and
> adjusts
>     with conflicts by periodically running a cron process.
>
>     Even running HDFS Balancer when all Shards are online can screw the
>     carefully planned layed. So there is HDFS-4420 for this, to exclude
> certain
>     sub-trees from balancing
>
> 2. There is another interesting HDFS-4606 yet to begin, which will aim to
> move
>     the block to local datanode, when it detects that it has performed a
> remote
>     read.
>
>     This is will be the best when completed, without requiring any plumbing
> code, whatsoever from Blur, HBase etc...
>

Yes I like this one.  This is basically what I was thinking about
implementing (logically).  Since it hasn't even started dev yet, it will
likely be a year or more before we see is in a stable release.  But maybe I
will be surprised.


>
> One big downside is the effect it has on the local data-node which will
> already be serving data, in addition to performing disk-writes as a result
> of remote reads
>
> If you are not already aware of it or have alternate ideas for Blur, please
> let me know your thoughts
>

I think that #2 is likely what will be most beneficial long term.  In the
near term, I haven't really seen a huge need for this type of feature.
 That may be because I haven't tested Blur where the reads were mostly
local and a short-circuit read could be made.  Or it might be because the
systems that I am using having a enough hardware where the remote reads are
not much of an issue.  Either way I think that this would benefit us and
I'm all for trying to come up with a solution.  :-)

Aaron


>
> --
> Ravi
>
>
> On Tue, Oct 15, 2013 at 11:07 PM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > Actually, what I unearthed after long-time fishing is this
> >
> > 1. Specifying hints to the namenode about the favored set of datanodes
> >     [HDFS-2576] when writing a file in hadoop.
> >
> >     The downside of such a feature is a datanode failure. HDFS Balancer
> and
> >
> >      Facebook's HBase port in GitHub has this patch in production
> >
> > 2.
> >
> >
> >
> >
> > On Tue, Oct 15, 2013 at 7:15 PM, Aaron McCurry <[email protected]>
> wrote:
> >
> >> On Tue, Oct 15, 2013 at 9:25 AM, Ravikumar Govindarajan <
> >> [email protected]> wrote:
> >>
> >> > Your idea looks great.
> >> >
> >> > But I am afraid I don't seem to grasp it fully. I am attempting to
> >> > understand your suggestions. So please bear for a moment.
> >> >
> >> > When co-locating shard-server and data-nodes
> >> >
> >> > 1. Every shard-server will attempt to write a local copy
> >> > 2. Proceed with default settings of hadoop [remote rack etc...] for
> next
> >> > copies
> >> >
> >> > Ideally, there are 2 problems when writing a file
> >> >
> >> > 1. Unable to write a local copy. [Lack of sufficient storage etc...]
> >> Hadoop
> >> > deflects the
> >> >
> >>
> >> True, this might happen if the cluster is low on storage and there are
> >> local disk failures.  If the cluster is in this kind of condition and
> >> running the normal Hadoop balancer won't help (reduce the local storage
> by
> >> migrating to another machine) then it's likely we can't do anything to
> >> help
> >> the situation.
> >>
> >>
> >> >     write to some other destination, internally.
> >> > 2. Shard-server failure. Since this is stateless, it will most likely
> >> be a
> >> > hardware    failure/planned-maintenance.
> >> >
> >>
> >> Yes this is true.
> >>
> >>
> >> >
> >> > Instead of asynchronously re-balancing every write[moving
> >> blocks-to-shard],
> >> > is it possible for us to trap into the above cases alone?
> >> >
> >>
> >> Not sure what you mean here.
> >>
> >>
> >> >
> >> > BTW, how do we re-arrange shard layout when shards are added/removed.
> >> >
> >> > I looked at 0.20 code and it seems to move shards too much. That could
> >> be
> >> > detrimental for what we are discussing now, right?
> >> >
> >>
> >> I have fixed this issue or at least improved it.
> >>
> >> https://issues.apache.org/jira/browse/BLUR-260
> >>
> >> This implementation will only move the shards that are down, and will
> >> never
> >> move any more shards than is necessary to maintain a properly balanced
> >> shard cluster.
> >>
> >> Basically it recalculates an optimal layout when the number of servers
> >> changes.  It will likely need to be improved by optimizing the location
> of
> >> the shard (which server) by figuring out what server that can serve the
> >> shard has the most blocks from the indexes of the shard.  Doing this
> >> should
> >> minimize block movement.
> >>
> >>
> >> Also had another thought, if we follow something similar to what HBase
> >> does
> >> and perform a full optimization on each shard once a day.  That would
> >> locate the data to the local server without us having to do any other
> >> work.
> >>  Of course this assumes that there is enough space locally and that
> there
> >> has been some change in the index in the last 24 hours to warrant doing
> >> the
> >> merge.
> >>
> >> Aaron
> >>
> >>
> >> > --
> >> > Ravi
> >> >
> >> >
> >> >
> >> >
> >> > On Sun, Oct 13, 2013 at 12:51 AM, Aaron McCurry <[email protected]>
> >> > wrote:
> >> >
> >> > > On Fri, Oct 11, 2013 at 11:12 AM, Ravikumar Govindarajan <
> >> > > [email protected]> wrote:
> >> > >
> >> > > > I came across this interesting JIRA in hadoop
> >> > > > https://issues.apache.org/jira/browse/HDFS-385
> >> > > >
> >> > > > In essence, it allows us more control over where blocks are
> placed.
> >> > > >
> >> > > > A BlockPlacementPolicy to optimistically write all blocks of a
> given
> >> > > > index-file into same set of datanodes could be highly helpful.
> >> > > >
> >> > > > Co-locating shard-servers and datanodes along with short-circuit
> >> reads
> >> > > > should improve greatly. We can always utilize the file-system
> cache
> >> for
> >> > > > local files. In case a given file is not served locally, then
> >> > > shard-servers
> >> > > > can use the block-cache
> >> > > >
> >> > > > Do you see some positives in such an approach?
> >> > > >
> >> > >
> >> > > I'm sure that there will be some performance improvements my using
> >> local
> >> > > reads for accessing HDFS.  The areas I would assume to see the
> biggest
> >> > > increases in performance would be merging and fetching data for
> >> > retrieval.
> >> > >  Even with accessing local drives, the search time would likely not
> be
> >> > > improved assuming the index is hot.  One of the reasons for doing
> >> > short-cut
> >> > > reads is to make use of the OS file system cache and since Blur
> >> already
> >> > > uses a filesystem cache it might just be duplicating functionality.
> >> > >  Another big reason for short-cut reads is to reduce network
> traffic.
> >> > >
> >> > > Given the scenario when another server has gone down.  If the shard
> >> > server
> >> > > is in an opening state we would have to change the Blur layout
> system
> >> to
> >> > > only fail to the server that contains the data for the index.  This
> >> might
> >> > > have some problems because the layout of the shards are based on how
> >> many
> >> > > shards are online in the existing shards not where they are located.
> >> > >
> >> > > So in a perfect world if the shard fails to the correct server it
> >> would
> >> > > reduce the amount of network traffic (to near 0 if it was perfect).
> >>  So
> >> > in
> >> > > short it could work, but it might be very hard.
> >> > >
> >> > > Assuming for a moment that the system is not dealing with a failure
> >> and
> >> > > assuming that the shard server is also running a datanode.  The
> first
> >> > > replica for HDFS is the local datanode, so if we just configure
> Hadoop
> >> > for
> >> > > the short-cut reads we would already get the benefits for data
> >> fetching
> >> > and
> >> > > merging.
> >> > >
> >> > > I had another idea that I wanted to run you.
> >> > >
> >> > > What if we instead of actively placing blocks during writes with
> the a
> >> > > Hadoop block layout policy, we write something like the Hadoop
> >> balancer.
> >> > >  We get Hadoop to move the blocks to the shard server (datanode)
> that
> >> is
> >> > > hosting the shard.  That way it would be asynchronous and after a
> >> > failure /
> >> > > shard movement it would relocate the blocks and still use a
> short-cut
> >> > read.
> >> > >  Kind of like the blocks following the shard around, instead of
> >> deciding
> >> > up
> >> > > front where the data for the shard should be located.  If this is
> what
> >> > you
> >> > > were thinking I'm sorry not understanding your suggestion.
> >> > >
> >> > > Aaron
> >> > >
> >> > >
> >> > >
> >> > > >
> >> > > > --
> >> > > > Ravi
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Short circuit reads and hadoop

Reply via email to