Re: Block reading and data locality

Amol Kekre Tue, 10 May 2016 08:18:02 -0700

True, I am listing possible questions in this thread. It is good to have
this analysis done in Apex community, at a minimum for us all to understand
possible cases (in no particular order other than listed so far in this
thread).


1. Container not available on the node asked
2. Desire/ability to not use same node for two block-readers
3. If not node local, can rack local chip in
4. Limit the number of ask-iterations with Resource Mgr
5. Data nodes and compute nodes not the same
6. Mainly works when the app is started. When the app is running, upon node
outage, the algorithm will be get complicated as other partitions are
already locked in and restarting other partitions is not something that may
be wanted

Thks
Amol





On Tue, May 10, 2016 at 7:52 AM, Thomas Weise <[email protected]>
wrote:

> The underlying assumption that each compute node is also a data node isn't
> correct either. I would like to see a better analysis of pros and cons in
> this discussion.
>
> Thomas
>
> On Tue, May 10, 2016 at 12:00 AM, Thomas Weise <[email protected]>
> wrote:
>
> > The problem is that you cannot reallocate partitions without resetting
> the
> > downstream dependencies. We are discussing a long running app, not a
> > MapReduce job.
> >
> > What use case has sparked this discussion? Is there a specific request
> for
> > such a feature?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> > On Mon, May 9, 2016 at 5:41 PM, Amol Kekre <[email protected]> wrote:
> >
> >> Spawning block readers on all data nodes will cause scale issues. For
> >> example on a 1000 data node cluster we cannot ask for 1000 containers
> for
> >> a
> >> file that has say 8 blocks. This feature has been solved by MapReduce,
> >> ideally we should use that part of MapReduce. I am not sure if it could
> be
> >> re-used. Assuming we cannot reuse, I am covering possible cases to
> >> consider.
> >>
> >> At a high level the operator wants to spawn containers where the data
> is.
> >> Given that it becomes a resource-ask call. We do have LOCALITY_HOST to
> >> start with in a pre-launch phase, so that is great start. As RM is asked
> >> for resources, we need to consider implications. I am listing some here
> >> (there could be more)
> >> 1. That node does not have container available: RM rejects or allocates
> >> another node
> >> 2. It may not be desirable to put two partitions on the same node
> >> // Apex already has a strong RM->resource ask protocol, so Apex again is
> >> on
> >> a good ground
> >>
> >> We can improve a bit more, as HDFS blocks would (usually) have three
> >> copies. We can improve the probability of success by listing all three
> in
> >> LOCALITY_HOST. Here it gets slightly complicated. Note that requesting
> all
> >> three nodes and returning 2 (in a worst case scenerio) per partition
> taxes
> >> RM, so should be avoided.
> >>
> >> The solution could be something along the following lines
> >> - Get node affinity from each partition with 3 choices
> >> - Create a first list of nodes that satisfies #1 and #2 above (or more
> >> constraints)
> >> - Then iterate till a solution is found
> >>   -> Ask RM for node selections
> >>   -> Check if the containers returned fit the solution
> >>   -> Repeat till a good case is found or repeat untll N iterations
> >> - End iteration
> >>
> >> There are some more optimizations during iterations -> if node local
> does
> >> not work, try rack local. For now getting node local attempt (say N
> times,
> >> N may be as low as 2) would be great start.
> >>
> >> Thks,
> >> Amol
> >>
> >>
> >> On Mon, May 9, 2016 at 4:22 PM, Chandni Singh <[email protected]>
> >> wrote:
> >>
> >> > It is already possible to request a specific host for a partition.
> >> >
> >> > Thats true. Just saw that a Partition contains a Map of attributes and
> >> that
> >> > can contain LOCALITY_HOST.
> >> >
> >> >
> >> > But you may want to evaluate the cost of container allocation and need
> >> to
> >> > reset the entire DAG against the benefits that you get from data
> >> locality.
> >> >
> >> > I see. So instead of spawning Block Reader on all the nodes (Pramod's
> >> > proposal) we can spawn Block Reader on all the data nodes.
> >> >
> >> > We can then have an HDFS specific module which finds all the data
> nodes
> >> by
> >> > talking to NameNode and create BlockReader partitions using that.
> >> >
> >> > Chandni
> >> >
> >> >
> >> > On Mon, May 9, 2016 at 3:59 PM, Thomas Weise <[email protected]>
> >> > wrote:
> >> >
> >> > > It is already possible to request a specific host for a partition.
> >> > >
> >> > > But you may want to evaluate the cost of container allocation and
> >> need to
> >> > > reset the entire DAG against the benefits that you get from data
> >> > locality.
> >> > >
> >> > > --
> >> > > sent from mobile
> >> > > On May 9, 2016 2:59 PM, "Chandni Singh" <[email protected]>
> >> wrote:
> >> > >
> >> > > > Hi Pramod,
> >> > > >
> >> > > > I thought about this and IMO one way to achieve a little more
> >> > efficiently
> >> > > >  is by providing some support from the platform and intelligent
> >> > > > partitioning in BlockReader.
> >> > > >
> >> > > > 1.  Platform support: A partition be able to express on which node
> >> it
> >> > > > should be created. Application master then requests RM to deploy
> the
> >> > > > partition on that node.
> >> > > >
> >> > > > 2. Initially just one instance of Block Reader is created. When it
> >> > > receives
> >> > > > BlockMetadata, it can derive where the new hdfs blocks are. So it
> >> can
> >> > > > create more Partitions if there isn't a BlockReader on that node
> >> > already
> >> > > > running.
> >> > > >
> >> > > > I will like to take it up if there is some consensus to support
> >> this.
> >> > > >
> >> > > > Chandni
> >> > > >
> >> > > > On Mon, May 9, 2016 at 2:56 PM, Sandesh Hegde <
> >> [email protected]
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > So the requirement is to mix runtime and deployment decisions.
> >> > > > > How about allowing the operators to request re-deployment based
> on
> >> > the
> >> > > > > runtime condition?
> >> > > > >
> >> > > > >
> >> > > > > On Mon, May 9, 2016 at 2:33 PM Pramod Immaneni <
> >> > [email protected]
> >> > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > The file splitter, block reader combination allows for
> parallel
> >> > > reading
> >> > > > > of
> >> > > > > > files by multiple partitions by dividing the files into
> blocks.
> >> > Does
> >> > > > > anyone
> >> > > > > > have any ideas on how to have the block readers be data local
> to
> >> > the
> >> > > > > blocks
> >> > > > > > they are reading.
> >> > > > > >
> >> > > > > > I think we will need to spawn block readers on all nodes where
> >> the
> >> > > > block
> >> > > > > > are present and if the readers are reading multiple files this
> >> > could
> >> > > > mean
> >> > > > > > all the nodes in the cluster and route the block meta
> >> information
> >> > to
> >> > > > the
> >> > > > > > appropriate block reader.
> >> > > > > >
> >> > > > > > Thanks
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Block reading and data locality

Reply via email to