Re: Impala integration with HDFS erasure coding

Tim Armstrong Mon, 21 Nov 2016 20:24:56 -0800

Hi Andrew,
  I wanted to reply to get the conversation started, although I'm not as
knowledgeable as others on this topic.

How are the erasure-coded blocks handled by the block locations APIs?

I believe our scheduler just reverts to round-robin if the blocks aren't
local to a particular daemon (we already do this for S3 and filesystems
like DSSD and Isilion).

We handle remote reads differently from local reads - we have separate I/O
queues for each local disk, then a separate remote read queue. It looks
like we do up to 8 concurrent remote reads by default. It might just work
out of the box, although I don't know if the current parameters are optimal.

On Thu, Nov 17, 2016 at 1:43 PM, Andrew Wang <[email protected]>
wrote:

> Hi Impala folks,
>
> I was wondering if there was any Impala work required to integrate with
> HDFS erasure coding (planned for release in Hadoop 3, already available in
> alpha form in 3.0.0-alpha1). I know that Impala tries to localize to nodes
> and disks. With EC though, most reads will be remote, so locality isn't
> important.
>
> Is Impala scheduling going to work out-of-the-box?
>
> Another idea is to implement a stride-aware data format, which re-enables
> locality even for striped blocks. It's not clear if this is important
> though, since EC is meant for cold data that isn't queried often.
>
> Thanks,
> Andrew
>

Re: Impala integration with HDFS erasure coding

Reply via email to