Responses/comments inline.

Before those, a note about process:
It looks like the work on this proposal is already underway. However, the
doc you sent out still has very many unanswered questions, one of which is
"to what extent will these changes deteriorate existing use cases that are
currently well supported".
I think it is very important to spell out the goals and non-goals of this
proposal explicitly, preferably in a separate section, so that there can be
a meaningful discussion of those in the Impala community.

The proposal lists 6 contributors, all of whom are/were Cloudera employees
at the time of writing. It appears that there was a fair amount of
discussion and experimentation that went into this proposal, but this
wasn't done publicly. This can create the impression that community
involvement is seen as an afterthought rather than an integral part of the
process (and the warning in red at the top that 'this document is public'
doesn't really do anything to counter that impression :)).

On Fri, Jun 8, 2018 at 9:54 AM, Todd Lipcon <[email protected]> wrote:

> On Thu, Jun 7, 2018 at 9:27 AM, Marcel Kornacker <[email protected]>
> wrote:
>
>> Hey Todd,
>>
>> thanks for putting that together, that area certainly needs some work.
>> I think there are a number of good ideas in the proposal, and I also
>> think there are a number of ways to augment it to make it even better:
>> 1. The size of the cache is obviously a result of the caching
>> granularity (or lack thereof), and switching to a bounded size with
>> more granularity seems like a good choice.
>> 2. However, one of the biggest usability problems stems from the need
>> for manual metadata discovery (via Invalidate/Refresh). Introducing a
>> TTL for cached objects seems like a bad approach, because it creates a
>> conflict between metadata freshness and caching
>> effectiveness/performance. This isn't really something that users want
>> to have to trade off.
>>
>
> Agreed. In the "future directions" part of the document I included one
> item about removing the need for invalidate/refresh, perhaps by subscribing
> to the push notification mechanisms provided by HMS/NN/S3/etc. With those
> notification mechanisms, the cache TTL could be dialed way up (perhaps to
> infinity) and only rely on cache capacity for eviction.
>
> In some cases the notification mechanisms provide enough information to
> update the cache in place (i.e the notification has the new information)
> whereas in others it would just contain enough information to perform
> invalidation. But from the user's perspective it achieves near-real-time
> freshness either way.
>
> Note, though, that, absent a transactional coordination between Impala and
> source systems, we'll always be in the realm of "soft guarantees". That is
> to say, the notification mechanisms are all asynchronous and could be
> backlogged due to a number of issues out of our control (eg momentary
> network congestion between HMS and Impala, or unknown issues on the S3/SNS
> backend). As you pointed out in the doc, "soft guarantees" are somewhat
> weak when you come at it from the perspective of an application developer.
> So, even with those mechanisms, we may need some explicit command like
> "sync metadata;" or somesuch that acts like a memory barrier.
>
> I think trying to bite all of that off in one go will increase the scope
> of the project quite a bit, though. Did you have some easier mechanism in
> mind to improve usability here that I'm missing?
>
> 3. Removing the role of a central "metadata curator" effectively
>> closes the door on doing automated metadata discovery in the future
>> (which in turn seems like a prerequisite for services like automated
>> format conversion, ie, anything that needs to react to the presence of
>> new data).
>>
>
> That's fair. Something like catalogd may still have its place to
> centralize the subscription to the notification services and re-broadcast
> them to the cluster as necessary. My thinking was that it looks
> sufficiently different from the current catalogd, though, that it would be
> a new component rather than an evolution of the existing
> catalogd-statestore-impalad metadata propagation protocol.
>
>
>> To address those concerns, I am proposing the following changes:
>> 1. Keep the role of a central metadata curator (let's call it metad
>> just for the sake of the argument, so as not to confuse it with the
>> existing catalogd).
>>
>
> Per above, I think this could be useful.
>
>
>> 2. Partition the namespace in some way (for instance, into a fixed
>> number of topics, with each  database assigned to a topic) and allow
>> coordinators to join and drop topics as needed in order to stay within
>> their cache size bounds.
>>
>
> I don't think partitioning to the granularity of a database is sufficient.
> We've found that even individual single tables can cause metadata updates
> in excess of 1GB, even though those tables are infrequently accessed, and
> when accessed, often only a very small subset of partitions. Often these
> mega tables live in the same database as the rest of the tables accessed by
> a workload, and asking users to manually partition them out brings physical
> implementation concerns bleeding into logical schema design decisions.
> That's quite a hassle when considering things like security policies being
> enforced on a per-database basis, right?
>

So then partition into a fixed number of topics and assign each *table* to
one? Or are you saying that caching granularity should be below the table
level? (Note that the actual granularity isn't spelled out in the
proposal.)


>
> 3. The improvements you suggest for streamlining the code, avoiding
>> in-place updates, and making metadata retrieval more efficient still
>> apply here, they're just executed by a different process.
>>
>
> I think a critical piece of the design, though, is that the incremental
> and fine-grained metadata retrieval extends all the way down to the
> impalad, and not just to the centralized coordinator. If we are pushing
> metadata at database (or even table) granularity from the catalog to the
> impalads, we won't reap the major benefits described here.
>

Why would caching at the table level not work? Is the assumption that there
are lots of very large and very infrequently accessed tables around, so
that they would destroy the cache hit rate?

It would be good to get clarity on these scenarios. It feels like the
proposal is being targeted at some very specific use cases, but these
haven't been spelled out clearly. If these are extreme use cases
characterized by catalogs that are several orders of magnitude larger than
what an average catalog might look like, it would be good to recognize that
and be cognizant to what extent more middle-of-the-road use cases will get
impacted.


>
> Here's another approach we could consider long term:
> - keep a central daemon responsible for interacting with source systems
> - impalads dont directly access source systems, but instead send metadata
> fetch requests to the central coordinator
> - impalads maintain a small cache of their own, and the central
> coordinator maintains a larger cache. If the impalad requests some info
> that the central daemon doesn't have, it fetches on demand from the source.
> - central coordinator propagates invalidations to impalads
>
> In essence, it's like the current catalog design, but based on a
> fine-grained "pull" design instead of a course-grained "push/broadcast"
> design.
>

The problem with 'pull' designs is that they don't work well with
fine-grained metadata discovery (e.g., a new file arrived; your proposal
sounds like that would require an invalidation either at the partition or
the full table level, in the case of an unpartitioned table). Given that
Refresh/Invalidate are one of the most problematic aspects of running
Impala in a production system, it would be good to evolve the handling of
metadata in a direction that doesn't preclude fine-grained and automated
metadata propagation in the future.


>
> This allows impalads to scale out without linearly increasing the amount
> of load on source systems. The downsides, though, are:
> - extra hop between impalad and source system on a cache miss
> - extra complexity in failure cases (source system becomes a SPOF, and if
> we replicate it we have more complex consistency to worry about)
> - scalability benefits may only be really important on large clusters.
> Small clusters can often be served sufficiently by just one or two
> coordinator impalads.
>
> -Todd
>
>
>> On Tue, May 22, 2018 at 9:27 PM, Todd Lipcon <[email protected]> wrote:
>> > Hey Impala devs,
>> >
>> > Over the past 3 weeks I have been investigating various issues with
>> > Impala's treatment of metadata. Based on data from a number of user
>> > deployments, and after discussing the issues with a number of Impala
>> > contributors and committers, I've come up with a proposal for a new
>> design.
>> > I've also developed a prototype to show that the approach is workable
>> and
>> > is likely to achieve its goals.
>> >
>> > Rather than describe the design in duplicate, I've written up a proposal
>> > document here:
>> > https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWd
>> GHFaaqh97fC_PvqVGCk/edit?ts=5b04a6b8#
>> >
>> > Please take a look and provide any input, questions, or concerns.
>> >
>> > Additionally, if any users on this list have experienced
>> metadata-related
>> > problems in the past and would be willing to assist in testing or
>> > contribute workloads, please feel free to respond to me either on or off
>> > list.
>> >
>> > Thanks
>> > Todd
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Reply via email to