Thanks for the update, Todd! -- Philip
On Tue, Aug 7, 2018 at 12:34 AM Todd Lipcon <[email protected]> wrote: > Hi folks, > > It's been a few weeks since the last update on this topic, so wanted to > check in and share the progress as well as some plans for the coming weeks. > > As far as progress is concerned, most of the work so far committed (or > almost-committed) has been some extensive refactoring. You may notice that > the majority of the Frontend now refers to catalog objects by interfaces > such as FeTable, FeDb, FeFsTable, etc, rather than the concrete classes > Table, Db, HdfsTable, etc. In addition, the catalog itself uses an > interface instead of a specific implementation. A new impalad flag > --use_local_catalog can be flipped on to switch in the "new" impalad > catalog implementation. > > One notable exception to the above refactoring is DDL statements such as > CREATE/DROP/ALTER. As we've worked on the project it's become clear that > the DDL implementation is quite tightly interwoven with the concrete > implementations since most of the operations make some attempt to > incrementally modify catalog objects in-place to reflect the changes being > made in external source systems. Additionally, the treatment of the catalog > versioning and metadata propagation protocol is pretty delicate. Given > that, we made a decision a few weeks back to continue to delegate DDL > functions to the catalogd by the existing mechanisms. > > This turned up an interesting new set of problems. Specifically, if the > impalad is fetching metadata directly from source systems, but the catalogd > continues to operate with its "snapshotted" table-granularity caches of > objects, it's possible (and even likely) that an impalad will have a > *newer* view of metadata than the catalogd. This can result in very > confusing scenarios, like: > > 1) a user creates a table through some external system like Hive or Spark > 2) a user is able to see and query the table on the impalad, since it > fetched the table metadata direct from HMS > 3) the user wants to drop the table from Impala. When it sends the request > to the catalogd, the catalogd will respond with a "table does not exist" > error. > 4) the user, confused, may attempt to "show tables" again, and will > continue to see it existing. > > At this point the only way to resolve the inconsistency would be some set > of invalidate queries. There are many other similar scenarios where a user > may get unexpected results or errors from DDLs based on the catalogd having > an older cached copy than the impalad is showing. > > One way to fix these issues would be to fully re-implement all of the DDL > statements without the tight interleaving with catalogd code. However, that > will take a certain amount of time and bring more risk due to the > complexity of those code paths. > > Additionally, during the design phase there were various concerns raised > around the "fetch directly from source systems" approach, including: > 1) increased load on source systems > 2) potential for small semantics changes such as users who might rely on > REFRESH "freezing" the view of a table > 3) loss of various fringe features such as non-persistent functions, which > currently are _only_ saved in catalogd memory > 4) potentially more difficult to later implement "subscribe to source > system" type functionality where new files or objects are discovered > automatically > > Given the above, plus the risks/effort of re-implementing DDL operations > decoupled from catalogd, I'm currently proposing that we shift the design > of the project a bit. Namely, instead of being "fetch granular metadata > on-demand from source systems with no catalogd", it will become "fetch > granular metadata on-demand from catalogd". The vast majority of the > refactoring work detailed at the top of this email is still relevant: the > planner still needs to operate on lighter-weight objects with different > properties than the catalog, and the machinery to switch around the > behavior still makes sense. It's just that the actual _source_ of metadata > will be a new RPC on the catalogd which allows on-demand granular metadata > retrieval. > > A few other key points of this design are: > > - the impalads, when they fetch metadata, can note the version number of > the catalog object that they fetched. This allows them to be sure that they > will never get "read skew" in their cache -- all pieces of metadata for a > given table need to have been read from the same version. Invalidation is > also much easier with these consistent version numbers. > - the catalogd, instead of broadcasting full catalog objects through the > statestore, now just needs to send out notices of object version number > changes. For example "table:functional.alltypes -> version 123". This > message can cause the impalads to invalidate the cache for that table for > any version less than 123, and all future queries are sure to re-fetch the > latest data on-demand. > - ideally we can make partition objects immutable, so that, given a > particular partition ID, we know that that ID will never be reused with > different data. On a REFRESH or other partition metadata change, a new ID > can be assigned to the partition. This allows us to safely cache partitions > (the largest bit of metadata) with long TTLs, and to do very cheap updates > after any update affects only a subset of partitions. > > In terms of the original "fetch from source systems" design, I do continue > to think it has promise longer-term as it simplifies the overall > architecture and can help with cross-system metadata consistency. But we > can treat fetch-from-catalogd as a nice interim that should bring most of > the performance and scalability benefits to users sooner and with less > risk. > > I'll plan to update the original design document to reflect this in coming > days. > > Thanks > -Todd > > -- > Todd Lipcon > Software Engineer, Cloudera >
