Re: "Death of Schema-on-Read"

Ted Dunning Fri, 06 Apr 2018 11:47:15 -0700

On Thu, Apr 5, 2018 at 10:22 PM, Hanumath Rao Maduri <[email protected]>
wrote:


> ...
>
> Thank you Ted for your valuable suggestions, as regards to your comment on
> "metastore is good but centralized is bad" can you please share your view
> point on what all design issues it can cause. I know that it can be
> bottleneck but just want to know about other issues.

Put in other terms if centralized metastore engineered in a good way to
> avoid most of the bottleneck, then do you think it can be good to use for
> metadata?
>

Centralized metadata stores have caused the following problems in my
experience:

1) they lock versions and make it extremely hard to upgrade applications
incrementally. It is a common fiction that one can upgrade all applications
using the same data at the same moment. It isn't acceptable to require an
outage and force an upgrade on users. It also isn't acceptable to force the
metadata store to never be updated.

2) they go down and take everything else with it.

3) they require elaborate caching. The error message "updating metadata
cache" was the most common string on the impala mailing list for a long
time because of the 30 minute delays that customers were seeing due to this
kind of problem.

4) they limit expressivity. Because it is hard to update a metadata store
safely, they move slowly and typically don't describe new data well. Thus,
Hive metadata store doesn't deal with variable typed data or structured
data worth a darn. The same thing will happen with any new centralized
meta-data store.

5) they inhibit multi-tenancy. Ideally, data describes itself so that
different users can see the same data even if they are nominally not part
of the same org or sub-org.

6) they inhibit data fabrics that extend beyond a single cluster.
Centralized metadata stores are inherently anti-global. Self-describing
data, on the other hand, is inherently global since whereever the data
goes, so goes the metadata. Note that self-describing data does not have to
be intrinsically self-descriptive in a single file. I view JSON file with a
schema file alongside as a self-describing pair.

As an example, imagine that file extensions were tied to applications by a
central authority (a metadata store). This would mean that you couldn't
change web browsers (.html) or spreadhsheets. Or compilers. And frankly,
the fact that my computer has a single idea about how a file is interpreted
is limiting. I would prefer to use photoshop on images in certain
directories and Preview for other images elsewhere. A single repository
linking file type to application is too limiting even on my laptop.

That is the same issue, ultimately, as a centralized data store except that
my issues with images are tiny compared to the problems that occur when you
have 5000 analysts working on data that all get screwed by a single broken
piece of software.

Re: "Death of Schema-on-Read"

Reply via email to