Re: [DISCUSS] PIP 14: Pulsar Schema Registry

David Rusek Mon, 12 Feb 2018 09:06:34 -0800

Thanks for the input!

I'm sorry for the delay in responding. I sympathize with the concerns
surrounding Zookeeper. This is especially a concern for large scale
installations. I have updated my the PR associated with this proposal and
separated out the storage used by the schema registry. The proposal still
uses zookeeper/bookkeeper to store schemas but only as a default
implementation and is completely replaceable through a configuration
setting. Again, this is now all pluggable and could very well be built
using any number of external systems available in a deployment. I would
eventually like to incorporate an implementation based on a  bookkeeper
key/value contrib module once it becomes available[1]. I have taken your
suggestions regarding the message fields and security concerns and updated
the PR; the schema definition is now pretty sparse and allows for easy
extension by the user of the system.


[1]
http://mail-archives.apache.org/mod_mbox/bookkeeper-dev/201802.mbox/browser

-Dave

On Tue, Feb 6, 2018 at 10:53 AM, Sahaya Andrews <andr...@apache.org> wrote:

> It's not clear from the proposal if schema is enforced against a
> namespace or individual topic has it's schema.
>
> Regarding the use of Zookeeper, I also agree with Joe. We should avoid
> using zookeeper to store such information since it limits our ability
> to scale beyond a point.
>
> We could create a namespace specific ledger to store such data.
>
> Andrews.
>
> On Tue, Feb 6, 2018 at 5:29 AM, Joe F <j...@apache.org> wrote:
> > The concept of a schema registry is good.  I have some questions,
> concerns
> > and comments
> >
> > 1) Access control
> > There may be a need for some topics to remain private to uses that don't
> > have permissions on that topic.  The existence of such topics, (or its
> > schema) should not be disclosed by discoveries.  That is, unauthorized
> > requests (probes) should return 404 (does not exist ) and not 401.
> > (forbidden). Please ensure this in the implementation.
> >
> >
> > 2)Isn't the Schema message definition (the meta Schema) best left to each
> > particular installation? Shouldn't Pulsar define just a Schema  as base
> > fields and key value pairs? I mean, I can think of many different fields
> in
> > addition to name, version, format, state and mods. Every time someone
> needs
> > to  add something to the meta Schema, it will require a protocol change.
> > The list of optional fields in the current definition is arbitrary. There
> > is nothing particular about those fields that require that they be
> > enumerated. And the optional nature of all those fields indicates this
> very
> > same issue - that this list of fields is very subjective and will be
> > subject to all sort of additions and deletions. Isn't that list better
> > implemented as key value properties?  What is the rationale for this
> > specific set of fields?
> >
> > 3)Use of Zookeeper as a repo.
> >
> > I speak from experience, as I run some of the largest Pulsar clusters in
> > existence. I have a significant disagreement with using Zookeeper as the
> > meta repo for the Schema. Even if the actual Schema is stored in
> Bookkeeper
> > ledgers, using a ZK node for Schema is an increase in ZK load, and
> reduces
> > the scalability of Pulsar. ZK nodes have a significant impact on Pulsar
> > scalability limits. And I mean the impact from the very existence of a ZK
> > node, not the read/writes on that node.
> >
> > I understand this feature is optional. But that does not solve the
> > underlying issue. Pulsar should be moving towards reducing ZK usage, not
> > increasing it. We should be working to reduce even the existing usage of
> > ZK, wherever it is possible.
> >
> > We should not be  building a feature which requires a tradeoff between
> > using  that feature and scalability.   I would like to use this feature,
> > but as it is, it is going to reduce the working limit of my clusters
> > 15-20%.  That is definitely not a good thing
> >
> > Joe
> >
> >
> > On Mon, Feb 5, 2018 at 11:04 AM, Sijie Guo <guosi...@gmail.com> wrote:
> >
> >> +1 great to see this proposal coming out.
> >>
> >> - Sijie
> >>
> >> On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io> wrote:
> >>
> >> > https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9
> >> >
> >> > -------
> >> >
> >> >  * **Status**: Proposal
> >> >  * **Author**: Dave Rusek - Streamlio
> >> >  * **Pull Request**: See Below
> >> >  * **Mailing List discussion**:
> >> >
> >> >
> >> > ## Motivation
> >> >
> >> > Data flowing through a messaging system is typically untyped. Data
> flows
> >> > from
> >> > end-to-end as bytes and only the producers and consumers are aware of
> the
> >> > type
> >> > and structure of the data. This requires systems to coordinate
> >> out-of-band
> >> > and
> >> > makes it hard for other systems to discover useful data on which they
> can
> >> > operate. Schema registries help to alleviate these problems by
> providing
> >> a
> >> > centralized storage area for structural definitions of system data. By
> >> > having a
> >> > centralized storage repository systems producing data to the system
> can
> >> > communicate to downstream systems the structure of the data being
> >> produced.
> >> >
> >> > This document is a proposal to build a schema registry service tightly
> >> > integrated with Pulsar's topic hierarchy. This schema integration is
> an
> >> > opt-in
> >> > feature and will not affect existing or future properties, clusters,
> >> > namespaces,
> >> > or topics that do not choose to take advantage. If however, an
> >> > administrator
> >> > chooses to use this functionality then it will serve as a
> self-describing
> >> > integrity check for data in the system as well as allow integrations
> >> > between
> >> > Pulsar and other systems that are able to discover and take advantage
> of
> >> > this
> >> > type information
> >> >
> >> > ## Design
> >> >
> >> > ### Data Model
> >> >
> >> > ```protobuf
> >> > message Schema {
> >> >     enum Format {
> >> >         AVRO = 0;
> >> >         JSON = 1;
> >> >         PROTOBUF = 2;
> >> >         THRIFT = 3;
> >> >     }
> >> >
> >> >     enum State {
> >> >         STAGED = 1;
> >> >         ACTIVE = 2;
> >> >     }
> >> >
> >> >     optional string name = 1;
> >> >     optional int32 version = 2;
> >> >     optional Format format = 3;
> >> >     optional State state = 4;
> >> >     optional string modified_user = 5;
> >> >     optional string modified_time = 6;
> >> > }
> >> > ```
> >> >
> >> > ### Storing Schema Data
> >> >
> >> > Schema data will be stored alongside message data in BookKeeper. Much
> >> like
> >> > a
> >> > managed ledger schema entries will be stored as an append only,
> ordered,
> >> > list of
> >> > entries. Schema entries occupy a BookKeeper Ledger and a topic with an
> >> > associated schema will require a zookeeper node. Topics without any
> >> > associated
> >> > schema data will incur no overhead.
> >> >
> >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1)
> >> >
> >> > ### Serving Schema Data
> >> >
> >> > Serving schemas from the pulsar brokers would allow us to take
> advantage
> >> of
> >> > the
> >> > topic ownership routing logic to co-locate a schema with it’s topic as
> >> well
> >> > as
> >> > ensure a single owner per schema ledger in the case of the streamlio
> >> schema
> >> > registry. Such an arrangement would serve both read and writes through
> >> the
> >> > same
> >> > broker. This will require a new admin API to expose the schema data
> model
> >> > as a
> >> > collection of REST resources.
> >> >
> >> > ```java
> >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{
> version}")
> >> > @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> >> > @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> >> > ```
> >> >
> >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2)
> >> >
> >> > # Changes
> >> >
> >> > * Implement a Schema Repository in Pulsar brokers [Staged PR](
> >> > https://github.com/mgodave/incubator-pulsar/pull/1)
> >> > * Add Schema resouces to broker admin API [Staged PR](
> >> > https://github.com/mgodave/incubator-pulsar/pull/2)
> >> > * Extend client/server binary protocol to expose schema to client
> [PR](
> >> > https://github.com/apache/incubator-pulsar/pull/1112)
> >> >
> >>
>

Re: [DISCUSS] PIP 14: Pulsar Schema Registry

Reply via email to