Re: [DISCUSS] PIP 14: Pulsar Schema Registry

David Rusek Wed, 14 Feb 2018 08:16:14 -0800

That was the original proposal but we deemed it too expensive for the very
reason you state. The thing about the current PR is that this is not
impossible, you would only need to implement a new SchemaStorage backend.


-Dave

On Wed, Feb 14, 2018 at 8:04 AM, Ivan Kelly <iv...@apache.org> wrote:

> Couldn't the schema for a topic be stored in another topic with
> infinite retention?
>
> So if we have a topic /my-prop/my-namespace/topic-foo, we could have a
> topic /my-prop/my-namespace/topic-foo/schema which contains the
> different versions of the schema for the original? To get the schema,
> you just read the latest message from the topic. For older versions of
> the schema, you can read the whole topic.
>
> This could sidestep the storage concerns, though if you have a schema
> for every topic, you end up doubling the load on zookeeper.
> Alternatively, you don't even need the schema to be 1-1 with the
> topic.
>
> -Ivan
>
> On Mon, Feb 12, 2018 at 6:06 PM, David Rusek <d...@streaml.io> wrote:
> > Thanks for the input!
> >
> > I'm sorry for the delay in responding. I sympathize with the concerns
> > surrounding Zookeeper. This is especially a concern for large scale
> > installations. I have updated my the PR associated with this proposal and
> > separated out the storage used by the schema registry. The proposal still
> > uses zookeeper/bookkeeper to store schemas but only as a default
> > implementation and is completely replaceable through a configuration
> > setting. Again, this is now all pluggable and could very well be built
> > using any number of external systems available in a deployment. I would
> > eventually like to incorporate an implementation based on a  bookkeeper
> > key/value contrib module once it becomes available[1]. I have taken your
> > suggestions regarding the message fields and security concerns and
> updated
> > the PR; the schema definition is now pretty sparse and allows for easy
> > extension by the user of the system.
> >
> > [1]
> > http://mail-archives.apache.org/mod_mbox/bookkeeper-dev/
> 201802.mbox/browser
> >
> > -Dave
> >
> > On Tue, Feb 6, 2018 at 10:53 AM, Sahaya Andrews <andr...@apache.org>
> wrote:
> >
> >> It's not clear from the proposal if schema is enforced against a
> >> namespace or individual topic has it's schema.
> >>
> >> Regarding the use of Zookeeper, I also agree with Joe. We should avoid
> >> using zookeeper to store such information since it limits our ability
> >> to scale beyond a point.
> >>
> >> We could create a namespace specific ledger to store such data.
> >>
> >> Andrews.
> >>
> >> On Tue, Feb 6, 2018 at 5:29 AM, Joe F <j...@apache.org> wrote:
> >> > The concept of a schema registry is good.  I have some questions,
> >> concerns
> >> > and comments
> >> >
> >> > 1) Access control
> >> > There may be a need for some topics to remain private to uses that
> don't
> >> > have permissions on that topic.  The existence of such topics, (or its
> >> > schema) should not be disclosed by discoveries.  That is, unauthorized
> >> > requests (probes) should return 404 (does not exist ) and not 401.
> >> > (forbidden). Please ensure this in the implementation.
> >> >
> >> >
> >> > 2)Isn't the Schema message definition (the meta Schema) best left to
> each
> >> > particular installation? Shouldn't Pulsar define just a Schema  as
> base
> >> > fields and key value pairs? I mean, I can think of many different
> fields
> >> in
> >> > addition to name, version, format, state and mods. Every time someone
> >> needs
> >> > to  add something to the meta Schema, it will require a protocol
> change.
> >> > The list of optional fields in the current definition is arbitrary.
> There
> >> > is nothing particular about those fields that require that they be
> >> > enumerated. And the optional nature of all those fields indicates this
> >> very
> >> > same issue - that this list of fields is very subjective and will be
> >> > subject to all sort of additions and deletions. Isn't that list better
> >> > implemented as key value properties?  What is the rationale for this
> >> > specific set of fields?
> >> >
> >> > 3)Use of Zookeeper as a repo.
> >> >
> >> > I speak from experience, as I run some of the largest Pulsar clusters
> in
> >> > existence. I have a significant disagreement with using Zookeeper as
> the
> >> > meta repo for the Schema. Even if the actual Schema is stored in
> >> Bookkeeper
> >> > ledgers, using a ZK node for Schema is an increase in ZK load, and
> >> reduces
> >> > the scalability of Pulsar. ZK nodes have a significant impact on
> Pulsar
> >> > scalability limits. And I mean the impact from the very existence of
> a ZK
> >> > node, not the read/writes on that node.
> >> >
> >> > I understand this feature is optional. But that does not solve the
> >> > underlying issue. Pulsar should be moving towards reducing ZK usage,
> not
> >> > increasing it. We should be working to reduce even the existing usage
> of
> >> > ZK, wherever it is possible.
> >> >
> >> > We should not be  building a feature which requires a tradeoff between
> >> > using  that feature and scalability.   I would like to use this
> feature,
> >> > but as it is, it is going to reduce the working limit of my clusters
> >> > 15-20%.  That is definitely not a good thing
> >> >
> >> > Joe
> >> >
> >> >
> >> > On Mon, Feb 5, 2018 at 11:04 AM, Sijie Guo <guosi...@gmail.com>
> wrote:
> >> >
> >> >> +1 great to see this proposal coming out.
> >> >>
> >> >> - Sijie
> >> >>
> >> >> On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io>
> wrote:
> >> >>
> >> >> > https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9
> >> >> >
> >> >> > -------
> >> >> >
> >> >> >  * **Status**: Proposal
> >> >> >  * **Author**: Dave Rusek - Streamlio
> >> >> >  * **Pull Request**: See Below
> >> >> >  * **Mailing List discussion**:
> >> >> >
> >> >> >
> >> >> > ## Motivation
> >> >> >
> >> >> > Data flowing through a messaging system is typically untyped. Data
> >> flows
> >> >> > from
> >> >> > end-to-end as bytes and only the producers and consumers are aware
> of
> >> the
> >> >> > type
> >> >> > and structure of the data. This requires systems to coordinate
> >> >> out-of-band
> >> >> > and
> >> >> > makes it hard for other systems to discover useful data on which
> they
> >> can
> >> >> > operate. Schema registries help to alleviate these problems by
> >> providing
> >> >> a
> >> >> > centralized storage area for structural definitions of system
> data. By
> >> >> > having a
> >> >> > centralized storage repository systems producing data to the system
> >> can
> >> >> > communicate to downstream systems the structure of the data being
> >> >> produced.
> >> >> >
> >> >> > This document is a proposal to build a schema registry service
> tightly
> >> >> > integrated with Pulsar's topic hierarchy. This schema integration
> is
> >> an
> >> >> > opt-in
> >> >> > feature and will not affect existing or future properties,
> clusters,
> >> >> > namespaces,
> >> >> > or topics that do not choose to take advantage. If however, an
> >> >> > administrator
> >> >> > chooses to use this functionality then it will serve as a
> >> self-describing
> >> >> > integrity check for data in the system as well as allow
> integrations
> >> >> > between
> >> >> > Pulsar and other systems that are able to discover and take
> advantage
> >> of
> >> >> > this
> >> >> > type information
> >> >> >
> >> >> > ## Design
> >> >> >
> >> >> > ### Data Model
> >> >> >
> >> >> > ```protobuf
> >> >> > message Schema {
> >> >> >     enum Format {
> >> >> >         AVRO = 0;
> >> >> >         JSON = 1;
> >> >> >         PROTOBUF = 2;
> >> >> >         THRIFT = 3;
> >> >> >     }
> >> >> >
> >> >> >     enum State {
> >> >> >         STAGED = 1;
> >> >> >         ACTIVE = 2;
> >> >> >     }
> >> >> >
> >> >> >     optional string name = 1;
> >> >> >     optional int32 version = 2;
> >> >> >     optional Format format = 3;
> >> >> >     optional State state = 4;
> >> >> >     optional string modified_user = 5;
> >> >> >     optional string modified_time = 6;
> >> >> > }
> >> >> > ```
> >> >> >
> >> >> > ### Storing Schema Data
> >> >> >
> >> >> > Schema data will be stored alongside message data in BookKeeper.
> Much
> >> >> like
> >> >> > a
> >> >> > managed ledger schema entries will be stored as an append only,
> >> ordered,
> >> >> > list of
> >> >> > entries. Schema entries occupy a BookKeeper Ledger and a topic
> with an
> >> >> > associated schema will require a zookeeper node. Topics without any
> >> >> > associated
> >> >> > schema data will incur no overhead.
> >> >> >
> >> >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1)
> >> >> >
> >> >> > ### Serving Schema Data
> >> >> >
> >> >> > Serving schemas from the pulsar brokers would allow us to take
> >> advantage
> >> >> of
> >> >> > the
> >> >> > topic ownership routing logic to co-locate a schema with it’s
> topic as
> >> >> well
> >> >> > as
> >> >> > ensure a single owner per schema ledger in the case of the
> streamlio
> >> >> schema
> >> >> > registry. Such an arrangement would serve both read and writes
> through
> >> >> the
> >> >> > same
> >> >> > broker. This will require a new admin API to expose the schema data
> >> model
> >> >> > as a
> >> >> > collection of REST resources.
> >> >> >
> >> >> > ```java
> >> >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> >> >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{
> >> version}")
> >> >> > @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> >> >> > @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
> >> >> > ```
> >> >> >
> >> >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2)
> >> >> >
> >> >> > # Changes
> >> >> >
> >> >> > * Implement a Schema Repository in Pulsar brokers [Staged PR](
> >> >> > https://github.com/mgodave/incubator-pulsar/pull/1)
> >> >> > * Add Schema resouces to broker admin API [Staged PR](
> >> >> > https://github.com/mgodave/incubator-pulsar/pull/2)
> >> >> > * Extend client/server binary protocol to expose schema to client
> >> [PR](
> >> >> > https://github.com/apache/incubator-pulsar/pull/1112)
> >> >> >
> >> >>
> >>
>

Re: [DISCUSS] PIP 14: Pulsar Schema Registry

Reply via email to