Thanks for the input! I'm sorry for the delay in responding. I sympathize with the concerns surrounding Zookeeper. This is especially a concern for large scale installations. I have updated my the PR associated with this proposal and separated out the storage used by the schema registry. The proposal still uses zookeeper/bookkeeper to store schemas but only as a default implementation and is completely replaceable through a configuration setting. Again, this is now all pluggable and could very well be built using any number of external systems available in a deployment. I would eventually like to incorporate an implementation based on a bookkeeper key/value contrib module once it becomes available[1]. I have taken your suggestions regarding the message fields and security concerns and updated the PR; the schema definition is now pretty sparse and allows for easy extension by the user of the system.
[1] http://mail-archives.apache.org/mod_mbox/bookkeeper-dev/201802.mbox/browser -Dave On Tue, Feb 6, 2018 at 10:53 AM, Sahaya Andrews <andr...@apache.org> wrote: > It's not clear from the proposal if schema is enforced against a > namespace or individual topic has it's schema. > > Regarding the use of Zookeeper, I also agree with Joe. We should avoid > using zookeeper to store such information since it limits our ability > to scale beyond a point. > > We could create a namespace specific ledger to store such data. > > Andrews. > > On Tue, Feb 6, 2018 at 5:29 AM, Joe F <j...@apache.org> wrote: > > The concept of a schema registry is good. I have some questions, > concerns > > and comments > > > > 1) Access control > > There may be a need for some topics to remain private to uses that don't > > have permissions on that topic. The existence of such topics, (or its > > schema) should not be disclosed by discoveries. That is, unauthorized > > requests (probes) should return 404 (does not exist ) and not 401. > > (forbidden). Please ensure this in the implementation. > > > > > > 2)Isn't the Schema message definition (the meta Schema) best left to each > > particular installation? Shouldn't Pulsar define just a Schema as base > > fields and key value pairs? I mean, I can think of many different fields > in > > addition to name, version, format, state and mods. Every time someone > needs > > to add something to the meta Schema, it will require a protocol change. > > The list of optional fields in the current definition is arbitrary. There > > is nothing particular about those fields that require that they be > > enumerated. And the optional nature of all those fields indicates this > very > > same issue - that this list of fields is very subjective and will be > > subject to all sort of additions and deletions. Isn't that list better > > implemented as key value properties? What is the rationale for this > > specific set of fields? > > > > 3)Use of Zookeeper as a repo. > > > > I speak from experience, as I run some of the largest Pulsar clusters in > > existence. I have a significant disagreement with using Zookeeper as the > > meta repo for the Schema. Even if the actual Schema is stored in > Bookkeeper > > ledgers, using a ZK node for Schema is an increase in ZK load, and > reduces > > the scalability of Pulsar. ZK nodes have a significant impact on Pulsar > > scalability limits. And I mean the impact from the very existence of a ZK > > node, not the read/writes on that node. > > > > I understand this feature is optional. But that does not solve the > > underlying issue. Pulsar should be moving towards reducing ZK usage, not > > increasing it. We should be working to reduce even the existing usage of > > ZK, wherever it is possible. > > > > We should not be building a feature which requires a tradeoff between > > using that feature and scalability. I would like to use this feature, > > but as it is, it is going to reduce the working limit of my clusters > > 15-20%. That is definitely not a good thing > > > > Joe > > > > > > On Mon, Feb 5, 2018 at 11:04 AM, Sijie Guo <guosi...@gmail.com> wrote: > > > >> +1 great to see this proposal coming out. > >> > >> - Sijie > >> > >> On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io> wrote: > >> > >> > https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9 > >> > > >> > ------- > >> > > >> > * **Status**: Proposal > >> > * **Author**: Dave Rusek - Streamlio > >> > * **Pull Request**: See Below > >> > * **Mailing List discussion**: > >> > > >> > > >> > ## Motivation > >> > > >> > Data flowing through a messaging system is typically untyped. Data > flows > >> > from > >> > end-to-end as bytes and only the producers and consumers are aware of > the > >> > type > >> > and structure of the data. This requires systems to coordinate > >> out-of-band > >> > and > >> > makes it hard for other systems to discover useful data on which they > can > >> > operate. Schema registries help to alleviate these problems by > providing > >> a > >> > centralized storage area for structural definitions of system data. By > >> > having a > >> > centralized storage repository systems producing data to the system > can > >> > communicate to downstream systems the structure of the data being > >> produced. > >> > > >> > This document is a proposal to build a schema registry service tightly > >> > integrated with Pulsar's topic hierarchy. This schema integration is > an > >> > opt-in > >> > feature and will not affect existing or future properties, clusters, > >> > namespaces, > >> > or topics that do not choose to take advantage. If however, an > >> > administrator > >> > chooses to use this functionality then it will serve as a > self-describing > >> > integrity check for data in the system as well as allow integrations > >> > between > >> > Pulsar and other systems that are able to discover and take advantage > of > >> > this > >> > type information > >> > > >> > ## Design > >> > > >> > ### Data Model > >> > > >> > ```protobuf > >> > message Schema { > >> > enum Format { > >> > AVRO = 0; > >> > JSON = 1; > >> > PROTOBUF = 2; > >> > THRIFT = 3; > >> > } > >> > > >> > enum State { > >> > STAGED = 1; > >> > ACTIVE = 2; > >> > } > >> > > >> > optional string name = 1; > >> > optional int32 version = 2; > >> > optional Format format = 3; > >> > optional State state = 4; > >> > optional string modified_user = 5; > >> > optional string modified_time = 6; > >> > } > >> > ``` > >> > > >> > ### Storing Schema Data > >> > > >> > Schema data will be stored alongside message data in BookKeeper. Much > >> like > >> > a > >> > managed ledger schema entries will be stored as an append only, > ordered, > >> > list of > >> > entries. Schema entries occupy a BookKeeper Ledger and a topic with an > >> > associated schema will require a zookeeper node. Topics without any > >> > associated > >> > schema data will incur no overhead. > >> > > >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1) > >> > > >> > ### Serving Schema Data > >> > > >> > Serving schemas from the pulsar brokers would allow us to take > advantage > >> of > >> > the > >> > topic ownership routing logic to co-locate a schema with it’s topic as > >> well > >> > as > >> > ensure a single owner per schema ledger in the case of the streamlio > >> schema > >> > registry. Such an arrangement would serve both read and writes through > >> the > >> > same > >> > broker. This will require a new admin API to expose the schema data > model > >> > as a > >> > collection of REST resources. > >> > > >> > ```java > >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema") > >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{ > version}") > >> > @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema") > >> > @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema") > >> > ``` > >> > > >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2) > >> > > >> > # Changes > >> > > >> > * Implement a Schema Repository in Pulsar brokers [Staged PR]( > >> > https://github.com/mgodave/incubator-pulsar/pull/1) > >> > * Add Schema resouces to broker admin API [Staged PR]( > >> > https://github.com/mgodave/incubator-pulsar/pull/2) > >> > * Extend client/server binary protocol to expose schema to client > [PR]( > >> > https://github.com/apache/incubator-pulsar/pull/1112) > >> > > >> >