Couldn't the schema for a topic be stored in another topic with
infinite retention?

So if we have a topic /my-prop/my-namespace/topic-foo, we could have a
topic /my-prop/my-namespace/topic-foo/schema which contains the
different versions of the schema for the original? To get the schema,
you just read the latest message from the topic. For older versions of
the schema, you can read the whole topic.

This could sidestep the storage concerns, though if you have a schema
for every topic, you end up doubling the load on zookeeper.
Alternatively, you don't even need the schema to be 1-1 with the
topic.

-Ivan

On Mon, Feb 12, 2018 at 6:06 PM, David Rusek <d...@streaml.io> wrote:
> Thanks for the input!
>
> I'm sorry for the delay in responding. I sympathize with the concerns
> surrounding Zookeeper. This is especially a concern for large scale
> installations. I have updated my the PR associated with this proposal and
> separated out the storage used by the schema registry. The proposal still
> uses zookeeper/bookkeeper to store schemas but only as a default
> implementation and is completely replaceable through a configuration
> setting. Again, this is now all pluggable and could very well be built
> using any number of external systems available in a deployment. I would
> eventually like to incorporate an implementation based on a  bookkeeper
> key/value contrib module once it becomes available[1]. I have taken your
> suggestions regarding the message fields and security concerns and updated
> the PR; the schema definition is now pretty sparse and allows for easy
> extension by the user of the system.
>
> [1]
> http://mail-archives.apache.org/mod_mbox/bookkeeper-dev/201802.mbox/browser
>
> -Dave
>
> On Tue, Feb 6, 2018 at 10:53 AM, Sahaya Andrews <andr...@apache.org> wrote:
>
>> It's not clear from the proposal if schema is enforced against a
>> namespace or individual topic has it's schema.
>>
>> Regarding the use of Zookeeper, I also agree with Joe. We should avoid
>> using zookeeper to store such information since it limits our ability
>> to scale beyond a point.
>>
>> We could create a namespace specific ledger to store such data.
>>
>> Andrews.
>>
>> On Tue, Feb 6, 2018 at 5:29 AM, Joe F <j...@apache.org> wrote:
>> > The concept of a schema registry is good.  I have some questions,
>> concerns
>> > and comments
>> >
>> > 1) Access control
>> > There may be a need for some topics to remain private to uses that don't
>> > have permissions on that topic.  The existence of such topics, (or its
>> > schema) should not be disclosed by discoveries.  That is, unauthorized
>> > requests (probes) should return 404 (does not exist ) and not 401.
>> > (forbidden). Please ensure this in the implementation.
>> >
>> >
>> > 2)Isn't the Schema message definition (the meta Schema) best left to each
>> > particular installation? Shouldn't Pulsar define just a Schema  as base
>> > fields and key value pairs? I mean, I can think of many different fields
>> in
>> > addition to name, version, format, state and mods. Every time someone
>> needs
>> > to  add something to the meta Schema, it will require a protocol change.
>> > The list of optional fields in the current definition is arbitrary. There
>> > is nothing particular about those fields that require that they be
>> > enumerated. And the optional nature of all those fields indicates this
>> very
>> > same issue - that this list of fields is very subjective and will be
>> > subject to all sort of additions and deletions. Isn't that list better
>> > implemented as key value properties?  What is the rationale for this
>> > specific set of fields?
>> >
>> > 3)Use of Zookeeper as a repo.
>> >
>> > I speak from experience, as I run some of the largest Pulsar clusters in
>> > existence. I have a significant disagreement with using Zookeeper as the
>> > meta repo for the Schema. Even if the actual Schema is stored in
>> Bookkeeper
>> > ledgers, using a ZK node for Schema is an increase in ZK load, and
>> reduces
>> > the scalability of Pulsar. ZK nodes have a significant impact on Pulsar
>> > scalability limits. And I mean the impact from the very existence of a ZK
>> > node, not the read/writes on that node.
>> >
>> > I understand this feature is optional. But that does not solve the
>> > underlying issue. Pulsar should be moving towards reducing ZK usage, not
>> > increasing it. We should be working to reduce even the existing usage of
>> > ZK, wherever it is possible.
>> >
>> > We should not be  building a feature which requires a tradeoff between
>> > using  that feature and scalability.   I would like to use this feature,
>> > but as it is, it is going to reduce the working limit of my clusters
>> > 15-20%.  That is definitely not a good thing
>> >
>> > Joe
>> >
>> >
>> > On Mon, Feb 5, 2018 at 11:04 AM, Sijie Guo <guosi...@gmail.com> wrote:
>> >
>> >> +1 great to see this proposal coming out.
>> >>
>> >> - Sijie
>> >>
>> >> On Fri, Jan 26, 2018 at 12:57 PM, David Rusek <d...@streaml.io> wrote:
>> >>
>> >> > https://gist.github.com/mgodave/b265250a685f3574166ae617462ea4f9
>> >> >
>> >> > -------
>> >> >
>> >> >  * **Status**: Proposal
>> >> >  * **Author**: Dave Rusek - Streamlio
>> >> >  * **Pull Request**: See Below
>> >> >  * **Mailing List discussion**:
>> >> >
>> >> >
>> >> > ## Motivation
>> >> >
>> >> > Data flowing through a messaging system is typically untyped. Data
>> flows
>> >> > from
>> >> > end-to-end as bytes and only the producers and consumers are aware of
>> the
>> >> > type
>> >> > and structure of the data. This requires systems to coordinate
>> >> out-of-band
>> >> > and
>> >> > makes it hard for other systems to discover useful data on which they
>> can
>> >> > operate. Schema registries help to alleviate these problems by
>> providing
>> >> a
>> >> > centralized storage area for structural definitions of system data. By
>> >> > having a
>> >> > centralized storage repository systems producing data to the system
>> can
>> >> > communicate to downstream systems the structure of the data being
>> >> produced.
>> >> >
>> >> > This document is a proposal to build a schema registry service tightly
>> >> > integrated with Pulsar's topic hierarchy. This schema integration is
>> an
>> >> > opt-in
>> >> > feature and will not affect existing or future properties, clusters,
>> >> > namespaces,
>> >> > or topics that do not choose to take advantage. If however, an
>> >> > administrator
>> >> > chooses to use this functionality then it will serve as a
>> self-describing
>> >> > integrity check for data in the system as well as allow integrations
>> >> > between
>> >> > Pulsar and other systems that are able to discover and take advantage
>> of
>> >> > this
>> >> > type information
>> >> >
>> >> > ## Design
>> >> >
>> >> > ### Data Model
>> >> >
>> >> > ```protobuf
>> >> > message Schema {
>> >> >     enum Format {
>> >> >         AVRO = 0;
>> >> >         JSON = 1;
>> >> >         PROTOBUF = 2;
>> >> >         THRIFT = 3;
>> >> >     }
>> >> >
>> >> >     enum State {
>> >> >         STAGED = 1;
>> >> >         ACTIVE = 2;
>> >> >     }
>> >> >
>> >> >     optional string name = 1;
>> >> >     optional int32 version = 2;
>> >> >     optional Format format = 3;
>> >> >     optional State state = 4;
>> >> >     optional string modified_user = 5;
>> >> >     optional string modified_time = 6;
>> >> > }
>> >> > ```
>> >> >
>> >> > ### Storing Schema Data
>> >> >
>> >> > Schema data will be stored alongside message data in BookKeeper. Much
>> >> like
>> >> > a
>> >> > managed ledger schema entries will be stored as an append only,
>> ordered,
>> >> > list of
>> >> > entries. Schema entries occupy a BookKeeper Ledger and a topic with an
>> >> > associated schema will require a zookeeper node. Topics without any
>> >> > associated
>> >> > schema data will incur no overhead.
>> >> >
>> >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/1)
>> >> >
>> >> > ### Serving Schema Data
>> >> >
>> >> > Serving schemas from the pulsar brokers would allow us to take
>> advantage
>> >> of
>> >> > the
>> >> > topic ownership routing logic to co-locate a schema with it’s topic as
>> >> well
>> >> > as
>> >> > ensure a single owner per schema ledger in the case of the streamlio
>> >> schema
>> >> > registry. Such an arrangement would serve both read and writes through
>> >> the
>> >> > same
>> >> > broker. This will require a new admin API to expose the schema data
>> model
>> >> > as a
>> >> > collection of REST resources.
>> >> >
>> >> > ```java
>> >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
>> >> > @GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{
>> version}")
>> >> > @DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
>> >> > @POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
>> >> > ```
>> >> >
>> >> > [Staged PR](https://github.com/mgodave/incubator-pulsar/pull/2)
>> >> >
>> >> > # Changes
>> >> >
>> >> > * Implement a Schema Repository in Pulsar brokers [Staged PR](
>> >> > https://github.com/mgodave/incubator-pulsar/pull/1)
>> >> > * Add Schema resouces to broker admin API [Staged PR](
>> >> > https://github.com/mgodave/incubator-pulsar/pull/2)
>> >> > * Extend client/server binary protocol to expose schema to client
>> [PR](
>> >> > https://github.com/apache/incubator-pulsar/pull/1112)
>> >> >
>> >>
>>

Reply via email to