I think we are on the same page

On May 23, 2018 at 14:00:25, Simon Elliston Ball (
si...@simonellistonball.com) wrote:

There is certainly a lot of value in the idea of tagging the data with a
config version of some sort for traceability. This is probably a per
topology it goes through thing that gives us detailed lineage. Maybe
something like the NiFi provenance approach and a link to a lineage store
like atlas would make sense (in our case that’s simpler than the NiFi use
case of course since we have set topologies).

My other use for schema versions is around preserving backward
compatibility for schema in stores that need to think harder about schema
evolution such as columnar formats in hdfs (orc or parquet for example) so
I think we need some means of storing and retrieving schema versions.

I’m proposing that the versions be created on the basis of config changes.
So the process would be config change triggering schema inference
triggering diff to old schema optionally triggering a net new version.

Does they make sense?

Simon

On 22 May 2018, at 19:33, Otto Fowler <ottobackwa...@gmail.com> wrote:

I’ve also talked with J. Zeolla conceptually storing data in hdfs relative
to the version of the schema to produced it, but that may not matter….

So Simon, do you mean that as part of taking a configuration change (
either startup or live while running ) we ‘update’ the metadata/schema, or
re-evaluate and then save/version it?
maybe the data should have a field about the config/schema version that it
was generated with….




On May 22, 2018 at 13:56:23, Simon Elliston Ball (
si...@simonellistonball.com) wrote:

Absolutely. I would agree with that as an approach.

I would also suggest we discuss where schemas and versions should be
stored. Atlas? The NiFi schema repo abstraction (which limits us to Avro to
express schema).

What I would like to see would be a change to parser interfaces that emits
field types, ditto the enrichment stages, and then detect changes from that.

The other issue to consider is forward and back compatibility on versions.
For example, if we want to output ORC schema (I really think we should,
because the current JSON on HDFS format is huge and slow), we need to
consider the schema output history, since ORC will allow scheme evolution
to an extent (adding fields) but not to others (removing or reordering
fields). This can be resolved by sensible versioning and history aware
schema generation.

Simon


On 22 May 2018 at 15:23, Otto Fowler <ottobackwa...@gmail.com> wrote:

> Yes Simon, when I say ‘whatever we would call the complete parse/enrich
> path’ that is what I was referring to.
>
> I would think the flow would be:
>
> Save or deploy sensor configurations
> -> check if there is a difference in the configurations from last to new
> version
> -> if there is a difference that effects the ‘schema’ in any configuration
> -> build master schema from configurations
> -> version, store, deploy
>
> or something.  I’m sure there are things about clean slate deploy vs. new
> version deploy.
>
> On May 22, 2018 at 09:59:06, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> What I would really like to see is not a full end-to-end schema, but units
> that contribute schema. I don't want to see a parser, enrichment, indexing
> config as one package because in any given deployment for any given sensor,
> I may have a different set of enrichments, and so need a different output
> template.
>
> What I would propose would be parsers and enrichments contribute partial
> schema (potentially expressed as avro, but the important thing is just a
> map of fields to types) which can then be composed, and have the metron
> platform handle creating ES templates / solr schema / Hive Hcat schema /
> A.N.Other index's schema meta data as the composite of those pieces. So, a
> parser would contribute a set of fields, the fieldTransformations on the
> sensor would contribute some fields, and each enrichment block would
> contribute some fields, at which point we have enough schema definition to
> generate all the required artefacts for whatever storage it ends up in.
>
> Essentially, composable partial schema units from each component, which add
> up at the end.
>
> Does that make sense?
>
> Simon
>
>
> On 22 May 2018 at 14:10, Otto Fowler <ottobackwa...@gmail.com> wrote:
>
> > We have discussed in the past as part of 777 ( moment of silence…. ) the
> > idea that parsers/sensors ( or whatever we would call the complete
> > parse/enrich path ) could define a their ES or Solr schemas so that
> > they can be ‘installed’ as part of metron and remove the requirement for
> a
> > separate install by the system or by the user of a specific index
> template
> > or equivalent.
> >
> > Nifi has settled on Avro schemas to describe their ‘record’ based data,
> and
> > it makes me wonder if we might want to think of using Avro as a universal
> > schema or the base for one such that we can define a schema and apply it
> to
> > either ES or Solr.
> >
> > Thoughts?
> >
>
>
>
> --
> --
> simon elliston ball
> @sireb
>
>


--
--
simon elliston ball
@sireb

Reply via email to