Hi Emanuel,

> This may look simple use case, but very hard to implement due.. but please
> do surprise me with sequence of processors needed to implement what i
think
> *its a great real world example of data quality*

I think this is the root of the problem. Personally, I wouldn't
characterize anything that relies on schema inference over a schema first
design as a "great real world example of data quality" because getting real
data quality takes a lot of hard data engineering work in an enterprise
environment. As the saying goes, there ain't no such thing as a free lunch.

Now, if you want to generate schemas in a robust way, here's one way I know
tends to yield good results:

https://github.com/FasterXML/jackson-dataformats-binary/tree/master/avro

It will take a POJO and generate an Avro schema from, and since Java is
fairly strongly typed language you just need to massage things a little
with some annotations to get certain nuances like I think
javax.validation.Nullable will automatically make a field nullable.

On Mon, Feb 3, 2020 at 1:59 PM Emanuel Oliveira <[email protected]> wrote:

> Hi Mike,
>
> Let me summarize as i see my long post is not pads ng the clean easy
> message i intended:
> *processor I**nferAvroSchema*:
> - should retrieve types from analysing data from csv. property "Input
> Content Type" lists CSV, JSON but in reality the property "Number Of
> Records To Analyze" only works with Json. With CSV all types are strings..
> Not hard to detect if a field only contains digits or alphanumerics, only
> timestamps could need extra property to help with format (or out of the box
> just also detect timestamps as well.. not hard).
>
> *Mandatory subset of fields verification:*
> ValidateRecord allows optional 3 schema properties (outside reader and
> writer) to supply an avro schema to balidate mandatory subset of fields -
> but - ConvertRecord doesn't allow this.
>
>
> Finally i would like to request your suggestion for following use case(same
> we struggled):
> - given 1 csv with header line listing 100 fields we want:
> --- validate mandatory fields (just 1 or 2 fields).
> --- automatic create avroschema based on data lines.
> ---export avro like this:
> ------ some fields obfuscated + remaining fields not obfuscated (or the
> other way around: some fields not obfuscated + remaining fields
> obfuscated). And of course header line stay in line with final fields
> order.
>
> This may look simple use case, but very hard to implement due.. but please
> do surprise me with sequence of processors needed to implement what i think
> its a great real world example of data quality (mandatory fields + parcial
> obfuscation + export as different format and just subset of the fields and
> where some obfuscated and others not).
>
> Thanks and hope mote clear, im sure this will help more dev teams.
>
> Cheers,
> Emanuel
>
>
>
>
> On Mon 3 Feb 2020, 13:50 Mike Thomsen, <[email protected]> wrote:
>
> > One thing I should mention is that schema inference is simply not capable
> > of exploiting Avro's field aliasing. That's an incredibly powerful
> feature
> > that allows you to reconcile data sets without writing a single line of
> > code. For example, I wrote a schema last year that uses aliases to
> > reconcile 9 different CSV data sets into a common model without writing
> one
> > line of code. This is all it takes:
> >
> > {
> >   "name": "first_name",
> >   "type": "string",
> >   "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname", "fName"
> ]
> > }
> >
> > That one manual line just reconciled 5 fields into a common model.
> >
> > On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <[email protected]>
> > wrote:
> >
> > > Hi Emanuel,
> > >
> > > I think you raise some potentially valid issues that are worth looking
> at
> > > in more detail. I can say our experience with NiFi is exact opposite,
> but
> > > part of that is that we are a 100% "schema first" shop. Avro is
> insanely
> > > easy to learn, and we've gotten junior data engineers up to speed in a
> > > matter of days producing beta quality data contracts that way.
> > >
> > > On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <[email protected]>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> Based on recent experience, I found very hard to implement logic
> which i
> > >> think should exists out of the box, and instead it was slow process of
> > >> keeping discovering a property on a processor only works for a type of
> > >> data
> > >> when processor supports multiple types etc.
> > >>
> > >> I would like you all to keep it simple attitude and imagine hwo you
> > would
> > >> implement a basic scenario as:
> > >>
> > >> *basic scenario 1 - shall be easy to implement out of the box
> following
> > 3
> > >> needs:*
> > >> CSV (*get schema automatically via header line*) --> *validate
> mandatory
> > >> subset of fields (presence) and (data types)* --> *export subset of
> > >> fields*
> > >> or all (but want some of them obfuscated)
> > >> problems/workarounds found 1.9 rc3
> > >>
> > >> *1. processor ValidateRecord*
> > >> [1.1] *OK* - allows *getting schema automatically via header line* and
> > >> *mandatory
> > >> subset of fields* (presence) via the 3 schema properties --> suggest
> > >> rename
> > >> properties to make clear those at processor level are "mandatory
> check"
> > vs
> > >> the schema on reader which is the well the data read schema.
> > >> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> > >> using InferSchema right ? problem is it only supports JSON.*
> > >> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> > >> original fields* (always export all original fields) --> add property
> to
> > >> control export all fields (default) or use writer schema(with subset).
> > >>
> > >> *2. processor ConvertRecord*
> > >> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
> > >> improve/add
> > >> property to cleanup fields (regex search/replace - so we can strip
> > >> whitespaces and anything else that breaks nifi processors and/or that
> > >> doesnt interest us)
> > >> [2.2] *NOK* missing *mandatory subset of fields.*
> > >> [2.3] *OK* but does good jobs converting between formats, and/or
> *export
> > >> all or subset of fields via writer schema*.
> > >>
> > >> *3. processor InferAvroSchema*
> > >> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> > >> inbound data, in reality the property "Number Of Records To Analyze"
> > only
> > >> supports JSON. Took us 2 days debugging to understand the problem.. 1
> > CSV
> > >> with 4k lines and mostly nulls, "1"s or "2"s but some few records
> would
> > be
> > >> "true" or "false".. meaning avro data type should have been [null,
> > string]
> > >> but no.. as we found out, type kept being [null, long] with doc always
> > >> using 1st data line in CSV to determine field type. This was VERY
> > scaring
> > >> to find out.. how can it be this was fully working as expected ? We
> > endup
> > >> needing to add +1 processor to convert CSV into JSON so we could get
> > >> proper
> > >> schema.. and even now we still testing, as seems all fields got
> [string]
> > >> when some columns should be long.
> > >>
> > >> Im not sure the best way to expose this, but im working at enterprise
> > >> level, and believe me, this small but critical nuances are starting to
> > >> push
> > >> the mood on NiFi.
> > >> But because I felt in love with NiFi and i like the idea of graphical
> > >> design of flows etc, but we really must fix this critical little
> > devils..
> > >> they are being screamed as nifi problems at management level.
> > >> I know nifi is open source, and its upon us developers to improve, i
> > just
> > >> would like to call attention that we must be sure on the middle of PRs
> > and
> > >> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> > >> sense to release a processor with only 50% of its main goal developed
> > when
> > >> the remaining work would be easy and fast to do (aka InferAvroSchema).
> > >>
> > >> As i keep experimenting more and more with NiFi, i start detecting the
> > >> level of basic quality features is bellow from what i think it should
> > be.
> > >> Better not release incomplete processors at least regarding core
> > function
> > >> of the processor.
> > >>
> > >> I know developers can contributes with new code, fixes and
> > enhancements..
> > >> but is there any gatekeeper team double checking the deliverables ?
> like
> > >> at
> > >> basic developer should provide enough unite tests.. again the
> > >> InferAvroSchema being a processor to export avro schema based on
> either
> > a
> > >> CSV or JSON, then obviously there should be couple unit testings CSVs
> > and
> > >> JSON with different data so we can be sure sure we have the proper
> type
> > on
> > >> the avro schema exported right ?
> > >>
> > >> Above i share some ideas, and i got much more from my day by day
> > >> experience
> > >> that i been working with NiFi at entperise level for more than 1 year
> by
> > >> now.
> > >> Let me know what shall be the way to create JIRAs to fix several
> > >> processors
> > >> in order to allow aone unexperienced nifi client developer to
> accomplish
> > >> the basic flow of:
> > >>
> > >> CSV (*get schema automatically via header line*) --> *validate
> mandatory
> > >> subset of fields (presence) and (data types)* --> *export subset of
> > >> fields*
> > >> or all (but want some of them obfuscated)
> > >>
> > >> I challenge anyone to come out with flows to implement this basic
> flow..
> > >> and test and see what i mean,, you will see how incomplete and hard
> are
> > >> things.. which should not be the case at all. NiFi shall be true Lego,
> > add
> > >> processors that says does XPTO and trust it will.. but we keep
> finding a
> > >> lot of nuances..
> > >>
> > >> I dont mind taking 1 day off my  and work have a meeting with some of
> > you
> > >> -
> > >> dont know if theres such a thing as tech lead on nifi project? - and i
> > >> think would be urgent to fix the foundations of some processors. Let
> me
> > >> know..
> > >>
> > >>
> > >>
> > >> Best Regards,
> > >> *Emanuel Oliveira*
> > >>
> > >
> >
> >
>

Reply via email to