Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Pierre Villard Mon, 03 Feb 2020 11:35:19 -0800

Hi Emanuel,

Just wanted to answer to your questions regarding JIRA. You can create an
account on the Apache JIRA [1] and open JIRAs on the NiFi project [2]. Once
you created an account and logged in the JIRA once, you can share your
login with us, and we can grant you the "contributor" role which gives you
the right to assign yourself a JIRA if you want. But there is no specific
requirements to create JIRAs and/or comment existing JIRAs.


[1] https://issues.apache.org/jira/secure/Signup!default.jspa
[2] https://issues.apache.org/jira/projects/NIFI


Le lun. 3 févr. 2020 à 13:59, Emanuel Oliveira <[email protected]> a
écrit :

> Hi Mike,
>
> Let me summarize as i see my long post is not pads ng the clean easy
> message i intended:
> *processor I**nferAvroSchema*:
> - should retrieve types from analysing data from csv. property "Input
> Content Type" lists CSV, JSON but in reality the property "Number Of
> Records To Analyze" only works with Json. With CSV all types are strings..
> Not hard to detect if a field only contains digits or alphanumerics, only
> timestamps could need extra property to help with format (or out of the box
> just also detect timestamps as well.. not hard).
>
> *Mandatory subset of fields verification:*
> ValidateRecord allows optional 3 schema properties (outside reader and
> writer) to supply an avro schema to balidate mandatory subset of fields -
> but - ConvertRecord doesn't allow this.
>
>
> Finally i would like to request your suggestion for following use case(same
> we struggled):
> - given 1 csv with header line listing 100 fields we want:
> --- validate mandatory fields (just 1 or 2 fields).
> --- automatic create avroschema based on data lines.
> ---export avro like this:
> ------ some fields obfuscated + remaining fields not obfuscated (or the
> other way around: some fields not obfuscated + remaining fields
> obfuscated). And of course header line stay in line with final fields
> order.
>
> This may look simple use case, but very hard to implement due.. but please
> do surprise me with sequence of processors needed to implement what i think
> its a great real world example of data quality (mandatory fields + parcial
> obfuscation + export as different format and just subset of the fields and
> where some obfuscated and others not).
>
> Thanks and hope mote clear, im sure this will help more dev teams.
>
> Cheers,
> Emanuel
>
>
>
>
> On Mon 3 Feb 2020, 13:50 Mike Thomsen, <[email protected]> wrote:
>
> > One thing I should mention is that schema inference is simply not capable
> > of exploiting Avro's field aliasing. That's an incredibly powerful
> feature
> > that allows you to reconcile data sets without writing a single line of
> > code. For example, I wrote a schema last year that uses aliases to
> > reconcile 9 different CSV data sets into a common model without writing
> one
> > line of code. This is all it takes:
> >
> > {
> >   "name": "first_name",
> >   "type": "string",
> >   "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname", "fName"
> ]
> > }
> >
> > That one manual line just reconciled 5 fields into a common model.
> >
> > On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <[email protected]>
> > wrote:
> >
> > > Hi Emanuel,
> > >
> > > I think you raise some potentially valid issues that are worth looking
> at
> > > in more detail. I can say our experience with NiFi is exact opposite,
> but
> > > part of that is that we are a 100% "schema first" shop. Avro is
> insanely
> > > easy to learn, and we've gotten junior data engineers up to speed in a
> > > matter of days producing beta quality data contracts that way.
> > >
> > > On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <[email protected]>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> Based on recent experience, I found very hard to implement logic
> which i
> > >> think should exists out of the box, and instead it was slow process of
> > >> keeping discovering a property on a processor only works for a type of
> > >> data
> > >> when processor supports multiple types etc.
> > >>
> > >> I would like you all to keep it simple attitude and imagine hwo you
> > would
> > >> implement a basic scenario as:
> > >>
> > >> *basic scenario 1 - shall be easy to implement out of the box
> following
> > 3
> > >> needs:*
> > >> CSV (*get schema automatically via header line*) --> *validate
> mandatory
> > >> subset of fields (presence) and (data types)* --> *export subset of
> > >> fields*
> > >> or all (but want some of them obfuscated)
> > >> problems/workarounds found 1.9 rc3
> > >>
> > >> *1. processor ValidateRecord*
> > >> [1.1] *OK* - allows *getting schema automatically via header line* and
> > >> *mandatory
> > >> subset of fields* (presence) via the 3 schema properties --> suggest
> > >> rename
> > >> properties to make clear those at processor level are "mandatory
> check"
> > vs
> > >> the schema on reader which is the well the data read schema.
> > >> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> > >> using InferSchema right ? problem is it only supports JSON.*
> > >> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> > >> original fields* (always export all original fields) --> add property
> to
> > >> control export all fields (default) or use writer schema(with subset).
> > >>
> > >> *2. processor ConvertRecord*
> > >> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
> > >> improve/add
> > >> property to cleanup fields (regex search/replace - so we can strip
> > >> whitespaces and anything else that breaks nifi processors and/or that
> > >> doesnt interest us)
> > >> [2.2] *NOK* missing *mandatory subset of fields.*
> > >> [2.3] *OK* but does good jobs converting between formats, and/or
> *export
> > >> all or subset of fields via writer schema*.
> > >>
> > >> *3. processor InferAvroSchema*
> > >> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> > >> inbound data, in reality the property "Number Of Records To Analyze"
> > only
> > >> supports JSON. Took us 2 days debugging to understand the problem.. 1
> > CSV
> > >> with 4k lines and mostly nulls, "1"s or "2"s but some few records
> would
> > be
> > >> "true" or "false".. meaning avro data type should have been [null,
> > string]
> > >> but no.. as we found out, type kept being [null, long] with doc always
> > >> using 1st data line in CSV to determine field type. This was VERY
> > scaring
> > >> to find out.. how can it be this was fully working as expected ? We
> > endup
> > >> needing to add +1 processor to convert CSV into JSON so we could get
> > >> proper
> > >> schema.. and even now we still testing, as seems all fields got
> [string]
> > >> when some columns should be long.
> > >>
> > >> Im not sure the best way to expose this, but im working at enterprise
> > >> level, and believe me, this small but critical nuances are starting to
> > >> push
> > >> the mood on NiFi.
> > >> But because I felt in love with NiFi and i like the idea of graphical
> > >> design of flows etc, but we really must fix this critical little
> > devils..
> > >> they are being screamed as nifi problems at management level.
> > >> I know nifi is open source, and its upon us developers to improve, i
> > just
> > >> would like to call attention that we must be sure on the middle of PRs
> > and
> > >> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> > >> sense to release a processor with only 50% of its main goal developed
> > when
> > >> the remaining work would be easy and fast to do (aka InferAvroSchema).
> > >>
> > >> As i keep experimenting more and more with NiFi, i start detecting the
> > >> level of basic quality features is bellow from what i think it should
> > be.
> > >> Better not release incomplete processors at least regarding core
> > function
> > >> of the processor.
> > >>
> > >> I know developers can contributes with new code, fixes and
> > enhancements..
> > >> but is there any gatekeeper team double checking the deliverables ?
> like
> > >> at
> > >> basic developer should provide enough unite tests.. again the
> > >> InferAvroSchema being a processor to export avro schema based on
> either
> > a
> > >> CSV or JSON, then obviously there should be couple unit testings CSVs
> > and
> > >> JSON with different data so we can be sure sure we have the proper
> type
> > on
> > >> the avro schema exported right ?
> > >>
> > >> Above i share some ideas, and i got much more from my day by day
> > >> experience
> > >> that i been working with NiFi at entperise level for more than 1 year
> by
> > >> now.
> > >> Let me know what shall be the way to create JIRAs to fix several
> > >> processors
> > >> in order to allow aone unexperienced nifi client developer to
> accomplish
> > >> the basic flow of:
> > >>
> > >> CSV (*get schema automatically via header line*) --> *validate
> mandatory
> > >> subset of fields (presence) and (data types)* --> *export subset of
> > >> fields*
> > >> or all (but want some of them obfuscated)
> > >>
> > >> I challenge anyone to come out with flows to implement this basic
> flow..
> > >> and test and see what i mean,, you will see how incomplete and hard
> are
> > >> things.. which should not be the case at all. NiFi shall be true Lego,
> > add
> > >> processors that says does XPTO and trust it will.. but we keep
> finding a
> > >> lot of nuances..
> > >>
> > >> I dont mind taking 1 day off my  and work have a meeting with some of
> > you
> > >> -
> > >> dont know if theres such a thing as tech lead on nifi project? - and i
> > >> think would be urgent to fix the foundations of some processors. Let
> me
> > >> know..
> > >>
> > >>
> > >>
> > >> Best Regards,
> > >> *Emanuel Oliveira*
> > >>
> > >
> >
> >
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Reply via email to