Hi Emanuel, > This may look simple use case, but very hard to implement due.. but please > do surprise me with sequence of processors needed to implement what i think > *its a great real world example of data quality*
I think this is the root of the problem. Personally, I wouldn't characterize anything that relies on schema inference over a schema first design as a "great real world example of data quality" because getting real data quality takes a lot of hard data engineering work in an enterprise environment. As the saying goes, there ain't no such thing as a free lunch. Now, if you want to generate schemas in a robust way, here's one way I know tends to yield good results: https://github.com/FasterXML/jackson-dataformats-binary/tree/master/avro It will take a POJO and generate an Avro schema from, and since Java is fairly strongly typed language you just need to massage things a little with some annotations to get certain nuances like I think javax.validation.Nullable will automatically make a field nullable. On Mon, Feb 3, 2020 at 1:59 PM Emanuel Oliveira <[email protected]> wrote: > Hi Mike, > > Let me summarize as i see my long post is not pads ng the clean easy > message i intended: > *processor I**nferAvroSchema*: > - should retrieve types from analysing data from csv. property "Input > Content Type" lists CSV, JSON but in reality the property "Number Of > Records To Analyze" only works with Json. With CSV all types are strings.. > Not hard to detect if a field only contains digits or alphanumerics, only > timestamps could need extra property to help with format (or out of the box > just also detect timestamps as well.. not hard). > > *Mandatory subset of fields verification:* > ValidateRecord allows optional 3 schema properties (outside reader and > writer) to supply an avro schema to balidate mandatory subset of fields - > but - ConvertRecord doesn't allow this. > > > Finally i would like to request your suggestion for following use case(same > we struggled): > - given 1 csv with header line listing 100 fields we want: > --- validate mandatory fields (just 1 or 2 fields). > --- automatic create avroschema based on data lines. > ---export avro like this: > ------ some fields obfuscated + remaining fields not obfuscated (or the > other way around: some fields not obfuscated + remaining fields > obfuscated). And of course header line stay in line with final fields > order. > > This may look simple use case, but very hard to implement due.. but please > do surprise me with sequence of processors needed to implement what i think > its a great real world example of data quality (mandatory fields + parcial > obfuscation + export as different format and just subset of the fields and > where some obfuscated and others not). > > Thanks and hope mote clear, im sure this will help more dev teams. > > Cheers, > Emanuel > > > > > On Mon 3 Feb 2020, 13:50 Mike Thomsen, <[email protected]> wrote: > > > One thing I should mention is that schema inference is simply not capable > > of exploiting Avro's field aliasing. That's an incredibly powerful > feature > > that allows you to reconcile data sets without writing a single line of > > code. For example, I wrote a schema last year that uses aliases to > > reconcile 9 different CSV data sets into a common model without writing > one > > line of code. This is all it takes: > > > > { > > "name": "first_name", > > "type": "string", > > "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname", "fName" > ] > > } > > > > That one manual line just reconciled 5 fields into a common model. > > > > On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <[email protected]> > > wrote: > > > > > Hi Emanuel, > > > > > > I think you raise some potentially valid issues that are worth looking > at > > > in more detail. I can say our experience with NiFi is exact opposite, > but > > > part of that is that we are a 100% "schema first" shop. Avro is > insanely > > > easy to learn, and we've gotten junior data engineers up to speed in a > > > matter of days producing beta quality data contracts that way. > > > > > > On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <[email protected]> > > > wrote: > > > > > >> Hi, > > >> > > >> Based on recent experience, I found very hard to implement logic > which i > > >> think should exists out of the box, and instead it was slow process of > > >> keeping discovering a property on a processor only works for a type of > > >> data > > >> when processor supports multiple types etc. > > >> > > >> I would like you all to keep it simple attitude and imagine hwo you > > would > > >> implement a basic scenario as: > > >> > > >> *basic scenario 1 - shall be easy to implement out of the box > following > > 3 > > >> needs:* > > >> CSV (*get schema automatically via header line*) --> *validate > mandatory > > >> subset of fields (presence) and (data types)* --> *export subset of > > >> fields* > > >> or all (but want some of them obfuscated) > > >> problems/workarounds found 1.9 rc3 > > >> > > >> *1. processor ValidateRecord* > > >> [1.1] *OK* - allows *getting schema automatically via header line* and > > >> *mandatory > > >> subset of fields* (presence) via the 3 schema properties --> suggest > > >> rename > > >> properties to make clear those at processor level are "mandatory > check" > > vs > > >> the schema on reader which is the well the data read schema. > > >> [1.2] *NOK* - does not allow *types validation**.* *One could thinking > > >> using InferSchema right ? problem is it only supports JSON.* > > >> [1.2] *NOK* - ignores writer schema where one could supply *subset of > > >> original fields* (always export all original fields) --> add property > to > > >> control export all fields (default) or use writer schema(with subset). > > >> > > >> *2. processor ConvertRecord* > > >> [2.1] *OK* csvreader able to *get schema from header -*-> maybe > > >> improve/add > > >> property to cleanup fields (regex search/replace - so we can strip > > >> whitespaces and anything else that breaks nifi processors and/or that > > >> doesnt interest us) > > >> [2.2] *NOK* missing *mandatory subset of fields.* > > >> [2.3] *OK* but does good jobs converting between formats, and/or > *export > > >> all or subset of fields via writer schema*. > > >> > > >> *3. processor InferAvroSchema* > > >> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as > > >> inbound data, in reality the property "Number Of Records To Analyze" > > only > > >> supports JSON. Took us 2 days debugging to understand the problem.. 1 > > CSV > > >> with 4k lines and mostly nulls, "1"s or "2"s but some few records > would > > be > > >> "true" or "false".. meaning avro data type should have been [null, > > string] > > >> but no.. as we found out, type kept being [null, long] with doc always > > >> using 1st data line in CSV to determine field type. This was VERY > > scaring > > >> to find out.. how can it be this was fully working as expected ? We > > endup > > >> needing to add +1 processor to convert CSV into JSON so we could get > > >> proper > > >> schema.. and even now we still testing, as seems all fields got > [string] > > >> when some columns should be long. > > >> > > >> Im not sure the best way to expose this, but im working at enterprise > > >> level, and believe me, this small but critical nuances are starting to > > >> push > > >> the mood on NiFi. > > >> But because I felt in love with NiFi and i like the idea of graphical > > >> design of flows etc, but we really must fix this critical little > > devils.. > > >> they are being screamed as nifi problems at management level. > > >> I know nifi is open source, and its upon us developers to improve, i > > just > > >> would like to call attention that we must be sure on the middle of PRs > > and > > >> JIRA enhancements we not forgetting the basic threshold.. doesn't make > > >> sense to release a processor with only 50% of its main goal developed > > when > > >> the remaining work would be easy and fast to do (aka InferAvroSchema). > > >> > > >> As i keep experimenting more and more with NiFi, i start detecting the > > >> level of basic quality features is bellow from what i think it should > > be. > > >> Better not release incomplete processors at least regarding core > > function > > >> of the processor. > > >> > > >> I know developers can contributes with new code, fixes and > > enhancements.. > > >> but is there any gatekeeper team double checking the deliverables ? > like > > >> at > > >> basic developer should provide enough unite tests.. again the > > >> InferAvroSchema being a processor to export avro schema based on > either > > a > > >> CSV or JSON, then obviously there should be couple unit testings CSVs > > and > > >> JSON with different data so we can be sure sure we have the proper > type > > on > > >> the avro schema exported right ? > > >> > > >> Above i share some ideas, and i got much more from my day by day > > >> experience > > >> that i been working with NiFi at entperise level for more than 1 year > by > > >> now. > > >> Let me know what shall be the way to create JIRAs to fix several > > >> processors > > >> in order to allow aone unexperienced nifi client developer to > accomplish > > >> the basic flow of: > > >> > > >> CSV (*get schema automatically via header line*) --> *validate > mandatory > > >> subset of fields (presence) and (data types)* --> *export subset of > > >> fields* > > >> or all (but want some of them obfuscated) > > >> > > >> I challenge anyone to come out with flows to implement this basic > flow.. > > >> and test and see what i mean,, you will see how incomplete and hard > are > > >> things.. which should not be the case at all. NiFi shall be true Lego, > > add > > >> processors that says does XPTO and trust it will.. but we keep > finding a > > >> lot of nuances.. > > >> > > >> I dont mind taking 1 day off my and work have a meeting with some of > > you > > >> - > > >> dont know if theres such a thing as tech lead on nifi project? - and i > > >> think would be urgent to fix the foundations of some processors. Let > me > > >> know.. > > >> > > >> > > >> > > >> Best Regards, > > >> *Emanuel Oliveira* > > >> > > > > > > > >
