Hi Emanuel, I think you raise some potentially valid issues that are worth looking at in more detail. I can say our experience with NiFi is exact opposite, but part of that is that we are a 100% "schema first" shop. Avro is insanely easy to learn, and we've gotten junior data engineers up to speed in a matter of days producing beta quality data contracts that way.
On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <[email protected]> wrote: > Hi, > > Based on recent experience, I found very hard to implement logic which i > think should exists out of the box, and instead it was slow process of > keeping discovering a property on a processor only works for a type of data > when processor supports multiple types etc. > > I would like you all to keep it simple attitude and imagine hwo you would > implement a basic scenario as: > > *basic scenario 1 - shall be easy to implement out of the box following 3 > needs:* > CSV (*get schema automatically via header line*) --> *validate mandatory > subset of fields (presence) and (data types)* --> *export subset of fields* > or all (but want some of them obfuscated) > problems/workarounds found 1.9 rc3 > > *1. processor ValidateRecord* > [1.1] *OK* - allows *getting schema automatically via header line* and > *mandatory > subset of fields* (presence) via the 3 schema properties --> suggest rename > properties to make clear those at processor level are "mandatory check" vs > the schema on reader which is the well the data read schema. > [1.2] *NOK* - does not allow *types validation**.* *One could thinking > using InferSchema right ? problem is it only supports JSON.* > [1.2] *NOK* - ignores writer schema where one could supply *subset of > original fields* (always export all original fields) --> add property to > control export all fields (default) or use writer schema(with subset). > > *2. processor ConvertRecord* > [2.1] *OK* csvreader able to *get schema from header -*-> maybe improve/add > property to cleanup fields (regex search/replace - so we can strip > whitespaces and anything else that breaks nifi processors and/or that > doesnt interest us) > [2.2] *NOK* missing *mandatory subset of fields.* > [2.3] *OK* but does good jobs converting between formats, and/or *export > all or subset of fields via writer schema*. > > *3. processor InferAvroSchema* > [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as > inbound data, in reality the property "Number Of Records To Analyze" only > supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV > with 4k lines and mostly nulls, "1"s or "2"s but some few records would be > "true" or "false".. meaning avro data type should have been [null, string] > but no.. as we found out, type kept being [null, long] with doc always > using 1st data line in CSV to determine field type. This was VERY scaring > to find out.. how can it be this was fully working as expected ? We endup > needing to add +1 processor to convert CSV into JSON so we could get proper > schema.. and even now we still testing, as seems all fields got [string] > when some columns should be long. > > Im not sure the best way to expose this, but im working at enterprise > level, and believe me, this small but critical nuances are starting to push > the mood on NiFi. > But because I felt in love with NiFi and i like the idea of graphical > design of flows etc, but we really must fix this critical little devils.. > they are being screamed as nifi problems at management level. > I know nifi is open source, and its upon us developers to improve, i just > would like to call attention that we must be sure on the middle of PRs and > JIRA enhancements we not forgetting the basic threshold.. doesn't make > sense to release a processor with only 50% of its main goal developed when > the remaining work would be easy and fast to do (aka InferAvroSchema). > > As i keep experimenting more and more with NiFi, i start detecting the > level of basic quality features is bellow from what i think it should be. > Better not release incomplete processors at least regarding core function > of the processor. > > I know developers can contributes with new code, fixes and enhancements.. > but is there any gatekeeper team double checking the deliverables ? like at > basic developer should provide enough unite tests.. again the > InferAvroSchema being a processor to export avro schema based on either a > CSV or JSON, then obviously there should be couple unit testings CSVs and > JSON with different data so we can be sure sure we have the proper type on > the avro schema exported right ? > > Above i share some ideas, and i got much more from my day by day experience > that i been working with NiFi at entperise level for more than 1 year by > now. > Let me know what shall be the way to create JIRAs to fix several processors > in order to allow aone unexperienced nifi client developer to accomplish > the basic flow of: > > CSV (*get schema automatically via header line*) --> *validate mandatory > subset of fields (presence) and (data types)* --> *export subset of fields* > or all (but want some of them obfuscated) > > I challenge anyone to come out with flows to implement this basic flow.. > and test and see what i mean,, you will see how incomplete and hard are > things.. which should not be the case at all. NiFi shall be true Lego, add > processors that says does XPTO and trust it will.. but we keep finding a > lot of nuances.. > > I dont mind taking 1 day off my and work have a meeting with some of you - > dont know if theres such a thing as tech lead on nifi project? - and i > think would be urgent to fix the foundations of some processors. Let me > know.. > > > > Best Regards, > *Emanuel Oliveira* >
