Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Mike Thomsen Sun, 02 Feb 2020 18:33:25 -0800

Hi Emanuel,

I think you raise some potentially valid issues that are worth looking at
in more detail. I can say our experience with NiFi is exact opposite, but
part of that is that we are a 100% "schema first" shop. Avro is insanely
easy to learn, and we've gotten junior data engineers up to speed in a
matter of days producing beta quality data contracts that way.


On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <[email protected]> wrote:

> Hi,
>
> Based on recent experience, I found very hard to implement logic which i
> think should exists out of the box, and instead it was slow process of
> keeping discovering a property on a processor only works for a type of data
> when processor supports multiple types etc.
>
> I would like you all to keep it simple attitude and imagine hwo you would
> implement a basic scenario as:
>
> *basic scenario 1 - shall be easy to implement out of the box following 3
> needs:*
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
> problems/workarounds found 1.9 rc3
>
> *1. processor ValidateRecord*
> [1.1] *OK* - allows *getting schema automatically via header line* and
> *mandatory
> subset of fields* (presence) via the 3 schema properties --> suggest rename
> properties to make clear those at processor level are "mandatory check" vs
> the schema on reader which is the well the data read schema.
> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> using InferSchema right ? problem is it only supports JSON.*
> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> original fields* (always export all original fields) --> add property to
> control export all fields (default) or use writer schema(with subset).
>
> *2. processor ConvertRecord*
> [2.1] *OK* csvreader able to *get schema from header -*-> maybe improve/add
> property to cleanup fields (regex search/replace - so we can strip
> whitespaces and anything else that breaks nifi processors and/or that
> doesnt interest us)
> [2.2] *NOK* missing *mandatory subset of fields.*
> [2.3] *OK* but does good jobs converting between formats, and/or *export
> all or subset of fields via writer schema*.
>
> *3. processor InferAvroSchema*
> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> inbound data, in reality the property "Number Of Records To Analyze" only
> supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
> with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
> "true" or "false".. meaning avro data type should have been [null, string]
> but no.. as we found out, type kept being [null, long] with doc always
> using 1st data line in CSV to determine field type. This was VERY scaring
> to find out.. how can it be this was fully working as expected ? We endup
> needing to add +1 processor to convert CSV into JSON so we could get proper
> schema.. and even now we still testing, as seems all fields got [string]
> when some columns should be long.
>
> Im not sure the best way to expose this, but im working at enterprise
> level, and believe me, this small but critical nuances are starting to push
> the mood on NiFi.
> But because I felt in love with NiFi and i like the idea of graphical
> design of flows etc, but we really must fix this critical little devils..
> they are being screamed as nifi problems at management level.
> I know nifi is open source, and its upon us developers to improve, i just
> would like to call attention that we must be sure on the middle of PRs and
> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> sense to release a processor with only 50% of its main goal developed when
> the remaining work would be easy and fast to do (aka InferAvroSchema).
>
> As i keep experimenting more and more with NiFi, i start detecting the
> level of basic quality features is bellow from what i think it should be.
> Better not release incomplete processors at least regarding core function
> of the processor.
>
> I know developers can contributes with new code, fixes and enhancements..
> but is there any gatekeeper team double checking the deliverables ? like at
> basic developer should provide enough unite tests.. again the
> InferAvroSchema being a processor to export avro schema based on either a
> CSV or JSON, then obviously there should be couple unit testings CSVs and
> JSON with different data so we can be sure sure we have the proper type on
> the avro schema exported right ?
>
> Above i share some ideas, and i got much more from my day by day experience
> that i been working with NiFi at entperise level for more than 1 year by
> now.
> Let me know what shall be the way to create JIRAs to fix several processors
> in order to allow aone unexperienced nifi client developer to accomplish
> the basic flow of:
>
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
>
> I challenge anyone to come out with flows to implement this basic flow..
> and test and see what i mean,, you will see how incomplete and hard are
> things.. which should not be the case at all. NiFi shall be true Lego, add
> processors that says does XPTO and trust it will.. but we keep finding a
> lot of nuances..
>
> I dont mind taking 1 day off my  and work have a meeting with some of you -
> dont know if theres such a thing as tech lead on nifi project? - and i
> think would be urgent to fix the foundations of some processors. Let me
> know..
>
>
>
> Best Regards,
> *Emanuel Oliveira*
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Reply via email to