Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Otto Fowler Sun, 02 Feb 2020 06:01:46 -0800

I hope you entered Jira issues with your great feedback!




On February 1, 2020 at 12:33:44, Emanuel Oliveira ([email protected])
wrote:

Hi,

Based on recent experience, I found very hard to implement logic which i
think should exists out of the box, and instead it was slow process of
keeping discovering a property on a processor only works for a type of data
when processor supports multiple types etc.

I would like you all to keep it simple attitude and imagine hwo you would
implement a basic scenario as:

*basic scenario 1 - shall be easy to implement out of the box following 3
needs:*
CSV (*get schema automatically via header line*) --> *validate mandatory
subset of fields (presence) and (data types)* --> *export subset of fields*
or all (but want some of them obfuscated)
problems/workarounds found 1.9 rc3

*1. processor ValidateRecord*
[1.1] *OK* - allows *getting schema automatically via header line* and
*mandatory
subset of fields* (presence) via the 3 schema properties --> suggest rename
properties to make clear those at processor level are "mandatory check" vs
the schema on reader which is the well the data read schema.
[1.2] *NOK* - does not allow *types validation**.* *One could thinking
using InferSchema right ? problem is it only supports JSON.*
[1.2] *NOK* - ignores writer schema where one could supply *subset of
original fields* (always export all original fields) --> add property to
control export all fields (default) or use writer schema(with subset).

*2. processor ConvertRecord*
[2.1] *OK* csvreader able to *get schema from header -*-> maybe improve/add
property to cleanup fields (regex search/replace - so we can strip
whitespaces and anything else that breaks nifi processors and/or that
doesnt interest us)
[2.2] *NOK* missing *mandatory subset of fields.*
[2.3] *OK* but does good jobs converting between formats, and/or *export
all or subset of fields via writer schema*.

*3. processor InferAvroSchema*
[3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
inbound data, in reality the property "Number Of Records To Analyze" only
supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
"true" or "false".. meaning avro data type should have been [null, string]
but no.. as we found out, type kept being [null, long] with doc always
using 1st data line in CSV to determine field type. This was VERY scaring
to find out.. how can it be this was fully working as expected ? We endup
needing to add +1 processor to convert CSV into JSON so we could get proper
schema.. and even now we still testing, as seems all fields got [string]
when some columns should be long.

Im not sure the best way to expose this, but im working at enterprise
level, and believe me, this small but critical nuances are starting to push
the mood on NiFi.
But because I felt in love with NiFi and i like the idea of graphical
design of flows etc, but we really must fix this critical little devils..
they are being screamed as nifi problems at management level.
I know nifi is open source, and its upon us developers to improve, i just
would like to call attention that we must be sure on the middle of PRs and
JIRA enhancements we not forgetting the basic threshold.. doesn't make
sense to release a processor with only 50% of its main goal developed when
the remaining work would be easy and fast to do (aka InferAvroSchema).

As i keep experimenting more and more with NiFi, i start detecting the
level of basic quality features is bellow from what i think it should be.
Better not release incomplete processors at least regarding core function
of the processor.

I know developers can contributes with new code, fixes and enhancements..
but is there any gatekeeper team double checking the deliverables ? like at
basic developer should provide enough unite tests.. again the
InferAvroSchema being a processor to export avro schema based on either a
CSV or JSON, then obviously there should be couple unit testings CSVs and
JSON with different data so we can be sure sure we have the proper type on
the avro schema exported right ?

Above i share some ideas, and i got much more from my day by day experience
that i been working with NiFi at entperise level for more than 1 year by
now.
Let me know what shall be the way to create JIRAs to fix several processors
in order to allow aone unexperienced nifi client developer to accomplish
the basic flow of:

CSV (*get schema automatically via header line*) --> *validate mandatory
subset of fields (presence) and (data types)* --> *export subset of fields*
or all (but want some of them obfuscated)

I challenge anyone to come out with flows to implement this basic flow..
and test and see what i mean,, you will see how incomplete and hard are
things.. which should not be the case at all. NiFi shall be true Lego, add
processors that says does XPTO and trust it will.. but we keep finding a
lot of nuances..

I dont mind taking 1 day off my and work have a meeting with some of you -
dont know if theres such a thing as tech lead on nifi project? - and i
think would be urgent to fix the foundations of some processors. Let me
know..



Best Regards,
*Emanuel Oliveira*

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Reply via email to