One thing I should mention is that schema inference is simply not capable
of exploiting Avro's field aliasing. That's an incredibly powerful feature
that allows you to reconcile data sets without writing a single line of
code. For example, I wrote a schema last year that uses aliases to
reconcile 9 different CSV data sets into a common model without writing one
line of code. This is all it takes:

{
  "name": "first_name",
  "type": "string",
  "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname", "fName" ]
}

That one manual line just reconciled 5 fields into a common model.

On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <[email protected]> wrote:

> Hi Emanuel,
>
> I think you raise some potentially valid issues that are worth looking at
> in more detail. I can say our experience with NiFi is exact opposite, but
> part of that is that we are a 100% "schema first" shop. Avro is insanely
> easy to learn, and we've gotten junior data engineers up to speed in a
> matter of days producing beta quality data contracts that way.
>
> On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <[email protected]>
> wrote:
>
>> Hi,
>>
>> Based on recent experience, I found very hard to implement logic which i
>> think should exists out of the box, and instead it was slow process of
>> keeping discovering a property on a processor only works for a type of
>> data
>> when processor supports multiple types etc.
>>
>> I would like you all to keep it simple attitude and imagine hwo you would
>> implement a basic scenario as:
>>
>> *basic scenario 1 - shall be easy to implement out of the box following 3
>> needs:*
>> CSV (*get schema automatically via header line*) --> *validate mandatory
>> subset of fields (presence) and (data types)* --> *export subset of
>> fields*
>> or all (but want some of them obfuscated)
>> problems/workarounds found 1.9 rc3
>>
>> *1. processor ValidateRecord*
>> [1.1] *OK* - allows *getting schema automatically via header line* and
>> *mandatory
>> subset of fields* (presence) via the 3 schema properties --> suggest
>> rename
>> properties to make clear those at processor level are "mandatory check" vs
>> the schema on reader which is the well the data read schema.
>> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
>> using InferSchema right ? problem is it only supports JSON.*
>> [1.2] *NOK* - ignores writer schema where one could supply *subset of
>> original fields* (always export all original fields) --> add property to
>> control export all fields (default) or use writer schema(with subset).
>>
>> *2. processor ConvertRecord*
>> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
>> improve/add
>> property to cleanup fields (regex search/replace - so we can strip
>> whitespaces and anything else that breaks nifi processors and/or that
>> doesnt interest us)
>> [2.2] *NOK* missing *mandatory subset of fields.*
>> [2.3] *OK* but does good jobs converting between formats, and/or *export
>> all or subset of fields via writer schema*.
>>
>> *3. processor InferAvroSchema*
>> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
>> inbound data, in reality the property "Number Of Records To Analyze" only
>> supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
>> with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
>> "true" or "false".. meaning avro data type should have been [null, string]
>> but no.. as we found out, type kept being [null, long] with doc always
>> using 1st data line in CSV to determine field type. This was VERY scaring
>> to find out.. how can it be this was fully working as expected ? We endup
>> needing to add +1 processor to convert CSV into JSON so we could get
>> proper
>> schema.. and even now we still testing, as seems all fields got [string]
>> when some columns should be long.
>>
>> Im not sure the best way to expose this, but im working at enterprise
>> level, and believe me, this small but critical nuances are starting to
>> push
>> the mood on NiFi.
>> But because I felt in love with NiFi and i like the idea of graphical
>> design of flows etc, but we really must fix this critical little devils..
>> they are being screamed as nifi problems at management level.
>> I know nifi is open source, and its upon us developers to improve, i just
>> would like to call attention that we must be sure on the middle of PRs and
>> JIRA enhancements we not forgetting the basic threshold.. doesn't make
>> sense to release a processor with only 50% of its main goal developed when
>> the remaining work would be easy and fast to do (aka InferAvroSchema).
>>
>> As i keep experimenting more and more with NiFi, i start detecting the
>> level of basic quality features is bellow from what i think it should be.
>> Better not release incomplete processors at least regarding core function
>> of the processor.
>>
>> I know developers can contributes with new code, fixes and enhancements..
>> but is there any gatekeeper team double checking the deliverables ? like
>> at
>> basic developer should provide enough unite tests.. again the
>> InferAvroSchema being a processor to export avro schema based on either a
>> CSV or JSON, then obviously there should be couple unit testings CSVs and
>> JSON with different data so we can be sure sure we have the proper type on
>> the avro schema exported right ?
>>
>> Above i share some ideas, and i got much more from my day by day
>> experience
>> that i been working with NiFi at entperise level for more than 1 year by
>> now.
>> Let me know what shall be the way to create JIRAs to fix several
>> processors
>> in order to allow aone unexperienced nifi client developer to accomplish
>> the basic flow of:
>>
>> CSV (*get schema automatically via header line*) --> *validate mandatory
>> subset of fields (presence) and (data types)* --> *export subset of
>> fields*
>> or all (but want some of them obfuscated)
>>
>> I challenge anyone to come out with flows to implement this basic flow..
>> and test and see what i mean,, you will see how incomplete and hard are
>> things.. which should not be the case at all. NiFi shall be true Lego, add
>> processors that says does XPTO and trust it will.. but we keep finding a
>> lot of nuances..
>>
>> I dont mind taking 1 day off my  and work have a meeting with some of you
>> -
>> dont know if theres such a thing as tech lead on nifi project? - and i
>> think would be urgent to fix the foundations of some processors. Let me
>> know..
>>
>>
>>
>> Best Regards,
>> *Emanuel Oliveira*
>>
>

Reply via email to