Re: [DISCUSS] Feature proposal: First-class Avro Support

Bryan Bende Sat, 15 Aug 2015 15:42:23 -0700

Ryan,

Thanks for the feedback and suggestions! We will definitely factor all of
this into the design, and when I get a chance I will update the Wiki page
accordingly.


Thanks,

Bryan

On Sat, Aug 15, 2015 at 5:45 PM, Ryan Blue <[email protected]> wrote:

> On 08/12/2015 06:09 PM, Bryan Bende wrote:
>
>> All,
>>
>> Given how popular Avro has become, I'm very interested in making progress
>> on providing first-class support with in NiFi. I took a stab at filling in
>> some of the requirements on the Feature Proposal Wiki page [1] and wanted
>> to get feedback from everyone to see if these ideas are headed in the
>> right
>> direction.
>>
>> Are there any major features missing from that list? any other
>> recommendations?
>>
>> I'm also proposing that we create a new Avro bundle to capture the
>> functionality that is decided upon, and we can consider whether any of the
>> existing Avro-specific functionality in the Kite bundle could eventually
>> move to the Avro bundle. If anyone feels strongly about this, or has an
>> alternative recommendation, let us know.
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support
>>
>> Thanks,
>>
>> Bryan
>>
>
> Thanks for putting this together, Bryan!
>
> I have a few thoughts and observations about the proposal:
>
> * Conversion to Avro is an easier problem than conversion from Avro. Item
> #2 is to convert from Avro to other formats like CSV, but that isn't
> possible for some Avro schemas. For example, Avro supports nested lists and
> maps that have no good representation in CSV so we'll have to be careful
> about that conversion. It is possible for a lot of data and is definitely
> valuable, though.
>
> * For #3, converting Avro records, I'd also like to see the addition of
> transformation expressions. For example, I might have a timestamp in
> seconds that I need to convert to the Avro timestamp-millis type by
> multiplying the value by 1000.
>
> * There are a few systems like Flume that use Avro serialization for
> individual records, without the Avro file container. This complicates
> behavior a bit. Your suggestion to have merge/split is great, but we should
> plan on having a couple of scenarios for it:
>   - Merge/split between files and bare records with schema header
>   - Merge/split Avro files to produce different sized files
>
> * The "extract fingerprint" processor could be more general and populate a
> few fields from the Avro header:
>   - Schema definition (full, not fp)
>   - Schema fingerprint
>   - Schema root record name (if schema is a record)
>   - Key/value metadata, like compression codec
>
> * It looks like #7, evaluate paths, and #8, update records, are intended
> for the case where the content is a bare Avro record. I'm not sure that
> evaluating paths would work for Avro files.
>
> * For the update records processor, this is really similar to the
> processor to convert between Avro schemas, #3. I suggest merging the two
> and making it easy to work with either a file or a record via record-level
> callback. This would be useful elsewhere as well. Maybe tell the difference
> between file and record by checking for the filename attribute?
>
> On the subject of where these processors go, I'm not attached to them
> being in the Kite bundle. It would probably be better to separate that out.
> However, there are some specific features in the Kite bundle that I think
> are really valuable:
>   - Use a schema file from a HDFS path (requires Hadoop config)
>   - Use the current schema of a dataset/table
>
> Those make it possible to update a table schema, then have that change
> propagate to the conversion in NiFi. So if I start receiving a new field in
> my JSON data, I just update a table definition and then the processor picks
> up the change either automatically or with a restart.
>
> The other complication is that the libraries for reading JSON and CSV (and
> from an InputFormat if you are interested) are in Kite, so you'll have a
> Kite dependency either way. We can look at separating the support into
> stand-alone Kite modules or moving it into the upstream Avro project.
>
> Overall, this looks like a great addition!
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: [DISCUSS] Feature proposal: First-class Avro Support

Reply via email to