Ryan, Thanks for the feedback and suggestions! We will definitely factor all of this into the design, and when I get a chance I will update the Wiki page accordingly.
Thanks, Bryan On Sat, Aug 15, 2015 at 5:45 PM, Ryan Blue <[email protected]> wrote: > On 08/12/2015 06:09 PM, Bryan Bende wrote: > >> All, >> >> Given how popular Avro has become, I'm very interested in making progress >> on providing first-class support with in NiFi. I took a stab at filling in >> some of the requirements on the Feature Proposal Wiki page [1] and wanted >> to get feedback from everyone to see if these ideas are headed in the >> right >> direction. >> >> Are there any major features missing from that list? any other >> recommendations? >> >> I'm also proposing that we create a new Avro bundle to capture the >> functionality that is decided upon, and we can consider whether any of the >> existing Avro-specific functionality in the Kite bundle could eventually >> move to the Avro bundle. If anyone feels strongly about this, or has an >> alternative recommendation, let us know. >> >> [1] >> https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support >> >> Thanks, >> >> Bryan >> > > Thanks for putting this together, Bryan! > > I have a few thoughts and observations about the proposal: > > * Conversion to Avro is an easier problem than conversion from Avro. Item > #2 is to convert from Avro to other formats like CSV, but that isn't > possible for some Avro schemas. For example, Avro supports nested lists and > maps that have no good representation in CSV so we'll have to be careful > about that conversion. It is possible for a lot of data and is definitely > valuable, though. > > * For #3, converting Avro records, I'd also like to see the addition of > transformation expressions. For example, I might have a timestamp in > seconds that I need to convert to the Avro timestamp-millis type by > multiplying the value by 1000. > > * There are a few systems like Flume that use Avro serialization for > individual records, without the Avro file container. This complicates > behavior a bit. Your suggestion to have merge/split is great, but we should > plan on having a couple of scenarios for it: > - Merge/split between files and bare records with schema header > - Merge/split Avro files to produce different sized files > > * The "extract fingerprint" processor could be more general and populate a > few fields from the Avro header: > - Schema definition (full, not fp) > - Schema fingerprint > - Schema root record name (if schema is a record) > - Key/value metadata, like compression codec > > * It looks like #7, evaluate paths, and #8, update records, are intended > for the case where the content is a bare Avro record. I'm not sure that > evaluating paths would work for Avro files. > > * For the update records processor, this is really similar to the > processor to convert between Avro schemas, #3. I suggest merging the two > and making it easy to work with either a file or a record via record-level > callback. This would be useful elsewhere as well. Maybe tell the difference > between file and record by checking for the filename attribute? > > On the subject of where these processors go, I'm not attached to them > being in the Kite bundle. It would probably be better to separate that out. > However, there are some specific features in the Kite bundle that I think > are really valuable: > - Use a schema file from a HDFS path (requires Hadoop config) > - Use the current schema of a dataset/table > > Those make it possible to update a table schema, then have that change > propagate to the conversion in NiFi. So if I start receiving a new field in > my JSON data, I just update a table definition and then the processor picks > up the change either automatically or with a restart. > > The other complication is that the libraries for reading JSON and CSV (and > from an InputFormat if you are interested) are in Kite, so you'll have a > Kite dependency either way. We can look at separating the support into > stand-alone Kite modules or moving it into the upstream Avro project. > > Overall, this looks like a great addition! > > rb > > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
