Hey Peter, Finally getting a chance to take a look at some of this stuff. Very cool that you're jumping in on this! I think this has the potential to really enable a lot of great new possibilities very quickly and easily. Matt B. and I discussed this concept a while back on the mailing list a bit but didn't get into great detail. He may have some great thoughts that he can contribute, as well. I jotted down some thoughts that I had on the design as I looked over it...
* I think we should not really think of these are record conversions, but rather two distinct concepts - record parsing and record serialization. This opens up a lot more possibilities, as we could use a Record Parser service for a more generic PutHBase processors, for instance. Likely, we could use the Record Serialization service with a source processor like ExecuteSQL. And then a processor that wants to convert from one format to another could use both a parser and a serializer. I don't think the RecordConversionService is really needed, as I think it makes more sense to just have a single processor to do conversion and these service API's would be used by this processor. * Rather than having a RecordSet that contains a Map<String, Field> getSchema() and a lot of "getSupportsXYZ()" types of methods, I think we should instead have a "Schema getSchema()" method and move the description of what is supported into the Schema object. * I have written record parsing libraries before, and while the natural tendency is to create a structure such as Record that has a bunch of Field objects, this can actually cripple performance when running on large volumes of records. For instance, if you were running just 50 MB/sec. of incoming data, and lets say that you have fields that average 100 bytes each, you're talking half a million Field objects per second that get created and have to be garbage collected. And these are fairly conservative numbers - you could easily reach many million Field objects per second. So the approach that I would recommend is instead of having Record return a bunch of Field objects, instead have accessor methods on Record such as: getString(int fieldIndex); getString(String fieldName); getInt(int fieldIndex); getInt(String fieldName); getType(int fieldIndex); getName(int fieldIndex); * I would also expect that calling getString(1) would convert the first field to a String by using String.vlaueOf or something like that. I would expect some sort of inferred conversion, but the impl you have here just does a type-cast, which can lead to a lot of redundant code being used by the clients of the library. * If Record is going to have a 'getUnderlyingObject' method that returns an Object, it may make more sense to instead make Record generic and return an object of type T? I know this is quite a bit of feedback. Hopefully not too overwhelming :) There's a whole lot of good stuff here, and I think this is going to turn out to be an awesome addition to NiFi. Very excited about it! -Mark > On Aug 23, 2016, at 1:00 PM, Peter Wicks (pwicks) <[email protected]> wrote: > > Matt, > > I've re-architected it a bit. I'll admit that I haven't tested this new > design, but it makes sense, wanted to get your feedback. Let me know if this > somewhat lines up with your vision. > > https://github.com/patricker/nifi/tree/RowProcessorService > > Then we can have cool processors like `RecordSetConverter` that can convert > between arbitrary types, and `RecordSetToSQL` etc... Could be really cool. > > --Peter > > -----Original Message----- > From: Peter Wicks (pwicks) > Sent: Monday, August 22, 2016 8:57 PM > To: '[email protected]' <[email protected]> > Subject: RE: Looking for feedback on my WIP Design > > Matt, > > When you put it that way it sounds like we'll be taking over the world by > breakfast... > > It sounds like this has a lot of other potential uses, such as a generic > conversion processor instead of all these one-off a-to-b conversion > processors that are floating around. I hadn't really thought that far, but I > can see you went there straight away. > > I'll put some more work into it along the lines you outlined and check back > in. > > Thanks, > Peter > > -----Original Message----- > From: Matt Burgess [mailto:[email protected]] > Sent: Monday, August 22, 2016 7:58 PM > To: [email protected] > Subject: Re: Looking for feedback on my WIP Design > > Peter, > > First of all, great work! I couldn't find an Apache Jira for this, but I > remember seeing something in the dev email list about perhaps having a > ControllerService for arbitrary conversions. > > I took a look at the commit; first things first, looks good for the use case, > thanks much! A handful of notes: > > - As you know, the critical part is your "RowConverter". To be most useful, > we should make sure (like Java has, over time, with their supported SQL > types), we can refer to structured types in some common language. So > "startNewFile()" might be better as "newRecord()", and > addRow() might be better expressed as a NiFi-defined interface. > ResultSets could have a simple wrapper implementing the generic interface > that would let the code consume it the same as any other object. > > - For a common language, perhaps we could refer to structured data as > "records" and individual fields (also perhaps nested) as "fields". > > - With a common NiFi-defined API for getting records and fields, we could > implement all the converters you mention without any explicit dependency on > an implementation NAR/JAR. > > - We should avoid the explicit conversion of input format to any intermediate > format; in other words, the interface should follow the Adapter or Facade > pattern, and should convert-on-read. > > - I'm sure I've forgotten some of the details between format details, but > I'll list them for discussion with respect to conversions: > > 1) XML > 2) JSON > 3) CSV (and TSV and other text-based delimited files) > 4) SQL ResultSet > 5) CQL ResultSet (Cassandra) > 6) Elasticsearch result set > 7) Avro > 8) Parquet > > If we can cover these with a single interface and 8(-ish) implementations, > then I think we're in good shape for world domination > :) Not being sarcastic, I'm saying let's make it happen! > > Regards, > Matt > > On Mon, Aug 22, 2016 at 9:32 PM, Peter Wicks (pwicks) <[email protected]> > wrote: >> I'm working on a change to QueryDatabaseTable (and eventually would apply to >> ExecuteSQL) to allow users to choose the output format, so something besides >> just Avro. I'm planning to put in a ticket soon for my work. >> >> Goals: >> >> - If this update goes out no user should be affected, as defaults >> will work the way they have before. >> >> - Don't want to muck up dependencies of standard processors with >> lots of other libraries to support multiple formats. As such I have >> implemented it using Controller Services to make the converters pluggable. >> >> - Would like to support these output formats to start: >> >> o Avro (This one is already in my commit, as I copy/pasted the code over; >> but the logic has now been moved to an appropriate library, which I like) >> >> o ORC (Would be implemented in the HIVE library) >> >> o JSON (No idea where I would put this processor, unless it's in a new >> module) >> >> o Delimited (No idea where I would put this processor, unless it's in a >> new module) >> >> Here is my commit: >> https://github.com/patricker/nifi/commit/01a79804f60b6be0f86499949712c >> d9118fb4f7f >> >> I'd appreciate feedback on design/implementation. I have the guts in there >> of how I was planning to implement this. >> >> Thanks, >> Peter
