RE: Looking for feedback on my WIP Design

Peter Wicks (pwicks) Tue, 06 Sep 2016 10:39:22 -0700

Mark,

Thanks for the overwhelming feedback :)

I mostly agree with what you've written.  My plan has been to allow adhoc use 
of the parsers/serializers.  I haven't pushed to Github in a bit so I don't 
100% remember what's out there, but right now my ControllerService is setup so 
that you can use it for either/both:

public interface RecordConversionService extends ControllerService {
    RecordConverter getRecordConverter(ProcessContext context)  throws 
ProcessException;
    RecordSet getRecordSet(FlowFile fIn, ProcessSession session);
}

Maybe it just needs a rename? Maybe just call it `RecordService`.  If you think 
it makes more sense from a usability/API standpoint I have no problem with 
splitting this up into two ControllerServices (parse/serialize).

On Schemas and Fields, I completely agree.  It felt a little off to me, and 
doing it your way makes more sense, and also aligns with how other 
implementations do it (ResultSet/Avro/etc...).

You did slightly confuse me when you brought up `String.valueOf`, since 
String.valueOf just calls Object.toString internally, and calling toString on 
an Object that is already a String returns the same object reference (no 
conversion of any kind); so at that point String.valueOf(o) and (String)o 
appear equivalent.

Also, while value types like Integer do have a `valueOf` method, they generally 
accept Strings, so if you take that route and convert the type to String and 
then back to the actual object type you risk losing data due to Java default 
String formatting (hypothetically). Whereas if we know it's already of that 
Object type under the hood we can cast directly.  Can you be a bit more concise 
on how you'd like to see data type conversions and what the issue is with 
direct casts?  I think all of the automated interfaces will just use getObject 
anyways, the other methods are more for custom processor development.

On getUnderlyingObject, I don't think having it generic really helps anyone.  I 
wasn't going to include this getter at all, but when I was working on my 
original code I was working with QueryDatabaseTable and it needs to pass the 
ResultSet object to the Max Value collector, so I needed an easy way to access 
it.  If you need the underlying object internally (for this implementation 
specific parser/serializer), you will probably just reference the typed field 
copy or an implementation specific getter. If you are accessing it externally 
(custom processor), you'll be using interfaces and so you'd have to know 
exactly what type the underlying object was to begin with for it to help you, 
or put in custom logic based on the name of the converter.

Thanks,
  Peter

-----Original Message-----
From: Mark Payne [mailto:[email protected]] 
Sent: Saturday, September 03, 2016 6:51 PM
To: [email protected]
Subject: Re: Looking for feedback on my WIP Design

Hey Peter,

Finally getting a chance to take a look at some of this stuff. Very cool that 
you're jumping in on this! I think this has the potential to really enable a 
lot of great new possibilities very quickly and easily. Matt B. and I discussed 
this concept a while back on the mailing list a bit but didn't get into great 
detail. He may have some great thoughts that he can contribute, as well. I 
jotted down some thoughts that I had on the design as I looked over it...

* I think we should not really think of these are record conversions, but 
rather two distinct concepts - record parsing and record serialization. This 
opens up a lot more possibilities, as we could use a Record Parser service for 
a more generic PutHBase processors, for instance. Likely, we could use the 
Record Serialization service with a source processor like ExecuteSQL. And then 
a processor that wants to convert from one format to another could use both a 
parser and a serializer. I don't think the RecordConversionService is really 
needed, as I think it makes more sense to just have a single processor to do 
conversion and these service API's would be used by this processor.

* Rather than having a RecordSet that contains a Map<String, Field> getSchema() 
and a lot of "getSupportsXYZ()" types of methods, I think we should instead 
have a "Schema getSchema()" method and move the description of what is 
supported into the Schema object.

* I have written record parsing libraries before, and while the natural 
tendency is to create a structure such as Record that has a bunch of Field 
objects, this can actually cripple performance when running on large volumes of 
records. For instance, if you were running just 50 MB/sec. of incoming data, 
and lets say that you have fields that average 100 bytes each, you're talking 
half a million Field objects per second that get created and have to be garbage 
collected. And these are fairly conservative numbers - you could easily reach 
many million Field objects per second. So the approach that I would recommend 
is instead of having Record return a bunch of Field objects, instead have 
accessor methods on Record such as:

getString(int fieldIndex);
getString(String fieldName);
getInt(int fieldIndex);
getInt(String fieldName);
getType(int fieldIndex);
getName(int fieldIndex);

* I would also expect that calling getString(1) would convert the first field 
to a String by using String.vlaueOf or something like that.
I would expect some sort of inferred conversion, but the impl you have here 
just does a type-cast, which can lead to a lot of redundant code being used by 
the clients of the library.

* If Record is going to have a 'getUnderlyingObject' method that returns an 
Object, it may make more sense to instead make Record generic and return an 
object of type T?

I know this is quite a bit of feedback. Hopefully not too overwhelming :) 
There's a whole lot of good stuff here, and I think this is going to turn out 
to be an awesome addition to NiFi. Very excited about it!

-Mark

> On Aug 23, 2016, at 1:00 PM, Peter Wicks (pwicks) <[email protected]> wrote:
> 
> Matt,
> 
> I've re-architected it a bit.  I'll admit that I haven't tested this new 
> design, but it makes sense, wanted to get your feedback.  Let me know if this 
> somewhat lines up with your vision.
> 
> https://github.com/patricker/nifi/tree/RowProcessorService
> 
> Then we can have cool processors like `RecordSetConverter` that can convert 
> between arbitrary types, and `RecordSetToSQL` etc...  Could be really cool.
> 
> --Peter
> 
> -----Original Message-----
> From: Peter Wicks (pwicks)
> Sent: Monday, August 22, 2016 8:57 PM
> To: '[email protected]' <[email protected]>
> Subject: RE: Looking for feedback on my WIP Design
> 
> Matt,
> 
> When you put it that way it sounds like we'll be taking over the world by 
> breakfast...
> 
> It sounds like this has a lot of other potential uses, such as a generic 
> conversion processor instead of all these one-off a-to-b conversion 
> processors that are floating around. I hadn't really thought that far, but I 
> can see you went there straight away.
> 
> I'll put some more work into it along the lines you outlined and check back 
> in.
> 
> Thanks,
>  Peter
> 
> -----Original Message-----
> From: Matt Burgess [mailto:[email protected]]
> Sent: Monday, August 22, 2016 7:58 PM
> To: [email protected]
> Subject: Re: Looking for feedback on my WIP Design
> 
> Peter,
> 
> First of all, great work! I couldn't find an Apache Jira for this, but I 
> remember seeing something in the dev email list about perhaps having a 
> ControllerService for arbitrary conversions.
> 
> I took a look at the commit; first things first, looks good for the use case, 
> thanks much!  A handful of notes:
> 
> - As you know, the critical part is your "RowConverter".  To be most 
> useful, we should make sure (like Java has, over time, with their 
> supported SQL types), we can refer to structured types in some common 
> language.  So "startNewFile()" might be better as "newRecord()", and
> addRow() might be better expressed as a NiFi-defined interface.
> ResultSets could have a simple wrapper implementing the generic interface 
> that would let the code consume it the same as any other object.
> 
> - For a common language, perhaps we could refer to structured data as 
> "records" and individual fields (also perhaps nested) as "fields".
> 
> - With a common NiFi-defined API for getting records and fields, we could 
> implement all the converters you mention without any explicit dependency on 
> an implementation NAR/JAR.
> 
> - We should avoid the explicit conversion of input format to any intermediate 
> format; in other words, the interface should follow the Adapter or Facade 
> pattern, and should convert-on-read.
> 
> - I'm sure I've forgotten some of the details between format details, but 
> I'll list them for discussion with respect to conversions:
> 
> 1) XML
> 2) JSON
> 3) CSV (and TSV and other text-based delimited files)
> 4) SQL ResultSet
> 5) CQL ResultSet (Cassandra)
> 6) Elasticsearch result set
> 7) Avro
> 8) Parquet
> 
> If we can cover these with a single interface and 8(-ish) 
> implementations, then I think we're in good shape for world domination
> :) Not being sarcastic, I'm saying let's make it happen!
> 
> Regards,
> Matt
> 
> On Mon, Aug 22, 2016 at 9:32 PM, Peter Wicks (pwicks) <[email protected]> 
> wrote:
>> I'm working on a change to QueryDatabaseTable (and eventually would apply to 
>> ExecuteSQL) to allow users to choose the output format, so something besides 
>> just Avro.  I'm planning to put in a ticket soon for my work.
>> 
>> Goals:
>> 
>> -          If this update goes out no user should be affected, as defaults 
>> will work the way they have before.
>> 
>> -          Don't want to muck up dependencies of standard processors with 
>> lots of other libraries to support multiple formats. As such I have 
>> implemented it using Controller Services to make the converters pluggable.
>> 
>> -          Would like to support these output formats to start:
>> 
>> o   Avro (This one is already in my commit, as I copy/pasted the code over; 
>> but the logic has now been moved to an appropriate library, which I like)
>> 
>> o   ORC (Would be implemented in the HIVE library)
>> 
>> o   JSON (No idea where I would put this processor, unless it's in a new 
>> module)
>> 
>> o   Delimited (No idea where I would put this processor, unless it's in a 
>> new module)
>> 
>> Here is my commit: 
>> https://github.com/patricker/nifi/commit/01a79804f60b6be0f86499949712
>> c
>> d9118fb4f7f
>> 
>> I'd appreciate feedback on design/implementation.  I have the guts in there 
>> of how I was planning to implement this.
>> 
>> Thanks,
>>  Peter

RE: Looking for feedback on my WIP Design

Reply via email to