Re: Looking for feedback on my WIP Design

Mark Payne Tue, 06 Sep 2016 11:48:18 -0700

Hey Peter,

Thanks for the quick reply!


So the way that you're outlining the RecordConversionService here, I believe 
the idea is that this service
is more of a "factory" for creating a parser or a serializer? Is that accurate? 
If so I'm okay with that. The RecordConverter
name sounds to me like something you would give a JSON blob and it'll split out 
an XML blob or something like that, though...
it may just be the terminology that I'm not understanding. Maybe some JavaDocs 
will clear that all up once we understand
the direction that we're heading better.

But if that is the idea, to have this work as a factory, I would recommend we 
also have an isRecordConverterSupported() and
isRecordSetSupported() type of method, as I can definitely see someone 
implementing some sort of parsing functionality but
not providing the ability to write data out in a specific format (for example, 
read metadata from a PDF file but no ability to write
arbitrary data to a PDF or something like that).

My suggestion of using String.valueOf was to account for the case when the 
value that you're obtaining is not actually a String.
If, for instance, I want to call getString() and the value is actually a Long, 
then String.valueOf() would return a String representation.
Attempting to cast would throw a ClassCastException. Similarly, if I wanted to 
use getLong() and the value was an int, we should certainly
be able to convert the int to a long, but attempting to cast it explicitly will 
throw ClassCastException, and this can lead to some really
odd problems when debugging.

Now, if the intent is to always use getObject() then I would recommend that we 
get rid of the getString(), getInteger(), etc. all together
and allow the user to explicitly cast it themselves if they need it cast. But I 
think it would be fairly common to want to convert these value
types, so it seems reasonable to me to include this conversion in the lib 
itself.

I do like the getUnderlyingObject() actually. Imagine the case where we want to 
run some filtering logic. We may want to get in
a record, check if it passes the filter, and if so, pass on the original object 
to the serializer.

Thanks
-Mark

> On Sep 6, 2016, at 1:38 PM, Peter Wicks (pwicks) <[email protected]> wrote:
> 
> Mark,
> 
> Thanks for the overwhelming feedback :)
> 
> I mostly agree with what you've written.  My plan has been to allow adhoc use 
> of the parsers/serializers.  I haven't pushed to Github in a bit so I don't 
> 100% remember what's out there, but right now my ControllerService is setup 
> so that you can use it for either/both:
> 
> public interface RecordConversionService extends ControllerService {
>    RecordConverter getRecordConverter(ProcessContext context)  throws 
> ProcessException;
>    RecordSet getRecordSet(FlowFile fIn, ProcessSession session);
> }
> 
> Maybe it just needs a rename? Maybe just call it `RecordService`.  If you 
> think it makes more sense from a usability/API standpoint I have no problem 
> with splitting this up into two ControllerServices (parse/serialize).
> 
> On Schemas and Fields, I completely agree.  It felt a little off to me, and 
> doing it your way makes more sense, and also aligns with how other 
> implementations do it (ResultSet/Avro/etc...).
> 
> You did slightly confuse me when you brought up `String.valueOf`, since 
> String.valueOf just calls Object.toString internally, and calling toString on 
> an Object that is already a String returns the same object reference (no 
> conversion of any kind); so at that point String.valueOf(o) and (String)o 
> appear equivalent.
> 
> Also, while value types like Integer do have a `valueOf` method, they 
> generally accept Strings, so if you take that route and convert the type to 
> String and then back to the actual object type you risk losing data due to 
> Java default String formatting (hypothetically). Whereas if we know it's 
> already of that Object type under the hood we can cast directly.  Can you be 
> a bit more concise on how you'd like to see data type conversions and what 
> the issue is with direct casts?  I think all of the automated interfaces will 
> just use getObject anyways, the other methods are more for custom processor 
> development.
> 
> On getUnderlyingObject, I don't think having it generic really helps anyone.  
> I wasn't going to include this getter at all, but when I was working on my 
> original code I was working with QueryDatabaseTable and it needs to pass the 
> ResultSet object to the Max Value collector, so I needed an easy way to 
> access it.  If you need the underlying object internally (for this 
> implementation specific parser/serializer), you will probably just reference 
> the typed field copy or an implementation specific getter. If you are 
> accessing it externally (custom processor), you'll be using interfaces and so 
> you'd have to know exactly what type the underlying object was to begin with 
> for it to help you, or put in custom logic based on the name of the converter.
> 
> Thanks,
>  Peter
> 
> -----Original Message-----
> From: Mark Payne [mailto:[email protected]] 
> Sent: Saturday, September 03, 2016 6:51 PM
> To: [email protected]
> Subject: Re: Looking for feedback on my WIP Design
> 
> Hey Peter,
> 
> Finally getting a chance to take a look at some of this stuff. Very cool that 
> you're jumping in on this! I think this has the potential to really enable a 
> lot of great new possibilities very quickly and easily. Matt B. and I 
> discussed this concept a while back on the mailing list a bit but didn't get 
> into great detail. He may have some great thoughts that he can contribute, as 
> well. I jotted down some thoughts that I had on the design as I looked over 
> it...
> 
> * I think we should not really think of these are record conversions, but 
> rather two distinct concepts - record parsing and record serialization. This 
> opens up a lot more possibilities, as we could use a Record Parser service 
> for a more generic PutHBase processors, for instance. Likely, we could use 
> the Record Serialization service with a source processor like ExecuteSQL. And 
> then a processor that wants to convert from one format to another could use 
> both a parser and a serializer. I don't think the RecordConversionService is 
> really needed, as I think it makes more sense to just have a single processor 
> to do conversion and these service API's would be used by this processor.
> 
> * Rather than having a RecordSet that contains a Map<String, Field> 
> getSchema() and a lot of "getSupportsXYZ()" types of methods, I think we 
> should instead have a "Schema getSchema()" method and move the description of 
> what is supported into the Schema object.
> 
> * I have written record parsing libraries before, and while the natural 
> tendency is to create a structure such as Record that has a bunch of Field 
> objects, this can actually cripple performance when running on large volumes 
> of records. For instance, if you were running just 50 MB/sec. of incoming 
> data, and lets say that you have fields that average 100 bytes each, you're 
> talking half a million Field objects per second that get created and have to 
> be garbage collected. And these are fairly conservative numbers - you could 
> easily reach many million Field objects per second. So the approach that I 
> would recommend is instead of having Record return a bunch of Field objects, 
> instead have accessor methods on Record such as:
> 
> getString(int fieldIndex);
> getString(String fieldName);
> getInt(int fieldIndex);
> getInt(String fieldName);
> getType(int fieldIndex);
> getName(int fieldIndex);
> 
> * I would also expect that calling getString(1) would convert the first field 
> to a String by using String.vlaueOf or something like that.
> I would expect some sort of inferred conversion, but the impl you have here 
> just does a type-cast, which can lead to a lot of redundant code being used 
> by the clients of the library.
> 
> * If Record is going to have a 'getUnderlyingObject' method that returns an 
> Object, it may make more sense to instead make Record generic and return an 
> object of type T?
> 
> I know this is quite a bit of feedback. Hopefully not too overwhelming :) 
> There's a whole lot of good stuff here, and I think this is going to turn out 
> to be an awesome addition to NiFi. Very excited about it!
> 
> 
> -Mark
> 
> 
> 
>> On Aug 23, 2016, at 1:00 PM, Peter Wicks (pwicks) <[email protected]> wrote:
>> 
>> Matt,
>> 
>> I've re-architected it a bit.  I'll admit that I haven't tested this new 
>> design, but it makes sense, wanted to get your feedback.  Let me know if 
>> this somewhat lines up with your vision.
>> 
>> https://github.com/patricker/nifi/tree/RowProcessorService
>> 
>> Then we can have cool processors like `RecordSetConverter` that can convert 
>> between arbitrary types, and `RecordSetToSQL` etc...  Could be really cool.
>> 
>> --Peter
>> 
>> -----Original Message-----
>> From: Peter Wicks (pwicks)
>> Sent: Monday, August 22, 2016 8:57 PM
>> To: '[email protected]' <[email protected]>
>> Subject: RE: Looking for feedback on my WIP Design
>> 
>> Matt,
>> 
>> When you put it that way it sounds like we'll be taking over the world by 
>> breakfast...
>> 
>> It sounds like this has a lot of other potential uses, such as a generic 
>> conversion processor instead of all these one-off a-to-b conversion 
>> processors that are floating around. I hadn't really thought that far, but I 
>> can see you went there straight away.
>> 
>> I'll put some more work into it along the lines you outlined and check back 
>> in.
>> 
>> Thanks,
>> Peter
>> 
>> -----Original Message-----
>> From: Matt Burgess [mailto:[email protected]]
>> Sent: Monday, August 22, 2016 7:58 PM
>> To: [email protected]
>> Subject: Re: Looking for feedback on my WIP Design
>> 
>> Peter,
>> 
>> First of all, great work! I couldn't find an Apache Jira for this, but I 
>> remember seeing something in the dev email list about perhaps having a 
>> ControllerService for arbitrary conversions.
>> 
>> I took a look at the commit; first things first, looks good for the use 
>> case, thanks much!  A handful of notes:
>> 
>> - As you know, the critical part is your "RowConverter".  To be most 
>> useful, we should make sure (like Java has, over time, with their 
>> supported SQL types), we can refer to structured types in some common 
>> language.  So "startNewFile()" might be better as "newRecord()", and
>> addRow() might be better expressed as a NiFi-defined interface.
>> ResultSets could have a simple wrapper implementing the generic interface 
>> that would let the code consume it the same as any other object.
>> 
>> - For a common language, perhaps we could refer to structured data as 
>> "records" and individual fields (also perhaps nested) as "fields".
>> 
>> - With a common NiFi-defined API for getting records and fields, we could 
>> implement all the converters you mention without any explicit dependency on 
>> an implementation NAR/JAR.
>> 
>> - We should avoid the explicit conversion of input format to any 
>> intermediate format; in other words, the interface should follow the Adapter 
>> or Facade pattern, and should convert-on-read.
>> 
>> - I'm sure I've forgotten some of the details between format details, but 
>> I'll list them for discussion with respect to conversions:
>> 
>> 1) XML
>> 2) JSON
>> 3) CSV (and TSV and other text-based delimited files)
>> 4) SQL ResultSet
>> 5) CQL ResultSet (Cassandra)
>> 6) Elasticsearch result set
>> 7) Avro
>> 8) Parquet
>> 
>> If we can cover these with a single interface and 8(-ish) 
>> implementations, then I think we're in good shape for world domination
>> :) Not being sarcastic, I'm saying let's make it happen!
>> 
>> Regards,
>> Matt
>> 
>> On Mon, Aug 22, 2016 at 9:32 PM, Peter Wicks (pwicks) <[email protected]> 
>> wrote:
>>> I'm working on a change to QueryDatabaseTable (and eventually would apply 
>>> to ExecuteSQL) to allow users to choose the output format, so something 
>>> besides just Avro.  I'm planning to put in a ticket soon for my work.
>>> 
>>> Goals:
>>> 
>>> -          If this update goes out no user should be affected, as defaults 
>>> will work the way they have before.
>>> 
>>> -          Don't want to muck up dependencies of standard processors with 
>>> lots of other libraries to support multiple formats. As such I have 
>>> implemented it using Controller Services to make the converters pluggable.
>>> 
>>> -          Would like to support these output formats to start:
>>> 
>>> o   Avro (This one is already in my commit, as I copy/pasted the code over; 
>>> but the logic has now been moved to an appropriate library, which I like)
>>> 
>>> o   ORC (Would be implemented in the HIVE library)
>>> 
>>> o   JSON (No idea where I would put this processor, unless it's in a new 
>>> module)
>>> 
>>> o   Delimited (No idea where I would put this processor, unless it's in a 
>>> new module)
>>> 
>>> Here is my commit: 
>>> https://github.com/patricker/nifi/commit/01a79804f60b6be0f86499949712
>>> c
>>> d9118fb4f7f
>>> 
>>> I'd appreciate feedback on design/implementation.  I have the guts in there 
>>> of how I was planning to implement this.
>>> 
>>> Thanks,
>>> Peter
>

Re: Looking for feedback on my WIP Design

Reply via email to