Looks good.  I definitely agree with shrinking the message size.  We can keep 
this to a notification and let client go to the metastore to get the 
information it cares about.  

One comment I would make is we should consider that in time we would like to 
move this away from just sending messages via JMS to sending them via other 
messaging protocols as well (HTTP, Kafka, etc.)  So we don't want to do 
anything that binds this more tightly to JMS or ActiveMQ.  I don't see anything 
in these changes that do, but I think it's good to call that out as a design 
goal.

Alan.

On Oct 30, 2012, at 2:26 PM, Mithun Radhakrishnan wrote:

> Hello, HCat-Dev.
> 
> I'm working on modifying the HCat messages (sent over JMS/ActiveMQ, for 
> partition-add/delete) so that clients (such as
> Oozie) would have an easier time with consumption.
> Here are some limitations of what's available currently:
> 1. The present implementation in HCatalog (branch-0.4/) seems to send the 
> entire Partition (Java) instance in serialized fashion. Since the 
> partition-parameters, hdfs-location etc. are all serialized, the messages are 
> rather, emm, garrulous.
> 2. There doesn't seem to be any support for versioning either. So when new 
> fields are added, older clients won't work at all without update.
> 
> Could we consider transmitting only that info which identifies the partitions 
> that pertain to the operation (e.g. partition keys), and drop any information 
> that might be gathered from querying the metadata (e.g. storage location, 
> partition-parameters, etc.)
> 
> We're also considering that the initial implementation encode the ActiveMQ 
> payload in JSON.  Here's an example of the proposed message format for an 
> "add_partition" operation:
> 
> "add_partition": {
>   "hcat_server" : "thrift://my.hcat.server:9080",
>   "hcat_service_principal" : "hcat/[email protected]",
>   "db": "default",
>   "table": "starling_jobs",
>   "partitions":
>     [
>       {"grid": "AxoniteBlue", "dt": "2012_10_25"},// Sets of partition-keys.
>       {"grid": "AxoniteBlue", "dt": "2012_10_26"},
>       {"grid": "AxoniteBlue", "dt": "2012_10_27"},
>       {"grid": "AxoniteBlue", "dt": "2012_10_28"},
>     ],
>   "timestamp": "1351534729" // In this case, interpreted as creation-time.
> }
> 
> If we continue to use JMS MapMessages, we could consider having 3 keys in the 
> map:
> 1. version = "1" (for the first implementation. Increment as we go.)
> 2. format = "json" (We could consider adding different formats if we choose.)
> 3. message = <the json message body, as above.>
> 
> The version and format help a factory choose the right implementation to 
> deserialize the message. (A client-side library we supply to Oozie should 
> hide this and provide POJOs.)
> 
> Since the "partitions" field is an array, and since the values corresponding 
> to partition-keys are all strings, we'd be able to accommodate partial 
> partitions-specs, or even wild-cards. This might help us add support for 
> "mark-set-done" later on.
> 
> The first key ("add_partition", "drop_partition" or "alter_partition") 
> indicates the operation, and the value indicates the record-body. (At first 
> glance, the record-body doesn't change for these operations. But that might 
> change, so we'll keep them distinct.)
> 
> Also note that HiveMetaStore::add_partitions_core() currently doesn't send 1 
> message for the entire set of partitions being added. Instead we get one 
> message per partition. This could be verbose and sub-optimal. We'll tackle 
> this sort of thing after we've nailed the format down.
> 
> I'm toying with the idea of adding an "other" property, an array of 
> key-values to accommodate stuff we hadn't considered, at "run-time" (like if 
> we want to introduce a hack). The need for such a property is contingent on 
> the behaviour of Jackson w.r.t. newly added properties in the record-body. 
> (I'll run experiments and keep you posted.)
> 
> What do you think?
> 
> Mithun
> 

Reply via email to