Re: [infinispan-dev] data interoperability and remote querying

Manik Surtani Thu, 11 Apr 2013 03:24:52 -0700

On 10 Apr 2013, at 18:29, Mircea Markus <[email protected]> wrote:

> 
> On 10 Apr 2013, at 17:45, Manik Surtani wrote:
> 
>> Yes.  We haven't quite designed how remote querying will work, but we have a 
>> few ideas.
> Thanks for sharing :-)
>> First, let me explain  how in-VM indexing works.  An object's fields are 
>> appropriately annotated so that when it is stored in Infinispan with a 
>> put(), Hibernate Search can extract the fields and values, flatten it into a 
>> Lucene-friendly "document", and associate it with the entry's key for 
>> searching later.
>> 
>> Now one approach to doing this when storing objects remotely is the 
>> serialisation format.  A format that can be parsed on the server side for 
>> easy indexing.  An example of this could be JSON (an appropriate 
>> transformation will need to exist on the server side to strip out irrelevant 
>> fields before indexing).  This would be completely platform-independent, and 
>> also support the interop you described below.  The drawback?  Slow JSON 
>> serialisation and deserialization, and a very verbose data stream.
> What about using our own object definition, based on a fixed number of 
> supported types: e.g. int, long, , bigdecimal, String, Date and some more. 
> Each client object would need to implement the logic to serialize and 
> deserialize itself into this format, using some StremWriters, a bit like our 
> serilizers today. 
> The StreamWritters would be provided be provided by us, for every supported 
> programming language, and would have methods like writeInt,writeLong etc.
> Another nice thing we can add to this object scheme is versioning, which is 
> useful for rolling upgrades.
> The server side would then index the known types using lucene. The client 
> should be able to define queries based on these objects and supported types 
> (the query semantic to be defined).
> Disclaimer: not an original idea, there is already a similar approach used in 
> other datagrids providers.


Sounds a LOT like ProtoBufs.  Or - yuck - CORBA.  But generally, 
wheel-reinvention?  Why can't we use an existing library that provides this?

>> 
>> Another approach may be to perform the field extraction on the client side, 
>> so that the data sent to the server would be key=XXX (binary), value=YYY 
>> (binary), indexing_metadata=ZZZ (JSON).  This way the server does not need 
>> to be able to parse the value for indexing, since the field data it needs is 
>> already provided in a platform-independent manner (JSON).  The benefit here 
>> is that keys and values can still be binary, and can use an efficient 
>> marshaller.  The drawback, is that field extraction needs to happen on the 
>> client.  Not hard for the Java client (bits of Hibernate Search could be 
>> reused), but for non-Java clients this may increase complexity of those 
>> clients quite a bit (much easier for dynamic language clients - python/ruby).
> The client would need to build an lucene index itself and send it to the 
> server, I guess Sanne/Emmanuel can comment more on the complexity involved 
> here.
> Here are some limitations I see to this approach:
> - cannot define an index at runtime. If we want to do that, the client would 
> need to storm all the data in the system and re-index it. 
> - cannot run a query for data that is not indexed. I think this is a pretty 
> common requirement as well. 
>> This approach does *not* solve your problem below, because for interop you 
>> will still need a platform-independent serialisation mechanism like Avro or 
>> ProtoBufs for the object <--> blob <--> object conversion.
> Indeed. I think we should decide what approach we take and if we go for the 
> former, not even suggest Apache Avro but implement our own scheme.

See above.  Why implement our own?  Portable and efficient object serialisation 
is an entire sub-field of computer science in itself; do we _really_ want to 
commit to building and maintaining our own?

>> Personally, I prefer the second approach since it separates concerns 
>> (portable indexes vs. portable values) plus would lead to (IMO) a 
>> better-performing implementation.  I'd love to hear others' thoughts though.
> I don't like the first approach because of the marshalling overhead. The 
> former

You mean the latter?

> seems complex, doesn't scale(requires the implementation of indexing for 
> every programming language)  and limiting (indexes need to be defined a 
> priori, cannot query for non-indexed data). 

>> 
>> Cheers
>> Manik
>> 
>> On 10 Apr 2013, at 17:11, Mircea Markus <[email protected]> wrote:
>> 
>>> That is write the Person object in Java and read a Person object in C#, 
>>> assume a hotrod client for simplicity.
>>> Now at some point we'll have to run a query over the same hotrod, something 
>>> like "give me all the Persons named Mircea".
>>> At this stage, the server side needs to be aware of the Person object in 
>>> order to be able to run the query and select the relevant Persons. It needs 
>>> a schema. Instead of suggesting Avro as an data interoperability protocol, 
>>> we might want to define and use this schema instead: we'd need it anyway 
>>> for remote querying and we won't have two ways of doing the same thing.
>>> Thoughts? 
>>> 
>>> Cheers,
>>> -- 
>>> Mircea Markus
>>> Infinispan lead (www.infinispan.org)
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [email protected]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> --
>> Manik Surtani
>> [email protected]
>> twitter.com/maniksurtani
>> 
>> Platform Architect, JBoss Data Grid
>> http://red.ht/data-grid
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> [email protected]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> Cheers,
> -- 
> Mircea Markus
> Infinispan lead (www.infinispan.org)
> 
> 
> 
> 
> 
> _______________________________________________
> infinispan-dev mailing list
> [email protected]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
[email protected]
twitter.com/maniksurtani

Platform Architect, JBoss Data Grid
http://red.ht/data-grid


_______________________________________________
infinispan-dev mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Re: [infinispan-dev] data interoperability and remote querying

Reply via email to