Re: Document storage

2012-05-28 Thread Brian O'Neill
Just following up on this age-old thread because we've recently done some
development

Ben, we recently had the exact need you outline.  We are storing JSON
documents in Cassandra. We needed to index based on a field in the JSON.
 We ended up extending our cassandra-indexing code to accomodate this.
https://github.com/hmsonline/cassandra-indexing

You can now configure the indexing to accomodate a field within the JSON
document.

We're going to update the wiki to make this more usable, but it triggered
the same kind of debate/thought process on this thread.  In the coming
weeks/months, we'll probably consider a switch to protobuf with an update
to our indexing code to understand the internal structure of documents
stored in Cassandra.

just an update for now,
brian

On Fri, Mar 30, 2012 at 1:33 PM, Ben McCann  wrote:

> >
> > If you don't need selected updates and having something as compact as
> > possible on disk make a important difference for you, sure, do use blobs.
> > The only argument is that you can already do that without any change to
> > the core.
>
>
> The thing that we can't do today without changes to the core is index on
> subparts of some document format like Protobuf/JSON/etc.  If cassandra were
> to understand one of these formats, it could remove the need for manual
> management of an index.
>
>
> On Fri, Mar 30, 2012 at 10:23 AM, Sylvain Lebresne  >wrote:
>
> > On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday
> >  wrote:
> > > But decomposing into columns will lead to more of that:
> > >
> > > - Total amount of serialized data is (in most cases a lot) larger than
> > protobuffed / compressed version
> >
> > At least with sstable compression, I would expect the difference to
> > not be too big in practice.
> >
> > > - If you do selective updates the document will be scattered over
> > multiple ssts plus if you do sliced reads you can't optimize reads as
> > opposed to the single column version that when updated is automatically
> > superseding older versions so most reads will hit only one sst
> >
> > But if you need to do selective updates, then a blob just doesn't work
> > so that comparison is moot.
> >
> > Now I don't think anyone pretended that you should never use blobs
> > (whether that's protobuffed, jsoned, ...). If you don't need selected
> > updates and having something as compact as possible on disk make a
> > important difference for you, sure, do use blobs. The only argument is
> > that you can already do that without any change to the core. What we
> > are saying is that for the case where you care more about schema
> > flexibility (being able to do selective updates, to index on some
> > subpart, etc...) then we think that something like the map and list
> > idea of CASSANDRA-3647 will probably be a more natural fit to the
> > current CQL API.
> >
> > --
> > Sylvain
> >
> > >
> > > All these reads make the hot dataset. If it fits the page cache your
> > fine. If it doesn't you need to buy more iron.
> > >
> > > Really could not resist because your statement seems to be contrary to
> > all our tests / learnings.
> > >
> > > Cheers,
> > > Daniel
> > >
> > > From dev list:
> > >
> > > Re: Document storage
> > > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
> > wrote:
> > >>> I think this is a much better approach because that gives you the
> > >>> ability to update or retrieve just parts of objects efficiently,
> > >>> rather than making column values just blobs with a bunch of special
> > >>> case logic to introspect them.  Which feels like a big step backwards
> > >>> to me.
> > >>
> > >> Unless your access pattern involves reading/writing the whole document
> > each time. In
> > > that case you're better off serializing the whole document and storing
> > it in a column as a
> > > byte[] without incurring the overhead of column indexes. Right?
> > >
> > > Hmm, not sure what you're thinking of there.
> > >
> > > If you mean the "index" that's part of the row header for random
> > > access within a row, then no, serializing to byte[] doesn't save you
> > > anything.
> > >
> > > If you mean secondary indexes, don't declare any if you don't want any.
> > :)
> > >
> > > Just telling C* to store a byte[] *will* be slightly lighter-weight
> > > than giving it named columns, but we're talking negligible compared to
> > > the overhead of actually moving the data on or off disk in the first
> > > place.  Not even close to being worth giving up being able to deal
> > > with your data from standard tools like cqlsh, IMO.
> > >
> > > --
> > > Jonathan Ellis
> > > Project Chair, Apache Cassandra
> > > co-founder of DataStax, the source for professional Cassandra support
> > > http://www.datastax.com
> > >
> >
>



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Document storage

2012-03-30 Thread Ben McCann
>
> If you don't need selected updates and having something as compact as
> possible on disk make a important difference for you, sure, do use blobs.
> The only argument is that you can already do that without any change to
> the core.


The thing that we can't do today without changes to the core is index on
subparts of some document format like Protobuf/JSON/etc.  If cassandra were
to understand one of these formats, it could remove the need for manual
management of an index.


On Fri, Mar 30, 2012 at 10:23 AM, Sylvain Lebresne wrote:

> On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday
>  wrote:
> > But decomposing into columns will lead to more of that:
> >
> > - Total amount of serialized data is (in most cases a lot) larger than
> protobuffed / compressed version
>
> At least with sstable compression, I would expect the difference to
> not be too big in practice.
>
> > - If you do selective updates the document will be scattered over
> multiple ssts plus if you do sliced reads you can't optimize reads as
> opposed to the single column version that when updated is automatically
> superseding older versions so most reads will hit only one sst
>
> But if you need to do selective updates, then a blob just doesn't work
> so that comparison is moot.
>
> Now I don't think anyone pretended that you should never use blobs
> (whether that's protobuffed, jsoned, ...). If you don't need selected
> updates and having something as compact as possible on disk make a
> important difference for you, sure, do use blobs. The only argument is
> that you can already do that without any change to the core. What we
> are saying is that for the case where you care more about schema
> flexibility (being able to do selective updates, to index on some
> subpart, etc...) then we think that something like the map and list
> idea of CASSANDRA-3647 will probably be a more natural fit to the
> current CQL API.
>
> --
> Sylvain
>
> >
> > All these reads make the hot dataset. If it fits the page cache your
> fine. If it doesn't you need to buy more iron.
> >
> > Really could not resist because your statement seems to be contrary to
> all our tests / learnings.
> >
> > Cheers,
> > Daniel
> >
> > From dev list:
> >
> > Re: Document storage
> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
> wrote:
> >>> I think this is a much better approach because that gives you the
> >>> ability to update or retrieve just parts of objects efficiently,
> >>> rather than making column values just blobs with a bunch of special
> >>> case logic to introspect them.  Which feels like a big step backwards
> >>> to me.
> >>
> >> Unless your access pattern involves reading/writing the whole document
> each time. In
> > that case you're better off serializing the whole document and storing
> it in a column as a
> > byte[] without incurring the overhead of column indexes. Right?
> >
> > Hmm, not sure what you're thinking of there.
> >
> > If you mean the "index" that's part of the row header for random
> > access within a row, then no, serializing to byte[] doesn't save you
> > anything.
> >
> > If you mean secondary indexes, don't declare any if you don't want any.
> :)
> >
> > Just telling C* to store a byte[] *will* be slightly lighter-weight
> > than giving it named columns, but we're talking negligible compared to
> > the overhead of actually moving the data on or off disk in the first
> > place.  Not even close to being worth giving up being able to deal
> > with your data from standard tools like cqlsh, IMO.
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of DataStax, the source for professional Cassandra support
> > http://www.datastax.com
> >
>


Re: Document storage

2012-03-30 Thread Sylvain Lebresne
On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday
 wrote:
> But decomposing into columns will lead to more of that:
>
> - Total amount of serialized data is (in most cases a lot) larger than 
> protobuffed / compressed version

At least with sstable compression, I would expect the difference to
not be too big in practice.

> - If you do selective updates the document will be scattered over multiple 
> ssts plus if you do sliced reads you can't optimize reads as opposed to the 
> single column version that when updated is automatically superseding older 
> versions so most reads will hit only one sst

But if you need to do selective updates, then a blob just doesn't work
so that comparison is moot.

Now I don't think anyone pretended that you should never use blobs
(whether that's protobuffed, jsoned, ...). If you don't need selected
updates and having something as compact as possible on disk make a
important difference for you, sure, do use blobs. The only argument is
that you can already do that without any change to the core. What we
are saying is that for the case where you care more about schema
flexibility (being able to do selective updates, to index on some
subpart, etc...) then we think that something like the map and list
idea of CASSANDRA-3647 will probably be a more natural fit to the
current CQL API.

--
Sylvain

>
> All these reads make the hot dataset. If it fits the page cache your fine. If 
> it doesn't you need to buy more iron.
>
> Really could not resist because your statement seems to be contrary to all 
> our tests / learnings.
>
> Cheers,
> Daniel
>
> From dev list:
>
> Re: Document storage
> On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian  wrote:
>>> I think this is a much better approach because that gives you the
>>> ability to update or retrieve just parts of objects efficiently,
>>> rather than making column values just blobs with a bunch of special
>>> case logic to introspect them.  Which feels like a big step backwards
>>> to me.
>>
>> Unless your access pattern involves reading/writing the whole document each 
>> time. In
> that case you're better off serializing the whole document and storing it in 
> a column as a
> byte[] without incurring the overhead of column indexes. Right?
>
> Hmm, not sure what you're thinking of there.
>
> If you mean the "index" that's part of the row header for random
> access within a row, then no, serializing to byte[] doesn't save you
> anything.
>
> If you mean secondary indexes, don't declare any if you don't want any. :)
>
> Just telling C* to store a byte[] *will* be slightly lighter-weight
> than giving it named columns, but we're talking negligible compared to
> the overhead of actually moving the data on or off disk in the first
> place.  Not even close to being worth giving up being able to deal
> with your data from standard tools like cqlsh, IMO.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Re: Document storage

2012-03-30 Thread Brian O'Neill

Do we also need to consider the client API?
If we don't adjust thrift, the client just gets bytes right?
The client is on their own to marshal back into a structure.  In this
case, it seems like we would want to chose a standard that is efficient
and for which there are common libraries.  Protobuf seems to fit the bill
here.  

Or do we pass back some other structure?  (Native lists/maps? JSON
strings?)

Do we ignore sorting/comparators?
(similar to SOLR, I'm not sure people have defined a good sort for
multi-valued items)

-brian

 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



On 3/30/12 12:01 PM, "Daniel Doubleday"  wrote:

>> Just telling C* to store a byte[] *will* be slightly lighter-weight
>> than giving it named columns, but we're talking negligible compared to
>> the overhead of actually moving the data on or off disk in the first
>> place. 
>Hm - but isn't this exactly the point? You don't want to move data off
>disk.
>But decomposing into columns will lead to more of that:
>
>- Total amount of serialized data is (in most cases a lot) larger than
>protobuffed / compressed version
>- If you do selective updates the document will be scattered over
>multiple ssts plus if you do sliced reads you can't optimize reads as
>opposed to the single column version that when updated is automatically
>superseding older versions so most reads will hit only one sst
>
>All these reads make the hot dataset. If it fits the page cache your
>fine. If it doesn't you need to buy more iron.
>
>Really could not resist because your statement seems to be contrary to
>all our tests / learnings.
>
>Cheers,
>Daniel
>
>From dev list:
>
>Re: Document storage
>On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian  wrote:
>>> I think this is a much better approach because that gives you the
>>> ability to update or retrieve just parts of objects efficiently,
>>> rather than making column values just blobs with a bunch of special
>>> case logic to introspect them.  Which feels like a big step backwards
>>> to me.
>>
>> Unless your access pattern involves reading/writing the whole document
>>each time. In
>that case you're better off serializing the whole document and storing it
>in a column as a
>byte[] without incurring the overhead of column indexes. Right?
>
>Hmm, not sure what you're thinking of there.
>
>If you mean the "index" that's part of the row header for random
>access within a row, then no, serializing to byte[] doesn't save you
>anything.
>
>If you mean secondary indexes, don't declare any if you don't want any. :)
>
>Just telling C* to store a byte[] *will* be slightly lighter-weight
>than giving it named columns, but we're talking negligible compared to
>the overhead of actually moving the data on or off disk in the first
>place.  Not even close to being worth giving up being able to deal
>with your data from standard tools like cqlsh, IMO.
>
>-- 
>Jonathan Ellis
>Project Chair, Apache Cassandra
>co-founder of DataStax, the source for professional Cassandra support
>http://www.datastax.com
>




Re: Document storage

2012-03-30 Thread Daniel Doubleday
> Just telling C* to store a byte[] *will* be slightly lighter-weight
> than giving it named columns, but we're talking negligible compared to
> the overhead of actually moving the data on or off disk in the first
> place. 
Hm - but isn't this exactly the point? You don't want to move data off disk.
But decomposing into columns will lead to more of that:

- Total amount of serialized data is (in most cases a lot) larger than 
protobuffed / compressed version
- If you do selective updates the document will be scattered over multiple ssts 
plus if you do sliced reads you can't optimize reads as opposed to the single 
column version that when updated is automatically superseding older versions so 
most reads will hit only one sst

All these reads make the hot dataset. If it fits the page cache your fine. If 
it doesn't you need to buy more iron.

Really could not resist because your statement seems to be contrary to all our 
tests / learnings.

Cheers,
Daniel

From dev list:

Re: Document storage
On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian  wrote:
>> I think this is a much better approach because that gives you the
>> ability to update or retrieve just parts of objects efficiently,
>> rather than making column values just blobs with a bunch of special
>> case logic to introspect them.  Which feels like a big step backwards
>> to me.
>
> Unless your access pattern involves reading/writing the whole document each 
> time. In
that case you're better off serializing the whole document and storing it in a 
column as a
byte[] without incurring the overhead of column indexes. Right?

Hmm, not sure what you're thinking of there.

If you mean the "index" that's part of the row header for random
access within a row, then no, serializing to byte[] doesn't save you
anything.

If you mean secondary indexes, don't declare any if you don't want any. :)

Just telling C* to store a byte[] *will* be slightly lighter-weight
than giving it named columns, but we're talking negligible compared to
the overhead of actually moving the data on or off disk in the first
place.  Not even close to being worth giving up being able to deal
with your data from standard tools like cqlsh, IMO.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com



Re: Document storage

2012-03-29 Thread Ben McCann
Cool.  How were you thinking we should store the data?  As a stanardized
composite column (e.g. potentially a list as ["fieldName", ]:
"fieldValue" and a set as  ["fieldName",  "fieldValue" ]:"")?  Or as a new
column type?


On Thu, Mar 29, 2012 at 12:35 PM, Jonathan Ellis  wrote:

> I kind of hijacked
> https://issues.apache.org/jira/browse/CASSANDRA-3647 ("Sylvain
> suggests we start with (non-nested) lists, maps, and sets. I agree
> that this is a great 80/20 approach to the problem") but we could
> split it out to another ticket.
>
> On Thu, Mar 29, 2012 at 2:24 PM, Ben McCann  wrote:
> > Thanks Jonathan.  The only reason I suggested JSON was because it already
> > has support for lists.  Native support for lists in Cassandra would more
> > than satisfy me.  Are there any existing proposals or a bug I can follow?
> >  I'm not familiar with the Cassandra codebase, so I'm not entirely sure
> how
> > helpful I can be, but I'd certainly be interested in taking a look to see
> > what's required.
> >
> > -Ben
> >
> >
> > On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill  >wrote:
> >
> >> Jonathan,
> >>
> >> I was actually going to take this up with Nate McCall a few weeks back.
>  I
> >> think it might make sense to get the client development community
> together
> >> (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)
> >>
> >> I agree whole-heartedly that it shouldn't go into the database for all
> the
> >> reasons you point out.
> >>
> >> If we can all decide on some standards for data storage (e.g. composite
> >> types), indexing strategies, etc.  We can provide higher-level functions
> >> through the client libraries and also provide interoperability between
> >> them.  (without bloating Cassandra)
> >>
> >> CCing Nate.  Nate, thoughts?
> >> I wouldn't mind coordinating/facilitating the conversation.  If we know
> >> who should be involved.
> >>
> >> -brian
> >>
> >> 
> >> Brian O'Neill
> >> Lead Architect, Software Development
> >> Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
> >> p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
> >> blog: http://brianoneill.blogspot.com/
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 3/29/12 3:06 PM, "Ben McCann"  wrote:
> >>
> >> >Jonathan, I asked Brian about his REST
> >> >API<
> >> https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
> >> >9C8Us>and
> >> >he said he does not take the json objects and split them because the
> >> >client libraries do not agree on implementations.  This was exactly my
> >> >concern as well with this solution.  I would be perfectly happy to do
> it
> >> >this way instead of using JSON if it were standardized.  The reason I
> >> >suggested JSON is that it is standardized.  As far as I can tell,
> >> >Cassandra
> >> >doesn't support maps and lists in a standardized way today, which is
> the
> >> >root of my problem.
> >> >
> >> >-Ben
> >> >
> >> >
> >> >On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian 
> >> wrote:
> >> >
> >> >> Yes, I meant the "row header index". What I have done is that I'm
> >> >>storing
> >> >> an object (i.e. UserProfile) where you read or write it as a whole (a
> >> >>user
> >> >> updates their user details in a single page in the UI). So I
> serialize
> >> >>that
> >> >> object into a binary JSON using SMILE format. I then compress it
> using
> >> >> Snappy on the client side. So as far as Cassandra cares it's storing
> a
> >> >> byte[].
> >> >>
> >> >> Now on the client side, I'm using cassandra-cli with a custom type
> that
> >> >> knows how to turn a byte[] into a JSON text and back. The only issue
> was
> >> >> CASSANDRA-4081 where "assume" doesn't work with custom types. If
> >> >> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
> >> >>
> >> >> Also advantages of this vs. the thrift based Super Column families
> are:
> >> >>
> >> >> 1. Saving extra CPU usage on the Cassandra nodes. Since
> >> >> serialize/deserialize and compression/decompression happens on the
> >> >>client
> >> >> nodes where there is plenty idle CPU time
> >> >>
> >> >> 2. Saving network bandwidth since I'm sending over a compressed
> byte[]
> >> >>
> >> >>
> >> >> -- Drew
> >> >>
> >> >>
> >> >>
> >> >> On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
> >> >>
> >> >> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
> >> >> wrote:
> >> >> >>> I think this is a much better approach because that gives you the
> >> >> >>> ability to update or retrieve just parts of objects efficiently,
> >> >> >>> rather than making column values just blobs with a bunch of
> special
> >> >> >>> case logic to introspect them.  Which feels like a big step
> >> >>backwards
> >> >> >>> to me.
> >> >> >>
> >> >> >> Unless your access pattern involves reading/writing the whole
> >> >>document
> >> >> each time. In that case you're better off serializing the whole
> document
> >> >> and storing it in a column as a byte[] without incurring the
> overhead of
> >> >> column indexes. Right?
> >

Re: Document storage

2012-03-29 Thread Jonathan Ellis
I kind of hijacked
https://issues.apache.org/jira/browse/CASSANDRA-3647 ("Sylvain
suggests we start with (non-nested) lists, maps, and sets. I agree
that this is a great 80/20 approach to the problem") but we could
split it out to another ticket.

On Thu, Mar 29, 2012 at 2:24 PM, Ben McCann  wrote:
> Thanks Jonathan.  The only reason I suggested JSON was because it already
> has support for lists.  Native support for lists in Cassandra would more
> than satisfy me.  Are there any existing proposals or a bug I can follow?
>  I'm not familiar with the Cassandra codebase, so I'm not entirely sure how
> helpful I can be, but I'd certainly be interested in taking a look to see
> what's required.
>
> -Ben
>
>
> On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill wrote:
>
>> Jonathan,
>>
>> I was actually going to take this up with Nate McCall a few weeks back.  I
>> think it might make sense to get the client development community together
>> (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)
>>
>> I agree whole-heartedly that it shouldn't go into the database for all the
>> reasons you point out.
>>
>> If we can all decide on some standards for data storage (e.g. composite
>> types), indexing strategies, etc.  We can provide higher-level functions
>> through the client libraries and also provide interoperability between
>> them.  (without bloating Cassandra)
>>
>> CCing Nate.  Nate, thoughts?
>> I wouldn't mind coordinating/facilitating the conversation.  If we know
>> who should be involved.
>>
>> -brian
>>
>> 
>> Brian O'Neill
>> Lead Architect, Software Development
>> Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
>> p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
>> blog: http://brianoneill.blogspot.com/
>>
>>
>>
>>
>>
>>
>>
>> On 3/29/12 3:06 PM, "Ben McCann"  wrote:
>>
>> >Jonathan, I asked Brian about his REST
>> >API<
>> https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
>> >9C8Us>and
>> >he said he does not take the json objects and split them because the
>> >client libraries do not agree on implementations.  This was exactly my
>> >concern as well with this solution.  I would be perfectly happy to do it
>> >this way instead of using JSON if it were standardized.  The reason I
>> >suggested JSON is that it is standardized.  As far as I can tell,
>> >Cassandra
>> >doesn't support maps and lists in a standardized way today, which is the
>> >root of my problem.
>> >
>> >-Ben
>> >
>> >
>> >On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian 
>> wrote:
>> >
>> >> Yes, I meant the "row header index". What I have done is that I'm
>> >>storing
>> >> an object (i.e. UserProfile) where you read or write it as a whole (a
>> >>user
>> >> updates their user details in a single page in the UI). So I serialize
>> >>that
>> >> object into a binary JSON using SMILE format. I then compress it using
>> >> Snappy on the client side. So as far as Cassandra cares it's storing a
>> >> byte[].
>> >>
>> >> Now on the client side, I'm using cassandra-cli with a custom type that
>> >> knows how to turn a byte[] into a JSON text and back. The only issue was
>> >> CASSANDRA-4081 where "assume" doesn't work with custom types. If
>> >> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
>> >>
>> >> Also advantages of this vs. the thrift based Super Column families are:
>> >>
>> >> 1. Saving extra CPU usage on the Cassandra nodes. Since
>> >> serialize/deserialize and compression/decompression happens on the
>> >>client
>> >> nodes where there is plenty idle CPU time
>> >>
>> >> 2. Saving network bandwidth since I'm sending over a compressed byte[]
>> >>
>> >>
>> >> -- Drew
>> >>
>> >>
>> >>
>> >> On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
>> >>
>> >> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
>> >> wrote:
>> >> >>> I think this is a much better approach because that gives you the
>> >> >>> ability to update or retrieve just parts of objects efficiently,
>> >> >>> rather than making column values just blobs with a bunch of special
>> >> >>> case logic to introspect them.  Which feels like a big step
>> >>backwards
>> >> >>> to me.
>> >> >>
>> >> >> Unless your access pattern involves reading/writing the whole
>> >>document
>> >> each time. In that case you're better off serializing the whole document
>> >> and storing it in a column as a byte[] without incurring the overhead of
>> >> column indexes. Right?
>> >> >
>> >> > Hmm, not sure what you're thinking of there.
>> >> >
>> >> > If you mean the "index" that's part of the row header for random
>> >> > access within a row, then no, serializing to byte[] doesn't save you
>> >> > anything.
>> >> >
>> >> > If you mean secondary indexes, don't declare any if you don't want
>> >>any.
>> >> :)
>> >> >
>> >> > Just telling C* to store a byte[] *will* be slightly lighter-weight
>> >> > than giving it named columns, but we're talking negligible compared to
>> >> > the overhead of actually moving the data on o

Re: Document storage

2012-03-29 Thread Brian O'Neill

Jonathan,

We store JSON as our column values.  I'd love to see support for maps and
lists.  If I get some time this weekend, I'll take a look to see what is
required.  I doesn't seem like it would be that hard.

-brian

 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/







On 3/29/12 3:18 PM, "Jonathan Ellis"  wrote:

>On Thu, Mar 29, 2012 at 2:06 PM, Ben McCann  wrote:
>> As far as I can tell, Cassandra
>> doesn't support maps and lists in a standardized way today, which is the
>> root of my problem.
>
>I'm pretty serious about adding those for 1.2, for what that's worth.
>(If you want to jump in and help code that up, so much the better.)
>
>-- 
>Jonathan Ellis
>Project Chair, Apache Cassandra
>co-founder of DataStax, the source for professional Cassandra support
>http://www.datastax.com




Re: Document storage

2012-03-29 Thread Ben McCann
Thanks Jonathan.  The only reason I suggested JSON was because it already
has support for lists.  Native support for lists in Cassandra would more
than satisfy me.  Are there any existing proposals or a bug I can follow?
 I'm not familiar with the Cassandra codebase, so I'm not entirely sure how
helpful I can be, but I'd certainly be interested in taking a look to see
what's required.

-Ben


On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill wrote:

> Jonathan,
>
> I was actually going to take this up with Nate McCall a few weeks back.  I
> think it might make sense to get the client development community together
> (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)
>
> I agree whole-heartedly that it shouldn't go into the database for all the
> reasons you point out.
>
> If we can all decide on some standards for data storage (e.g. composite
> types), indexing strategies, etc.  We can provide higher-level functions
> through the client libraries and also provide interoperability between
> them.  (without bloating Cassandra)
>
> CCing Nate.  Nate, thoughts?
> I wouldn't mind coordinating/facilitating the conversation.  If we know
> who should be involved.
>
> -brian
>
> 
> Brian O'Neill
> Lead Architect, Software Development
> Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
> p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
> blog: http://brianoneill.blogspot.com/
>
>
>
>
>
>
>
> On 3/29/12 3:06 PM, "Ben McCann"  wrote:
>
> >Jonathan, I asked Brian about his REST
> >API<
> https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
> >9C8Us>and
> >he said he does not take the json objects and split them because the
> >client libraries do not agree on implementations.  This was exactly my
> >concern as well with this solution.  I would be perfectly happy to do it
> >this way instead of using JSON if it were standardized.  The reason I
> >suggested JSON is that it is standardized.  As far as I can tell,
> >Cassandra
> >doesn't support maps and lists in a standardized way today, which is the
> >root of my problem.
> >
> >-Ben
> >
> >
> >On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian 
> wrote:
> >
> >> Yes, I meant the "row header index". What I have done is that I'm
> >>storing
> >> an object (i.e. UserProfile) where you read or write it as a whole (a
> >>user
> >> updates their user details in a single page in the UI). So I serialize
> >>that
> >> object into a binary JSON using SMILE format. I then compress it using
> >> Snappy on the client side. So as far as Cassandra cares it's storing a
> >> byte[].
> >>
> >> Now on the client side, I'm using cassandra-cli with a custom type that
> >> knows how to turn a byte[] into a JSON text and back. The only issue was
> >> CASSANDRA-4081 where "assume" doesn't work with custom types. If
> >> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
> >>
> >> Also advantages of this vs. the thrift based Super Column families are:
> >>
> >> 1. Saving extra CPU usage on the Cassandra nodes. Since
> >> serialize/deserialize and compression/decompression happens on the
> >>client
> >> nodes where there is plenty idle CPU time
> >>
> >> 2. Saving network bandwidth since I'm sending over a compressed byte[]
> >>
> >>
> >> -- Drew
> >>
> >>
> >>
> >> On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
> >>
> >> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
> >> wrote:
> >> >>> I think this is a much better approach because that gives you the
> >> >>> ability to update or retrieve just parts of objects efficiently,
> >> >>> rather than making column values just blobs with a bunch of special
> >> >>> case logic to introspect them.  Which feels like a big step
> >>backwards
> >> >>> to me.
> >> >>
> >> >> Unless your access pattern involves reading/writing the whole
> >>document
> >> each time. In that case you're better off serializing the whole document
> >> and storing it in a column as a byte[] without incurring the overhead of
> >> column indexes. Right?
> >> >
> >> > Hmm, not sure what you're thinking of there.
> >> >
> >> > If you mean the "index" that's part of the row header for random
> >> > access within a row, then no, serializing to byte[] doesn't save you
> >> > anything.
> >> >
> >> > If you mean secondary indexes, don't declare any if you don't want
> >>any.
> >> :)
> >> >
> >> > Just telling C* to store a byte[] *will* be slightly lighter-weight
> >> > than giving it named columns, but we're talking negligible compared to
> >> > the overhead of actually moving the data on or off disk in the first
> >> > place.  Not even close to being worth giving up being able to deal
> >> > with your data from standard tools like cqlsh, IMO.
> >> >
> >> > --
> >> > Jonathan Ellis
> >> > Project Chair, Apache Cassandra
> >> > co-founder of DataStax, the source for professional Cassandra support
> >> > http://www.datastax.com
> >>
> >>
>
>
>


Re: Document storage

2012-03-29 Thread Brian O'Neill
Jonathan, 

I was actually going to take this up with Nate McCall a few weeks back.  I
think it might make sense to get the client development community together
(Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)

I agree whole-heartedly that it shouldn't go into the database for all the
reasons you point out.

If we can all decide on some standards for data storage (e.g. composite
types), indexing strategies, etc.  We can provide higher-level functions
through the client libraries and also provide interoperability between
them.  (without bloating Cassandra)

CCing Nate.  Nate, thoughts?
I wouldn't mind coordinating/facilitating the conversation.  If we know
who should be involved.

-brian

 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/







On 3/29/12 3:06 PM, "Ben McCann"  wrote:

>Jonathan, I asked Brian about his REST
>API9C8Us>and
>he said he does not take the json objects and split them because the
>client libraries do not agree on implementations.  This was exactly my
>concern as well with this solution.  I would be perfectly happy to do it
>this way instead of using JSON if it were standardized.  The reason I
>suggested JSON is that it is standardized.  As far as I can tell,
>Cassandra
>doesn't support maps and lists in a standardized way today, which is the
>root of my problem.
>
>-Ben
>
>
>On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian  wrote:
>
>> Yes, I meant the "row header index". What I have done is that I'm
>>storing
>> an object (i.e. UserProfile) where you read or write it as a whole (a
>>user
>> updates their user details in a single page in the UI). So I serialize
>>that
>> object into a binary JSON using SMILE format. I then compress it using
>> Snappy on the client side. So as far as Cassandra cares it's storing a
>> byte[].
>>
>> Now on the client side, I'm using cassandra-cli with a custom type that
>> knows how to turn a byte[] into a JSON text and back. The only issue was
>> CASSANDRA-4081 where "assume" doesn't work with custom types. If
>> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
>>
>> Also advantages of this vs. the thrift based Super Column families are:
>>
>> 1. Saving extra CPU usage on the Cassandra nodes. Since
>> serialize/deserialize and compression/decompression happens on the
>>client
>> nodes where there is plenty idle CPU time
>>
>> 2. Saving network bandwidth since I'm sending over a compressed byte[]
>>
>>
>> -- Drew
>>
>>
>>
>> On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
>>
>> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
>> wrote:
>> >>> I think this is a much better approach because that gives you the
>> >>> ability to update or retrieve just parts of objects efficiently,
>> >>> rather than making column values just blobs with a bunch of special
>> >>> case logic to introspect them.  Which feels like a big step
>>backwards
>> >>> to me.
>> >>
>> >> Unless your access pattern involves reading/writing the whole
>>document
>> each time. In that case you're better off serializing the whole document
>> and storing it in a column as a byte[] without incurring the overhead of
>> column indexes. Right?
>> >
>> > Hmm, not sure what you're thinking of there.
>> >
>> > If you mean the "index" that's part of the row header for random
>> > access within a row, then no, serializing to byte[] doesn't save you
>> > anything.
>> >
>> > If you mean secondary indexes, don't declare any if you don't want
>>any.
>> :)
>> >
>> > Just telling C* to store a byte[] *will* be slightly lighter-weight
>> > than giving it named columns, but we're talking negligible compared to
>> > the overhead of actually moving the data on or off disk in the first
>> > place.  Not even close to being worth giving up being able to deal
>> > with your data from standard tools like cqlsh, IMO.
>> >
>> > --
>> > Jonathan Ellis
>> > Project Chair, Apache Cassandra
>> > co-founder of DataStax, the source for professional Cassandra support
>> > http://www.datastax.com
>>
>>




Re: Document storage

2012-03-29 Thread Jonathan Ellis
On Thu, Mar 29, 2012 at 2:06 PM, Ben McCann  wrote:
> As far as I can tell, Cassandra
> doesn't support maps and lists in a standardized way today, which is the
> root of my problem.

I'm pretty serious about adding those for 1.2, for what that's worth.
(If you want to jump in and help code that up, so much the better.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Document storage

2012-03-29 Thread Ben McCann
Jonathan, I asked Brian about his REST
APIand
he said he does not take the json objects and split them because the
client libraries do not agree on implementations.  This was exactly my
concern as well with this solution.  I would be perfectly happy to do it
this way instead of using JSON if it were standardized.  The reason I
suggested JSON is that it is standardized.  As far as I can tell, Cassandra
doesn't support maps and lists in a standardized way today, which is the
root of my problem.

-Ben


On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian  wrote:

> Yes, I meant the "row header index". What I have done is that I'm storing
> an object (i.e. UserProfile) where you read or write it as a whole (a user
> updates their user details in a single page in the UI). So I serialize that
> object into a binary JSON using SMILE format. I then compress it using
> Snappy on the client side. So as far as Cassandra cares it's storing a
> byte[].
>
> Now on the client side, I'm using cassandra-cli with a custom type that
> knows how to turn a byte[] into a JSON text and back. The only issue was
> CASSANDRA-4081 where "assume" doesn't work with custom types. If
> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
>
> Also advantages of this vs. the thrift based Super Column families are:
>
> 1. Saving extra CPU usage on the Cassandra nodes. Since
> serialize/deserialize and compression/decompression happens on the client
> nodes where there is plenty idle CPU time
>
> 2. Saving network bandwidth since I'm sending over a compressed byte[]
>
>
> -- Drew
>
>
>
> On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
>
> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
> wrote:
> >>> I think this is a much better approach because that gives you the
> >>> ability to update or retrieve just parts of objects efficiently,
> >>> rather than making column values just blobs with a bunch of special
> >>> case logic to introspect them.  Which feels like a big step backwards
> >>> to me.
> >>
> >> Unless your access pattern involves reading/writing the whole document
> each time. In that case you're better off serializing the whole document
> and storing it in a column as a byte[] without incurring the overhead of
> column indexes. Right?
> >
> > Hmm, not sure what you're thinking of there.
> >
> > If you mean the "index" that's part of the row header for random
> > access within a row, then no, serializing to byte[] doesn't save you
> > anything.
> >
> > If you mean secondary indexes, don't declare any if you don't want any.
> :)
> >
> > Just telling C* to store a byte[] *will* be slightly lighter-weight
> > than giving it named columns, but we're talking negligible compared to
> > the overhead of actually moving the data on or off disk in the first
> > place.  Not even close to being worth giving up being able to deal
> > with your data from standard tools like cqlsh, IMO.
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of DataStax, the source for professional Cassandra support
> > http://www.datastax.com
>
>


Re: Document storage

2012-03-29 Thread Drew Kutcharian
Yes, I meant the "row header index". What I have done is that I'm storing an 
object (i.e. UserProfile) where you read or write it as a whole (a user updates 
their user details in a single page in the UI). So I serialize that object into 
a binary JSON using SMILE format. I then compress it using Snappy on the client 
side. So as far as Cassandra cares it's storing a byte[].

Now on the client side, I'm using cassandra-cli with a custom type that knows 
how to turn a byte[] into a JSON text and back. The only issue was 
CASSANDRA-4081 where "assume" doesn't work with custom types. If CASSANDRA-4081 
gets fixed, I'll get the best of both worlds.

Also advantages of this vs. the thrift based Super Column families are:

1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize 
and compression/decompression happens on the client nodes where there is plenty 
idle CPU time

2. Saving network bandwidth since I'm sending over a compressed byte[]


-- Drew



On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

> On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian  wrote:
>>> I think this is a much better approach because that gives you the
>>> ability to update or retrieve just parts of objects efficiently,
>>> rather than making column values just blobs with a bunch of special
>>> case logic to introspect them.  Which feels like a big step backwards
>>> to me.
>> 
>> Unless your access pattern involves reading/writing the whole document each 
>> time. In that case you're better off serializing the whole document and 
>> storing it in a column as a byte[] without incurring the overhead of column 
>> indexes. Right?
> 
> Hmm, not sure what you're thinking of there.
> 
> If you mean the "index" that's part of the row header for random
> access within a row, then no, serializing to byte[] doesn't save you
> anything.
> 
> If you mean secondary indexes, don't declare any if you don't want any. :)
> 
> Just telling C* to store a byte[] *will* be slightly lighter-weight
> than giving it named columns, but we're talking negligible compared to
> the overhead of actually moving the data on or off disk in the first
> place.  Not even close to being worth giving up being able to deal
> with your data from standard tools like cqlsh, IMO.
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com



Re: Document storage

2012-03-29 Thread Jonathan Ellis
On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian  wrote:
>> I think this is a much better approach because that gives you the
>> ability to update or retrieve just parts of objects efficiently,
>> rather than making column values just blobs with a bunch of special
>> case logic to introspect them.  Which feels like a big step backwards
>> to me.
>
> Unless your access pattern involves reading/writing the whole document each 
> time. In that case you're better off serializing the whole document and 
> storing it in a column as a byte[] without incurring the overhead of column 
> indexes. Right?

Hmm, not sure what you're thinking of there.

If you mean the "index" that's part of the row header for random
access within a row, then no, serializing to byte[] doesn't save you
anything.

If you mean secondary indexes, don't declare any if you don't want any. :)

Just telling C* to store a byte[] *will* be slightly lighter-weight
than giving it named columns, but we're talking negligible compared to
the overhead of actually moving the data on or off disk in the first
place.  Not even close to being worth giving up being able to deal
with your data from standard tools like cqlsh, IMO.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Document storage

2012-03-29 Thread Drew Kutcharian
> I think this is a much better approach because that gives you the
> ability to update or retrieve just parts of objects efficiently,
> rather than making column values just blobs with a bunch of special
> case logic to introspect them.  Which feels like a big step backwards
> to me.

Unless your access pattern involves reading/writing the whole document each 
time. In that case you're better off serializing the whole document and storing 
it in a column as a byte[] without incurring the overhead of column indexes. 
Right?


On Mar 29, 2012, at 9:23 AM, Jonathan Ellis wrote:

> On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
>  wrote:
>> Its not clear what 3647 actually is, there is no code attached, and no real 
>> example in it.
>> 
>> Aside from that, the reason this would be useful to me (if we could get 
>> indexing of attributes working), is that I already have my data in 
>> JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
>> break it up into columns to insert, and re-assemble into columns to read.
> 
> I don't understand the problem.  Assuming Cassandra support for maps
> and lists, I could write a Python module that takes json (or thrift,
> or protobuf) objects and splits them into Cassandra rows by fields in
> a couple hours.  I'm pretty sure this is essentially what Brian's REST
> api for Cassandra does now.
> 
> I think this is a much better approach because that gives you the
> ability to update or retrieve just parts of objects efficiently,
> rather than making column values just blobs with a bunch of special
> case logic to introspect them.  Which feels like a big step backwards
> to me.
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com



Re: Document storage

2012-03-29 Thread Drew Kutcharian
I agree with Edward here, the simpler we keep the core the better. I think all 
the ser/deser and conversions should happen on the client side.

-- Drew


On Mar 29, 2012, at 8:36 AM, Edward Capriolo wrote:

> The issue with these super complex types is to do anything useful with
> them you would either need scanners or co processors. As its stands
> right now complex data like json is fairly opaque to Cassandra.
> Getting cassandra to natively speak protobuffs or whatever flavor of
> the week serialization framework is hip right now we make the codebase
> very large. How is that field sorted? How is it indexed? This is
> starting to go very far against the schema-less nosql grain. Where
> does this end up users wanting to store binary XML index it and feed
> cassandra XPath queries?
> 
> 
> On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann  wrote:
>> Creating materialized paths may well be a possible solution.  If that were
>> the solution the community were to agree upon then I would like it to be a
>> standardized and well-documented best practice.  I asked how to store a
>> list of values on the user
>> list
>> and
>> no one suggested ["fieldName", ]: "fieldValue".  It would be a
>> huge pain right now to create materialized paths like this for each of my
>> objects, so client library support would definitely be needed.  And the
>> client libraries should agree.  If Astyanax and lazyboy both add support
>> for materialized path and I write an object to Cassandra with Astyanax,
>> then I should be able to read it back with lazyboy.  The benefit of using
>> JSON/SMILE is that it's very clear that there's exactly one way to
>> serialize and deserialize the data and it's very easy.  It's not clear to
>> me that this is true using materialized paths.
>> 
>> 
>> On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson 
>> wrote:
>> 
 
 
 Would there be interest in adding a JsonType?
>>> 
>>> 
>>> What about checking that data inserted into a JsonType is valid JSON? How
>>> would you do it, and would the overhead be something we are concerned
>>> about, especially if the JSON string is large?
>>> 



Re: Document storage

2012-03-29 Thread Drew Kutcharian
Hi Ben,

Sure, there's nothing really to it, but I'll email it to you. As far as why I'm 
using Snappy on the type instead of sstable_compression is because when you set 
sstable_compression the compression happens on the Cassandra nodes and I see 
two advantages with my approach:

1. Saving extra CPU usage on the Cassandra nodes. Since 
compression/decompression can easily be done on the client nodes where there is 
plenty idle CPU time

2. Saving network bandwidth since you're sending over a compressed byte[]

One thing to note about my approach is that when I define the schema in 
Cassandra, I define the columns as byte[] and not my custom type and I do all 
the conversion on the client side.

-- Drew




On Mar 29, 2012, at 12:04 AM, Ben McCann wrote:

> Sounds awesome Drew.  Mind sharing your custom type?  I just wrote a basic
> JSON type and did the validation the same way you did, but I don't have any
> SMILE support yet.  It seems that if your type were committed to the
> Cassandra codebase then the issue you ran into of the CLI only supporting
> built-in types would no longer be a problem for you (though fixing the
> issue anyway would be good and I voted for it).  Btw, any reason you
> compress it with Snappy yourself instead of just setting sstable_compression
> to 
> SnappyCompressorand
> letting Cassandra do that part?
> 
> -Ben
> 
> 
> On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian  wrote:
> 
>> I'm actually doing something almost the same. I serialize my objects into
>> byte[] using Jackson's SMILE format, then compress it using Snappy then
>> store the byte[] in Cassandra. I actually created a simple Cassandra Type
>> for this but I hit a wall with cassandra-cli:
>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-4081
>> 
>> Please vote on the JIRA if you are interested.
>> 
>> Validation is pretty simple, you just need to read the value and parse it
>> using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)
>> 
>> -- Drew
>> 
>> 
>> 
>> On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:
>> 
>>> I don't imagine sort is a meaningful operation on JSON data.  As long as
>>> the sorting is consistent I would think that should be sufficient.
>>> 
>>> 
>>> On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo >> wrote:
>>> 
 Some work I did stores JSON blobs in columns. The question on JSON
 type is how to sort it.
 
 On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
  wrote:
> I don't speak for the project, but you might give it a day or two for
 people to respond and/or perhaps create a jira ticket.  Seems like
>> that's a
 reasonable data type that would get some traction - a json type.
>> However,
 what would validation look like?  That's one of the main reasons there
>> are
 the data types and validators, in order to validate on insert.
> 
> On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
> 
>> Any thoughts?  I'd like to submit a patch, but only if it will be
 accepted.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann 
>> wrote:
>> 
>>> Hi,
>>> 
>>> I was wondering if it would be interesting to add some type of
>>> document-oriented data type.
>>> 
>>> I've found it somewhat awkward to store document-oriented data in
>>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it,
>> and
>>> store it, but Cassandra cannot differentiate it from any other string
 or
>>> byte array.  However, if my column validation_class could be a
>> JsonType
>>> that would allow tools to potentially do more interesting
 introspection on
>>> the column value.  E.g. bug 3647<
 https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for
>> supporting
 arbitrarily nested "documents" in CQL.  Running a
>>> query against the JSON column in Pig is possible as well, but again
>> in
 this
>>> use case it would be helpful to be able to encode in column metadata
 that
>>> the column is stored as JSON.  For debugging, running nightly
>> reports,
 etc.
>>> it would be quite useful compared to the opaque string and byte array
 types
>>> we have today.  JSON is appealing because it would be easy to
 implement.
>>> Something like Thrift or Protocol Buffers would actually be
>> interesting
>>> since they would be more space efficient.  However, they would also
>> be
 a
>>> bit more difficult to implement because of the extra typing
>> information
>>> they provide.  I'm hoping with Cassandra 1.0's addition of
>> compression
 that
>>> storing JSON is not too inefficient.
>>> 
>>> Would there be interest in adding a JsonType?  I could look at
>> putting
 a
>>> patch together.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
> 
 
>> 
>> 



RE: Document storage

2012-03-29 Thread Jeremiah Jordan
But it isn't special case logic.  The current AbstractType and Indexing of 
Abstract types for the most part would already support this.  Someone just has 
to write the code for JSONType or ProtoBuffType.

The problem isn't writing the code to break objects up, the problem is 
encode/decode time.  Encode/decode to thrift is already a significant portion 
of the time line in writing data, adding an object to column encode/decode on 
top of that makes it even longer.  For a read heavy load that wants the 
JSON/Proto as the thing to be served to clients, an increase in the write time 
line to parse/index the blob is probably acceptable, so that you don't have to 
pay the re-assemble penalty every time you hit the database for that object.

But, once we get multi range slicing, for the average case I think the break it 
up into multiple columns approach will be best for most people.  That is the 
other problem I have with doing the break into columns thing right now.  I have 
to either use Super Columns and not be able to index, so why did I break them 
up?  Or I can't get multiple objects at once, with out pulling a huge slice 
from o1 start to o5 end and then throwing away the majority of the data I 
pulled back that doesn't belong to o1 and o5

-Jeremiah


From: Jonathan Ellis [jbel...@gmail.com]
Sent: Thursday, March 29, 2012 11:23 AM
To: dev@cassandra.apache.org
Subject: Re: Document storage

On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
 wrote:
> Its not clear what 3647 actually is, there is no code attached, and no real 
> example in it.
>
> Aside from that, the reason this would be useful to me (if we could get 
> indexing of attributes working), is that I already have my data in 
> JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
> break it up into columns to insert, and re-assemble into columns to read.

I don't understand the problem.  Assuming Cassandra support for maps
and lists, I could write a Python module that takes json (or thrift,
or protobuf) objects and splits them into Cassandra rows by fields in
a couple hours.  I'm pretty sure this is essentially what Brian's REST
api for Cassandra does now.

I think this is a much better approach because that gives you the
ability to update or retrieve just parts of objects efficiently,
rather than making column values just blobs with a bunch of special
case logic to introspect them.  Which feels like a big step backwards
to me.

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Document storage

2012-03-29 Thread Jonathan Ellis
On Thu, Mar 29, 2012 at 9:57 AM, Jeremiah Jordan
 wrote:
> Its not clear what 3647 actually is, there is no code attached, and no real 
> example in it.
>
> Aside from that, the reason this would be useful to me (if we could get 
> indexing of attributes working), is that I already have my data in 
> JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
> break it up into columns to insert, and re-assemble into columns to read.

I don't understand the problem.  Assuming Cassandra support for maps
and lists, I could write a Python module that takes json (or thrift,
or protobuf) objects and splits them into Cassandra rows by fields in
a couple hours.  I'm pretty sure this is essentially what Brian's REST
api for Cassandra does now.

I think this is a much better approach because that gives you the
ability to update or retrieve just parts of objects efficiently,
rather than making column values just blobs with a bunch of special
case logic to introspect them.  Which feels like a big step backwards
to me.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Document storage

2012-03-29 Thread Edward Capriolo
The issue with these super complex types is to do anything useful with
them you would either need scanners or co processors. As its stands
right now complex data like json is fairly opaque to Cassandra.
Getting cassandra to natively speak protobuffs or whatever flavor of
the week serialization framework is hip right now we make the codebase
very large. How is that field sorted? How is it indexed? This is
starting to go very far against the schema-less nosql grain. Where
does this end up users wanting to store binary XML index it and feed
cassandra XPath queries?


On Thu, Mar 29, 2012 at 11:23 AM, Ben McCann  wrote:
> Creating materialized paths may well be a possible solution.  If that were
> the solution the community were to agree upon then I would like it to be a
> standardized and well-documented best practice.  I asked how to store a
> list of values on the user
> list
> and
> no one suggested ["fieldName", ]: "fieldValue".  It would be a
> huge pain right now to create materialized paths like this for each of my
> objects, so client library support would definitely be needed.  And the
> client libraries should agree.  If Astyanax and lazyboy both add support
> for materialized path and I write an object to Cassandra with Astyanax,
> then I should be able to read it back with lazyboy.  The benefit of using
> JSON/SMILE is that it's very clear that there's exactly one way to
> serialize and deserialize the data and it's very easy.  It's not clear to
> me that this is true using materialized paths.
>
>
> On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson 
> wrote:
>
>> >
>> >
>> > Would there be interest in adding a JsonType?
>>
>>
>> What about checking that data inserted into a JsonType is valid JSON? How
>> would you do it, and would the overhead be something we are concerned
>> about, especially if the JSON string is large?
>>


Re: Document storage

2012-03-29 Thread Ben McCann
Creating materialized paths may well be a possible solution.  If that were
the solution the community were to agree upon then I would like it to be a
standardized and well-documented best practice.  I asked how to store a
list of values on the user
list
and
no one suggested ["fieldName", ]: "fieldValue".  It would be a
huge pain right now to create materialized paths like this for each of my
objects, so client library support would definitely be needed.  And the
client libraries should agree.  If Astyanax and lazyboy both add support
for materialized path and I write an object to Cassandra with Astyanax,
then I should be able to read it back with lazyboy.  The benefit of using
JSON/SMILE is that it's very clear that there's exactly one way to
serialize and deserialize the data and it's very easy.  It's not clear to
me that this is true using materialized paths.


On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson wrote:

> >
> >
> > Would there be interest in adding a JsonType?
>
>
> What about checking that data inserted into a JsonType is valid JSON? How
> would you do it, and would the overhead be something we are concerned
> about, especially if the JSON string is large?
>


Re: Document storage

2012-03-29 Thread Tyler Patterson
>
>
> Would there be interest in adding a JsonType?


What about checking that data inserted into a JsonType is valid JSON? How
would you do it, and would the overhead be something we are concerned
about, especially if the JSON string is large?


RE: Document storage

2012-03-29 Thread Jeremiah Jordan
Its not clear what 3647 actually is, there is no code attached, and no real 
example in it.

Aside from that, the reason this would be useful to me (if we could get 
indexing of attributes working), is that I already have my data in 
JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
break it up into columns to insert, and re-assemble into columns to read.  
Also, until we get multiple slice range reads, I can't read two different 
structures out of one row without getting all the other stuff between them, 
unless there are only two columns and I read them using column names not slices.

As it is right now I have to maintain custom indexes on all my attributes to be 
able to put ProtoBuff into



From: Jake Luciani [jak...@gmail.com]
Sent: Thursday, March 29, 2012 7:44 AM
To: dev@cassandra.apache.org
Subject: Re: Document storage

Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
would seem the only thing a JSON type offers you is validation.  3647 takes
it much further by deconstructing a JSON document using composite columns
to flatten the document out, with the ability to access and update portions
of the document (as well as reconstruct it).

On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann  wrote:

> Hi,
>
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
>
> I've found it somewhat awkward to store document-oriented data in Cassandra
> today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
> Cassandra cannot differentiate it from any other string or byte array.
>  However, if my column validation_class could be a JsonType that would
> allow tools to potentially do more interesting introspection on the column
> value.  E.g. bug 3647
> <https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for
> supporting arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in this
> use case it would be helpful to be able to encode in column metadata that
> the column is stored as JSON.  For debugging, running nightly reports, etc.
> it would be quite useful compared to the opaque string and byte array types
> we have today.  JSON is appealing because it would be easy to implement.
>  Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
> storing JSON is not too inefficient.
>
> Would there be interest in adding a JsonType?  I could look at putting a
> patch together.
>
> Thanks,
> Ben
>



--
http://twitter.com/tjake


RE: Document storage

2012-03-29 Thread Jeremiah Jordan
Its not clear what 3647 actually is, there is no code attached, and no real 
example in it.

Aside from that, the reason this would be useful to me (if we could get 
indexing of attributes working), is that I already have my data in 
JSON/Thrift/ProtoBuff, depending how large the data is, it isn't trivial to 
break it up into columns to insert, and re-assemble into columns to read.  
Also, until we get multiple slice range reads, I can't read two different 
structures out of one row without getting all the other stuff between them, 
unless there are only two columns and I read them using column names not slices.

As it is right now I have to maintain custom indexes on all my attributes to be 
able to put ProtoBuff's into columns, and get some searching on them.  It would 
be nice if I could drop all my custom indexing code and just tell Cassandra, 
hey, index column.attr1.subattr2.

-Jeremiah

From: Jake Luciani [jak...@gmail.com]
Sent: Thursday, March 29, 2012 7:44 AM
To: dev@cassandra.apache.org
Subject: Re: Document storage

Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
would seem the only thing a JSON type offers you is validation.  3647 takes
it much further by deconstructing a JSON document using composite columns
to flatten the document out, with the ability to access and update portions
of the document (as well as reconstruct it).

On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann  wrote:

> Hi,
>
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
>
> I've found it somewhat awkward to store document-oriented data in Cassandra
> today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
> Cassandra cannot differentiate it from any other string or byte array.
>  However, if my column validation_class could be a JsonType that would
> allow tools to potentially do more interesting introspection on the column
> value.  E.g. bug 3647
> <https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for
> supporting arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in this
> use case it would be helpful to be able to encode in column metadata that
> the column is stored as JSON.  For debugging, running nightly reports, etc.
> it would be quite useful compared to the opaque string and byte array types
> we have today.  JSON is appealing because it would be easy to implement.
>  Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
> storing JSON is not too inefficient.
>
> Would there be interest in adding a JsonType?  I could look at putting a
> patch together.
>
> Thanks,
> Ben
>



--
http://twitter.com/tjake


Re: Document storage

2012-03-29 Thread Rick Branson
Ben,

You can create a "materialized path" for each field in the document:

{
["user", "firstName"]: "ben",
["user", "skills", ]: "java",
["user", "skills", ]: "javascript",
["user", "skills", ]: "html",
["user", "education", "school"]: "cmu",
["user", "education", "major"]: "computer science" 
}

This way each field could be independently updated, and you can take 
sub-document slices with queries such as "give me everything under 
user/skills." 

Rick


On Thursday, March 29, 2012 at 7:27 AM, Ben McCann wrote:

> Could you explain further how I would use CASSANDRA-3647? There's still
> very little documentation on composite columns and it was not clear to me
> whether they could be used to store document oriented data. Say for
> example that I had a document like:
> 
> user: {
> firstName: 'ben',
> skills: ['java', 'javascript', 'html'],
> education {
> school: 'cmu',
> major: 'computer science'
> }
> }
> 
> How would I flatten this to be stored and then reconstruct the document?
> 
> 
> On Thu, Mar 29, 2012 at 5:44 AM, Jake Luciani  (mailto:jak...@gmail.com)> wrote:
> 
> > Is there a reason you would prefer a JSONType over CASSANDRA-3647? It
> > would seem the only thing a JSON type offers you is validation. 3647 takes
> > it much further by deconstructing a JSON document using composite columns
> > to flatten the document out, with the ability to access and update portions
> > of the document (as well as reconstruct it).
> > 
> > On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann  > (mailto:b...@benmccann.com)> wrote:
> > 
> > > Hi,
> > > 
> > > I was wondering if it would be interesting to add some type of
> > > document-oriented data type.
> > > 
> > > I've found it somewhat awkward to store document-oriented data in
> > Cassandra
> > > today. I can make a JSON/Protobuf/Thrift, serialize it, and store it,
> > 
> > 
> > but
> > > Cassandra cannot differentiate it from any other string or byte array.
> > > However, if my column validation_class could be a JsonType that would
> > > allow tools to potentially do more interesting introspection on the
> > 
> > 
> > column
> > > value. E.g. bug 3647
> > > calls for
> > > supporting arbitrarily nested "documents" in CQL. Running a
> > > query against the JSON column in Pig is possible as well, but again in
> > 
> > 
> > this
> > > use case it would be helpful to be able to encode in column metadata that
> > > the column is stored as JSON. For debugging, running nightly reports,
> > 
> > 
> > etc.
> > > it would be quite useful compared to the opaque string and byte array
> > 
> > 
> > types
> > > we have today. JSON is appealing because it would be easy to implement.
> > > Something like Thrift or Protocol Buffers would actually be interesting
> > > since they would be more space efficient. However, they would also be a
> > > bit more difficult to implement because of the extra typing information
> > > they provide. I'm hoping with Cassandra 1.0's addition of compression
> > 
> > 
> > that
> > > storing JSON is not too inefficient.
> > > 
> > > Would there be interest in adding a JsonType? I could look at putting a
> > > patch together.
> > > 
> > > Thanks,
> > > Ben
> > 
> > 
> > 
> > 
> > 
> > --
> > http://twitter.com/tjake
> 





Re: Document storage

2012-03-29 Thread Ben McCann
Could you explain further how I would use CASSANDRA-3647?  There's still
very little documentation on composite columns and it was not clear to me
whether they could be used to store document oriented data.  Say for
example that I had a document like:

user: {
  firstName: 'ben',
  skills: ['java', 'javascript', 'html'],
  education {
school: 'cmu',
major: 'computer science'
  }
}

How would I flatten this to be stored and then reconstruct the document?


On Thu, Mar 29, 2012 at 5:44 AM, Jake Luciani  wrote:

> Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
> would seem the only thing a JSON type offers you is validation.  3647 takes
> it much further by deconstructing a JSON document using composite columns
> to flatten the document out, with the ability to access and update portions
> of the document (as well as reconstruct it).
>
> On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann  wrote:
>
> > Hi,
> >
> > I was wondering if it would be interesting to add some type of
> > document-oriented data type.
> >
> > I've found it somewhat awkward to store document-oriented data in
> Cassandra
> > today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it,
> but
> > Cassandra cannot differentiate it from any other string or byte array.
> >  However, if my column validation_class could be a JsonType that would
> > allow tools to potentially do more interesting introspection on the
> column
> > value.  E.g. bug 3647
> > calls for
> > supporting arbitrarily nested "documents" in CQL.  Running a
> > query against the JSON column in Pig is possible as well, but again in
> this
> > use case it would be helpful to be able to encode in column metadata that
> > the column is stored as JSON.  For debugging, running nightly reports,
> etc.
> > it would be quite useful compared to the opaque string and byte array
> types
> > we have today.  JSON is appealing because it would be easy to implement.
> >  Something like Thrift or Protocol Buffers would actually be interesting
> > since they would be more space efficient.  However, they would also be a
> > bit more difficult to implement because of the extra typing information
> > they provide.  I'm hoping with Cassandra 1.0's addition of compression
> that
> > storing JSON is not too inefficient.
> >
> > Would there be interest in adding a JsonType?  I could look at putting a
> > patch together.
> >
> > Thanks,
> > Ben
> >
>
>
>
> --
> http://twitter.com/tjake
>


Re: Document storage

2012-03-29 Thread Jake Luciani
Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
would seem the only thing a JSON type offers you is validation.  3647 takes
it much further by deconstructing a JSON document using composite columns
to flatten the document out, with the ability to access and update portions
of the document (as well as reconstruct it).

On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann  wrote:

> Hi,
>
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
>
> I've found it somewhat awkward to store document-oriented data in Cassandra
> today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
> Cassandra cannot differentiate it from any other string or byte array.
>  However, if my column validation_class could be a JsonType that would
> allow tools to potentially do more interesting introspection on the column
> value.  E.g. bug 3647
> calls for
> supporting arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in this
> use case it would be helpful to be able to encode in column metadata that
> the column is stored as JSON.  For debugging, running nightly reports, etc.
> it would be quite useful compared to the opaque string and byte array types
> we have today.  JSON is appealing because it would be easy to implement.
>  Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
> storing JSON is not too inefficient.
>
> Would there be interest in adding a JsonType?  I could look at putting a
> patch together.
>
> Thanks,
> Ben
>



-- 
http://twitter.com/tjake


Re: Document storage

2012-03-29 Thread Ben McCann
Sounds awesome Drew.  Mind sharing your custom type?  I just wrote a basic
JSON type and did the validation the same way you did, but I don't have any
SMILE support yet.  It seems that if your type were committed to the
Cassandra codebase then the issue you ran into of the CLI only supporting
built-in types would no longer be a problem for you (though fixing the
issue anyway would be good and I voted for it).  Btw, any reason you
compress it with Snappy yourself instead of just setting sstable_compression
to 
SnappyCompressorand
letting Cassandra do that part?

-Ben


On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian  wrote:

> I'm actually doing something almost the same. I serialize my objects into
> byte[] using Jackson's SMILE format, then compress it using Snappy then
> store the byte[] in Cassandra. I actually created a simple Cassandra Type
> for this but I hit a wall with cassandra-cli:
>
> https://issues.apache.org/jira/browse/CASSANDRA-4081
>
> Please vote on the JIRA if you are interested.
>
> Validation is pretty simple, you just need to read the value and parse it
> using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)
>
> -- Drew
>
>
>
> On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:
>
> > I don't imagine sort is a meaningful operation on JSON data.  As long as
> > the sorting is consistent I would think that should be sufficient.
> >
> >
> > On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo  >wrote:
> >
> >> Some work I did stores JSON blobs in columns. The question on JSON
> >> type is how to sort it.
> >>
> >> On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
> >>  wrote:
> >>> I don't speak for the project, but you might give it a day or two for
> >> people to respond and/or perhaps create a jira ticket.  Seems like
> that's a
> >> reasonable data type that would get some traction - a json type.
>  However,
> >> what would validation look like?  That's one of the main reasons there
> are
> >> the data types and validators, in order to validate on insert.
> >>>
> >>> On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
> >>>
>  Any thoughts?  I'd like to submit a patch, but only if it will be
> >> accepted.
> 
>  Thanks,
>  Ben
> 
> 
>  On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann 
> wrote:
> 
> > Hi,
> >
> > I was wondering if it would be interesting to add some type of
> > document-oriented data type.
> >
> > I've found it somewhat awkward to store document-oriented data in
> > Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it,
> and
> > store it, but Cassandra cannot differentiate it from any other string
> >> or
> > byte array.  However, if my column validation_class could be a
> JsonType
> > that would allow tools to potentially do more interesting
> >> introspection on
> > the column value.  E.g. bug 3647<
> >> https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for
> supporting
> >> arbitrarily nested "documents" in CQL.  Running a
> > query against the JSON column in Pig is possible as well, but again
> in
> >> this
> > use case it would be helpful to be able to encode in column metadata
> >> that
> > the column is stored as JSON.  For debugging, running nightly
> reports,
> >> etc.
> > it would be quite useful compared to the opaque string and byte array
> >> types
> > we have today.  JSON is appealing because it would be easy to
> >> implement.
> > Something like Thrift or Protocol Buffers would actually be
> interesting
> > since they would be more space efficient.  However, they would also
> be
> >> a
> > bit more difficult to implement because of the extra typing
> information
> > they provide.  I'm hoping with Cassandra 1.0's addition of
> compression
> >> that
> > storing JSON is not too inefficient.
> >
> > Would there be interest in adding a JsonType?  I could look at
> putting
> >> a
> > patch together.
> >
> > Thanks,
> > Ben
> >
> >
> >>>
> >>
>
>


Re: Document storage

2012-03-28 Thread Drew Kutcharian
I'm actually doing something almost the same. I serialize my objects into 
byte[] using Jackson's SMILE format, then compress it using Snappy then store 
the byte[] in Cassandra. I actually created a simple Cassandra Type for this 
but I hit a wall with cassandra-cli:

https://issues.apache.org/jira/browse/CASSANDRA-4081

Please vote on the JIRA if you are interested.

Validation is pretty simple, you just need to read the value and parse it using 
Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)

-- Drew



On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:

> I don't imagine sort is a meaningful operation on JSON data.  As long as
> the sorting is consistent I would think that should be sufficient.
> 
> 
> On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo wrote:
> 
>> Some work I did stores JSON blobs in columns. The question on JSON
>> type is how to sort it.
>> 
>> On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
>>  wrote:
>>> I don't speak for the project, but you might give it a day or two for
>> people to respond and/or perhaps create a jira ticket.  Seems like that's a
>> reasonable data type that would get some traction - a json type.  However,
>> what would validation look like?  That's one of the main reasons there are
>> the data types and validators, in order to validate on insert.
>>> 
>>> On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
>>> 
 Any thoughts?  I'd like to submit a patch, but only if it will be
>> accepted.
 
 Thanks,
 Ben
 
 
 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:
 
> Hi,
> 
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
> 
> I've found it somewhat awkward to store document-oriented data in
> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
> store it, but Cassandra cannot differentiate it from any other string
>> or
> byte array.  However, if my column validation_class could be a JsonType
> that would allow tools to potentially do more interesting
>> introspection on
> the column value.  E.g. bug 3647<
>> https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for supporting
>> arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in
>> this
> use case it would be helpful to be able to encode in column metadata
>> that
> the column is stored as JSON.  For debugging, running nightly reports,
>> etc.
> it would be quite useful compared to the opaque string and byte array
>> types
> we have today.  JSON is appealing because it would be easy to
>> implement.
> Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be
>> a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression
>> that
> storing JSON is not too inefficient.
> 
> Would there be interest in adding a JsonType?  I could look at putting
>> a
> patch together.
> 
> Thanks,
> Ben
> 
> 
>>> 
>> 



Re: Document storage

2012-03-28 Thread Ben McCann
I don't imagine sort is a meaningful operation on JSON data.  As long as
the sorting is consistent I would think that should be sufficient.


On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo wrote:

> Some work I did stores JSON blobs in columns. The question on JSON
> type is how to sort it.
>
> On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
>  wrote:
> > I don't speak for the project, but you might give it a day or two for
> people to respond and/or perhaps create a jira ticket.  Seems like that's a
> reasonable data type that would get some traction - a json type.  However,
> what would validation look like?  That's one of the main reasons there are
> the data types and validators, in order to validate on insert.
> >
> > On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
> >
> >> Any thoughts?  I'd like to submit a patch, but only if it will be
> accepted.
> >>
> >> Thanks,
> >> Ben
> >>
> >>
> >> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:
> >>
> >>> Hi,
> >>>
> >>> I was wondering if it would be interesting to add some type of
> >>> document-oriented data type.
> >>>
> >>> I've found it somewhat awkward to store document-oriented data in
> >>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
> >>> store it, but Cassandra cannot differentiate it from any other string
> or
> >>> byte array.  However, if my column validation_class could be a JsonType
> >>> that would allow tools to potentially do more interesting
> introspection on
> >>> the column value.  E.g. bug 3647<
> https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for supporting
> arbitrarily nested "documents" in CQL.  Running a
> >>> query against the JSON column in Pig is possible as well, but again in
> this
> >>> use case it would be helpful to be able to encode in column metadata
> that
> >>> the column is stored as JSON.  For debugging, running nightly reports,
> etc.
> >>> it would be quite useful compared to the opaque string and byte array
> types
> >>> we have today.  JSON is appealing because it would be easy to
> implement.
> >>> Something like Thrift or Protocol Buffers would actually be interesting
> >>> since they would be more space efficient.  However, they would also be
> a
> >>> bit more difficult to implement because of the extra typing information
> >>> they provide.  I'm hoping with Cassandra 1.0's addition of compression
> that
> >>> storing JSON is not too inefficient.
> >>>
> >>> Would there be interest in adding a JsonType?  I could look at putting
> a
> >>> patch together.
> >>>
> >>> Thanks,
> >>> Ben
> >>>
> >>>
> >
>


Re: Document storage

2012-03-28 Thread Edward Capriolo
Some work I did stores JSON blobs in columns. The question on JSON
type is how to sort it.

On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
 wrote:
> I don't speak for the project, but you might give it a day or two for people 
> to respond and/or perhaps create a jira ticket.  Seems like that's a 
> reasonable data type that would get some traction - a json type.  However, 
> what would validation look like?  That's one of the main reasons there are 
> the data types and validators, in order to validate on insert.
>
> On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
>
>> Any thoughts?  I'd like to submit a patch, but only if it will be accepted.
>>
>> Thanks,
>> Ben
>>
>>
>> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:
>>
>>> Hi,
>>>
>>> I was wondering if it would be interesting to add some type of
>>> document-oriented data type.
>>>
>>> I've found it somewhat awkward to store document-oriented data in
>>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
>>> store it, but Cassandra cannot differentiate it from any other string or
>>> byte array.  However, if my column validation_class could be a JsonType
>>> that would allow tools to potentially do more interesting introspection on
>>> the column value.  E.g. bug 
>>> 3647calls for 
>>> supporting arbitrarily nested "documents" in CQL.  Running a
>>> query against the JSON column in Pig is possible as well, but again in this
>>> use case it would be helpful to be able to encode in column metadata that
>>> the column is stored as JSON.  For debugging, running nightly reports, etc.
>>> it would be quite useful compared to the opaque string and byte array types
>>> we have today.  JSON is appealing because it would be easy to implement.
>>> Something like Thrift or Protocol Buffers would actually be interesting
>>> since they would be more space efficient.  However, they would also be a
>>> bit more difficult to implement because of the extra typing information
>>> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
>>> storing JSON is not too inefficient.
>>>
>>> Would there be interest in adding a JsonType?  I could look at putting a
>>> patch together.
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>


Re: Document storage

2012-03-28 Thread Tatu Saloranta
On Wed, Mar 28, 2012 at 6:59 PM, Jeremiah Jordan
 wrote:
> Sounds interesting to me.  I looked into adding protocol buffer support at 
> one point, and it didn't look like it would be too much work.  The tricky 
> part was I also wanted to add indexing support for attributes of the inserted 
> protocol buffers.  That looked a little trickier, but still not impossible.  
> Though other stuff came up and I never got around to actually writing any 
> code.
> JSON support would be nice, especially if you figured out how to get built in 
> indexing of the attributes inside the JSON to work =).

Also, for whatever it's worth, it should be trivial to add support for
Smile (binary JSON serialization):
http://wiki.fasterxml.com/SmileFormatSpec
since its logical data structure is pure JSON, no extensions or
subsetting. The main Java impl is by Jackson project, but there is
also a C codec (https://github.com/pierre/libsmile), and prototypes
for PHP and Ruby bindings as well.
But for all data it's bit faster, bit more compact; about 30% for
individual items, but more (40 - 70%) for data sequences (due to
optional back-referencing).

JSON and Smile can be auto-detected from first 4 bytes or so, reliably
and efficiently, so one should be able to add this either
transparently or explicitly.
One could even transcode things on the fly -- store as Smile, expose
filtered results as JSON (and accept JSON or both). This could reduce
storage cost while keep the benefits of flexible data format.

-+ Tatu +-


Re: Document storage

2012-03-28 Thread Jeremiah Jordan
Sounds interesting to me.  I looked into adding protocol buffer support at one 
point, and it didn't look like it would be too much work.  The tricky part was 
I also wanted to add indexing support for attributes of the inserted protocol 
buffers.  That looked a little trickier, but still not impossible.  Though 
other stuff came up and I never got around to actually writing any code.
JSON support would be nice, especially if you figured out how to get built in 
indexing of the attributes inside the JSON to work =).

-Jeremiah

On Mar 28, 2012, at 10:58 AM, Ben McCann wrote:

> Hi,
> 
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
> 
> I've found it somewhat awkward to store document-oriented data in Cassandra
> today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
> Cassandra cannot differentiate it from any other string or byte array.
> However, if my column validation_class could be a JsonType that would
> allow tools to potentially do more interesting introspection on the column
> value.  E.g. bug 3647
> calls for
> supporting arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in this
> use case it would be helpful to be able to encode in column metadata that
> the column is stored as JSON.  For debugging, running nightly reports, etc.
> it would be quite useful compared to the opaque string and byte array types
> we have today.  JSON is appealing because it would be easy to implement.
> Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
> storing JSON is not too inefficient.
> 
> Would there be interest in adding a JsonType?  I could look at putting a
> patch together.
> 
> Thanks,
> Ben



Re: Document storage

2012-03-28 Thread Jeremy Hanna
I don't speak for the project, but you might give it a day or two for people to 
respond and/or perhaps create a jira ticket.  Seems like that's a reasonable 
data type that would get some traction - a json type.  However, what would 
validation look like?  That's one of the main reasons there are the data types 
and validators, in order to validate on insert.

On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:

> Any thoughts?  I'd like to submit a patch, but only if it will be accepted.
> 
> Thanks,
> Ben
> 
> 
> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:
> 
>> Hi,
>> 
>> I was wondering if it would be interesting to add some type of
>> document-oriented data type.
>> 
>> I've found it somewhat awkward to store document-oriented data in
>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
>> store it, but Cassandra cannot differentiate it from any other string or
>> byte array.  However, if my column validation_class could be a JsonType
>> that would allow tools to potentially do more interesting introspection on
>> the column value.  E.g. bug 
>> 3647calls for 
>> supporting arbitrarily nested "documents" in CQL.  Running a
>> query against the JSON column in Pig is possible as well, but again in this
>> use case it would be helpful to be able to encode in column metadata that
>> the column is stored as JSON.  For debugging, running nightly reports, etc.
>> it would be quite useful compared to the opaque string and byte array types
>> we have today.  JSON is appealing because it would be easy to implement.
>> Something like Thrift or Protocol Buffers would actually be interesting
>> since they would be more space efficient.  However, they would also be a
>> bit more difficult to implement because of the extra typing information
>> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
>> storing JSON is not too inefficient.
>> 
>> Would there be interest in adding a JsonType?  I could look at putting a
>> patch together.
>> 
>> Thanks,
>> Ben
>> 
>> 



Re: Document storage

2012-03-28 Thread Ben McCann
Any thoughts?  I'd like to submit a patch, but only if it will be accepted.

Thanks,
Ben


On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:

> Hi,
>
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
>
> I've found it somewhat awkward to store document-oriented data in
> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
> store it, but Cassandra cannot differentiate it from any other string or
> byte array.  However, if my column validation_class could be a JsonType
> that would allow tools to potentially do more interesting introspection on
> the column value.  E.g. bug 
> 3647calls for 
> supporting arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in this
> use case it would be helpful to be able to encode in column metadata that
> the column is stored as JSON.  For debugging, running nightly reports, etc.
> it would be quite useful compared to the opaque string and byte array types
> we have today.  JSON is appealing because it would be easy to implement.
>  Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
> storing JSON is not too inefficient.
>
> Would there be interest in adding a JsonType?  I could look at putting a
> patch together.
>
> Thanks,
> Ben
>
>