[ 
https://issues.apache.org/jira/browse/CASSANDRA-265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725654#action_12725654
 ] 

Jonathan Ellis commented on CASSANDRA-265:
------------------------------------------

After some more thought I came up with a straightforward (if clunky) way to 
support this in Thrift.

(In my defense I note that it's already de rigeur to wrap the thrift "client" 
in something more idiomatic, and that [b]lob apis for traditional databases 
bear more than a little resemblance.)

You would add these methods (actual names subject to bikeshedding):

begin_lob(key, columnPath, size, ts) returns thrift_lob_id
repeat until sum(byte.length) == size: 
    stream_lob(thrift_lob_id, byte[])
commit_lob(thrift_lob_id) throws Bad Stuff

These would map fairly directly to the StreamingMessage/LargeObjectCommand 
structures described above.

Some opinions that I am not married to:

 - we don't need block_for since streaming adds enough latency already that we 
want to just assume block_for=N
 - having a separate commit_lob is less magic than having the final stream_lob 
behave a little differently from the others

> Large object support
> --------------------
>
>                 Key: CASSANDRA-265
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-265
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>
> The standard answer since forever has been "cassandra is a bad fit for large 
> objects."
> But I think it doesn't have to be that way.  With a few simplifying 
> assumptions we can make this doable.
> First, screw Thrift.  There is no way to specify a stream of bytes 
> cross-platform.  You can't mix raw sockets into Thrift very easily (?) so 
> screw it.  Make it an internal-only API to start with, like the much-vaunted 
> and much-feared BinaryVerbHandler.
> Second, forget about writing multiple lobs at once.  You insert one lob at a 
> time, to a specific column.
> With Thrift out of the equation we are not out of the woods.  
> MessagingService also assumes that Messages will be memory resident and not 
> streamed.  One approach to fix this would be to have a StreamingMessage class 
> that consists of a message id (that would be paired w/ origination endpoint 
> to make it unique) and a size.  The VerbHandler would keep a Map of 
> incomplete StreamingMessages around until the full size was read.  Then they 
> could be disposed of.
> So a LargeObjectCommand would be basically just the command id and the 
> payload, the streamed lob.  And we would handle it by streaming it directly 
> to a file.  When the stream was complete, we would do a write to the standard 
> commitlog/memtable with a pointer to that lob file.  That would then be 
> flushed normally to the sstable.  (This would require adding another boolean 
> to Column serialization, whether the value is really a lob pointer.  We could 
> combine this with the existing bool into a single byte and have room for a 
> couple more flags, without taking extra space.)
> So lobs would never appear directly in the commitlog, and we would never have 
> to rewrite them multiple times during compaction; just the pointers would get 
> merged, but the lob files themselves would not have to be touched.  (Except 
> to remove them when a compaction shows that an older version is no longer 
> needed.)
> Then of course we'd need a corresponding ReadLargeObject command.  So the 
> basics are straightforward.
> Read Repair and Hinted Handoff would add a few more wrinkles but nothing 
> fundamentally challenging.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to