[ 
https://issues.apache.org/jira/browse/CASSANDRA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028547#comment-13028547
 ] 

Matthew F. Dennis commented on CASSANDRA-1278:
----------------------------------------------

{quote}
bq. It was intentional as previously only the streaming was buffered (at 4k)
It was the other way around IIRC: (non-encrypted) streaming used 
channel.transferTo, which bypassed the buffering entirely. The buffering was 
for internode messaging: see CASSANDRA-1943.
{quote}

Yes, I see now; that was unintentional and has been corrected.

{quote}
bq. We could construct something that buffers up X amount of data and then 
frames the data being sent and change the inner loop to decompose that but it's 
extra complexity, code and overhead.
You're already buffering rows in StreamingProxyFlusher.bufferRow: the change 
would simply be to continue to buffer rows until a threshold was reached. The 
benefit here is that the code on the receiving side doesn't need to change when 
the proxy starts sending it a different SSTable version/format. I've never 
heard of somebody regretting having framing in a protocol: it's always the 
other way around.
{quote}

Yes, I understood what you were suggesting; that was precisely the extra 
buffering I was talking about.  Buffering more than one row on the client side 
means we keep larger buffers around and increase the GC pressure which is 
already a problem on the proxy because of thrift.  That being said, I've 
changed the protocol to be framed but the proxy still just sends one row at a 
time (each row in a frame) to avoid the problems mentioned.  If we later wanted 
to change the proxy to buffer more or implement a different client the server 
won't care.

{quote}Also, an SSTable version (as usually held by Descriptor) should be added 
to the header of your protocol so that clients don't break by sending 
unversioned blobs: not having versioning is my primary complaint vis-a-vis 
BinaryMemtables.{quote}

Added, along with a protocol version, to the header.

{quote}
bq. If we buffer it on the other side we consume more memory for a longer 
period of time
I was talking about buffering on the client side: the server side can do one 
system call to flush to disk, such that it never enters userspace.
{quote}

I was too, it's primarily the buffering on the proxy side that is the problem.  
The goal is to get the data off the proxy as quickly as possible.  As quickly 
as possible is one row at a time because of the serialization format (size must 
be known before entire row can be written).

{quote}
bq. If you're comparing to the streams we use for repair and similar, they 
require table names and byte ranges be known up front

We've had enough trouble debugging streaming when people use it all the time 
for repair. I shudder to think of the bugs we'll introduce to a second-class 
protocol that gets used slightly more often than BMT.
{quote}

that's because the streaming used for repair is complex and fragile; 
independent streams are tightly coupled in a session, sizes must be known up 
front, retries are complex and require out-of-band messaging between nodes, 
everything is "buffered" on disk before building of any indexes/filters starts, 
et cetera.  In comparison the protocol used for loading is extremely simple; if 
it makes you feel better we could add a CRC/MD5 to the stream.

{quote}
Maybe we've been too clever here: why not just write out the full sstable on 
the client, and stream it over (indexes and all) so that

    * we move the [primary] index build off the server, which should give a 
nice performance boost
    * we have filenames and sizes ready to go so streaming will be happy

We're still talking about a minor change to streaming of recognizing that we're 
getting all the components and not just data, but that's something we can deal 
with at the StreamInSession level, I don't think we'll need to change the 
protocol itself.
{quote}

One of the main goals of the bulk loading was that no local/temp storage was 
required on the client; that has been the plan from the beginning.  If you have 
something that generates full tables, indexes and filters it makes more sense 
to generate them locally by using the SSTableWriter directly, push them to the 
box and then using CASSANDRA-2438 to "add" them to the node.  Maybe we could 
add this as an option to the proxy to make it just a bit easier to do but it 
certainly isn't suitable as the only option.  If we want this, it should be a 
separate ticket as it's separate functionality.  Overall though, I'm not really 
a fan of requiring temp space on the proxy.

The problem I can think of at the moment is that for large clusters this is a 
lot of seeking on the proxy since you need to generate one table for every 
replica set or a lot of repeated passes on the same data.  Even if you do this 
or make it "very fast" (tm) it doesn't much matter because as you transfer 
small tables to nodes they will almost immediately be compacted meaning the 
work saved to generate the indexes and filters was wasted and was only a small 
percentage of the overall work moved off of the cluster.  Compacting the tables 
on the clients before sending them would just make a questionable idea worse...

{quote}
bq. Maybe we've been too clever here: why not just write out the full sstable 
on the client, and stream it over (indexes and all) so that

As much as I want to merge the protocols, I'm not sure I like the limitations 
this puts on clients: being able to send a stream without needing local 
tempspace is very, very beneficial, IMO (for example, needing tempspace was by 
far the most annoying limitation of a Hadoop LuceneOutputFormat I worked on).
{quote}

Exactly; requiring temp space seems like an anti-feature to me.

{quote}
bq. If you're comparing to the streams we use for repair and similar, they 
require table names and byte ranges be known up front

bq. something we can deal with at the StreamInSession level, I don't think 
we'll need to change the protocol itself

With versioned messaging, changing the protocol is at least possible, if 
painful... my dream would be:

   1. Deprecate the file ranges in Streaming session objects, to be replaced 
with framing in the stream
   2. Move the Streaming session object to a header of the streaming connection 
(almost identical to LoaderStream)
   3. Deprecate the Messaging based setup and teardown for streaming sessions: 
a sender initiates a stream by opening a streaming connection, and tears it 
down with success codes after each file (again, like this protocol)
{quote}

The protocol is now versioned (as well as the table format) so this is possible 
(though certainly on a different ticket).  If we change the existing streaming 
to use this protocol I think we end up with something a lot less fragile and a 
lot less complex.

Essentially the sender is in control and keeps retrying until the receiver has 
the data; deprecate sessions all together.  When node A wants to send things to 
node B, it records that fact in the system table.  For each entry it sends the 
file using the bulk loading protocol and continues retrying until the file is 
excepted.  For each range it wants to send it frames the entire range.  The 
only complex part is preventing removal of the SSTable on the source (node A) 
until it was successfully streamed to the destination (node B).

{quote}tl;dr: I'd prefer some slight adjustments to Matt's protocol (mentioned 
above) over requiring tempspace on the client.{quote}

ditto



> Make bulk loading into Cassandra less crappy, more pluggable
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-1278
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1278
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jeremy Hanna
>            Assignee: Matthew F. Dennis
>             Fix For: 0.8.1
>
>         Attachments: 1278-cassandra-0.7-v2.txt, 1278-cassandra-0.7.1.txt, 
> 1278-cassandra-0.7.txt
>
>   Original Estimate: 40h
>          Time Spent: 40h 40m
>  Remaining Estimate: 0h
>
> Currently bulk loading into Cassandra is a black art.  People are either 
> directed to just do it responsibly with thrift or a higher level client, or 
> they have to explore the contrib/bmt example - 
> http://wiki.apache.org/cassandra/BinaryMemtable  That contrib module requires 
> delving into the code to find out how it works and then applying it to the 
> given problem.  Using either method, the user also needs to keep in mind that 
> overloading the cluster is possible - which will hopefully be addressed in 
> CASSANDRA-685
> This improvement would be to create a contrib module or set of documents 
> dealing with bulk loading.  Perhaps it could include code in the Core to make 
> it more pluggable for external clients of different types.
> It is just that this is something that many that are new to Cassandra need to 
> do - bulk load their data into Cassandra.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to