Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "Streaming" page has been changed by NirmalRanganathan. The comment on this change is: revamped the streaming documentation to reflect the refactoring done in #1189. http://wiki.apache.org/cassandra/Streaming?action=diff&rev1=4&rev2=5 -------------------------------------------------- - When data needs to be moved from one node (the source) to another (the destination), the following steps occur: + There are two main instances of streaming: + * Transfer - Occurs when a Source pushes SSTables for certain ranges to a Destination. Initiated and controlled by the Source. + * Request - Occurs when a Destination requests a set of ranges from a Source. Initiated and controlled by the Destination. - 1. The destination sends a request to the source with the data ranges it desires + === Transfer === + The following steps occur for Stream Transfers. + 1. Source has a list of ranges it must transfer to another node. - 1. The source copies the data in those ranges to sstable files in preparation for streaming. This is called anti-compaction (because compaction merges multiple sstable files into one, and this does the opposite). + 1. Source copies the data in those ranges to sstable files in preparation for streaming. This is called anti-compaction (because compaction merges multiple sstable files into one, and this does the opposite). - 1. The source sends the list of files to be streamed to the destination, followed by the data + 1. Source builds a list of Pending``File's which contains information on each sstable to be transfered. + 1. Source starts streaming the first file from the list, followed by the log "Waiting for transfer to $some_node to complete". The header for the stream contains information on the streamed file for the Destination to interpret what to do with the incoming stream. + 1. Destination receives the file writes it to disk and sends a File``Status. + 1. On successful transfer the Source streams the next file until its done, on error it re-streams the same file. - Monitoring the status of streaming on both source and destination nodes can be found (in 0.6) under the `org.apache.cassandra.streaming.StreamingService` MBean. The `Status` attribute gives an easy indication of what a node is doing with respect to streaming. + === Request === + 1. Destination compiles a list of ranges it needs from another node. + 1. Destination sends a Stream``Request``Message to the Source node with the list of ranges. + 1. Source prepares the SSTables for those ranges and creates the Pending``File's. + 1. Source starts streaming the first file in the list. The header for the first stream contains info of the current stream and a list of the remaining Pending``File's that fall in the requested ranges. + 1. Destination receives the stream and writes it to disk, followed by the log message "Streaming added org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard1-e-1-Data.db')". + 1. Destination then takes the lead and requests the remaining files one at a time. If an error occurs it re-requests the same file, if not continues with the next file until done. + 1. Source streams each of the requested files. The files are already anti-compacted, so it just streams them to the Destination. - Step 2 is what takes the most time on most systems. The destination will be idle during this stage; to monitor anti-compaction progress, you should check the `org.apache.cassandra.db.CompactionManager` mbean on the source. + == Streaming Invocations == + Streaming in Cassandra between nodes is invoked in the following contexts: + 1. '''Bootstrapping''' - During bootstrap the node requests ranges from other nodes. It invokes ''Stream Request''. + 1. '''Repair''' - The Anti-Entropy service performs repairs by comparing Merkle trees and the final step in the process is to transfer conflicting ranges to other nodes and request conflicting changes from other nodes in that order. So both ''Request'' and ''Transfer'' are invoked here. + 1. '''Restore Replica''' - Storage``Service performs a ''Stream Request''. + 1. '''Un-bootstrap''' - During node decommission or move, un-bootstrap is invoked to transfer the nodes ranges to other nodes. ''Transfer'' is invoked in this case. - Once step 3 begins actual data transfer, the sending node will report a status of `"Waiting for transfer to $some_node to complete."` The receiving node will report `"Receiving stream"` while receiving stream data. The `StreamDestinations` and `StreamSources` attributes each contain a list of hosts that the current node is either sending stream data to or receiving it from. + == Monitoring == + Anti-compaction, both during request and transfer takes the most amount of time. It can be monitored using the `org.apache.cassandra.db.CompactionManager` mbean on the Source. - The operations `getOutgoingFiles(host)` and `getIncomingFiles(host)` each return a list of strings describing the status of individual files being streamed to and from a given host. Each string follows this format: `[path to file] [bytes sent/received]/[file size]` If you think that streaming is taking too long on your cluster, the first thing you should do is check `StreamSources` or `StreamDestinations` to figure out which hosts are streaming files. Use those hosts as inputs to `getOutgoingFiles()` or `getIncomingFiles()` to check on the status of individual files from the problematic source and destination nodes. Streaming is conducted in 32MB chunks, so you should refresh the file status after a few seconds to see if the sent/received values change. If they do not change, or change more slowly than you'd like, something is wrong. Keep in mind that a source node can only stream a single file at a time, but a destination node can simultaneously receive several files. + Monitoring the status of streaming on both Source and Destination nodes can be found under the `org.apache.cassandra.streaming.StreamingService` MBean. The `Status` attribute gives an easy indication of what a node is doing with respect to streaming. The operations `getOutgoingFiles(host)` and `getIncomingFiles(host)` each return a list of strings describing the status of individual files being streamed to and from a given host. Each string follows this format: `[path to file] [bytes sent/received]/[file size]` If you think that streaming is taking too long on your cluster, the first thing you should do is check `StreamSources` or `StreamDestinations` to figure out which hosts are streaming files. Use those hosts as inputs to `getOutgoingFiles()` or `getIncomingFiles()` to check on the status of individual files from the problematic source and destination nodes. Streaming is conducted in 32MB chunks, so you should refresh the file status after a few seconds to see if the sent/received values change. If they do not change, or change more slowly than you'd like, something is wrong. + The streaming status can also be monitored using {{{nodetool -h <hostname/IP> streams}}} +
