Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "ReadPathForUsers" page has been changed by MichaelEdge: https://wiki.apache.org/cassandra/ReadPathForUsers?action=diff&rev1=1&rev2=2 == The Local Coordinator == The local coordinator receives the read request from the client and performs the following: - 1. The local coordinator determines which node is responsible for storing the data; it does this using the Partitioner to hash the partition key - 1. The local coordinator sends the read request to the replica node storing the data + 1. The local coordinator determines which nodes are responsible for storing the data: + * The first replica is chosen based on the Partitioner hashing the primary key + * Other replicas are chosen based on replication strategy defined for the keyspace + 1. The local coordinator sends a read request to the fastest replica. All Cassandra snitches utilise the dynamic snitch to monitor read latency between nodes and maintain a list of the fastest responding replicas (or more accurately, the snitch calculates and maintains a ‘badness score’ per node, and read requests are routed to nodes with the lowest ‘badness’ score). == Replica Node == - The replica node receives the read request from the local coordinator and performs the following: + The replica node receives the read request from the local coordinator and checks whether the requested row exists in the row cache. The row cache is a read-through cache and will only contain the requested data if it was previously read. === Requested data in row cache === - 1. Check if the row cache contains the requested row. The row cache is a read-through cache and will only contain the requested data if it was previously read. 1. If the row is in the row cache, return the data to the local coordinator. Since the row cache already contains fully merged data there is no need to check anywhere else for the data and the read request can now be considered complete. === Requested data not in row cache === The data must now be read from the SSTables and MemTable: @@ -34, +35 @@ It might seem an unnecessary overhead to read data from both MemTable and SSTables if the data for a partition key exists in the MemTable. After all, doesn’t the MemTable contain the latest copy of the data for a partition key? It depends on the type of operation applied to the data. Consider the following examples: 1. If a new row is inserted, this new row will exist in the MemTable and will only exist in an SSTable when the MemTable is flushed. Prior to the flush the bloom filters will indicate that the row does not exist in any SSTables, and the row will therefore be read from the MemTable only (assuming it has not previously been read and exists in the row cache). - 1. After the MemTable containing the new row is flushed to disk, the data for this partition key exists in an SSTable only, and no longer exists in the MemTable. The flushed MemTable has been replaced by a new, empty MemTable. + 1. After the MemTable containing the new row is flushed to disk, the data for this partition key exists in an SSTable only, and no longer exists in the MemTable. The flushed MemTable has been replaced by a new, empty MemTable. - 1. If the new row is now updated, the updated cells will exist in the MemTable. However, not every cell will exist in the MemTable, only those that were updated (the MemTable is never populated with the current values from the SSTables). To obtain a complete view of the data for this partition key both the MemTable and SSTable must be read. + 1. If the new row is now updated, the updated cells will exist in the MemTable. However, not every cell will exist in the MemTable, only those that were updated (the MemTable is never populated with the current values from the SSTables). To obtain a complete view of the data for this partition key both the MemTable and SSTable must be read. A MemTable is therefore a write-back cache that temporarily stores a copy of data by partition key, prior to that data being flushed to durable storage in the form of SSTables on disk. Unlike a true write-back cache (where changed data exists in the cache only), the changed data is also made durable in a Commit Log so the MemTable can be rebuilt in the event of a failure. == Consistency Level and the Read Path == Cassandra’s tunable consistency applies to read requests as well as writes. The consistency level determines the number of replica nodes that must respond before the results of a read request can be sent back to the client; by tuning the consistency level a user can determine whether a read request should return fully consistent data, or whether stale, eventually consistent data is acceptable. The diagram and explanation below describe how Cassandra responds to read requests where the consistency level is greater than ONE. For clarity this section does not consider multi data-centre read requests. {{attachment:CassandraReadConsistencyLevel.png|Cassandra Read Consistency Level|width=800}} - 1. The local coordinator determines which nodes are responsible for storing the data: + 1. The local coordinator determines which nodes are responsible for storing the data: * The first replica is chosen based on the Partitioner hashing the primary key * Other replicas are chosen based on replication strategy defined for the keyspace - 1. The local coordinator sends a read request to the fastest replica. All Cassandra snitches utilise the dynamic snitch to monitor read latency between nodes and maintain a list of the fastest responding replicas (or more accurately, the snitch calculates and maintains a ‘badness score’ per node, and read requests are routed to nodes with the lowest ‘badness’ score). + 1. The local coordinator sends a read request to the fastest replica. - 1. The fastest replica performs a read according to the ‘read path’ described above. + 1. The fastest replica performs a read according to the ‘read path’ described above. - 1. The local coordinator sends a ‘digest’ read request to the other replica nodes; these nodes calculate a hash of the data being requested and returns this to the local coordinator. + 1. The local coordinator sends a ‘digest’ read request to the other replica nodes; these nodes calculate a hash of the data being requested and returns this to the local coordinator. - 1. The local coordinator compares the hashes from all replica nodes. If they match it indicates that all nodes contain exactly the same version of the data; in this case the data from the fastest replica is returned to the client. + 1. The local coordinator compares the hashes from all replica nodes. If they match it indicates that all nodes contain exactly the same version of the data; in this case the data from the fastest replica is returned to the client. - 1. If the digests do not match then a conflict resolution process is necessary: + 1. If the digests do not match then a conflict resolution process is necessary: * Read data from all replica nodes (with the exception of the fastest replica, as this has already responded to a full read request) according to the ‘read path’ described above. * Merge the data cell by cell based on timestamp. For each cell, the data from all replicas is checked and the most recent timestamp wins. * Return the merged data to the client.
