Re: question on maximum disk seeks

2017-03-21 Thread preetika tyagi
Oh I see. I understand it now. Thank you for the clarification!

Preetika

On Tue, Mar 21, 2017 at 11:07 AM, Jonathan Haddad  wrote:

> Each sstable has it's own partition index, therefore it's never updated.
>
> On Tue, Mar 21, 2017 at 11:04 AM preetika tyagi 
> wrote:
>
>> Yes, I understand that. However, what I'm trying to understand is the
>> internal structure of partition index. When a record associate with the
>> same partition key is updated, we have two different records with different
>> timestamps. There are chances of these two records being split across two
>> different SSTables (of course as long as compaction is not merging them
>> into one SSTable eventually). How partition index looks like in such case?
>> For the same key, we have two different records in different SSTables. How
>> does partition index store such information? Can it have repeated partition
>> keys with different disk offsets pointing to different SSTables?
>>
>> On Tue, Mar 21, 2017 at 10:09 AM, Jonathan Haddad 
>> wrote:
>>
>> The partition index is never updated, as sstables are immutable.
>>
>> On Tue, Mar 21, 2017 at 9:40 AM preetika tyagi 
>> wrote:
>>
>> Thank you Jan & Jeff for the responses. That was really useful.
>>
>> Jan - I have one follow-up question. When the data is spread over more
>> than one SSTable in case of updates as you mentioned, we will need two
>> seeks per SSTable (one for partition index and another for SSTable itself).
>> I'm curious to know how partition index is structured internally. I was
>> assuming it to be a table with  pairs. In case of an
>> update to the same key for several times, how it is recorded in the
>> partition index?
>>
>> Thanks,
>> Preetika
>>
>> On Mon, Mar 20, 2017 at 10:37 PM,  wrote:
>>
>> Hi,
>>
>>
>>
>> youre right – one seek with hit in the partition key cache and two if not.
>>
>>
>>
>> Thats the theory – but two thinge to mention:
>>
>>
>>
>> First, you need two seeks per sstable not per entire read. So if you data
>> is spread over multiple sstables on disk you obviously need more then two
>> reads. Think of often updated partition keys – in combination with memory
>> preassure you can easily end up with maaany sstables (ok they will be
>> compacted some time in the future).
>>
>>
>>
>> Second, there could be fragmentation on disk which leads to seeks during
>> sequential reads.
>>
>>
>>
>> Jan
>>
>>
>>
>> Gesendet von meinem Windows 10 Phone
>>
>>
>>
>> *Von: *preetika tyagi 
>> *Gesendet: *Montag, 20. März 2017 21:18
>> *An: *user@cassandra.apache.org
>> *Betreff: *question on maximum disk seeks
>>
>>
>>
>> I'm trying to understand the maximum number of disk seeks required in a
>> read operation in Cassandra. I looked at several online articles including
>> this one: https://docs.datastax.com/en/cassandra/3.0/
>> cassandra/dml/dmlAboutReads.html
>>
>> As per my understanding, two disk seeks are required in the worst case.
>> One is for reading the partition index and another is to read the actual
>> data from the compressed partition. The index of the data in compressed
>> partitions is obtained from the compression offset tables (which is stored
>> in memory). Am I on the right track here? Will there ever be a case when
>> more than 1 disk seek is required to read the data?
>>
>> Thanks,
>>
>> Preetika
>>
>>
>>
>>
>>
>>
>>


Re: question on maximum disk seeks

2017-03-21 Thread Jonathan Haddad
Each sstable has it's own partition index, therefore it's never updated.

On Tue, Mar 21, 2017 at 11:04 AM preetika tyagi 
wrote:

> Yes, I understand that. However, what I'm trying to understand is the
> internal structure of partition index. When a record associate with the
> same partition key is updated, we have two different records with different
> timestamps. There are chances of these two records being split across two
> different SSTables (of course as long as compaction is not merging them
> into one SSTable eventually). How partition index looks like in such case?
> For the same key, we have two different records in different SSTables. How
> does partition index store such information? Can it have repeated partition
> keys with different disk offsets pointing to different SSTables?
>
> On Tue, Mar 21, 2017 at 10:09 AM, Jonathan Haddad 
> wrote:
>
> The partition index is never updated, as sstables are immutable.
>
> On Tue, Mar 21, 2017 at 9:40 AM preetika tyagi 
> wrote:
>
> Thank you Jan & Jeff for the responses. That was really useful.
>
> Jan - I have one follow-up question. When the data is spread over more
> than one SSTable in case of updates as you mentioned, we will need two
> seeks per SSTable (one for partition index and another for SSTable itself).
> I'm curious to know how partition index is structured internally. I was
> assuming it to be a table with  pairs. In case of an
> update to the same key for several times, how it is recorded in the
> partition index?
>
> Thanks,
> Preetika
>
> On Mon, Mar 20, 2017 at 10:37 PM,  wrote:
>
> Hi,
>
>
>
> youre right – one seek with hit in the partition key cache and two if not.
>
>
>
> Thats the theory – but two thinge to mention:
>
>
>
> First, you need two seeks per sstable not per entire read. So if you data
> is spread over multiple sstables on disk you obviously need more then two
> reads. Think of often updated partition keys – in combination with memory
> preassure you can easily end up with maaany sstables (ok they will be
> compacted some time in the future).
>
>
>
> Second, there could be fragmentation on disk which leads to seeks during
> sequential reads.
>
>
>
> Jan
>
>
>
> Gesendet von meinem Windows 10 Phone
>
>
>
> *Von: *preetika tyagi 
> *Gesendet: *Montag, 20. März 2017 21:18
> *An: *user@cassandra.apache.org
> *Betreff: *question on maximum disk seeks
>
>
>
> I'm trying to understand the maximum number of disk seeks required in a
> read operation in Cassandra. I looked at several online articles including
> this one:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
>
> As per my understanding, two disk seeks are required in the worst case.
> One is for reading the partition index and another is to read the actual
> data from the compressed partition. The index of the data in compressed
> partitions is obtained from the compression offset tables (which is stored
> in memory). Am I on the right track here? Will there ever be a case when
> more than 1 disk seek is required to read the data?
>
> Thanks,
>
> Preetika
>
>
>
>
>
>
>


Re: question on maximum disk seeks

2017-03-21 Thread preetika tyagi
Yes, I understand that. However, what I'm trying to understand is the
internal structure of partition index. When a record associate with the
same partition key is updated, we have two different records with different
timestamps. There are chances of these two records being split across two
different SSTables (of course as long as compaction is not merging them
into one SSTable eventually). How partition index looks like in such case?
For the same key, we have two different records in different SSTables. How
does partition index store such information? Can it have repeated partition
keys with different disk offsets pointing to different SSTables?

On Tue, Mar 21, 2017 at 10:09 AM, Jonathan Haddad  wrote:

> The partition index is never updated, as sstables are immutable.
>
> On Tue, Mar 21, 2017 at 9:40 AM preetika tyagi 
> wrote:
>
>> Thank you Jan & Jeff for the responses. That was really useful.
>>
>> Jan - I have one follow-up question. When the data is spread over more
>> than one SSTable in case of updates as you mentioned, we will need two
>> seeks per SSTable (one for partition index and another for SSTable itself).
>> I'm curious to know how partition index is structured internally. I was
>> assuming it to be a table with  pairs. In case of an
>> update to the same key for several times, how it is recorded in the
>> partition index?
>>
>> Thanks,
>> Preetika
>>
>> On Mon, Mar 20, 2017 at 10:37 PM,  wrote:
>>
>> Hi,
>>
>>
>>
>> youre right – one seek with hit in the partition key cache and two if not.
>>
>>
>>
>> Thats the theory – but two thinge to mention:
>>
>>
>>
>> First, you need two seeks per sstable not per entire read. So if you data
>> is spread over multiple sstables on disk you obviously need more then two
>> reads. Think of often updated partition keys – in combination with memory
>> preassure you can easily end up with maaany sstables (ok they will be
>> compacted some time in the future).
>>
>>
>>
>> Second, there could be fragmentation on disk which leads to seeks during
>> sequential reads.
>>
>>
>>
>> Jan
>>
>>
>>
>> Gesendet von meinem Windows 10 Phone
>>
>>
>>
>> *Von: *preetika tyagi 
>> *Gesendet: *Montag, 20. März 2017 21:18
>> *An: *user@cassandra.apache.org
>> *Betreff: *question on maximum disk seeks
>>
>>
>>
>> I'm trying to understand the maximum number of disk seeks required in a
>> read operation in Cassandra. I looked at several online articles including
>> this one: https://docs.datastax.com/en/cassandra/3.0/
>> cassandra/dml/dmlAboutReads.html
>>
>> As per my understanding, two disk seeks are required in the worst case.
>> One is for reading the partition index and another is to read the actual
>> data from the compressed partition. The index of the data in compressed
>> partitions is obtained from the compression offset tables (which is stored
>> in memory). Am I on the right track here? Will there ever be a case when
>> more than 1 disk seek is required to read the data?
>>
>> Thanks,
>>
>> Preetika
>>
>>
>>
>>
>>
>>


Re: question on maximum disk seeks

2017-03-21 Thread Jonathan Haddad
The partition index is never updated, as sstables are immutable.

On Tue, Mar 21, 2017 at 9:40 AM preetika tyagi 
wrote:

> Thank you Jan & Jeff for the responses. That was really useful.
>
> Jan - I have one follow-up question. When the data is spread over more
> than one SSTable in case of updates as you mentioned, we will need two
> seeks per SSTable (one for partition index and another for SSTable itself).
> I'm curious to know how partition index is structured internally. I was
> assuming it to be a table with  pairs. In case of an
> update to the same key for several times, how it is recorded in the
> partition index?
>
> Thanks,
> Preetika
>
> On Mon, Mar 20, 2017 at 10:37 PM,  wrote:
>
> Hi,
>
>
>
> youre right – one seek with hit in the partition key cache and two if not.
>
>
>
> Thats the theory – but two thinge to mention:
>
>
>
> First, you need two seeks per sstable not per entire read. So if you data
> is spread over multiple sstables on disk you obviously need more then two
> reads. Think of often updated partition keys – in combination with memory
> preassure you can easily end up with maaany sstables (ok they will be
> compacted some time in the future).
>
>
>
> Second, there could be fragmentation on disk which leads to seeks during
> sequential reads.
>
>
>
> Jan
>
>
>
> Gesendet von meinem Windows 10 Phone
>
>
>
> *Von: *preetika tyagi 
> *Gesendet: *Montag, 20. März 2017 21:18
> *An: *user@cassandra.apache.org
> *Betreff: *question on maximum disk seeks
>
>
>
> I'm trying to understand the maximum number of disk seeks required in a
> read operation in Cassandra. I looked at several online articles including
> this one:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
>
> As per my understanding, two disk seeks are required in the worst case.
> One is for reading the partition index and another is to read the actual
> data from the compressed partition. The index of the data in compressed
> partitions is obtained from the compression offset tables (which is stored
> in memory). Am I on the right track here? Will there ever be a case when
> more than 1 disk seek is required to read the data?
>
> Thanks,
>
> Preetika
>
>
>
>
>
>


Re: question on maximum disk seeks

2017-03-21 Thread preetika tyagi
Thank you Jan & Jeff for the responses. That was really useful.

Jan - I have one follow-up question. When the data is spread over more than
one SSTable in case of updates as you mentioned, we will need two seeks per
SSTable (one for partition index and another for SSTable itself). I'm
curious to know how partition index is structured internally. I was
assuming it to be a table with  pairs. In case of an
update to the same key for several times, how it is recorded in the
partition index?

Thanks,
Preetika

On Mon, Mar 20, 2017 at 10:37 PM,  wrote:

> Hi,
>
>
>
> youre right – one seek with hit in the partition key cache and two if not.
>
>
>
> Thats the theory – but two thinge to mention:
>
>
>
> First, you need two seeks per sstable not per entire read. So if you data
> is spread over multiple sstables on disk you obviously need more then two
> reads. Think of often updated partition keys – in combination with memory
> preassure you can easily end up with maaany sstables (ok they will be
> compacted some time in the future).
>
>
>
> Second, there could be fragmentation on disk which leads to seeks during
> sequential reads.
>
>
>
> Jan
>
>
>
> Gesendet von meinem Windows 10 Phone
>
>
>
> *Von: *preetika tyagi 
> *Gesendet: *Montag, 20. März 2017 21:18
> *An: *user@cassandra.apache.org
> *Betreff: *question on maximum disk seeks
>
>
>
> I'm trying to understand the maximum number of disk seeks required in a
> read operation in Cassandra. I looked at several online articles including
> this one: https://docs.datastax.com/en/cassandra/3.0/
> cassandra/dml/dmlAboutReads.html
>
> As per my understanding, two disk seeks are required in the worst case.
> One is for reading the partition index and another is to read the actual
> data from the compressed partition. The index of the data in compressed
> partitions is obtained from the compression offset tables (which is stored
> in memory). Am I on the right track here? Will there ever be a case when
> more than 1 disk seek is required to read the data?
>
> Thanks,
>
> Preetika
>
>
>
>
>


Re: question on maximum disk seeks

2017-03-20 Thread Jeff Jirsa


On 2017-03-20 13:17 (-0700), preetika tyagi  wrote: 
> I'm trying to understand the maximum number of disk seeks required in a
> read operation in Cassandra. I looked at several online articles including
> this one:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
> 
> As per my understanding, two disk seeks are required in the worst case. One
> is for reading the partition index and another is to read the actual data
> from the compressed partition. The index of the data in compressed
> partitions is obtained from the compression offset tables (which is stored
> in memory). Am I on the right track here? Will there ever be a case when
> more than 1 disk seek is required to read the data?
> 

That sounds right, but do note that it's PER SSTABLE in which the data is 
stored (or in which there's a bloom filter false positive).