[jira] [Issue Comment Edited] (CASSANDRA-3150) ColumnFormatRecordReader loops forever

Mck SembWever (JIRA) Sat, 10 Sep 2011 11:05:33 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102062#comment-13102062
 ]


Mck SembWever edited comment on CASSANDRA-3150 at 9/10/11 6:04 PM:
-------------------------------------------------------------------

Debug from a task that was still running at 1200%

The initial split for this CFRR is 
30303030303031333131313739353337303038d4e7f72db2ed11e09d7c68b59973a5d8 : 
303030303030313331323631393735313231381778518cc00711e0acb968b59973a5d8

This job was run with 
 cassandra.input.split.size=196608
 cassandra.range.batch.size=16000

therefore there shouldn't be more than 13 calls to get_range_slices(..) in this 
task. There was already 166 calls in this log.


What i can see here is that the original split for this task is just way too 
big and this comes from {{describe_splits(..)}}
which in turn depends on "index_interval". Reading 
{{StorageService.getSplits(..)}} i would guess that the split can in fact 
contain many more keys with the default sampling of 128. Question is how low 
can/should i bring index_interval (this cf can have up to 8 billion rows)?

      was (Author: michaelsembwever):
    Debug from a task that was still running at 1200%

The initial split for this CFRR is 
30303030303031333131313739353337303038d4e7f72db2ed11e09d7c68b59973a5d8 : 
303030303030313331323631393735313231381778518cc00711e0acb968b59973a5d8

This job was run with 
 cassandra.input.split.size=196608
 cassandra.range.batch.size=16000

therefore there shouldn't be more than 13 calls to get_range_slices(..) in this 
task. There was already 166 calls in this log.


What i can see here is that the original split for this task is just way too 
big and this comes from {{describe_splits(..)}}
which in turn depends on "index_interval". Reading 
{{StorageService.getSplits(..)}} i would guess that the split can in fact 
contain many more keys with the default sampling of 128. Question is how low 
can/should i bring index_interval ?
  
> ColumnFormatRecordReader loops forever
> --------------------------------------
>
>                 Key: CASSANDRA-3150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.8.4
>            Reporter: Mck SembWever
>            Assignee: Mck SembWever
>            Priority: Critical
>         Attachments: CASSANDRA-3150.patch, 
> attempt_201109071357_0044_m_003040_0.grep-get_range_slices.log
>
>
> From http://thread.gmane.org/gmane.comp.db.cassandra.user/20039
> {quote}
> bq. Cassandra-0.8.4 w/ ByteOrderedPartitioner
> bq. CFIF's inputSplitSize=196608
> bq. 3 map tasks (from 4013) is still running after read 25 million rows.
> bq. Can this be a bug in StorageService.getSplits(..) ?
> getSplits looks pretty foolproof to me but I guess we'd need to add
> more debug logging to rule out a bug there for sure.
> I guess the main alternative would be a bug in the recordreader paging.
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3150) ColumnFormatRecordReader loops forever

Reply via email to