[
https://issues.apache.org/jira/browse/CASSANDRA-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13930645#comment-13930645
]
Jonathan Ellis edited comment on CASSANDRA-6793 at 3/11/14 5:50 PM:
--------------------------------------------------------------------
I confess that I'm mystified by the schema introduced in CASSANDRA-4421:
{noformat}
/**
* This counts the occurrences of words in ColumnFamily
* cql3_worldcount ( user_id text,
* category_id text,
* sub_category_id text,
* title text,
* body text,
* PRIMARY KEY (user_id, category_id, sub_category_id))
*
* For each word, we output the total number of occurrences across all body
texts.
*
* When outputting to Cassandra, we write the word counts to column family
* output_words ( row_id1 text,
* row_id2 text,
* word text,
* count_num text,
* PRIMARY KEY ((row_id1, row_id2), word))
* as a {word, count} to columns: word, count_num with a row key of "word sum"
*/
/**
* This counts the occurrences of words in ColumnFamily
* cql3_worldcount ( user_id text,
* category_id text,
* sub_category_id text,
* title text,
* body text,
* PRIMARY KEY (user_id, category_id, sub_category_id))
*
* For each word, we output the total number of occurrences across all body
texts.
*
* When outputting to Cassandra, we write the word counts to column family
* output_words ( row_id1 text,
* row_id2 text,
* word text,
* count_num text,
* PRIMARY KEY ((row_id1, row_id2), word))
* as a {word, count} to columns: word, count_num with a row key of "word sum"
*/
{noformat}
Both the input and output tables look far more complex than necessary.
My preferred solution would be to just strip the output down to {{(word text
primary key, count int)}}, and make a similar simplification for the input.
Can you shed any light [~alexliu68]?
was (Author: jbellis):
I confess that I'm mystified by the schema introduced in CASSANDRA-4421:
{noformat}
/**
* This counts the occurrences of words in ColumnFamily
* cql3_worldcount ( user_id text,
* category_id text,
* sub_category_id text,
* title text,
* body text,
* PRIMARY KEY (user_id, category_id, sub_category_id))
*
* For each word, we output the total number of occurrences across all body
texts.
*
* When outputting to Cassandra, we write the word counts to column family
* output_words ( row_id1 text,
* row_id2 text,
* word text,
* count_num text,
* PRIMARY KEY ((row_id1, row_id2), word))
* as a {word, count} to columns: word, count_num with a row key of "word sum"
*/
/**
* This counts the occurrences of words in ColumnFamily
* cql3_worldcount ( user_id text,
* category_id text,
* sub_category_id text,
* title text,
* body text,
* PRIMARY KEY (user_id, category_id, sub_category_id))
*
* For each word, we output the total number of occurrences across all body
texts.
*
* When outputting to Cassandra, we write the word counts to column family
* output_words ( row_id1 text,
* row_id2 text,
* word text,
* count_num text,
* PRIMARY KEY ((row_id1, row_id2), word))
* as a {word, count} to columns: word, count_num with a row key of "word sum"
*/
{noformat}
Both the input and output tables look far more complex than necessary.
My preferred solution would be to just strip the output down to {(word text
primary key, count int)}, and make a similar simplification for the input.
Can you shed any light [~alexliu68]?
> NPE in Hadoop Word count example
> --------------------------------
>
> Key: CASSANDRA-6793
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6793
> Project: Cassandra
> Issue Type: Bug
> Components: Examples
> Reporter: Chander S Pechetty
> Assignee: Chander S Pechetty
> Priority: Minor
> Labels: hadoop
> Attachments: trunk-6793.txt
>
>
> The partition keys requested in WordCount.java do not match the primary key
> set up in the table output_words. It looks this patch was not merged properly
> from
> [CASSANDRA-5622|https://issues.apache.org/jira/browse/CASSANDRA-5622].The
> attached patch addresses the NPE and uses the correct keys defined in #5622.
> I am assuming there is no need to fix the actual NPE like throwing an
> InvalidRequestException back to user to fix the partition keys, as it would
> be trivial to get the same from the TableMetadata using the driver API.
> java.lang.NullPointerException
> at
> org.apache.cassandra.dht.Murmur3Partitioner.getToken(Murmur3Partitioner.java:92)
> at
> org.apache.cassandra.dht.Murmur3Partitioner.getToken(Murmur3Partitioner.java:40)
> at org.apache.cassandra.client.RingCache.getRange(RingCache.java:117)
> at
> org.apache.cassandra.hadoop.cql3.CqlRecordWriter.write(CqlRecordWriter.java:163)
> at
> org.apache.cassandra.hadoop.cql3.CqlRecordWriter.write(CqlRecordWriter.java:63)
> at
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:587)
> at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at WordCount$ReducerToCassandra.reduce(Unknown Source)
> at WordCount$ReducerToCassandra.reduce(Unknown Source)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
--
This message was sent by Atlassian JIRA
(v6.2#6252)