[jira] [Comment Edited] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Michael Kjellman (JIRA) Fri, 21 Sep 2012 12:45:10 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460770#comment-13460770
 ]


Michael Kjellman edited comment on CASSANDRA-4208 at 9/22/12 6:45 AM:
----------------------------------------------------------------------

Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems 
to set the column family.

I would assume:

ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, 
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, 
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);

is all that is needed. If i don't setup the job with 
job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat 
throws an exception

Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: 
Output directory not set.
        at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)

If i do specify that at the job level the job name never seems to to set the 
column family name on that job.

additionally, using the job name as the column family name is slightly 
inconvenient as we use '_' in our column family names which is not a valid 
character in MultipleOutputs as it looks like _# is the way they internally 
keep track of counters if that is enabled. 

i would love to see the patch you are proposing to fix the issue for 
bulkoutputformat :)
                
      was (Author: mkjellman):
    Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never 
seems to set the column family.

I would assume:

ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, 
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, 
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);

is all that is needed. If i don't setup the job with 
job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat 
throws an exception

Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: 
Output directory not set.
        at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)

If i do specify that at the job level the job name never seems to to set the 
column family name on that job.

additionally, using the job name as the column family name is slightly 
inconvenient as we use '_' in our column family names which is not a valid 
character in MultipleOutputs as it looks like _# is the way they internally 
keep track of counters if that is enabled. 
                  
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Reply via email to