[
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460770#comment-13460770
]
Michael Kjellman edited comment on CASSANDRA-4208 at 9/22/12 6:45 AM:
----------------------------------------------------------------------
Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems
to set the column family.
I would assume:
ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1,
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2,
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
is all that is needed. If i don't setup the job with
job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat
throws an exception
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
at
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)
If i do specify that at the job level the job name never seems to to set the
column family name on that job.
additionally, using the job name as the column family name is slightly
inconvenient as we use '_' in our column family names which is not a valid
character in MultipleOutputs as it looks like _# is the way they internally
keep track of counters if that is enabled.
i would love to see the patch you are proposing to fix the issue for
bulkoutputformat :)
was (Author: mkjellman):
Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never
seems to set the column family.
I would assume:
ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1,
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2,
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
is all that is needed. If i don't setup the job with
job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat
throws an exception
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
at
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)
If i do specify that at the job level the job name never seems to to set the
column family name on that job.
additionally, using the job name as the column family name is slightly
inconvenient as we use '_' in our column family names which is not a valid
character in MultipleOutputs as it looks like _# is the way they internally
keep track of counters if that is enabled.
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
> Issue Type: Improvement
> Components: Hadoop
> Affects Versions: 1.1.0
> Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt,
> cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family
> in a single reducer. Considering that writing values to Cassandra often
> involves multiple column families (i.e. updating your index when you insert a
> new value), this seems overly restrictive. I am submitting a patch that
> moves the specification of column family from the job configuration to the
> write() call in ColumnFamilyRecordWriter.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira