[ 
https://issues.apache.org/jira/browse/CASSANDRA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932094#action_12932094
 ] 

Jacob Perkins commented on CASSANDRA-1737:
------------------------------------------

I wouldn't argue that it is better. Instead I've examined the use cases that we 
at Infochimps have (lots of data in different shapes and sizes) and rewrote the 
example to make a couple generic use cases as painless as possible. Those are:

a. Inserting a flat table with column names (a huge number of datasets look 
like this)
b. Inserting records where the column names are the fields themselves (helps 
with 'graph shaped' datasets)

In my experience, if you're already using Hadoop, rearranging your data to fit 
one of the two generic structures is more or less trivial. If your data is more 
complex, or is unable to fully express itself in one of these two structures, 
then you'll be forced to write custom code (as you would already have had to 
do).

As far as why this is different:

0. In general, data rearrangement and other preprocessing should be decoupled 
from the database loading itself.

1. This does not require a reduce step. That means, if your data is already 
arranged as it needs to be for insertion (a reasonable requirement I think), 
you can skip the costly overhead of a partition, copy, and sort on the Hadoop 
side of things. Less moving parts, less things to fail.

2. Implements the hadoop tool runner allowing you to pass in generic '-D' 
options. This includes the path to cassandra.yaml, what type of insertion, row 
key field, super column name field (if any), column names, as well as hadoop 
options such as the min split size.

3. Uses code from AbstractCassandraDaemon.java to initialize the internal node.

4. Two types of bulk loading are supported

5. Simple ruby runner for a clean interface


I'll submit the changes as a patch as soon as I am able.

> Simplify bulk loading using the bmt_example
> -------------------------------------------
>
>                 Key: CASSANDRA-1737
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1737
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Contrib
>    Affects Versions: 0.7 beta 2
>            Reporter: Jacob Perkins
>            Priority: Minor
>             Fix For: 0.7.0
>
>         Attachments: cassandra_bulk_loader.tar.bz2
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Current bmt_example does not work as given with 0.7. Make it work and 
> possibly easier to use. Also, it should not require a reduce, especially to 
> insert a flat table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to