[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

haosdent Sun, 06 Apr 2014 01:04:14 -0700

Github user haosdent commented on the pull request:

    https://github.com/apache/spark/pull/194#issuecomment-39661707
  
    @marmbrus @pwendell I restart this issue today. After learn the sources 
relate `SchemaRDD`, I think the better approach is to provide 
`saveAsHBaseTable(rdd: RDD[Text], ...)` and `saveAsHBaseTable(rdd: SchemaRDD, 
...)`both. HBase is quite different from RDBMS. `SchemaRDD` assume every cell 
in `Row` have `name` and `dataType`. This assumption is OK for Hive or Parquet. 
But for HBase, this assumption lose some important parts. In HBase, all data 
are stored in the `Array[Byte]` and don't have `dataType`. And for every cell 
in HBase, it have rowkey(like index in RDBMS), qualifier(like `name` above) and 
column family. Column family couldn't be represent in `SchemaRDD`.
    
    So for some user have specific requirements to set column families, we 
could provide `saveAsHBaseTable(rdd: RDD[Text], ...)` and tell the user how to 
use it. It provide the max flexibility to use HBase for user. On the other 
hand, `saveAsHBaseTable(rdd: Schema, ...)` is also necessary for user which 
have only a column family. We could set a fixed column family in the 
initialization of `SparkHBaseWriter` to work around the problem above.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

Reply via email to