[jira] [Updated] (HADOOP-13304) distributed database for store , mapreduce for compute

jiang hehui (JIRA) Tue, 21 Jun 2016 02:01:22 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


jiang hehui updated HADOOP-13304:
---------------------------------
    Description: 
in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute 
.
my idea is that data are stored in distributed database , data compute is like 
mapreduce.

h2. how to do ?

* insert: 
using two-phase commit ,according to the split policy ,just execute insert in 
nodes

* delete: 
using two-phase commit ,according to the split policy ,just execute delete in 
nodes

* update:
using two-phase commit, according to the split policy, if record node does not 
change ,just execute update in nodes, if record node change, first delete old 
value in source node , and insert new value in destination node .
* select:
** simple select (like data just in one node , or data fusion across multi 
nodes not need)is just the same like standalone database server;
** complex select (like distinct , group by, order by, sub query, join across 
multi nodes),we call a job 
{panel}
{color:red}job are parsed into stages , stages have lineage , all stages in a 
job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql 
,shuffle, reducesql .
when receive request sql, according to metadata ,generate the execution plan 
which contain the dag , including stage and mapsql ,shuffle, reducesql in each 
stage; then just execute the plan , and return result to client.

as in spark , it is the same ; rdd is table , job is job;
as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle , 
reducesql is reduce.
{color}
{panel}


h2. architecture:
!http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!

* client : user interface 
* master : master like nameserver in hdfs
* meta database : contain the base information about system , nodes , tables 
and so on 
* store node :  database node where data is stored , insert,delete,update is 
always executed on 
* calculate node ： where execure select , source data is in store nodes , then 
other task is run on calculate node . calculae node may be the same as store 
node in practice


h2. example:

{panel}
select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=? 
group by age


execution plan may be:
stage0:
        mapsql:
                select age,count(u_id) v from tab_user_info t where u_reg_dt>=? 
and u_reg_dt<=? group by age

        shuffle: 
                shuffle by age with range policy, 
                for example ,if number of reduce node is N , then every node 
has (max(u_id)-min(u_id))/N record , 
                reduce node have id , node with small id store data with small 
range of age , so we can group by in each node
                
        reducesql: 
                select age,sum(v) from t where group by age

note:
we must execute group by on reduce node because of data coming from different 
mapsql need to be aggregated



{panel}


{panel}

select t1.u_id,t1.u_name,t2.login_product 
from tab_user_info t1 join tab_login_info t2 
on (t1.u_id=t2.u_id and t1.u_reg_dt>=? and t1.u_reg_dt<=?)

execution plan may be:

stage0:
        mapsql:
                select u_id,u_name from tab_user_info t where u_reg_dt>=? and 
t1.u_reg_dt<=? ;
                select u_id, login_product from tab_login_info t ;


        shuffle: 
                shuffle by u_id with range policy, 
                for example ,if number of reduce node is N , then every node 
has (max(u_id)-min(u_id))/N record , 
                reduce node have id , node with small id store data with small 
range of u_id , so we can join in each node
                
        reducesql: 
                select t1.u_id,t1.u_name,t2.login_product 
                from tab_user_info t1 join tab_login_info t2 
                on (t1.u_id=t2.u_id)


note:
because of join ,each table need to be tagged so that reduce can determine each 
record belongs to which table


{panel}

  was:
in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute 
.
my idea is that data are stored in distributed database , data compute is like 
mapreduce.

!http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!

* insert: 
using two-phase commit ,according to the split policy ,just execute insert in 
nodes

* delete: 
using two-phase commit ,according to the split policy ,just execute delete in 
nodes

* update:
using two-phase commit, according to the split policy, if record node does not 
change ,just execute update in nodes, if record node change, first delete old 
value in source node , and insert new value in destination node .
* select:
** simple select (like data just in one node , or data fusion across multi 
nodes not need)is just the same like standalone database server;
** complex select (like distinct , group by, order by, sub query, join across 
multi nodes),we call a job 
{panel}
{color:red}job are parsed into stages , stages have lineage , all stages in a 
job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql 
,shuffle, reducesql .
when receive request sql, according to metadata ,generate the execution plan 
which contain the dag , including stage and mapsql ,shuffle, reducesql in each 
stage; then just execute the plan , and return result to client.

as in spark , it is the same ; rdd is table , job is job;
as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle , 
reducesql is reduce.
{color}
{panel}


> distributed database for store , mapreduce for compute
> ------------------------------------------------------
>
>                 Key: HADOOP-13304
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13304
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 2.6.4
>            Reporter: jiang hehui
>
> in hadoop ,hdfs is responsible for store , mapreduce is responsible for 
> compute .
> my idea is that data are stored in distributed database , data compute is 
> like mapreduce.
> h2. how to do ?
> * insert: 
> using two-phase commit ,according to the split policy ,just execute insert in 
> nodes
> * delete: 
> using two-phase commit ,according to the split policy ,just execute delete in 
> nodes
> * update:
> using two-phase commit, according to the split policy, if record node does 
> not change ,just execute update in nodes, if record node change, first delete 
> old value in source node , and insert new value in destination node .
> * select:
> ** simple select (like data just in one node , or data fusion across multi 
> nodes not need)is just the same like standalone database server;
> ** complex select (like distinct , group by, order by, sub query, join across 
> multi nodes),we call a job 
> {panel}
> {color:red}job are parsed into stages , stages have lineage , all stages in a 
> job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql 
> ,shuffle, reducesql .
> when receive request sql, according to metadata ,generate the execution plan 
> which contain the dag , including stage and mapsql ,shuffle, reducesql in 
> each stage; then just execute the plan , and return result to client.
> as in spark , it is the same ; rdd is table , job is job;
> as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle , 
> reducesql is reduce.
> {color}
> {panel}
> h2. architecture:
> !http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!
> * client : user interface 
> * master : master like nameserver in hdfs
> * meta database : contain the base information about system , nodes , tables 
> and so on 
> * store node :  database node where data is stored , insert,delete,update is 
> always executed on 
> * calculate node ： where execure select , source data is in store nodes , 
> then other task is run on calculate node . calculae node may be the same as 
> store node in practice
> h2. example:
> {panel}
> select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and 
> u_reg_dt<=? group by age
> execution plan may be:
> stage0:
>       mapsql:
>               select age,count(u_id) v from tab_user_info t where u_reg_dt>=? 
> and u_reg_dt<=? group by age
>       shuffle: 
>               shuffle by age with range policy, 
>               for example ,if number of reduce node is N , then every node 
> has (max(u_id)-min(u_id))/N record , 
>               reduce node have id , node with small id store data with small 
> range of age , so we can group by in each node
>               
>       reducesql: 
>               select age,sum(v) from t where group by age
> note:
> we must execute group by on reduce node because of data coming from different 
> mapsql need to be aggregated
> {panel}
> {panel}
> select t1.u_id,t1.u_name,t2.login_product 
> from tab_user_info t1 join tab_login_info t2 
> on (t1.u_id=t2.u_id and t1.u_reg_dt>=? and t1.u_reg_dt<=?)
> execution plan may be:
> stage0:
>       mapsql:
>               select u_id,u_name from tab_user_info t where u_reg_dt>=? and 
> t1.u_reg_dt<=? ;
>               select u_id, login_product from tab_login_info t ;
>       shuffle: 
>               shuffle by u_id with range policy, 
>               for example ,if number of reduce node is N , then every node 
> has (max(u_id)-min(u_id))/N record , 
>               reduce node have id , node with small id store data with small 
> range of u_id , so we can join in each node
>               
>       reducesql: 
>               select t1.u_id,t1.u_name,t2.login_product 
>               from tab_user_info t1 join tab_login_info t2 
>               on (t1.u_id=t2.u_id)
> note:
> because of join ,each table need to be tagged so that reduce can determine 
> each record belongs to which table
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-13304) distributed database for store , mapreduce for compute

Reply via email to