[
https://issues.apache.org/jira/browse/HADOOP-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
jiang hehui updated HADOOP-13304:
---------------------------------
Description:
in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute
.
my idea is that data are stored in distributed database , data compute is like
mapreduce.
h2. how to do ?
* insert:
using two-phase commit ,according to the split policy ,just execute insert in
nodes
* delete:
using two-phase commit ,according to the split policy ,just execute delete in
nodes
* update:
using two-phase commit, according to the split policy, if record node does not
change ,just execute update in nodes, if record node change, first delete old
value in source node , and insert new value in destination node .
* select:
** simple select (like data just in one node , or data fusion across multi
nodes not need)is just the same like standalone database server;
** complex select (like distinct , group by, order by, sub query, join across
multi nodes),we call a job
{panel}
{color:red}job are parsed into stages , stages have lineage , all stages in a
job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql
,shuffle, reducesql .
when receive request sql, according to metadata ,generate the execution plan
which contain the dag , including stage and mapsql ,shuffle, reducesql in each
stage; then just execute the plan , and return result to client.
as in spark , it is the same ; rdd is table , job is job;
as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle ,
reducesql is reduce.
{color}
{panel}
h2. architecture:
!http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!
* client : user interface
* master : master like nameserver in hdfs
* meta database : contain the base information about system , nodes , tables
and so on
* store node : database node where data is stored , insert,delete,update is
always executed on
* calculate node : where execure select , source data is in store nodes , then
other task is run on calculate node . calculae node may be the same as store
node in practice
h2. example:
{panel}
select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and u_reg_dt<=?
group by age
execution plan may be:
stage0:
mapsql:
select age,count(u_id) v from tab_user_info t where u_reg_dt>=?
and u_reg_dt<=? group by age
shuffle:
shuffle by age with range policy,
for example ,if number of reduce node is N , then every node
has (max(u_id)-min(u_id))/N record ,
reduce node have id , node with small id store data with small
range of age , so we can group by in each node
reducesql:
select age,sum(v) from t where group by age
note:
we must execute group by on reduce node because of data coming from different
mapsql need to be aggregated
{panel}
{panel}
select t1.u_id,t1.u_name,t2.login_product
from tab_user_info t1 join tab_login_info t2
on (t1.u_id=t2.u_id and t1.u_reg_dt>=? and t1.u_reg_dt<=?)
execution plan may be:
stage0:
mapsql:
select u_id,u_name from tab_user_info t where u_reg_dt>=? and
t1.u_reg_dt<=? ;
select u_id, login_product from tab_login_info t ;
shuffle:
shuffle by u_id with range policy,
for example ,if number of reduce node is N , then every node
has (max(u_id)-min(u_id))/N record ,
reduce node have id , node with small id store data with small
range of u_id , so we can join in each node
reducesql:
select t1.u_id,t1.u_name,t2.login_product
from tab_user_info t1 join tab_login_info t2
on (t1.u_id=t2.u_id)
note:
because of join ,each table need to be tagged so that reduce can determine each
record belongs to which table
{panel}
was:
in hadoop ,hdfs is responsible for store , mapreduce is responsible for compute
.
my idea is that data are stored in distributed database , data compute is like
mapreduce.
!http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!
* insert:
using two-phase commit ,according to the split policy ,just execute insert in
nodes
* delete:
using two-phase commit ,according to the split policy ,just execute delete in
nodes
* update:
using two-phase commit, according to the split policy, if record node does not
change ,just execute update in nodes, if record node change, first delete old
value in source node , and insert new value in destination node .
* select:
** simple select (like data just in one node , or data fusion across multi
nodes not need)is just the same like standalone database server;
** complex select (like distinct , group by, order by, sub query, join across
multi nodes),we call a job
{panel}
{color:red}job are parsed into stages , stages have lineage , all stages in a
job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql
,shuffle, reducesql .
when receive request sql, according to metadata ,generate the execution plan
which contain the dag , including stage and mapsql ,shuffle, reducesql in each
stage; then just execute the plan , and return result to client.
as in spark , it is the same ; rdd is table , job is job;
as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle ,
reducesql is reduce.
{color}
{panel}
> distributed database for store , mapreduce for compute
> ------------------------------------------------------
>
> Key: HADOOP-13304
> URL: https://issues.apache.org/jira/browse/HADOOP-13304
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs
> Affects Versions: 2.6.4
> Reporter: jiang hehui
>
> in hadoop ,hdfs is responsible for store , mapreduce is responsible for
> compute .
> my idea is that data are stored in distributed database , data compute is
> like mapreduce.
> h2. how to do ?
> * insert:
> using two-phase commit ,according to the split policy ,just execute insert in
> nodes
> * delete:
> using two-phase commit ,according to the split policy ,just execute delete in
> nodes
> * update:
> using two-phase commit, according to the split policy, if record node does
> not change ,just execute update in nodes, if record node change, first delete
> old value in source node , and insert new value in destination node .
> * select:
> ** simple select (like data just in one node , or data fusion across multi
> nodes not need)is just the same like standalone database server;
> ** complex select (like distinct , group by, order by, sub query, join across
> multi nodes),we call a job
> {panel}
> {color:red}job are parsed into stages , stages have lineage , all stages in a
> job make up dag( Directed Acyclic Graph ) ,every stage contains mapsql
> ,shuffle, reducesql .
> when receive request sql, according to metadata ,generate the execution plan
> which contain the dag , including stage and mapsql ,shuffle, reducesql in
> each stage; then just execute the plan , and return result to client.
> as in spark , it is the same ; rdd is table , job is job;
> as mapreduce in hadoop, it is the same ; mapsql is map , shuffle is shuffle ,
> reducesql is reduce.
> {color}
> {panel}
> h2. architecture:
> !http://images2015.cnblogs.com/blog/439702/201606/439702-20160621124133334-32823985.png!
> * client : user interface
> * master : master like nameserver in hdfs
> * meta database : contain the base information about system , nodes , tables
> and so on
> * store node : database node where data is stored , insert,delete,update is
> always executed on
> * calculate node : where execure select , source data is in store nodes ,
> then other task is run on calculate node . calculae node may be the same as
> store node in practice
> h2. example:
> {panel}
> select age,count(u_id) v from tab_user_info t where u_reg_dt>=? and
> u_reg_dt<=? group by age
> execution plan may be:
> stage0:
> mapsql:
> select age,count(u_id) v from tab_user_info t where u_reg_dt>=?
> and u_reg_dt<=? group by age
> shuffle:
> shuffle by age with range policy,
> for example ,if number of reduce node is N , then every node
> has (max(u_id)-min(u_id))/N record ,
> reduce node have id , node with small id store data with small
> range of age , so we can group by in each node
>
> reducesql:
> select age,sum(v) from t where group by age
> note:
> we must execute group by on reduce node because of data coming from different
> mapsql need to be aggregated
> {panel}
> {panel}
> select t1.u_id,t1.u_name,t2.login_product
> from tab_user_info t1 join tab_login_info t2
> on (t1.u_id=t2.u_id and t1.u_reg_dt>=? and t1.u_reg_dt<=?)
> execution plan may be:
> stage0:
> mapsql:
> select u_id,u_name from tab_user_info t where u_reg_dt>=? and
> t1.u_reg_dt<=? ;
> select u_id, login_product from tab_login_info t ;
> shuffle:
> shuffle by u_id with range policy,
> for example ,if number of reduce node is N , then every node
> has (max(u_id)-min(u_id))/N record ,
> reduce node have id , node with small id store data with small
> range of u_id , so we can join in each node
>
> reducesql:
> select t1.u_id,t1.u_name,t2.login_product
> from tab_user_info t1 join tab_login_info t2
> on (t1.u_id=t2.u_id)
> note:
> because of join ,each table need to be tagged so that reduce can determine
> each record belongs to which table
> {panel}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]