[ 
https://issues.apache.org/jira/browse/HBASE-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal Khandelwal updated HBASE-18872:
--------------------------------------
    Description: 
I did a simple experiment of loading ~200 million rows on a table 1 and nothing 
in a table 2. This test was done on a local cluster ~ approx 3-4 containers 
were running in parallel. The focus of the test was not on how much time backup 
takes but on time spent on the table were no data has been changed.

*Table without Data -->*
Elapsed:        44mins, 52sec
Average Map Time        3sec
Average Shuffle Time    2mins, 35sec
Average Merge Time      0sec
Average Reduce Time     0sec
Map : 2052
Reduce : 1

*Table with Data -->*
Elapsed:        1hrs, 44mins, 10sec
Average Map Time        4sec
Average Shuffle Time    37sec
Average Merge Time      3sec
Average Reduce Time     47sec
Map : 2052
Reduce : 64

All above numbers are a single node cluster so not many mappers run in 
parallel. but let's extrapolate this to 20 node cluster, with ~100 tables and 
data size to be backed up various for approx 2000 Wals, let us say each 20 node 
can process 3 containers i.e 60 wals in parallel. assume 3 sec are spent in 
each WALs i.e. 6000\ 60 sec -->  100 per table --> 10000 sec for all tables. 
~166 mins -->  ~2.7 only for filtering.  This does not seem to be scale. (These 
are just rough numbers from a basic test). As all parsing is O (m (WALS) * n 
(Tables))

Main intend of this test is to see even the backup of very less churning table 
might take good amount for just filtering the data. As number of table or data 
increases, this does not seem scalable 

Even i can see from our current cluster numbers easily close to 100 table, 200 
millions rows,  200 -300 GB.

I would suggest that we should have filtering to parse WALs once and to 
segregate in multiple WALs per table --> hFiles from per table wals. ( just a 
rough idea).



  was:
I did a simple experiment of loading ~200 million rows on a table 1 and nothing 
in a table 2. This test was done on a local cluster ~ approx 3-4 containers 
were running in parallel. The focus of the test was not on how much time backup 
takes but on time spent on the table were no data has been changed.

*Table without Data -->*
Elapsed:        44mins, 52sec
Average Map Time        3sec
Average Shuffle Time    2mins, 35sec
Average Merge Time      0sec
Average Reduce Time     0sec
Map : 2052
Reduce : 1

*Table with Data -->*
Elapsed:        1hrs, 44mins, 10sec
Average Map Time        4sec
Average Shuffle Time    37sec
Average Merge Time      3sec
Average Reduce Time     47sec
Map : 2052
Reduce : 64

All above numbers are a single node cluster so not many mappers run in 
parallel. but let's extrapolate this to 20 node cluster, with ~100 tables and 
data size to be backed up various for approx 2000 Wals, let us say each 20 node 
can process 3 containers i.e 60 wals in parallel. assume 3 sec are spent in 
each WALs i.e. 6000\ 60 sec -->  100 per table --> 10000 sec for all tables. 
~166 hrs only for filtering.  This does not seem to be scale. (These are just 
rough numbers from a basic test). As all parsing is O (m (WALS) * n (Tables))

Main intend of this test is to see even the backup of very less churning table 
might take good amount for just filtering the data. As number of table or data 
increases, this does not seem scalable 

Even i can see from our current cluster numbers easily close to 100 table, 200 
millions rows,  200 -300 GB.

I would suggest that we should have filtering to parse WALs once and to 
segregate in multiple WALs per table --> hFiles from per table wals. ( just a 
rough idea).




> Backup scaling for multiple table and millions of row
> -----------------------------------------------------
>
>                 Key: HBASE-18872
>                 URL: https://issues.apache.org/jira/browse/HBASE-18872
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Vishal Khandelwal
>
> I did a simple experiment of loading ~200 million rows on a table 1 and 
> nothing in a table 2. This test was done on a local cluster ~ approx 3-4 
> containers were running in parallel. The focus of the test was not on how 
> much time backup takes but on time spent on the table were no data has been 
> changed.
> *Table without Data -->*
> Elapsed:      44mins, 52sec
> Average Map Time      3sec
> Average Shuffle Time  2mins, 35sec
> Average Merge Time    0sec
> Average Reduce Time   0sec
> Map : 2052
> Reduce : 1
> *Table with Data -->*
> Elapsed:      1hrs, 44mins, 10sec
> Average Map Time      4sec
> Average Shuffle Time  37sec
> Average Merge Time    3sec
> Average Reduce Time   47sec
> Map : 2052
> Reduce : 64
> All above numbers are a single node cluster so not many mappers run in 
> parallel. but let's extrapolate this to 20 node cluster, with ~100 tables and 
> data size to be backed up various for approx 2000 Wals, let us say each 20 
> node can process 3 containers i.e 60 wals in parallel. assume 3 sec are spent 
> in each WALs i.e. 6000\ 60 sec -->  100 per table --> 10000 sec for all 
> tables. 
> ~166 mins -->  ~2.7 only for filtering.  This does not seem to be scale. 
> (These are just rough numbers from a basic test). As all parsing is O (m 
> (WALS) * n (Tables))
> Main intend of this test is to see even the backup of very less churning 
> table might take good amount for just filtering the data. As number of table 
> or data increases, this does not seem scalable 
> Even i can see from our current cluster numbers easily close to 100 table, 
> 200 millions rows,  200 -300 GB.
> I would suggest that we should have filtering to parse WALs once and to 
> segregate in multiple WALs per table --> hFiles from per table wals. ( just a 
> rough idea).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to