[
https://issues.apache.org/jira/browse/HBASE-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Rodionov reassigned HBASE-18872:
-----------------------------------------
Assignee: Vladimir Rodionov
> Backup scaling for multiple table and millions of row
> -----------------------------------------------------
>
> Key: HBASE-18872
> URL: https://issues.apache.org/jira/browse/HBASE-18872
> Project: HBase
> Issue Type: Improvement
> Reporter: Vishal Khandelwal
> Assignee: Vladimir Rodionov
>
> I did a simple experiment of loading ~200 million rows on a table 1 and
> nothing in a table 2. This test was done on a local cluster ~ approx 3-4
> containers were running in parallel. The focus of the test was not on how
> much time backup takes but on time spent on the table were no data has been
> changed.
> *Table without Data -->*
> Elapsed: 44mins, 52sec
> Average Map Time 3sec
> Average Shuffle Time 2mins, 35sec
> Average Merge Time 0sec
> Average Reduce Time 0sec
> Map : 2052
> Reduce : 1
> *Table with Data -->*
> Elapsed: 1hrs, 44mins, 10sec
> Average Map Time 4sec
> Average Shuffle Time 37sec
> Average Merge Time 3sec
> Average Reduce Time 47sec
> Map : 2052
> Reduce : 64
> All above numbers are a single node cluster so not many mappers run in
> parallel. but let's extrapolate this to 20 node cluster, with ~100 tables and
> data size to be backed up various for approx 2000 Wals, let us say each 20
> node can process 3 containers i.e 60 wals in parallel. assume 3 sec are spent
> in each WALs i.e. 6000\ 60 sec --> 100 per table --> 10000 sec for all
> tables.
> ~166 mins --> ~2.7 hrs only for filtering. This does not seem to be scale.
> (These are just rough numbers from a basic test). As all parsing is O (m
> (WALS) * n (Tables))
> Main intend of this test is to see even the backup of very less churning
> table might take good amount for just filtering the data. As number of table
> or data increases, this does not seem scalable
> Even i can see from our current cluster numbers easily close to 100 table,
> 200 millions rows, 200 -300 GB.
> I would suggest that we should have filtering to parse WALs once and to
> segregate in multiple WALs per table --> hFiles from per table wals. ( just a
> rough idea).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)