[jira] [Created] (HBASE-20943) Add offline/online region count into metrics
Tianying Chang created HBASE-20943: -- Summary: Add offline/online region count into metrics Key: HBASE-20943 URL: https://issues.apache.org/jira/browse/HBASE-20943 Project: HBase Issue Type: Improvement Components: metrics Affects Versions: 1.2.6.1, 2.0.0 Reporter: Tianying Chang We intensively use metrics to monitor the health of our HBase production cluster. We have seen some regions of a table stuck and cannot be brought online due to AWS issue which cause some log file corrupted. It will be good if we can catch this early. Although WebUI has this information, it is not useful for automated monitoring. By adding this metric, we can easily monitor them with our monitoring system. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-17453) add Ping into HBase server for deprecated GetProtocolVersion
Tianying Chang created HBASE-17453: -- Summary: add Ping into HBase server for deprecated GetProtocolVersion Key: HBASE-17453 URL: https://issues.apache.org/jira/browse/HBASE-17453 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 1.2.2 Reporter: Tianying Chang Assignee: Tianying Chang Priority: Minor Our HBase service is hosted in AWS. We saw cases where the connection between the client (Asynchbase in our case) and server stop working but did not throw any exception, therefore traffic stuck. So we added a "Ping" feature in AsyncHBase 1.5 by utilizing the GetProtocolVersion() API provided at RS side, if no traffic for given time, we send the "Ping", if no response back for "Ping", we assume the connect is bad and reconnect. Now we are upgrading cluster from 94 to 1.2. However, GetProtocolVersion() is deprecated. To be able to support same detect/reconnect feature, we added Ping() in our internal HBase 1.2 branch, and also patched accordingly in Asynchbase 1.7. We would like to open source this feature since it is useful for use case in AWS environment. We used GetProtocolVersion in AsyncHBase to detect unhealthy connection to RS since in AWS, sometimes it enters a state the connection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16128) add support for p999 histogram metrics
Tianying Chang created HBASE-16128: -- Summary: add support for p999 histogram metrics Key: HBASE-16128 URL: https://issues.apache.org/jira/browse/HBASE-16128 Project: HBase Issue Type: Improvement Components: metrics Affects Versions: 1.2.1 Reporter: Tianying Chang Assignee: Tianying Chang Priority: Minor Currently there is support for p75,p90,p99, but not support for p999. We need p999 metrics for reflecting p99 metrics at client level, especially client side is fanout call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-16029) All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike
[ https://issues.apache.org/jira/browse/HBASE-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tianying Chang resolved HBASE-16029. Resolution: Duplicate > All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is > on, causing flush spike > -- > > Key: HBASE-16029 > URL: https://issues.apache.org/jira/browse/HBASE-16029 > Project: HBase > Issue Type: Improvement > Components: hbase, Performance >Affects Versions: 1.2.1 >Reporter: Tianying Chang >Assignee: Tianying Chang > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-16028) All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike
[ https://issues.apache.org/jira/browse/HBASE-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tianying Chang resolved HBASE-16028. Resolution: Duplicate > All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is > on, causing flush spike > -- > > Key: HBASE-16028 > URL: https://issues.apache.org/jira/browse/HBASE-16028 > Project: HBase > Issue Type: Improvement > Components: hbase, Performance >Affects Versions: 1.2.1 >Reporter: Tianying Chang >Assignee: Tianying Chang > > In our production cluster, we observed that memstore flush spike every hour > for all regions/RS. (we use the default memstore periodic flush time of 1 > hour). > This will happend when two conditions are met: > 1. the memstore does not have enough data to be flushed before 1 hour limit > reached; > 2. all regions are opened around the same time, (e.g. all RS are started at > the same time when start a cluster). > With above two conditions, all the regions will be flushed around the same > time at: startTime+1hour-delay again and again. > We added a flush jittering time to randomize the flush time of each region, > so that they don't get flushed at around the same time. We had this feature > running in our 94.7 and 94.26 cluster. Recently, we upgrade to 1.2, found > this issue still there in 1.2. So we are porting this into 1.2 branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-16027) All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike
[ https://issues.apache.org/jira/browse/HBASE-16027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tianying Chang resolved HBASE-16027. Resolution: Duplicate > All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is > on, causing flush spike > -- > > Key: HBASE-16027 > URL: https://issues.apache.org/jira/browse/HBASE-16027 > Project: HBase > Issue Type: Bug > Components: hbase, Performance >Affects Versions: 1.2.1 >Reporter: Tianying Chang >Assignee: Tianying Chang > > In our production cluster, we observed that memstore flush spike every hour > for all regions/RS. (we use the default memstore periodic flush time of 1 > hour). > This will happend when two conditions are met: > 1. the memstore does not have enough data to be flushed before 1 hour limit > reached; > 2. all regions are opened around the same time, (e.g. all RS are started at > the same time when start a cluster). > With above two conditions, all the regions will be flushed around the same > time at: startTime+1hour-delay again and again. > We added a flush jittering time to randomize the flush time of each region, > so that they don't get flushed at around the same time. We had this feature > running in our 94.7 and 94.26 cluster. Recently, we upgrade to 1.2, found > this issue still there in 1.2. So we are porting this into 1.2 branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16030) All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike
Tianying Chang created HBASE-16030: -- Summary: All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike Key: HBASE-16030 URL: https://issues.apache.org/jira/browse/HBASE-16030 Project: HBase Issue Type: Improvement Affects Versions: 1.2.1 Reporter: Tianying Chang Assignee: Tianying Chang In our production cluster, we observed that memstore flush spike every hour for all regions/RS. (we use the default memstore periodic flush time of 1 hour). This will happend when two conditions are met: 1. the memstore does not have enough data to be flushed before 1 hour limit reached; 2. all regions are opened around the same time, (e.g. all RS are started at the same time when start a cluster). With above two conditions, all the regions will be flushed around the same time at: startTime+1hour-delay again and again. We added a flush jittering time to randomize the flush time of each region, so that they don't get flushed at around the same time. We had this feature running in our 94.7 and 94.26 cluster. Recently, we upgrade to 1.2, found this issue still there in 1.2. So we are porting this into 1.2 branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16029) All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike
Tianying Chang created HBASE-16029: -- Summary: All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike Key: HBASE-16029 URL: https://issues.apache.org/jira/browse/HBASE-16029 Project: HBase Issue Type: Improvement Components: hbase, Performance Affects Versions: 1.2.1 Reporter: Tianying Chang Assignee: Tianying Chang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16028) All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike
Tianying Chang created HBASE-16028: -- Summary: All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike Key: HBASE-16028 URL: https://issues.apache.org/jira/browse/HBASE-16028 Project: HBase Issue Type: Improvement Components: hbase, Performance Affects Versions: 1.2.1 Reporter: Tianying Chang Assignee: Tianying Chang In our production cluster, we observed that memstore flush spike every hour for all regions/RS. (we use the default memstore periodic flush time of 1 hour). This will happend when two conditions are met: 1. the memstore does not have enough data to be flushed before 1 hour limit reached; 2. all regions are opened around the same time, (e.g. all RS are started at the same time when start a cluster). With above two conditions, all the regions will be flushed around the same time at: startTime+1hour-delay again and again. We added a flush jittering time to randomize the flush time of each region, so that they don't get flushed at around the same time. We had this feature running in our 94.7 and 94.26 cluster. Recently, we upgrade to 1.2, found this issue still there in 1.2. So we are porting this into 1.2 branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16027) All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike
Tianying Chang created HBASE-16027: -- Summary: All Regions are flushed at about same time when MEMSTORE_PERIODIC_FLUSH is on, causing flush spike Key: HBASE-16027 URL: https://issues.apache.org/jira/browse/HBASE-16027 Project: HBase Issue Type: Bug Components: hbase, Performance Affects Versions: 1.2.1 Reporter: Tianying Chang Assignee: Tianying Chang In our production cluster, we observed that memstore flush spike every hour for all regions/RS. (we use the default memstore periodic flush time of 1 hour). This will happend when two conditions are met: 1. the memstore does not have enough data to be flushed before 1 hour limit reached; 2. all regions are opened around the same time, (e.g. all RS are started at the same time when start a cluster). With above two conditions, all the regions will be flushed around the same time at: startTime+1hour-delay again and again. We added a flush jittering time to randomize the flush time of each region, so that they don't get flushed at around the same time. We had this feature running in our 94.7 and 94.26 cluster. Recently, we upgrade to 1.2, found this issue still there in 1.2. So we are porting this into 1.2 branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15155) Show All RPC handler tasks stop working after cluster is under heavy load for a while
Tianying Chang created HBASE-15155: -- Summary: Show All RPC handler tasks stop working after cluster is under heavy load for a while Key: HBASE-15155 URL: https://issues.apache.org/jira/browse/HBASE-15155 Project: HBase Issue Type: Bug Components: monitoring Affects Versions: 0.94.19, 1.0.0, 0.98.0 Reporter: Tianying Chang Assignee: Tianying Chang After we upgrade from 94.7 to 94.26 and 1.0, we found that "Show All RPC handler status" link on RS webUI stops working after running in production cluster with relatively high load for several days. Turn out to be it is a bug introduced by https://issues.apache.org/jira/browse/HBASE-10312 The BoundedFIFOBuffer cause RPCHandler Status overriden/removed permanently when there is a spike of non-RPC tasks status that is over the MAX_SIZE (1000). So as long as the RS experienced "high" load once, the RPC status monitoring is gone forever, until RS is restarted. We added a unit test that can repro this. And the fix can pass the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-11765) ReplicationSink should merge the Put/Delete of the same row into one Action even if they are from different hlog entry.
Tianying Chang created HBASE-11765: -- Summary: ReplicationSink should merge the Put/Delete of the same row into one Action even if they are from different hlog entry. Key: HBASE-11765 URL: https://issues.apache.org/jira/browse/HBASE-11765 Project: HBase Issue Type: Improvement Components: Performance, Replication Affects Versions: 0.94.7 Reporter: Tianying Chang Assignee: Tianying Chang Fix For: 0.94.7 The current replicationSink code make sure it will only create one Put/Delete action of the kv of same row if it is from same hlog entry. However, when the same row of Put/Delete exist in different hlog entry, multiple Put/Delete action will be created, this will cause synchronization cost during the multi batch operation. In one of our application traffic pattern which has delete for same row twice for many rows, we saw doMiniBatchMutation() is invoked many times due to the row lock for the same row. ReplicationSink side is super slow, and replication queue build up. We should put the put/delete for the same row into one Put/Delete action even if they are from different hlog entry. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-11684) HBase replicationSource should support multithread to ship the log entry
Tianying Chang created HBASE-11684: -- Summary: HBase replicationSource should support multithread to ship the log entry Key: HBASE-11684 URL: https://issues.apache.org/jira/browse/HBASE-11684 Project: HBase Issue Type: Improvement Components: Performance, regionserver, Replication Reporter: Tianying Chang Assignee: Tianying Chang We found the replication rate cannot keep up with the write rate when the master cluster is write heavy. We got huge log queue build up due to that. But when we do a rolling restart of master cluster, we found that the appliedOpsRate doubled due to the extra thread created to help recover the log of the restarted RS. ReplicateLogEntries is a synchronous blocking call, it becomes the bottleneck when is only runs with one thread. I think we should support multi-thread for the replication source to ship the data. I don't see any consistency problem. Any other concern here? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-10935) support snapshot policy where flush memstore can be skipped to prevent production cluster freeze
Tianying Chang created HBASE-10935: -- Summary: support snapshot policy where flush memstore can be skipped to prevent production cluster freeze Key: HBASE-10935 URL: https://issues.apache.org/jira/browse/HBASE-10935 Project: HBase Issue Type: New Feature Components: shell, snapshots Affects Versions: 0.94.18, 0.94.7 Reporter: Tianying Chang Assignee: Tianying Chang Priority: Minor Fix For: 0.94.19 We are using snapshot feature to do HBase disaster recovery. We will do snapshot in our production cluster periodically. The current flush snapshot policy require all regions of the table to coordinate to prevent write and do flush at the same time. Since we use WALPlayer to complete the data that is not in the snapshot HFile, we don't need the snapshot to do coordinated flush. The snapshot just recored all the HFile that are already there. I added the parameter in the HBase shell. So people can choose to use the NoFlush snapshot when they need, like below. Otherwise, the default flush snpahot support is not impacted. snaphot 'TestTable', 'TestSnapshot', 'skipFlush' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-8836) Separate reader and writer thread pool in RegionServer, so that write throughput will not be impacted when the read load is very high
Tianying Chang created HBASE-8836: - Summary: Separate reader and writer thread pool in RegionServer, so that write throughput will not be impacted when the read load is very high Key: HBASE-8836 URL: https://issues.apache.org/jira/browse/HBASE-8836 Project: HBase Issue Type: New Feature Components: Performance, regionserver Affects Versions: 0.94.8 Reporter: Tianying Chang Assignee: Tianying Chang Fix For: 0.94.8 We found that when the read load on a specific RS is high, the write throughput also get impacted dramatically, and even cause write data loss sometimes. We want to prioritize the write by putting them in a separate queue from the read request, so that slower read will not make fast write wait nu-necessarily long. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-7882) move region level metrics readReqeustCount and writeRequestCount to Metric 2
[ https://issues.apache.org/jira/browse/HBASE-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tianying Chang resolved HBASE-7882. --- Resolution: Duplicate Release Note: the metrics already in metric 2. There is another jira related for putting this 2 metrics in 94, which has been commited. This jira can be closed now. move region level metrics readReqeustCount and writeRequestCount to Metric 2 - Key: HBASE-7882 URL: https://issues.apache.org/jira/browse/HBASE-7882 Project: HBase Issue Type: Bug Components: metrics Affects Versions: 0.96.0 Reporter: Tianying Chang Assignee: Tianying Chang Priority: Minor HBASE-7818 is for 94. Following the refactor of HBASE-6410, I need to refactor the 94 patch of HBASE-7818 against metric 2. Patch for 96 will be very different from 94. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-8044) split/flush/compact/major_compact from hbase shell does not work for region key with \x format
[ https://issues.apache.org/jira/browse/HBASE-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tianying Chang resolved HBASE-8044. --- Resolution: Duplicate this bug has been fixed by HBASE-6643 in 0.94. It changed the shell input for split/flush/compact to take encoded region name, instead of the full region name. This avoided the confusion for the format conversion problem with full region name. This fix is not needed anymore. split/flush/compact/major_compact from hbase shell does not work for region key with \x format -- Key: HBASE-8044 URL: https://issues.apache.org/jira/browse/HBASE-8044 Project: HBase Issue Type: Bug Components: Admin Affects Versions: 0.94.5 Reporter: Tianying Chang Assignee: Tianying Chang Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8044.patch, 8044-trunk.txt, 8044-trunk-v2.txt, 8044-v2.patch the conversion between bytes and string is incorrect -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8085) Backport the fix for Bytes.toStringBinary() into 94
Tianying Chang created HBASE-8085: - Summary: Backport the fix for Bytes.toStringBinary() into 94 Key: HBASE-8085 URL: https://issues.apache.org/jira/browse/HBASE-8085 Project: HBase Issue Type: Bug Components: util Affects Versions: 0.94.5 Reporter: Tianying Chang Assignee: Tianying Chang Fix For: 0.94.7 there is a bug in Bytes.toStringBinary(), which will return the same string for 1) byte[] a = {'\\', 'x', 'D', 'A'} 2) \xDA. It seems this bug has already been fixed in trunk with HBASE 6991. We should backport it to 94. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8044) split/flush/compact/major_compact from hbase shell does not work for region key has \x format,
Tianying Chang created HBASE-8044: - Summary: split/flush/compact/major_compact from hbase shell does not work for region key has \x format, Key: HBASE-8044 URL: https://issues.apache.org/jira/browse/HBASE-8044 Project: HBase Issue Type: Bug Components: Admin Affects Versions: 0.94.5 Reporter: Tianying Chang Assignee: Tianying Chang Fix For: 0.94.6 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7896) make rename_table working in 92/94
Tianying Chang created HBASE-7896: - Summary: make rename_table working in 92/94 Key: HBASE-7896 URL: https://issues.apache.org/jira/browse/HBASE-7896 Project: HBase Issue Type: Bug Components: scripts Affects Versions: 0.94.5, 0.92.2 Reporter: Tianying Chang Assignee: Tianying Chang Fix For: 0.94.5, 0.92.2 The rename_table function is very useful for our customers. However, rename_table.rb does not work for 92/94. It has several bugs. It will be useful to fix them so that users can solve their problems. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7882) move region level metrics readReqeustCount and writeRequestCount to Metric 2
Tianying Chang created HBASE-7882: - Summary: move region level metrics readReqeustCount and writeRequestCount to Metric 2 Key: HBASE-7882 URL: https://issues.apache.org/jira/browse/HBASE-7882 Project: HBase Issue Type: Bug Components: metrics Affects Versions: 0.96.0 Reporter: Tianying Chang Assignee: Tianying Chang Priority: Minor Fix For: 0.96.0 HBASE-7818 is for 94. Following the refactor of HBASE-6410, I need to refactor the 94 patch of HBASE-7818 against metric 2. Patch for 96 will be very different from 94. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7816) numericPersistentMetrics should be not be cleared for regions that are not being closed.
Tianying Chang created HBASE-7816: - Summary: numericPersistentMetrics should be not be cleared for regions that are not being closed. Key: HBASE-7816 URL: https://issues.apache.org/jira/browse/HBASE-7816 Project: HBase Issue Type: Bug Components: metrics Affects Versions: 0.94.4 Reporter: Tianying Chang Assignee: Tianying Chang Fix For: 0.94.4 when a region is closed, the region level dynamic metrics numericPersistentMetrics are cleared for all regions on the same region servers. It is OK for numericMetrics and timeVaryingMetrics. But not right for numericPersistentMetrics, because those value are accumulated and not be reset at poll time. To keep the right value, only the metrics for the closed region should be cleared, numericPersistentMetrics for other regions should be kept. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7818) add region level metrics readReqeustCount and writeRequestCount
Tianying Chang created HBASE-7818: - Summary: add region level metrics readReqeustCount and writeRequestCount Key: HBASE-7818 URL: https://issues.apache.org/jira/browse/HBASE-7818 Project: HBase Issue Type: Improvement Components: metrics Affects Versions: 0.94.4 Reporter: Tianying Chang Assignee: Tianying Chang Priority: Minor Fix For: 0.94.6 Request rate at region server level can help identify the hot region server. But it will be good if we can further identify the hot regions on that region server. That way, we can easily find out unbalanced regions problem. Currently, readRequestCount and writeReqeustCount per region is exposed at webUI. It will be more useful to expose it through hadoop metrics framework and/or JMX, so that people can see the history when the region is hot. I am exposing the existing readRequestCount/writeRequestCount into the dynamic region level metrics framework. I am not changing/exposing it as rate because our openTSDB is taking the raw data of read/write count, and apply rate function to display the rate already. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira