[jira] [Reopened] (HBASE-19483) Add proper privilege check for rsgroup commands

2018-01-10 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-19483:


Removal of methods from a LimitedPrivate interface is not allowed in a patch 
release. Please provide an addendum for branch-1.4 which restores these methods:

Removed Methods  4 
hbase-server-1.4.0.jar, AccessController.class
package org.apache.hadoop.hbase.security.access
AccessController.isAuthorizationSupported ( Configuration conf ) [static]  :  
boolean 
AccessController.requireNamespacePermission ( String request, String namespace, 
Permission.Action... permissions )  :  void 
AccessController.requireNamespacePermission ( String request, String namespace, 
TableName tableName, Map> 
familyMap, Permission.Action... permissions )  :  void 

hbase-server-1.4.0.jar, VisibilityController.class
package org.apache.hadoop.hbase.security.visibility
VisibilityController.isAuthorizationSupported ( Configuration conf ) [static]  
:  boolean 


> Add proper privilege check for rsgroup commands
> ---
>
> Key: HBASE-19483
> URL: https://issues.apache.org/jira/browse/HBASE-19483
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Guangxu Cheng
> Fix For: 1.4.1, 1.5.0, 2.0.0-beta-1
>
> Attachments: 19483.master.011.patch, 19483.v11.patch, 
> 19483.v11.patch, HBASE-19483.addendum-1.patch, HBASE-19483.addendum.patch, 
> HBASE-19483.branch-1.001.patch, HBASE-19483.branch-2.001.patch, 
> HBASE-19483.branch-2.002.patch, HBASE-19483.branch-2.003.patch, 
> HBASE-19483.master.001.patch, HBASE-19483.master.002.patch, 
> HBASE-19483.master.003.patch, HBASE-19483.master.004.patch, 
> HBASE-19483.master.005.patch, HBASE-19483.master.006.patch, 
> HBASE-19483.master.007.patch, HBASE-19483.master.008.patch, 
> HBASE-19483.master.009.patch, HBASE-19483.master.010.patch, 
> HBASE-19483.master.011.patch, HBASE-19483.master.011.patch, 
> HBASE-19483.master.012.patch, HBASE-19483.master.013.patch, 
> HBASE-19483.master.014.patch
>
>
> Currently list_rsgroups command can be executed by any user.
> This is inconsistent with other list commands such as list_peers and 
> list_peer_configs.
> We should add proper privilege check for list_rsgroups command.
> privilege check should be added for get_table_rsgroup / get_server_rsgroup / 
> get_rsgroup commands.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (HBASE-19752) RSGroupBasedLoadBalancer#getMisplacedRegions() should handle the case where rs group cannot be determined

2018-01-12 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-19752:


Reverted from branch-1.4 due to compilation failure. Please fix and reapply.


> RSGroupBasedLoadBalancer#getMisplacedRegions() should handle the case where 
> rs group cannot be determined
> -
>
> Key: HBASE-19752
> URL: https://issues.apache.org/jira/browse/HBASE-19752
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
> Fix For: 1.4.1, 1.5.0, 2.0.0-beta-1
>
> Attachments: 19752.v1.txt, 19752.v2.txt, 19752.v3.txt, 19752.v4.txt, 
> 19752.v5.txt, 19752.v6.txt, 19752.v7.branch-1.txt, 19752.v7.txt
>
>
> Observed the following in rs group test output:
> {code}
> 2018-01-10 14:17:23,006 DEBUG [AssignmentThread] 
> rsgroup.RSGroupBasedLoadBalancer(316): Found misplaced region: 
> hbase:acl,,1515593841277.ecf47ecb7522d7fab40db0a237f973fd. on server: 
> localhost,1,1 found in group: null outside of group: UNKNOWN
> {code}
> Here is corresponding code:
> {code}
>   if (assignedServer != null &&
>   (info == null || 
> !info.containsServer(assignedServer.getAddress( {
> RSGroupInfo otherInfo = null;
> otherInfo = 
> rsGroupInfoManager.getRSGroupOfServer(assignedServer.getAddress());
> LOG.debug("Found misplaced region: " + 
> regionInfo.getRegionNameAsString() +
> {code}
> As you can see, both info and otherInfo were null.
> In this case, the region should not be placed in misplacedRegions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-19790) Fix compatibility break in 1.3.2-SNAPSHOT

2018-01-12 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-19790:
--

 Summary: Fix compatibility break in 1.3.2-SNAPSHOT
 Key: HBASE-19790
 URL: https://issues.apache.org/jira/browse/HBASE-19790
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.3.2
Reporter: Andrew Purtell
Assignee: Andrew Purtell
Priority: Blocker
 Fix For: 1.3.2


This change is disallowed in a patch release:

{code}
package org.apache.hadoop.hbase.regionserver
interface Region 

Abstract method closeRegionOperation ( Region.Operation ) has been added to 
this interface.

Recompilation of a client program may be terminated with the message: a client 
class C is not abstract and does not override abstract method 
closeRegionOperation ( Region.Operation ) in Region.
{code}

Table is a Public interface.

See https://hbase.apache.org/book.html#hbase.versioning
{quote}
New APIs introduced in a patch version will only be added in a source 
compatible way [1]: i.e. code that implements public APIs will continue to 
compile.
{quote}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-19842) Cell ACLs v2

2018-01-22 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-19842:
--

 Summary: Cell ACLs v2
 Key: HBASE-19842
 URL: https://issues.apache.org/jira/browse/HBASE-19842
 Project: HBase
  Issue Type: New Feature
  Components: security
Reporter: Andrew Purtell


Per cell ACLs as currently implemented (HBASE-7662) embed the serialized ACL in 
a tag stored with each cell. This was done for performance. This has some 
drawbacks, most significantly unnecessary duplication and to grant or revoke 
this requires a rewrite of every affected cell. We could implement them in a 
space efficient (and management efficient way) at the cost of some performance 
like so:

First, allow storage of cell level ACLs in the ACL table. Rowkey could be hash 
of serialized ACL format. Just have to avoid using rowkeys that associate the 
ACL with a cf, or table, or namespace... And handle entries in the ACL tables 
which don't conform to today's keying strategy. 

Then provide the option for storing the rowkey of an entry in the ACL table in 
the cell ACL tag instead of the complete serialization. 

The advantages would be reduction of unnecessary duplication, and, like ACLs at 
other granularities, a GRANT or REVOKE which updates the ACL table will update 
access control rules for all affected cells. The disadvantage would be in order 
to process the reference to the ACL for each cell with an ACL reference in a 
tag we will need to look up the ACL in the ACL table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-19858) Backport HBASE-14061 (Support CF-level Storage Policy) to branch-1

2018-01-24 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-19858:
--

 Summary: Backport HBASE-14061 (Support CF-level Storage Policy) to 
branch-1
 Key: HBASE-19858
 URL: https://issues.apache.org/jira/browse/HBASE-19858
 Project: HBase
  Issue Type: Task
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0


Backport the following commits to branch-1:
 * HBASE-14061 Support CF-level Storage Policy
 * HBASE-14061 Support CF-level Storage Policy (addendum)
 * HBASE-14061 Support CF-level Storage Policy (addendum2)
 * HBASE-15172 Support setting storage policy in bulkload
 * HBASE-17538 HDFS.setStoragePolicy() logs errors on local fs
 * HBASE-18015 Storage class aware block placement for procedure v2 WALs
 * HBASE-18017 Reduce frequency of setStoragePolicy failure warnings
 * HBASE-19016 Coordinate storage policy property name for table schema and 
bulkload

 

Fix
 * Default storage policy if not configured cannot be "NONE"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-19466) Rare failure in TestScannerCursor

2018-01-24 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-19466.

Resolution: Cannot Reproduce

> Rare failure in TestScannerCursor
> -
>
> Key: HBASE-19466
> URL: https://issues.apache.org/jira/browse/HBASE-19466
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Andrew Purtell
>Priority: Minor
>
> I think we just need to increase the timeout interval to deal with occasional 
> slowdowns on test executors. 1998 ms is a pretty short timeout.
> By the way "rpcTimetout" in the exception message is a misspelling.
> [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 37.412 s <<< FAILURE! - in 
> org.apache.hadoop.hbase.regionserver.TestScannerCursor
> [ERROR] 
> testHeartbeatWithSparseFilter(org.apache.hadoop.hbase.regionserver.TestScannerCursor)
>   Time elapsed: 35.604 s  <<< ERROR!
> org.apache.hadoop.hbase.client.RetriesExhaustedException: 
> Failed after attempts=36, exceptions:
> Thu Dec 07 22:27:16 UTC 2017, null, java.net.SocketTimeoutException: 
> callTimeout=4000, callDuration=4108: Call to 
> ip-172-31-47-35.us-west-2.compute.internal/172.31.47.35:35690 failed on local 
> exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=52, 
> waitTime=2002, rpcTimetout=1998 row '' on table 'TestScannerCursor' at 
> region=TestScannerCursor,,1512685598567.1d4e59215a881d6ccbd0b5b5bdec5587., 
> hostname=ip-172-31-47-35.us-west-2.compute.internal,35690,1512685593244, 
> seqNum=2
> at 
> org.apache.hadoop.hbase.regionserver.TestScannerCursor.testHeartbeatWithSparseFilter(TestScannerCursor.java:154)
> Caused by: java.net.SocketTimeoutException: callTimeout=4000, 
> callDuration=4108: Call to 
> ip-172-31-47-35.us-west-2.compute.internal/172.31.47.35:35690 failed on local 
> exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=52, 
> waitTime=2002, rpcTimetout=1998 row '' on table 'TestScannerCursor' at 
> region=TestScannerCursor,,1512685598567.1d4e59215a881d6ccbd0b5b5bdec5587., 
> hostname=ip-172-31-47-35.us-west-2.compute.internal,35690,1512685593244, 
> seqNum=2
> Caused by: java.io.IOException: Call to 
> ip-172-31-47-35.us-west-2.compute.internal/172.31.47.35:35690 failed on local 
> exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=52, 
> waitTime=2002, rpcTimetout=1998
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=52, 
> waitTime=2002, rpcTimetout=1998



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-17883) release 1.4.0

2018-01-24 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-17883.

Resolution: Fixed

> release 1.4.0
> -
>
> Key: HBASE-17883
> URL: https://issues.apache.org/jira/browse/HBASE-17883
> Project: HBase
>  Issue Type: Task
>  Components: community
>Affects Versions: 1.4.0
>Reporter: Sean Busbey
>Assignee: Andrew Purtell
>Priority: Critical
>
> Let's start working through doing the needful; it's been almost 3 months sine 
> 1.3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-19967) Add Major Compaction Tool options for off-peak / on-peak hours

2018-02-09 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-19967:
--

 Summary: Add Major Compaction Tool options for off-peak / on-peak 
hours
 Key: HBASE-19967
 URL: https://issues.apache.org/jira/browse/HBASE-19967
 Project: HBase
  Issue Type: Improvement
Reporter: Andrew Purtell


After HBASE-19528 an operator can disable automatic major compaction and more 
intelligently manage major compaction impact on cluster operations with an 
external tool that drives the compaction activity. This tool can be invoked at 
whatever schedule is desirable, and can restrict activity by table and column 
family, with given concurrency of regionservers compacting at a given time. 

Add Major Compaction Tool options for off-peak / on-peak hours. Allow for 
definition of "off-peak" as a time range bounded by two points on a 24 hour 
clock, and for two concurrency target for global compaction activity, one for 
the off-peak interval, the other for the remainder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-14610) IntegrationTestRpcClient from HBASE-14535 is failing with Async RPC client

2018-02-14 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-14610.

   Resolution: Incomplete
Fix Version/s: (was: 1.4.2)
   (was: 1.2.8)
   (was: 1.5.0)
   (was: 1.3.2)
   (was: 3.0.0)
   (was: 2.0.0)

> IntegrationTestRpcClient from HBASE-14535 is failing with Async RPC client
> --
>
> Key: HBASE-14610
> URL: https://issues.apache.org/jira/browse/HBASE-14610
> Project: HBase
>  Issue Type: Bug
>  Components: IPC/RPC
>Reporter: Enis Soztutar
>Priority: Major
> Attachments: output
>
>
> HBASE-14535 introduces an IT to simulate a running cluster with RPC servers 
> and RPC clients doing requests against the servers. 
> It passes with the sync client, but fails with async client. Probably we need 
> to take a look. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20012) Backport filesystem quotas (HBASE-16961) to branch-1

2018-02-16 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20012:
--

 Summary: Backport filesystem quotas (HBASE-16961) to branch-1
 Key: HBASE-20012
 URL: https://issues.apache.org/jira/browse/HBASE-20012
 Project: HBase
  Issue Type: New Feature
Reporter: Andrew Purtell
 Fix For: 1.5.0


Filesystem quotas (HBASE-16961) is an experimental feature committed to 
branch-2 and up. We are thinking about chargeback and share-back models at work 
and this begins to look compelling. I wish this meant then we'd give HBase 2 a 
spin but that's unfortunately not realistic. It is very likely we will want to 
make use of this before we are up on HBase 2. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20018) Safe online META repair

2018-02-18 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20018:
--

 Summary: Safe online META repair
 Key: HBASE-20018
 URL: https://issues.apache.org/jira/browse/HBASE-20018
 Project: HBase
  Issue Type: New Feature
  Components: hbck
Reporter: Andrew Purtell


HBCK is a tank, or a giant shotgun, or choose the battlefield metaphor you feel 
is most appropriate. It rolls onto the field and leaves problems crushed in its 
wake, but if you point it in the wrong direction, it will also crush your 
production data too. As such it is a means of last resort to fix an ailing 
cluster. It is also imperative that user request traffic, writes in particular, 
are stopped before attempting a number of the fixes. It is unlikely the default 
"-repair" option is what you want - this turns on too many fixes to risk at one 
time. There are a large number of command line switches for individual checks 
and fixes which are very useful but also error prone when cobbling together a 
command line for a cluster fix under pressure. An operations team might 
hesitate to employ hbck to fix some accumulating bad state, because of the 
disruption use of it requires, and the risk of compounding the problem if not 
carefully done. That of course would be bad because the accumulating bad state 
will eventually have an availability impact. 

It should be safer to use hbck, but changing hbck also carries risk. We can 
leave it be as the useful (but dangerous) tool it is and focus on a subset of 
its functionality to make safer.

There are a class of META corruptions of mild to moderate severity which could 
in theory be handled more safely in an online manner without requiring a 
suspension of user traffic. Some things hbck does are safe enough to use 
directly for this. Others need tweaks to do more preflight checks (like 
checking region states) first. Develop these as a separate tool, maybe even a 
new HMaster or Admin component.

Look for opportunities to share code with existing hbck, via refactor into a 
shared library. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20027) Port assignments in site configuration are ignored

2018-02-20 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20027:
--

 Summary: Port assignments in site configuration are ignored
 Key: HBASE-20027
 URL: https://issues.apache.org/jira/browse/HBASE-20027
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.4.3
Reporter: Andrew Purtell


Port assignments for master and regionserver RPC and info ports in site 
configuration appear to be ignored.

We are not catching this in tests because there appears to be no positive test 
for port assignment and the only fixed information we require is the zookeeper 
quorum and client port. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20063) Port HBASE-19799 (Add web UI to rsgroup) to branch-1

2018-02-23 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20063:
--

 Summary: Port HBASE-19799 (Add web UI to rsgroup) to branch-1
 Key: HBASE-20063
 URL: https://issues.apache.org/jira/browse/HBASE-20063
 Project: HBase
  Issue Type: Task
  Components: rsgroup, UI
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0, 1.4.3






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20087) Periodically attempt redeploy of regions in FAILED_OPEN state

2018-02-26 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20087:
--

 Summary: Periodically attempt redeploy of regions in FAILED_OPEN 
state
 Key: HBASE-20087
 URL: https://issues.apache.org/jira/browse/HBASE-20087
 Project: HBase
  Issue Type: Improvement
  Components: master, Region Assignment
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 2.0.0, 1.5.0


Because RSGroups can cause permanent RIT with regions in FAILED_OPEN state, we 
added logic to the master portion of the RSGroups extention to enumerate RITs 
and retry assignment of regions in FAILED_OPEN state.

However, this strategy can be applied generally to reduce need of operator 
involvement in cluster operations. Now an operator has to manually resolve 
FAILED_OPEN assignments but there is little risk in automatically retrying them 
after a while. If the reason the assignment failed has not cleared, the 
assignment will just fail again. Should the reason the assignment failed be 
resolved, then operators don't have to do more in order for the cluster to 
fully heal. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20088) Update copyright notices to year 2018

2018-02-26 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20088:
--

 Summary: Update copyright notices to year 2018
 Key: HBASE-20088
 URL: https://issues.apache.org/jira/browse/HBASE-20088
 Project: HBase
  Issue Type: Task
Reporter: Andrew Purtell


NOTICE file, UIs, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20089) make_rc.sh should name SHA-512 checksum files with the extension .sha512

2018-02-26 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20089:
--

 Summary: make_rc.sh should name SHA-512 checksum files with the 
extension .sha512
 Key: HBASE-20089
 URL: https://issues.apache.org/jira/browse/HBASE-20089
 Project: HBase
  Issue Type: Task
Reporter: Andrew Purtell


>From [~elserj]
{quote}
we need to update the checksum naming convention for SHA*. Per [1], .sha 
filenames should only contain SHA1, and .sha512 file names should be used for 
SHA512 xsum. I believe this means we just need to modify make_rc.sh to put the 
xsum into .sha512 instead of .sha. We do not need to distribute SHA1 xsums and, 
afaik, there is little cryptographic value to this.

[1] http://www.apache.org/dev/release-distribution.html#sigs-and-sums
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20096) Missing version warning for exec-maven-plugin in hbase-shaded-check-invariants

2018-02-26 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20096:
--

 Summary: Missing version warning for exec-maven-plugin in 
hbase-shaded-check-invariants
 Key: HBASE-20096
 URL: https://issues.apache.org/jira/browse/HBASE-20096
 Project: HBase
  Issue Type: Bug
  Components: build
Reporter: Andrew Purtell
 Fix For: 1.5.0, 1.4.3


Reported by [~dbist13]:

Affects branch-1 and branch-1.4

{noformat}
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hbase:hbase-shaded-check-invariants:pom:1.5.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for 
org.codehaus.mojo:exec-maven-plugin is missing. @ 
org.apache.hbase:hbase-shaded-check-invariants:[unknown-version], 
/Users/apurtell/src/hbase/hbase-shaded/hbase-shaded-check-invariants/pom.xml, 
line 161, column 15
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20096) Missing version warning for exec-maven-plugin in hbase-shaded-check-invariants

2018-02-26 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20096.

   Resolution: Duplicate
Fix Version/s: (was: 1.4.3)
   (was: 1.5.0)

Dup of HBASE-20091

> Missing version warning for exec-maven-plugin in hbase-shaded-check-invariants
> --
>
> Key: HBASE-20096
> URL: https://issues.apache.org/jira/browse/HBASE-20096
> Project: HBase
>  Issue Type: Bug
>  Components: build
>Reporter: Andrew Purtell
>Priority: Minor
>
> Reported by [~dbist13]:
> Affects branch-1 and branch-1.4
> {noformat}
> [WARNING] Some problems were encountered while building the effective model 
> for org.apache.hbase:hbase-shaded-check-invariants:pom:1.5.0-SNAPSHOT
> [WARNING] 'build.plugins.plugin.version' for 
> org.codehaus.mojo:exec-maven-plugin is missing. @ 
> org.apache.hbase:hbase-shaded-check-invariants:[unknown-version], 
> /Users/apurtell/src/hbase/hbase-shaded/hbase-shaded-check-invariants/pom.xml, 
> line 161, column 15
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20102) AssignmentManager#shutdown doesn't shut down scheduled executor

2018-02-27 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20102:
--

 Summary: AssignmentManager#shutdown doesn't shut down scheduled 
executor
 Key: HBASE-20102
 URL: https://issues.apache.org/jira/browse/HBASE-20102
 Project: HBase
  Issue Type: Bug
  Components: master, Region Assignment
Affects Versions: 1.4.2
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0, 1.4.3






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-19989) READY_TO_MERGE and READY_TO_SPLIT do not update region state correctly

2018-02-27 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-19989:


The new tests TestZKLessMergeOnCluster and TestZKLessSplitOnCluster 
consistently fail for me on branch-1.3 and branch-1.4. 

{code}
java.lang.RuntimeException: 
org.apache.hadoop.hbase.exceptions.DeserializationException: Missing pb magic 
PBUF prefix
{code}



> READY_TO_MERGE and READY_TO_SPLIT do not update region state correctly
> --
>
> Key: HBASE-19989
> URL: https://issues.apache.org/jira/browse/HBASE-19989
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Ben Lau
>Assignee: Ben Lau
>Priority: Major
> Fix For: 1.3.2, 1.5.0, 1.4.3
>
> Attachments: HBASE-19989-branch-1.patch
>
>
> Region state transitions do not work correctly for READY_TO_MERGE/SPLIT.  
> [~thiruvel] and I noticed this is due to break statements being in the wrong 
> place in AssignmentManager.  This allows a race condition for example in 
> which one of the regions being merged could be moved concurrently, resulting 
> in the merge transaction failing and then double assignment and/or dataloss.  
> This bug appears to only affect branch-1 (for example 1.3 and 1.4) and not 
> branch-2 as the relevant code in AM has since been rewritten.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-19989) READY_TO_MERGE and READY_TO_SPLIT do not update region state correctly

2018-02-27 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-19989.

Resolution: Fixed

Pushed addendum to branch-1.3, branch-1.4, and branch-1

> READY_TO_MERGE and READY_TO_SPLIT do not update region state correctly
> --
>
> Key: HBASE-19989
> URL: https://issues.apache.org/jira/browse/HBASE-19989
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Ben Lau
>Assignee: Ben Lau
>Priority: Major
> Fix For: 1.3.2, 1.5.0, 1.4.3
>
> Attachments: HBASE-19989-ADDENDUM-branch-1.patch, 
> HBASE-19989-branch-1.patch
>
>
> Region state transitions do not work correctly for READY_TO_MERGE/SPLIT.  
> [~thiruvel] and I noticed this is due to break statements being in the wrong 
> place in AssignmentManager.  This allows a race condition for example in 
> which one of the regions being merged could be moved concurrently, resulting 
> in the merge transaction failing and then double assignment and/or dataloss.  
> This bug appears to only affect branch-1 (for example 1.3 and 1.4) and not 
> branch-2 as the relevant code in AM has since been rewritten.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-17448) Export metrics from RecoverableZooKeeper

2018-02-28 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-17448:

  Assignee: Andrew Purtell  (was: Chinmay Kulkarni)

> Export metrics from RecoverableZooKeeper
> 
>
> Key: HBASE-17448
> URL: https://issues.apache.org/jira/browse/HBASE-17448
> Project: HBase
>  Issue Type: Improvement
>  Components: Zookeeper
>Affects Versions: 1.3.1
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
>  Labels: patch
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-17448-branch-1.patch, HBASE-17448.patch, 
> HBASE-17448.patch
>
>
> Consider adding instrumentation to RecoverableZooKeeper that exposes metrics 
> on the performance and health of the embedded ZooKeeper client: latency 
> histograms for each op type, number of reconnections, number of ops where a 
> reconnection was necessary to proceed, number of failed ops due to 
> CONNECTIONLOSS, number of failed ops due to SESSIONEXIPRED, number of failed 
> ops due to OPERATIONTIMEOUT. 
> RecoverableZooKeeper is a class in hbase-client so we can hook up the new 
> metrics to both client- and server-side metrics reporters. Probably this 
> metrics source should be a process singleton. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20109) Add Admin#getMasterAddress API for lightweight discovery of the active master location

2018-02-28 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20109:
--

 Summary: Add Admin#getMasterAddress API for lightweight discovery 
of the active master location
 Key: HBASE-20109
 URL: https://issues.apache.org/jira/browse/HBASE-20109
 Project: HBase
  Issue Type: Improvement
  Components: Client
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 2.0.0, 1.5.0


Right now the only public API available to the client to learn the server name 
of the active master is Admin#getClusterStatus#getMaster, returning ServerName. 
On a cluster of any size getClusterStatus is expensive, especially if used only 
to retrieve the active master name. 

Let's add a simple API 

{code}
ServerName Admin#getMasterAddress()
{code}

for lightweight discovery of the active master location. This makes sense 
because, weirdly, Admin already has a method getMasterInfoPort(), returning 
int. 

Internally the client has a notion of the active master because there is a 
connection open to it, or one that can be reopened, or if for some reason it's 
not easy to make a ServerName for that state, the ServerName can be 
deserialized out of the znode tracking the active master location.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-17448) Export metrics from RecoverableZooKeeper

2018-03-01 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-17448.

Resolution: Fixed

> Export metrics from RecoverableZooKeeper
> 
>
> Key: HBASE-17448
> URL: https://issues.apache.org/jira/browse/HBASE-17448
> Project: HBase
>  Issue Type: Improvement
>  Components: Zookeeper
>Affects Versions: 1.3.1
>Reporter: Andrew Purtell
>Assignee: Chinmay Kulkarni
>Priority: Major
>  Labels: patch
> Fix For: 1.4.2, 1.4.1, 1.4.0
>
> Attachments: HBASE-17448-branch-1.patch, HBASE-17448.patch, 
> HBASE-17448.patch
>
>
> Consider adding instrumentation to RecoverableZooKeeper that exposes metrics 
> on the performance and health of the embedded ZooKeeper client: latency 
> histograms for each op type, number of reconnections, number of ops where a 
> reconnection was necessary to proceed, number of failed ops due to 
> CONNECTIONLOSS, number of failed ops due to SESSIONEXIPRED, number of failed 
> ops due to OPERATIONTIMEOUT. 
> RecoverableZooKeeper is a class in hbase-client so we can hook up the new 
> metrics to both client- and server-side metrics reporters. Probably this 
> metrics source should be a process singleton. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-20146) Regions are stuck while opening when WAL is disabled

2018-03-14 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-20146:


Reopened because there is an addendum in progress and some discussion about it. 
Please commit the addendum asap as soon as the discussion is settled or revert 
the original commit. Thanks!

> Regions are stuck while opening when WAL is disabled
> 
>
> Key: HBASE-20146
> URL: https://issues.apache.org/jira/browse/HBASE-20146
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.3.1
>Reporter: Ashish Singhi
>Assignee: Ashish Singhi
>Priority: Critical
> Fix For: 2.0.0, 3.0.0, 1.3.2, 1.5.0, 1.2.7, 1.4.3
>
> Attachments: HBASE-20146-addendum.patch, HBASE-20146.patch, 
> HBASE-20146.v1.patch
>
>
> On a running cluster we had set {{hbase.regionserver.hlog.enabled}} to false, 
> to disable the WAL for complete cluster, after restarting HBase service, 
> regions are not getting opened leading to HMaster abort as Namespace table 
> regions are not getting assigned. 
> jstack for region open:
> {noformat}
> "RS_OPEN_PRIORITY_REGION-BLR106595:16045-1" #159 prio=5 os_prio=0 
> tid=0x7fdfa4341000 nid=0x419d waiting on condition [0x7fdfa0467000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x87554448> (a 
> java.util.concurrent.CountDownLatch$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
> at org.apache.hadoop.hbase.wal.WALKey.getWriteEntry(WALKey.java:98)
> at 
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:131)
> at 
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:88)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.writeRegionOpenMarker(HRegion.java:1026)
> at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6849)
> at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6803)
> at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6774)
> at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6730)
> at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6681)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:363)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This used to work with HBase 1.0.2 version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20063) Port HBASE-19799 (Add web UI to rsgroup) to branch-1

2018-03-19 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20063.

   Resolution: Later
 Assignee: (was: Andrew Purtell)
Fix Version/s: (was: 1.4.3)
   (was: 1.5.0)

There are some java 8-isms and a difficult problem with how to handle jsp pages 
meant for hbase-server that must be separated out into hbase-rsgroup on 
branch-1 and dynamically loaded. Not worth it at this time I think, so 
resolving as Later (probably Never)

> Port HBASE-19799 (Add web UI to rsgroup) to branch-1
> 
>
> Key: HBASE-20063
> URL: https://issues.apache.org/jira/browse/HBASE-20063
> Project: HBase
>  Issue Type: Task
>  Components: rsgroup, UI
>Reporter: Andrew Purtell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20315) Document post release process steps for RM

2018-03-29 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20315:
--

 Summary: Document post release process steps for RM
 Key: HBASE-20315
 URL: https://issues.apache.org/jira/browse/HBASE-20315
 Project: HBase
  Issue Type: Task
  Components: build, documentation
Reporter: Andrew Purtell


We should document post release steps that RMs have to take and add it to the 
'How To Release' section of the refguide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20318) Lower "Set storagePolicy=XXX for path=YYY" INFO level logging to DEBUG

2018-03-29 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20318:
--

 Summary: Lower "Set storagePolicy=XXX for path=YYY" INFO level 
logging to DEBUG
 Key: HBASE-20318
 URL: https://issues.apache.org/jira/browse/HBASE-20318
 Project: HBase
  Issue Type: Task
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 3.0.0, 2.1.0, 1.5.0, 2.0.1


Set storagePolicy=XXX for path=YYY INFO level logging is too chatty, drop to 
DEBUG.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20318) Lower "Set storagePolicy=XXX for path=YYY" INFO level logging to DEBUG

2018-03-29 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20318.

   Resolution: Invalid
Fix Version/s: (was: 2.0.1)
   (was: 1.5.0)
   (was: 2.1.0)
   (was: 3.0.0)

Turns out this is just an issue on an internal backport.

> Lower "Set storagePolicy=XXX for path=YYY" INFO level logging to DEBUG
> --
>
> Key: HBASE-20318
> URL: https://issues.apache.org/jira/browse/HBASE-20318
> Project: HBase
>  Issue Type: Task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Trivial
>
> Set storagePolicy=XXX for path=YYY INFO level logging is too chatty, drop to 
> DEBUG.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-14729) SplitLogManager does not clean files from WALs folder in case of master failover

2018-04-06 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-14729:


> SplitLogManager does not clean files from WALs folder in case of master 
> failover
> 
>
> Key: HBASE-14729
> URL: https://issues.apache.org/jira/browse/HBASE-14729
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.0.0
>Reporter: Samir Ahmic
>Assignee: Samir Ahmic
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-14729.patch
>
>
> While i was testing master failover process on master branch (distributed 
> cluster setup) i notice following:
> 1. List of dead regionservers was increasing every time active master was 
> restarted.
> 2. Number of folders in /hbase/WALs folder was increasing every time active 
> master was restarted
> Here is exception from master logs showing why this is happening:
> {code}
> 2015-10-30 09:41:49,238 INFO  [ProcedureExecutor-3] master.SplitLogManager: 
> finished splitting (more than or equal to) 0 bytes in 0 log files in 
> [hdfs://P3cluster/hbase/WALs/hnode1,16000,1446043659224-splitting] in 21ms
> 2015-10-30 09:41:49,235 WARN  [ProcedureExecutor-2] master.SplitLogManager: 
> Returning success without actually splitting and deleting all the log files 
> in path hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting: 
> [FileStatus{path=hdfs://P3cluster/hbase/WALs/hnode1,16000,1446046595488-splitting/hnode1%2C16000%2C1446046595488.meta.1446046691314.meta;
>  isDirectory=false; length=39944; replication=3; blocksize=268435456; 
> modification_time=1446050348104; access_time=1446046691317; owner=hbase; 
> group=supergroup; permission=rw-r--r--; isSymlink=false}]
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException):
>  `/hbase/WALs/hnode1,16000,1446046595488-splitting is non empty': Directory 
> is not empty
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3524)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3479)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3463)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:751)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:562)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1411)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy15.delete(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:490)
>   at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy16.delete(Unknown Source)
>   at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
>   at com.sun.proxy.$Proxy17.delete(Unknown Source)
>   at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(

[jira] [Created] (HBASE-20429) Support for mixed or write-heavy workloads on non-HDFS filesystems

2018-04-16 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20429:
--

 Summary: Support for mixed or write-heavy workloads on non-HDFS 
filesystems
 Key: HBASE-20429
 URL: https://issues.apache.org/jira/browse/HBASE-20429
 Project: HBase
  Issue Type: Umbrella
Reporter: Andrew Purtell


We can support reasonably well use cases on non-HDFS filesystems, like S3, 
where an external writer has loaded (and continues to load) HFiles via the bulk 
load mechanism, and then we serve out a read only workload at the HBase API.

Mixed workloads or write-heavy workloads won't fare as well. In fact, data loss 
seems certain. It will depend in the specific filesystem, but all of the S3 
backed Hadoop filesystems suffer from a couple of obvious problems, notably a 
lack of atomic rename. 

This umbrella will serve to collect some related ideas for consideration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20430) Improve store file management for non-HDFS filesystems

2018-04-16 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20430:
--

 Summary: Improve store file management for non-HDFS filesystems
 Key: HBASE-20430
 URL: https://issues.apache.org/jira/browse/HBASE-20430
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell


HBase keeps a file open for every active store file so no additional round 
trips to the NameNode are needed after the initial open. HDFS internally 
multiplexes open files, but the Hadoop S3 filesystem implementations do not, 
or, at least, not as well. As the bulk of data under management increases we 
observe the required number of concurrently open connections will rise, and 
expect it will eventually exhaust a limit somewhere (the client, the OS file 
descriptor table or open file limits, or the S3 service).

Initially we can simply introduce an option to close every store file after the 
reader has finished, and determine the performance impact. Use cases backed by 
non-HDFS filesystems will already have to cope with a different read 
performance profile. Based on experiments with the S3 backed Hadoop 
filesystems, notably S3A, even with aggressively tuned options simple reads can 
be very slow when there are blockcache misses, 15-20 seconds observed for Get 
of a single small row, for example. We expect extensive use of the BucketCache 
to mitigate in this application already. Could be backed by offheap storage, 
but more likely a large number of cache files managed by the file engine on 
local SSD storage. If misses are already going to be super expensive, then the 
motivation to do more than simply open store files on demand is largely absent.

Still, we could employ a predictive cache. Where frequent access to a given 
store file (or, at least, its store) is predicted, keep a reference to the 
store file open. Can keep statistics about read frequency, write it out to 
HFiles during compaction, and note these stats when opening the region, perhaps 
by reading all meta blocks of region HFiles when opening. Otherwise, close the 
file after reading and open again on demand. Need to be careful not to use ARC 
or equivalent as cache replacement strategy as it is encumbered. The size of 
the cache can be determined at startup after detecting the underlying 
filesystem. Eg. setCacheSize(VERY_LARGE_CONSTANT) if (fs instanceof 
DistributedFileSystem), so we don't lose much when on HDFS still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20431) Store commit transaction for filesystems that do not support an atomic rename

2018-04-16 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20431:
--

 Summary: Store commit transaction for filesystems that do not 
support an atomic rename
 Key: HBASE-20431
 URL: https://issues.apache.org/jira/browse/HBASE-20431
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell


HBase expects the Hadoop filesystem implementation to support an atomic 
rename() operation. HDFS does. The S3 backed filesystems do not. The 
fundamental issue is the non-atomic and eventually consistent nature of the S3 
service. A S3 bucket is not a filesystem. S3 is not always immediately 
read-your-writes. Object metadata can be temporarily inconsistent just after 
new objects are stored. There can be a settling period to ride over. 
Renaming/moving objects from one path to another are copy operations with 
O(file) complexity and O(data) time followed by a series of deletes with 
O(file) complexity. Failures at any point prior to completion will leave the 
operation in an inconsistent state. The missing atomic rename semantic opens 
opportunities for corruption and data loss, which may or may not be repairable 
with HBCK.

Handling this at the HBase level could be done with a new multi-step filesystem 
transaction framework. Call it StoreCommitTransaction. SplitTransaction and 
MergeTransaction are well established cases where even on HDFS we have 
non-atomic filesystem changes and are our implementation template for the new 
work. In this new StoreCommitTransaction we'd be moving flush and compaction 
temporaries out of the temporary directory into the region store directory. On 
HDFS the implementation would be easy. We can rely on the filesystem's atomic 
rename semantics. On S3 it would be work: First we would build the list of 
objects to move, then copy each object into the destination, and then finally 
delete all objects at the original path. We must handle transient errors with 
retry strategies appropriate for the action at hand. We must handle serious or 
permanent errors where the RS doesn't need to be aborted with a rollback that 
cleans it all up. Finally, we must handle permanent errors where the RS must be 
aborted with a rollback during region open/recovery. Note that after all 
objects have been copied and we are deleting obsolete source objects we must 
roll forward, not back. To support recovery after an abort we must utilize the 
WAL to track transaction progress. Put markers in for StoreCommitTransaction 
start and completion state, with details of the store file(s) involved, so it 
can be rolled back during region recovery at open. This will be significant 
work in HFile, HStore, flusher, compactor, and HRegion. Wherever we use HDFS's 
rename now we would substitute the running of this new multi-step filesystem 
transaction.

We need to determine this for certain, but I believe the PUT or multipart 
upload of an object must complete before the object is visible, so we don't 
have to worry about the case where an object is visible before fully uploaded 
as part of normal operations. So an individual object copy will either happen 
entirely and the target will then become visible, or it won't and the target 
won't exist.

S3 has an optimization, PUT COPY 
(https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which 
the AmazonClient embedded in S3A utilizes for moves. When designing the 
StoreCommitTransaction be sure to allow for filesystem implementations that 
leverage a server side copy operation. Doing a get-then-put should be optional. 
(Not sure Hadoop has an interface that advertises this capability yet; we can 
add one if not.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20445) Defer work when a row lock is busy

2018-04-17 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20445:
--

 Summary: Defer work when a row lock is busy
 Key: HBASE-20445
 URL: https://issues.apache.org/jira/browse/HBASE-20445
 Project: HBase
  Issue Type: Improvement
Reporter: Andrew Purtell


Instead of blocking on row locks, defer the call and make the call runner 
available so it can service other activity. Have runners pick up deferred calls 
in the background after servicing the other request. 

Spin briefly on tryLock() wherever we are now using lock() to acquire a row 
lock. Introduce two new configuration parameters: one for the amount of time to 
wait between lock acquisition attempts, and another for the total number of 
times we wait before deferring the work. If the lock cannot be acquired, put 
the call back into the call queue. Call queues therefore should be priority 
queues sorted by deadline. Currently they are implemented with 
LinkedBlockingQueue (which isn't), or AdaptiveLifoCoDelCallQueue (which is) if 
the CoDel scheduler is enabled. Perhaps we could just require use of 
AdaptiveLifoCoDelCallQueue. Runners will be picking up work from the head of 
the queues as long as they are not empty, so deferred calls will be serviced 
again, or dropped if the deadline has passed.

Implementing continuations for simple operations should be straightforward. 

Batch mutations try to acquire as many rowlocks as they can, then do the 
partial batch over the successfully locked rows, then loop back to attempt the 
remaining work. This is a partial implementation of what we need so we can 
build on it. Rather than loop around, save the partial batch completion state 
and put a pointer to it along with the call back into the RPC queue.

For scans where allowPartialResults has been set to true we can simply complete 
the call at the point we fail to acquire a row lock. The client will handle the 
rest. For scans where allowPartialResults is false we have to save the scanner 
state and partial results, and put a pointer to this state along with the call 
back into the queue. 

We could approach this in phases:

Phase 0 - Sort out the call queuing details. Do we require 
AdaptiveLifoCoDelCallQueue? Certainly we can make use of it. Can we also have 
RWQueueRpcExecutor create queues as PriorityBlockingQueue instead of 
LinkedBlockingQueue? There must be a reason why not already.

Phase 1 - Implement deferral of simple ops only. (Batch mutations and scans 
will still block on rowlocks.)

Phase 2 - Implement deferral of batch mutations. (Scans will still block on 
rowlocks.)

Phase 3 - Implement deferral of scans where allowPartialResults is false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20453) Shell fails to start with SyntaxError

2018-04-18 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20453:
--

 Summary: Shell fails to start with SyntaxError
 Key: HBASE-20453
 URL: https://issues.apache.org/jira/browse/HBASE-20453
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.5.0, 1.4.4
Reporter: Andrew Purtell


SyntaxError: hbase/bin/../hbase-shell/src/main/ruby/hbase/table.rb:724: syntax 
error, unexpected tDOT

 .map { |i| Bytes.toStringBinary(i.getRegionInfo().getStartKey) }
 ^
  require at org/jruby/RubyKernel.java:1062
   (root) at 
/Users/apurtell/src/hbase/bin/../hbase-shell/src/main/ruby/hbase/table.rb:25
  require at org/jruby/RubyKernel.java:1062
   (root) at 
/Users/apurtell/src/hbase/bin/../hbase-shell/src/main/ruby/hbase/hbase.rb:102
  require at org/jruby/RubyKernel.java:1062
   (root) at /Users/apurtell/src/hbase/bin/../bin/hirb.rb:107




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-20276) [shell] Revert shell REPL change and document

2018-04-18 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-20276:


At least in branch-1 and branch-1.4 this broke the shell, please fix

SyntaxError: hbase/bin/../hbase-shell/src/main/ruby/hbase/table.rb:724: syntax 
error, unexpected tDOT

.map

{ |i| Bytes.toStringBinary(i.getRegionInfo().getStartKey) }
^
require at org/jruby/RubyKernel.java:1062
(root) at 
/Users/apurtell/src/hbase/bin/../hbase-shell/src/main/ruby/hbase/table.rb:25
require at org/jruby/RubyKernel.java:1062
(root) at 
/Users/apurtell/src/hbase/bin/../hbase-shell/src/main/ruby/hbase/hbase.rb:102
require at org/jruby/RubyKernel.java:1062
(root) at /Users/apurtell/src/hbase/bin/../bin/hirb.rb:107



> [shell] Revert shell REPL change and document
> -
>
> Key: HBASE-20276
> URL: https://issues.apache.org/jira/browse/HBASE-20276
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, shell
>Affects Versions: 1.4.0, 2.0.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Blocker
> Fix For: 1.4.4, 2.0.0
>
> Attachments: HBASE-20276.0.patch, HBASE-20276.1.patch, 
> HBASE-20276.2.patch, HBASE-20276.3.patch
>
>
> Feedback from [~mdrob] on HBASE-19158:
> {quote}
> Shell:
> HBASE-19770. There was another issue opened where this was identified as a 
> problem so maybe the shape will change further, but I can't find it now.
> {quote}
> New commentary from [~busbey]:
> This was a follow on to HBASE-15965. That change effectively makes it so none 
> of our ruby wrappers can be used to build expressions in an interactive REPL. 
> This is a pretty severe change (most of my tips on HBASE-15611 will break, I 
> think).
> I think we should
> a) Have a DISCUSS thread, spanning dev@ and user@
> b) based on the outcome of that thread, either default to the new behavior or 
> the old behavior
> c) if we keep the HBASE-15965 behavior as  the default, flag it as 
> incompatible, call it out in the hbase 2.0 upgrade section, and update docs 
> (two examples: the output in the shell_exercises sections would be wrong, and 
> the _table_variables section won't work)
> d) In either case document the new flag in the ref guide



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20453) Shell fails to start with SyntaxError

2018-04-18 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20453.

Resolution: Duplicate

Reopenend HBASE-20276 instead, duping this

> Shell fails to start with SyntaxError
> -
>
> Key: HBASE-20453
> URL: https://issues.apache.org/jira/browse/HBASE-20453
> Project: HBase
>  Issue Type: Bug
>Reporter: Andrew Purtell
>Priority: Major
>
> SyntaxError: hbase/bin/../hbase-shell/src/main/ruby/hbase/table.rb:724: 
> syntax error, unexpected tDOT
>  .map { |i| Bytes.toStringBinary(i.getRegionInfo().getStartKey) }
>  ^
>   require at org/jruby/RubyKernel.java:1062
>(root) at 
> /Users/apurtell/src/hbase/bin/../hbase-shell/src/main/ruby/hbase/table.rb:25
>   require at org/jruby/RubyKernel.java:1062
>(root) at 
> /Users/apurtell/src/hbase/bin/../hbase-shell/src/main/ruby/hbase/hbase.rb:102
>   require at org/jruby/RubyKernel.java:1062
>(root) at /Users/apurtell/src/hbase/bin/../bin/hirb.rb:107



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20486) Change default compaction throughput controller to PressureAwareThroughputController in branch-1

2018-04-24 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20486:
--

 Summary: Change default compaction throughput controller to 
PressureAwareThroughputController in branch-1
 Key: HBASE-20486
 URL: https://issues.apache.org/jira/browse/HBASE-20486
 Project: HBase
  Issue Type: Task
Reporter: Andrew Purtell
 Fix For: 1.5.0


Switch the default compaction throughput controller from 
NoLimitThroughputController to PressureAwareThroughputController in branch-1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-9465) Push entries to peer clusters serially

2018-04-25 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-9465:
---

> Push entries to peer clusters serially
> --
>
> Key: HBASE-9465
> URL: https://issues.apache.org/jira/browse/HBASE-9465
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Affects Versions: 1.4.0, 2.0.0
>Reporter: Honghua Feng
>Assignee: Phil Yang
>Priority: Critical
> Attachments: HBASE-9465-branch-1-v1.patch, 
> HBASE-9465-branch-1-v1.patch, HBASE-9465-branch-1-v2.patch, 
> HBASE-9465-branch-1-v3.patch, HBASE-9465-branch-1-v4.patch, 
> HBASE-9465-branch-1-v4.patch, HBASE-9465-branch-1.v4.revert.patch, 
> HBASE-9465-v1.patch, HBASE-9465-v2.patch, HBASE-9465-v2.patch, 
> HBASE-9465-v3.patch, HBASE-9465-v4.patch, HBASE-9465-v5.patch, 
> HBASE-9465-v6.patch, HBASE-9465-v6.patch, HBASE-9465-v7.patch, 
> HBASE-9465-v7.patch, HBASE-9465.pdf
>
>
> When region-move or RS failure occurs in master cluster, the hlog entries 
> that are not pushed before region-move or RS-failure will be pushed by 
> original RS(for region move) or another RS which takes over the remained hlog 
> of dead RS(for RS failure), and the new entries for the same region(s) will 
> be pushed by the RS which now serves the region(s), but they push the hlog 
> entries of a same region concurrently without coordination.
> This treatment can possibly lead to data inconsistency between master and 
> peer clusters:
> 1. there are put and then delete written to master cluster
> 2. due to region-move / RS-failure, they are pushed by different 
> replication-source threads to peer cluster
> 3. if delete is pushed to peer cluster before put, and flush and 
> major-compact occurs in peer cluster before put is pushed to peer cluster, 
> the delete is collected and the put remains in peer cluster
> In this scenario, the put remains in peer cluster, but in master cluster the 
> put is masked by the delete, hence data inconsistency between master and peer 
> clusters



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20493) Port HBASE-19994 (Create a new class for RPC throttling exception, make it retryable) to branch-1

2018-04-25 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20493:
--

 Summary: Port HBASE-19994 (Create a new class for RPC throttling 
exception, make it retryable) to branch-1
 Key: HBASE-20493
 URL: https://issues.apache.org/jira/browse/HBASE-20493
 Project: HBase
  Issue Type: Task
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0


Port HBASE-19994 (Create a new class for RPC throttling exception, make it 
retryable). Need to preserve the current behavior where the client gets a 
non-retryable ThrottlingException and only optionally throw back the retryable 
RpcThrottlingException if explicitly allowed by configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20496) TestGlobalThrottler failing on branch-1 since revert of HBASE-9465

2018-04-26 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20496:
--

 Summary: TestGlobalThrottler failing on branch-1 since revert of 
HBASE-9465
 Key: HBASE-20496
 URL: https://issues.apache.org/jira/browse/HBASE-20496
 Project: HBase
  Issue Type: Bug
Reporter: Andrew Purtell


Not sure why we didn't catch it earlier, but with my latest dev setup including 
8u JVM, TestGlobalThrottler fails reliably, and a git bisect finds the problem 
at this revert:

{noformat}
commit ba7a936f74985eb9d974fdc87b0d06cb8cd8473d
Author: Sean Busbey 
Date: Tue Nov 7 23:50:35 2017 -0600

Revert "HBASE-9465 Push entries to peer clusters serially"

This reverts commit 441bc050b991c14c048617bc443b97f46e21b76f.

Conflicts:
hbase-client/src/main/java/org/apache/hadoop/hbase/MetaTableAccessor.java
hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java
hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/WALProtos.java
hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/Replication.java
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

Signed-off-by: Andrew Purtell 
{noformat}

For now I'm going to disable the test. Leaving this open for debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-9465) Push entries to peer clusters serially

2018-04-26 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-9465.
---
Resolution: Fixed

> Push entries to peer clusters serially
> --
>
> Key: HBASE-9465
> URL: https://issues.apache.org/jira/browse/HBASE-9465
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Affects Versions: 1.4.0, 2.0.0
>Reporter: Honghua Feng
>Assignee: Phil Yang
>Priority: Critical
> Attachments: HBASE-9465-branch-1-v1.patch, 
> HBASE-9465-branch-1-v1.patch, HBASE-9465-branch-1-v2.patch, 
> HBASE-9465-branch-1-v3.patch, HBASE-9465-branch-1-v4.patch, 
> HBASE-9465-branch-1-v4.patch, HBASE-9465-branch-1.v4.revert.patch, 
> HBASE-9465-v1.patch, HBASE-9465-v2.patch, HBASE-9465-v2.patch, 
> HBASE-9465-v3.patch, HBASE-9465-v4.patch, HBASE-9465-v5.patch, 
> HBASE-9465-v6.patch, HBASE-9465-v6.patch, HBASE-9465-v7.patch, 
> HBASE-9465-v7.patch, HBASE-9465.pdf
>
>
> When region-move or RS failure occurs in master cluster, the hlog entries 
> that are not pushed before region-move or RS-failure will be pushed by 
> original RS(for region move) or another RS which takes over the remained hlog 
> of dead RS(for RS failure), and the new entries for the same region(s) will 
> be pushed by the RS which now serves the region(s), but they push the hlog 
> entries of a same region concurrently without coordination.
> This treatment can possibly lead to data inconsistency between master and 
> peer clusters:
> 1. there are put and then delete written to master cluster
> 2. due to region-move / RS-failure, they are pushed by different 
> replication-source threads to peer cluster
> 3. if delete is pushed to peer cluster before put, and flush and 
> major-compact occurs in peer cluster before put is pushed to peer cluster, 
> the delete is collected and the put remains in peer cluster
> In this scenario, the put remains in peer cluster, but in master cluster the 
> put is masked by the delete, hence data inconsistency between master and peer 
> clusters



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20493) Port HBASE-19994 (Create a new class for RPC throttling exception, make it retryable) to branch-1

2018-04-26 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20493.

  Resolution: Fixed
Hadoop Flags: Reviewed

> Port HBASE-19994 (Create a new class for RPC throttling exception, make it 
> retryable) to branch-1
> -
>
> Key: HBASE-20493
> URL: https://issues.apache.org/jira/browse/HBASE-20493
> Project: HBase
>  Issue Type: Task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
> Fix For: 1.5.0
>
> Attachments: HBASE-20493-branch-1.patch
>
>
> Port HBASE-19994 (Create a new class for RPC throttling exception, make it 
> retryable). Need to preserve the current behavior where the client gets a 
> non-retryable ThrottlingException and only optionally throw back the 
> retryable RpcThrottlingException if explicitly allowed by configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20501) Update Hadoop minimum version to 2.7

2018-04-27 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20501:
--

 Summary: Update Hadoop minimum version to 2.7
 Key: HBASE-20501
 URL: https://issues.apache.org/jira/browse/HBASE-20501
 Project: HBase
  Issue Type: Task
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0


See discussion thread on dev@ "[DISCUSS] Branching for HBase 1.5 and Hadoop 
minimum version update (to 2.7)"

Consensus
* This is a needed change due to the practicalities of having Hadoop as a 
dependency
* Let's move up the minimum supported version of Hadoop to 2.7.1.
* Update documentation (support matrix, compatibility discussion) to call this 
out.
* Be sure to call out this change in the release notes.
* Take the opportunity to remind users about our callout "Replace the Hadoop 
Bundled With HBase!" recommending users upgrade their Hadoop if < 2.7.1.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20505) PE should support multi column family read and write cases

2018-04-27 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20505:
--

 Summary: PE should support multi column family read and write cases
 Key: HBASE-20505
 URL: https://issues.apache.org/jira/browse/HBASE-20505
 Project: HBase
  Issue Type: Test
Reporter: Andrew Purtell
 Fix For: 3.0.0, 2.1.0, 1.5.0


PerformanceEvaluation has a --columns parameter but this adjusts the number of 
distinct column qualifiers to write (and, with --addColumns, to add to the 
scan), not the number of column families. 

We need something like a new --families parameter that will increase the number 
of column families defined in the test table schema, written to, and included 
in gets and scans. Default is 1, current behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20513) Collect and emit ScanMetrics in PerformanceEvaluation

2018-05-01 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20513:
--

 Summary: Collect and emit ScanMetrics in PerformanceEvaluation
 Key: HBASE-20513
 URL: https://issues.apache.org/jira/browse/HBASE-20513
 Project: HBase
  Issue Type: Improvement
  Components: test
Reporter: Andrew Purtell
 Fix For: 3.0.0, 2.1.0, 1.5.0


To better understand changes in scanning behavior between version, enable 
ScanMetrics collection in PerformanceEvaluation and collect and roll up the 
results into a report at termination.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20517) Fix PerformanceEvaluation 'column' parameter

2018-05-01 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20517:
--

 Summary: Fix PerformanceEvaluation 'column' parameter
 Key: HBASE-20517
 URL: https://issues.apache.org/jira/browse/HBASE-20517
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 3.0.0, 2.1.0, 1.5.0, 1.2.7, 1.3.3, 2.0.1, 1.4.5


PerformanceEvaluation's 'column' parameter looks broken to me. 

To test:
1. Write some data with 20 columns.
2. Do a scan test selecting one column.
3. Do a scan test selecting ten columns.

You'd expect the amount of data returned to vary but no, because the read side 
isn't selecting the same qualifiers that are written. Bytes returned in case 3 
should be 10x those in case 2.

I'm in branch-1 code at the moment. Probably affects trunk too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20517) Fix PerformanceEvaluation 'column' parameter

2018-05-04 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20517.

Resolution: Fixed

Pushed to 1.2 and up

> Fix PerformanceEvaluation 'column' parameter
> 
>
> Key: HBASE-20517
> URL: https://issues.apache.org/jira/browse/HBASE-20517
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 3.0.0, 2.1.0, 1.5.0, 1.2.7, 1.3.3, 2.0.1, 1.4.5
>
> Attachments: HBASE-20517-branch-1.patch, HBASE-20517.patch
>
>
> PerformanceEvaluation's 'column' parameter looks broken to me.
> To test:
> 1. Write some data with 20 columns.
> 2. Do a scan test selecting one column.
> 3. Do a scan test selecting ten columns.
> You'd expect the amount of data returned to vary but no, because the read 
> side isn't selecting the same qualifiers that are written. Bytes returned in 
> case 3 should be 10x those in case 2.
> I'm in branch-1 code at the moment. Probably affects trunk too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20513) Collect and emit ScanMetrics in PerformanceEvaluation

2018-05-04 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20513.

Resolution: Fixed

Pushed to 1.3 and up

> Collect and emit ScanMetrics in PerformanceEvaluation
> -
>
> Key: HBASE-20513
> URL: https://issues.apache.org/jira/browse/HBASE-20513
> Project: HBase
>  Issue Type: Test
>  Components: test
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
> Fix For: 3.0.0, 2.1.0, 1.5.0, 1.3.3, 2.0.1, 1.4.5
>
> Attachments: HBASE-20513-branch-1.patch, HBASE-20513.patch
>
>
> To better understand changes in scanning behavior between version, enable 
> ScanMetrics collection in PerformanceEvaluation and collect and roll up the 
> results into a report at termination.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20505) PE should support multi column family read and write cases

2018-05-07 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20505.

Resolution: Fixed

Pushed to 1.2 and up

> PE should support multi column family read and write cases
> --
>
> Key: HBASE-20505
> URL: https://issues.apache.org/jira/browse/HBASE-20505
> Project: HBase
>  Issue Type: Test
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
> Fix For: 3.0.0, 2.1.0, 1.5.0, 1.3.3, 2.0.1, 1.4.5
>
> Attachments: HBASE-20505-branch-1.patch, HBASE-20505.patch
>
>
> PerformanceEvaluation has a --columns parameter but this adjusts the number 
> of distinct column qualifiers to write (and, with --addColumns, to add to the 
> scan), not the number of column families. 
> We need something like a new --families parameter that will increase the 
> number of column families defined in the test table schema, written to, and 
> included in gets and scans. Default is 1, current behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20554) "WALs outstanding" message from CleanerChore is noisy

2018-05-09 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20554:
--

 Summary: "WALs outstanding" message from CleanerChore is noisy
 Key: HBASE-20554
 URL: https://issues.apache.org/jira/browse/HBASE-20554
 Project: HBase
  Issue Type: Bug
Reporter: Andrew Purtell


WARN level "WALs outstanding" from CleanerChore should be DEBUG and are not 
always correct. 

I left a cluster configured for ITBLL (retaining all WALs for post hoc 
analysis) and in the morning found the master log full of "WALs outstanding" 
warnings from CleanerChore. 

Should this really be a warning? Perhaps better logged at DEBUG level.

{quote}
2018-05-09 16:42:03,893 WARN  
[node-1.cluster,16000,1525851521469_ChoreService_2] cleaner.CleanerChore: WALs 
outstanding under hdfs://node-1.cluster/hbase/oldWALs

If someone has configured really long WAL retention then having WALs in oldWALs 
will be normal. 

Also, it seems the warning is sometimes incorrect.

{quote}
2018-05-09 16:42:24,751 WARN  
[node-1.cluster,16000,1525851521469_ChoreService_1] cleaner.CleanerChore: WALs 
outstanding under hdfs://node-1.cluster/hbase/archive
{quote}

There are no WALs under archive/. 

Even at DEBUG level, if it is not correct, then it can lead an operator to be 
concerned about nothing, so better to just remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20595) Remove the concept of 'special tables' from rsgroups

2018-05-16 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20595:
--

 Summary: Remove the concept of 'special tables' from rsgroups
 Key: HBASE-20595
 URL: https://issues.apache.org/jira/browse/HBASE-20595
 Project: HBase
  Issue Type: Task
  Components: Region Assignment, rsgroup
Reporter: Andrew Purtell
 Fix For: 3.0.0, 2.1.0, 1.5.0


Regionserver groups needs to specially handle what it calls "special tables", 
tables upon which core or other modular functionality depends. They need to be 
excluded from normal rsgroup processing during bootstrap to avoid circular 
dependencies or errors due to insufficiently initialized state. I think we also 
want to ensure that such tables are always given a rsgroup assignment with 
nonzero servers. (IIRC another issue already raises that point, we can link it 
later.)

Special tables include:
* The system tables in the 'hbase:' namespace
* The ACL table if the AccessController coprocessor is installed
* The Labels table if the VisibilityController coprocessor is installed
* The Quotas table if the FS quotas feature is active

Either we need a facility where "special tables" can be registered, which 
should be in core. Or, we institute a blanket rule that core and all extensions 
that need a "special table" must put them into the 'hbase:' namespace, so the 
TableName#isSystemTable() test will return TRUE for all, and then rsgroups 
simply needs to test for that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20597) Use a lock to serialize access to a shared reference to ZooKeeperWatcher in HBaseReplicationEndpoint

2018-05-17 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20597:
--

 Summary: Use a lock to serialize access to a shared reference to 
ZooKeeperWatcher in HBaseReplicationEndpoint
 Key: HBASE-20597
 URL: https://issues.apache.org/jira/browse/HBASE-20597
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.4.4, 1.3.2
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 3.0.0, 2.1.0, 1.5.0, 1.3.3, 2.0.1, 1.4.5


The code that closes down a ZKW that fails to initialize when attempting to 
connect to the remote cluster is not MT safe and can in theory leak 
ZooKeeperWatcher instances. The allocation of a new ZKW and store to the 
reference is not atomic. Might have concurrent allocations with only one 
winning store, leading to leaked ZKW instances. If the connection problem is 
persistent, like loss of shared trust between the clusters, we may accumulate 
unclosed ZKW instances over time, with a ZK send thread and event thread each, 
and eventually have enough leaked threads to cause OOME (cannot allocate native 
thread). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20603) Histogram metrics should reset min and max in snapshotAndReset

2018-05-18 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20603:
--

 Summary: Histogram metrics should reset min and max in 
snapshotAndReset
 Key: HBASE-20603
 URL: https://issues.apache.org/jira/browse/HBASE-20603
 Project: HBase
  Issue Type: Bug
  Components: metrics
Reporter: Andrew Purtell
Assignee: Andrew Purtell


It's weird that the bins are reset at every monitoring interval but min and max 
are tracked over the lifetime of the process. Makes it impossible to set alarms 
on max value as they'll never shut off unless the process is restarted. 
Histogram metrics should reset min and max in snapshotAndReset.

For discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20619) TestWeakObjectPool occasionally times out

2018-05-22 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20619:
--

 Summary: TestWeakObjectPool occasionally times out
 Key: HBASE-20619
 URL: https://issues.apache.org/jira/browse/HBASE-20619
 Project: HBase
  Issue Type: Test
  Components: test
Affects Versions: 1.4.4, 1.5.0
Reporter: Andrew Purtell


TestWeakObjectPool occasionally times out. Failure is rare and executor is an 
EC2 instance, so I think it's just a question of the timeout being too small.

[ERROR] testCongestion(org.apache.hadoop.hbase.util.TestWeakObjectPool)  Time 
elapsed: 1.049 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 1000 
milliseconds
at 
org.apache.hadoop.hbase.util.TestWeakObjectPool.testCongestion(TestWeakObjectPool.java:102)





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20486) Change default throughput controller to PressureAwareThroughputController in branch-1

2018-05-22 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20486.

  Resolution: Fixed
Hadoop Flags: Reviewed

Pushed to branch-1. Thanks for the patch [~xucang]

> Change default throughput controller to PressureAwareThroughputController in 
> branch-1
> -
>
> Key: HBASE-20486
> URL: https://issues.apache.org/jira/browse/HBASE-20486
> Project: HBase
>  Issue Type: Task
>Reporter: Andrew Purtell
>Assignee: Xu Cang
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-20486.branch-1.001.patch
>
>
> Switch the default throughput controller from NoLimitThroughputController to 
> PressureAwareThroughputController in branch-1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20608) Remove build option of error prone profile for branch-1 after HBASE-12350

2018-05-22 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20608.

   Resolution: Fixed
 Assignee: Andrew Purtell  (was: Mike Drob)
Fix Version/s: 1.5.0

Committed my hack. We can open another issue for a more nuanced fix

> Remove build option of error prone profile for branch-1 after HBASE-12350
> -
>
> Key: HBASE-20608
> URL: https://issues.apache.org/jira/browse/HBASE-20608
> Project: HBase
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.4.4, 1.4.5
>Reporter: Tak Lon (Stephen) Wu
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
>
> After HBASE-12350, error prone profile was introduced/backported to branch-1 
> and branch-2. However, branch-1 is still building with JDK 7 and is 
> incompatible with this error prone profile such that `mvn test-compile` 
> failed since then. 
> Open this issue to track the removal of `-PerrorProne` in the build command 
> (in Jenkins)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20606) hbase:acl table is listed in list_rsgroups output even when acl is not enabled

2018-05-23 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20606.

Resolution: Duplicate

> hbase:acl table is listed in list_rsgroups output even when acl is not enabled
> --
>
> Key: HBASE-20606
> URL: https://issues.apache.org/jira/browse/HBASE-20606
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Reporter: Biju Nair
>Priority: Major
>
> Steps to reproduce
>  - 
> {noformat}
> add_rsgroup 'test_rsgroup'{noformat}
>  - Add a server to the new {{rsgroup}}
>  - 
> {noformat}
> hbase(main):002:0> list_rsgroups
> NAME                                       SERVER / TABLE                     
>                                                                               
>          
>  test_rsgroup                                                                 
>                                                                               
>          
>  default                                   server dob2-bach-r3n13:16020       
>                                                                          
>                                            server dob2-bach-r3n13:16022       
>                                                                          
>                                            server dob2-bach-r3n13:16023       
>                                                                          
>                                            server dob2-bach-r3n13:16024       
>                                                                          
>                                            server dob2-bach-r3n13:16025       
>                                                                          
>                                            server dob2-bach-r3n13:16026       
>                                                                          
>                                            table hbase:meta                   
>                                                                               
>          
>                                            table hbase:acl                    
>                                                                               
>          
>                                            table hbase:namespace              
>                                                                               
>          
>                                            table hbase:rsgroup
> move_servers_rsgroup 'test_rsgroup',['dob2-bach-r3n13:16020']{noformat}
>  - Move {{hbase}} namespace to the new {{rsgroup}}
>  - 
> {noformat}
> hbase(main):005:0> move_namespaces_rsgroup 'test_rsgroup',['hbase']{noformat}
>  - List {{Rsgroups}} to verify all the {{hbase}} tables are moved 
>  - 
> {noformat}
> hbase(main):006:0> list_rsgroups
> NAME                                       SERVER / TABLE                     
>                                                                               
>          
>  test_rsgroup                              server dob2-bach-r3n13:16020       
>                                                                          
>                                            table hbase:meta                   
>                                                                               
>          
>                                            table hbase:namespace              
>                                                                               
>          
>                                            table hbase:rsgroup                
>                                                                               
>          
>  default                                   server dob2-bach-r3n13:16022       
>                                                                          
>                                            server dob2-bach-r3n13:16023       
>                                                                          
>                                            server dob2-bach-r3n13:16024       
>                                                                          
>                                            server dob2-bach-r3n13:16025       
>                                                                          
>                                            server dob2-bach-r3n13:16026       
>                                                                          
>                                            table hbase:acl {noformat}
>  - {{hbase:acl}} table is not moved to the new {{rsgroup}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20646) TestWALProcedureStoreOnHDFS failing on branch-1

2018-05-24 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20646:
--

 Summary: TestWALProcedureStoreOnHDFS failing on branch-1
 Key: HBASE-20646
 URL: https://issues.apache.org/jira/browse/HBASE-20646
 Project: HBase
  Issue Type: Test
Affects Versions: 1.4.4
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0, 1.4.5


TestWALProcedureStoreOnHDFS fails sometimes on branch-1 depending on junit 
particulars. An @After decoration was improperly added. Remove to fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20646) TestWALProcedureStoreOnHDFS failing on branch-1

2018-05-24 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20646.

Resolution: Fixed

> TestWALProcedureStoreOnHDFS failing on branch-1
> ---
>
> Key: HBASE-20646
> URL: https://issues.apache.org/jira/browse/HBASE-20646
> Project: HBase
>  Issue Type: Test
>Affects Versions: 1.4.4
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Trivial
> Fix For: 1.5.0, 1.4.5
>
> Attachments: HBASE-20646-branch-1.patch
>
>
> TestWALProcedureStoreOnHDFS fails sometimes on branch-1 depending on junit 
> particulars. An @After decoration was improperly added. Remove to fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20646) TestWALProcedureStoreOnHDFS failing on branch-1

2018-05-30 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20646.

   Resolution: Fixed
Fix Version/s: 2.1.0
   3.0.0

Committed an addendum to branch-1.4 and branch-1 that suppresses the warning. 
Synced this change to branch-2 and master since the issue is there too even if 
we are not tripping over it today.

> TestWALProcedureStoreOnHDFS failing on branch-1
> ---
>
> Key: HBASE-20646
> URL: https://issues.apache.org/jira/browse/HBASE-20646
> Project: HBase
>  Issue Type: Test
>Affects Versions: 1.4.4
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Trivial
> Fix For: 3.0.0, 2.1.0, 1.5.0, 1.4.5
>
> Attachments: HBASE-20646-branch-1.patch
>
>
> TestWALProcedureStoreOnHDFS fails sometimes on branch-1 depending on junit 
> particulars. An @After decoration was improperly added. Remove to fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-18116) Replication source in-memory accounting should not include bulk transfer hfiles

2018-05-31 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-18116:


I ran all TestReplication\*\* tests before commit, but forgot about 
TestGlobalThrottler (should really be renamed to 
TestReplicationGlobalThrottler). 

Reverted my commits for now. Can reapply once all tests are passing.
{noformat}
[ERROR] Failures:
[ERROR]   TestGlobalThrottler.testQuota:180{noformat}

> Replication source in-memory accounting should not include bulk transfer 
> hfiles
> ---
>
> Key: HBASE-18116
> URL: https://issues.apache.org/jira/browse/HBASE-18116
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Andrew Purtell
>Assignee: Xu Cang
>Priority: Major
> Fix For: 3.0.0, 2.1.0, 1.5.0
>
> Attachments: HBASE-18116.master.001.patch, 
> HBASE-18116.master.002.patch
>
>
> In ReplicationSourceWALReaderThread we maintain a global quota on enqueued 
> replication work for preventing OOM by queuing up too many edits into queues 
> on heap. When calculating the size of a given replication queue entry, if it 
> has associated hfiles (is a bulk load to be replicated as a batch of hfiles), 
> we get the file sizes and include the sum. We then apply that result to the 
> quota. This isn't quite right. Those hfiles will be pulled by the sink as a 
> file copy, not pushed by the source. The cells in those files are not queued 
> in memory at the source and therefore shouldn't be counted against the quota.
> Related, the sum of the hfile sizes are also included when checking if queued 
> work exceeds the configured replication queue capacity, which is by default 
> 64 MB. HFiles are commonly much larger than this. 
> So what happens is when we encounter a bulk load replication entry typically 
> both the quota and capacity limits are exceeded, we break out of loops, and 
> send right away. What is transferred on the wire via HBase RPC though has 
> only a partial relationship to the calculation. 
> Depending how you look at it, it makes sense to factor hfile file sizes 
> against replication queue capacity limits. The sink will be occupied 
> transferring those files at the HDFS level. Anyway, this is how we have been 
> doing it and it is too late to change now. I do not however think it is 
> correct to apply hfile file sizes against a quota for in memory state on the 
> source. The source doesn't queue or even transfer those bytes. 
> Something I noticed while working on HBASE-18027.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20667) Rename TestGlobalThrottler to TestReplicationGlobalThrottler

2018-05-31 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20667:
--

 Summary: Rename TestGlobalThrottler to 
TestReplicationGlobalThrottler
 Key: HBASE-20667
 URL: https://issues.apache.org/jira/browse/HBASE-20667
 Project: HBase
  Issue Type: Test
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 3.0.0, 2.1.0, 1.5.0


If running replication unit tests , perhaps like  {{mvn test 
-Dtest=TestReplication\*,Test\*Replication\*}} , then you will miss 
TestGlobalThrottler. This should be renamed to TestReplicationGlobalThrottler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20670) NPE in HMaster#isInMaintenanceMode

2018-06-01 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20670:
--

 Summary: NPE in HMaster#isInMaintenanceMode
 Key: HBASE-20670
 URL: https://issues.apache.org/jira/browse/HBASE-20670
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.3.2
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0, 1.3.3, 1.4.5


{noformat}
Problem accessing /master-status. Reason: INTERNAL_SERVER_ERROR
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.master.HMaster.isInMaintenanceMode(HMaster.java:2559)
{noformat}

The ZK trackers, including the maintenance mode tracker, are initialized only 
after we try to bring up the filesystem. If HDFS is in safe mode then an access 
to the master status page trips over this problem. There might be other issues 
after we fix this, but NPE Is always a bug, so let's address it. One option is 
to connect the ZK based components with ZK before attempting to bring up the 
filesystem. Let me try that first. If that doesn't work we could at least throw 
an IOE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20496) TestGlobalThrottler failing on branch-1 since revert of HBASE-9465

2018-06-01 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20496.

Resolution: Fixed

Resolved as part of HBASE-18116

> TestGlobalThrottler failing on branch-1 since revert of HBASE-9465
> --
>
> Key: HBASE-20496
> URL: https://issues.apache.org/jira/browse/HBASE-20496
> Project: HBase
>  Issue Type: Bug
>Reporter: Andrew Purtell
>Priority: Minor
>
> Not sure why we didn't catch it earlier, but with my latest dev setup 
> including 8u JVM, TestGlobalThrottler fails reliably, and a git bisect finds 
> the problem at this revert:
> {noformat}
> commit ba7a936f74985eb9d974fdc87b0d06cb8cd8473d
> Author: Sean Busbey 
> Date: Tue Nov 7 23:50:35 2017 -0600
> Revert "HBASE-9465 Push entries to peer clusters serially"
> This reverts commit 441bc050b991c14c048617bc443b97f46e21b76f.
> Conflicts:
> hbase-client/src/main/java/org/apache/hadoop/hbase/MetaTableAccessor.java
> hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java
> hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/WALProtos.java
> hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
> hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/Replication.java
> hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
> Signed-off-by: Andrew Purtell 
> {noformat}
> For now I'm going to disable the test. Leaving this open for debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-18116) Replication source in-memory accounting should not include bulk transfer hfiles

2018-06-01 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-18116.

Resolution: Fixed

> Replication source in-memory accounting should not include bulk transfer 
> hfiles
> ---
>
> Key: HBASE-18116
> URL: https://issues.apache.org/jira/browse/HBASE-18116
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Andrew Purtell
>Assignee: Xu Cang
>Priority: Major
> Fix For: 3.0.0, 2.1.0, 1.5.0
>
> Attachments: HBASE-18116.master.001.patch, 
> HBASE-18116.master.002.patch, HBASE-18116.master.003.patch
>
>
> In ReplicationSourceWALReaderThread we maintain a global quota on enqueued 
> replication work for preventing OOM by queuing up too many edits into queues 
> on heap. When calculating the size of a given replication queue entry, if it 
> has associated hfiles (is a bulk load to be replicated as a batch of hfiles), 
> we get the file sizes and include the sum. We then apply that result to the 
> quota. This isn't quite right. Those hfiles will be pulled by the sink as a 
> file copy, not pushed by the source. The cells in those files are not queued 
> in memory at the source and therefore shouldn't be counted against the quota.
> Related, the sum of the hfile sizes are also included when checking if queued 
> work exceeds the configured replication queue capacity, which is by default 
> 64 MB. HFiles are commonly much larger than this. 
> So what happens is when we encounter a bulk load replication entry typically 
> both the quota and capacity limits are exceeded, we break out of loops, and 
> send right away. What is transferred on the wire via HBase RPC though has 
> only a partial relationship to the calculation. 
> Depending how you look at it, it makes sense to factor hfile file sizes 
> against replication queue capacity limits. The sink will be occupied 
> transferring those files at the HDFS level. Anyway, this is how we have been 
> doing it and it is too late to change now. I do not however think it is 
> correct to apply hfile file sizes against a quota for in memory state on the 
> source. The source doesn't queue or even transfer those bytes. 
> Something I noticed while working on HBASE-18027.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20799) TestBucketCache#testCacheBlockNextBlockMetadataMissing is flaky

2018-06-27 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20799:
--

 Summary: TestBucketCache#testCacheBlockNextBlockMetadataMissing is 
flaky
 Key: HBASE-20799
 URL: https://issues.apache.org/jira/browse/HBASE-20799
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Andrew Purtell


{noformat}
[ERROR] testCacheBlockNextBlockMetadataMissing[1: blockSize=16,384, 
bucketSizes=[I@29ee9faa](org.apache.hadoop.hbase.io.hfile.bucket.TestBucketCache)
  Time elapsed: 0.066 s  <<< FAILURE!
java.lang.AssertionError: expected: 
java.nio.HeapByteBuffer but 
was: java.nio.HeapByteBuffer
at 
org.apache.hadoop.hbase.io.hfile.bucket.TestBucketCache.testCacheBlockNextBlockMetadataMissing(TestBucketCache.java:424)
{noformat}

[~zyork] any idea what is going on here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20799) TestBucketCache#testCacheBlockNextBlockMetadataMissing is flaky

2018-06-27 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20799.

Resolution: Duplicate

> TestBucketCache#testCacheBlockNextBlockMetadataMissing is flaky
> ---
>
> Key: HBASE-20799
> URL: https://issues.apache.org/jira/browse/HBASE-20799
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.4.5
>Reporter: Andrew Purtell
>Priority: Major
>
> {noformat}
> [ERROR] testCacheBlockNextBlockMetadataMissing[1: blockSize=16,384, 
> bucketSizes=[I@29ee9faa](org.apache.hadoop.hbase.io.hfile.bucket.TestBucketCache)
>   Time elapsed: 0.066 s  <<< FAILURE!
> java.lang.AssertionError: expected: 
> java.nio.HeapByteBuffer but 
> was: java.nio.HeapByteBuffer
> at 
> org.apache.hadoop.hbase.io.hfile.bucket.TestBucketCache.testCacheBlockNextBlockMetadataMissing(TestBucketCache.java:424)
> {noformat}
> [~zyork] any idea what is going on here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-20450) Provide metrics for number of total active, priority and replication rpc handlers

2018-06-29 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-20450:


This was only committed to trunk but I think it would be useful to bring back 
to branch-1 (for 1.5). I have a branch-1 patch ready so no need for anyone else 
to backport. Curious if you'd mind this in branch-2 [~stack] (not branch-2.0!) 
Going to assume ok if I don't hear anything for a few days. 

> Provide metrics for number of total active, priority and replication rpc 
> handlers
> -
>
> Key: HBASE-20450
> URL: https://issues.apache.org/jira/browse/HBASE-20450
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-20450.master.001.patch, 
> HBASE-20450.master.002.patch
>
>
> Currently hbase provides a metric for [number of total active rpc 
> handlers|https://github.com/apache/hbase/blob/f4f2b68238a094d7b1931dc8b7939742ccbb2b57/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java#L187]
>  which is a sum of the following:
>  * number of active general rpc handlers
>  * number of active priority rpc handlers
>  * number of active replication rpc handlers
> I think we can have 3 different metrics corresponding to the above mentioned 
> handlers which will allow us to see detailed information about number of 
> active handlers running for a particular type of handler.
> We can have following new metrics:
>  * numActiveGeneralHandler
>  * numActivePriorityHandler
>  * numActiveReplicationHandler
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20450) Provide metrics for number of total active, priority and replication rpc handlers

2018-07-02 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20450.

Resolution: Fixed

> Provide metrics for number of total active, priority and replication rpc 
> handlers
> -
>
> Key: HBASE-20450
> URL: https://issues.apache.org/jira/browse/HBASE-20450
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0, 2.1.0, 1.5.0
>
> Attachments: HBASE-20450.master.001.patch, 
> HBASE-20450.master.002.patch
>
>
> Currently hbase provides a metric for [number of total active rpc 
> handlers|https://github.com/apache/hbase/blob/f4f2b68238a094d7b1931dc8b7939742ccbb2b57/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleRpcScheduler.java#L187]
>  which is a sum of the following:
>  * number of active general rpc handlers
>  * number of active priority rpc handlers
>  * number of active replication rpc handlers
> I think we can have 3 different metrics corresponding to the above mentioned 
> handlers which will allow us to see detailed information about number of 
> active handlers running for a particular type of handler.
> We can have following new metrics:
>  * numActiveGeneralHandler
>  * numActivePriorityHandler
>  * numActiveReplicationHandler
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20895) NPE in RpcServer#readAndProcess

2018-07-16 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20895:
--

 Summary: NPE in RpcServer#readAndProcess
 Key: HBASE-20895
 URL: https://issues.apache.org/jira/browse/HBASE-20895
 Project: HBase
  Issue Type: Bug
  Components: rpc
Affects Versions: 1.3.2
Reporter: Andrew Purtell
Assignee: Monani Mihir
 Fix For: 1.5.0, 1.3.3, 1.4.6


{noformat}
2018-07-10 16:25:55,005 DEBUG [.sfdc.net,port=60020] ipc.RpcServer - 
RpcServer.listener,port=60020: Caught exception while reading:
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1761)
at 
org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:949)
at 
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:730)
at 
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:706)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

This looks like it could be a use after close problem if there is concurrent 
access to a Connection.

In process() we might store a null back to the 'data' field.

Meanwhile in readAndProcess() we have a case where we might be blocked on a 
channel read and then after coming back from the read we go to use 'data' after 
a null has been written back, leading to a NPE.

{quote} 
count = channelRead(channel, data);
1761 --->   if (count >= 0 && *data.remaining()* == 0) { // count==0 if 
dataLength == 0
process();
   }
{quote} 

Whether a NPE happens or not is going to depend on the timing of the store back 
to 'data' in another thread and use of 'data' in this thread and whether or not 
the JVM has optimized away a reload of 'data' (it's not declared volatile)

We should do a null check here just to be defensive. We should also look at 
whether the concurrent access to the Connection is intended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20897) Port HBASE-20866 to branch-2 and up

2018-07-16 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20897:
--

 Summary: Port HBASE-20866 to branch-2 and up
 Key: HBASE-20897
 URL: https://issues.apache.org/jira/browse/HBASE-20897
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell
Assignee: Vikas Vishwakarma






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20896) Port HBASE-20866 to branch-1 and branch-1.4

2018-07-16 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20896:
--

 Summary: Port HBASE-20866 to branch-1 and branch-1.4 
 Key: HBASE-20896
 URL: https://issues.apache.org/jira/browse/HBASE-20896
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell
Assignee: Vikas Vishwakarma






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20931) [branch-1] Add -Dhttps.protocols=TLSv1.2 to Maven command line in make_rc.sh

2018-07-24 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20931:
--

 Summary: [branch-1] Add -Dhttps.protocols=TLSv1.2 to Maven command 
line in make_rc.sh
 Key: HBASE-20931
 URL: https://issues.apache.org/jira/browse/HBASE-20931
 Project: HBase
  Issue Type: Task
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0, 1.2.7, 1.3.3, 1.4.6


As of June 2018 the insecure TLS 1.0 and 1.1 protocols are no longer supported 
for SSL connections to Maven Central and perhaps other public Maven 
repositories. The branch-1 builds which require Java 7, of which the latest 
public release was 7u80, need to add {{-Dhttps.protocols=TLSv1.2}} to the Maven 
command line in order to avoid artifact retrieval problems during builds.

We especially need this in make_rc.sh which starts up with an empty local Maven 
cache. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20931) [branch-1] Add -Dhttps.protocols=TLSv1.2 to Maven command line in make_rc.sh

2018-07-24 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20931.

Resolution: Fixed

> [branch-1] Add -Dhttps.protocols=TLSv1.2 to Maven command line in make_rc.sh
> 
>
> Key: HBASE-20931
> URL: https://issues.apache.org/jira/browse/HBASE-20931
> Project: HBase
>  Issue Type: Task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Trivial
> Fix For: 1.5.0, 1.2.7, 1.3.3, 1.4.6
>
> Attachments: HBASE-20931-branch-1.patch
>
>
> As of June 2018 the insecure TLS 1.0 and 1.1 protocols are no longer 
> supported for SSL connections to Maven Central and perhaps other public Maven 
> repositories. The branch-1 builds which require Java 7, of which the latest 
> public release was 7u80, need to add {{-Dhttps.protocols=TLSv1.2}} to the 
> Maven command line in order to avoid artifact retrieval problems during 
> builds.
> We especially need this in make_rc.sh which starts up with an empty local 
> Maven cache. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20982) [branch-1] TestExportSnapshot is flaky

2018-07-30 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-20982:
--

 Summary:  [branch-1] TestExportSnapshot is flaky
 Key: HBASE-20982
 URL: https://issues.apache.org/jira/browse/HBASE-20982
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 1.4.6
Reporter: Andrew Purtell


Passes for me
{noformat}
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 390.02 s 
- in org.apache.hadoop.hbase.snapshot.TestExportSnapshot
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
{noformat}

but fails or times out for others. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21000) Default limits for PressureAwareCompactionThroughputController are too low

2018-08-02 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21000:
--

 Summary: Default limits for 
PressureAwareCompactionThroughputController are too low
 Key: HBASE-21000
 URL: https://issues.apache.org/jira/browse/HBASE-21000
 Project: HBase
  Issue Type: Improvement
Affects Versions: 1.5.0
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 3.0.0, 1.5.0, 2.2.0


In PressureAwareCompactionThroughputController:
{code:java}
/** 


 * A throughput controller which uses the follow schema to limit throughput 


 *  


 * If compaction pressure is greater than 1.0, no limitation.  


 * In off peak hours, use a fixed throughput limitation 


 * {@value #HBASE_HSTORE_COMPACTION_MAX_THROUGHPUT_OFFPEAK}


 * In normal hours, the max throughput is tuned between 


 * {@value #HBASE_HSTORE_COMPACTION_MAX_THROUGHPUT_LOWER_BOUND} and 


 * {@value #HBASE_HSTORE_COMPACTION_MAX_THROUGHPUT_HIGHER_BOUND}, using the 
formula "lower +   

 * (higher - lower) * compactionPressure", where compactionPressure is in 
range [0.0, 1.0]   
  
 * 


 */
{code}
The lower and upper bounds are 10MB/sec and 20MB/sec, respectively:
{code:java}
  public static final String 
HBASE_HSTORE_COMPACTION_MAX_THROUGHPUT_HIGHER_BOUND =
  "hbase.hstore.compaction.throughput.higher.bound";

  private static final long 
DEFAULT_HBASE_HSTORE_COMPACTION_MAX_THROUGHPUT_HIGHER_BOUND =
  20L * 1024 * 1024;

  public static final String HBASE_HSTORE_COMPACTION_MAX_THROUGHPUT_LOWER_BOUND 
=
  "hbase.hstore.compaction.throughput.lower.bound";

  private static final long 
DEFAULT_HBASE_HSTORE_COMPACTION_MAX_THROUGHPUT_LOWER_BOUND =
  10L * 1024 * 1024;
{code}
(In contrast, in PressureAwareFlushThroughputController the lower and upper 
bounds are 10x of those limits, at 100MB/sec and 200MB/sec, respectively.)

In fairly light load scenarios we see compaction quickly falls behind and write 
clients are backed off or failing due to RegionTooBusy exceptions. Although 
compaction throughput becomes unbounded after the store reaches the blocking 
file count, in the lead up to this the default settings do not provide enough 
bandwidth to stave off blocking. The defaults should be increased. 

I'm not sure what new good defaults make sense. We could start by doubling them 
to 20MB/sec and 40MB/sec respectively. Might need to be doubled again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20407) Retry HBase admin API if master failover is in progress

2018-08-15 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20407.

Resolution: Duplicate

Duping this out in favor of HBASE-20408

> Retry HBase admin API if master failover is in progress
> ---
>
> Key: HBASE-20407
> URL: https://issues.apache.org/jira/browse/HBASE-20407
> Project: HBase
>  Issue Type: Improvement
>  Components: Admin
>Reporter: Divesh Jain
>Assignee: Divesh Jain
>Priority: Minor
>
> When a master switch over is in progress and an admin API is called, perform 
> a retry operation before throwing an exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21099) NPE in TestTableResource.setUpBeforeClass (TestTableResource.java:134)

2018-08-22 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21099:
--

 Summary: NPE in TestTableResource.setUpBeforeClass 
(TestTableResource.java:134)
 Key: HBASE-21099
 URL: https://issues.apache.org/jira/browse/HBASE-21099
 Project: HBase
  Issue Type: Bug
  Components: REST, test
Reporter: Andrew Purtell
 Fix For: 2.0.2, 2.2.0, 2.1.1


TestTableResource fails consistently with NPE, only in the branch-2s. Both 
master and branch-1 is fine. 

{noformat}
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.397 
s <<< FAILURE! - in org.apache.hadoop.hbase.rest.TestTableResource
[ERROR] org.apache.hadoop.hbase.rest.TestTableResource  Time elapsed: 54.395 s  
<<< ERROR!
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.rest.TestTableResource.setUpBeforeClass(TestTableResource.java:134)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-20940) HStore.cansplit should not allow split to happen if it has references

2018-08-23 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-20940:


We didn't catch some test issues, see HBASE-21105 . Reopening. Needs an addendum

> HStore.cansplit should not allow split to happen if it has references
> -
>
> Key: HBASE-20940
> URL: https://issues.apache.org/jira/browse/HBASE-20940
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.1.1, 2.0.2, 1.4.7
>
> Attachments: HBASE-20940.branch-1.3.v1.patch, 
> HBASE-20940.branch-1.3.v2.patch, HBASE-20940.branch-1.v1.patch, 
> HBASE-20940.branch-1.v2.patch, HBASE-20940.branch-1.v3.patch, 
> HBASE-20940.v1.patch, HBASE-20940.v2.patch, HBASE-20940.v3.patch, 
> HBASE-20940.v4.patch, result_HBASE-20940.branch-1.v2.log
>
>
> When split happens and immediately another split happens, it may result into 
> a split of a region who still has references to its parent. More details 
> about scenario can be found here HBASE-20933
> HStore.hasReferences should check from fs.storefile rather than in memory 
> objects.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-20890) PE filterScan seems to be stuck forever

2018-08-23 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-20890:


> PE filterScan seems to be stuck forever
> ---
>
> Key: HBASE-20890
> URL: https://issues.apache.org/jira/browse/HBASE-20890
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.3
>Reporter: Vikas Vishwakarma
>Assignee: Abhishek Goyal
>Priority: Minor
>
> Command Used
> {code:java}
> ~/current/bigdata-hbase/hbase/hbase/bin/hbase pe --nomapred randomWrite 1 > 
> write 2>&1
> ~/current/bigdata-hbase/hbase/hbase/bin/hbase pe --nomapred filterScan 1 > 
> filterScan 2>&1
> {code}
>  
> Output
> This kept running for several hours just printing the below messages in logs
>  
> {code:java}
> -bash-4.1$ grep "Advancing internal scanner to startKey" filterScan.1 | head
> 2018-07-13 10:44:45,188 DEBUG [TestClient-0] client.ClientScanner - Advancing 
> internal scanner to startKey at '52359'
> 2018-07-13 10:44:45,976 DEBUG [TestClient-0] client.ClientScanner - Advancing 
> internal scanner to startKey at '52359'
> 2018-07-13 10:44:46,695 DEBUG [TestClient-0] client.ClientScanner - Advancing 
> internal scanner to startKey at '52359'
> .
> -bash-4.1$ grep "Advancing internal scanner to startKey" filterScan.1 | tail
> 2018-07-15 06:20:22,353 DEBUG [TestClient-0] client.ClientScanner - Advancing 
> internal scanner to startKey at '52359'
> 2018-07-15 06:20:23,044 DEBUG [TestClient-0] client.ClientScanner - Advancing 
> internal scanner to startKey at '52359'
> 2018-07-15 06:20:23,768 DEBUG [TestClient-0] client.ClientScanner - Advancing 
> internal scanner to startKey at '52359'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21162) Revert suspicious change to BoundedByteBufferPool and disable use of direct buffers for IPC reservoir by default

2018-09-06 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21162:
--

 Summary: Revert suspicious change to BoundedByteBufferPool and 
disable use of direct buffers for IPC reservoir by default
 Key: HBASE-21162
 URL: https://issues.apache.org/jira/browse/HBASE-21162
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.4.7
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0, 1.4.8


We had a production incident where we traced the issue to a direct buffer leak. 
On a hunch we tried setting hbase.ipc.server.reservoir.enabled = false and 
after that no native memory leak could be observed in any regionserver process 
under the triggering load. 

On HBASE-19239 (Fix findbugs and error-prone issues) I made a change to 
BoundedByteBufferPool that is suspicious given this finding. It was committed 
to branch-1.4 and branch-1. I'm going to revert this change. 

In addition the allocation of direct memory for the server RPC reservoir is a 
bit problematic in that tracing native memory or direct buffer leaks to a 
particular class or compilation unit is difficult, so I also propose allocating 
the reservoir on the heap by default instead. Should there be a leak it is much 
easier to do an analysis of a heap dump with familiar tools to find it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20307) LoadTestTool prints too much zookeeper logging

2018-09-07 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-20307.

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.0.3
   2.1.1
   1.4.8
   2.2.0
   1.2.8
   1.3.3
   1.5.0
   3.0.0

> LoadTestTool prints too much zookeeper logging
> --
>
> Key: HBASE-20307
> URL: https://issues.apache.org/jira/browse/HBASE-20307
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Reporter: Mike Drob
>Assignee: Colin Garcia
>Priority: Major
>  Labels: beginner
> Fix For: 3.0.0, 1.5.0, 1.3.3, 1.2.8, 2.2.0, 1.4.8, 2.1.1, 2.0.3
>
> Attachments: HBASE-20307.000.patch, HBASE-20307.001.patch
>
>
> When running ltt there is a ton of ZK related cruft that I probably don't 
> care about. Hide it behind -verbose flag or point people at log4j 
> configuration but don't print it by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21203) TestZKMainServer#testCommandLineWorks won't pass with default 4lw whitelist

2018-09-17 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21203:
--

 Summary: TestZKMainServer#testCommandLineWorks won't pass with 
default 4lw whitelist
 Key: HBASE-21203
 URL: https://issues.apache.org/jira/browse/HBASE-21203
 Project: HBase
  Issue Type: Bug
Reporter: Andrew Purtell
Assignee: Andrew Purtell


Recent versions of ZooKeeper whitelist the so-called 4-letter word admin 
commands, and 'stat' is not in the default whitelist, so 
TestZKMainServer#testCommandLineWorks cannot get off the ground.. Set system 
property zookeeper.4lw.commands.whitelist=* in 
MiniZooKeeperCluster#setupTestEnv as we do not need to whitelist 4-letter 
commands for unit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-10342) RowKey Prefix Bloom Filter

2018-09-20 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-10342.

Resolution: Duplicate

Duped by HBASE-20636

> RowKey Prefix Bloom Filter
> --
>
> Key: HBASE-10342
> URL: https://issues.apache.org/jira/browse/HBASE-10342
> Project: HBase
>  Issue Type: New Feature
>Reporter: Liyin Tang
>Priority: Major
>
> When designing HBase schema for some use cases, it is quite common to combine 
> multiple information within the RowKey. For instance, assuming that rowkey is 
> constructed as md5(id1) + id1 + id2, and user wants to scan all the rowkeys 
> which starting by id1. In such case, the rowkey bloom filter is able to cut 
> more unnecessary seeks during the scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21220) Port HBASE-20636 (Introduce two bloom filter type : ROWPREFIX and ROWPREFIX_DELIMITED) to branch-1

2018-09-21 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21220:
--

 Summary: Port HBASE-20636 (Introduce two bloom filter type : 
ROWPREFIX and ROWPREFIX_DELIMITED) to branch-1
 Key: HBASE-21220
 URL: https://issues.apache.org/jira/browse/HBASE-21220
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-19418) RANGE_OF_DELAY in PeriodicMemstoreFlusher should be configurable.

2018-09-28 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-19418.

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.1.1
   1.4.8
   2.2.0
   1.3.3
   1.5.0
   3.0.0

Pushed up, thanks for the contribution [~ramatronics]

> RANGE_OF_DELAY in PeriodicMemstoreFlusher should be configurable.
> -
>
> Key: HBASE-19418
> URL: https://issues.apache.org/jira/browse/HBASE-19418
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha-4
>Reporter: Jean-Marc Spaggiari
>Assignee: Ramie Raufdeen
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 1.4.8, 2.1.1
>
> Attachments: HBASE-19418.master.000.patch
>
>
> When RSs have a LOT of regions and CFs, flushing everything within 5 minutes 
> is not always doable. It might be interesting to be able to increase the 
> RANGE_OF_DELAY. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-21258) Add resetting of flags for RS Group pre/post hooks in TestRSGroups

2018-10-01 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-21258:


Pardon me, there has been a review error. Reopening because I'm reverting what 
was committed to branch-1. 

> Add resetting of flags for RS Group pre/post hooks in TestRSGroups
> --
>
> Key: HBASE-21258
> URL: https://issues.apache.org/jira/browse/HBASE-21258
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: 21258.branch-1.04.txt, 21258.branch-1.05.txt, 
> 21258.branch-2.v1.patch, 21258.v1.txt
>
>
> Over HBASE-20627, [~xucang] reminded me that the resetting of flags for RS 
> Group pre/post hooks in TestRSGroups was absent.
> This issue is to add the resetting of these flags before each subtest starts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-21258) Add resetting of flags for RS Group pre/post hooks in TestRSGroups

2018-10-01 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-21258.

   Resolution: Fixed
Fix Version/s: 1.4.8
   1.5.0

The branch-2 patch applies without any changes needed. Resolving this as fixed. 
If additional changes are needed, let's open a new issue not do something 
radical with a branch-1 patch.

> Add resetting of flags for RS Group pre/post hooks in TestRSGroups
> --
>
> Key: HBASE-21258
> URL: https://issues.apache.org/jira/browse/HBASE-21258
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.2.0, 1.4.8
>
> Attachments: 21258.branch-1.04.txt, 21258.branch-1.05.txt, 
> 21258.branch-2.v1.patch, 21258.v1.txt
>
>
> Over HBASE-20627, [~xucang] reminded me that the resetting of flags for RS 
> Group pre/post hooks in TestRSGroups was absent.
> This issue is to add the resetting of these flags before each subtest starts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-21117) Backport HBASE-18350 (fix RSGroups) to branch-1 (Only port the part fixing table locking issue.)

2018-10-01 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-21117.

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 1.4.8
   1.5.0

> Backport HBASE-18350  (fix RSGroups)  to branch-1 (Only port the part fixing 
> table locking issue.)
> --
>
> Key: HBASE-21117
> URL: https://issues.apache.org/jira/browse/HBASE-21117
> Project: HBase
>  Issue Type: Bug
>  Components: backport, rsgroup, shell
>Affects Versions: 1.3.2
>Reporter: Xu Cang
>Assignee: Xu Cang
>Priority: Major
>  Labels: backport
> Fix For: 1.5.0, 1.4.8
>
> Attachments: HBASE-21117-branch-1.001.patch, 
> HBASE-21117-branch-1.002.patch
>
>
> When working on HBASE-20666, I found out HBASE-18350 did not get ported to 
> branch-1, which causes procedure to hang when #moveTables called sometimes. 
> After looking into the 18350 patch, seems it's important since it fixes 4 
> issues. This Jira is an attempt to backport it to branch-1.
>  
>  
> Edited: Aug26.
> After reviewed the HBASE-18350 patch. I decided to only port part 2 of the 
> patch.
> Because part1 and part3 is AMv2 related. I won't touch is since Amv2 is only 
> for branch-2
>  
> {quote} 
> Subject: [PATCH] HBASE-18350 RSGroups are broken under AMv2
> - Table moving to RSG was buggy, because it left the table unassigned.
>   Now it is fixed we immediately assign to an appropriate RS
>   (MoveRegionProcedure).
> *- Table was locked while moving, but unassign operation hung, because*
>   *locked table queues are not scheduled while locked. Fixed.    port 
> this one.*
> - ProcedureSyncWait was buggy, because it searched the procId in
>   executor, but executor does not store the return values of internal
>   operations (they are stored, but immediately removed by the cleaner).
> - list_rsgroups in the shell show also the assigned tables and servers.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-21261) Add log4j.properties for hbase-rsgroup tests

2018-10-01 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-21261.

   Resolution: Fixed
Fix Version/s: 2.0.3
   2.1.1
   1.4.8
   2.2.0
   1.5.0
   3.0.0

> Add log4j.properties for hbase-rsgroup tests
> 
>
> Key: HBASE-21261
> URL: https://issues.apache.org/jira/browse/HBASE-21261
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Andrew Purtell
>Priority: Trivial
> Fix For: 3.0.0, 1.5.0, 2.2.0, 1.4.8, 2.1.1, 2.0.3
>
>
> When I tried to debug TestRSGroups, at first I couldn't find any DEBUG log.
> Turns out that under hbase-rsgroup/src/test/resources there is no 
> log4j.properties
> This issue adds log4j.properties for hbase-rsgroup tests.
> This would be useful when finding root cause for hbase-rsgroup test 
> failure(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21263) Mention compression algorithm along with other storefile details

2018-10-02 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21263:
--

 Summary: Mention compression algorithm along with other storefile 
details
 Key: HBASE-21263
 URL: https://issues.apache.org/jira/browse/HBASE-21263
 Project: HBase
  Issue Type: Improvement
Reporter: Andrew Purtell


Where we log storefile details we should also log the compression algorithm 
used to compress blocks on disk, if any. 

For example, here's a log line out of compaction:

2018-10-02 21:59:47,594 DEBUG 
[regionserver/host/1.1.1.1:8120-longCompactions-1538517461152] 
compactions.Compactor: Compacting 
hdfs://namenode:8020/hbase/data/default/TestTable/86037c19117a46b5b8148439ea55753b/tiny/3d04a7c28d6343ceb773737dbb192533,
 keycount=3335873, bloomtype=ROW, size=107.5 M, encoding=ROW_INDEX_V1, 
seqNum=154199, earliestPutTs=1538516084915

Aside from bloom type, block encoding, and filename, it would be good to know 
compression type in this type of DEBUG or INFO level logging. A minor omission 
of information that could be helpful during debugging. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-19444) RSGroups test units cannot be concurrently executed

2018-10-03 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-19444.

   Resolution: Duplicate
Fix Version/s: (was: 1.4.9)
   (was: 2.2.0)
   (was: 1.5.0)
   (was: 3.0.0)

Duping this out. Replacing with a task issue to break up TestRSGroups into 
smaller units. Current run time is ~240 seconds / 4 minutes and the test is 
only stable when run by itself.

> RSGroups test units cannot be concurrently executed
> ---
>
> Key: HBASE-19444
> URL: https://issues.apache.org/jira/browse/HBASE-19444
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Andrew Purtell
>Priority: Minor
>
> TestRSGroups and friends cannot be concurrently executed or they are very 
> likely to flake, failing with constraint exceptions. If executed serially all 
> units pass. Fix for concurrent execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21265) Split up TestRSGroups

2018-10-03 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21265:
--

 Summary: Split up TestRSGroups
 Key: HBASE-21265
 URL: https://issues.apache.org/jira/browse/HBASE-21265
 Project: HBase
  Issue Type: Task
  Components: rsgroup, test
Affects Versions: 1.4.8
Reporter: Andrew Purtell
 Fix For: 3.0.0, 1.5.0, 2.2.0


TestRSGroups is flaky. It is stable when run in isolation but when run as part 
of the suite with concurrent executors it can fail. The current running time of 
this unit on my dev box is ~240 seconds (4 minutes), which is far too much 
time. This unit should be broken up 5 to 8 ways, grouped by functionality under 
test. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21266) Not running balancer because processing dead regionservers, but empty rs list, and state does not recover

2018-10-03 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21266:
--

 Summary: Not running balancer because processing dead 
regionservers, but empty rs list, and state does not recover
 Key: HBASE-21266
 URL: https://issues.apache.org/jira/browse/HBASE-21266
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.4.8
Reporter: Andrew Purtell
 Fix For: 1.5.0, 1.4.9


Found during ITBLL testing. AM in master gets into a state where manual 
attempts from the shell to run the balancer always return false and this is 
printed in the master log:

2018-10-03 19:17:14,892 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: Not 
running balancer because processing dead regionserver(s): 

Note the empty list. 

This errant state did not recover without intervention by way of master 
restart, but the test environment was chaotic so needs investigation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21283) Add new shell command 'rit' for listing regions in transition

2018-10-09 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21283:
--

 Summary: Add new shell command 'rit' for listing regions in 
transition
 Key: HBASE-21283
 URL: https://issues.apache.org/jira/browse/HBASE-21283
 Project: HBase
  Issue Type: Improvement
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 3.0.0, 1.5.0, 2.2.0


The 'status' shell command shows regions in transition but sometimes an 
operator may want to retrieve a simple list of regions in transition. Here's a 
patch that adds a new 'rit' command to the TOOLS group that does just that. 

No test, because it seems hard to mock RITs from the ruby test code, but I have 
run TestShell and it passes, so the command is verified to meet minimum 
requirements, like help text, and manually verified with branch-1 (shell in 
branch-2 and up doesn't return until TransitRegionProcedure has completed so by 
that time no RIT):

{noformat}
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
Version 1.5.0-SNAPSHOT, r9bb6d2fa8b760f16cd046657240ebd4ad91cb6de, Mon Oct  8 
21:05:50 UTC 2018

hbase(main):001:0> help 'rit'
List all regions in transition.
Examples:
  hbase> rit

hbase(main):002:0> create ...
0 row(s) in 2.5150 seconds
=> Hbase::Table - IntegrationTestBigLinkedList

hbase(main):003:0> rit
0 row(s) in 0.0340 seconds

hbase(main):004:0> unassign '56f0c38c81ae453d19906ce156a2d6a1'
0 row(s) in 0.0540 seconds

hbase(main):005:0> rit 
IntegrationTestBigLinkedList,L\xCC\xCC\xCC\xCC\xCC\xCC\xCB,1539117183224.56f0c38c81ae453d19906ce156a2d6a1.
 state=PENDING_CLOSE, ts=Tue Oct 09 20:33:34 UTC 2018 (0s ago), server=null 


  
1 row(s) in 0.0170 seconds
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21284) Forward port HBASE-21000 to branch-2

2018-10-09 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21284:
--

 Summary: Forward port HBASE-21000 to branch-2
 Key: HBASE-21284
 URL: https://issues.apache.org/jira/browse/HBASE-21284
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell


See parent for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21346) Update release procedure and website publishing docs in the book

2018-10-19 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21346:
--

 Summary: Update release procedure and website publishing docs in 
the book
 Key: HBASE-21346
 URL: https://issues.apache.org/jira/browse/HBASE-21346
 Project: HBase
  Issue Type: Task
  Components: documentation, website
Reporter: Andrew Purtell


Now as part of the release process the RM must manually update the download 
page (hbase.apache.org/downloads/). To accomplish that [~mdrob] says

{quote}

To update the download links, on master branch edit
 src/site/xdoc/downloads.xml
 After you commit and push, jenkins will build the site and publish it for you.

{quote}

 

New code lines also need a fork of the API documentation. To accomplish that:

{quote}

To update the API Docs and version specific reference guide, update 
src/site/site.xml with a new section to link to the docs in the drop down list. 
(only necessary the first time, but it hasn't been done yet for 1.4.x) Then git 
clone [https://git-wip-us.apache.org/repos/asf/hbase-site.git] and make a 1.4 
directory there. Copy contents of the docs/ directory from the release tarball 
to the version directory. Copy target/site/devapidocs and testapidocs from a 
local build of the tag, since those don't get published in the release tarball. 
Commit your changes, then do an empty commit with message "INFRA-10751 Empty 
commit". Push your changes

{quote}

 

Try this out. Take notes. Update the release instructions and website publish 
instructions in the book accordingly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21358) Snapshot procedure fails but SnapshotManager thinks it is still running

2018-10-22 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21358:
--

 Summary: Snapshot procedure fails but SnapshotManager thinks it is 
still running
 Key: HBASE-21358
 URL: https://issues.apache.org/jira/browse/HBASE-21358
 Project: HBase
  Issue Type: Bug
  Components: snapshots
Affects Versions: 1.3.2
Reporter: Andrew Purtell


A snapshot procedure fails due to chaotic test action but the snapshot manager 
still thinks it is running. The test client spins needlessly checking for 
something that will never actually complete. We give up eventually but we could 
be failing this a lot faster. 

On the integration client we are checking and re-checking: 

2018-10-20 01:06:11,718 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: Getting 
current status of snapshot from master... 
2018-10-20 01:06:11,719 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: (#40) 
Sleeping: 8571ms while waiting for snapshot completion. 

This is what it looks like on the master side each time the client checks in: 

2018-10-20 01:04:54,565 DEBUG 
[RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] 
master.MasterRpcServices: Checking to see if snapshot from request:{ 
ss=IntegrationTestBigLinkedList-it-1539997289258 
table=IntegrationTestBigLinkedList type=FLUSH } is done 
2018-10-20 01:04:54,565 DEBUG 
[RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] 
snapshot.SnapshotManager: Snapshoting '{ 
ss=IntegrationTestBigLinkedList-it-1539997289258 
table=IntegrationTestBigLinkedList type=FLUSH }' is still in progress! 

There is no running procedure for the snapshot. The procedure has failed. The 
snapshot manager does not take any useful action afterward but believes the 
snapshot to still be in progress.

I see related complaint from the hfile archiver task afterward, empty 
directories, failure to parse protobuf in descriptor files... Seems like there 
was junk in the filesystem left over from the failed snapshot. The master was 
soon restarted by chaos action, and now I don't see these complaints, so that 
partially complete snapshot may have been cleaned up.

This is with 1.3.2, but patched to include the multithreaded hfile archiving 
improvements from later versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21359) Fix build problem against Hadoop 2.8.5

2018-10-22 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-21359:
--

 Summary: Fix build problem against Hadoop 2.8.5
 Key: HBASE-21359
 URL: https://issues.apache.org/jira/browse/HBASE-21359
 Project: HBase
  Issue Type: Bug
  Components: build
Affects Versions: 1.4.8
Reporter: Andrew Purtell
Assignee: Andrew Purtell
 Fix For: 1.5.0, 1.4.9


1.4.8 build fails against Hadoop 2.8.5. The fix is an easy change to 
supplemental-models.xml. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >