Re: Jenkins build failures?

Josh Elser Wed, 25 May 2016 18:56:16 -0700

+1 Great digging, Sergey!

Sergey Soldatov wrote:

James,
Sure. I will file a JIRA and check about non zero thread pool size (not
sure that would help since it's initialized in getDefaultExecutor and
always used if no other pool is provided in HTable constructor).
Thanks,
Sergey


On Mon, May 23, 2016 at 8:11 PM, James Taylor<[email protected]>
wrote:

Thanks, Sergey. Sounds like you're on to it. We could try configuring
those tests with a non zero thread pool size so they don't
use SynchronousQueue. Want to file a JIRA with this info so we don't lose
track of it?

     James

On Tue, May 17, 2016 at 11:21 PM, Sergey Soldatov<
[email protected]>  wrote:

Getting back to the failures with OOM/unable to create a native thread.
Those files have around 100 tests inside each that are running on top of
the phoenix. In total they generate over 2500 scans. (system.catalog,
sequences and regular scans over table).  The problem that on HBase side
all scans are going through the ThreadPoolExecutor generated in HTable.
Which is using SynchronousQueue as the queue. As from the javadoc for
ThreadPoolExecutor:

*Direct handoffs. A good default choice for a work queue is a
SynchronousQueue that hands off tasks to threads without otherwise holding
them. Here, an attempt to queue a task will fail if no threads are
immediately available to run it, so a new thread will be constructed. This
policy avoids lockups when handling sets of requests that might have
internal dependencies. Direct handoffs generally require unbounded
maximumPoolSizes to avoid rejection of new submitted tasks. This in turn
admits the possibility of unbounded thread growth when commands continue
to
arrive on average faster than they can be processed.*


And actually we hit exactly last  case. But still there isl a question.
Since all those tests all passing correctly and the scans are completed
during execution (I checked that) it's not clear why all those threads are
still alive. If someone has a suggestion why it could happen it will be
interesting to listen. Otherwise I will dig deeper a bit later.  Possible
also it's worth to change the queue in HBase to something less aggressive
in terms of thread creation.

Thanks,
Sergey


On Thu, May 5, 2016 at 8:24 AM, James Taylor<[email protected]>
wrote:

Looks like all Jenkins builds are failing, but it seems environmental?

Do

we need to exclude some particular kind of host(s)?

On Wed, May 4, 2016 at 5:25 PM, James Taylor<[email protected]>
wrote:

Thanks, Sergey!

On Wed, May 4, 2016 at 5:22 PM, Sergey Soldatov<

[email protected]>

wrote:

James,
Ah, didn't notice that timeouts are not shown in the final report as
failures. It seems that the build is using JDK 1.7 and test run OOM
with PermGen space. Fixed in PHOENIX-2879

Thanks,
Sergey

On Wed, May 4, 2016 at 1:48 PM, James Taylor<[email protected]
wrote:

Sergey, on master branch (which is HBase 1.2):
https://builds.apache.org/job/Phoenix-master/1214/console

On Wed, May 4, 2016 at 1:31 PM, Sergey Soldatov<

[email protected]>

wrote:

James,
Regarding HivePhoenixStoreIT. Are you talking about
Phoenix-4.x-HBase-1.0  job? Last build passed it successfully.


On Wed, May 4, 2016 at 10:15 AM, James Taylor<

[email protected]>

wrote:

Our Jenkins builds have improved, but we're seeing some issues:
- timeouts with the new

org.apache.phoenix.hive.HivePhoenixStoreIT

test.

- consistent failure with 4.x-HBase-1.1 build. I suspect that

Jenkins

build
is out-of-date, as we haven't had a 4.x-HBase-1.1 branch for

quite

while.
There's likely some changes that were made to the other Jenkins

build

scripts that weren't made to this one
- flapping of
the

org.apache.phoenix.end2end.index.ReadOnlyIndexFailureIT.testWriteFailureReadOnlyIndex

test in 0.98 and 1.0
- no email sent for 0.98 build (as far as I can tell)

If folks have time to look into these, that'd be much

appreciated.

     James



On Sat, Apr 30, 2016 at 11:55 AM, James Taylor<

[email protected]>

wrote:

The defaults when tests are running are much lower than the

standard

Phoenix defaults (see QueryServicesTestImpl and
BaseTest.setUpConfigForMiniCluster()). It's unclear to me why

the

HashJoinIT and SortMergeJoinIT tests (I think these are the

culprits)

do
not seem to adhere to these (or maybe override them?). They

fail

for me

on
my Mac, but they do pass on a Linux box. Would be awesome if

someone

could
investigate and submit a patch to fix these.

Thanks,
James

On Sat, Apr 30, 2016 at 11:47 AM, Nick Dimiduk<

[email protected]>

wrote:

The default thread pool sizes for HDFS, HBase, ZK, and the

Phoenix

client
are all contributing to this huge thread count.

A good starting point would be to take a jstack of the IT

process

and

count, group by threads with similar name. Reconfigure to

reduce

all

those
groups to something like 10 each, see if the test still runs

reliably

on
local hardware.

On Friday, April 29, 2016, Sergey Soldatov<

[email protected]>

wrote:

but the way, we need to do something with those OOMs and

"unable

to

create new native thread" in ITs. It's quite strange to see

in

lines test such kind of failures. Especially when queries

for

table

with less than 10 rows generate over 2500 threads. Does

anybody

know

whether it's zk related issue?

On Fri, Apr 29, 2016 at 7:51 AM, James Taylor
<[email protected]
<javascript:;>>  wrote:

A patch would be much appreciated, Sergey.

On Fri, Apr 29, 2016 at 3:26 AM, Sergey Soldatov<

[email protected]<javascript:;>>

wrote:

As for flume module - flume-ng is coming with commons-io

2.1

while
hadoop&  hbase require org.apache.commons.io.Charsets

which

was

introduced in 2.3. Easy way is to move dependency on

flume-ng

after
the dependencies on hbase/hadoop.

The last thing about ConcurrentHashMap - it definitely

means

that

the
code was compiled with 1.8 since 1.7 returns a simple Set

while

1.8
returns KeySetView



On Thu, Apr 28, 2016 at 4:08 PM, Josh Elser<

[email protected]

<javascript:;>>  wrote:

*tl;dr*

* I'm removing ubuntu-us1 from all pools
* Phoenix-Flume ITs look busted
* UpsertValuesIT looks busted
* Something is weirdly wrong with

Phoenix-4.x-HBase-1.1 in

its

entirety.

Details below...

It looks like we have a bunch of different reasons for

the

failures.

Starting with Phoenix-master:

org.apache.phoenix.schema.NewerTableAlreadyExistsException:

ERROR

(42M04): Table already exists. tableName=T
         at

org.apache.phoenix.end2end.UpsertValuesIT.testBatchedUpsert(UpsertValuesIT.java:476)

<<<

I've seen this coming out of a few different tests (I

think

I've

also

run

into it on my own, but that's another thing)

Some of them look like the Jenkins build host is just
over-taxed:

Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x00000007e7600000, 331350016, 0)

failed;

error='Cannot

allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime

Environment

to

continue.

# Native memory allocation (malloc) failed to allocate
331350016

bytes

for

committing reserved memory.
# An error report file with more information is saved

as:

/home/jenkins/jenkins-slave/workspace/Phoenix-master/phoenix-core/hs_err_pid26454.log

Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x00000007ea600000, 273678336, 0)

failed;

error='Cannot

allocate memory' (errno=12)
#
<<<

and

-------------------------------------------------------
  T E S T S
-------------------------------------------------------
Build step 'Invoke top-level Maven targets' marked

build

as

failure
<<<

Both of these issues are limited to the host

"ubuntu-us1".

Let

me

just

remove him from the pool (on Phoenix-master) and see if

that

helps

at

all.

I also see some sporadic failures of some Flume tests

Running org.apache.phoenix.flume.PhoenixSinkIT
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time

elapsed:

0.004

sec

<<<  FAILURE! - in

org.apache.phoenix.flume.PhoenixSinkIT

org.apache.phoenix.flume.PhoenixSinkIT  Time elapsed:

0.004

sec

<<<

ERROR!

java.lang.RuntimeException: java.io.IOException:

Failed to

save

in

any

storage directories while saving namespace.
Caused by: java.io.IOException: Failed to save in any

storage

directories

while saving namespace.

Running org.apache.phoenix.flume.RegexEventSerializerIT
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time

elapsed:

0.005

sec

<<<  FAILURE! - in
org.apache.phoenix.flume.RegexEventSerializerIT
org.apache.phoenix.flume.RegexEventSerializerIT  Time

elapsed:

0.004

sec

<<<  ERROR!
java.lang.RuntimeException: java.io.IOException:

Failed to

save

in

any

storage directories while saving namespace.
Caused by: java.io.IOException: Failed to save in any

storage

directories

while saving namespace.
<<<

I'm not sure what the error message means at a glance.

For Phoenix-HBase-1.1:

org.apache.hadoop.hbase.DoNotRetryIOException:

java.lang.NoSuchMethodError:

java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;

at

org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)

at

org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)

at

org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)

at

org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)

         at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError:

java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;

at

org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:432)

at

org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:346)

at

org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:264)

at

org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:318)

at

org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)

at

org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)

         ... 4 more
2016-04-28 22:54:35,497 WARN  [RS:0;hemera:41302]

org.apache.hadoop.hbase.regionserver.HRegionServer(2279):

error

telling

master we are up
com.google.protobuf.ServiceException:

org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException):

org.apache.hadoop.hbase.DoNotRetryIOException:

java.lang.NoSuchMethodError:

java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;

at

org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)

at

org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)

at

org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)

at

org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)

         at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError:

java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;

at

org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:432)

at

org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:346)

at

org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:264)

at

org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:318)

at

org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)

at

org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)

         ... 4 more

         at

org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)

at

org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:318)

at

org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982)

at

org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2269)

at

org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:893)

at

org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.runRegionServer(MiniHBaseCluster.java:156)

at

org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$000(MiniHBaseCluster.java:108)

at

org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$1.run(MiniHBaseCluster.java:140)

at

java.security.AccessController.doPrivileged(Native

Method)

at

javax.security.auth.Subject.doAs(Subject.java:356)

at

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)

at

org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:307)

at

org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.run(MiniHBaseCluster.java:138)

         at java.lang.Thread.run(Thread.java:745)
Caused by:

org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException):

org.apache.hadoop.hbase.DoNotRetryIOException:

java.lang.NoSuchMethodError:

java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;

at

org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)

at

org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)

at

org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)

at

org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)

         at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError:

java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;

at

org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:432)

at

org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:346)

at

org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:264)

at

org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:318)

at

org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)

at

org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)

         ... 4 more

         at

org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1235)

at

org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:217)

         ... 13 more
<<<

We have hit-or-miss on this error message which keeps

hbase:namespace

from

being assigned (as the RS's can never report into the

hmaster).

This

is

happening across a couple of the nodes

(ubuntu-[3,4,6]). I

had

tried

to

look

into this one over the weekend (and was lead to a JDK8

built

jar,

running on

JDK7), but if I look at META-INF/MANIFEST.mf in the

hbase-server-1.1.3.jar

from central, I see it was built with 1.7.0_80 (which I

think

means

the

JDK8

thought is a red-herring). I'm really confused by this

one,

actually.

Something must be amiss here.

For Phoenix-HBase-1.0:

We see the same Phoenix-Flume failures, UpsertValuesIT

failure,

and

timeouts

on ubuntu-us1. There is one crash on H10, but that

might

just

be

bad

luck.

For Phoenix-HBase-0.98:

Same UpsertValuesIT failure and failures on ubuntu-us1.


James Taylor wrote:

Anyone know why our Jenkins builds keep failing? Is it

environmental

and

is
there anything we can do about it?

Thanks,
James

Re: Jenkins build failures?

Reply via email to