Getting back to the failures with OOM/unable to create a native thread.
Those files have around 100 tests inside each that are running on top of
the phoenix. In total they generate over 2500 scans. (system.catalog,
sequences and regular scans over table). The problem that on HBase side
all scans are going through the ThreadPoolExecutor generated in HTable.
Which is using SynchronousQueue as the queue. As from the javadoc for
ThreadPoolExecutor:
*Direct handoffs. A good default choice for a work queue is a
SynchronousQueue that hands off tasks to threads without otherwise holding
them. Here, an attempt to queue a task will fail if no threads are
immediately available to run it, so a new thread will be constructed. This
policy avoids lockups when handling sets of requests that might have
internal dependencies. Direct handoffs generally require unbounded
maximumPoolSizes to avoid rejection of new submitted tasks. This in turn
admits the possibility of unbounded thread growth when commands continue
to
arrive on average faster than they can be processed.*
And actually we hit exactly last case. But still there isl a question.
Since all those tests all passing correctly and the scans are completed
during execution (I checked that) it's not clear why all those threads are
still alive. If someone has a suggestion why it could happen it will be
interesting to listen. Otherwise I will dig deeper a bit later. Possible
also it's worth to change the queue in HBase to something less aggressive
in terms of thread creation.
Thanks,
Sergey
On Thu, May 5, 2016 at 8:24 AM, James Taylor<[email protected]>
wrote:
Looks like all Jenkins builds are failing, but it seems environmental?
Do
we need to exclude some particular kind of host(s)?
On Wed, May 4, 2016 at 5:25 PM, James Taylor<[email protected]>
wrote:
Thanks, Sergey!
On Wed, May 4, 2016 at 5:22 PM, Sergey Soldatov<
[email protected]>
wrote:
James,
Ah, didn't notice that timeouts are not shown in the final report as
failures. It seems that the build is using JDK 1.7 and test run OOM
with PermGen space. Fixed in PHOENIX-2879
Thanks,
Sergey
On Wed, May 4, 2016 at 1:48 PM, James Taylor<[email protected]
wrote:
Sergey, on master branch (which is HBase 1.2):
https://builds.apache.org/job/Phoenix-master/1214/console
On Wed, May 4, 2016 at 1:31 PM, Sergey Soldatov<
[email protected]>
wrote:
James,
Regarding HivePhoenixStoreIT. Are you talking about
Phoenix-4.x-HBase-1.0 job? Last build passed it successfully.
On Wed, May 4, 2016 at 10:15 AM, James Taylor<
[email protected]>
wrote:
Our Jenkins builds have improved, but we're seeing some issues:
- timeouts with the new
org.apache.phoenix.hive.HivePhoenixStoreIT
test.
- consistent failure with 4.x-HBase-1.1 build. I suspect that
Jenkins
build
is out-of-date, as we haven't had a 4.x-HBase-1.1 branch for
quite
a
while.
There's likely some changes that were made to the other Jenkins
build
scripts that weren't made to this one
- flapping of
the
org.apache.phoenix.end2end.index.ReadOnlyIndexFailureIT.testWriteFailureReadOnlyIndex
test in 0.98 and 1.0
- no email sent for 0.98 build (as far as I can tell)
If folks have time to look into these, that'd be much
appreciated.
James
On Sat, Apr 30, 2016 at 11:55 AM, James Taylor<
[email protected]>
wrote:
The defaults when tests are running are much lower than the
standard
Phoenix defaults (see QueryServicesTestImpl and
BaseTest.setUpConfigForMiniCluster()). It's unclear to me why
the
HashJoinIT and SortMergeJoinIT tests (I think these are the
culprits)
do
not seem to adhere to these (or maybe override them?). They
fail
for me
on
my Mac, but they do pass on a Linux box. Would be awesome if
someone
could
investigate and submit a patch to fix these.
Thanks,
James
On Sat, Apr 30, 2016 at 11:47 AM, Nick Dimiduk<
[email protected]>
wrote:
The default thread pool sizes for HDFS, HBase, ZK, and the
Phoenix
client
are all contributing to this huge thread count.
A good starting point would be to take a jstack of the IT
process
and
count, group by threads with similar name. Reconfigure to
reduce
all
those
groups to something like 10 each, see if the test still runs
reliably
on
local hardware.
On Friday, April 29, 2016, Sergey Soldatov<
[email protected]>
wrote:
but the way, we need to do something with those OOMs and
"unable
to
create new native thread" in ITs. It's quite strange to see
in
10
lines test such kind of failures. Especially when queries
for
table
with less than 10 rows generate over 2500 threads. Does
anybody
know
whether it's zk related issue?
On Fri, Apr 29, 2016 at 7:51 AM, James Taylor
<[email protected]
<javascript:;>> wrote:
A patch would be much appreciated, Sergey.
On Fri, Apr 29, 2016 at 3:26 AM, Sergey Soldatov<
[email protected]<javascript:;>>
wrote:
As for flume module - flume-ng is coming with commons-io
2.1
while
hadoop& hbase require org.apache.commons.io.Charsets
which
was
introduced in 2.3. Easy way is to move dependency on
flume-ng
after
the dependencies on hbase/hadoop.
The last thing about ConcurrentHashMap - it definitely
means
that
the
code was compiled with 1.8 since 1.7 returns a simple Set
while
1.8
returns KeySetView
On Thu, Apr 28, 2016 at 4:08 PM, Josh Elser<
[email protected]
<javascript:;>> wrote:
*tl;dr*
* I'm removing ubuntu-us1 from all pools
* Phoenix-Flume ITs look busted
* UpsertValuesIT looks busted
* Something is weirdly wrong with
Phoenix-4.x-HBase-1.1 in
its
entirety.
Details below...
It looks like we have a bunch of different reasons for
the
failures.
Starting with Phoenix-master:
org.apache.phoenix.schema.NewerTableAlreadyExistsException:
ERROR
1013
(42M04): Table already exists. tableName=T
at
org.apache.phoenix.end2end.UpsertValuesIT.testBatchedUpsert(UpsertValuesIT.java:476)
<<<
I've seen this coming out of a few different tests (I
think
I've
also
run
into it on my own, but that's another thing)
Some of them look like the Jenkins build host is just
over-taxed:
Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x00000007e7600000, 331350016, 0)
failed;
error='Cannot
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime
Environment
to
continue.
# Native memory allocation (malloc) failed to allocate
331350016
bytes
for
committing reserved memory.
# An error report file with more information is saved
as:
#
/home/jenkins/jenkins-slave/workspace/Phoenix-master/phoenix-core/hs_err_pid26454.log
Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x00000007ea600000, 273678336, 0)
failed;
error='Cannot
allocate memory' (errno=12)
#
<<<
and
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Build step 'Invoke top-level Maven targets' marked
build
as
failure
<<<
Both of these issues are limited to the host
"ubuntu-us1".
Let
me
just
remove him from the pool (on Phoenix-master) and see if
that
helps
at
all.
I also see some sporadic failures of some Flume tests
Running org.apache.phoenix.flume.PhoenixSinkIT
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time
elapsed:
0.004
sec
<<< FAILURE! - in
org.apache.phoenix.flume.PhoenixSinkIT
org.apache.phoenix.flume.PhoenixSinkIT Time elapsed:
0.004
sec
<<<
ERROR!
java.lang.RuntimeException: java.io.IOException:
Failed to
save
in
any
storage directories while saving namespace.
Caused by: java.io.IOException: Failed to save in any
storage
directories
while saving namespace.
Running org.apache.phoenix.flume.RegexEventSerializerIT
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time
elapsed:
0.005
sec
<<< FAILURE! - in
org.apache.phoenix.flume.RegexEventSerializerIT
org.apache.phoenix.flume.RegexEventSerializerIT Time
elapsed:
0.004
sec
<<< ERROR!
java.lang.RuntimeException: java.io.IOException:
Failed to
save
in
any
storage directories while saving namespace.
Caused by: java.io.IOException: Failed to save in any
storage
directories
while saving namespace.
<<<
I'm not sure what the error message means at a glance.
For Phoenix-HBase-1.1:
org.apache.hadoop.hbase.DoNotRetryIOException:
java.lang.NoSuchMethodError:
java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)
at
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError:
java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at
org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:432)
at
org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:346)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:264)
at
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:318)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)
at
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)
... 4 more
2016-04-28 22:54:35,497 WARN [RS:0;hemera:41302]
org.apache.hadoop.hbase.regionserver.HRegionServer(2279):
error
telling
master we are up
com.google.protobuf.ServiceException:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException):
org.apache.hadoop.hbase.DoNotRetryIOException:
java.lang.NoSuchMethodError:
java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)
at
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError:
java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at
org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:432)
at
org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:346)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:264)
at
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:318)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)
at
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)
... 4 more
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:318)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2269)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:893)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.runRegionServer(MiniHBaseCluster.java:156)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$000(MiniHBaseCluster.java:108)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$1.run(MiniHBaseCluster.java:140)
at
java.security.AccessController.doPrivileged(Native
Method)
at
javax.security.auth.Subject.doAs(Subject.java:356)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
at
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:307)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.run(MiniHBaseCluster.java:138)
at java.lang.Thread.run(Thread.java:745)
Caused by:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException):
org.apache.hadoop.hbase.DoNotRetryIOException:
java.lang.NoSuchMethodError:
java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)
at
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError:
java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at
org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:432)
at
org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:346)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:264)
at
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:318)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)
at
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)
... 4 more
at
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1235)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:217)
... 13 more
<<<
We have hit-or-miss on this error message which keeps
hbase:namespace
from
being assigned (as the RS's can never report into the
hmaster).
This
is
happening across a couple of the nodes
(ubuntu-[3,4,6]). I
had
tried
to
look
into this one over the weekend (and was lead to a JDK8
built
jar,
running on
JDK7), but if I look at META-INF/MANIFEST.mf in the
hbase-server-1.1.3.jar
from central, I see it was built with 1.7.0_80 (which I
think
means
the
JDK8
thought is a red-herring). I'm really confused by this
one,
actually.
Something must be amiss here.
For Phoenix-HBase-1.0:
We see the same Phoenix-Flume failures, UpsertValuesIT
failure,
and
timeouts
on ubuntu-us1. There is one crash on H10, but that
might
just
be
bad
luck.
For Phoenix-HBase-0.98:
Same UpsertValuesIT failure and failures on ubuntu-us1.
James Taylor wrote:
Anyone know why our Jenkins builds keep failing? Is it
environmental
and
is
there anything we can do about it?
Thanks,
James