[
https://issues.apache.org/jira/browse/HAWQ-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086723#comment-15086723
]
zharui commented on HAWQ-323:
-----------------------------
Thank you for your reply.
Now another issue comes out. I clean the cluster and I want to let the cluster
just have 2 nodes. When I finished cluster initialize, I found that the
resourcemanager crash than restart again and again. Then I read the log and
found resourcemanager segmentation fault error. The log as follows
2016-01-07 10:57:30.158627
CST,,,p120379,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Segment resource
manager discovers localhost configuration the first
time.",,,,,,,0,,"requesthandler_RMSEG.c",74,
2016-01-07 10:57:30.158633
CST,,,p120379,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Segment resource
manager Build/update local host
information.",,,,,,,0,,"requesthandler_RMSEG.c",123,
2016-01-07 10:57:30.158640
CST,,,p120379,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Built localhost
information. NODE:ID=-1,HAWQ AVAIL, GRM UNAVAIL, HAWQ CAP (65536 MB, 16.000000
CORE), GRM CAP(0 MB, 0.000000
CORE),NODE:HOST=ws01.mzhen.cn:40000,Master:0,Standby:0,Alive:1.Addresses:127.0.0.1,192.168.3.2,123.103.19.2,192.168.122.1",,,,,,,0,,"requesthandler_RMSEG.c",201,
2016-01-07 10:57:33.164831
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","resourcemanager
process (PID 120379) was terminated by signal 11: Segmentation
fault",,,,,,,0,,"postmaster.c",4714,
2016-01-07 10:57:33.164876
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","server process
(PID 120379) was terminated by signal 11: Segmentation
fault",,,,,,,0,,"postmaster.c",4714,
2016-01-07 10:57:33.164885
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","terminating any
other active server processes",,,,,,,0,,"postmaster.c",4452,
2016-01-07 10:57:33.165428
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","BeginResetOfPostmasterAfterChildrenAreShutDown:
counter 259",,,,,,,0,,"postmaster.c",1868,
2016-01-07 10:57:33.165437
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","gp_session_id
high-water mark is 1",,,,,,,0,,"postmaster.c",1892,
2016-01-07 10:57:33.203150
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","resetting shared
memory",,,,,,,0,,"postmaster.c",3300,
2016-01-07 10:57:33.203173
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","PersistentFilespace_ShmemSize:
69192 = gp_max_filespaces: 8 * sizeof(FilespaceDirEntryData): 1048 +
PersistentFilespace_SharedDataSize():
80",,,,,,,0,,"cdbpersistentfilespace.c",1144,
2016-01-07 10:57:33.203182
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","PersistentTablespace_ShmemSize:
6304 = gp_max_tablespaces: 16 * sizeof(TablespaceDirEntryData): 32 +
PersistentTablespace_SharedDataSize():
80",,,,,,,0,,"cdbpersistenttablespace.c",1192,
2016-01-07 10:57:33.203190
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","PersistentDatabase_ShmemSize:
15984 = PersistentDatabase_SharedDataSize(): 15984 =
PersistentDatabaseSharedData: 80 + MaxPersistentDatabaseDirectories: 256 (db:
16 * ts: 16) * sizeof(DatabaseDirEntryData):
56",,,,,,,0,,"cdbpersistentdatabase.c",1477,
2016-01-07 10:57:33.203199
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","PersistentRelation_ShmemSize:
3673504 = gp_max_relations: 65536 * sizeof(RelationDirEntryData): 32 +
PersistentRelation_SharedDataSize(): 80",,,,,,,0,,"cdbpersistentrelation.c",454,
2016-01-07 10:57:33.203207
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Metadata Cache
Share Memory Size : 155720180",,,,,,,0,,"ipci.c",184,
2016-01-07 10:57:33.203482
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","temporary files
using default location",,,,,,,0,,"primary_mirror_mode.c",282,
2016-01-07 10:57:33.203490
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","transaction
files using default pg_system filespace",,,,,,,0,,"primary_mirror_mode.c",1133,
2016-01-07 10:57:33.346770
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","[MetadataCache]
Metadata cache initialize successfully.
block_capacity:2097152",,,,,,,0,,"cdbmetadatacache.c",248,
2016-01-07 10:57:33.351796
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","PrimaryMirrorMode:
Processing postmaster reset with recent mode of
3",,,,,,,0,,"primary_mirror_mode.c",877,
2016-01-07 10:57:33.351836
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","PrimaryMirrorMode:
Processing postmaster reset to non-fault
state",,,,,,,0,,"primary_mirror_mode.c",885,
2016-01-07 10:57:33.351844
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","removing all
temporary files",,,,,,,0,,"fd.c",2123,
2016-01-07 10:57:33.351852
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","all server
processes terminated; reinitializing",,,,,,,0,,"postmaster.c",1909,
2016-01-07 10:57:33.353265
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","database system
was interrupted at 2016-01-07 10:57:29 CST",,,,,,,0,,"xlog.c",6221,
2016-01-07 10:57:33.363626
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","checkpoint
record is at 0/92D6D0",,,,,,,0,,"xlog.c",6298,
2016-01-07 10:57:33.363678
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","redo record is
at 0/92D6D0; undo record is at 0/0; shutdown TRUE",,,,,,,0,,"xlog.c",6332,
2016-01-07 10:57:33.363711
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","next transaction
ID: 0/972; next OID: 10996",,,,,,,0,,"xlog.c",6336,
2016-01-07 10:57:33.363743
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","next
MultiXactId: 1; next MultiXactOffset: 0",,,,,,,0,,"xlog.c",6339,
2016-01-07 10:57:33.363772
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","database system
was not properly shut down; automatic recovery in
progress",,,,,,,0,,"xlog.c",6428,
2016-01-07 10:57:33.422160
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","record with zero
length at 0/92D720",,,,,,,0,,"xlog.c",4108,
2016-01-07 10:57:33.422184
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","no record for
redo after checkpoint, skip redo and proceed for recovery
pass",,,,,,,0,,"xlog.c",6492,
2016-01-07 10:57:33.422207
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","end of
transaction log location is 0/92D720",,,,,,,0,,"xlog.c",6576,
2016-01-07 10:57:33.422743
CST,,,p120467,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Finished startup
pass 1. Proceeding to startup crash recovery passes 2 and
3.",,,,,,,0,,"xlog.c",6810,
2016-01-07 10:57:33.427194
CST,,,p120468,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Finished startup
crash recovery pass 2",,,,,,,0,,"xlog.c",6981,
2016-01-07 10:57:33.505649
CST,,,p120469,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","recovery restart
point at 0/92D6D0",,,,,"xlog redo checkpoint: redo 0/92D6D0; undo 0/0; tli 1;
xid 0/972; oid 10996; multi 1; offset 0; shutdown
REDO PASS 3 @ 0/92D6D0; LSN 0/92D720: prev 0/92D680; xid 0: XLOG - checkpoint:
redo 0/92D6D0; undo 0/0; tli 1; xid 0/972; oid 10996; multi 1; offset 0;
shutdown",,0,,"xlog.c",8317,
2016-01-07 10:57:33.505711
CST,,,p120469,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","record with zero
length at 0/92D720",,,,,,,0,,"xlog.c",4108,
2016-01-07 10:57:33.505720
CST,,,p120469,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Oldest active
transaction from prepared transactions 972",,,,,,,0,,"xlog.c",5990,
2016-01-07 10:57:33.820605
CST,,,p120469,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","database system
is ready",,,,,,,0,,"xlog.c",6016,
2016-01-07 10:57:33.820642
CST,,,p120469,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","PostgreSQL
8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 2.0.0.0_beta build dev) on
x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.7 20120313 (Red Hat
4.4.7-16) compiled on Jan 6 2016 15:41:35",,,,,,,0,,"xlog.c",6026,
2016-01-07 10:57:33.916443
CST,,,p120469,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Finished startup
crash recovery pass 3",,,,,,,0,,"xlog.c",7125,
2016-01-07 10:57:33.922753
CST,,,p120495,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Finished startup
integrity checking",,,,,,,0,,"xlog.c",7153,
2016-01-07 10:57:33.926302
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","HAWQ Master RM
:: Temporary directory /tmp",,,,,,,0,,"resourcemanager.c",1060,
2016-01-07 10:57:33.926318
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","HAWQ Segment RM
:: Temporary directory /tmp",,,,,,,0,,"resourcemanager.c",1067,
2016-01-07 10:57:33.933475
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","HAWQ RM SEG
process works now.",,,,,,,0,,"resourcemanager_RMSEG.c",71,
2016-01-07 10:57:33.933534
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Wait for HAWQ RM
-1",,,,,,,0,,"resourcemanager.c",419,
2016-01-07 10:57:33.933569
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","HAWQ :: Received
signal notification that HAWQ RM works now.",,,,,,,0,,"resourcemanager.c",427,
2016-01-07 10:57:33.933579
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","PostgreSQL
8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 2.0.0.0_beta build dev) on
x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.7 20120313 (Red Hat
4.4.7-16) compiled on Jan 6 2016 18:31:33",,,,,,,0,,"postmaster.c",3660,
2016-01-07 10:57:33.933589
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","database system
is ready to accept connections","PostgreSQL 8.2.15 (Greenplum Database 4.2.0
build 1) (HAWQ 2.0.0.0_beta build dev) on x86_64-unknown-linux-gnu, compiled by
GCC gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16) compiled on Jan 6 2016
18:31:33",,,,,,0,,"postmaster.c",3667,
2016-01-07 10:57:34.034510
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager
discovered local host IPv4 address 127.0.0.1",,,,,,,0,,"network_utils.c",245,
2016-01-07 10:57:34.034531
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager
discovered local host IPv4 address 192.168.3.2",,,,,,,0,,"network_utils.c",245,
2016-01-07 10:57:34.034538
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager
discovered local host IPv4 address 123.103.19.2",,,,,,,0,,"network_utils.c",245,
2016-01-07 10:57:34.034550
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager
discovered local host IPv4 address
192.168.122.1",,,,,,,0,,"network_utils.c",245,
2016-01-07 10:57:34.034558
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Segment resource
manager discovers localhost configuration the first
time.",,,,,,,0,,"requesthandler_RMSEG.c",74,
2016-01-07 10:57:34.034564
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Segment resource
manager Build/update local host
information.",,,,,,,0,,"requesthandler_RMSEG.c",123,
2016-01-07 10:57:34.034571
CST,,,p120499,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","Built localhost
information. NODE:ID=-1,HAWQ AVAIL, GRM UNAVAIL, HAWQ CAP (65536 MB, 16.000000
CORE), GRM CAP(0 MB, 0.000000
CORE),NODE:HOST=ws01.mzhen.cn:40000,Master:0,Standby:0,Alive:1.Addresses:127.0.0.1,192.168.3.2,123.103.19.2,192.168.122.1",,,,,,,0,,"requesthandler_RMSEG.c",201,
2016-01-07 10:57:34.040028
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","resourcemanager
process (PID 120499) was terminated by signal 11: Segmentation
fault",,,,,,,0,,"postmaster.c",4714,
2016-01-07 10:57:34.040069
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","server process
(PID 120499) was terminated by signal 11: Segmentation
fault",,,,,,,0,,"postmaster.c",4714,
2016-01-07 10:57:34.040101
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","terminating any
other active server processes",,,,,,,0,,"postmaster.c",4452,
2016-01-07 10:57:34.040608
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","BeginResetOfPostmasterAfterChildrenAreShutDown:
counter 260",,,,,,,0,,"postmaster.c",1868,
2016-01-07 10:57:34.040627
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","gp_session_id
high-water mark is 1",,,,,,,0,,"postmaster.c",1892,
2016-01-07 10:57:34.069838
CST,,,p113549,th-336778880,,,,0,,,seg-10000,,,,,"LOG","00000","resetting shared
memory",,,,,,,0,,"postmaster.c",3300,
> Cannot query when cluster include more than 1 segment
> -----------------------------------------------------
>
> Key: HAWQ-323
> URL: https://issues.apache.org/jira/browse/HAWQ-323
> Project: Apache HAWQ
> Issue Type: Bug
> Components: Core, Resource Manager
> Affects Versions: 2.0.0-beta-incubating
> Reporter: zharui
> Assignee: Lei Chang
>
> The version I use is 2.0.0-beta-RC2. I can query data normally when cluster
> just have 1 segment. Once the cluster have more then 1 segments online, I
> cannot finish any query and being informed that "ERROR: failed to acquire
> resource from resource manager, 7 of 8 segments are unavailable
> (pquery.c:788)".
> I have read the segment logs and the source code about resource manager. I
> guess this issue is because of the communication failure between segment
> instance and resource manager server. I can find the logs of the segment
> connect to resource manager successfully such as "AsyncComm framework
> receives message 518 from FD5" and "Resource enforcer increases memory quota
> to: total memory quota=65536 MB, delta memory quota = 65536 MB", but the
> other online segments have no these log.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)