[
https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077649#comment-16077649
]
Harald Bögeholz commented on HAWQ-1498:
---------------------------------------
I had to rerun the experiment so the process IDs are different.
Relevant lines from lsof on the segment node are now
{noformat}
postgres 99948 gpadmin 36r REG 253,32 212272 0 9438395
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201
(deleted)
postgres 99948 gpadmin 37r REG 253,32 1667 0 9438396
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201_214319.meta
(deleted)
postgres 99976 gpadmin 15r REG 253,32 121425488 0 9437196
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949199
(deleted)
postgres 99976 gpadmin 16r REG 253,32 948647 0 9438394
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949199_214317.meta
(deleted)
postgres 99976 gpadmin 18r REG 253,32 212272 0 9438395
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201
(deleted)
postgres 99976 gpadmin 19r REG 253,32 1667 0 9438396
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201_214319.meta
(deleted)
{noformat}
Relevant lines from ps -ef are
{noformat}
gpadmin 99948 761943 0 05:40 ? 00:00:03 postgres: port 40000, gpadmin
graphs 118.138.237.116(12911) con3231 seg1 idle
gpadmin 99976 761943 0 05:40 ? 00:00:01 postgres: port 40000, gpadmin
graphs 118.138.237.116(12929) con3231 seg1 idle
{noformat}
pstack says:
{noformat}
[root@mds-hdp-04 ~]# pstack 99948
Thread 2 (Thread 0x7f8c059b6700 (LWP 99949)):
#0 0x00007f8c0189ee2d in poll () from /lib64/libc.so.6
#1 0x00000000009766c0 in rxThreadFunc (arg=<optimized out>) at ic_udp.c:6278
#2 0x00007f8c02338dc5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f8c018a976d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f8c05b1ea00 (LWP 99948)):
#0 0x00007f8c0233f82b in recv () from /lib64/libpthread.so.0
#1 0x00000000006b59dc in secure_read (port=0x2cbf910, ptr=0xf44620
<PqRecvBuffer>, len=8192) at be-secure.c:307
#2 0x00000000006bf9c6 in pq_recvbuf () at pqcomm.c:824
#3 0x00000000006c06b5 in pq_getbyte () at pqcomm.c:929
#4 0x00000000007e53e5 in SocketBackend (inBuf=0x7ffc8298b360) at postgres.c:434
#5 ReadCommand (inBuf=inBuf@entry=0x7ffc8298b360) at postgres.c:592
#6 0x00000000007e9546 in PostgresMain (argc=<optimized out>, argv=<optimized
out>, argv@entry=0x2d0a120, username=0x2d00fb0 "gpadmin") at postgres.c:4816
#7 0x000000000079cf68 in BackendRun (port=0x2cbf910) at postmaster.c:5915
#8 BackendStartup (port=0x2cbf910) at postmaster.c:5484
#9 ServerLoop () at postmaster.c:2163
#10 0x000000000079fd49 in PostmasterMain (argc=9, argv=<optimized out>) at
postmaster.c:1454
#11 0x00000000004a4b69 in main (argc=9, argv=0x2cba570) at main.c:226
{noformat}
and
{noformat}
[root@mds-hdp-04 ~]# pstack 99976
Thread 2 (Thread 0x7f8c059b6700 (LWP 99979)):
#0 0x00007f8c0189ee2d in poll () from /lib64/libc.so.6
#1 0x00000000009766c0 in rxThreadFunc (arg=<optimized out>) at ic_udp.c:6278
#2 0x00007f8c02338dc5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f8c018a976d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f8c05b1ea00 (LWP 99976)):
#0 0x00007f8c0233f82b in recv () from /lib64/libpthread.so.0
#1 0x00000000006b59dc in secure_read (port=0x2cbf910, ptr=0xf44620
<PqRecvBuffer>, len=8192) at be-secure.c:307
#2 0x00000000006bf9c6 in pq_recvbuf () at pqcomm.c:824
#3 0x00000000006c06b5 in pq_getbyte () at pqcomm.c:929
#4 0x00000000007e53e5 in SocketBackend (inBuf=0x7ffc8298b360) at postgres.c:434
#5 ReadCommand (inBuf=inBuf@entry=0x7ffc8298b360) at postgres.c:592
#6 0x00000000007e9546 in PostgresMain (argc=<optimized out>, argv=<optimized
out>, argv@entry=0x2d0a120, username=0x2d00fb0 "gpadmin") at postgres.c:4816
#7 0x000000000079cf68 in BackendRun (port=0x2cbf910) at postmaster.c:5915
#8 BackendStartup (port=0x2cbf910) at postmaster.c:5484
#9 ServerLoop () at postmaster.c:2163
#10 0x000000000079fd49 in PostmasterMain (argc=9, argv=<optimized out>) at
postmaster.c:1454
#11 0x00000000004a4b69 in main (argc=9, argv=0x2cba570) at main.c:226
{noformat}
Hope this helps
> Segments keep open file descriptors for deleted files
> -----------------------------------------------------
>
> Key: HAWQ-1498
> URL: https://issues.apache.org/jira/browse/HAWQ-1498
> Project: Apache HAWQ
> Issue Type: Bug
> Reporter: Harald Bögeholz
> Assignee: Radar Lei
> Fix For: 2.2.0.0-incubating
>
>
> I have been running some large computations in HAWQ using psql on the master.
> These computations created temporary tables and dropped them again.
> Nevertheless free disk space in HDFS decreased by much more than it should.
> While the psql session on the master was still open I investigated on one of
> the slave machines.
> HDFS is stored on /mds:
> {noformat}
> [root@mds-hdp-04 ~]# ls -l /mds
> total 36
> drwxr-xr-x. 3 root root 4096 Jun 14 04:23 falcon
> drwxr-xr-x. 3 root root 4096 Jun 14 04:42 hdfs
> drwx------. 2 root root 16384 Jun 8 02:48 lost+found
> drwxr-xr-x. 5 storm hadoop 4096 Jun 14 04:45 storm
> drwxr-xr-x. 4 root root 4096 Jun 14 04:43 yarn
> drwxr-xr-x. 2 zookeeper hadoop 4096 Jun 14 04:39 zookeeper
> [root@mds-hdp-04 ~]# df /mds
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/vdc 515928320 314560220 175137316 65% /mds
> [root@mds-hdp-04 ~]# du -s /mds
> 89918952 /mds
> {noformat}
> Note that there is a more than 200 GB difference between the disk space used
> according to df and the sum of all files on that file system according to du.
> I have found the culprit to be several postgres processes running as gpadmin
> and holding open file descriptors to deleted files. Here are the first few:
> {noformat}
> [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10
> postgres 665334 gpadmin 18r REG 253,32 134217728 0 9438234
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482
> (deleted)
> postgres 665334 gpadmin 34r REG 253,32 24488 0 9438114
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398
> (deleted)
> postgres 665334 gpadmin 35r REG 253,32 199 0 9438115
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398_187044.meta
> (deleted)
> postgres 665334 gpadmin 37r REG 253,32 134217728 0 9438208
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446
> (deleted)
> postgres 665334 gpadmin 38r REG 253,32 1048583 0 9438209
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446_187092.meta
> (deleted)
> postgres 665334 gpadmin 39r REG 253,32 1048583 0 9438235
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482_187128.meta
> (deleted)
> postgres 665334 gpadmin 40r REG 253,32 134217728 0 9438262
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555
> (deleted)
> postgres 665334 gpadmin 41r REG 253,32 1048583 0 9438263
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555_187201.meta
> (deleted)
> postgres 665334 gpadmin 42r REG 253,32 134217728 0 9438285
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602
> (deleted)
> postgres 665334 gpadmin 43r REG 253,32 1048583 0 9438286
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602_187248.meta
> (deleted)
> {noformat}
> As soon I close the psql session on the master the disk space is freed on the
> slaves:
> {noformat}
> [root@mds-hdp-04 ~]# df /mds
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/vdc 515928320 89992720 399704816 19% /mds
> [root@mds-hdp-04 ~]# du -s /mds
> 89918952 /mds
> [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10
> {noformat}
> I believe this to be a bug. At least for me it looks like a very undesirable
> behavior.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)