[ 
https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077649#comment-16077649
 ] 

Harald Bögeholz commented on HAWQ-1498:
---------------------------------------

I had to rerun the experiment so the process IDs are different.
Relevant lines from lsof on the segment node are now

{noformat}
postgres  99948 gpadmin   36r   REG 253,32    212272     0  9438395 
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201
 (deleted)
postgres  99948 gpadmin   37r   REG 253,32      1667     0  9438396 
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201_214319.meta
 (deleted)
postgres  99976 gpadmin   15r   REG 253,32 121425488     0  9437196 
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949199
 (deleted)
postgres  99976 gpadmin   16r   REG 253,32    948647     0  9438394 
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949199_214317.meta
 (deleted)
postgres  99976 gpadmin   18r   REG 253,32    212272     0  9438395 
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201
 (deleted)
postgres  99976 gpadmin   19r   REG 253,32      1667     0  9438396 
/mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201_214319.meta
 (deleted)
{noformat}

Relevant lines from ps -ef are

{noformat}
gpadmin   99948 761943  0 05:40 ?        00:00:03 postgres: port 40000, gpadmin 
graphs 118.138.237.116(12911) con3231 seg1 idle
gpadmin   99976 761943  0 05:40 ?        00:00:01 postgres: port 40000, gpadmin 
graphs 118.138.237.116(12929) con3231 seg1 idle
{noformat}

pstack says:


{noformat}
[root@mds-hdp-04 ~]# pstack 99948
Thread 2 (Thread 0x7f8c059b6700 (LWP 99949)):
#0  0x00007f8c0189ee2d in poll () from /lib64/libc.so.6
#1  0x00000000009766c0 in rxThreadFunc (arg=<optimized out>) at ic_udp.c:6278
#2  0x00007f8c02338dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f8c018a976d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f8c05b1ea00 (LWP 99948)):
#0  0x00007f8c0233f82b in recv () from /lib64/libpthread.so.0
#1  0x00000000006b59dc in secure_read (port=0x2cbf910, ptr=0xf44620 
<PqRecvBuffer>, len=8192) at be-secure.c:307
#2  0x00000000006bf9c6 in pq_recvbuf () at pqcomm.c:824
#3  0x00000000006c06b5 in pq_getbyte () at pqcomm.c:929
#4  0x00000000007e53e5 in SocketBackend (inBuf=0x7ffc8298b360) at postgres.c:434
#5  ReadCommand (inBuf=inBuf@entry=0x7ffc8298b360) at postgres.c:592
#6  0x00000000007e9546 in PostgresMain (argc=<optimized out>, argv=<optimized 
out>, argv@entry=0x2d0a120, username=0x2d00fb0 "gpadmin") at postgres.c:4816
#7  0x000000000079cf68 in BackendRun (port=0x2cbf910) at postmaster.c:5915
#8  BackendStartup (port=0x2cbf910) at postmaster.c:5484
#9  ServerLoop () at postmaster.c:2163
#10 0x000000000079fd49 in PostmasterMain (argc=9, argv=<optimized out>) at 
postmaster.c:1454
#11 0x00000000004a4b69 in main (argc=9, argv=0x2cba570) at main.c:226
{noformat}

and

{noformat}
[root@mds-hdp-04 ~]# pstack 99976
Thread 2 (Thread 0x7f8c059b6700 (LWP 99979)):
#0  0x00007f8c0189ee2d in poll () from /lib64/libc.so.6
#1  0x00000000009766c0 in rxThreadFunc (arg=<optimized out>) at ic_udp.c:6278
#2  0x00007f8c02338dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f8c018a976d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f8c05b1ea00 (LWP 99976)):
#0  0x00007f8c0233f82b in recv () from /lib64/libpthread.so.0
#1  0x00000000006b59dc in secure_read (port=0x2cbf910, ptr=0xf44620 
<PqRecvBuffer>, len=8192) at be-secure.c:307
#2  0x00000000006bf9c6 in pq_recvbuf () at pqcomm.c:824
#3  0x00000000006c06b5 in pq_getbyte () at pqcomm.c:929
#4  0x00000000007e53e5 in SocketBackend (inBuf=0x7ffc8298b360) at postgres.c:434
#5  ReadCommand (inBuf=inBuf@entry=0x7ffc8298b360) at postgres.c:592
#6  0x00000000007e9546 in PostgresMain (argc=<optimized out>, argv=<optimized 
out>, argv@entry=0x2d0a120, username=0x2d00fb0 "gpadmin") at postgres.c:4816
#7  0x000000000079cf68 in BackendRun (port=0x2cbf910) at postmaster.c:5915
#8  BackendStartup (port=0x2cbf910) at postmaster.c:5484
#9  ServerLoop () at postmaster.c:2163
#10 0x000000000079fd49 in PostmasterMain (argc=9, argv=<optimized out>) at 
postmaster.c:1454
#11 0x00000000004a4b69 in main (argc=9, argv=0x2cba570) at main.c:226
{noformat}

Hope this helps

> Segments keep open file descriptors for deleted files
> -----------------------------------------------------
>
>                 Key: HAWQ-1498
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1498
>             Project: Apache HAWQ
>          Issue Type: Bug
>            Reporter: Harald Bögeholz
>            Assignee: Radar Lei
>             Fix For: 2.2.0.0-incubating
>
>
> I have been running some large computations in HAWQ using psql on the master. 
> These computations created temporary tables and dropped them again. 
> Nevertheless free disk space in HDFS decreased by much more than it should. 
> While the psql session on the master was still open I investigated on one of 
> the slave machines.
> HDFS is stored on /mds:
> {noformat}
> [root@mds-hdp-04 ~]# ls -l /mds
> total 36
> drwxr-xr-x. 3 root      root    4096 Jun 14 04:23 falcon
> drwxr-xr-x. 3 root      root    4096 Jun 14 04:42 hdfs
> drwx------. 2 root      root   16384 Jun  8 02:48 lost+found
> drwxr-xr-x. 5 storm     hadoop  4096 Jun 14 04:45 storm
> drwxr-xr-x. 4 root      root    4096 Jun 14 04:43 yarn
> drwxr-xr-x. 2 zookeeper hadoop  4096 Jun 14 04:39 zookeeper
> [root@mds-hdp-04 ~]# df /mds
> Filesystem     1K-blocks      Used Available Use% Mounted on
> /dev/vdc       515928320 314560220 175137316  65% /mds
> [root@mds-hdp-04 ~]# du -s /mds
> 89918952      /mds
> {noformat}
> Note that there is a more than 200 GB difference between the disk space used 
> according to df and the sum of all files on that file system according to du.
> I have found the culprit to be several postgres processes running as gpadmin 
> and holding open file descriptors to deleted files. Here are the first few:
> {noformat}
> [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10
> postgres 665334 gpadmin   18r   REG 253,32 134217728     0  9438234 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482
>  (deleted)
> postgres 665334 gpadmin   34r   REG 253,32     24488     0  9438114 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398
>  (deleted)
> postgres 665334 gpadmin   35r   REG 253,32       199     0  9438115 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398_187044.meta
>  (deleted)
> postgres 665334 gpadmin   37r   REG 253,32 134217728     0  9438208 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446
>  (deleted)
> postgres 665334 gpadmin   38r   REG 253,32   1048583     0  9438209 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446_187092.meta
>  (deleted)
> postgres 665334 gpadmin   39r   REG 253,32   1048583     0  9438235 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482_187128.meta
>  (deleted)
> postgres 665334 gpadmin   40r   REG 253,32 134217728     0  9438262 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555
>  (deleted)
> postgres 665334 gpadmin   41r   REG 253,32   1048583     0  9438263 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555_187201.meta
>  (deleted)
> postgres 665334 gpadmin   42r   REG 253,32 134217728     0  9438285 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602
>  (deleted)
> postgres 665334 gpadmin   43r   REG 253,32   1048583     0  9438286 
> /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602_187248.meta
>  (deleted)
> {noformat}
> As soon I close the psql session on the master the disk space is freed on the 
> slaves:
> {noformat}
> [root@mds-hdp-04 ~]# df /mds
> Filesystem     1K-blocks     Used Available Use% Mounted on
> /dev/vdc       515928320 89992720 399704816  19% /mds
> [root@mds-hdp-04 ~]# du -s /mds
> 89918952      /mds
> [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10
> {noformat}
> I believe this to be a bug. At least for me it looks like a very undesirable 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to