Re: [lustre-discuss] lustre write wrong data under postgresql benchmark tool test(concurrent access same data file with primary node write and standy node read)

Andreas Dilger Fri, 05 Oct 2018 21:58:23 -0700

On Oct 5, 2018, at 18:30, Michael Nishimoto <[email protected]> wrote:
> 
> Hi Andreas,
> 
> I didn't see an answer back, but I have a followup question.  Also, my 
> Postgres knowledge is pretty much zero.
> 
> I understand your concern about using direct I/O and how application buffers 
> can cause problems.
> My question is about a statement in the original email, a reference to 
> shutting down Postgres 
> on the standby node and running pg_xlogdump which I assume is a standalone 
> command.
> 
> Shouldn't a standalone command running on the standby node see a consistent 
> copy of data?
> 
> I assume that active and standby nodes are Lustre clients in the same cluster.


Yes, Lustre will keep the file data consistent between different clients, 
regardless of the IO pattern used.  That said, large aligned reads/writes are 
much more efficient than small ones.

The important point is that Lustre can only keep the data consistent on the 
kernel side of the filesystem, it can't do anything once the data is in 
userspace buffers, which is what I think the problem of the original poster 
relates to.

Cheers, Andreas

> On Sat, Sep 29, 2018 at 2:54 PM Andreas Dilger <[email protected]> wrote:
>> Is PG using O_DIRECT or buffered read/write?  Is it caching the pages in 
>> userspace?
>> 
>> Lustre will definitely keep pages consistent between clients, but if the 
>> application is caching the pages in userspace, and does not have any 
>> protocol between the nodes to invalidate cached pages when they are modified 
>> on disk, then the data will become inconsistent when one node modifies it. 
>> 
>> That is the same reason it isn't possible to mount a single ext4 filesystem 
>> r/w on one node and r/o on another node with shared storage, because the 
>> filesystem doesn't expect data to be changing underneath it, and will cache 
>> pages in RAM and not re-read them if they are modified on the other node. 
>> 
>> Cheers, Andreas
>> 
>> On Sep 29, 2018, at 07:57, 王亮 <[email protected]> wrote:
>> 
>>> Hello, lustre development team
>>> 
>>> background: we have two postgresql instances running as a primary and 
>>> standby and they share the same xlog file and data files (we change PG code 
>>> to achieve this) which located in the mounted lustre file system, and we 
>>> want to have a try with the lustre file system, we used gfs2 before, and we 
>>> expect the lustre will show a much better performance, but ...
>>>  
>>> We meet a read/write concurrent access problem related with Lustre. Would 
>>> you like to give us some suggestions? Any advices are appreciated, and 
>>> thank you in advanced : )
>>> note: we are very sure the standy instance will not write any data to disk. 
>>> (to be sure of this, we also shutdown the standby end, and use pg_xlogdump 
>>> tool to read the xlog file, the problem still happened, and pg_xlogdump to 
>>> is just a query to with any write operation) 
>>> 
>>> 
>>> Scenario Description:
>>> There’re 4 nodes(CentOS Linux release 7.4) connected with infiniband 
>>> network(driven by MLNX_OFED_LINUX-4.4):
>>> 10.0.0.106 acts as MDS with a local PCI-E 800GB SSD that used as MDT.
>>> 10.0.0.101 acts as OSS with a same local PCI-E 800GB SSD that used as OST.
>>> 10.0.0.104 and 10.0.0.105 act as Lustre client and mount the Lustre file 
>>> system at the directory of “/lustre”.
>>> The Lustre related packages are compiled from official 
>>> lustre-2.10.5-1.src.rpm.
>>>  
>>> The simplest verification(i.e. dd command) passed without errors.
>>>  
>>> Error:
>>> Then start our customized PostgreSQL service at 104 and 105. 104 runs as 
>>> the primary PostgreSQL server, and 105 runs as the secondary PostgreSQL 
>>> server. All the two PostgreSQL nodes read/write the shared directory of 
>>> “/lustre” provided by Lustre. The primary PostgreSQL server will open files 
>>> with *RW* mode and write something into the files; at the *meantime* the 
>>> second PostgreSQL server will open the same files with *R* mode and read 
>>> the data written by the primary PostgreSQL server, and it gets the *wrong* 
>>> data (the flushed data by primary in disk is error i.e. write wrong data 
>>> into disk). This will happen when we run a benchmark tool of PostgreSQL.
>>>  
>>>  
>>> PS 1.
>>> We tried different options to mount the Lustre:
>>> mount -t lustre -o flock 10.0.0.106@o2ib0:/birdfs /lustre
>>> mount -t lustre -o flock -o ro 10.0.0.106@o2ib0:/birdfs /lustre
>>> but the error always exists.
>>>  
>>> PS 2.
>>> Attach the initial information, maybe helpful.
>>> [root@106 ~]# mkfs.lustre --fsname=birdfs --mgs --mdt --index=0 --reformat 
>>> /dev/nvme0n1
>>>  
>>>    Permanent disk data:
>>> Target:     birdfs:MDT0000
>>> Index:      0
>>> Lustre FS:  birdfs
>>> Mount type: ldiskfs
>>> Flags:      0x65
>>>               (MDT MGS first_time update )
>>> Persistent mount opts: user_xattr,errors=remount-ro
>>> Parameters:
>>>  
>>> device size = 763097MB
>>> formatting backing filesystem ldiskfs on /dev/nvme0n1
>>>        target name   birdfs:MDT0000
>>>        4k blocks     195353046
>>>        options        -J size=4096 -I 1024 -i 2560 -q -O 
>>> dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E 
>>> lazy_journal_init -F
>>> mkfs_cmd = mke2fs -j -b 4096 -L birdfs:MDT0000  -J size=4096 -I 1024 -i 
>>> 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E 
>>> lazy_journal_init -F /dev/nvme0n1 195353046
>>>  
>>> Writing CONFIGS/mountdata
>>>  
>>> [root@101 ~]# mkfs.lustre --fsname=birdfs --ost --reformat --index=0 
>>> --mgsnode=10.0.0.106@o2ib0 /dev/nvme0n1
>>>  
>>>    Permanent disk data:
>>> Target:     birdfs:OST0000
>>> Index:      0
>>> Lustre FS:  birdfs
>>> Mount type: ldiskfs
>>> Flags:      0x62
>>>               (OST first_time update )
>>> Persistent mount opts: ,errors=remount-ro
>>> Parameters: mgsnode=10.0.0.106@o2ib
>>>  
>>> device size = 763097MB
>>> formatting backing filesystem ldiskfs on /dev/nvme0n1
>>>        target name   birdfs:OST0000
>>>        4k blocks     195353046
>>>        options        -J size=400 -I 512 -i 69905 -q -O 
>>> extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E 
>>> resize="4290772992",lazy_journal_init -F
>>> mkfs_cmd = mke2fs -j -b 4096 -L birdfs:OST0000  -J size=400 -I 512 -i 69905 
>>> -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E 
>>> resize="4290772992",lazy_journal_init -F /dev/nvme0n1 195353046
>>> Writing CONFIGS/mountdata
>>>  
>>>  
>>> Looking forward to any replies.
>>>  
>>> Regards,
>>> Bird
>>> 
>>> 
>>> -- 
>>> regards
>>> denny
>> 

Cheers, Andreas
---
Andreas Dilger
CTO Whamcloud




_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre write wrong data under postgresql benchmark tool test(concurrent access same data file with primary node write and standy node read)

Reply via email to