Hi Andreas, I didn't see an answer back, but I have a followup question. Also, my Postgres knowledge is pretty much zero.
I understand your concern about using direct I/O and how application buffers can cause problems. My question is about a statement in the original email, a reference to shutting down Postgres on the standby node and running pg_xlogdump which I assume is a standalone command. Shouldn't a standalone command running on the standby node see a consistent copy of data? I assume that active and standby nodes are Lustre clients in the same cluster. Thanks, Michael On Sat, Sep 29, 2018 at 2:54 PM Andreas Dilger <[email protected]> wrote: > Is PG using O_DIRECT or buffered read/write? Is it caching the pages in > userspace? > > Lustre will definitely keep pages consistent between clients, but if the > application is caching the pages in userspace, and does not have any > protocol between the nodes to invalidate cached pages when they are > modified on disk, then the data will become inconsistent when one node > modifies it. > > That is the same reason it isn't possible to mount a single ext4 > filesystem r/w on one node and r/o on another node with shared storage, > because the filesystem doesn't expect data to be changing underneath it, > and will cache pages in RAM and not re-read them if they are modified on > the other node. > > Cheers, Andreas > > On Sep 29, 2018, at 07:57, 王亮 <[email protected]> wrote: > > Hello, lustre development team > > > background: we have two postgresql instances running as a primary and > standby and they share the same xlog file and data files (we change PG code > to achieve this) which located in the mounted lustre file system, and we > want to have a try with the lustre file system, we used gfs2 before, and we > expect the lustre will show a much better performance, but ... > > > > We meet a read/write concurrent access problem related with Lustre. Would > you like to give us some suggestions? Any advices are appreciated, and > thank you in advanced : ) > > note: we are very sure the standy instance will not write any data to > disk. (to be sure of this, we also shutdown the standby end, and use > pg_xlogdump tool to read the xlog file, the problem still happened, and > pg_xlogdump to is just a query to with any write operation) > > > > Scenario Description: > > There’re 4 nodes(CentOS Linux release 7.4) connected with infiniband > network(driven by MLNX_OFED_LINUX-4.4): > > 10.0.0.106 acts as MDS with a local PCI-E 800GB SSD that used as MDT. > > 10.0.0.101 acts as OSS with a same local PCI-E 800GB SSD that used as OST. > > 10.0.0.104 and 10.0.0.105 act as Lustre client and mount the Lustre file > system at the directory of “/lustre”. > > The Lustre related packages are compiled from official > lustre-2.10.5-1.src.rpm. > > > > The simplest verification(i.e. dd command) passed without errors. > > > > Error: > > Then start our customized PostgreSQL service at 104 and 105. 104 runs as > the primary PostgreSQL server, and 105 runs as the secondary PostgreSQL > server. All the two PostgreSQL nodes read/write the shared directory of “ > /lustre” provided by Lustre. The primary PostgreSQL server will open > files with **RW** mode and write something into the files; at the * > *meantime** the second PostgreSQL server will open the same files with * > *R** mode and read the data written by the primary PostgreSQL server, and > it gets the **wrong** data (the flushed data by primary in disk is error > i.e. write wrong data into disk). This will happen when we run a benchmark > tool of PostgreSQL. > > > > > > PS 1. > > We tried different options to mount the Lustre: > > mount -t lustre -o flock 10.0.0.106@o2ib0:/birdfs /lustre > > mount -t lustre -o flock -o ro 10.0.0.106@o2ib0:/birdfs /lustre > > but the error always exists. > > > > PS 2. > > Attach the initial information, maybe helpful. > > [root@106 ~]# mkfs.lustre --fsname=birdfs --mgs --mdt --index=0 > --reformat /dev/nvme0n1 > > > > Permanent disk data: > > Target: birdfs:MDT0000 > > Index: 0 > > Lustre FS: birdfs > > Mount type: ldiskfs > > Flags: 0x65 > > (MDT MGS first_time update ) > > Persistent mount opts: user_xattr,errors=remount-ro > > Parameters: > > > > device size = 763097MB > > formatting backing filesystem ldiskfs on /dev/nvme0n1 > > target name birdfs:MDT0000 > > 4k blocks 195353046 > > options -J size=4096 -I 1024 -i 2560 -q -O > dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E > lazy_journal_init -F > > mkfs_cmd = mke2fs -j -b 4096 -L birdfs:MDT0000 -J size=4096 -I 1024 -i > 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E > lazy_journal_init -F /dev/nvme0n1 195353046 > > > > Writing CONFIGS/mountdata > > > > [root@101 ~]# mkfs.lustre --fsname=birdfs --ost --reformat --index=0 > --mgsnode=10.0.0.106@o2ib0 /dev/nvme0n1 > > > > Permanent disk data: > > Target: birdfs:OST0000 > > Index: 0 > > Lustre FS: birdfs > > Mount type: ldiskfs > > Flags: 0x62 > > (OST first_time update ) > > Persistent mount opts: ,errors=remount-ro > > Parameters: mgsnode=10.0.0.106@o2ib > > > > device size = 763097MB > > formatting backing filesystem ldiskfs on /dev/nvme0n1 > > target name birdfs:OST0000 > > 4k blocks 195353046 > > options -J size=400 -I 512 -i 69905 -q -O > extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E > resize="4290772992",lazy_journal_init -F > > mkfs_cmd = mke2fs -j -b 4096 -L birdfs:OST0000 -J size=400 -I 512 -i > 69905 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E > resize="4290772992",lazy_journal_init -F /dev/nvme0n1 195353046 > > Writing CONFIGS/mountdata > > > > > > Looking forward to any replies. > > > > Regards, > > Bird > > > -- > regards > denny > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > -- ---------------- Michael Nishimoto cell: 408-410-9277
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
