On Fri, Aug 12, 2016 at 11:44 AM, Ioana Danes <ioanada...@gmail.com> wrote:
> > > On Fri, Aug 12, 2016 at 11:34 AM, Adrian Klaver <adrian.kla...@aklaver.com > > wrote: > >> On 08/12/2016 08:30 AM, Ioana Danes wrote: >> >>> >>> >>> On Fri, Aug 12, 2016 at 11:26 AM, Adrian Klaver >>> <adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>> wrote: >>> >>> On 08/12/2016 08:10 AM, Ioana Danes wrote: >>> >>> >>> >>> On Fri, Aug 12, 2016 at 10:47 AM, Francisco Olarte >>> <fola...@peoplecall.com <mailto:fola...@peoplecall.com> >>> <mailto:fola...@peoplecall.com <mailto:fola...@peoplecall.com>>> >>> wrote: >>> >>> CCing to the list... >>> >>> Thanks >>> >>> >>> On Fri, Aug 12, 2016 at 4:10 PM, Ioana Danes >>> <ioanada...@gmail.com <mailto:ioanada...@gmail.com> >>> <mailto:ioanada...@gmail.com <mailto:ioanada...@gmail.com>>> >>> wrote: >>> >> given 318220 and 318216 are just a bit away ( 4db08/4db0c >>> ), and it >>> >> repeats sporadically, have you ruled out ( by having page >>> checksums or >>> >> other mechanism ) a potential disk read/write error ? >>> >> >>> >> >>> >> > Also the index is correct on db3 as the record in case >>> (with >>> drawid = >>> >> > 318216) is retrieved if I filter by drawid = 318220 >>> >> >>> >> Specially if this happens, you may have some slightly bad >>> disks/ram/ >>> >> leading to this kind of problems. >>> >> >>> > >>> > Could be. I also had some issues with an rsync between db3 >>> and >>> drdb a week >>> > ago that did not complete for bigger files (> 200MB) and >>> gave me some >>> > corruption messages. Then the system was revbooted and >>> everything >>> seemed >>> > fine but apparently it is not. >>> > I am planning to drop & create the table from a good >>> backup and if >>> that does >>> > not fix the issue then I will rebuild the server. >>> >>> I would check whatever logs you can ( syslog or eventlog, >>> smart log, >>> etc.. ) hunting for disk errors ( sometimes they are >>> reported ). This >>> kind of problems, with programs as tested as postgres and >>> rsync, tend >>> to indicate controller/RAM/disk going bad ( in your case it >>> could be >>> caused by a single bit getting flipped in a sector for the >>> data >>> portion of the table, and not being propagated either >>> because it >>> happened after your sync of drdb or because it was synced >>> from the WAL >>> and not the table, or because it was read from the disk >>> cache ). >>> >>> I agree, unfortunately I did not find any clues about corruption >>> or any >>> anomalies in the logs. >>> I will work tonight to rebuild that table and see where I go >>> from there. >>> >>> >>> The db3 database is on a different machine from all the other >>> databases you set up, correct? >>> >>> Yes, they are all different vms first 3 dbs are on the same cluster but >>> drdb is a remote machine, >>> >> >> Aah, another player in the mix. >> >> What virtualization technology are you using? >> > > kvm > Sorry I should add more info kernel 4.7 and the filesystem is xfs vs ext3/ext4 > >> >>> Thank you >>> >>> >>> >>> Thanks, >>> ioana >>> >>> Francisco Olarte. >>> >>> >>> >>> >>> -- >>> Adrian Klaver >>> adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> >>> >>> >>> >> >> -- >> Adrian Klaver >> adrian.kla...@aklaver.com >> > >