[CentOS] filesystem corruption?

2015-04-06 Thread m . roth
Got an older server here, running CentOS 6.6 (64-bit). Suddenly, at
0-dark-30 yesterday morning, we had failures to connect.

After several tries to reboot and get working, I tried yum update, and
that failed, complaining of an python krb5 error. With more investigation,
I discovered that logins were failing as there was a problem with pam;
this turned out to be it couldn't open /lib64/security/pam_permit.so. The
reason for that was that it was a broken symlink, pointing to a file in
the same directory, that actually existed in the /lib64. Checking other
systems, I found it should, in fact, be a file, not a symlink.

At this point, the system was considered suspect. I brought the system
down, replaced the root drive, and rebuilt. I was not able to build it as
CentOS 7, as something in the older hardware broke the install. CentOS 6
built successfully, and the server was returned to service.

I then loaded the drive in another server, and examined it. fsck reported
both / and /boot were clean, but when I redid this with fask -c, to check
for bad blocks, it found many multiply-claimed blocks.

First question: anyone have an idea why it showed as clean, until I
checked for bad blocks? Would that just be because I'd gracefully shut
down the original server, and it mounted ok on the other server?

Mounting it on /mnt, I found no driver errors being reported in the logs,
nor anything happening, including logons, before an automated contact from
another server, which failed. AND I checked our loghost, and nothing odd
shows there, neither in message nor in secure.

At this point, I *think* it's filesystem corruption, rather than a
compromised system, but I'd really like to hear anyone's thoughts on this.

  mark



___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] filesystem corruption?

2015-04-06 Thread Valeri Galtsev

On Mon, April 6, 2015 4:37 pm, m.r...@5-cent.us wrote:
 Got an older server here, running CentOS 6.6 (64-bit). Suddenly, at
 0-dark-30 yesterday morning, we had failures to connect.

 After several tries to reboot and get working, I tried yum update, and
 that failed, complaining of an python krb5 error. With more investigation,
 I discovered that logins were failing as there was a problem with pam;
 this turned out to be it couldn't open /lib64/security/pam_permit.so. The
 reason for that was that it was a broken symlink, pointing to a file in
 the same directory, that actually existed in the /lib64. Checking other
 systems, I found it should, in fact, be a file, not a symlink.

 At this point, the system was considered suspect. I brought the system
 down, replaced the root drive, and rebuilt. I was not able to build it as
 CentOS 7, as something in the older hardware broke the install. CentOS 6
 built successfully, and the server was returned to service.

 I then loaded the drive in another server, and examined it. fsck reported
 both / and /boot were clean, but when I redid this with fask -c, to check
 for bad blocks, it found many multiply-claimed blocks.

 First question: anyone have an idea why it showed as clean, until I
 checked for bad blocks? Would that just be because I'd gracefully shut
 down the original server, and it mounted ok on the other server?

 Mounting it on /mnt, I found no driver errors being reported in the logs,
 nor anything happening, including logons, before an automated contact from
 another server, which failed. AND I checked our loghost, and nothing odd
 shows there, neither in message nor in secure.

 At this point, I *think* it's filesystem corruption, rather than a
 compromised system, but I'd really like to hear anyone's thoughts on this.

   mark


  Someone has suggested to reformat disk. Before doing that you may want
to make an image of the whole drive as it is now: dd the whole device
into file (somewhere on huge filesystem). I definitely would do that
before even running fsck or badblocks (BTW, badblocks has
non-destructive mode) - too late to mention now. You may need this image
for future forensics.

The best would be to have some system integrity suite installed before bad
event, then you will be able to tell what exactly changed (and
approximately when). Alas, you don't seem to have that option. You should
be able to use backup as a sort of replacement for that: (hopefully you
back up system area as well). I would restore all on the closest date
before event, compare all you had with what you see on mounted image(s) of
your drive (I would definitely play with copy of copy of image, leaving
original intact). I definitely would mount them read only with no journal.
Take a look in logs what kind of events you find there. Check that logs
were not tampered with (chkrootkit may be your friend). Take a look who
logged when for how long (and from where!), see if there is correlation
with some segfaults or kernel oopses, or if some kernel modules were
loaded (should they be loaded all of a sudden?). Anyway, take some
forensics guide if you don't do forensics often, and follow it. May take a
couple of weeks depending on how busy you are in general. Good luck with
that.

Hardware (drive) hypothesis. It is very attractive. I would kick myself so
wishful thinking will not take over. But if you indeed noticed bad blocks
detected, this quite likely is your case. Again, logs must have records as
drive will report its hardware events. I also would check SMART status of
drive. Try to get some information from drive (hdparm comes to my mind,
careful, you don't want to change anything which mostly hdparm is used
for, just collect info). After everything else tried I would run hard
drive fitness test (vendors have downloadable utility). BTW, what is
model/manufacturer of the drive?

[There is one more possibility which unlikely is your case: bad memory, or
just just small memory error but in really bad place that cased big
consequences. Reboot would resolve trouble, so it is unlikely your case.
But if this hits specific place in RAM, it can cause corruption of
filesystem as well...]

Good luck! Let us know what you find out.

Valeri


Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] filesystem corruption?

2015-04-06 Thread Keith Keller
On 2015-04-07, Valeri Galtsev galt...@kicp.uchicago.edu wrote:

 before even running fsck or badblocks (BTW, badblocks has
 non-destructive mode) - too late to mention now. You may need this image
 for future forensics.

e2fsck -c will run badblocks in read-only mode, so it may not be too
late.

--keith

-- 
kkel...@wombat.san-francisco.ca.us


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] filesystem corruption?

2015-04-06 Thread Nataraj
On 04/06/2015 02:37 PM, m.r...@5-cent.us wrote:

 I then loaded the drive in another server, and examined it. fsck reported
 both / and /boot were clean, but when I redid this with fask -c, to check
 for bad blocks, it found many multiply-claimed blocks.

Just running fsck with no arguments will not do anything unless the
filesystem is unclean or the time interval between checks has expired. 
I suspect that fsck -f would have found problems as well.

Time will tell if there is a hardware problem with the system, but I
would probably run some hardware diagnostics on the server including
memory and IO tests if you wanted to be on the safe side.  You could
also reformat the disk and run some write/readback diagnostics if you
wanted to find out if the disk is bad.

Nataraj

 First question: anyone have an idea why it showed as clean, until I
 checked for bad blocks? Would that just be because I'd gracefully shut
 down the original server, and it mounted ok on the other server?

 Mounting it on /mnt, I found no driver errors being reported in the logs,
 nor anything happening, including logons, before an automated contact from
 another server, which failed. AND I checked our loghost, and nothing odd
 shows there, neither in message nor in secure.

 At this point, I *think* it's filesystem corruption, rather than a
 compromised system, but I'd really like to hear anyone's thoughts on this.

   mark



 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos