I had a similar experience recently, with a totally different cause.
Someone had repartitioned the harddisk. The Data.fs was at /home/somewhere/Data.fs on the root partition. The harddisk then got a /home partition. But the zeoserver was not restarted. All seemed to keep working. Zope kept writing to the file on the root partition. After a zeoserver restart it found the file on the new /home partition, which had the data of the time that the repartition happened.

I am not sure if they managed to recover the data by temporarily undoing the /home partition: unmounting it should bring back the root partition file that had the recent changes.

Your problem seems different.  I just wanted to add an extra data point.

I have not seen your exact problem before with the zeopacking.

Come to think of it, we rotate the logs weekly and have an extra line in the logrotate.conf that sends a USR2 kill signal to the zeoserver so that it reopens the log files. I wonder if it reopens the Data.fs too.

But if this is the problem, then I would expect that you have encountered this a lot more than five times in the past two years. I wonder if there is a race condition in the zeopack script that triggers this.

Just some thoughts.

Maurits

P.S. The buildout part that generates the logrotate.conf file:

[logrotate.conf]
recipe = zc.recipe.deployment:configuration
text =
    rotate 4
    weekly
    create
    compress
    delaycompress

    ${buildout:directory}/var/log/zeoclient*.log {
        sharedscripts
        postrotate
            /bin/kill -USR2 $(cat ${buildout:directory}/var/zeoclient.pid)
        endscript
    }

    ${buildout:directory}/var/log/zeoserver.log {
        postrotate
            /bin/kill -USR2 $(cat ${buildout:directory}/var/zeoserver.pid)
        endscript
    }

Then link to this file in /etc/logrotate.d/

Tom Cameron schreef op 27-05-15 om 06:03:
Ive sent this to the zope list too but figured many people use Plone so
I may get some feedback here too.

We’ve been running Zope servers for about 15 years. In the last few
years we have had a number of Zope servers instantly lose data upon
restart. This has happened to us about 5 times over the past 2 years on
4 different servers.

Upon restarting Zope we notice a rollback of data to sometimes months
old and a Data.fs with a much older timestamp than expected. At this
point the data was irretrievable. We find that it is not in any backups
or anywhere on the file system.

We have finally worked out what is happening but not exactly why. Below
is a detailed explanation of what we are seeing.

We are keen to know if anybody else is having similar issues or if
perhaps we have setup something incorrectly and need to change how we
pack the database. Also this may serve as a warning to others.

*Explanation*
The issue is that Zope is losing the file descriptor, the Data.fs still
remains in the folder but Zope keeps writing to a (deleted) file
descriptor which can be seen with lsof. Every server affected by this
was running ZEO and had zeopack in the default configuration. We believe
for some reason when zeopack tries to pack the ZODB, that when the
Data.fs is moved and renamed to Data.fs.old and eventually deleted, Zope
doesn’t recognize the new file.

This is what we believe happens:

1. Zope is connected to a file Data.fs.
2. zeopack packs this file into a new file Data.fs.tmp
3. zeopack deletes any file named Data..fs.old
4. zeopack then renames Data.fs to Data.fs.old
5. zeopack then renames Data.fs.tmp to Data.fs

The problem is that Zope is connected directly to the Data.fs file via
its handler and even though it is renamed it continues to write to
Data.fs.old.

Later on, we do a subsequent zeopack and the Data.fs.old file is deleted
in the process, but Zope keeps writing to this deleted file.

We now have a situation where Zope is connected to a deleted file and is
writing all its data to it. If Zope is ever stopped, the connection is
lost and so effectively is the file.

It is true that expert forensics could retrieve the file but in the
first few instances that this happened we were totally baffled as to
where the data had gone. In one case we had not restarted Zope for 4
months and thus we lost 4 months of data.

*Environment*
We are running standard Plone servers on Debian 5.0.10 in this case,
with following:

  * Plone 4.3.4.1 (4309)
  * CMF 2.2.8
  * Zope 2.13.22
  * Python 2.7.8 (default, Jan 27 2015, 14:19:28) [GCC 4.3.2]
  * PIL 2.3.0 (Pillow)
  * ZEO: plone.recipe.zeoserver 1.2.6

Effective-user is set correctly for this instance “landcarevic” and
properly mounts the Data.fs under normal circumstances.

*Detailed Symptoms*
This is a real case we discovered and managed to retrieve the lost file.

Zope will continue to write to a Data.fs that has been technically
deleted. We are able to notice this because the Data.fs.tmp will have a
much more recent timestamp than the Data.fs in the var/filestorage
directory.

-rw-r--r-- 1 landcarevic psacln  6735952811 2015-04-12 00:11 Data.fs

-rw-r--r-- 1 landcarevic psacln    18594749 2015-04-12 00:11 Data.fs.index

-rw-r--r-- 1 landcarevic psacln           6 2015-03-30 11:20 Data.fs.lock

-rw-r--r-- 1 landcarevic psaserv   14600344 2015-05-26 10:27 Data.fs.tmp


An “lsof | grep Data.fs” can show if the file descriptor is pointing to
a deleted file on the fs:

python2.7  9307  landcarevic    7uW     REG               0,31
          6   4986845
/var/www/vhosts/landcarevic.net.au/plone43/var/filestorage/Data.fs.lock
<http://landcarevic.net.au/plone43/var/filestorage/Data.fs.lock>

python2.7  9307  landcarevic   11u      REG               0,31
   14600344   4986235
/var/www/vhosts/landcarevic.net.au/plone43/var/filestorage/Data.fs..tmp
<http://landcarevic.net.au/plone43/var/filestorage/Data.fs.tmp>

python2.7  9307  landcarevic   12u      REG               0,31
7058741434   4986349
  (deleted)/var/www/vhosts/landcarevic.net.au/plone43/var/filestorage/Data.fs.old 
<http://landcarevic.net.au/plone43/var/filestorage/Data.fs.old>


The process id for the instance is 9307 and it’s still holding open a
Data.fs.old that has been deleted. Most likely due to a ZEO pack before
from the cron in this case. Because it packs the Data.fs and then moves
the old one to Data.fs.old which eventually gets deleted. But Zope never
properly recognises the new Data.fs for some reason.

*Recovery*
As long as the Zope server doesn’t get shut down or restarted, it will
hold this open and continue to write to the disk. It can be recovered by
copying from /proc where /proc/{process ID}/fd/{file descriptor ID} can
be found from running lsof as previously:

# cp /proc/9307/fd/12 /home/alyssa/Data.fs-current

# ls -al /home/alyssa/

...

-rw-r--r--  1 root root 7058741434 2015-05-26 11:11 Data.fs-current


Which has the correct timestamp and file size.




--
*Tom Cameron*



--
*Tom Cameron*
Technical Director

*Mooball IT*

Level 1, 888 Brunswick Street, New Farm, Brisbane QLD 4005
mob: +61 (4) 5584 1995  ph: +61 (7) 3040 4440
http://www.mooball.com <http://www.mooball.com/>

<http://www.mooball.com/>

*Try Kitovu <https://kitovu.com/> - A better way to manage client work*

Email disclaimer <http://www.mooball.com/email_disclaimer>




--
Maurits van Rees: http://maurits.vanrees.org/
Zest Software: http://zestsoftware.nl

_______________________________________________
Setup mailing list
[email protected]
https://lists.plone.org/mailman/listinfo/plone-setup

Reply via email to