Hello everyone,

we are running 4 * Dell R430 for firewalling, NAT, accounting etc. for a 
student network (approx. 5.200 users). We use pf and authpf. Server 1 and 2 
form a carp-cluster as well as server 2 und 3. All boxes come with identical 
hardware and software configuration. The only difference is, that cluster A 
runs 6.7 and cluster B openbsd 7.0.

Every user (-> student) on the network has it's own individual login (directly 
doing ssh to one of the boxes) to open up a connection to the internet. The 
user database on server 1 und 2 carries approx 2.600 users, the user database 
on cluster B the other half. 

The creation and updating of user information is scripted. Most of the time we 
just need to update authpf.message to show traffic consumption to the students 
on login:

echo "* UPD (183883)"
echo "---\n\nWelcome to studNET!\n\nYou have a maximum of 600 GB traffic 
available per month.\nYou have already used 9.231 GB in the current month 
(calculated at 2022-08-08 21:02:07) [.....] .\n\n---" 
>/etc/authpf/users/183883/authpf.message || error_handler
echo "... authpf-file /etc/authpf/users/183883/authpf.message generated"
if [ $USER_ERROR -eq 0 ]
then
  echo "* UPD (183883|dummyuser, dummyuser) ... success"
else
  echo "* UPD (183883| dummyuser, dummyuser) ... failed"
fi

This chunk of code is repeated maybe 2.000 times,  generated twice a day to a 
script file and run by cron.

*Problem*
Maybe once a month server 3 or 4 crash - they just freeze. Sometimes a reboot 
helps but often it additionaly comes along with a corrupt user database (system 
wont start, user root not found).  If this happens we manually have to recover  
a working master.passwd and apply pwd_mkdb. As the systems freeze there are no 
helping log entries or something similar. The only thing for sure is, that 
*when* it happens its always *after* the script ran and until now it never 
happend on server 1 or 2 (6.7).

*Question*
As the problem surely seems to be caused by the exectution of the script the 
question is why this happens? Heaavy IO or some bug with the hard disk driver? 
Does someone of you have a clue why the system crashes and even the user 
database gets corrupted in our setup?!  

Best regards,
Martin Miethe

Reply via email to