Re: [Gluster-users] Geo-Replication - UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 78: surrogates not allowed

2021-02-26 Thread Dietmar Putz

Hi Andreas,

recently i have been faced with the same fault. I'm pretty sure you are 
speaking german, that's why a translation should not be necessary.


I found the reason by tracing a certain process which points to the 
gsyncd.log and looking backward from the error until i found some 
lgetxattr function call's. In the corresponding directory i found some 
filenames with 'special' characters. Rename fixed the problem.


Below 'my' history and solution for UnicodeEncodeError und 
UnicodeDecodeError. Hope it helps...btw, we are running gfs 7.9 on 
Ubuntu 18.04.



best regards

Dietmar



script fuer trace von geo-replication :





[ 07:35:09 ] - root@gl-master-05  ~/tmp/geo-rep $cat trace_gf.sh
#!/bin/bash
#
# script zum tracen der geo-rep aktivitaeten
# script benoetigt pid
# gedacht zum tracen der parent pid von master prozess auf gsyncd.log
# in diesem beispiel pid 13620
#
#
#[ 16:19:24 ] - root@gl-master-05 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1 $lsof 
gsyncd.log

#COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
#python3 13021 root    3w   REG    8,2  2905607 9572924 gsyncd.log
#python3 13619 root    3w   REG    8,2  2905607 9572924 gsyncd.log
#python3 13620 root    3w   REG    8,2  2905607 9572924 gsyncd.log
#[ 16:19:27 ] - root@gl-master-05 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1 $

#
#gf_log="/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log" 


tr_out="/root/tmp/geo-rep/trace-`date +"%H_%M_%S_%d_%m_%Y"`.out"

echo "tr_out : $tr_out"
#pid=`lsof "$gf_log" | grep -v COMMAND | head -1 | awk '{print $2}'`
PID=$1
echo "pid : $PID"

ps -p $PID > /dev/null 2>&1
if [ $? -ne 0 ]
then
    echo "Pid $PID not running"
    exit
fi

nohup strace -t -f -s 256 -o $tr_out -p$PID &

PID_STRACE=`ps -aef | grep -v grep | grep strace | awk '{print $2}'`
echo "Pid von strace : $PID_STRACE"

while true
do
    filesize=`ls -l $tr_out | awk '{print $5}'`
    if [ $filesize -gt 10 ]
    then
        ps -p $PID > /dev/null 2>&1
        if [ $? -eq 0 ]
        then
            kill -9 $PID_STRACE
            sleep 1
            rm $tr_out
            nohup strace -t -f -s 256 -o $tr_out -p$PID &
            PID_STRACE=`ps -aef | grep -v grep | grep strace | awk 
'{print $2}'`

            echo "Pid von strace : $PID_STRACE"
        else
            echo "pid $PID laeuft nicht mehr"
            exit
        fi
    fi
    ps -p $PID > /dev/null 2>&1
    if [ $? -ne 0 ]
    then
        echo "pid $PID laeuft nicht mehr..."
        exit
    fi
    sleep 120
    echo "`date` : `ls -lh $tr_out`"
done

-- 



zu 2. Loesungsansatz (s.u.) :

Fuer diesen Fehler reicht es den 'letzten' Prozess zu tracen. Hier 1236, 
nicht 13021. 13021 ist der 'mother' prozess, nach error werden die beien 
anderen gekillt und mit neuer pid gestartet, resultat von beobachtungen :


[ 13:00:04 ] - root@gl-master-05  ~/tmp/geo-rep/15 $lsof 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log

COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
python3  1235 root    3w   REG    8,2  2857996 9572924 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log
python3  1236 root    3w   REG    8,2  2857996 9572924 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log
python3 13021 root    3w   REG    8,2  2857996 9572924 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log

[ 13:00:18 ] - root@gl-master-05  ~/tmp/geo-rep/15 $

[ 13:00:10 ] - root@gl-master-05  ~/tmp/geo-rep $strace -t -f -s 256 -o 
/root/tmp/geo-rep/gsyncd1.out -p1236


Um das file nicht zu gross werden zu lassen kann man den strace immer 
wieder killen, file loeschen, und strace neu starten. Pech natuerlich 
wenn gerade dann der Fehler auftritt. Das file hat schnell eine Groesse 
von 1GB und mehr (ca. 10 Minuten, je nach aktivitaet) und viele 
Millionen lines...


geo-replication log beobachten, kill von o.g. pid ist allerdings nicht 
noetig. Der Prozess endet bei error, und damit auch der trace.


[ 12:32:04 ] - root@gl-master-05 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1 $tail -f 
gsyncd.log


...

[2021-02-11 12:53:59.530649] I [master(worker 
/brick1/mvol1):1441:process] _GMaster: Batch Completed mode=xsync    
duration=178.4717    changelog_start=1613041474 
changelog_end=1613041474    num_changelogs=1    stime=None entry_stime=None
[2021-02-11 12:53:59.639853] I [master(worker /brick1/mvol1):1681:crawl] 
_GMaster: processing xsync changelog 
path=/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/xsync/XSYNC-CHANGELOG.1613041477

###
[2021-02-11 13:00:57.149347] E [syncdutils(worker 
/brick1/mvol1):339:log_raise_exception] : FAIL:

Traceback (most recent call last):

Re: [Gluster-users] Geo-Replication - UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 78: surrogates not allowed

2021-02-26 Thread Andreas Kirbach

Hi Dietmar,

thank you for your reply.

I've also started to trace this down and you are correct, the directory 
does contain filenames with 'special' characters (umlauts), but renaming 
them as a workaround unfortunately is not an option.


So the question really is why does it fail on those characters and how 
to fix that so it doesn't error even if there are such filenames.


Kind regards,
Andreas

Am 26.02.2021 um 14:16 schrieb Dietmar Putz:

Hi Andreas,

recently i have been faced with the same fault. I'm pretty sure you are 
speaking german, that's why a translation should not be necessary.


I found the reason by tracing a certain process which points to the 
gsyncd.log and looking backward from the error until i found some 
lgetxattr function call's. In the corresponding directory i found some 
filenames with 'special' characters. Rename fixed the problem.


Below 'my' history and solution for UnicodeEncodeError und 
UnicodeDecodeError. Hope it helps...btw, we are running gfs 7.9 on 
Ubuntu 18.04.



best regards

Dietmar



script fuer trace von geo-replication :





[ 07:35:09 ] - root@gl-master-05  ~/tmp/geo-rep $cat trace_gf.sh
#!/bin/bash
#
# script zum tracen der geo-rep aktivitaeten
# script benoetigt pid
# gedacht zum tracen der parent pid von master prozess auf gsyncd.log
# in diesem beispiel pid 13620
#
#
#[ 16:19:24 ] - root@gl-master-05 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1 $lsof 
gsyncd.log

#COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
#python3 13021 root    3w   REG    8,2  2905607 9572924 gsyncd.log
#python3 13619 root    3w   REG    8,2  2905607 9572924 gsyncd.log
#python3 13620 root    3w   REG    8,2  2905607 9572924 gsyncd.log
#[ 16:19:27 ] - root@gl-master-05 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1 $

#
#gf_log="/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log" 


tr_out="/root/tmp/geo-rep/trace-`date +"%H_%M_%S_%d_%m_%Y"`.out"

echo "tr_out : $tr_out"
#pid=`lsof "$gf_log" | grep -v COMMAND | head -1 | awk '{print $2}'`
PID=$1
echo "pid : $PID"

ps -p $PID > /dev/null 2>&1
if [ $? -ne 0 ]
then
     echo "Pid $PID not running"
     exit
fi

nohup strace -t -f -s 256 -o $tr_out -p$PID &

PID_STRACE=`ps -aef | grep -v grep | grep strace | awk '{print $2}'`
echo "Pid von strace : $PID_STRACE"

while true
do
     filesize=`ls -l $tr_out | awk '{print $5}'`
     if [ $filesize -gt 10 ]
     then
         ps -p $PID > /dev/null 2>&1
         if [ $? -eq 0 ]
         then
             kill -9 $PID_STRACE
             sleep 1
             rm $tr_out
             nohup strace -t -f -s 256 -o $tr_out -p$PID &
             PID_STRACE=`ps -aef | grep -v grep | grep strace | awk 
'{print $2}'`

             echo "Pid von strace : $PID_STRACE"
         else
             echo "pid $PID laeuft nicht mehr"
             exit
         fi
     fi
     ps -p $PID > /dev/null 2>&1
     if [ $? -ne 0 ]
     then
         echo "pid $PID laeuft nicht mehr..."
         exit
     fi
     sleep 120
     echo "`date` : `ls -lh $tr_out`"
done

-- 



zu 2. Loesungsansatz (s.u.) :

Fuer diesen Fehler reicht es den 'letzten' Prozess zu tracen. Hier 1236, 
nicht 13021. 13021 ist der 'mother' prozess, nach error werden die beien 
anderen gekillt und mit neuer pid gestartet, resultat von beobachtungen :


[ 13:00:04 ] - root@gl-master-05  ~/tmp/geo-rep/15 $lsof 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log

COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
python3  1235 root    3w   REG    8,2  2857996 9572924 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log
python3  1236 root    3w   REG    8,2  2857996 9572924 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log
python3 13021 root    3w   REG    8,2  2857996 9572924 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log

[ 13:00:18 ] - root@gl-master-05  ~/tmp/geo-rep/15 $

[ 13:00:10 ] - root@gl-master-05  ~/tmp/geo-rep $strace -t -f -s 256 -o 
/root/tmp/geo-rep/gsyncd1.out -p1236


Um das file nicht zu gross werden zu lassen kann man den strace immer 
wieder killen, file loeschen, und strace neu starten. Pech natuerlich 
wenn gerade dann der Fehler auftritt. Das file hat schnell eine Groesse 
von 1GB und mehr (ca. 10 Minuten, je nach aktivitaet) und viele 
Millionen lines...


geo-replication log beobachten, kill von o.g. pid ist allerdings nicht 
noetig. Der Prozess endet bei error, und damit auch der trace.


[ 12:32:04 ] - root@gl-master-05 
/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1 $tail -f 
gsyncd.log


...

[2021-02-11 12:53:59.530649] I [master(worker 
/brick1/mvol1):1441:process] _GMaster: Batch Completed mode=xsync
duration=178.4717