[ceph-users] Re: Please guide us in identifying the cause of the data miss in EC pool

Frédéric Nass Sat, 03 Aug 2024 00:12:00 -0700

Hi,

First thing that comes to mind when it comes to data unavailability or 
inconsistencies after a power outage is that some dirty data may have been lost 
along the IO path before reaching persistent storage. This can happen with non 
enterprise grade SSDs using non-persistent cache or with HDDs disk buffer if 
left enabled for example.


With that said, have you tried to deep-scrub the PG from which you can't 
retrieve data? What's the status of this PG now? Did it recover?

Regards,
Frédéric.


________________________________
De : [email protected]
Envoyé : mercredi 31 juillet 2024 05:49
À : ceph-users
Objet : [ceph-users] Please guide us in identifying the cause of the data miss 
in EC pool

Dear Ceph team:&nbsp; &nbsp; &nbsp;On July 13th at 4:55 AM, our Ceph cluster 
experienced a significant power outage in the data center, causing a large 
number of OSDs to power off and restart (total: 1172, down: 821). Approximately 
two hours later, all OSDs successfully started, and the cluster resumed its 
services. However, around 6 PM, the business department reported that some 
files, which had been successfully written (via the RGW service), were failing 
to download, and the number of such files was quite significant. Consequently, 
we began a series of investigations: 


1. The incident occurred at 04:55. At 05:01, we executed noout, nobackfill, and 
norecover. At 06:22, we executed `ceph osd pause`. By 07:23, all OSDs were 
UP&amp;IN, and subsequently, we executed `ceph osd unpause`. 


2. We randomly selected a problematic file and attempted to download it via the 
S3 API. The RGW returned "No such key". 


3. The RGW logs showed op status=-2, http status=200. We also checked the 
upload logs, which indicated 2024-07-13 04:19:20.052, op status=0, 
http_status=200. 


4. We set debug_rgw=20 and attempted to download the file again. It was found 
that a 4M chunk(this file is 64M) failed to get. 


5. Using rados get for this chunk returned: "No such file or directory". 


6. Setting debug_osd=20, we observed get_object_context: obc NOT found in 
cache. 


7. Setting debug_bluestore=20, we saw get_onode oid xxx, key xxx != 
'0xfffffffffffffffeffffffffffffffff'o'. 


8. We stopped the primary OSD and tried to get the file again, but the result 
was the same. The object’s corresponding PG state was 
active+recovery_wait+degraded. 


9. Using ceph-objectstore-tool --op list &amp;&amp; --op log, we could not find 
the object information. The ceph-kvstore-tool rocksdb command also did not 
reveal anything new. 


10. If an OSD had lost data, the PG state should have been unfound or 
inconsistency. 


11. We started reanalyzing the startup logs of the OSDs related to the PG. The 
pool was set to erasure-code 6-3 with 9 OSDs. Six of these OSDs had restarted, 
and after peering, the PG state became ACTIVE. 


12. We divided the lost files, and the upload time was before the failure 
occurred. The earliest upload time was around 1 am, and the successful upload 
records could be found in the RGW log 


13. We have submitted an issue on the Ceph issue 
tracker:&nbsp;https://tracker.ceph.com/issues/66942, it includes the original 
logs needed for troubleshooting. However, four days have passed without any 
response. In desperation, we are sending this email, hoping that someone from 
the Ceph team can guide us as soon as possible. 


We are currently in a difficult situation and hope you can provide guidance. 
Thank you. 



Best regards. 





[email protected] 
[email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Please guide us in identifying the cause of the data miss in EC pool

Reply via email to