Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Marco Baldini - H.S. Amiata

Hi

I monitor dmesg in each of the 3 nodes, no hardware issue reported. And 
the problem happens with various different OSDs in different nodes, for 
me it is clear it's not an hardware problem.


Thanks for reply



Il 05/03/2018 21:45, Vladimir Prokofev ha scritto:

> always solved by ceph pg repair 
That doesn't necessarily means that there's no hardware issue. In my 
case repair also worked fine and returned cluster to OK state every 
time, but in time faulty disk fail another scrub operation, and this 
repeated multiple times before we replaced that disk.
One last thing to look into is dmesg at your OSD nodes. If there's a 
hardware read error it will be logged in dmesg.


2018-03-05 18:26 GMT+03:00 Marco Baldini - H.S. Amiata 
>:


Hi and thanks for reply

The OSDs are all healthy, in fact after a ceph pg repair  the
ceph health is back to OK and in the OSD log I see   repair
ok, 0 fixed

The SMART data of the 3 OSDs seems fine

*OSD.5*

# ceph-disk list | grep osd.5
  /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2

# smartctl -a /dev/sdd
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,www.smartmontools.org 


=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST1000DM003-1SB10C
Serial Number:Z9A1MA1V
LU WWN Device Id: 5 000c50 090c7028b
Firmware Version: CC43
User Capacity:1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate:7200 rpm
Form Factor:  3.5 inches
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:Mon Mar  5 16:17:22 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine 
completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection:(0) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:(   1) minutes.
Extended self-test routine
recommended polling time:( 109) minutes.
Conveyance self-test routine
recommended polling time:(   2) minutes.
SCT capabilities:  (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate 0x000f   082   063   006Pre-fail  Always   
-   193297722
   3 Spin_Up_Time0x0003   097   097   000Pre-fail  Always   
-   0
   4 Start_Stop_Count0x0032   100   100   020Old_age   Always   
-   60
   5 Reallocated_Sector_Ct   0x0033   100   100   010Pre-fail  Always   
-   0
   7 Seek_Error_Rate 0x000f   091   060   045Pre-fail  Always   
-   1451132477
   9 Power_On_Hours  0x0032   085   085   000Old_age   Always   
-   13283
  10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always   
-   0
  12 Power_Cycle_Count   0x0032   100   100   020Old_age   

Re: [ceph-users] When all Mons are down, does existing RBD volume continue to work

2018-03-05 Thread Gregory Farnum
On Sun, Mar 4, 2018 at 12:02 AM Mayank Kumar  wrote:

> Ceph Users,
>
> My question is if all mons are down(i know its a terrible situation to
> be), does an existing rbd volume which is mapped to a host and being
> used(read/written to) continues to work?
>
> I understand that it wont get notifications about osdmap, etc, but
> assuming nothing fails, does the read/write ios on the exsiting rbd volume
> continue to work or that would start failing ?
>

Clients will continue to function if there are transient monitor issues,
but you can't rely on them continuing in a long-term failure scenario.
Eventually *something* will hit a timeout, whether that's an OSD on its
pings, or some kind of key rotation for cephx, or
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-05 Thread Brad Hubbard
On Fri, Mar 2, 2018 at 3:54 PM, Alex Gorbachev  wrote:
> On Thu, Mar 1, 2018 at 10:57 PM, David Turner  wrote:
>> Blocked requests and slow requests are synonyms in ceph. They are 2 names
>> for the exact same thing.
>>
>>
>> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev  
>> wrote:
>>>
>>> On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
>>> wrote:
>>> > `ceph health detail` should show you more information about the slow
>>> > requests.  If the output is too much stuff, you can grep out for blocked
>>> > or
>>> > something.  It should tell you which OSDs are involved, how long they've
>>> > been slow, etc.  The default is for them to show '> 32 sec' but that may
>>> > very well be much longer and `ceph health detail` will show that.
>>>
>>> Hi David,
>>>
>>> Thank you for the reply.  Unfortunately, the health detail only shows
>>> blocked requests.  This seems to be related to a compression setting
>>> on the pool, nothing in OSD logs.
>>>
>>> I replied to another compression thread.  This makes sense since
>>> compression is new, and in the past all such issues were reflected in
>>> OSD logs and related to either network or OSD hardware.
>>>
>>> Regards,
>>> Alex
>>>
>>> >
>>> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
>>> > wrote:
>>> >>
>>> >> Is there a switch to turn on the display of specific OSD issues?  Or
>>> >> does the below indicate a generic problem, e.g. network and no any
>>> >> specific OSD?
>>> >>
>>> >> 2018-02-28 18:09:36.438300 7f6dead56700  0
>>> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
>>> >> total 15997 MB, used 6154 MB, avail 9008 MB
>>> >> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
>>> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
>>> >> (REQUEST_SLOW)
>>> >> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
>>> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
>>> >> (REQUEST_SLOW)
>>> >> 2018-02-28 18:09:53.794882 7f6de8551700  0
>>> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
>>> >> "status", "format": "json"} v 0) v1
>>> >>
>>> >> --
>
> I was wrong where the pool compression does not matter, even
> uncompressed pool also generates these slow messages.
>
> Question is why no subsequent message relating to specific OSDs (like
> in Jewel and prior, like this example from RH:
>
> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
>
> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
> old, received at {date-time}: osd_op(client.4240.0:8
> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
> currently waiting for subops from [610]
>
> In comparison, my Luminous cluster only shows the general slow/blocked 
> message:
>
> 2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check update: 66 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check update: 5 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
> blocked > 32 sec)
> 2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
> [INF] : Cluster is now healthy
>
> So where are the details?

Working on this, thanks.

See https://tracker.ceph.com/issues/23236

>
> Thanks,
> Alex
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete a Pool - how hard should be?

2018-03-05 Thread Alex Gorbachev
On Mon, Mar 5, 2018 at 2:17 PM, Gregory Farnum  wrote:
> On Thu, Mar 1, 2018 at 9:21 AM Max Cuttins  wrote:
>>
>> I think this is a good question for everybody: How hard should be delete a
>> Pool?
>>
>> We ask to tell the pool twice.
>> We ask to add "--yes-i-really-really-mean-it"
>> We ask to add ability to mons to delete the pool (and remove this ability
>> ASAP after).
>>
>> ... and then somebody of course ask us to restore the pool.
>>
>> I think that all this stuff is not looking in the right direction.
>> It's not the administrator that need to be warned from delete datas.
>> It's the data owner that should be warned (which most of the time give
>> it's approval by phone and gone).
>>
>>
>> So, all this stuff just make the life of administrator harder, while not
>> improving in any way the life of the Data Owner.
>> Probably the best solution is to ...do not delete at all and instead apply
>> a "deleting policy".
>> Something like:
>>
>> ceph osd pool rm POOL_NAME -yes
>> -> POOL_NAME is set to be deleted, removal is scheduled within 30 days.
>>
>>
>> This allow us to do 2 things:
>>
>> allow administrator to don't waste their time in CML with true strange
>> command
>> allow data owner to have a grace period to verify if, after deletion,
>> everything works as expected and that data that disapper wasn't usefull in
>> some way.
>>
>> After 30 days data will be removed automatically. This is a safe policy
>> for ADMIN and DATA OWNER.
>> Of course ADMIN should be allowed to remove POOL scheduleded for deletion
>> in order to save disk spaces if needed (but only if needed).
>>
>> What do you think?
>>
>>
>
> You're not wrong, and indeed that's why I pushed back on the latest attempt
> to make deleting pools even more cumbersome.
>
> But having a "trash" concept is also pretty weird. If admins can override it
> to just immediately delete the data (if they need the space), how is that
> different from just being another hoop to jump through? If we want to give
> the data owners a chance to undo, how do we identify and notify *them*
> rather than the admin running the command? But if admins can't override the
> trash and delete immediately, what do we do for things like testing and
> proofs of concept where large-scale data creates and deletes are to be
> expected?

What about using the at command:

ceph osd pool rm   --yes-i-really-really-mean-it | at now + 30 days

Regards,
Alex

> -Greg
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] XFS Metadata corruption while activating OSD

2018-03-05 Thread 赵赵贺东
Hello ceph-users,It is a really really Really tough problem for our team.We investigated in the problem for a long time, try a lot of efforts, but can’t solve the problem, even the concentrate cause of the problem is still unclear for us!So, Anyone give any solution/suggestion/opinion whatever  will be highly highly appreciated!!!Problem Summary:When we activate osd, there will be  metadata corrupttion in the activating disk, probability is 100% !Admin Nodes node:Platform:	X86OS:		Ubuntu 16.04Kernel:	4.12.0Ceph:	Luminous 12.2.2OSD nodes:Platform:	armv7OS:      	Ubuntu 14.04Kernel:  	4.4.39Ceph:	Lominous 12.2.2Disk:	10T+10TMemory:	2GBDeploy log:
root@mnc000:/home/mnvadmin/ceph# ceph-deploy disk zap arms001-01:sda
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO ] Invoked (1.5.39): /usr/bin/ceph-deploy disk zap 
arms001-01:sda
[ceph_deploy.cli][INFO ] ceph-deploy options:
[ceph_deploy.cli][INFO ] username : None
[ceph_deploy.cli][INFO ] verbose : False
[ceph_deploy.cli][INFO ] overwrite_conf : False
[ceph_deploy.cli][INFO ] subcommand : zap
[ceph_deploy.cli][INFO ] quiet : False
[ceph_deploy.cli][INFO ] cd_conf : 
[ceph_deploy.cli][INFO ] cluster : ceph
[ceph_deploy.cli][INFO ] func : 
[ceph_deploy.cli][INFO ] ceph_conf : None
[ceph_deploy.cli][INFO ] default_release : False
[ceph_deploy.cli][INFO ] disk : [('arms001-01', '/dev/sda', None)]
[ceph_deploy.osd][DEBUG ] zapping /dev/sda on arms001-01
[arms001-01][DEBUG ] connection detected need for sudo
[arms001-01][DEBUG ] connected to host: arms001-01
[arms001-01][DEBUG ] detect platform information from remote host
[arms001-01][DEBUG ] detect machine type
[arms001-01][DEBUG ] find the location of an executable
[arms001-01][INFO ] Running command: sudo /sbin/initctl version
[arms001-01][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO ] Distro info: Ubuntu 14.04 trusty
[arms001-01][DEBUG ] zeroing last few blocks of device
[arms001-01][DEBUG ] find the location of an executable
[arms001-01][INFO ] Running command: sudo /usr/local/bin/ceph-disk zap /dev/sda
[arms001-01][WARNIN] 
/usr/local/lib/python2.7/dist-packages/ceph_disk-1.0.0-py2.7.egg/ceph_disk/main.py:5653:
 UserWarning:
[arms001-01][WARNIN] 
***
[arms001-01][WARNIN] This tool is now deprecated in favor of ceph-volume.
[arms001-01][WARNIN] It is recommended to use ceph-volume for OSD deployments. 
For details see:
[arms001-01][WARNIN]
[arms001-01][WARNIN] http://docs.ceph.com/docs/master/ceph-volume/#migrating
[arms001-01][WARNIN]
[arms001-01][WARNIN] 
***
[arms001-01][WARNIN]
[arms001-01][DEBUG ] 4 bytes were erased at offset 0x0 (xfs)
[arms001-01][DEBUG ] they were: 58 46 53 42
[arms001-01][WARNIN] 10+0 records in
[arms001-01][WARNIN] 10+0 records out
[arms001-01][WARNIN] 10485760 bytes (10 MB) copied, 0.0610462 s, 172 MB/s
[arms001-01][WARNIN] 10+0 records in
[arms001-01][WARNIN] 10+0 records out
[arms001-01][WARNIN] 10485760 bytes (10 MB) copied, 0.129642 s, 80.9 MB/s
[arms001-01][WARNIN] Caution: invalid backup GPT header, but valid main header; 
regenerating
[arms001-01][WARNIN] backup header from main header.
[arms001-01][WARNIN]
[arms001-01][WARNIN] Warning! Main and backup partition tables differ! Use the 
'c' and 'e' options
[arms001-01][WARNIN] on the recovery & transformation menu to examine the two 
tables.
[arms001-01][WARNIN]
[arms001-01][WARNIN] Warning! One or more CRCs don't match. You should repair 
the disk!
[arms001-01][WARNIN]
[arms001-01][DEBUG ] 

[arms001-01][DEBUG ] Caution: Found protective or hybrid MBR and corrupt GPT. 
Using GPT, but disk
[arms001-01][DEBUG ] verification and recovery are STRONGLY recommended.
[arms001-01][DEBUG ] 

[arms001-01][DEBUG ] GPT data structures destroyed! You may now partition the 
disk using fdisk or
[arms001-01][DEBUG ] other utilities.
[arms001-01][DEBUG ] Creating new GPT entries.
[arms001-01][DEBUG ] The operation has completed successfully.
[arms001-01][WARNIN] 
/usr/local/lib/python2.7/dist-packages/ceph_disk-1.0.0-py2.7.egg/ceph_disk/main.py:5685:
 UserWarning:
[arms001-01][WARNIN] 
***
[arms001-01][WARNIN] This tool is now deprecated in favor of ceph-volume.
[arms001-01][WARNIN] It is recommended to use ceph-volume for OSD deployments. 
For details see:
[arms001-01][WARNIN]
[arms001-01][WARNIN] http://docs.ceph.com/docs/master/ceph-volume/#migrating
[arms001-01][WARNIN]
[arms001-01][WARNIN] 
***
[arms001-01][WARNIN]


root@mnc000:/home/mnvadmin/ceph# ceph-deploy osd prepare --filestore 
arms001-01:sda

Re: [ceph-users] Ceph SNMP hooks?

2018-03-05 Thread Andre Goree

On 2018/02/28 3:32 pm, David Turner wrote:

You could probably write an SNMP module for the new ceph-mgr daemon. 
What do you want to use to monitor Ceph that requires SNMP?


On Wed, Feb 28, 2018 at 1:13 PM Andre Goree  wrote:

I've looked and haven't found much information besides custom 
3rd-party

plugins so I figured I'd ask here:

Is there a way to monitor a clusters 'health' via SNMP?




Thanks I was looking just to monitor Ceph's health status via PRTG.  
I've actually found a project or two on Github specifically for Ceph & 
PRTG but haven't tried anything yet.




--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Vladimir Prokofev
> always solved by ceph pg repair 
That doesn't necessarily means that there's no hardware issue. In my case
repair also worked fine and returned cluster to OK state every time, but in
time faulty disk fail another scrub operation, and this repeated multiple
times before we replaced that disk.
One last thing to look into is dmesg at your OSD nodes. If there's a
hardware read error it will be logged in dmesg.

2018-03-05 18:26 GMT+03:00 Marco Baldini - H.S. Amiata :

> Hi and thanks for reply
>
> The OSDs are all healthy, in fact after a ceph pg repair  the ceph
> health is back to OK and in the OSD log I see   repair ok, 0 fixed
>
> The SMART data of the 3 OSDs seems fine
>
> *OSD.5*
>
> # ceph-disk list | grep osd.5
>  /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2
>
> # smartctl -a /dev/sdd
> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Seagate Barracuda 7200.14 (AF)
> Device Model: ST1000DM003-1SB10C
> Serial Number:Z9A1MA1V
> LU WWN Device Id: 5 000c50 090c7028b
> Firmware Version: CC43
> User Capacity:1,000,204,886,016 bytes [1.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate:7200 rpm
> Form Factor:  3.5 inches
> Device is:In smartctl database [for details use: -P show]
> ATA Version is:   ATA8-ACS T13/1699-D revision 4
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:Mon Mar  5 16:17:22 2018 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x82)   Offline data collection activity
>   was completed without error.
>   Auto Offline Data Collection: Enabled.
> Self-test execution status:  (   0)   The previous self-test routine 
> completed
>   without error or no self-test has ever
>   been run.
> Total time to complete Offline
> data collection:  (0) seconds.
> Offline data collection
> capabilities:  (0x7b) SMART execute Offline immediate.
>   Auto Offline data collection on/off 
> support.
>   Suspend Offline collection upon new
>   command.
>   Offline surface scan supported.
>   Self-test supported.
>   Conveyance Self-test supported.
>   Selective Self-test supported.
> SMART capabilities:(0x0003)   Saves SMART data before entering
>   power-saving mode.
>   Supports SMART auto save timer.
> Error logging capability:(0x01)   Error logging supported.
>   General Purpose Logging supported.
> Short self-test routine
> recommended polling time:  (   1) minutes.
> Extended self-test routine
> recommended polling time:  ( 109) minutes.
> Conveyance self-test routine
> recommended polling time:  (   2) minutes.
> SCT capabilities:(0x1085) SCT Status supported.
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate 0x000f   082   063   006Pre-fail  Always  
>  -   193297722
>   3 Spin_Up_Time0x0003   097   097   000Pre-fail  Always  
>  -   0
>   4 Start_Stop_Count0x0032   100   100   020Old_age   Always  
>  -   60
>   5 Reallocated_Sector_Ct   0x0033   100   100   010Pre-fail  Always  
>  -   0
>   7 Seek_Error_Rate 0x000f   091   060   045Pre-fail  Always  
>  -   1451132477
>   9 Power_On_Hours  0x0032   085   085   000Old_age   Always  
>  -   13283
>  10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always  
>  -   0
>  12 Power_Cycle_Count   0x0032   100   100   020Old_age   Always  
>  -   61
> 183 Runtime_Bad_Block   0x0032   100   100   000Old_age   Always  
>  -   0
> 184 End-to-End_Error0x0032   100   100   099Old_age   Always  
>  -   0
> 187 Reported_Uncorrect  0x0032   100   100   000Old_age   Always  
>  -   0
> 188 Command_Timeout 0x0032   100   100   000Old_age   Always  
>  -   0 0 0
> 189 

Re: [ceph-users] rbd mirror mechanics

2018-03-05 Thread Jason Dillaman
On Mon, Mar 5, 2018 at 2:07 PM, Brady Deetz  wrote:
> While preparing a risk assessment for a DR solution involving RBD, I'm
> increasingly unsure of a few things.
>
> 1) Does the failover from primary to secondary cluster occur automatically
> in the case that the primary backing rados pool becomes inaccessible?

There is no automatic failover since any clients to the RBD images
would be at a higher level of integration that librbd would have no
way to know about / interact with (i.e. how would RBD know how to kill
a VM on DC1 and somehow restart the VM in DC2?).

> 1.a) If the primary backing rados pool is unintentionally deleted, can the
> client still failover to the secondary?

Yes, the storage admin can force promote non-primary images on the DR
cluster to primary.

> 2) When an RBD image that is mirrored is deleted from the primary cluster,
> is it automatically deleted from the secondary cluster?

Yes, the deletion is replicated to the DR cluster without delay. The
Mimic release offers a new "rbd mirroring delete delay = " configuration setting to keep a deleted image within the RBD
trash bucket until the delay expires.

> 2.a) If the primary RBD image is unintentionally deleted, can the client
> still failover to the secondary?

Yes, assuming deferred deletion is enabled and you discover the
unintentional deletion prior to the expiration of the deletion delay.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache tier

2018-03-05 Thread Budai Laszlo

Dear all,

I have some questions about cache tier in ceph:

1. Can someone share experiences with cache tiering? What are the sensitive 
things to pay attention regarding the cache tier? Can one use the same ssd for 
both cache and
2. Is cache tiering supported with bluestore? Any advices for using cache tier 
with Bluestore?


Kind regards,
Laszlo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete a Pool - how hard should be?

2018-03-05 Thread Gregory Farnum
On Thu, Mar 1, 2018 at 9:21 AM Max Cuttins  wrote:

> I think this is a good question for everybody: How hard should be delete a
> Pool?
>
> We ask to tell the pool twice.
> We ask to add "--yes-i-really-really-mean-it"
> We ask to add ability to mons to delete the pool (and remove this ability
> ASAP after).
>
> ... and then somebody of course ask us to restore the pool.
>
> I think that all this stuff is not looking in the right direction.
> It's not the administrator that need to be warned from delete datas.
> It's the data owner that should be warned (which most of the time give
> it's approval by phone and gone).
>

> So, all this stuff just make the life of administrator harder, while not
> improving in any way the life of the Data Owner.
> Probably the best solution is to ...do not delete at all and instead apply
> a "deleting policy".
> Something like:
>
> ceph osd pool rm POOL_NAME -yes
> -> POOL_NAME is set to be deleted, removal is scheduled within 30 days.
>
>
> This allow us to do 2 things:
>
>- allow administrator to don't waste their time in CML with true
>strange command
>- allow data owner to have a grace period to verify if, after
>deletion, everything works as expected and that data that disapper wasn't
>usefull in some way.
>
> After 30 days data will be removed automatically. This is a safe policy
> for ADMIN and DATA OWNER.
> Of course ADMIN should be allowed to remove POOL scheduleded for deletion
> in order to save disk spaces if needed (but only if needed).
>
> What do you think?
>
>
You're not wrong, and indeed that's why I pushed back on the latest attempt
to make deleting pools even more cumbersome.

But having a "trash" concept is also pretty weird. If admins can override
it to just immediately delete the data (if they need the space), how is
that different from just being another hoop to jump through? If we want to
give the data owners a chance to undo, how do we identify and notify *them*
rather than the admin running the command? But if admins can't override the
trash and delete immediately, what do we do for things like testing and
proofs of concept where large-scale data creates and deletes are to be
expected?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd mirror mechanics

2018-03-05 Thread Brady Deetz
While preparing a risk assessment for a DR solution involving RBD, I'm
increasingly unsure of a few things.

1) Does the failover from primary to secondary cluster occur automatically
in the case that the primary backing rados pool becomes inaccessible?

1.a) If the primary backing rados pool is unintentionally deleted, can the
client still failover to the secondary?


2) When an RBD image that is mirrored is deleted from the primary cluster,
is it automatically deleted from the secondary cluster?

2.a) If the primary RBD image is unintentionally deleted, can the client
still failover to the secondary?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep Scrub distribution

2018-03-05 Thread Gregory Farnum
On Mon, Mar 5, 2018 at 9:56 AM Jonathan D. Proulx  wrote:

> Hi All,
>
> I've recently noticed my deep scrubs are EXTREAMLY poorly
> distributed.  They are stating with in the 18->06 local time start
> stop time but are not distrubuted over enough days or well distributed
> over the range of days they have.
>
> root@ceph-mon0:~# for date in `ceph pg dump | awk '/active/{print $20}'`;
> do date +%D -d $date; done | sort | uniq -c
> dumped all
>   1 03/01/18
>   6 03/03/18
>8358 03/04/18
>1875 03/05/18
>
> So very nearly all 10240 pgs scrubbed lastnight/this morning.  I've
> been kicking this around for a while since I noticed poor distribution
> over a 7 day range when I was really pretty sure I'd changed that from
> the 7d default to 28d.
>
> Tried kicking it out to 42 days about a week ago with:
>
> ceph tell osd.* injectargs '--osd_deep_scrub_interval 3628800'
>
>
> There were many error suggesting it could nto reread the change and I'd
> need to restart the OSDs but 'ceph daemon osd.0 config show |grep
> osd_deep_scrub_interval' showed the right value so I let it roll for a
> week but the scrubs did not spread out.
>
> So Friday I set that value in ceph.conf and did rolling restarts of
> all OSDs.  Then doubled checked running value on all daemons.
> Checking Sunday the nightly deeps scrubs (based on LAST_DEEP_SCRUB
> voodoo above) show near enough 1/42nd of PGs had been scrubbed
> Saturday night that I thought this was working.
>
> This morning I checked again and got the results above.
>
> I would expect after changing to a 42d scrub cycle I'd see approx 1/42
> of the PGs deep scrub each night untill there was a roughly even
> distribution over the past 42 days.
>
> So which thing is broken my config or my expectations?
>

Sadly, changing the interval settings does not directly change the
scheduling of deep scrubs. Instead, it merely influences whether a PG will
get queued for scrub when it is examined as a candidate, based on how
out-of-date its scrub is. (That is, nothing holistically goes "I need to
scrub 1/n of these PGs every night"; there's a simple task that says "is
this PG's last scrub more than n days old?")

Users have shared various scripts on the list for setting up a more even
scrub distribution by fiddling with the settings and poking at specific PGs
to try and smear them out over the whole time period; I'd check archives or
google for those. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deep Scrub distribution

2018-03-05 Thread Jonathan D. Proulx
Hi All,

I've recently noticed my deep scrubs are EXTREAMLY poorly
distributed.  They are stating with in the 18->06 local time start
stop time but are not distrubuted over enough days or well distributed
over the range of days they have.

root@ceph-mon0:~# for date in `ceph pg dump | awk '/active/{print $20}'`; do 
date +%D -d $date; done | sort | uniq -c
dumped all
  1 03/01/18
  6 03/03/18
   8358 03/04/18
   1875 03/05/18

So very nearly all 10240 pgs scrubbed lastnight/this morning.  I've
been kicking this around for a while since I noticed poor distribution
over a 7 day range when I was really pretty sure I'd changed that from
the 7d default to 28d.

Tried kicking it out to 42 days about a week ago with:

ceph tell osd.* injectargs '--osd_deep_scrub_interval 3628800'


There were many error suggesting it could nto reread the change and I'd
need to restart the OSDs but 'ceph daemon osd.0 config show |grep
osd_deep_scrub_interval' showed the right value so I let it roll for a
week but the scrubs did not spread out.

So Friday I set that value in ceph.conf and did rolling restarts of
all OSDs.  Then doubled checked running value on all daemons.
Checking Sunday the nightly deeps scrubs (based on LAST_DEEP_SCRUB
voodoo above) show near enough 1/42nd of PGs had been scrubbed
Saturday night that I thought this was working.

This morning I checked again and got the results above.

I would expect after changing to a 42d scrub cycle I'd see approx 1/42
of the PGs deep scrub each night untill there was a roughly even
distribution over the past 42 days.

So which thing is broken my config or my expectations?

-Jon

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph newbie(?) issues

2018-03-05 Thread Daniel K
I had a similar problem with some relatively underpowered servers (2x
E5-2603 6 core 1.7ghz no HT, 12-14 2TB OSDs per server, 32Gb RAM)

There was a process on a couple of the servers that would hang and chew up
all available CPU. When that happened, I started getting scrub errors on
those servers.



On Mon, Mar 5, 2018 at 8:45 AM, Jan Marquardt  wrote:

> Am 05.03.18 um 13:13 schrieb Ronny Aasen:
> > i had some similar issues when i started my proof of concept. especialy
> > the snapshot deletion i remember well.
> >
> > the rule of thumb for filestore that i assume you are running is 1GB ram
> > per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for
> > osd's + some  GB's for the mon service, + some GB's  for the os itself.
> >
> > i suspect if you inspect your dmesg log and memory graphs you will find
> > that the out of memory killer ends your osd's when the snap deletion (or
> > any other high load task) runs.
> >
> > I ended up reducing the number of osd's per node, since the old
> > mainboard i used was maxed for memory.
>
> Well, thanks for the broad hint. Somehow I assumed we fulfill the
> recommendations, but of course you are right. We'll check if our boards
> support 48 GB RAM. Unfortunately, there are currently no corresponding
> messages. But I can't rule out that there haven't been any.
>
> > corruptions occured for me as well. and they was normaly associated with
> > disks dying or giving read errors. ceph often managed to fix them but
> > sometimes i had to just remove the hurting OSD disk.
> >
> > hage some graph's  to look at. personaly i used munin/munin-node since
> > it was just an apt-get away from functioning graphs
> >
> > also i used smartmontools to send me emails about hurting disks.
> > and smartctl to check all disks for errors.
>
> I'll check S.M.A.R.T stuff. I am wondering if scrubbing errors are
> always caused by disk problems or if they also could be triggered
> by flapping OSDs or other circumstances.
>
> > good luck with ceph !
>
> Thank you!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Marco Baldini - H.S. Amiata

Hi and thanks for reply

The OSDs are all healthy, in fact after a ceph pg repair  the ceph 
health is back to OK and in the OSD log I see  repair ok, 0 fixed


The SMART data of the 3 OSDs seems fine

*OSD.5*

# ceph-disk list | grep osd.5
 /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2

# smartctl -a /dev/sdd
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST1000DM003-1SB10C
Serial Number:Z9A1MA1V
LU WWN Device Id: 5 000c50 090c7028b
Firmware Version: CC43
User Capacity:1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate:7200 rpm
Form Factor:  3.5 inches
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:Mon Mar  5 16:17:22 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection:(0) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:(   1) minutes.
Extended self-test routine
recommended polling time:( 109) minutes.
Conveyance self-test routine
recommended polling time:(   2) minutes.
SCT capabilities:  (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   082   063   006Pre-fail  Always   
-   193297722
  3 Spin_Up_Time0x0003   097   097   000Pre-fail  Always   
-   0
  4 Start_Stop_Count0x0032   100   100   020Old_age   Always   
-   60
  5 Reallocated_Sector_Ct   0x0033   100   100   010Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   091   060   045Pre-fail  Always   
-   1451132477
  9 Power_On_Hours  0x0032   085   085   000Old_age   Always   
-   13283
 10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always   
-   0
 12 Power_Cycle_Count   0x0032   100   100   020Old_age   Always   
-   61
183 Runtime_Bad_Block   0x0032   100   100   000Old_age   Always   
-   0
184 End-to-End_Error0x0032   100   100   099Old_age   Always   
-   0
187 Reported_Uncorrect  0x0032   100   100   000Old_age   Always   
-   0
188 Command_Timeout 0x0032   100   100   000Old_age   Always   
-   0 0 0
189 High_Fly_Writes 0x003a   086   086   000Old_age   Always   
-   14
190 Airflow_Temperature_Cel 0x0022   071   055   040Old_age   Always   
-   29 (Min/Max 23/32)
193 Load_Cycle_Count0x0032   100   100   000Old_age   Always   
-   607
194 Temperature_Celsius 0x0022   029   014   000Old_age   Always   
-   29 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   004   001   000Old_age   Always   
-   193297722
197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always   
-   0
198 Offline_Uncorrectable   0x0010   100   100   000Old_age   Offline  
-   0
199 

Re: [ceph-users] Ceph newbie(?) issues

2018-03-05 Thread Ronny Aasen

On 05. mars 2018 14:45, Jan Marquardt wrote:

Am 05.03.18 um 13:13 schrieb Ronny Aasen:

i had some similar issues when i started my proof of concept. especialy
the snapshot deletion i remember well.

the rule of thumb for filestore that i assume you are running is 1GB ram
per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for
osd's + some  GB's for the mon service, + some GB's  for the os itself.

i suspect if you inspect your dmesg log and memory graphs you will find
that the out of memory killer ends your osd's when the snap deletion (or
any other high load task) runs.

I ended up reducing the number of osd's per node, since the old
mainboard i used was maxed for memory.


Well, thanks for the broad hint. Somehow I assumed we fulfill the
recommendations, but of course you are right. We'll check if our boards
support 48 GB RAM. Unfortunately, there are currently no corresponding
messages. But I can't rule out that there haven't been any.


corruptions occured for me as well. and they was normaly associated with
disks dying or giving read errors. ceph often managed to fix them but
sometimes i had to just remove the hurting OSD disk.

hage some graph's  to look at. personaly i used munin/munin-node since
it was just an apt-get away from functioning graphs

also i used smartmontools to send me emails about hurting disks.
and smartctl to check all disks for errors.


I'll check S.M.A.R.T stuff. I am wondering if scrubbing errors are
always caused by disk problems or if they also could be triggered
by flapping OSDs or other circumstances.


good luck with ceph !


Thank you!



in my not that extensive experience, schrub errors come mainly from 2 
issues.  Either disk's giving read errors (should be visible both in the 
log and dmesg.) or having pools with size=2/min_size=1 instead of the 
default and recomended size=3/min_size=2
but i can not say that they do not come from crashing OSD's but my case 
the osd kept crashing due to bad disk and/or low memory.





If you have scrub errors you can not get rid of on filestore (not 
bluestore!) you can read the two following urls.



http://ceph.com/geen-categorie/ceph-manually-repair-object/  and on 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/


basicaly the steps are:

- find the pg  ::  rados list-inconsistent-pg [pool]
- find the problem ::  rados list-inconsistent-obj 0.6 
--format=json-pretty ; give you the object name  look for hints to what 
is the bad object
- find the object on disks  :: manually check the objects on each osd 
for the given pg, check the object metadata (size/date/etc), run md5sum 
on them all and compare. check objects on the nonrunning osd's and 
compare there as well. anything to try to determine what object is ok 
and what is bad.
- fix the problem  :: assuming you find the bad object, stop the 
affected osd with the bad object, remove the object manually, restart 
osd. issue repair command.



Once i fixed my min_size=1 misconfiguration, and pulled the dying (but 
functional) disks from my cluster, and reduced osd count to prevent 
dying osd's  all of those scrub errors went away. have not seen one in 6 
months now.



kinds regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Marco Baldini - H.S. Amiata

Hi

I just posted in the ceph tracker with my logs and my issue

Let's hope this will be fixed

Thanks


Il 05/03/2018 13:36, Paul Emmerich ha scritto:

Hi,

yeah, the cluster that I'm seeing this on also has only one host that 
reports that specific checksum. Two other hosts only report the same 
error that you are seeing.


Could you post to the tracker issue that you are also seeing this?

Paul

2018-03-05 12:21 GMT+01:00 Marco Baldini - H.S. Amiata 
>:


Hi

After some days with debug_osd 5/5 I found [ERR] in different
days, different PGs, different OSDs, different hosts. This is what
I get in the OSD logs:

*OSD.5 (host 3)*
2018-03-01 20:30:02.702269 7fdf4d515700  2 osd.5 pg_epoch: 16486 pg[9.1c( v 
16486'51798 (16431'50251,16486'51798] local-lis/les=16474/16475 n=3629 
ec=1477/1477 lis/c 16474/16474 les/c/f 16475/16477/0 16474/16474/16474) [5,6] 
r=0 lpr=16474 crt=16486'51798 lcod 16486'51797 mlcod 16486'51797 
active+clean+scrubbing+deep] 9.1c shard 6: soid 
9:3b157c56:::rbd_data.1526386b8b4567.1761:head candidate had a read 
error
2018-03-01 20:30:02.702278 7fdf4d515700 -1 log_channel(cluster) log [ERR] : 
9.1c shard 6: soid 9:3b157c56:::rbd_data.1526386b8b4567.1761:head 
candidate had a read error

*OSD.4 (host 3)*
2018-02-28 00:03:33.458558 7f112cf76700 -1 log_channel(cluster) log [ERR] : 
13.65 shard 2: soid 13:a719ecdf:::rbd_data.5f65056b8b4567.f8eb:head 
candidate had a read error

*OSD.8 (host 2)*
2018-02-27 23:55:15.100084 7f4dd0816700 -1 log_channel(cluster) log [ERR] : 
14.31 shard 1: soid 14:8cc6cd37:::rbd_data.30b15b6b8b4567.81a1:head 
candidate had a read error

I don't know what this error is meaning, and as always a ceph pg
repair fixes it. I don't think this is normal.

Ideas?

Thanks


Il 28/02/2018 14:48, Marco Baldini - H.S. Amiata ha scritto:


Hi

I read the bugtracker issue and it seems a lot like my problem,
even if I can't check the reported checksum because I don't have
it in my logs, perhaps it's because of debug osd = 0/0 in ceph.conf

I just raised the OSD log level

ceph tell osd.* injectargs --debug-osd 5/5

I'll check OSD logs in the next days...

Thanks



Il 28/02/2018 11:59, Paul Emmerich ha scritto:

Hi,

might be http://tracker.ceph.com/issues/22464


Can you check the OSD log file to see if the reported checksum
is 0x6706be76?


Paul


Am 28.02.2018 um 11:43 schrieb Marco Baldini - H.S. Amiata
>:

Hello

I have a little ceph cluster with 3 nodes, each with 3x1TB HDD
and 1x240GB SSD. I created this cluster after Luminous release,
so all OSDs are Bluestore. In my crush map I have two rules,
one targeting the SSDs and one targeting the HDDs. I have 4
pools, one using the SSD rule and the others using the HDD
rule, three pools are size=3 min_size=2, one is size=2
min_size=1 (this one have content that it's ok to lose)

In the last 3 month I'm having a strange random problem. I
planned my osd scrubs during the night (osd scrub begin hour =
20, osd scrub end hour = 7) when office is closed so there is
low impact on the users. Some mornings, when I ceph the cluster
health, I find:

HEALTH_ERR X scrub errors; Possible data damage: Y pgs inconsistent
OSD_SCRUB_ERRORS X scrub errors
PG_DAMAGED Possible data damage: Y pg inconsistent

X and Y sometimes are 1, sometimes 2.

I issue a ceph health detail, check the damaged PGs, and run a
ceph pg repair for the damaged PGs, I get

instructing pg PG on osd.N to repair

PG are different, OSD that have to repair PG is different, even
the node hosting the OSD is different, I made a list of all PGs
and OSDs. This morning is the most recent case:

> ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
pg 13.65 is active+clean+inconsistent, acting [4,2,6]
pg 14.31 is active+clean+inconsistent, acting [8,3,1]
> ceph pg repair 13.65
instructing pg 13.65 on osd.4 to repair

(node-2)> tail /var/log/ceph/ceph-osd.4.log
2018-02-28 08:38:47.593447 7f112cf76700  0 log_channel(cluster) log [DBG] : 
13.65 repair starts
2018-02-28 08:39:37.573342 7f112cf76700  0 log_channel(cluster) log [DBG] : 
13.65 repair ok, 0 fixed
> ceph pg repair 14.31
instructing pg 14.31 on osd.8 to repair

(node-3)> tail /var/log/ceph/ceph-osd.8.log
2018-02-28 08:52:37.297490 7f4dd0816700  0 log_channel(cluster) log [DBG] : 
14.31 repair starts
2018-02-28 08:53:00.704020 7f4dd0816700  0 log_channel(cluster) log [DBG] : 
14.31 repair ok, 0 fixed


I made a list of when I got 

Re: [ceph-users] Ceph newbie(?) issues

2018-03-05 Thread Jan Marquardt
Am 05.03.18 um 13:13 schrieb Ronny Aasen:
> i had some similar issues when i started my proof of concept. especialy
> the snapshot deletion i remember well.
> 
> the rule of thumb for filestore that i assume you are running is 1GB ram
> per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for
> osd's + some  GB's for the mon service, + some GB's  for the os itself.
> 
> i suspect if you inspect your dmesg log and memory graphs you will find
> that the out of memory killer ends your osd's when the snap deletion (or
> any other high load task) runs.
> 
> I ended up reducing the number of osd's per node, since the old
> mainboard i used was maxed for memory.

Well, thanks for the broad hint. Somehow I assumed we fulfill the
recommendations, but of course you are right. We'll check if our boards
support 48 GB RAM. Unfortunately, there are currently no corresponding
messages. But I can't rule out that there haven't been any.

> corruptions occured for me as well. and they was normaly associated with
> disks dying or giving read errors. ceph often managed to fix them but
> sometimes i had to just remove the hurting OSD disk.
> 
> hage some graph's  to look at. personaly i used munin/munin-node since
> it was just an apt-get away from functioning graphs
> 
> also i used smartmontools to send me emails about hurting disks.
> and smartctl to check all disks for errors.

I'll check S.M.A.R.T stuff. I am wondering if scrubbing errors are
always caused by disk problems or if they also could be triggered
by flapping OSDs or other circumstances.

> good luck with ceph !

Thank you!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Vladimir Prokofev
> candidate had a read error
speaks for itself - while scrubbing it coudn't read data.
I had similar issue, and it was just OSD dying - errors and relocated
sectors in SMART, just replaced the disk. But in your case it seems that
errors are on different OSDs? Are your OSDs all healthy?
You can use this command to see some details.
rados list-inconsistent-obj  --format=json-pretty
pg.id is the PG that's reporting as inconsistent. My guess is that you'll
see read errors in this output, with OSD number that encountered error.
After that you have to check that OSD health - SMART details, etc.
Not always it's the disk itself that causing problems - for example we had
read errors because of a faulty backplane interface in a server; changing
the chassis resolved this issue.


2018-03-05 14:21 GMT+03:00 Marco Baldini - H.S. Amiata :

> Hi
>
> After some days with debug_osd 5/5 I found [ERR] in different days,
> different PGs, different OSDs, different hosts. This is what I get in the
> OSD logs:
>
> *OSD.5 (host 3)*
> 2018-03-01 20:30:02.702269 7fdf4d515700  2 osd.5 pg_epoch: 16486 pg[9.1c( v 
> 16486'51798 (16431'50251,16486'51798] local-lis/les=16474/16475 n=3629 
> ec=1477/1477 lis/c 16474/16474 les/c/f 16475/16477/0 16474/16474/16474) [5,6] 
> r=0 lpr=16474 crt=16486'51798 lcod 16486'51797 mlcod 16486'51797 
> active+clean+scrubbing+deep] 9.1c shard 6: soid 
> 9:3b157c56:::rbd_data.1526386b8b4567.1761:head candidate had a 
> read error
> 2018-03-01 20:30:02.702278 7fdf4d515700 -1 log_channel(cluster) log [ERR] : 
> 9.1c shard 6: soid 9:3b157c56:::rbd_data.1526386b8b4567.1761:head 
> candidate had a read error
>
> *
> OSD.4 (host 3)*
> 2018-02-28 00:03:33.458558 7f112cf76700 -1 log_channel(cluster) log [ERR] : 
> 13.65 shard 2: soid 
> 13:a719ecdf:::rbd_data.5f65056b8b4567.f8eb:head candidate had a 
> read error
>
> *OSD.8 (host 2)*
> 2018-02-27 23:55:15.100084 7f4dd0816700 -1 log_channel(cluster) log [ERR] : 
> 14.31 shard 1: soid 
> 14:8cc6cd37:::rbd_data.30b15b6b8b4567.81a1:head candidate had a 
> read error
>
> I don't know what this error is meaning, and as always a ceph pg repair
> fixes it. I don't think this is normal.
>
> Ideas?
>
> Thanks
>
> Il 28/02/2018 14:48, Marco Baldini - H.S. Amiata ha scritto:
>
> Hi
>
> I read the bugtracker issue and it seems a lot like my problem, even if I
> can't check the reported checksum because I don't have it in my logs,
> perhaps it's because of debug osd = 0/0 in ceph.conf
>
> I just raised the OSD log level
>
> ceph tell osd.* injectargs --debug-osd 5/5
>
> I'll check OSD logs in the next days...
>
> Thanks
>
>
>
> Il 28/02/2018 11:59, Paul Emmerich ha scritto:
>
> Hi,
>
> might be http://tracker.ceph.com/issues/22464
>
> Can you check the OSD log file to see if the reported checksum
> is 0x6706be76?
>
>
> Paul
>
> Am 28.02.2018 um 11:43 schrieb Marco Baldini - H.S. Amiata <
> mbald...@hsamiata.it>:
>
> Hello
>
> I have a little ceph cluster with 3 nodes, each with 3x1TB HDD and 1x240GB
> SSD. I created this cluster after Luminous release, so all OSDs are
> Bluestore. In my crush map I have two rules, one targeting the SSDs and one
> targeting the HDDs. I have 4 pools, one using the SSD rule and the others
> using the HDD rule, three pools are size=3 min_size=2, one is size=2
> min_size=1 (this one have content that it's ok to lose)
>
> In the last 3 month I'm having a strange random problem. I planned my osd
> scrubs during the night (osd scrub begin hour = 20, osd scrub end hour = 7)
> when office is closed so there is low impact on the users. Some mornings,
> when I ceph the cluster health, I find:
>
> HEALTH_ERR X scrub errors; Possible data damage: Y pgs inconsistent
> OSD_SCRUB_ERRORS X scrub errors
> PG_DAMAGED Possible data damage: Y pg inconsistent
>
> X and Y sometimes are 1, sometimes 2.
>
> I issue a ceph health detail, check the damaged PGs, and run a ceph pg
> repair for the damaged PGs, I get
>
> instructing pg PG on osd.N to repair
>
> PG are different, OSD that have to repair PG is different, even the node
> hosting the OSD is different, I made a list of all PGs and OSDs. This
> morning is the most recent case:
>
> > ceph health detail
> HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
> OSD_SCRUB_ERRORS 2 scrub errors
> PG_DAMAGED Possible data damage: 2 pgs inconsistent
> pg 13.65 is active+clean+inconsistent, acting [4,2,6]
> pg 14.31 is active+clean+inconsistent, acting [8,3,1]
>
> > ceph pg repair 13.65
> instructing pg 13.65 on osd.4 to repair
>
> (node-2)> tail /var/log/ceph/ceph-osd.4.log
> 2018-02-28 08:38:47.593447 7f112cf76700  0 log_channel(cluster) log [DBG] : 
> 13.65 repair starts
> 2018-02-28 08:39:37.573342 7f112cf76700  0 log_channel(cluster) log [DBG] : 
> 13.65 repair ok, 0 fixed
>
> > ceph pg repair 14.31
> instructing pg 14.31 on osd.8 to repair
>
> (node-3)> tail /var/log/ceph/ceph-osd.8.log
> 2018-02-28 08:52:37.297490 7f4dd0816700  

Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Paul Emmerich
Hi,

yeah, the cluster that I'm seeing this on also has only one host that
reports that specific checksum. Two other hosts only report the same error
that you are seeing.

Could you post to the tracker issue that you are also seeing this?

Paul

2018-03-05 12:21 GMT+01:00 Marco Baldini - H.S. Amiata :

> Hi
>
> After some days with debug_osd 5/5 I found [ERR] in different days,
> different PGs, different OSDs, different hosts. This is what I get in the
> OSD logs:
>
> *OSD.5 (host 3)*
> 2018-03-01 20:30:02.702269 7fdf4d515700  2 osd.5 pg_epoch: 16486 pg[9.1c( v 
> 16486'51798 (16431'50251,16486'51798] local-lis/les=16474/16475 n=3629 
> ec=1477/1477 lis/c 16474/16474 les/c/f 16475/16477/0 16474/16474/16474) [5,6] 
> r=0 lpr=16474 crt=16486'51798 lcod 16486'51797 mlcod 16486'51797 
> active+clean+scrubbing+deep] 9.1c shard 6: soid 
> 9:3b157c56:::rbd_data.1526386b8b4567.1761:head candidate had a 
> read error
> 2018-03-01 20:30:02.702278 7fdf4d515700 -1 log_channel(cluster) log [ERR] : 
> 9.1c shard 6: soid 9:3b157c56:::rbd_data.1526386b8b4567.1761:head 
> candidate had a read error
>
> *
> OSD.4 (host 3)*
> 2018-02-28 00:03:33.458558 7f112cf76700 -1 log_channel(cluster) log [ERR] : 
> 13.65 shard 2: soid 
> 13:a719ecdf:::rbd_data.5f65056b8b4567.f8eb:head candidate had a 
> read error
>
> *OSD.8 (host 2)*
> 2018-02-27 23:55:15.100084 7f4dd0816700 -1 log_channel(cluster) log [ERR] : 
> 14.31 shard 1: soid 
> 14:8cc6cd37:::rbd_data.30b15b6b8b4567.81a1:head candidate had a 
> read error
>
> I don't know what this error is meaning, and as always a ceph pg repair
> fixes it. I don't think this is normal.
>
> Ideas?
>
> Thanks
>
> Il 28/02/2018 14:48, Marco Baldini - H.S. Amiata ha scritto:
>
> Hi
>
> I read the bugtracker issue and it seems a lot like my problem, even if I
> can't check the reported checksum because I don't have it in my logs,
> perhaps it's because of debug osd = 0/0 in ceph.conf
>
> I just raised the OSD log level
>
> ceph tell osd.* injectargs --debug-osd 5/5
>
> I'll check OSD logs in the next days...
>
> Thanks
>
>
>
> Il 28/02/2018 11:59, Paul Emmerich ha scritto:
>
> Hi,
>
> might be http://tracker.ceph.com/issues/22464
>
> Can you check the OSD log file to see if the reported checksum
> is 0x6706be76?
>
>
> Paul
>
> Am 28.02.2018 um 11:43 schrieb Marco Baldini - H.S. Amiata <
> mbald...@hsamiata.it>:
>
> Hello
>
> I have a little ceph cluster with 3 nodes, each with 3x1TB HDD and 1x240GB
> SSD. I created this cluster after Luminous release, so all OSDs are
> Bluestore. In my crush map I have two rules, one targeting the SSDs and one
> targeting the HDDs. I have 4 pools, one using the SSD rule and the others
> using the HDD rule, three pools are size=3 min_size=2, one is size=2
> min_size=1 (this one have content that it's ok to lose)
>
> In the last 3 month I'm having a strange random problem. I planned my osd
> scrubs during the night (osd scrub begin hour = 20, osd scrub end hour = 7)
> when office is closed so there is low impact on the users. Some mornings,
> when I ceph the cluster health, I find:
>
> HEALTH_ERR X scrub errors; Possible data damage: Y pgs inconsistent
> OSD_SCRUB_ERRORS X scrub errors
> PG_DAMAGED Possible data damage: Y pg inconsistent
>
> X and Y sometimes are 1, sometimes 2.
>
> I issue a ceph health detail, check the damaged PGs, and run a ceph pg
> repair for the damaged PGs, I get
>
> instructing pg PG on osd.N to repair
>
> PG are different, OSD that have to repair PG is different, even the node
> hosting the OSD is different, I made a list of all PGs and OSDs. This
> morning is the most recent case:
>
> > ceph health detail
> HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
> OSD_SCRUB_ERRORS 2 scrub errors
> PG_DAMAGED Possible data damage: 2 pgs inconsistent
> pg 13.65 is active+clean+inconsistent, acting [4,2,6]
> pg 14.31 is active+clean+inconsistent, acting [8,3,1]
>
> > ceph pg repair 13.65
> instructing pg 13.65 on osd.4 to repair
>
> (node-2)> tail /var/log/ceph/ceph-osd.4.log
> 2018-02-28 08:38:47.593447 7f112cf76700  0 log_channel(cluster) log [DBG] : 
> 13.65 repair starts
> 2018-02-28 08:39:37.573342 7f112cf76700  0 log_channel(cluster) log [DBG] : 
> 13.65 repair ok, 0 fixed
>
> > ceph pg repair 14.31
> instructing pg 14.31 on osd.8 to repair
>
> (node-3)> tail /var/log/ceph/ceph-osd.8.log
> 2018-02-28 08:52:37.297490 7f4dd0816700  0 log_channel(cluster) log [DBG] : 
> 14.31 repair starts
> 2018-02-28 08:53:00.704020 7f4dd0816700  0 log_channel(cluster) log [DBG] : 
> 14.31 repair ok, 0 fixed
>
>
>
> I made a list of when I got OSD_SCRUB_ERRORS, what PG and what OSD had to
> repair PG. Date is dd/mm/
>
> 21/12/2017   --  pg 14.29 is active+clean+inconsistent, acting [6,2,4]
>
> 18/01/2018   --  pg 14.5a is active+clean+inconsistent, acting [6,4,1]
>
> 22/01/2018   --  pg 9.3a is active+clean+inconsistent, acting [2,7]
>
> 29/01/2018   --  pg 13.3e is 

Re: [ceph-users] Ceph newbie(?) issues

2018-03-05 Thread Ronny Aasen

On 05. mars 2018 11:21, Jan Marquardt wrote:

Hi,

we are relatively new to Ceph and are observing some issues, where
I'd like to know how likely they are to happen when operating a
Ceph cluster.

Currently our setup consists of three servers which are acting as
OSDs and MONs. Each server has two Intel Xeon L5420 (yes, I know,
it's not state of the art, but we thought it would be sufficient
for a Proof of Concept. Maybe we were wrong?) and 24 GB RAM and is
running 8 OSDs with 4 TB harddisks. 4 OSDs are sharing one SSD for
journaling. We started on Kraken and upgraded lately to Luminous.
The next two OSD servers and three separate MONs are ready for
deployment. Please find attached our ceph.conf. Current usage looks
like this:

data:
   pools:   1 pools, 768 pgs
   objects: 5240k objects, 18357 GB
   usage:   59825 GB used, 29538 GB / 89364 GB avail

We have only one pool which is exclusively used for rbd. We started
filling it with data and creating snapshots in January until Mid of
February. Everything was working like a charm until we started
removing old snapshots then.

While we were removing snapshots for the first time, OSDs started
flapping. Besides this there was no other load on the cluster.
For idle times we solved it by adding

osd snap trim priority = 1
osd snap trim sleep = 0.1

to ceph.conf. When there is load from other operations and we
remove big snapshots OSD flapping still occurs.

Last week our first scrub errors appeared. Repairing the first
one was no big deal. The second one however was, because the
instructed OSD started crashing. First on Friday osd.17 and
today osd.11.

ceph1:~# ceph pg repair 0.1b2
instructing pg 0.1b2 on osd.17 to repair

ceph1:~# ceph pg repair 0.1b2
instructing pg 0.1b2 on osd.11 to repair

I am still researching on the crashes, but already would be
thankful for any input.

Any opinions, hints and advices would really be appreciated.



i had some similar issues when i started my proof of concept. especialy 
the snapshot deletion i remember well.


the rule of thumb for filestore that i assume you are running is 1GB ram 
per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for 
osd's + some  GB's for the mon service, + some GB's  for the os itself.


i suspect if you inspect your dmesg log and memory graphs you will find 
that the out of memory killer ends your osd's when the snap deletion (or 
any other high load task) runs.


I ended up reducing the number of osd's per node, since the old 
mainboard i used was maxed for memory.



corruptions occured for me as well. and they was normaly associated with 
disks dying or giving read errors. ceph often managed to fix them but 
sometimes i had to just remove the hurting OSD disk.


hage some graph's  to look at. personaly i used munin/munin-node since 
it was just an apt-get away from functioning graphs


also i used smartmontools to send me emails about hurting disks.
and smartctl to check all disks for errors.


good luck with ceph !

kinds regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All pools full after one OSD got OSD_FULL state

2018-03-05 Thread Vladimir Prokofev
I'll pitch in my personal expirience.

When single OSD in a pool becomes full(95% used), then all client IO writes
to this pool must stop, even if other OSDs are almost free. This is done
for the purpose of data intergity. [1]
To avoid this you need to balance your failure domains.
For example, assuming replicated pool size = 2, if one of your failure
domains has a weight of 10, and the other has a weight of 3 - you're
screwed. CEPH has to have a copy in both failure domains, and when second
failure domain nears its capacity, first will still have more than 70% free
storage.
It's easy to calculate and predict cluster storage capacity when all your
failure domains are of the same weight, and their number is even to your
replication size, for example if your size = 3, and you have 3,6,9,12, etc,
failure domains of the same weight. It becomes not so easy when your
failure domains are of different weight, and an odd number to your
replicated pool size. This may be even more compiicated with EC pools, but
I don't use them, so no expirience there.
So what I learned is that you should build your cluster evenly, without
heavy imbalance in weights(and IOPS for that matter, if you don't want to
get slow requests), or you will regularly come to a situation where a
single OSD is in near_full status, while cluster reports terabytes of free
storage.

[1]
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#no-free-drive-space

2018-03-05 11:15 GMT+03:00 Jakub Jaszewski :

> One full OSD has caused that all pools got full. Can anyone help me
> understand this ?
>
> During ongoing PGs backfilling I see that MAX AVAIL values are changing
> when USED values are constant.
>
>
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 425T  145T 279T 65.70
> POOLS:
> NAME   ID USED   %USED MAX AVAIL
>OBJECTS
> volumes3  41011G 91.14 3987G
>10520026
> default.rgw.buckets.data   20   105T 93.11 7974G
>28484000
>
>
>
>
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 425T  146T 279T 65.66
> POOLS:
> NAME   ID USED   %USED MAX AVAIL
>OBJECTS
> volumes3  41013G 88.66 5246G
>10520539
> default.rgw.buckets.data   20   105T 91.1310492G
>28484000
>
>
> From what I can read in docs The MAX AVAIL value is a complicated function
> of the replication or erasure code used, the CRUSH rule that maps storage
> to devices, the utilization of those devices, and the configured
> mon_osd_full_ratio.
>
> Any clue what more I can do to make better use of available raw storage ?
> Increase number of PGs for better balanced OSDs utilization ?
>
> Thanks
> Jakub
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RocksDB configuration

2018-03-05 Thread Oliver Freyermuth
After going through:
https://de.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
I can already answer some of my own questions - notably, compaction should 
happen slowly,
and there is high write amplification for SSDs, which could explain why our 
SSDs in our MDS reached the limit. 
Likely, NVMes will perform better. 
I'm unsure how much of the write amplification hits WAL and how much hits the 
DB, though, this would be interesting to learn. 

Still, the questions concerning the usefulness of RocksDB compression (and if 
that's configurable at all via Ceph tunables),
and potential gains by offline compaction / overnight compaction are open. 

Also, after seeing this, I wonder how bad it really is if RocksDB spills over 
to a slow device, 
since the "hot" parts should stay on faster devices. 
Does somebody have long-term experience with that? 

Currently, our SSD-space is two small to keep RocksDB once our cluster becomes 
full (we only have about 7 GB of SSD per 4 TB of HDD-OSD),
so the question is whether we should buy larger SSDs / NVMes or this might 
actually be a non-issue in the long run. 

Cheers,
Oliver


Am 05.03.2018 um 11:42 schrieb Oliver Freyermuth:
> Dear Cephalopodians,
> 
> in the benchmarks done with many files, I noted that our bottleneck was 
> mainly given by the MDS-SSD performance,
> and notably, after deletion of the many files in CephFS, the RocksDB stayed 
> large and did not shrink. 
> Recreating an OSD from scratch and backfilling it, however, resulted in a 
> smaller RocksDB. 
> 
> I noticed some interesting messages in the logs of starting OSDs:
>  set rocksdb option compaction_readahead_size = 2097152
>  set rocksdb option compression = kNoCompression
>  set rocksdb option max_write_buffer_number = 4
>  set rocksdb option min_write_buffer_number_to_merge = 1
>  set rocksdb option recycle_log_file_num = 4
>  set rocksdb option writable_file_max_buffer_size = 0
>  set rocksdb option write_buffer_size = 268435456
> 
> Now I wonder: Can these be configured via Ceph parameters? 
> Can / should one trigger compaction ceph-kvstore-tool - is this safe when the 
> corresponding OSD is down, has anybody tested it? 
> Is there a fixed time slot when compaction starts (e.g. low load average)? 
> 
> I'm especially curious if compression would help to reduce write load on the 
> metadata servers - maybe not, since the synchronization of I/O has to happen 
> in any case,
> and this is more likely to be the actual limit than the bulk I/O. 
> 
> Just being curious! 
> 
> Cheers,
>   Oliver
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All pools full after one OSD got OSD_FULL state

2018-03-05 Thread Jakub Jaszewski
One full OSD has caused that all pools got full. Can anyone help me
understand this ?

During ongoing PGs backfilling I see that MAX AVAIL values are changing
when USED values are constant.


GLOBAL:
SIZE AVAIL RAW USED %RAW USED
425T  145T 279T 65.70
POOLS:
NAME   ID USED   %USED MAX AVAIL
 OBJECTS
volumes3  41011G 91.14 3987G
 10520026
default.rgw.buckets.data   20   105T 93.11 7974G
 28484000




GLOBAL:
SIZE AVAIL RAW USED %RAW USED
425T  146T 279T 65.66
POOLS:
NAME   ID USED   %USED MAX AVAIL
 OBJECTS
volumes3  41013G 88.66 5246G
 10520539
default.rgw.buckets.data   20   105T 91.1310492G
 28484000


>From what I can read in docs The MAX AVAIL value is a complicated function
of the replication or erasure code used, the CRUSH rule that maps storage
to devices, the utilization of those devices, and the configured
mon_osd_full_ratio.

Any clue what more I can do to make better use of available raw storage ?
Increase number of PGs for better balanced OSDs utilization ?

Thanks
Jakub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-05 Thread Robert Sander
On 05.03.2018 00:26, Adrian Saul wrote:
>  
> 
> We are using Ceph+RBD+NFS under pacemaker for VMware.  We are doing
> iSCSI using SCST but have not used it against VMware, just Solaris and
> Hyper-V.
> 
> 
> It generally works and performs well enough – the biggest issues are the
> clustering for iSCSI ALUA support and NFS failover, most of which we
> have developed in house – we still have not quite got that right yet.

You should look at setting up a Samba CTDB cluster with CephFS as
backend. This can also be used with NFS including NFS failover.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

2018-03-05 Thread Marco Baldini - H.S. Amiata

Hi

After some days with debug_osd 5/5 I found [ERR] in different days, 
different PGs, different OSDs, different hosts. This is what I get in 
the OSD logs:


*OSD.5 (host 3)*
2018-03-01 20:30:02.702269 7fdf4d515700  2 osd.5 pg_epoch: 16486 pg[9.1c( v 
16486'51798 (16431'50251,16486'51798] local-lis/les=16474/16475 n=3629 
ec=1477/1477 lis/c 16474/16474 les/c/f 16475/16477/0 16474/16474/16474) [5,6] 
r=0 lpr=16474 crt=16486'51798 lcod 16486'51797 mlcod 16486'51797 
active+clean+scrubbing+deep] 9.1c shard 6: soid 
9:3b157c56:::rbd_data.1526386b8b4567.1761:head candidate had a read 
error
2018-03-01 20:30:02.702278 7fdf4d515700 -1 log_channel(cluster) log [ERR] : 
9.1c shard 6: soid 9:3b157c56:::rbd_data.1526386b8b4567.1761:head 
candidate had a read error

*OSD.4 (host 3)*
2018-02-28 00:03:33.458558 7f112cf76700 -1 log_channel(cluster) log [ERR] : 
13.65 shard 2: soid 13:a719ecdf:::rbd_data.5f65056b8b4567.f8eb:head 
candidate had a read error

*OSD.8 (host 2)*
2018-02-27 23:55:15.100084 7f4dd0816700 -1 log_channel(cluster) log [ERR] : 
14.31 shard 1: soid 14:8cc6cd37:::rbd_data.30b15b6b8b4567.81a1:head 
candidate had a read error

I don't know what this error is meaning, and as always a ceph pg repair 
fixes it. I don't think this is normal.


Ideas?

Thanks


Il 28/02/2018 14:48, Marco Baldini - H.S. Amiata ha scritto:


Hi

I read the bugtracker issue and it seems a lot like my problem, even 
if I can't check the reported checksum because I don't have it in my 
logs, perhaps it's because of debug osd = 0/0 in ceph.conf


I just raised the OSD log level

ceph tell osd.* injectargs --debug-osd 5/5

I'll check OSD logs in the next days...

Thanks



Il 28/02/2018 11:59, Paul Emmerich ha scritto:

Hi,

might be http://tracker.ceph.com/issues/22464

Can you check the OSD log file to see if the reported checksum 
is 0x6706be76?



Paul

Am 28.02.2018 um 11:43 schrieb Marco Baldini - H.S. Amiata 
>:


Hello

I have a little ceph cluster with 3 nodes, each with 3x1TB HDD and 
1x240GB SSD. I created this cluster after Luminous release, so all 
OSDs are Bluestore. In my crush map I have two rules, one targeting 
the SSDs and one targeting the HDDs. I have 4 pools, one using the 
SSD rule and the others using the HDD rule, three pools are size=3 
min_size=2, one is size=2 min_size=1 (this one have content that 
it's ok to lose)


In the last 3 month I'm having a strange random problem. I planned 
my osd scrubs during the night (osd scrub begin hour = 20, osd scrub 
end hour = 7) when office is closed so there is low impact on the 
users. Some mornings, when I ceph the cluster health, I find:


HEALTH_ERR X scrub errors; Possible data damage: Y pgs inconsistent
OSD_SCRUB_ERRORS X scrub errors
PG_DAMAGED Possible data damage: Y pg inconsistent

X and Y sometimes are 1, sometimes 2.

I issue a ceph health detail, check the damaged PGs, and run a ceph 
pg repair for the damaged PGs, I get


instructing pg PG on osd.N to repair

PG are different, OSD that have to repair PG is different, even the 
node hosting the OSD is different, I made a list of all PGs and 
OSDs. This morning is the most recent case:


> ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
pg 13.65 is active+clean+inconsistent, acting [4,2,6]
pg 14.31 is active+clean+inconsistent, acting [8,3,1]
> ceph pg repair 13.65
instructing pg 13.65 on osd.4 to repair

(node-2)> tail /var/log/ceph/ceph-osd.4.log
2018-02-28 08:38:47.593447 7f112cf76700  0 log_channel(cluster) log [DBG] : 
13.65 repair starts
2018-02-28 08:39:37.573342 7f112cf76700  0 log_channel(cluster) log [DBG] : 
13.65 repair ok, 0 fixed
> ceph pg repair 14.31
instructing pg 14.31 on osd.8 to repair

(node-3)> tail /var/log/ceph/ceph-osd.8.log
2018-02-28 08:52:37.297490 7f4dd0816700  0 log_channel(cluster) log [DBG] : 
14.31 repair starts
2018-02-28 08:53:00.704020 7f4dd0816700  0 log_channel(cluster) log [DBG] : 
14.31 repair ok, 0 fixed


I made a list of when I got OSD_SCRUB_ERRORS, what PG and what OSD 
had to repair PG. Date is dd/mm/


21/12/2017   --  pg 14.29 is active+clean+inconsistent, acting [6,2,4]

18/01/2018   --  pg 14.5a is active+clean+inconsistent, acting [6,4,1]

22/01/2018   --  pg 9.3a is active+clean+inconsistent, acting [2,7]

29/01/2018   --  pg 13.3e is active+clean+inconsistent, acting [4,6,1]
  instructing pg 13.3e on osd.4 to repair

07/02/2018   --  pg 13.7e is active+clean+inconsistent, acting [8,2,5]
  instructing pg 13.7e on osd.8 to repair

09/02/2018   --  pg 13.30 is active+clean+inconsistent, acting [7,3,2]
  instructing pg 13.30 on osd.7 to repair

15/02/2018   --  pg 9.35 is active+clean+inconsistent, acting [1,8]
  instructing pg 9.35 on osd.1 to repair

 

[ceph-users] RocksDB configuration

2018-03-05 Thread Oliver Freyermuth
Dear Cephalopodians,

in the benchmarks done with many files, I noted that our bottleneck was mainly 
given by the MDS-SSD performance,
and notably, after deletion of the many files in CephFS, the RocksDB stayed 
large and did not shrink. 
Recreating an OSD from scratch and backfilling it, however, resulted in a 
smaller RocksDB. 

I noticed some interesting messages in the logs of starting OSDs:
 set rocksdb option compaction_readahead_size = 2097152
 set rocksdb option compression = kNoCompression
 set rocksdb option max_write_buffer_number = 4
 set rocksdb option min_write_buffer_number_to_merge = 1
 set rocksdb option recycle_log_file_num = 4
 set rocksdb option writable_file_max_buffer_size = 0
 set rocksdb option write_buffer_size = 268435456

Now I wonder: Can these be configured via Ceph parameters? 
Can / should one trigger compaction ceph-kvstore-tool - is this safe when the 
corresponding OSD is down, has anybody tested it? 
Is there a fixed time slot when compaction starts (e.g. low load average)? 

I'm especially curious if compression would help to reduce write load on the 
metadata servers - maybe not, since the synchronization of I/O has to happen in 
any case,
and this is more likely to be the actual limit than the bulk I/O. 

Just being curious! 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore questions

2018-03-05 Thread Gustavo Varela
There is a presentation of sage, slide 16,

https://es.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in

You can probably try that as an initial guide, hope it helps.

gus


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Frank 
Ritchie
Sent: domingo, 04 de marzo de 2018 1:41
To: ceph-us...@ceph.com
Subject: [ceph-users] BlueStore questions

Hi all,

I have a few questions on using BlueStore.

With FileStore it is not uncommon to see 1 nvme device being used as the 
journal device for up to 12 OSDs.

Can an adequately sized nvme device also be used as the wal/db device for up to 
12 OSDs?

Are there any rules of thumb for sizing wal/db?

Would love to hear some actual numbers from users.

thx
Frank


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph newbie(?) issues

2018-03-05 Thread Jan Marquardt
Hi,

we are relatively new to Ceph and are observing some issues, where
I'd like to know how likely they are to happen when operating a
Ceph cluster.

Currently our setup consists of three servers which are acting as
OSDs and MONs. Each server has two Intel Xeon L5420 (yes, I know,
it's not state of the art, but we thought it would be sufficient
for a Proof of Concept. Maybe we were wrong?) and 24 GB RAM and is
running 8 OSDs with 4 TB harddisks. 4 OSDs are sharing one SSD for
journaling. We started on Kraken and upgraded lately to Luminous.
The next two OSD servers and three separate MONs are ready for
deployment. Please find attached our ceph.conf. Current usage looks
like this:

data:
  pools:   1 pools, 768 pgs
  objects: 5240k objects, 18357 GB
  usage:   59825 GB used, 29538 GB / 89364 GB avail

We have only one pool which is exclusively used for rbd. We started
filling it with data and creating snapshots in January until Mid of
February. Everything was working like a charm until we started
removing old snapshots then.

While we were removing snapshots for the first time, OSDs started
flapping. Besides this there was no other load on the cluster.
For idle times we solved it by adding

osd snap trim priority = 1
osd snap trim sleep = 0.1

to ceph.conf. When there is load from other operations and we
remove big snapshots OSD flapping still occurs.

Last week our first scrub errors appeared. Repairing the first
one was no big deal. The second one however was, because the
instructed OSD started crashing. First on Friday osd.17 and
today osd.11.

ceph1:~# ceph pg repair 0.1b2
instructing pg 0.1b2 on osd.17 to repair

ceph1:~# ceph pg repair 0.1b2
instructing pg 0.1b2 on osd.11 to repair

I am still researching on the crashes, but already would be
thankful for any input.

Any opinions, hints and advices would really be appreciated.

Best Regards

Jan
[global]
fsid = c59e56df-2043-4c92-9492-25f05f268d9f
mon_initial_members = ceph1, ceph2, ceph3
mon_host = 10.10.100.21,10.10.100.22,10.10.100.23
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

public network = 10.10.100.0/24

[osd]

osd journal size = 0
osd snap trim priority = 1
osd snap trim sleep = 0.1

[client]

rbd default features = 3
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com