Re: [ceph-users] RGW hammer/master woes
Is there anyone who is hitting this? or any help on this is much appreciated. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Pavan Rallabhandi Sent: Saturday, February 28, 2015 11:42 PM To: ceph-us...@ceph.com Subject: [ceph-users] RGW hammer/master woes Am struggling to get through a basic PUT via swift client with RGW and CEPH binaries built out of Hammer/Master codebase, whereas the same (command on the same setup) is going through with RGW and CEPH binaries built out of Giant. Find below RGW log snippet and the command that was run. Am I missing anything obvious here? The user info looks like this: { user_id: johndoe, display_name: John Doe, email: j...@example.com, suspended: 0, max_buckets: 1000, auid: 0, subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe, access_key: 7B39L2TUQ448LZW4RI3M, secret_key: lshKCoacSlbyVc7mBLLr4cJ26fEEM22Tcmp29hT3}, { user: johndoe:swift, access_key: SHZ64EF7CIB4V42I14AH, secret_key: }], swift_keys: [ { user: johndoe:swift, secret_key: asdf}], caps: [], op_mask: read, write, delete, default_placement: , placement_tags: [], bucket_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, user_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, temp_url_keys: []} The command that was run and the logs: snip swift -A http://localhost:8989/auth -U johndoe:swift -K asdf upload mycontainer ceph 2015-02-28 23:28:39.272897 7fb610ff9700 1 == starting new request req=0x7fb5f0009990 = 2015-02-28 23:28:39.272913 7fb610ff9700 2 req 0:0.16::PUT /swift/v1/mycontainer/ceph::initializing 2015-02-28 23:28:39.272918 7fb610ff9700 10 host=localhost:8989 2015-02-28 23:28:39.272921 7fb610ff9700 20 subdomain= domain= in_hosted_domain=0 2015-02-28 23:28:39.272938 7fb610ff9700 10 meta HTTP_X_OBJECT_META_MTIME 2015-02-28 23:28:39.272945 7fb610ff9700 10 x x-amz-meta-mtime:1425140933.648506 2015-02-28 23:28:39.272964 7fb610ff9700 10 ver=v1 first=mycontainer req=ceph 2015-02-28 23:28:39.272971 7fb610ff9700 10 s-object=ceph s-bucket=mycontainer 2015-02-28 23:28:39.272976 7fb610ff9700 2 req 0:0.79:swift:PUT /swift/v1/mycontainer/ceph::getting op 2015-02-28 23:28:39.272982 7fb610ff9700 2 req 0:0.85:swift:PUT /swift/v1/mycontainer/ceph:put_obj:authorizing 2015-02-28 23:28:39.273008 7fb610ff9700 10 swift_user=johndoe:swift 2015-02-28 23:28:39.273026 7fb610ff9700 20 build_token token=0d006a6f686e646f653a73776966744436beb90402b13c4f53f35472c2cf0f 2015-02-28 23:28:39.273057 7fb610ff9700 2 req 0:0.000160:swift:PUT /swift/v1/mycontainer/ceph:put_obj:reading permissions 2015-02-28 23:28:39.273100 7fb610ff9700 15 Read AccessControlPolicyAccessControlPolicy xmlns=http://s3.amazonaws.com/doc/2006-03-01/;OwnerIDjohndoe/IDDisplayNameJohn Doe/DisplayName/OwnerAccessControlListGrantGrantee xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:type=CanonicalUserIDjohndoe/IDDisplayNameJohn Doe/DisplayName/GranteePermissionFULL_CONTROL/Permission/Grant/AccessControlList/AccessControlPolicy 2015-02-28 23:28:39.273114 7fb610ff9700 2 req 0:0.000216:swift:PUT /swift/v1/mycontainer/ceph:put_obj:init op 2015-02-28 23:28:39.273120 7fb610ff9700 2 req 0:0.000223:swift:PUT /swift/v1/mycontainer/ceph:put_obj:verifying op mask 2015-02-28 23:28:39.273123 7fb610ff9700 20 required_mask= 2 user.op_mask=7 2015-02-28 23:28:39.273125 7fb610ff9700 2 req 0:0.000228:swift:PUT /swift/v1/mycontainer/ceph:put_obj:verifying op permissions 2015-02-28 23:28:39.273129 7fb610ff9700 5 Searching permissions for uid=johndoe mask=50 2015-02-28 23:28:39.273131 7fb610ff9700 5 Found permission: 15 2015-02-28 23:28:39.273133 7fb610ff9700 5 Searching permissions for group=1 mask=50 2015-02-28 23:28:39.273135 7fb610ff9700 5 Permissions for group not found 2015-02-28 23:28:39.273136 7fb610ff9700 5 Searching permissions for group=2 mask=50 2015-02-28 23:28:39.273137 7fb610ff9700 5 Permissions for group not found 2015-02-28 23:28:39.273138 7fb610ff9700 5 Getting permissions id=johndoe owner=johndoe perm=2 2015-02-28 23:28:39.273140 7fb610ff9700 10 uid=johndoe requested perm (type)=2, policy perm=2, user_perm_mask=2, acl perm=2 2015-02-28 23:28:39.273143 7fb610ff9700 2 req 0:0.000246:swift:PUT /swift/v1/mycontainer/ceph:put_obj:verifying op params 2015-02-28 23:28:39.273146 7fb610ff9700 2 req 0:0.000249:swift:PUT /swift/v1/mycontainer/ceph:put_obj:executing 2015-02-28 23:28:39.273279 7fb610ff9700 10 x x-amz-meta-mtime:1425140933.648506 2015-02-28 23:28:39.273313 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0 2015-02-28 23:28:39.274354 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0 2015-02-28 23:28:39.274394
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Hi Robert, it seems I have not listened well on your advice - I set osd to out, instead of stoping it - and now instead of some ~ 3% of degraded objects, now there is 0.000% of degraded, and arround 6% misplaced - and rebalancing is happening again, but this is small percentage.. Do you know if later when I remove this OSD from crush map - no more data will be rebalanced (as per CEPH official documentation) - since already missplaced objects are geting distributed away to all other nodes ? (after service ceph stop osd.0 - there was 2.45% degraded data - but no backfilling was happening for some reason...it just stayed degraded... so this is a reason why I started back the OSD, and then set it to out...) Thanks On 4 March 2015 at 17:54, Andrija Panic andrija.pa...@gmail.com wrote: Hi Robert, I already have this stuff set. CEph is 0.87.0 now... Thanks, will schedule this for weekend, 10G network and 36 OSDs - should move data in less than 8h per my last experineced that was arround8h, but some 1G OSDs were included... Thx! On 4 March 2015 at 17:49, Robert LeBlanc rob...@leblancnet.us wrote: You will most likely have a very high relocation percentage. Backfills always are more impactful on smaller clusters, but osd max backfills should be what you need to help reduce the impact. The default is 10, you will want to use 1. I didn't catch which version of Ceph you are running, but I think there was some priority work done in firefly to help make backfills lower priority. I think it has gotten better in later versions. On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic andrija.pa...@gmail.com wrote: Thank you Rober - I'm wondering when I do remove total of 7 OSDs from crush map - weather that will cause more than 37% of data moved (80% or whatever) I'm also wondering if the thortling that I applied is fine or not - I will introduce the osd_recovery_delay_start 10sec as Irek said. I'm just wondering hom much will be the performance impact, because: - when stoping OSD, the impact while backfilling was fine more or a less - I can leave with this - when I removed OSD from cursh map - first 1h or so, impact was tremendous, and later on during recovery process impact was much less but still noticable... Thanks for the tip of course ! Andrija On 3 March 2015 at 18:34, Robert LeBlanc rob...@leblancnet.us wrote: I would be inclined to shut down both OSDs in a node, let the cluster recover. Once it is recovered, shut down the next two, let it recover. Repeat until all the OSDs are taken out of the cluster. Then I would set nobackfill and norecover. Then remove the hosts/disks from the CRUSH then unset nobackfill and norecover. That should give you a few small changes (when you shut down OSDs) and then one big one to get everything in the final place. If you are still adding new nodes, when nobackfill and norecover is set, you can add them in so that the one big relocate fills the new drives too. On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic andrija.pa...@gmail.com wrote: Thx Irek. Number of replicas is 3. I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already decommissioned), which is further connected to a new 10G switch/network with 3 servers on it with 12 OSDs each. I'm decommissioning old 3 nodes on 1G network... So you suggest removing whole node with 2 OSDs manually from crush map? Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas were originally been distributed over all 3 nodes. So anyway It could be safe to remove 2 OSDs at once together with the node itself...since replica count is 3... ? Thx again for your time On Mar 3, 2015 1:35 PM, Irek Fasikhov malm...@gmail.com wrote: Once you have only three nodes in the cluster. I recommend you add new nodes to the cluster, and then delete the old. 2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com: You have a number of replication? 2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com : Hi Irek, yes, stoping OSD (or seting it to OUT) resulted in only 3% of data degraded and moved/recovered. When I after that removed it from Crush map ceph osd crush rm id, that's when the stuff with 37% happened. And thanks Irek for help - could you kindly just let me know of the prefered steps when removing whole node? Do you mean I first stop all OSDs again, or just remove each OSD from crush map, or perhaps, just decompile cursh map, delete the node completely, compile back in, and let it heal/recover ? Do you think this would result in less data missplaces and moved arround ? Sorry for bugging you, I really appreaciate your help. Thanks On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote: A large percentage of the rebuild of the cluster map (But low
Re: [ceph-users] Ceph repo - RSYNC?
On 03/05/2015 07:14 PM, Brian Rak wrote: Do any of the Ceph repositories run rsync? We generally mirror the repository locally so we don't encounter any unexpected upgrades. eu.ceph.com used to run this, but it seems to be down now. # rsync rsync://eu.ceph.com rsync: failed to connect to eu.ceph.com: Connection refused (111) rsync error: error in socket IO (code 10) at clientserver.c(124) [receiver=3.0.6] Argh! That rsync daemon somehow sometimes dies and I don't notice. I'll see if I can fix this. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rgw admin api - users
The metadata api can do it: GET /admin/metadata/user Yehuda - Original Message - From: Joshua Weaver joshua.wea...@ctl.io To: ceph-us...@ceph.com Sent: Thursday, March 5, 2015 1:43:33 PM Subject: [ceph-users] rgw admin api - users According to the docs at http://docs.ceph.com/docs/master/radosgw/adminops/#get-user-info I should be able to invoke /admin/user without a quid specified, and get a list of users. No matter what I try, I get a 403. After looking at the source at github (ceph/ceph), it appears that there isn’t any code path that would result in a collection of users to be generated from that resource. Am I missing something? TIA, _josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] client-ceph [can not connect from client][connect protocol feature mismatch]
Hi, I am newbie for ceph, and ceph-user group. Recently I have been working on a ceph client. It worked on all the environments while when i tested on the production, it is not able to connect to ceph. Following are the operating system details and error. If someone has seen this problem before, any help is really appreciated. OS - lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.04.2 LTS Release: 12.04 Codename: precise 2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol feature mismatch, my 1ffa peer 42041ffa missing 4204 2015-03-05 13:37:17.635776 7f5191deb700 -- 10.8.25.112:0/2487 10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol feature mismatch, my 1ffa peer 42041ffa missing 4204 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Setting an OSD out will start the rebalance with the degraded object count. The OSD is still alive and can participate in the relocation of the objects. This is preferable so that you don't happen to get less the min_size because a disk fails during the rebalance then I/O stops on the cluster. Because CRUSH is an algorithm, anything that changes algorithm will cause a change in the output (location). When you set/fail an OSD, it changes the CRUSH, but the host and weight of the host are still in effect. When you remove the host or change the weight of the host (by removing a single OSD), it makes a change to the algorithm which will also cause some changes in how it computes the locations. Disclaimer - I have not tried this It may be possible to minimize the data movement by doing the following: 1. set norecover and nobackfill on the cluster 2. Set the OSDs to be removed to out 3. Adjust the weight of the hosts in the CRUSH (if removing all OSDs for the host, set it to zero) 4. If you have new OSDs to add, add them into the cluster now 5. Once all OSDs changes have been entered, unset norecover and nobackfill 6. This will migrate the data off the old OSDs and onto the new OSDs in one swoop. 7. Once the data migration is complete, set norecover and nobackfill on the cluster again. 8. Remove the old OSDs 9. Unset norecover and nobackfill The theory is that by setting the host weights to 0, removing the OSDs/hosts later should minimize the data movement afterwards because the algorithm should have already dropped it out as a candidate for placement. If this works right, then you basically queue up a bunch of small changes, do one data movement, always keep all copies of your objects online and minimize the impact of the data movement by leveraging both your old and new hardware at the same time. If you try this, please report back on your experience. I'm might try it in my lab, but I'm really busy at the moment so I don't know if I'll get to it real soon. On Thu, Mar 5, 2015 at 12:53 PM, Andrija Panic andrija.pa...@gmail.com wrote: Hi Robert, it seems I have not listened well on your advice - I set osd to out, instead of stoping it - and now instead of some ~ 3% of degraded objects, now there is 0.000% of degraded, and arround 6% misplaced - and rebalancing is happening again, but this is small percentage.. Do you know if later when I remove this OSD from crush map - no more data will be rebalanced (as per CEPH official documentation) - since already missplaced objects are geting distributed away to all other nodes ? (after service ceph stop osd.0 - there was 2.45% degraded data - but no backfilling was happening for some reason...it just stayed degraded... so this is a reason why I started back the OSD, and then set it to out...) Thanks On 4 March 2015 at 17:54, Andrija Panic andrija.pa...@gmail.com wrote: Hi Robert, I already have this stuff set. CEph is 0.87.0 now... Thanks, will schedule this for weekend, 10G network and 36 OSDs - should move data in less than 8h per my last experineced that was arround8h, but some 1G OSDs were included... Thx! On 4 March 2015 at 17:49, Robert LeBlanc rob...@leblancnet.us wrote: You will most likely have a very high relocation percentage. Backfills always are more impactful on smaller clusters, but osd max backfills should be what you need to help reduce the impact. The default is 10, you will want to use 1. I didn't catch which version of Ceph you are running, but I think there was some priority work done in firefly to help make backfills lower priority. I think it has gotten better in later versions. On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic andrija.pa...@gmail.com wrote: Thank you Rober - I'm wondering when I do remove total of 7 OSDs from crush map - weather that will cause more than 37% of data moved (80% or whatever) I'm also wondering if the thortling that I applied is fine or not - I will introduce the osd_recovery_delay_start 10sec as Irek said. I'm just wondering hom much will be the performance impact, because: - when stoping OSD, the impact while backfilling was fine more or a less - I can leave with this - when I removed OSD from cursh map - first 1h or so, impact was tremendous, and later on during recovery process impact was much less but still noticable... Thanks for the tip of course ! Andrija On 3 March 2015 at 18:34, Robert LeBlanc rob...@leblancnet.us wrote: I would be inclined to shut down both OSDs in a node, let the cluster recover. Once it is recovered, shut down the next two, let it recover. Repeat until all the OSDs are taken out of the cluster. Then I would set nobackfill and norecover. Then remove the hosts/disks from the CRUSH then unset nobackfill and norecover. That should give you a few small changes (when you shut down OSDs) and then one big one to get everything in the final place. If you
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Thanks a lot Robert. I have actually already tried folowing: a) set one OSD to out (6% of data misplaced, CEPH recovered fine), stop OSD, remove OSD from crush map (again 36% of data misplaced !!!) - then inserted OSD back in to crushmap - and those 36% displaced objects disappeared, of course - I'v undone the crush remove... so damage undone - the OSD is just out and cluster healthy again. b) set norecover, nobackfill, and then: - Remove one OSD from crush (the running OSD, not the one from point a) - only 18% of data misplaced !!! (no recovery was happening though, because of norecover, nobackfill) - Removed another OSD from same node - total of only 20% of objects missplaced (with 2 OSDs on same node, removed from crush map) -So these 2 OSD were still running UP and IN, and I just removed them from crush map, per the advice to avoid calcualting Crush map twice = from: http://image.slidesharecdn.com/scalingcephatcern-140311134847-phpapp01/95/scaling-ceph-at-cern-ceph-day-frankfurt-19-638.jpg?cb=1394564547 - And I added back this 2 OSD to crush map, this was just a test... So the algorith is very funny in some aspect..but it's all pseudo stuff so I kind of understand... I will share my finding during the rest of the OSD demotion, after I demote them... Thanks for your detailed inputs ! Andrija On 5 March 2015 at 22:51, Robert LeBlanc rob...@leblancnet.us wrote: Setting an OSD out will start the rebalance with the degraded object count. The OSD is still alive and can participate in the relocation of the objects. This is preferable so that you don't happen to get less the min_size because a disk fails during the rebalance then I/O stops on the cluster. Because CRUSH is an algorithm, anything that changes algorithm will cause a change in the output (location). When you set/fail an OSD, it changes the CRUSH, but the host and weight of the host are still in effect. When you remove the host or change the weight of the host (by removing a single OSD), it makes a change to the algorithm which will also cause some changes in how it computes the locations. Disclaimer - I have not tried this It may be possible to minimize the data movement by doing the following: 1. set norecover and nobackfill on the cluster 2. Set the OSDs to be removed to out 3. Adjust the weight of the hosts in the CRUSH (if removing all OSDs for the host, set it to zero) 4. If you have new OSDs to add, add them into the cluster now 5. Once all OSDs changes have been entered, unset norecover and nobackfill 6. This will migrate the data off the old OSDs and onto the new OSDs in one swoop. 7. Once the data migration is complete, set norecover and nobackfill on the cluster again. 8. Remove the old OSDs 9. Unset norecover and nobackfill The theory is that by setting the host weights to 0, removing the OSDs/hosts later should minimize the data movement afterwards because the algorithm should have already dropped it out as a candidate for placement. If this works right, then you basically queue up a bunch of small changes, do one data movement, always keep all copies of your objects online and minimize the impact of the data movement by leveraging both your old and new hardware at the same time. If you try this, please report back on your experience. I'm might try it in my lab, but I'm really busy at the moment so I don't know if I'll get to it real soon. On Thu, Mar 5, 2015 at 12:53 PM, Andrija Panic andrija.pa...@gmail.com wrote: Hi Robert, it seems I have not listened well on your advice - I set osd to out, instead of stoping it - and now instead of some ~ 3% of degraded objects, now there is 0.000% of degraded, and arround 6% misplaced - and rebalancing is happening again, but this is small percentage.. Do you know if later when I remove this OSD from crush map - no more data will be rebalanced (as per CEPH official documentation) - since already missplaced objects are geting distributed away to all other nodes ? (after service ceph stop osd.0 - there was 2.45% degraded data - but no backfilling was happening for some reason...it just stayed degraded... so this is a reason why I started back the OSD, and then set it to out...) Thanks On 4 March 2015 at 17:54, Andrija Panic andrija.pa...@gmail.com wrote: Hi Robert, I already have this stuff set. CEph is 0.87.0 now... Thanks, will schedule this for weekend, 10G network and 36 OSDs - should move data in less than 8h per my last experineced that was arround8h, but some 1G OSDs were included... Thx! On 4 March 2015 at 17:49, Robert LeBlanc rob...@leblancnet.us wrote: You will most likely have a very high relocation percentage. Backfills always are more impactful on smaller clusters, but osd max backfills should be what you need to help reduce the impact. The default is 10, you will want to use 1. I didn't catch which version of Ceph you are running, but I
Re: [ceph-users] pool distribution quality report script
Hi David, Mind sending me the output of ceph pg dump -f json? thanks! Mark On 03/05/2015 12:52 PM, David Burley wrote: Mark, It worked for me earlier this morning but the new rev is throwing a traceback: $ ceph pg dump -f json | python ./readpgdump.py pgdump_analysis.txt dumped all in format json Traceback (most recent call last): File ./readpgdump.py, line 294, in module parse_json(data) File ./readpgdump.py, line 263, in parse_json print_report(pool_counts, total_counts, JSON) File ./readpgdump.py, line 119, in print_report print_data(data, pool_weights, total_weights) File ./readpgdump.py, line 161, in print_data print format_line(Efficiency score using optimal weights for pool %s: %.1f%% % (pool, efficiency_score(data[name], weights['acting_totals']))) File ./readpgdump.py, line 71, in efficiency_score if weights and weights[osd]: KeyError: 0 On Thu, Mar 5, 2015 at 1:46 PM, Mark Nelson mnel...@redhat.com mailto:mnel...@redhat.com wrote: Hi Blair, I've updated the script and it now (theoretically) computes optimal crush weights based on both primary and secondary acting set OSDs. It also attempts to show you the efficiency of equal weights vs using weights optimized for different pools (or all pools). This is done by looking at the way weights would be assigned and how they would affect the current pool distribution, then looking at the skew for the heaviest weighted OSD vs the average. Unfortunately the output has now become beastly and complex (granted this is a large cluster with many pools!). I think the trick now is how to make the interface for this more manageable. For instance perhaps it's not interesting to know how one pool's weights affect the efficiency of the acting primary OSDs for another pool. I've included sample output, but it's huge (15K lines long!) Mark On 03/05/2015 01:52 AM, Blair Bethwaite wrote: Hi Mark, Cool, that looks handy. Though it'd be even better if it could go a step further and recommend re-weighting values to balance things out (or increased PG counts where needed). Cheers, On 5 March 2015 at 15:11, Mark Nelson mnel...@redhat.com mailto:mnel...@redhat.com wrote: Hi All, Recently some folks showed interest in gathering pool distribution statistics and I remembered I wrote a script to do that a while back. It was broken due to a change in the ceph pg dump output format that was committed a while back, so I cleaned the script up, added detection of header fields, automatic json support, and also added in calculation of expected max and min PGs per OSD and std deviation. The script is available here: https://github.com/ceph/ceph-__tools/blob/master/cbt/tools/__readpgdump.py https://github.com/ceph/ceph-tools/blob/master/cbt/tools/readpgdump.py Some general comments: 1) Expected numbers are derived by treating PGs and OSDs as a balls-in-buckets problem ala Raab Steger: http://www14.in.tum.de/__personen/raab/publ/balls.pdf http://www14.in.tum.de/personen/raab/publ/balls.pdf 2) You can invoke it either by passing it a file or stdout, ie: ceph pg dump -f json | ./readpgdump.py or ./readpgdump.py ~/pgdump.out 3) Here's a snippet of some of some sample output from a 210 OSD cluster. Does this output make sense to people? Is it useful? [nhm@burnupiX tools]$ ./readpgdump.py ~/pgdump.out +-__--__-+ | Detected input as plain | +-__--__-+ +-__--__-+ | Pool ID: 681 | +-__--__-+ | Participating OSDs: 210 | | Participating PGs: 4096 | +-__--__-+ | OSDs in Primary Role (Acting) | | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2 | | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5 | | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) | | 5 Least
[ceph-users] RadosGW - Create bucket via admin API
Hello guys, On adminops documentation that saw how to remove a bucket, but I can’t find the URI to create one, I’d like to know if this is possible? Regards. Italo Santos http://italosantos.com.br/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Firefly, cephfs issues: different unix rights depending on the client and ls are slow
Hi, I'm sorry to revive my post but I can't to solve my problems and I see anything in the log. I have tried with Hammer version and I found the same phenomena. In fact, first, I have tried the same installation (ie the same conf via puppet) of my cluster but in virtualbox environment and I have the same phenomena. Second, I have reinstalled my virtualbox environment but with the Hammer version of Ceph (ie the testing version 0.93-1trusty) and I have the same issues too. Le 04/03/2015 14:15, Francois Lafont wrote : [...] ~# mkdir /cephfs ~# mount -t ceph 10.0.2.150,10.0.2.151,10.0.2.152:/ /cephfs/ -o name=cephfs,secretfile=/etc/ceph/ceph.client.cephfs.secret Then in ceph-testfs, I do: root@test-cephfs:~# mkdir /cephfs/d1 root@test-cephfs:~# ll /cephfs/ total 4 drwxr-xr-x 1 root root0 Mar 4 11:45 ./ drwxr-xr-x 24 root root 4096 Mar 4 11:42 ../ drwxr-xr-x 1 root root0 Mar 4 11:45 d1/ After, in test-cephfs2, I do: root@test-cephfs2:~# ll /cephfs/ total 4 drwxr-xr-x 1 root root0 Mar 4 11:45 ./ drwxr-xr-x 24 root root 4096 Mar 4 11:42 ../ drwxrwxrwx 1 root root0 Mar 4 11:45 d1/ 1) Why the unix rights of d1/ are different when I'm in test-cephfs and when I'm in test-cephfs2? It should be the same, isn't it? This problem is not random and I can reproduce it indefinitely. 2) If I create 100 files in /cephfs/d1/ with test-cephfs: for i in $(seq 100) do echo $(date +%s.%N) /cephfs/d1/f_$i done sometimes, in test-cephfs2, when I do a simple: root@test-cephfs2:~# time \ls -la /cephfs Sorry error of copy and paste, of course it was: root@test-cephfs2:~# time \ls -la /cephfs/d1/ the command can take 2 or 3 seconds which seems to me very long for a directory with just 100 files. Generally, if I repeat the command on test-cephfs2 just after, it's immediate but not always. I can not reproduce the problem in a determinist way. Sometimes, to reproduce the problem, I must remove all the files in /cephfs/ on test-cepfs and recreate them. It's very strange. Sometimes and randomly, something seems to be stalled but I don't know what. I suspect a problem of mds tuning but, In fact, I don't know what to do. I have the same problem with hammer too. But someone can confirm me that 3s (not always) for ls -la in a cephfs directory which contains 100 file it's pathological? After all, maybe is it normal? I don't have much experience with cephfs. Thanks for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] client-ceph [can not connect from client][connect protocol feature mismatch]
What did you mean when say ceph client? The log piece that you posted seems to be about kernel that you are using not supporting some features of ceph. Try to update you kernel if your 'client' is Rados Block Device client. 06.03.2015 00:48, Sonal Dubey пишет: Hi, I am newbie for ceph, and ceph-user group. Recently I have been working on a ceph client. It worked on all the environments while when i tested on the production, it is not able to connect to ceph. Following are the operating system details and error. If someone has seen this problem before, any help is really appreciated. OS - lsb_release -a No LSB modules are available. Distributor ID:Ubuntu Description:Ubuntu 12.04.2 LTS Release:12.04 Codename:precise 2015-03-05 13:37:16.816322 7f5191deb700 -- 10.8.25.112:0/2487 http://10.8.25.112:0/2487 10.138.23.241:6789/0 http://10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol feature mismatch, my 1ffa peer 42041ffa missing 4204 2015-03-05 13:37:17.635776 7f5191deb700 -- 10.8.25.112:0/2487 http://10.8.25.112:0/2487 10.138.23.241:6789/0 http://10.138.23.241:6789/0 pipe(0x12489f0 sd=3 pgs=0 cs=0 l=0).connect protocol feature mismatch, my 1ffa peer 42041ffa missing 4204 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rgw admin api - users
According to the docs at http://docs.ceph.com/docs/master/radosgw/adminops/#get-user-info I should be able to invoke /admin/user without a quid specified, and get a list of users. No matter what I try, I get a 403. After looking at the source at github (ceph/ceph), it appears that there isn’t any code path that would result in a collection of users to be generated from that resource. Am I missing something? TIA, _josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understand RadosGW logs
- Original Message - From: Daniel Schneller daniel.schnel...@centerdevice.com To: ceph-users@lists.ceph.com Sent: Tuesday, March 3, 2015 2:54:13 AM Subject: [ceph-users] Understand RadosGW logs Hi! After realizing the problem with log rotation (see http://thread.gmane.org/gmane.comp.file-systems.ceph.user/17708) and fixing it, I now for the first time have some meaningful (and recent) logs to look at. While from an application perspective there seem to be no issues, I would like to understand some messages I find with relatively high frequency in the logs: Exhibit 1 - 2015-03-03 11:14:53.685361 7fcf4bfef700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:15:57.476059 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:17:43.570986 7fcf25fcb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:00.881640 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:48.147011 7fcf35feb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:27:40.572723 7fcf50ff9700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:29:40.082954 7fcf36fed700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:30:32.204492 7fcf4dff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 It means that returning data to the client got some error, usually means that the client disconnected before completion. I cannot find anything relevant by Googling for that, apart from the actual line of code that produces this line. What does that mean? Is it an indication of data corruption or are there more benign reasons for this line? Exhibit 2 -- Several of these blocks 2015-03-03 07:06:17.805772 7fcf36fed700 1 == starting new request req=0x7fcf5800f3b0 = 2015-03-03 07:06:17.836671 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836758 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836918 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=13055243 stripe_ofs=13055243 part_ofs=0 rule-part_size=0 2015-03-03 07:06:18.263126 7fcf36fed700 1 == req done req=0x7fcf5800f3b0 http_status=200 == ... 2015-03-03 09:27:29.855001 7fcf28fd1700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:27:29.866718 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866778 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866852 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=13107200 stripe_ofs=13107200 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866917 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=17301504 stripe_ofs=17301504 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.875466 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=21495808 stripe_ofs=21495808 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.884434 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=25690112 stripe_ofs=25690112 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.906155 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=29884416 stripe_ofs=29884416 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.914364 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=34078720 stripe_ofs=34078720 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.940653 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=38273024 stripe_ofs=38273024 part_ofs=0 rule-part_size=0 2015-03-03 09:27:30.272816 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=42467328 stripe_ofs=42467328 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.125773 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=46661632 stripe_ofs=46661632 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.192661 7fcf28fd1700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 09:27:31.194481 7fcf28fd1700 1 == req done req=0x7fcf580102a0 http_status=200 == ... 2015-03-03 09:28:43.008517 7fcf2a7d4700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:28:43.016414 7fcf2a7d4700 0 RGWObjManifest::operator++(): result: ofs=887579 stripe_ofs=887579 part_ofs=0 rule-part_size=0 2015-03-03 09:28:43.022387 7fcf2a7d4700 1 == req done req=0x7fcf580102a0 http_status=200 == First, what is the req= line? Is that a thread-id? I am asking, because the same id is used over and over in the same file over time. It's the request id (within the current radosgw instance) More
[ceph-users] tgt and krbd
Hi All, Just a heads up after a day's experimentation. I believe tgt with its default settings has a small write cache when exporting a kernel mapped RBD. Doing some write tests I saw 4 times the write throughput when using tgt aio + krbd compared to tgt with the builtin librbd. After running the following command against the LUN, which apparently disables write cache, Performance dropped back to what I am seeing using tgt+librbd and also the same as fio. tgtadm --op update --mode logicalunit --tid 2 --lun 3 -P mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0: 0:0 From that I can only deduce that using tgt + krbd in its default state is not 100% safe to use, especially in an HA environment. Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-kvm and cloned rbd image
On 03/05/2015 03:40 AM, Josh Durgin wrote: It looks like your libvirt rados user doesn't have access to whatever pool the parent image is in: librbd::AioRequest: write 0x7f1ec6ad6960 rbd_data.24413d1b58ba.0186 1523712~4096 should_complete: r = -1 -1 is EPERM, for operation not permitted. Check the libvirt user capabilites shown in ceph auth list - it should have at least r and class-read access to the pool storing the parent image. You can update it via the 'ceph auth caps' command. Josh, All images, parent, snapshot and clone reside on the same pool (libvirt-pool *) and the user (libvirt) seems to have the proper capabilities. See: client.libvirt key: caps: [mon] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rw class-read pool=rbd This same pool contains other (flat) images used to back my production VMs. They are all accessed with this same user and there have been no problems so far. I just can't seem able to use cloned images. -K. * In my original email describing the problem I used 'rbd' instead of 'libvirt-pool' for the pool name for simplicity. As more and more configuration items are requested, it makes more sense to use the real pool name to avoid causing any misconceptions. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] The project of ceph client file system porting from Linux to AIX
Hello Ketor, About 1 more years ago, I need a free DFS can be used in AIX environment as a tiered storage solution for Bank DC, that why the project. This project just port the CephFS in Linux kernel to AIX kernel(maybe RBD in future), so it's a kernel mode AIX cephfs. But I have multiple projects at hand now, so the AIX cephfs project is in pending status... It's open, anyone can make changes from this project if he want. On Thu, Mar 5, 2015 at 3:11 PM, Ketor D d.ke...@gmail.com wrote: Hi Dennis, I am interested in your project. I wrote a Win32 cephfs client https://github.com/ceph/ceph-dokan. But ceph-dokan runs in user-mode. I see you port code from kernel cephfs, are you planning to write a kernel mode AIX-cephfs? Thanks! 2015-03-04 17:59 GMT+08:00 Dennis Chen kernel.org@gmail.com: Hello, The ceph cluster now can only be used by Linux system AFAICT, so I planed to port the ceph client file system from Linux to AIX as a tiered storage solution in that platform. Below is the source code repository I've done, which is still in progress. 3 important modules: 1. aixker: maintain a uniform kernel API beteween the Linux and AIX 2. net: as a data transfering layer between the client and cluster 3. fs: as an adaptor to make the AIX can recognize the Linux file system. https://github.com/Dennis-Chen1977/aix-cephfs Welcome any comments or anything... -- Den -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH hardware recommendations and cluster design questions
Thank you all for all good advises and much needed documentation. I have a lot to digest :) Adrian On 03/04/2015 08:17 PM, Stephen Mercier wrote: To expand upon this, the very nature and existence of Ceph is to replace RAID. The FS itself replicates data and handles the HA functionality that you're looking for. If you're going to build a single server with all those disks, backed by a ZFS RAID setup, you're going to be much better suited with an iSCSI setup. The idea of ceph is that it takes the place of all the ZFS bells and whistles. A CEPH cluster that only has one OSD backed by that huge ZFS setup becomes just a wire-protocol to speak to the server. The magic in ceph comes from the replication and distribution of the data across many OSDs, hopefully living in many hosts. My own setup for instance uses 96 OSDs that are spread across 4 hosts (I know I know guys - CPU is a big deal with SSDs so 24 per host is a tall order - didn't know that when we built it - been working ok so far) that is then distributed between 2 cabinets on 2 separate cooling/power/data zones in our datacenter. My CRUSH map is currently setup for 3 copies of all data, and laid out so that at least one copy is located in each cabinet, and then the cab that gets the 2 copies also makes sure that each copy is on a different host. No RAID needed because ceph makes sure that I have a safe amount of copies of the data, in a distribution layout that allows us to sleep at night. In my opinion, ceph is much more pleasant, powerful, and versatile to deal with than both hardware RAID and ZFS (Both of which we have instances of deployed as well from previous iterations of infrastructure deployments). Now, you could always create small little zRAID clusters using ZFS, and then give an OSD to each of those, if you wanted even an additional layer of safety. Heck, you could even have hardware RAID behind the zRAID, for even another layer. Where YOU need to make the decision is the trade-off between HA functionality/peace of mind, performance, and useability/maintainability. Would me happy to answer any questions you still have... Cheers, -- Stephen Mercier Senior Systems Architect Attainia, Inc. Phone: 866-288-2464 ext. 727 Email: stephen.merc...@attainia.com mailto:stephen.merc...@attainia.com Web: www.attainia.com http://www.attainia.com Capital equipment lifecycle planning budgeting solutions for healthcare On Mar 4, 2015, at 10:42 AM, Alexandre DERUMIER wrote: Hi for hardware, inktank have good guides here: http://www.inktank.com/resource/inktank-hardware-selection-guide/ http://www.inktank.com/resource/inktank-hardware-configuration-guide/ ceph works well with multiple osd daemon (1 osd by disk), so you should not use raid. (xfs is the recommended fs for osd daemons). you don't need disk spare too, juste enough disk space to handle a disk failure. (datas are replicated-rebalanced on other disks/osd in case of disk failure) - Mail original - De: Adrian Sevcenco adrian.sevce...@cern.ch À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 18:30:31 Objet: [ceph-users] CEPH hardware recommendations and cluster designquestions Hi! I seen the documentation http://ceph.com/docs/master/start/hardware-recommendations/ but those minimum requirements without some recommendations don't tell me much ... So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd would do ... what puzzles me is that per daemon construct ... Why would i need/require to have multiple daemons? with separate servers (3 mon + 1 mds - i understood that this is the requirement) i imagine that each will run a single type of daemon.. did i miss something? (beside that maybe is a relation between daemons and block devices and for each block device should be a daemon?) for mon and mds : would help the clients if these are on 10 GbE? for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB how much ram i would really need? (128 gb would be way to much i think) (that RAIDZ3 for 36 disks is just a thought - i have also choices like: 2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare) Regarding journal and scrubbing : by using ZFS i would think that i can safely not use the CEPH ones ... is this ok? Do you have some other advises and recommendations for me? (the read:writes ratios will be 10:1) Thank you!! Adrian smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph repo - RSYNC?
I use reposync to keep mine updated when needed. Something like: cd ~ /ceph/repos reposync -r Ceph -c /etc/yum.repos.d/ceph.repo reposync -r Ceph-noarch -c /etc/yum.repos.d/ceph.repo reposync -r elrepo-kernel -c /etc/yum.repos.d/elrepo.repo Michael Kuriger Sr. Unix Systems Engineer S mk7...@yp.com | 818-649-7235 -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brian Rak Sent: Thursday, March 05, 2015 10:14 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] Ceph repo - RSYNC? Do any of the Ceph repositories run rsync? We generally mirror the repository locally so we don't encounter any unexpected upgrades. eu.ceph.com used to run this, but it seems to be down now. # rsync rsync://eu.ceph.com rsync: failed to connect to eu.ceph.com: Connection refused (111) rsync error: error in socket IO (code 10) at clientserver.c(124) [receiver=3.0.6] ___ ceph-users mailing list ceph-users@lists.ceph.com https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.comd=AwICAgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=5oPk_opCf1eJ_BZLqS3mzFHka3r1-lGm_ya8mvkaIh8s=sYjohrI39G9Owm-E92bzgsL53AYrmkFJJEzt-fEC7awe= ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pool distribution quality report script
Hi Blair, I've updated the script and it now (theoretically) computes optimal crush weights based on both primary and secondary acting set OSDs. It also attempts to show you the efficiency of equal weights vs using weights optimized for different pools (or all pools). This is done by looking at the way weights would be assigned and how they would affect the current pool distribution, then looking at the skew for the heaviest weighted OSD vs the average. Unfortunately the output has now become beastly and complex (granted this is a large cluster with many pools!). I think the trick now is how to make the interface for this more manageable. For instance perhaps it's not interesting to know how one pool's weights affect the efficiency of the acting primary OSDs for another pool. I've included sample output, but it's huge (15K lines long!) Mark On 03/05/2015 01:52 AM, Blair Bethwaite wrote: Hi Mark, Cool, that looks handy. Though it'd be even better if it could go a step further and recommend re-weighting values to balance things out (or increased PG counts where needed). Cheers, On 5 March 2015 at 15:11, Mark Nelson mnel...@redhat.com wrote: Hi All, Recently some folks showed interest in gathering pool distribution statistics and I remembered I wrote a script to do that a while back. It was broken due to a change in the ceph pg dump output format that was committed a while back, so I cleaned the script up, added detection of header fields, automatic json support, and also added in calculation of expected max and min PGs per OSD and std deviation. The script is available here: https://github.com/ceph/ceph-tools/blob/master/cbt/tools/readpgdump.py Some general comments: 1) Expected numbers are derived by treating PGs and OSDs as a balls-in-buckets problem ala Raab Steger: http://www14.in.tum.de/personen/raab/publ/balls.pdf 2) You can invoke it either by passing it a file or stdout, ie: ceph pg dump -f json | ./readpgdump.py or ./readpgdump.py ~/pgdump.out 3) Here's a snippet of some of some sample output from a 210 OSD cluster. Does this output make sense to people? Is it useful? [nhm@burnupiX tools]$ ./readpgdump.py ~/pgdump.out ++ | Detected input as plain | ++ ++ | Pool ID: 681 | ++ | Participating OSDs: 210 | | Participating PGs: 4096 | ++ | OSDs in Primary Role (Acting) | | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2 | | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5 | | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) | | 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9) | | Avg Deviation from Most Subscribed OSD: 54.6% | ++ | OSDs in Secondary Role (Acting) | | Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2 | | Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7 | | 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57) | | 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20) | | Avg Deviation from Most Subscribed OSD: 36.0% | ++ | OSDs in All Roles (Acting) | | Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5 | | Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6 | | 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91) | | 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32) | | Avg Deviation from Most Subscribed OSD: 37.1% | ++ | OSDs in Primary Role (Up) | | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2 | | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5 | | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) | | 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9) | | Avg Deviation from Most Subscribed OSD: 54.6% | ++ | OSDs in Secondary Role (Up) | | Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2 | | Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7 | | 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57) | | 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20) | | Avg Deviation from Most Subscribed OSD: 36.0% | ++ | OSDs in All Roles (Up) | | Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5 | | Actual PGs
Re: [ceph-users] pool distribution quality report script
Mark, It worked for me earlier this morning but the new rev is throwing a traceback: $ ceph pg dump -f json | python ./readpgdump.py pgdump_analysis.txt dumped all in format json Traceback (most recent call last): File ./readpgdump.py, line 294, in module parse_json(data) File ./readpgdump.py, line 263, in parse_json print_report(pool_counts, total_counts, JSON) File ./readpgdump.py, line 119, in print_report print_data(data, pool_weights, total_weights) File ./readpgdump.py, line 161, in print_data print format_line(Efficiency score using optimal weights for pool %s: %.1f%% % (pool, efficiency_score(data[name], weights['acting_totals']))) File ./readpgdump.py, line 71, in efficiency_score if weights and weights[osd]: KeyError: 0 On Thu, Mar 5, 2015 at 1:46 PM, Mark Nelson mnel...@redhat.com wrote: Hi Blair, I've updated the script and it now (theoretically) computes optimal crush weights based on both primary and secondary acting set OSDs. It also attempts to show you the efficiency of equal weights vs using weights optimized for different pools (or all pools). This is done by looking at the way weights would be assigned and how they would affect the current pool distribution, then looking at the skew for the heaviest weighted OSD vs the average. Unfortunately the output has now become beastly and complex (granted this is a large cluster with many pools!). I think the trick now is how to make the interface for this more manageable. For instance perhaps it's not interesting to know how one pool's weights affect the efficiency of the acting primary OSDs for another pool. I've included sample output, but it's huge (15K lines long!) Mark On 03/05/2015 01:52 AM, Blair Bethwaite wrote: Hi Mark, Cool, that looks handy. Though it'd be even better if it could go a step further and recommend re-weighting values to balance things out (or increased PG counts where needed). Cheers, On 5 March 2015 at 15:11, Mark Nelson mnel...@redhat.com wrote: Hi All, Recently some folks showed interest in gathering pool distribution statistics and I remembered I wrote a script to do that a while back. It was broken due to a change in the ceph pg dump output format that was committed a while back, so I cleaned the script up, added detection of header fields, automatic json support, and also added in calculation of expected max and min PGs per OSD and std deviation. The script is available here: https://github.com/ceph/ceph-tools/blob/master/cbt/tools/readpgdump.py Some general comments: 1) Expected numbers are derived by treating PGs and OSDs as a balls-in-buckets problem ala Raab Steger: http://www14.in.tum.de/personen/raab/publ/balls.pdf 2) You can invoke it either by passing it a file or stdout, ie: ceph pg dump -f json | ./readpgdump.py or ./readpgdump.py ~/pgdump.out 3) Here's a snippet of some of some sample output from a 210 OSD cluster. Does this output make sense to people? Is it useful? [nhm@burnupiX tools]$ ./readpgdump.py ~/pgdump.out +--- -+ | Detected input as plain | +--- -+ +--- -+ | Pool ID: 681 | +--- -+ | Participating OSDs: 210 | | Participating PGs: 4096 | +--- -+ | OSDs in Primary Role (Acting) | | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2 | | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5 | | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) | | 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9) | | Avg Deviation from Most Subscribed OSD: 54.6% | +--- -+ | OSDs in Secondary Role (Acting) | | Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2 | | Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7 | | 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57) | | 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20) | | Avg Deviation from Most Subscribed OSD: 36.0% | +--- -+ | OSDs in All Roles (Acting) | | Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5 | | Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6 | | 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91) | | 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32) | | Avg Deviation from Most Subscribed OSD: 37.1% | +--- -+ | OSDs in Primary Role (Up)
[ceph-users] Ceph repo - RSYNC?
Do any of the Ceph repositories run rsync? We generally mirror the repository locally so we don't encounter any unexpected upgrades. eu.ceph.com used to run this, but it seems to be down now. # rsync rsync://eu.ceph.com rsync: failed to connect to eu.ceph.com: Connection refused (111) rsync error: error in socket IO (code 10) at clientserver.c(124) [receiver=3.0.6] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understand RadosGW logs
Bump... On 2015-03-03 10:54:13 +, Daniel Schneller said: Hi! After realizing the problem with log rotation (see http://thread.gmane.org/gmane.comp.file-systems.ceph.user/17708) and fixing it, I now for the first time have some meaningful (and recent) logs to look at. While from an application perspective there seem to be no issues, I would like to understand some messages I find with relatively high frequency in the logs: Exhibit 1 - 2015-03-03 11:14:53.685361 7fcf4bfef700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:15:57.476059 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:17:43.570986 7fcf25fcb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:00.881640 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:48.147011 7fcf35feb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:27:40.572723 7fcf50ff9700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:29:40.082954 7fcf36fed700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:30:32.204492 7fcf4dff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 I cannot find anything relevant by Googling for that, apart from the actual line of code that produces this line. What does that mean? Is it an indication of data corruption or are there more benign reasons for this line? Exhibit 2 -- Several of these blocks 2015-03-03 07:06:17.805772 7fcf36fed700 1 == starting new request req=0x7fcf5800f3b0 = 2015-03-03 07:06:17.836671 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836758 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836918 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=13055243 stripe_ofs=13055243 part_ofs=0 rule-part_size=0 2015-03-03 07:06:18.263126 7fcf36fed700 1 == req done req=0x7fcf5800f3b0 http_status=200 == ... 2015-03-03 09:27:29.855001 7fcf28fd1700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:27:29.866718 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866778 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866852 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=13107200 stripe_ofs=13107200 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866917 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=17301504 stripe_ofs=17301504 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.875466 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=21495808 stripe_ofs=21495808 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.884434 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=25690112 stripe_ofs=25690112 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.906155 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=29884416 stripe_ofs=29884416 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.914364 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=34078720 stripe_ofs=34078720 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.940653 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=38273024 stripe_ofs=38273024 part_ofs=0 rule-part_size=0 2015-03-03 09:27:30.272816 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=42467328 stripe_ofs=42467328 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.125773 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=46661632 stripe_ofs=46661632 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.192661 7fcf28fd1700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 09:27:31.194481 7fcf28fd1700 1 == req done req=0x7fcf580102a0 http_status=200 == ... 2015-03-03 09:28:43.008517 7fcf2a7d4700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:28:43.016414 7fcf2a7d4700 0 RGWObjManifest::operator++(): result: ofs=887579 stripe_ofs=887579 part_ofs=0 rule-part_size=0 2015-03-03 09:28:43.022387 7fcf2a7d4700 1 == req done req=0x7fcf580102a0 http_status=200 == First, what is the req= line? Is that a thread-id? I am asking, because the same id is used over and over in the same file over time. More importantly, what do the RGWObjManifest::operator++():... lines mean? In the middle case above the block even ends with one of the ERROR lines mentioned before, but the HTTP status is still 200, suggesting a succesful operation. Thanks in advance for shedding some light, because I would like to know if I need to take some action or at least keep an eye on these via monitoring? Cheers, Daniel
[ceph-users] Question about notification of OSD down in client side
Hello, Is there some way to make the client(via RADOS API or something like that) to get the notification of an event (for example, an OSD down) happened in the cluster? -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph User Teething Problems
Thank you all for such wonderful feedback. Thank you to John Spray for putting me on the right track. I now see that the cephfs aspect of the project is being de-emphasised, so that the manual deployment instructions tell how to set up the object store, and then the cephfs is a separate issue that needs to be explicitly set up and configured in its own right. So that explains why the cephfs pools are not created by default, and why the required cephfs pools are now referred to, not as 'data' and 'metadata', but 'cepfs_data' and 'cephfs_metadata'. I have created these pools, and created a new cephfs filesystem, and I can mount it without problem. This confirms my suspicion that the manual deployment pages are in need of review and revision. They still refer to three default pools. I am happy that this section should deal with the object store setup only, but I still think that the osd part is a bit confused and confusing, particularly with respect to what is done on which machine. It would then be useful to say something like this completes the configuration of the basic store. If you wish to use cephfs, you must set up a metadata server, appropriate pools, and a cephfs filesystem. (See http://...). I was not trying to be smart or obscure when I made a brief and apparently dismissive reference to ceph-deploy. I railed against it and the demise of mkcephfs on this list at the point that mkcephfs was discontinued in the releases. That caused a few supportive responses at the time, so I know that I'm not alone. I did not wish to trawl over those arguments again unnecessarily. There is a principle that is being missed. The 'ceph' code contains everything required to set up and operate a ceph cluster. There should be documentation detailing how this is done. 'Ceph-deploy' is a separate thing. It is one of several tools that promise to make setting things up easy. However, my resistance is based on two factors. If I recall correctly, it is one of those projects in which the configuration needs to know what 'distribution' is being used. (Presumably, this is to try to deduce where various things are located). So if one is not using one of these 'distributions', one is stuffed right from the start. Secondly, the challenge that we are trying to overcome is learning what the various ceph components need, and how they need to be set up and configured. I don't think that the don't worry your pretty little head about that, we have a natty tool to do it for you approach is particularly useful. So I am not knocking ceph-deploy, Travis, it is just that I do not believe that it is relevant or useful to me at this point in time. I see that Lionel Bouton seems to share my views here. In general, the ceph documentation (in my humble opinion) needs to be draughted with a keen eye on the required scope. Deal with ceph; don't let it get contaminated with 'ceph-deploy', 'upstart', 'systemd', or anything else that is not actually part of ceph. As an example, once you have configured your osd, you start it with: ceph-osd -i {osd-number} It is as simple as that! If it is required to start the osd automatically, then that will be done using sysvinit, upstart, systemd, or whatever else is being used to bring the system up in the first place. It is unnecessary and confusing to try to second-guess the environment in which ceph may be being used, and contaminate the documentation with such details. (Having said that, I see no problem with adding separate, helpful, sections such as Suggestions for starting using 'upstart', or Suggestions for starting using 'systemd'). So I would reiterate the point that the really important documentation is probably quite simple for an expert to produce. Just spell out what each component needs in terms of keys, access to keys, files, and so on. Spell out how to set everything up. Also how to change things after the event, so that 'trial and error' does not have to contain really expensive errors. Once we understand the fundamentals, getting fancy and efficient is a completely separate further goal, and is not really a responsibility of core ceph development. I have an inexplicable emotional desire to see ceph working well with btrfs, which I like very much and have been using since the very early days. Despite all the 'not ready for production' warnings, I adopted it with enthusiasm, and have never had cause to regret it, and only once or twice experienced a failure that was painful to me. However, as I have experimented with ceph over the years, it has been very clear that ceph seems to be the most ruthless stress test for it, and it has always broken quite quickly (I also used xfs for comparison). I have seen evidence of much work going into btrfs in the kernel development now that the lead developer has moved from Oracle to, I think, Facebook. I now share the view that I think Robert LeBlanc has, that maybe btrfs will now stand the ceph test. Thanks, Lincoln Bryant, for confirming
Re: [ceph-users] Ceph User Teething Problems
David, You will need to up the limit of open files in the linux system. Check /etc/security/limits.conf. it is explained somewhere in the docs and the autostart scripts 'fixes' the issue for most people. When I did a manual deploy for the same reasons you are, I ran into this too. Robert LeBlanc Sent from a mobile device please excuse any typos. On Mar 5, 2015 3:14 AM, Datatone Lists li...@datatone.co.uk wrote: Thank you all for such wonderful feedback. Thank you to John Spray for putting me on the right track. I now see that the cephfs aspect of the project is being de-emphasised, so that the manual deployment instructions tell how to set up the object store, and then the cephfs is a separate issue that needs to be explicitly set up and configured in its own right. So that explains why the cephfs pools are not created by default, and why the required cephfs pools are now referred to, not as 'data' and 'metadata', but 'cepfs_data' and 'cephfs_metadata'. I have created these pools, and created a new cephfs filesystem, and I can mount it without problem. This confirms my suspicion that the manual deployment pages are in need of review and revision. They still refer to three default pools. I am happy that this section should deal with the object store setup only, but I still think that the osd part is a bit confused and confusing, particularly with respect to what is done on which machine. It would then be useful to say something like this completes the configuration of the basic store. If you wish to use cephfs, you must set up a metadata server, appropriate pools, and a cephfs filesystem. (See http://...). I was not trying to be smart or obscure when I made a brief and apparently dismissive reference to ceph-deploy. I railed against it and the demise of mkcephfs on this list at the point that mkcephfs was discontinued in the releases. That caused a few supportive responses at the time, so I know that I'm not alone. I did not wish to trawl over those arguments again unnecessarily. There is a principle that is being missed. The 'ceph' code contains everything required to set up and operate a ceph cluster. There should be documentation detailing how this is done. 'Ceph-deploy' is a separate thing. It is one of several tools that promise to make setting things up easy. However, my resistance is based on two factors. If I recall correctly, it is one of those projects in which the configuration needs to know what 'distribution' is being used. (Presumably, this is to try to deduce where various things are located). So if one is not using one of these 'distributions', one is stuffed right from the start. Secondly, the challenge that we are trying to overcome is learning what the various ceph components need, and how they need to be set up and configured. I don't think that the don't worry your pretty little head about that, we have a natty tool to do it for you approach is particularly useful. So I am not knocking ceph-deploy, Travis, it is just that I do not believe that it is relevant or useful to me at this point in time. I see that Lionel Bouton seems to share my views here. In general, the ceph documentation (in my humble opinion) needs to be draughted with a keen eye on the required scope. Deal with ceph; don't let it get contaminated with 'ceph-deploy', 'upstart', 'systemd', or anything else that is not actually part of ceph. As an example, once you have configured your osd, you start it with: ceph-osd -i {osd-number} It is as simple as that! If it is required to start the osd automatically, then that will be done using sysvinit, upstart, systemd, or whatever else is being used to bring the system up in the first place. It is unnecessary and confusing to try to second-guess the environment in which ceph may be being used, and contaminate the documentation with such details. (Having said that, I see no problem with adding separate, helpful, sections such as Suggestions for starting using 'upstart', or Suggestions for starting using 'systemd'). So I would reiterate the point that the really important documentation is probably quite simple for an expert to produce. Just spell out what each component needs in terms of keys, access to keys, files, and so on. Spell out how to set everything up. Also how to change things after the event, so that 'trial and error' does not have to contain really expensive errors. Once we understand the fundamentals, getting fancy and efficient is a completely separate further goal, and is not really a responsibility of core ceph development. I have an inexplicable emotional desire to see ceph working well with btrfs, which I like very much and have been using since the very early days. Despite all the 'not ready for production' warnings, I adopted it with enthusiasm, and have never had cause to regret it, and only once or twice experienced a failure that was painful to me. However, as I have
[ceph-users] 【ceph-users】Journal size when use ceph-deploy to add new osd
hello everyone, recently, I have a doubt about ceph osd journal. I use ceph-deploy to add new osd which the version is 1.4.0. And my ceph version is 0.80.5 the /dev/sdb is a sata disk,and the /dev/sdk is a ssd disk, the sdk1 partition size is 50G. ceph-deploy osd prepare host1:/dev/sdb1:/dev/sdk1 ceph-deploy osd activate host1:/dev/sdb1:/dev/sdk1 After I done this two command,the new osd is begin to work. In my ceph.conf, I don't set the osd journal path and the osd journal size,they are default. then I use command *ceph --admin-daemon /var/run/ceph/ceph-osd.*.asok config show | grep osd_journal_size * to check the osd socket file. I get the result :*osd_journal_size:5120。* That means the journal size is 5GB, the ceph-deploy don't set the journal full of my ssd disk. OK! then I restart this osd, and get the osd log: ... ... 2015-03-06 11:51:30.451245 7fd6c39df7a0 0 filestore(/var/lib/ceph/osd/ceph-11) mount: WRITEAHEAD journal mode explicitly enabled in conf 2015-03-06 11:51:30.454400 7fd6c39df7a0 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 21:* 53687091200 bytes*, block size 4096 bytes, directio = 1, aio = 0 2015-03-06 11:51:30.454551 7fd6c39df7a0 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 21: *53687091200 bytes*, block size 4096 bytes, directio = 1, aio = 0 2015-03-06 11:51:30.505709 7fd6c39df7a0 0 cls cls/hello/cls_hello.cc:271: loading cls_hello ... This means the journal size is 50GB, the ceph-deploy set the journal full of my ssd disk. So...which is correct? And when I set the osd_journal_size after I used ceph-deploy to add a new osd, whether this set begin to work ?? Thanks very much! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com