Re: [ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread George Mihaiescu
Terminate the connections on haproxy which  is great for ssl as well, and use 
these instructions to set qos per connection and data transferred:
http://blog.serverfault.com/2010/08/26/1016491873/


 

> On May 4, 2017, at 04:35, hrchu  wrote:
> 
> Thanks for reply.
> 
> tc can only do limit on interfaces or given IPs, but what I am talking about 
> is "per connection", e.g.,  each put object could be 5MB/s, get object could 
> be 1MB/s.
> 
> Correct me if anything wrong.
> 
> 
> Regards,
> 
> Chu, Hua-Rong (曲華榮), +886-3-4227151 #57968
> Networklab, Computer Science & Information Engineering,
> National Central University, Jhongli, Taiwan R.O.C.
> 
>> On Thu, May 4, 2017 at 4:01 PM, Marc Roos  wrote:
>> 
>> 
>> 
>> No experience with it. But why not use linux for it? Maybe this solution
>> on every RGW is sufficient, I cannot imagine you need 3rd party for
>> this.
>> 
>> https://unix.stackexchange.com/questions/28198/how-to-limit-network-bandwidth
>> https://wiki.archlinux.org/index.php/Advanced_traffic_control
>> 
>> 
>> 
>> -Original Message-
>> From: hrchu [mailto:petertc@gmail.com]
>> Sent: donderdag 4 mei 2017 9:24
>> To: Ceph Users
>> Subject: [ceph-users] Limit bandwidth on RadosGW?
>> 
>> Hi all,
>> I want to limit RadosGW per connection upload/download speed for QoS.
>> There is no build-in option for this, so maybe a 3rd party reverse proxy
>> in front of Radosgw is needed. Does anyone have experience about this?
>> 
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-05-02 Thread George Mihaiescu
One problem that I can see with this setup is that you will fill up the SSDs 
holding the primary replica before the HDD ones, if they are much different in 
size.

Other than that, it's a very inventive solution to increase read speeds without 
using a possibly buggy cache configuration.



> On Apr 20, 2017, at 05:25, Richard Hesketh  
> wrote:
> 
>> On 19/04/17 21:08, Reed Dier wrote:
>> Hi Maxime,
>> 
>> This is a very interesting concept. Instead of the primary affinity being 
>> used to choose SSD for primary copy, you set crush rule to first choose an 
>> osd in the ‘ssd-root’, then the ‘hdd-root’ for the second set.
>> 
>> And with 'step chooseleaf first {num}’
>>> If {num} > 0 && < pool-num-replicas, choose that many buckets. 
>> So 1 chooses that bucket
>>> If {num} < 0, it means pool-num-replicas - {num}
>> And -1 means it will fill remaining replicas on this bucket.
>> 
>> This is a very interesting concept, one I had not considered.
>> Really appreciate this feedback.
>> 
>> Thanks,
>> 
>> Reed
>> 
>>> On Apr 19, 2017, at 12:15 PM, Maxime Guyot  wrote:
>>> 
>>> Hi,
>>> 
> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
 1:4-5 is common but depends on your needs and the devices in question, ie. 
 assuming LFF drives and that you aren’t using crummy journals.
>>> 
>>> You might be speaking about different ratios here. I think that Anthony is 
>>> speaking about journal/OSD and Reed speaking about capacity ratio between 
>>> and HDD and SSD tier/root. 
>>> 
>>> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on 
>>> HDD), like Richard says you’ll get much better random read performance with 
>>> primary OSD on SSD but write performance won’t be amazing since you still 
>>> have 2 HDD copies to write before ACK. 
>>> 
>>> I know the doc suggests using primary affinity but since it’s a OSD level 
>>> setting it does not play well with other storage tiers so I searched for 
>>> other options. From what I have tested, a rule that selects the 
>>> first/primary OSD from the ssd-root then the rest of the copies from the 
>>> hdd-root works. Though I am not sure it is *guaranteed* that the first OSD 
>>> selected will be primary.
>>> 
>>> “rule hybrid {
>>> ruleset 2
>>> type replicated
>>> min_size 1
>>> max_size 10
>>> step take ssd-root
>>> step chooseleaf firstn 1 type host
>>> step emit
>>> step take hdd-root
>>> step chooseleaf firstn -1 type host
>>> step emit
>>> }”
>>> 
>>> Cheers,
>>> Maxime
> 
> FWIW splitting my HDDs and SSDs into two separate roots and using a crush 
> rule to first choose a host from the SSD root and take remaining replicas on 
> the HDD root was the way I did it, too. By inspection, it did seem that all 
> PGs in the pool had an SSD for a primary, so I think this is a reliable way 
> of doing it. You would of course end up with an acting primary on one of the 
> slow spinners for a brief period if you lost an SSD for whatever reason and 
> it needed to rebalance.
> 
> The only downside is that if you have your SSD and HDD OSDs on the same 
> physical hosts I'm not sure how you set up your failure domains and rules to 
> make sure that you don't take an SSD primary and HDD replica on the same 
> host. In my case, SSDs and HDDs are on different hosts, so it didn't matter 
> to me.
> -- 
> Richard Hesketh
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maintaining write performance under a steady intake of small objects

2017-05-02 Thread George Mihaiescu
Hi Patrick,

You could add more RAM to the servers witch will not increase the cost too 
much, probably.

You could change swappiness value or use something like 
https://hoytech.com/vmtouch/ to pre-cache inodes entries.

You could tarball the smaller files before loading them into Ceph maybe.

How are the ten clients accessing Ceph by the way?

> On May 1, 2017, at 14:23, Patrick Dinnen  wrote:
> 
> One additional detail, we also did filestore testing using Jewel and saw 
> substantially similar results to those on Kraken.
> 
>> On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen  wrote:
>> Hello Ceph-users,
>> 
>> Florian has been helping with some issues on our proof-of-concept cluster, 
>> where we've been experiencing these issues. Thanks for the replies so far. I 
>> wanted to jump in with some extra details.
>> 
>> All of our testing has been with scrubbing turned off, to remove that as a 
>> factor.
>> 
>> Our use case requires a Ceph cluster to indefinitely store ~10 billion files 
>> 20-60KB in size. We’ll begin with 4 billion files migrated from a legacy 
>> storage system. Ongoing writes will be handled by ~10 client machines and 
>> come in at a fairly steady 10-20 million files/day. Every file (excluding 
>> the legacy 4 billion) will be read once by a single client within hours of 
>> it’s initial write to the cluster. Future file read requests will come from 
>> a single server and with a long-tail distribution, with popular files read 
>> thousands of times a year but most read never or virtually never.
>> 
>> 
>> Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs). SSD 
>> journals at a 1:4 ratio with HDDs, Each node looks like this:
>> 2 x E5-2660 8-core Xeons
>> 64GB RAM DDR-3 PC1600
>> 10Gb ceph-internal network (SFP+) 
>> LSI 9210-8i controller (IT mode)
>> 4 x OSD 8TB HDDs, mix of two types
>> Seagate ST8000DM002
>> HGST HDN728080ALE604
>> Mount options = xfs (rw,noatime,attr2,inode64,noquota) 
>> 1 x SSD journal Intel 200GB DC S3700
>> 
>> Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with a 
>> replication level 2. We’re using rados bench to shotgun a lot of files into 
>> our test pools. Specifically following these two steps: 
>> ceph osd pool create poolofhopes 2048 2048 replicated "" replicated_ruleset 
>> 5
>> rados -p poolofhopes bench -t 32 -b 2 3000 write --no-cleanup
>> 
>> We leave the bench running for days at a time and watch the objects in 
>> cluster count. We see performance that starts off decent and degrades over 
>> time. There’s a very brief initial surge in write performance after which 
>> things settle into the downward trending pattern.
>> 
>> 1st hour - 2 million objects/hour
>> 20th hour - 1.9 million objects/hour 
>> 40th hour - 1.7 million objects/hour
>> 
>> This performance is not encouraging for us. We need to be writing 40 million 
>> objects per day (20 million files, duplicated twice). The rates we’re seeing 
>> at the 40th hour of our bench would be suffecient to achieve that. Those 
>> write rates are still falling though and we’re only at a fraction of the 
>> number of objects in cluster that we need to handle. So, the trends in 
>> performance suggests we shouldn’t count on having the write performance we 
>> need for too long.
>> 
>> If we repeat the process of creating a new pool and running the bench the 
>> same pattern holds, good initial performance that gradually degrades.
>> 
>> https://postimg.org/image/ovymk7n2d/
>> [caption:90 million objects written to a brand new, pre-split pool 
>> (poolofhopes). There are already 330 million objects on the cluster in other 
>> pools.]
>> 
>> Our working theory is that the degradation over time may be related to inode 
>> or dentry lookups that miss cache and lead to additional disk reads and seek 
>> activity. There’s a suggestion that filestore directory splitting may 
>> exacerbate that problem as additional/longer disk seeks occur related to 
>> what’s in which XFS assignment group. We have found pre-split pools useful 
>> in one major way, they avoid periods of near-zero write performance that we 
>> have put down to the active splitting of directories (the "thundering herd" 
>> effect). The overall downward curve seems to remain the same whether we 
>> pre-split or not.
>> 
>> The thundering herd seems to be kept in check by an appropriate pre-split. 
>> Bluestore may or may not be a solution, but uncertainty and stability within 
>> our fairly tight timeline don't recommend it to us. Right now our big 
>> question is "how can we avoid the gradual degradation in write performance 
>> over time?". 
>> 
>> Thank you, Patrick
>> 
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] osds down after upgrade hammer to jewel

2017-03-27 Thread George Mihaiescu
Make sure the OSD processes on the Jewel node are running. If you didn't change 
the ownership to user ceph, they won't start.


> On Mar 27, 2017, at 11:53, Jaime Ibar  wrote:
> 
> Hi all,
> 
> I'm upgrading ceph cluster from Hammer 0.94.9 to jewel 10.2.6.
> 
> The ceph cluster has 3 servers (one mon and one mds each) and another 6 
> servers with
> 12 osds each.
> The monitoring and mds have been succesfully upgraded to latest jewel 
> release, however
> after upgrade the first osd server(12 osds), ceph is not aware of them and
> are marked as down
> 
> ceph -s
> 
> cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
> health HEALTH_WARN
> [...]
>12/72 in osds are down
>noout flag(s) set
> osdmap e14010: 72 osds: 60 up, 72 in; 14641 remapped pgs
>flags noout
> [...]
> 
> ceph osd tree
> 
> 3   3.64000 osd.3  down  1.0 1.0
> 8   3.64000 osd.8  down  1.0 1.0
> 14   3.64000 osd.14 down  1.0 1.0
> 18   3.64000 osd.18 down  1.0  1.0
> 21   3.64000 osd.21 down  1.0  1.0
> 28   3.64000 osd.28 down  1.0  1.0
> 31   3.64000 osd.31 down  1.0  1.0
> 37   3.64000 osd.37 down  1.0  1.0
> 42   3.64000 osd.42 down  1.0  1.0
> 47   3.64000 osd.47 down  1.0  1.0
> 51   3.64000 osd.51 down  1.0  1.0
> 56   3.64000 osd.56 down  1.0  1.0
> 
> If I run this command with one of the down osd
> ceph osd in 14
> osd.14 is already in.
> however ceph doesn't mark it as up and the cluster health remains
> in degraded state.
> 
> Do I have to upgrade all the osds to jewel first?
> Any help as I'm running out of ideas?
> 
> Thanks
> Jaime
> 
> -- 
> 
> Jaime Ibar
> High Performance & Research Computing, IS Services
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
> Tel: +353-1-896-3725
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-14 Thread George Mihaiescu
Hi,

We initially upgraded from Hammer to Jewel while keeping the ownership
unchanged, by adding  "setuser match path =
/var/lib/ceph/$type/$cluster-$id" in ceph.conf


Later, we used the following steps to change from running as root to
running as ceph.

On the storage nodes, we ran the following command that doesn't change
permissions, but caches the filesystem (based on
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-November/006013.html
)

find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1
chown -R root:root

Set noout:
ceph osd set noout

On Storage node:
Edited "/etc/ceph/ceph.conf" and commented out #setuser match path =
/var/lib/ceph/$type/$cluster-$id
stop ceph-osd-all
find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1
chown -R ceph:ceph
chown -R ceph:ceph /var/lib/ceph/
start ceph-osd-all

Check that all the Ceph OSD processes are running:
ps aux | grep ceph | egrep –v grep

Unset "noout":
ceph osd unset noout

Wait till ceph is healthy again and continue with the next storage node.

The OSDs were down for about 2 min because we ran the find command before
hand and used xargs with 12 parallel processes, so recovery time was quick
as well.

We have more than 850 OSDs and the entire process went pretty smooth by
doing one storage server at a time.



On Tue, Mar 14, 2017 at 3:27 AM, Richard Arends 
wrote:

> On 03/13/2017 02:02 PM, Christoph Adomeit wrote:
>
> Christoph,
>
> Thanks for the detailed upgrade report.
>>
>> We have another scenario: We have allready upgraded to jewel 10.2.6 but
>> we are still running all our monitors and osd daemons as root using the
>> setuser match path directive.
>>
>> What would be the recommended way to have all daemons running as
>> ceph:ceph user ?
>>
>> Could we chown -R the monitor and osd data directories under
>> /var/lib/ceph one by one while keeping up service ?
>>
>
> Yes. To minimize the down time, you can do the chown twice. Once before
> restarting the daemons, while they are running with root user permissions.
> Then stop the daemons, do the chown again, but then only on the changed
> files (find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0  chown ceph:ceph)
> and start the Ceph daemons with setuser and setgroup set to ceph
>
>
>
> --
> With regards,
>
> Richard Arends.
> Snow BV / http://snow.nl
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw leaking data, orphan search loop

2017-02-28 Thread George Mihaiescu
Hi Yehuda,

I've ran the "radosgw-admin orphans find" command again, but captured its
output this time.

There are both "shadow" files and "multipart" files detected as leaked.

leaked:
default.34461213.1__multipart_data/d2a14aeb-a384-51b1-8704-fe76a9a6f5f5.-j0vqDrC0wr44bii2ytrtpcrlnspSyE.44
leaked:
default.34461213.1__multipart_data/dda8a18a-1d99-50b7-a397-b876811bdf94.Mpjqa2RwKirM9Ae_HTsttGiJEHAvdxc.113
leaked:
default.34461213.1__multipart_data/dda8a18a-1d99-50b7-a397-b876811bdf94.bRtExqhbdw_J0gcT-xADwUBQJFOjKmG.111
leaked:
default.34461213.1__multipart_data/f080cdc1-0826-5ac9-a9f7-21700ceeebf3.BqR5aBMpJDmO1U5xKSrEe3EvPmEHNq8.96
leaked:
default.34461213.1__multipart_data/f793a1ec-5c7d-5b59-845a-9d280325bb25.LcCH4ia_LwWV4MyVzwhTv_PAxrpSPpM.52
leaked:
default.34461213.1__multipart_data/f793a1ec-5c7d-5b59-845a-9d280325bb25.gbrMfo0bWww2nEe2x4LL146BwtLMkA6.37
leaked:
default.34461213.1__multipart_data/f793a1ec-5c7d-5b59-845a-9d280325bb25.rIZlATigZEwP6FVW66m-YhcmgIiJihM.48
leaked:
default.34461213.1__shadow_data/0181bbd5-4202-57a0-a1f3-07007d043660.2~gy8NGkx7YmMzHwv8_ordh7u_TNk7_4c.1_1
leaked:
default.34461213.1__shadow_data/0181bbd5-4202-57a0-a1f3-07007d043660.2~gy8NGkx7YmMzHwv8_ordh7u_TNk7_4c.1_2
leaked:
default.34461213.1__shadow_data/0181bbd5-4202-57a0-a1f3-07007d043660.2~gy8NGkx7YmMzHwv8_ordh7u_TNk7_4c.1_3
leaked:
default.34461213.1__shadow_data/0181bbd5-4202-57a0-a1f3-07007d043660.2~gy8NGkx7YmMzHwv8_ordh7u_TNk7_4c.1_4
leaked:
default.34461213.1__shadow_data/0181bbd5-4202-57a0-a1f3-07007d043660.2~gy8NGkx7YmMzHwv8_ordh7u_TNk7_4c.1_5
leaked:
default.34461213.1__shadow_data/0181bbd5-4202-57a0-a1f3-07007d043660.2~gy8NGkx7YmMzHwv8_ordh7u_TNk7_4c.1_6
leaked:
default.34461213.1__shadow_data/02aca392-6d6b-536c-ae17-fdffe164e05a.2~lXP-3WDlbF5MSPYuE7JHLNK1z1hr1Y4.100_1
leaked:
default.34461213.1__shadow_data/02aca392-6d6b-536c-ae17-fdffe164e05a.2~lXP-3WDlbF5MSPYuE7JHLNK1z1hr1Y4.100_10
leaked:
default.34461213.1__shadow_data/02aca392-6d6b-536c-ae17-fdffe164e05a.2~lXP-3WDlbF5MSPYuE7JHLNK1z1hr1Y4.100_11

I deleted both the multipart and shadow leaked files for one of the S3
objects, and then the object couldn't be retrieved anymore.

I deleted just the shadow leaked files for another S3 object, and then that
object couldn't be retrieved anymore after either.

I think the "radosgw-admin orphans find" command again doesn't work as
expected, is there anything else I can do?

Thank you,
George



On Fri, Feb 24, 2017 at 1:22 PM, Yehuda Sadeh-Weinraub <yeh...@redhat.com>
wrote:

> oid is object id. The orphan find command generates a list of objects
> that needs to be removed at the end of the run (if finishes
> successfully). If you didn't catch that, you should be able to still
> run the same scan (using the same scan id) and retrieve that info
> again.
>
> Yehuda
>
> On Fri, Feb 24, 2017 at 9:48 AM, George Mihaiescu <lmihaie...@gmail.com>
> wrote:
> > Hi Yehuda,
> >
> > Thank you for the quick reply.
> >
> > What is the  you're referring to that I should backup and then
> delete?
> > I extracted the files from the ".log" pool where the "orphan find" tool
> > stored the results, but they are zero bytes files.
> >
> >
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.52
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.58
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.000122
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.57
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.bck1.rados.53
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.buckets.20
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.buckets.25
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.bck1.rados.0
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.2
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.linked.19
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.38
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.18
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.92
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.000108
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.bck1.rados.13
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.linked.20
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.18
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.bck1.rados.11
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.50
> > -rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.buckets.33
> >
> >
> > George
> >
> >
> >
> > On Fri, Feb 24, 2017 at 12:12 PM, Yehuda Sadeh-Weinraub 

Re: [ceph-users] rgw leaking data, orphan search loop

2017-02-24 Thread George Mihaiescu
Hi Yehuda,

Thank you for the quick reply.

What is the  you're referring to that I should backup and then delete?
I extracted the files from the ".log" pool where the "orphan find" tool
stored the results, but they are zero bytes files.


-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.52
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.58
-rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.000122
-rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.57
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.bck1.rados.53
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.buckets.20
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.buckets.25
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.bck1.rados.0
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.2
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.linked.19
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.38
-rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.18
-rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.92
-rw-r--r-- 1 root root 0 Feb 24 12:45 obj_delete_at_hint.000108
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.bck1.rados.13
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.linked.20
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.18
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.bck1.rados.11
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.rados.50
-rw-r--r-- 1 root root 0 Feb 24 12:45 orphan.scan.orphans.buckets.33


George



On Fri, Feb 24, 2017 at 12:12 PM, Yehuda Sadeh-Weinraub <yeh...@redhat.com>
wrote:

> Hi,
>
> we wanted to have more confidence in the orphans search tool before
> providing a functionality that actually remove the objects. One thing
> that you can do is create a new pool, copy these objects to the new
> pool (as a backup, rados -p  --target-pool=
> cp  ), and remove these objects (rados -p  rm ).
> Then when you're confident enough that this didn't break existing
> objects, you can remove the backup pool.
>
> Yehuda
>
> On Fri, Feb 24, 2017 at 8:23 AM, George Mihaiescu <lmihaie...@gmail.com>
> wrote:
> > Hi,
> >
> > I updated http://tracker.ceph.com/issues/18331 with my own issue, and I
> am
> > hoping Orit or Yehuda could give their opinion on what to do next.
> > What was the purpose of the "orphan find" tool and how to actually clean
> up
> > these files?
> >
> > Thank you,
> > George
> >
> >
> > On Fri, Jan 13, 2017 at 2:22 PM, Wido den Hollander <w...@42on.com>
> wrote:
> >>
> >>
> >> > Op 24 december 2016 om 13:47 schreef Wido den Hollander <
> w...@42on.com>:
> >> >
> >> >
> >> >
> >> > > Op 23 december 2016 om 16:05 schreef Wido den Hollander
> >> > > <w...@42on.com>:
> >> > >
> >> > >
> >> > >
> >> > > > Op 22 december 2016 om 19:00 schreef Orit Wasserman
> >> > > > <owass...@redhat.com>:
> >> > > >
> >> > > >
> >> > > > HI Maruis,
> >> > > >
> >> > > > On Thu, Dec 22, 2016 at 12:00 PM, Marius Vaitiekunas
> >> > > > <mariusvaitieku...@gmail.com> wrote:
> >> > > > > On Thu, Dec 22, 2016 at 11:58 AM, Marius Vaitiekunas
> >> > > > > <mariusvaitieku...@gmail.com> wrote:
> >> > > > >>
> >> > > > >> Hi,
> >> > > > >>
> >> > > > >> 1) I've written before into mailing list, but one more time. We
> >> > > > >> have big
> >> > > > >> issues recently with rgw on jewel. because of leaked data - the
> >> > > > >> rate is
> >> > > > >> about 50GB/hour.
> >> > > > >>
> >> > > > >> We've hitted these bugs:
> >> > > > >> rgw: fix put_acls for objects starting and ending with
> underscore
> >> > > > >> (issue#17625, pr#11669, Orit Wasserman)
> >> > > > >>
> >> > > > >> Upgraded to jewel 10.2.5 - no luck.
> >> > > > >>
> >> > > > >> Also we've hitted this one:
> >> > > > >> rgw: RGW loses realm/period/zonegroup/zone data: period
> >> > > > >> overwritten if
> >> > > > >> somewhere in the cluster is still running Hammer (issue#17371,
> >> > > > >> pr#115

Re: [ceph-users] rgw leaking data, orphan search loop

2017-02-24 Thread George Mihaiescu
Hi,

I updated http://tracker.ceph.com/issues/18331 with my own issue, and I am
hoping Orit or Yehuda could give their opinion on what to do next.
What was the purpose of the "orphan find" tool and how to actually clean up
these files?

Thank you,
George


On Fri, Jan 13, 2017 at 2:22 PM, Wido den Hollander  wrote:

>
> > Op 24 december 2016 om 13:47 schreef Wido den Hollander :
> >
> >
> >
> > > Op 23 december 2016 om 16:05 schreef Wido den Hollander  >:
> > >
> > >
> > >
> > > > Op 22 december 2016 om 19:00 schreef Orit Wasserman <
> owass...@redhat.com>:
> > > >
> > > >
> > > > HI Maruis,
> > > >
> > > > On Thu, Dec 22, 2016 at 12:00 PM, Marius Vaitiekunas
> > > >  wrote:
> > > > > On Thu, Dec 22, 2016 at 11:58 AM, Marius Vaitiekunas
> > > > >  wrote:
> > > > >>
> > > > >> Hi,
> > > > >>
> > > > >> 1) I've written before into mailing list, but one more time. We
> have big
> > > > >> issues recently with rgw on jewel. because of leaked data - the
> rate is
> > > > >> about 50GB/hour.
> > > > >>
> > > > >> We've hitted these bugs:
> > > > >> rgw: fix put_acls for objects starting and ending with underscore
> > > > >> (issue#17625, pr#11669, Orit Wasserman)
> > > > >>
> > > > >> Upgraded to jewel 10.2.5 - no luck.
> > > > >>
> > > > >> Also we've hitted this one:
> > > > >> rgw: RGW loses realm/period/zonegroup/zone data: period
> overwritten if
> > > > >> somewhere in the cluster is still running Hammer (issue#17371,
> pr#11519,
> > > > >> Orit Wasserman)
> > > > >>
> > > > >> Fixed zonemaps - also no luck.
> > > > >>
> > > > >> We do not use multisite - only default realm, zonegroup, zone.
> > > > >>
> > > > >> We have no more ideas, how these data leak could happen. gc is
> working -
> > > > >> we can see it in rgw logs.
> > > > >>
> > > > >> Maybe, someone could give any hint about this? Where should we
> look?
> > > > >>
> > > > >>
> > > > >> 2) Another story is about removing all the leaked/orphan objects.
> > > > >> radosgw-admin orphans find enters the loop state on stage when it
> starts
> > > > >> linking objects.
> > > > >>
> > > > >> We've tried to change the number of shards to 16, 64 (default),
> 512. At
> > > > >> the moment it's running with shards number 1.
> > > > >>
> > > > >> Again, any ideas how to make orphan search happen?
> > > > >>
> > > > >>
> > > > >> I could provide any logs, configs, etc. if someone is ready to
> help on
> > > > >> this case.
> > > > >>
> > > > >>
> > > >
> > > > How many buckets do you have ? how many object in each?
> > > > Can you provide the output of rados ls -p .rgw.buckets ?
> > >
> > > Marius asked me to look into this for him, so I did.
> > >
> > > What I found is that at *least* three buckets have way more RADOS
> objects then they should.
> > >
> > > The .rgw.buckets pool has 35.651.590 objects totaling 76880G.
> > >
> > > I listed all objects in the .rgw.buckets pool and summed them per
> bucket, the top 5:
> > >
> > >  783844 default.25918901.102486
> > >  876013 default.25918901.3
> > > 3325825 default.24201682.7
> > > 6324217 default.84795862.29891
> > > 7805208 default.25933378.233873
> > >
> > > So I started to rados_stat() (using Python) all the objects in the
> last three pools. While these stat() calls are still running. I statted
> about 30% of the objects and their total size is already 17511GB/17TB.
> > >
> > > size_kb_actual summed up for bucket default.24201682.7,
> default.84795862.29891 and default.25933378.233873 sums up to 12TB.
> > >
> > > So I'm currently at 30% of statting the objects and I'm already 5TB
> over the total size of these buckets.
> > >
> >
> > The stat calls have finished. The grant total is 65TB.
> >
> > So while the buckets should consume only 12TB they seems to occupy 65TB
> of storage.
> >
> > > What I noticed is that it's mainly *shadow* objects which are all 4MB
> in size.
> > >
> > > I know that 'radosgw-admin orphans find --pool=.rgw.buckets
> --job-id=xyz' should also do this for me, but as mentioned, this keeps
> looping and hangs.
> > >
> >
> > I started this tool about 20 hours ago:
> >
> > # radosgw-admin orphans find --pool=.rgw.buckets --job-id=wido1
> --debug-rados=10 2>&1|gzip > orphans.find.wido1.log.gz
> >
> > It now shows me this in the logs while it is still running:
> >
> > 2016-12-24 13:41:00.989876 7ff6844d29c0 10 librados: omap-set-vals
> oid=orphan.scan.wido1.linked.27 nspace=
> > 2016-12-24 13:41:00.993271 7ff6844d29c0 10 librados: Objecter returned
> from omap-set-vals r=0
> > storing 2 entries at orphan.scan.wido1.linked.28
> > 2016-12-24 13:41:00.993311 7ff6844d29c0 10 librados: omap-set-vals
> oid=orphan.scan.wido1.linked.28 nspace=
> > storing 1 entries at orphan.scan.wido1.linked.31
> > 2016-12-24 13:41:00.995698 7ff6844d29c0 10 librados: Objecter returned
> from omap-set-vals r=0
> > 2016-12-24 13:41:00.995787 7ff6844d29c0 10 librados: omap-set-vals
> 

Re: [ceph-users] Help with the Hammer to Jewel upgrade procedure without loosing write access to the buckets

2017-01-26 Thread George Mihaiescu
Hi Mohammed,

Thanks for the hint, I think I remember seeing this when Jewel came out but I 
assumed it must be a mistake, or a mere recommendation but not a mandatory 
requirement because I always upgraded the OSDs last ones.

Today I upgraded my OSD nodes in the test environment to Jewel and regained 
write access to the buckets.

In production we have multiple RGW nodes behind load balancers, so we can 
upgrade them one at a time.

If we have to upgrade first all OSD nodes (which takes much longer considering 
they are many more) while the old Hammer RGW cannot talk to a Jewel cluster, 
then it means one cannot perform a live upgrade of Ceph, which I think breaks 
the promise of a large, distributed, always on storage system...

Now I'll have to test what happens with the cinder volumes attached to a Hammer 
cluster that's being upgraded to Jewel, and if upgrading the Ceph packages on 
the compute nodes to Jewel will require a restart of the VMs or reboot of the 
servers.

Thank you again for your help,
George


> On Jan 25, 2017, at 19:10, Mohammed Naser <mna...@vexxhost.com> wrote:
> 
> George,
> 
> I believe the supported upgrade model is monitors, OSDs, metadata servers and 
> object gateways finally.
> 
> I would suggest trying to support path, if you’re still having issues *with* 
> the correct upgrade sequence, I would look further into it
> 
> Thanks
> Mohammed
> 
>> On Jan 25, 2017, at 6:24 PM, George Mihaiescu <lmihaie...@gmail.com> wrote:
>> 
>> 
>> Hi,
>> 
>> I need your help with upgrading our cluster from Hammer (last version) to 
>> Jewel 10.2.5 without loosing write access to Radosgw.
>> 
>> We have a fairly large cluster (4.3 PB raw) mostly used to store large S3 
>> objects, and we currently have more than 500 TB of data in the 
>> ".rgw.buckets" pool, so I'm very cautious about upgrading it to Jewel. 
>> The plan is to upgrade Ceph-mon and Radosgw to 10.2.5, while keeping the OSD 
>> nodes on Hammer, then slowly update them as well.
>> 
>> 
>> I am currently testing the upgrade procedure in a lab environment, but once 
>> I update ceph-mon and radosgw to Jewel, I cannot upload files into new or 
>> existing buckets anymore, but I can still create new buckets.
>> 
>> 
>> I read [1], [2], [3] and [4] and even ran the script in [4] as it can be 
>> seen below, but still cannot upload new objects.
>> 
>> I was hoping that if I wait long enough to update from Hammer to Jewel, most 
>> of the big issues will be solved by point releases, but it seems that I'm 
>> doing something wrong, probably because of lack of up to date documentation.
>> 
>> 
>> 
>> After the update to Jewel, this is how things look in my test environment.
>> 
>> root@ceph-mon1:~# radosgw zonegroup get
>> 
>> root@ceph-mon1:~# radosgw-admin period get
>> period init failed: (2) No such file or directory
>> 2017-01-25 10:13:06.941018 7f98f0d13900  0 RGWPeriod::init failed to init 
>> realm  id  : (2) No such file or directory
>> 
>> root@ceph-mon1:~# radosgw-admin zonegroup get
>> failed to init zonegroup: (2) No such file or directory
>> 
>> root@ceph-mon1:~# ceph --version
>> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>> 
>> root@ceph-mon1:~# radosgw-admin realm list
>> {
>> "default_info": "",
>> "realms": []
>> }
>> 
>> root@ceph-mon1:~# radosgw-admin period list
>> {
>> "periods": []
>> }
>> 
>> root@ceph-mon1:~# radosgw-admin period get
>> period init failed: (2) No such file or directory
>> 2017-01-25 12:26:07.217986 7f97ca82e900  0 RGWPeriod::init failed to init 
>> realm  id  : (2) No such file or directory
>> 
>> root@ceph-mon1:~# radosgw-admin zonegroup get --rgw-zonegroup=default
>> {
>> "id": "default",
>> "name": "default",
>> "api_name": "",
>> "is_master": "true",
>> "endpoints": [],
>> "hostnames": [],
>> "hostnames_s3website": [],
>> "master_zone": "default",
>> "zones": [
>> {
>> "id": "default",
>> "name": "default",
>> "endpoints": [],
>> "log_meta": "false",
>> "log_data": "false",
>> "bucket_index_max_shards": 0,
>>   

[ceph-users] Help with the Hammer to Jewel upgrade procedure without loosing write access to the buckets

2017-01-25 Thread George Mihaiescu
Hi,

I need your help with upgrading our cluster from Hammer (last version) to
Jewel 10.2.5 without loosing write access to Radosgw.

We have a fairly large cluster (4.3 PB raw) mostly used to store large S3
objects, and we currently have more than 500 TB of data in the
".rgw.buckets" pool, so I'm very cautious about upgrading it to Jewel.
The plan is to upgrade Ceph-mon and Radosgw to 10.2.5, while keeping the
OSD nodes on Hammer, then slowly update them as well.


I am currently testing the upgrade procedure in a lab environment, but once
I update ceph-mon and radosgw to Jewel, I cannot upload files into new or
existing buckets anymore, but I can still create new buckets.


I read [1], [2], [3] and [4] and even ran the script in [4] as it can be
seen below, but still cannot upload new objects.

I was hoping that if I wait long enough to update from Hammer to Jewel,
most of the big issues will be solved by point releases, but it seems that
I'm doing something wrong, probably because of lack of up to date
documentation.



After the update to Jewel, this is how things look in my test environment.

root@ceph-mon1:~# radosgw zonegroup get

root@ceph-mon1:~# radosgw-admin period get
period init failed: (2) No such file or directory
2017-01-25 10:13:06.941018 7f98f0d13900  0 RGWPeriod::init failed to init
realm  id  : (2) No such file or directory

root@ceph-mon1:~# radosgw-admin zonegroup get
failed to init zonegroup: (2) No such file or directory

root@ceph-mon1:~# ceph --version
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

root@ceph-mon1:~# radosgw-admin realm list
{
"default_info": "",
"realms": []
}

root@ceph-mon1:~# radosgw-admin period list
{
"periods": []
}

root@ceph-mon1:~# radosgw-admin period get
period init failed: (2) No such file or directory
2017-01-25 12:26:07.217986 7f97ca82e900  0 RGWPeriod::init failed to init
realm  id  : (2) No such file or directory

root@ceph-mon1:~# radosgw-admin zonegroup get --rgw-zonegroup=default
{
"id": "default",
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "default",
"zones": [
{
"id": "default",
"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": ""
}

root@ceph-mon1:~# radosgw-admin zone get --zone-id=default
{
"id": "default",
"name": "default",
"domain_root": ".rgw",
"control_pool": ".rgw.control",
"gc_pool": ".rgw.gc",
"log_pool": ".log",
"intent_log_pool": ".intent-log",
"usage_log_pool": ".usage",
"user_keys_pool": ".users",
"user_email_pool": ".users.email",
"user_swift_pool": ".users.swift",
"user_uid_pool": ".users.uid",
"system_key": {
"access_key": "",
"secret_key": ""
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": ".rgw.buckets.index",
"data_pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra",
"index_type": 0
}
}
],
"metadata_heap": ".rgw.meta",
"realm_id": ""
}

root@ceph-mon1:~# rados df
pool name KB  objects   clones degraded
unfound   rdrd KB   wrwr KB
.log   0  1270
004140241275414020
.rgw   4   140
00  147  117   35   14
.rgw.buckets   1163540
004 4969   3811637
.rgw.buckets.index0   560
00 1871 1815  1190
.rgw.control   080
000000
.rgw.gc0   320
00 5214 5182 35190
.rgw.meta  280
0000   208
.rgw.root  240
00   72   48   128
.usage 020
00   87   87  1740
.users.uid 140
00  104   96   442
rbd000
000   

Re: [ceph-users] RGW: Delete orphan period for non-existent realm

2016-10-27 Thread George Mihaiescu
Are these problems fixed in the latest version of the Debian packages?

I'm a fairly large user with a lot of existing data stored in .rgw.buckets 
pool, and I'm running Hammer.

I just hope that upgrading to Jewel so long after its release will not cause 
loss of access to this data for my users, and I won't have to fix a broken 
radosgw by running manual commands like this one.

Can you please confirm that this issue was fixed already?

Thank you,
George

> On Oct 27, 2016, at 14:38, Orit Wasserman  wrote:
> 
> On Thu, Oct 27, 2016 at 12:30 PM, Richard Chan
>  wrote:
>> Hi Cephers,
>> 
>> In my period list I am seeing an orphan period
>> 
>> {
>>"periods": [
>>"24dca961-5761-4bd1-972b-685a57e2fcf7:staging",
>>"a5632c6c4001615e57e587c129c1ad93:staging",
>>"fac3496d-156f-4c09-9654-179ad44091b9"
>>]
>> }
>> 
>> 
>> The realm a5632c6c4001615e57e587c129c1ad93 no longer exists.
>> 
>> How do I clean this up?
>> 
>> radosgw-admin --cluster flash period delete --realm-id
>> a5632c6c4001615e57e587c129c1ad93
>> missing period id
>> 
> 
> you need to provide period id not realm id for this command.
> 
> try:
> radosgw-admin --cluster flash period delete --period 
> 
>> doesn't work as it wants a period id, instead of deleting by realm-id
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Richard Chan
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-08 Thread George Mihaiescu
Look in the cinder db, the volumes table to find the Uuid of the deleted 
volume. 

If you go through yours OSDs and look for the directories for PG index 20, you 
might find some fragments from the deleted volume, but it's a long shot...

> On Aug 8, 2016, at 4:39 PM, Georgios Dimitrakakis  
> wrote:
> 
> Dear David (and all),
> 
> the data are considered very critical therefore all this attempt to recover 
> them.
> 
> Although the cluster hasn't been fully stopped all users actions have. I mean 
> services are running but users are not able to read/write/delete.
> 
> The deleted image was the exact same size of the example (500GB) but it 
> wasn't the only one deleted today. Our user was trying to do a "massive" 
> cleanup by deleting 11 volumes and unfortunately one of them was very 
> important.
> 
> Let's assume that I "dd" all the drives what further actions should I do to 
> recover the files? Could you please elaborate a bit more on the phrase "If 
> you've never deleted any other rbd images and assuming you can recover data 
> with names, you may be able to find the rbd objects"??
> 
> Do you mean that if I know the file names I can go through and check for 
> them? How?
> Do I have to know *all* file names or by searching for a few of them I can 
> find all data that exist?
> 
> Thanks a lot for taking the time to answer my questions!
> 
> All the best,
> 
> G.
> 
>> I dont think theres a way of getting the prefix from the cluster at
>> this point.
>> 
>> If the deleted image was a similar size to the example youve given,
>> you will likely have had objects on every OSD. If this data is
>> absolutely critical you need to stop your cluster immediately or make
>> copies of all the drives with something like dd. If youve never
>> deleted any other rbd images and assuming you can recover data with
>> names, you may be able to find the rbd objects.
>> 
>> On Mon, Aug 8, 2016 at 7:28 PM, Georgios Dimitrakakis  wrote:
>> 
> Hi,
> 
> On 08.08.2016 10:50, Georgios Dimitrakakis wrote:
> 
>>> Hi,
>>> 
 On 08.08.2016 09:58, Georgios Dimitrakakis wrote:
 
 Dear all,
 
 I would like your help with an emergency issue but first
 let me describe our environment.
 
 Our environment consists of 2OSD nodes with 10x 2TB HDDs
 each and 3MON nodes (2 of them are the OSD nodes as well)
 all with ceph version 0.80.9
 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
 
 This environment provides RBD volumes to an OpenStack
 Icehouse installation.
 
 Although not a state of the art environment is working
 well and within our expectations.
 
 The issue now is that one of our users accidentally
 deleted one of the volumes without keeping its data first!
 
 Is there any way (since the data are considered critical
 and very important) to recover them from CEPH?
>>> 
>>> Short answer: no
>>> 
>>> Long answer: no, but
>>> 
>>> Consider the way Ceph stores data... each RBD is striped
>>> into chunks
>>> (RADOS objects with 4MB size by default); the chunks are
>>> distributed
>>> among the OSDs with the configured number of replicates
>>> (probably two
>>> in your case since you use 2 OSD hosts). RBD uses thin
>>> provisioning,
>>> so chunks are allocated upon first write access.
>>> If an RBD is deleted all of its chunks are deleted on the
>>> corresponding OSDs. If you want to recover a deleted RBD,
>>> you need to
>>> recover all individual chunks. Whether this is possible
>>> depends on
>>> your filesystem and whether the space of a former chunk is
>>> already
>>> assigned to other RADOS objects. The RADOS object names are
>>> composed
>>> of the RBD name and the offset position of the chunk, so if
>>> an
>>> undelete mechanism exists for the OSDs filesystem, you have
>>> to be
>>> able to recover file by their filename, otherwise you might
>>> end up
>>> mixing the content of various deleted RBDs. Due to the thin
>>> provisioning there might be some chunks missing (e.g. never
>>> allocated
>>> before).
>>> 
>>> Given the fact that
>>> - you probably use XFS on the OSDs since it is the
>>> preferred
>>> filesystem for OSDs (there is RDR-XFS, but Ive never had to
>>> use it)
>>> - you would need to stop the complete ceph cluster
>>> (recovery tools do
>>> not work on mounted filesystems)
>>> - your cluster has been in use after the RBD was deleted
>>> and thus
>>> parts of its former space might already have been
>>> overwritten
>>> (replication might help you here, since there are two OSDs
>>> to try)
>>> - XFS undelete does not work well on fragmented files (and
>>> OSDs tend
>>> to introduce 

Re: [ceph-users] 2 networks vs 2 NICs

2016-06-04 Thread George Mihaiescu
One benefit of separate networks is that you can graph the client vs 
replication traffic.

> On Jun 4, 2016, at 12:12 PM, Nick Fisk  wrote:
> 
> Yes, this is fine. I currently use 2 bonded 10G nics which have the untagged 
> vlan as the public network and a tagged vlan as the cluster network.
> 
> However, when I build my next cluster I will probably forgo the separate 
> cluster network and just run them over the same IP, as after running the 
> cluster, I don't see any benefit from separate networks when taking into 
> account the extra complexity. Something to consider.
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Adrian Sevcenco
>> Sent: 04 June 2016 16:11
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] 2 networks vs 2 NICs
>> 
>> Hi! I seen in discussion and in documentation that "networks" is used
>> interchangeable with "NIC" (which also is a different thing than interface) 
>> ..
>> So, my question is :for an OSD server with 24 OSDs with a single 40 GB NIC
>> would be ok to have a public network on the main interface and a vlan
>> (virtual) interface for the cluster network?
>> 
>> Thank you!
>> Adrian
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best CLI or GUI client for Ceph and S3 protocol

2016-05-25 Thread George Mihaiescu
Aws cli is used by us.



> On May 25, 2016, at 5:11 PM, Andrey Ptashnik  wrote:
> 
> Team,
> 
> I wanted to ask if some of you are using CLI or GUI based S3 browsers/clients 
> with Ceph and what are the best ones?
> 
> Regards,
> 
> Andrey Ptashnik
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dense storage nodes

2016-05-18 Thread George Mihaiescu
Hi Blair,

We use 36 OSDs nodes with journals on HDD running in a 90% object storage
cluster.
The servers have 128 GB RAM and 40 cores (HT) for the storage nodes with 4
TB SAS drives, and 256 GB and 48 cores for the storage nodes with 6 TB SAS
drives.
We use 2x10 Gb bonded for the client network, and 2x10 Gb bonded for the
replication traffic. The drives are 7.2K RPM, 12 GB SAS drives connected to
LSI 9300-8i 12 Gb/s.

We increased read-ahead on the drives to 8192, and we are using 64 MB for
rgw_obj_stripe_size because of the specific workload needs.
We have MTU 9000 set on the interfaces and using the "bond-xmit_hash_policy
layer3+4" to better distribute the traffic among the physical links.

The cluster has now 3.2 PB raw (50% used), and performs really well, with
no resources being strained.

Cheers,
George



On Wed, May 18, 2016 at 1:54 AM, Blair Bethwaite 
wrote:

> Hi all,
>
> What are the densest node configs out there, and what are your
> experiences with them and tuning required to make them work? If we can
> gather enough info here then I'll volunteer to propose some upstream
> docs covering this.
>
> At Monash we currently have some 32-OSD nodes (running RHEL7), though
> 8 of those OSDs are not storing or doing much yet (in a quiet EC'd RGW
> pool), the other 24 OSDs are serving RBD and at perhaps 65% full on
> average - these are 4TB drives.
>
> Aside from the already documented pid_max increases that are typically
> necessary just to start all OSDs, we've also had to up
> nf_conntrack_max. We've hit issues (twice now) that seem (have not
> figured out exactly how to confirm this yet) to be related to kernel
> dentry slab cache exhaustion - symptoms were a major slow down in
> performance and slow requests all over the place on writes, watching
> OSD iostat would show a single drive hitting 90+% util for ~15s with a
> bunch of small reads and no writes. These issues were worked around by
> tuning up filestore split and merge thresholds, though if we'd known
> about this earlier we'd probably have just bumped up the default
> object size so that we simply had fewer objects (and/or rounded up the
> PG count to the next power of 2). We also set vfs_cache_pressure to 1,
> though this didn't really seem to do much at the time. I've also seen
> recommendations about setting min_free_kbytes to something higher
> (currently 90112 on our hardware) but have not verified this.
>
> --
> Cheers,
> ~Blairo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Web based S3 client

2016-05-02 Thread George Mihaiescu
Hi Can,

I gave it a try and I can see my buckets, but I get an error (see attached)
when trying to see the contents of any bucket.

The application is pretty simplistic now, and it would be great if support
for the object and size count would be added. The bucket access type should
be displayed (private/public) and renaming buckets.


Thanks,
George

On Thu, Apr 28, 2016 at 10:29 PM, Can Zhang(张灿)  wrote:

> Hi,
>
> In order to help new users to get hands on S3, we developed a web based S3
> client called “Sree”, and hope to see if it could become part of Ceph.
> Currently we host the project at:
>
> https://github.com/cannium/Sree
>
> Users could use Sree to manage their files in browser, through Ceph’s S3
> interface. I think it’s more friendly for new users than s3cmd, and would
> help Ceph to hit more users.
>
> Any suggestions are welcomed. Hope to see your replies.
>
>
> Cheers,
> Can ZHANG
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rack weight imbalance

2016-02-23 Thread George Mihaiescu
Thank you Greg, much appreciated.

I'll test with the crush tool to see if it complains about this new layout.

George

On Mon, Feb 22, 2016 at 3:19 PM, Gregory Farnum <gfar...@redhat.com> wrote:

> On Mon, Feb 22, 2016 at 9:29 AM, George Mihaiescu <lmihaie...@gmail.com>
> wrote:
> > Hi,
> >
> > We have a fairly large Ceph cluster (3.2 PB) that we want to expand and
> we
> > would like to get your input on this.
> >
> > The current cluster has around 700 OSDs (4 TB and 6 TB) in three racks
> with
> > the largest pool being rgw and using a replica 3.
> > For non-technical reasons (budgetary, etc) we are considering getting
> three
> > more racks, but initially adding only two storage nodes with 36 x 8 TB
> > drives in each, which will basically cause the rack weights to be
> imbalanced
> > (three racks with weight around a 1000 and 288 OSDs, and three racks with
> > weight around 500 but only 72 OSDs)
> >
> > The one replica per rack CRUSH rule will cause existing data to be
> > re-balanced among all six racks, with OSDs in the new racks getting only
> a
> > proportionate amount of replicas.
> >
> > Do you see any possible problems with this approach? Should Ceph be able
> to
> > properly rebalance the existing data among racks with imbalanced weights?
> >
> > Thank you for your input and please let me know if you need additional
> info.
>
> This should be okay; you have multiple racks in each size and aren't
> trying to replicate a full copy to each rack individually. You can
> test it ahead of time with the crush tool, though:
> http://docs.ceph.com/docs/master/man/8/crushtool/
> It may turn out you're using old tunables and want to update them
> first or something.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rack weight imbalance

2016-02-22 Thread George Mihaiescu
Hi,

We have a fairly large Ceph cluster (3.2 PB) that we want to expand and we
would like to get your input on this.

The current cluster has around 700 OSDs (4 TB and 6 TB) in three racks with
the largest pool being rgw and using a replica 3.
For non-technical reasons (budgetary, etc) we are considering getting three
more racks, but initially adding only two storage nodes with 36 x 8 TB
drives in each, which will basically cause the rack weights to be
imbalanced (three racks with weight around a 1000 and 288 OSDs, and three
racks with weight around 500 but only 72 OSDs)

The one replica per rack CRUSH rule will cause existing data to be
re-balanced among all six racks, with OSDs in the new racks getting only a
proportionate amount of replicas.

Do you see any possible problems with this approach? Should Ceph be able to
properly rebalance the existing data among racks with imbalanced weights?

Thank you for your input and please let me know if you need additional info.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg repair behavior? (Was: Re: getting rid of misplaced objects)

2016-02-17 Thread George Mihaiescu
We have three replicas, so we just performed md5sum on all of them in order
to find the correct ones, then we deleted the bad file and ran pg repair.
On 15 Feb 2016 10:42 a.m., "Zoltan Arnold Nagy" 
wrote:

> Hi Bryan,
>
> You were right: we’ve modified our PG weights a little (from 1 to around
> 0.85 on some OSDs) and once I’ve changed them back to 1, the remapped PGs
> and misplaced objects were gone.
> So thank you for the tip.
>
> For the inconsistent ones and scrub errors, I’m a little wary to use pg
> repair as that - if I understand correctly - only copies the primary PG’s
> data to the other PGs thus can easily corrupt the whole object if the
> primary is corrupted.
>
> I haven’t seen an update on this since last May where this was brought up
> as a concern from several people and there were mentions of adding
> checksumming to the metadata and doing a checksum-comparison on repair.
>
> Can anybody update on the current status on how exactly pg repair works in
> Hammer or will work in Jewel?
>
> > On 11 Feb 2016, at 22:17, Stillwell, Bryan 
> wrote:
> >
> > What does 'ceph osd tree' look like for this cluster?  Also have you done
> > anything special to your CRUSH rules?
> >
> > I've usually found this to be caused by modifying OSD weights a little
> too
> > much.
> >
> > As for the inconsistent PG, you should be able to run 'ceph pg repair' on
> > it:
> >
> >
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#
> > pgs-inconsistent
> >
> >
> > Bryan
> >
> > On 2/11/16, 11:21 AM, "ceph-users on behalf of Zoltan Arnold Nagy"
> >  zol...@linux.vnet.ibm.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Are there any tips and tricks around getting rid of misplaced objects? I
> >> did check the archive but didn¹t find anything.
> >>
> >> Right now my cluster looks like this:
> >>
> >> pgmap v43288593: 16384 pgs, 4 pools, 45439 GB data, 10383 kobjects
> >>   109 TB used, 349 TB / 458 TB avail
> >>   330/25160461 objects degraded (0.001%)
> >>   31280/25160461 objects misplaced (0.124%)
> >>  16343 active+clean
> >> 40 active+remapped
> >>  1 active+clean+inconsistent
> >>
> >> This is how it has been for a while and I thought for sure that the
> >> misplaced would converge down to 0, but nevertheless, it didn¹t.
> >>
> >> Any pointers on how I could get it back to all active+clean?
> >>
> >> Cheers,
> >> Zoltan
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > 
> >
> > This E-mail and any of its attachments may contain Time Warner Cable
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to Time Warner Cable. This E-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this E-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this E-mail is strictly prohibited and
> may be unlawful. If you have received this E-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this E-mail and any printout.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Number of OSD map versions

2015-12-01 Thread George Mihaiescu
Thanks Dan,

I'll use these ones from Infernalis:


[global]
osd map message max = 100

[osd]
osd map cache size = 200
osd map max advance = 150
osd map share max epochs = 100
osd pg epoch persisted max stale = 150


George

On Mon, Nov 30, 2015 at 4:20 PM, Dan van der Ster <d...@vanderster.com>
wrote:

> I wouldn't run with those settings in production. That was a test to
> squeeze too many OSDs into too little RAM.
>
> Check the values from infernalis/master. Those should be safe.
>
> --
> Dan
> On 30 Nov 2015 21:45, "George Mihaiescu" <lmihaie...@gmail.com> wrote:
>
>> Hi,
>>
>> I've read the recommendation from CERN about the number of OSD maps (
>> https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf,
>> page 3) and I would like to know if there is any negative impact from these
>> changes:
>>
>> [global]
>> osd map message max = 10
>>
>> [osd]
>> osd map cache size = 20
>> osd map max advance = 10
>> osd map share max epochs = 10
>> osd pg epoch persisted max stale = 10
>>
>>
>> We are running Hammer with nowhere closer to 7000 OSDs, but I don't want
>> to waste memory on OSD maps which are not needed.
>>
>> Are there are large production deployments running with these or similar
>> settings?
>>
>> Thank you,
>> George
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Number of OSD map versions

2015-11-30 Thread George Mihaiescu
Hi,

I've read the recommendation from CERN about the number of OSD maps (
https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf, page
3) and I would like to know if there is any negative impact from these
changes:

[global]
osd map message max = 10

[osd]
osd map cache size = 20
osd map max advance = 10
osd map share max epochs = 10
osd pg epoch persisted max stale = 10


We are running Hammer with nowhere closer to 7000 OSDs, but I don't want to
waste memory on OSD maps which are not needed.

Are there are large production deployments running with these or similar
settings?

Thank you,
George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw broken files

2015-11-11 Thread George Mihaiescu
Hi,

I have created a bug report for an issue affecting our Ceph Hammer
environment, and I was wondering if anybody has some input on what can we
do to troubleshoot/fix it:

http://tracker.ceph.com/issues/13764

Thank you,
George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com