Re: [ceph-users] RBD client newer than cluster

2017-02-14 Thread Lukáš Kubín
Yes, also. The main reason though is temporarily missing connection from
Ceph nodes to package repo - this will take some days or weeks to reconnect.
The client nodes can connect and update.

Thanks,

Lukáš

On Tue, Feb 14, 2017 at 6:56 PM, Shinobu Kinjo  wrote:

> On Wed, Feb 15, 2017 at 2:18 AM, Lukáš Kubín 
> wrote:
> > Hi,
> > I'm most probably hitting bug http://tracker.ceph.com/issues/13755 -
> when
> > libvirt mounted RBD disks suspend I/O during snapshot creation until hard
> > reboot.
> >
> > My Ceph cluster (monitors and OSDs) is running v0.94.3, while clients
> > (OpenStack/KVM computes) run v0.94.5. Can I still update the client
> packages
> > (librbd1 and dependencies) to a patched release 0.94.7, while keeping the
> > cluster on v0.94.3?
>
> The latest hammer is v0.94.9 and hammer will be EOL in this spring.
> Why do you want to keep v0.94.3? Is it because you just want to avoid
> any risks regarding to upgrading packages?
>
> >
> > I realize it's not ideal but does it present any risk? Can I assume that
> > patching the client is sufficient to resolve the mentioned bug?
> >
> > Ceph cluster nodes can't receive updates currently and this will stay so
> for
> > some time still, but I need to resolve the snapshot bug urgently.
> >
> > Greetings,
> >
> > Lukáš
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel to Kraken OSD upgrade issues

2017-02-14 Thread Benjeman Meekhof
Hi all,

We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken
(11.2.0).  OS was RHEL derivative.  Prior to this we updated all the
mons to Kraken.

After updating ceph packages I restarted the 60 OSD on the box with
'systemctl restart ceph-osd.target'.  Very soon after the system cpu
load flat-lines at 100% with top showing all of that being system load
from ceph-osd processes.  Not long after we get OSD flapping due to
the load on the system (noout was set to start this, but perhaps
too-quickly unset post restart).

This is causing problems in the cluster, and we reboot the box.  The
OSD don't start up/mount automatically - not a new problem on this
setup.  We run 'ceph-disk activate $disk' on a list of all the
/dev/dm-X devices as output by ceph-disk list.  Everything activates
and the CPU gradually climbs to once again be a solid 100%.  No OSD
have joined cluster so it isn't causing issues.

I leave the box overnight...by the time I leave I see that 1-2 OSD on
this box are marked up/in.   By morning all are in, CPU is fine,
cluster is still fine.

This is not a show-stopping issue now that I know what happens though
it means upgrades are a several hour or overnight affair.  Next box I
will just mark all the OSD out before updating and restarting them or
try leaving them up but being sure to set noout to avoid flapping
while they churn.

Here's a log snippet from one currently spinning in the startup
process since 11am.  This is the second box we did, the first
experience being as detailed above.  Could this have anything to do
with the 'PGs are upgrading' message?

2017-02-14 11:04:07.028311 7fd7a0372940  0 _get_class not permitted to load lua
2017-02-14 11:04:07.077304 7fd7a0372940  0 osd.585 135493 crush map
has features 288514119978713088, adjusting msgr requires for clients
2017-02-14 11:04:07.077318 7fd7a0372940  0 osd.585 135493 crush map
has features 288514394856620032 was 8705, adjusting msgr requires for
mons
2017-02-14 11:04:07.077324 7fd7a0372940  0 osd.585 135493 crush map
has features 288514394856620032, adjusting msgr requires for osds
2017-02-14 11:04:09.446832 7fd7a0372940  0 osd.585 135493 load_pgs
2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading
2017-02-14 11:04:10.246166 7fd7a0372940  0 osd.585 135493 load_pgs
opened 148 pgs
2017-02-14 11:04:10.246249 7fd7a0372940  0 osd.585 135493 using 1 op
queue with priority op cut off at 64.
2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493
log_to_monitors {default=true}
2017-02-14 11:04:12.473450 7fd7a0372940  0 osd.585 135493 done with
init, starting boot process
(logs stop here, cpu spinning)


regards,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy and debian stretch 9

2017-02-14 Thread Zorg

Hello debian

stretch is almost stable so i wanted to deploy ceph jewel on it

but with

ceph-deploy new mynode

I have this error

[ceph_deploy][ERROR ] UnsupportedPlatform: Platform is not supported: 
debian  9.0



I Know I can cheat changing /etc/debian_version to 8.0 but i'm sure 
there is a better way to do it


Thanks for your help


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD client newer than cluster

2017-02-14 Thread Shinobu Kinjo
On Wed, Feb 15, 2017 at 2:18 AM, Lukáš Kubín  wrote:
> Hi,
> I'm most probably hitting bug http://tracker.ceph.com/issues/13755 - when
> libvirt mounted RBD disks suspend I/O during snapshot creation until hard
> reboot.
>
> My Ceph cluster (monitors and OSDs) is running v0.94.3, while clients
> (OpenStack/KVM computes) run v0.94.5. Can I still update the client packages
> (librbd1 and dependencies) to a patched release 0.94.7, while keeping the
> cluster on v0.94.3?

The latest hammer is v0.94.9 and hammer will be EOL in this spring.
Why do you want to keep v0.94.3? Is it because you just want to avoid
any risks regarding to upgrading packages?

>
> I realize it's not ideal but does it present any risk? Can I assume that
> patching the client is sufficient to resolve the mentioned bug?
>
> Ceph cluster nodes can't receive updates currently and this will stay so for
> some time still, but I need to resolve the snapshot bug urgently.
>
> Greetings,
>
> Lukáš
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] admin_socket: exception getting command descriptions

2017-02-14 Thread Vince

Hi Liuchang0812,

Thank you for replying the thread.

I have corrected this issue. This was due to incorrect ownership of the 
/var/lib/ceph. It was under root and I have changed it to ceph ownership 
to resolve this.


However, I am seeing a new error while preparing the osd's. Any idea 
about this?


===

[osd-ceph2][INFO  ] Running command: sudo /bin/ceph --cluster=ceph osd 
stat --format=json


[osd-ceph2][WARNIN] No data was received after 300 seconds, disconnecting...
[ceph_deploy.osd][DEBUG ] Host osd-ceph2 is now ready for osd use.

===

Full log

===

[root@admin-ceph mycluster]# ceph-deploy osd prepare 
osd-ceph2:/var/local/osd0
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.37): /usr/bin/ceph-deploy osd 
prepare osd-ceph2:/var/local/osd0

[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  disk  : 
[('osd-ceph2', '/var/local/osd0', None)]

[ceph_deploy.cli][INFO  ]  dmcrypt   : False
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  bluestore : None
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: prepare
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   : 
/etc/ceph/dmcrypt-keys

[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   : 


[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  fs_type   : xfs
[ceph_deploy.cli][INFO  ]  func  : at 0x25d2b90>

[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  zap_disk  : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks 
osd-ceph2:/var/local/osd0:

[osd-ceph2][DEBUG ] connection detected need for sudo
[osd-ceph2][DEBUG ] connected to host: osd-ceph2
[osd-ceph2][DEBUG ] detect platform information from remote host
[osd-ceph2][DEBUG ] detect machine type
[osd-ceph2][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 7.3.1611 Core
[ceph_deploy.osd][DEBUG ] Deploying osd to osd-ceph2
[osd-ceph2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.osd][DEBUG ] Preparing host osd-ceph2 disk /var/local/osd0 
journal None activate False

[osd-ceph2][DEBUG ] find the location of an executable
[osd-ceph2][INFO  ] Running command: sudo /usr/sbin/ceph-disk -v prepare 
--cluster ceph --fs-type xfs -- /var/local/osd0
[osd-ceph2][WARNIN] command: Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=fsid
[osd-ceph2][WARNIN] command: Running command: /usr/bin/ceph-osd 
--check-allows-journal -i 0 --cluster ceph
[osd-ceph2][WARNIN] command: Running command: /usr/bin/ceph-osd 
--check-wants-journal -i 0 --cluster ceph
[osd-ceph2][WARNIN] command: Running command: /usr/bin/ceph-osd 
--check-needs-journal -i 0 --cluster ceph
[osd-ceph2][WARNIN] command: Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=osd_journal_size
[osd-ceph2][WARNIN] populate_data_path: Preparing osd data dir 
/var/local/osd0
[osd-ceph2][WARNIN] command: Running command: /sbin/restorecon -R 
/var/local/osd0/ceph_fsid.16899.tmp
[osd-ceph2][WARNIN] command: Running command: /usr/bin/chown -R 
ceph:ceph /var/local/osd0/ceph_fsid.16899.tmp
[osd-ceph2][WARNIN] command: Running command: /sbin/restorecon -R 
/var/local/osd0/fsid.16899.tmp
[osd-ceph2][WARNIN] command: Running command: /usr/bin/chown -R 
ceph:ceph /var/local/osd0/fsid.16899.tmp
[osd-ceph2][WARNIN] command: Running command: /sbin/restorecon -R 
/var/local/osd0/magic.16899.tmp
[osd-ceph2][WARNIN] command: Running command: /usr/bin/chown -R 
ceph:ceph /var/local/osd0/magic.16899.tmp

[osd-ceph2][INFO  ] checking OSD status...
[osd-ceph2][DEBUG ] find the location of an executable
[osd-ceph2][INFO  ] Running command: sudo /bin/ceph --cluster=ceph osd 
stat --format=json


[osd-ceph2][WARNIN] No data was received after 300 seconds, disconnecting...
[ceph_deploy.osd][DEBUG ] Host osd-ceph2 is now ready for osd use.
===



On 02/12/2017 10:23 AM, liuchang0812 wrote:

Hi, Vince

What's your Ceph's version ?

` ceph --version`

2017-02-11 19:10 GMT+08:00 Vince :

Hi,

We are getting the below error while trying to create monitor from the admin
node.

ceph-deploy mon create-initial


[osd-ceph1][INFO  ] Running command: ceph --cluster=ceph --admin-daemon
/var/run/ceph/ceph-mon.osd-ceph1.asok mon_status
[osd-ceph1][ERROR ] admin_socket: exception getting command descriptions:
[Errno 2] No such file or directory


I couldn't see /var/run/ceph/ceph-mon.osd-ceph1.aso in the monitor 

Re: [ceph-users] CephFS : minimum stripe_unit ?

2017-02-14 Thread John Spray
On Tue, Feb 14, 2017 at 11:38 AM, Florent B  wrote:
> Hi everyone,
>
> I use Ceph-fuse on a Jewel cluster.
>
> I would like to set stripe_unit to 8192 on a directory but it seems not
> possible :
>
> # setfattr -n ceph.dir.layout.stripe_unit -v "8192" maildata1
> setfattr: maildata1: Invalid argument
>
> I know stripe_unit must be a multiple of object_size. I didn't change
> object_size and default value seems to be 4194304.
>
> And 8192 is a multiple of 4194304, isn't it ?
>
> Is there a minimum value ?

Yes -- this is a constant called CEPH_MIN_STRIPE_UNIT, which is equal to 65536.

John

>
> Thank you.
>
> Florent
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 来自guimark的邮件

2017-02-14 Thread guimark
unsubscribe ceph-users___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bcache vs flashcache vs cache tiering

2017-02-14 Thread Dongsheng Yang

Hi Sage and all,
We are going to use SSDs for cache in ceph. But I am not sure which 
one is the best solution, bcache? flashcache? or cache tier?


I found there are some CAUTION in ceph.com about cache tiering. Is cache 
tiering is already production ready? especially for rbd.


thanx in advance.
Yang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] To backup or not to backup the classic way - How to backup hundreds of TB?

2017-02-14 Thread Nick Fisk
Hardware failures are just one possible cause. If you value your data you will 
have a backup and preferably going to some sort of removable media that can be 
taken offsite, like those things that everybody keeps saying are dead…..what 
are they called….oh yeah tapes. J A online copy of your data on some sort of 
large JBOD or 2nd Ceph cluster is a good idea if you need faster access, but I 
wouldn’t rely on it for my only backup.

 

There are many things that can cause data loss, failing hardware is just one. 
As can be seen through many posts on this list, bugs in Ceph or user error is a 
much more common cause of data loss and triple replication won’t protect you 
from it. Thought should also be given to malicious actions by internal staff 
with grievances or external hackers (eg ransomware). In these cases even online 
backups like rsync…etc, might not protect you as that data can be accessed and 
deleted at the same time as the live data. I predict these sort of incidents 
will become more common in the near future.

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
, 
Sent: 14 February 2017 09:56
To: Götz Reinicke 
Cc: ceph new 
Subject: Re: [ceph-users] To backup or not to backup the classic way - How to 
backup hundreds of TB?

 

Hello!

 

  The answer is pretty much depends on your fears. If you afraid of hardware 
failures you could have more then standard 3 copies, configure your failure 
domain properly and so on. If you afraid of some big disaster that can hurt all 
of your hardware - you could consider making an async replica to a cluster in 
an another datacenter on another content. If you afraid of some kind of cluster 
software issues - then you can build an another cluster and use third-party 
tools to backup data there, but as you correctly noticed it will not be too 
convenient.

 

As a common sollution I would offer you to use the same cluster for backups as 
well (may be just a different pool\OSD tree with less expensive drives) - in 
most cases it's enough.


Best regards,

Vladimir

 

2017-02-14 14:15 GMT+05:00 Götz Reinicke  >:

Hi,

I guess that's a question that pops up in different places, but I could not 
find any which fits to my thoughts.

Currently we start to use ceph for file shares of our films produced by our 
students and some xen/vmware VMs. Thd VM data is already backed up; the fils 
original footage is stored in other places.

We start with some 100TB rbd and mount smb/NFS shares from the clients. May be 
we look into ceph fs soon.

The question is: How would someone handle a backup of 100 TB data? Rsyncing 
that to an other system or having a commercial backup solution looks not that 
good e.g. regarding the price.

One thought is, is there some sort of best practice in the ceph world e.g. 
replicating to an other physical independent cluster? Or use more replicas, 
odds, nodes and do snapshots in one cluster?

Having productive data and backup on the same hardware currently makes me feel 
not that good too….But the world changes :)

Long story short: How do you do backup hundreds of TB?

Curious for suggestions and thoughts .. Thanks and Regards . Götz


___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to repair MDS damage?

2017-02-14 Thread John Spray
On Tue, Feb 14, 2017 at 9:33 AM, Oliver Schulz  wrote:
> Dear Ceph Experts,
>
> after upgrading our Ceph cluster from Hammer to Jewel,
> the MDS (after a few days) found some metadata damage:
>
># ceph status
>[...]
>health HEALTH_ERR
>  mds0: Metadata damage detected
>[...]
>
> The output of
>
># ceph tell mds.0 damage ls
>
> is:
>
>[
>   {
>  "ino" : [...],
>  "id" : [...],
>  "damage_type" : "backtrace"
>   },
>   [...]
>]
>
> There are 5 such "damage_type" : "backtrace" entries in total.
>
> I'm not really surprised, there were a very few instances in
> the past where one or two (mostly empty directories) and
> symlinks acted strangely, and couldn't be deleted
> ("rm results in "Invalid argument"). Back then, I moved them
> all in a "quarantine" directory, but wasn't able to do anything
> about it.
>
> Now that CephFS does more rigorous checks and has spotted
> the trouble - how do I go about repairing this?

>From Kraken onwards, backtraces can be repaired using "ceph daemon
mds.  recursive repair" on the path containing the primary
dentry for the file (i.e. for hardlinks, the place the file was
originally created).

To identify the path that contains the inode, you can either do an
exhausting search using `find` (there is an argument that lets you
search by inode number), or try searching your mds logs to see the
point where it found the damage, where it may have printed the path.
However, the path in use when it detected the damage may have been a
remote link (i.e. a hardlink), which wouldn't be the path you want.

You can work around this for pre-Kraken MDS versions by changing the
file's name or immediate parentage (i.e. rename or move the file),
then use a "ceph daemon mds. flush journal" to force it to flush
out the new backtrace immediately.

Once you believe a new backtrace is written, use the "damage rm"
command to clear the damage entry, and try accessing the file via a
hardlink again to see if it's working now.

John

>
>
> Cheers and thanks for any help,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shrink cache target_max_bytes

2017-02-14 Thread Kees Meijs
Hi Cephers,

Although I might be stating an obvious fact: altering the parameter
works as advertised.

The only issue I encountered was lowering the parameter too much at once
results in some slow requests because the cache pool is "full".

So in short: it works when lowering the parameter bit by bit.

Regards,
Kees

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to repair MDS damage?

2017-02-14 Thread Oliver Schulz

Dear Ceph Experts,

after upgrading our Ceph cluster from Hammer to Jewel,
the MDS (after a few days) found some metadata damage:

   # ceph status
   [...]
   health HEALTH_ERR
 mds0: Metadata damage detected
   [...]

The output of

   # ceph tell mds.0 damage ls

is:

   [
  {
 "ino" : [...],
 "id" : [...],
 "damage_type" : "backtrace"
  },
  [...]
   ]

There are 5 such "damage_type" : "backtrace" entries in total.

I'm not really surprised, there were a very few instances in
the past where one or two (mostly empty directories) and
symlinks acted strangely, and couldn't be deleted
("rm results in "Invalid argument"). Back then, I moved them
all in a "quarantine" directory, but wasn't able to do anything
about it.

Now that CephFS does more rigorous checks and has spotted
the trouble - how do I go about repairing this?


Cheers and thanks for any help,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-14 Thread george.vasilakakos
Hi Brad,

I'll be doing so later in the day.

Thanks,

George

From: Brad Hubbard [bhubb...@redhat.com]
Sent: 13 February 2017 22:03
To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
Subject: Re: [ceph-users] PG stuck peering after host reboot

I'd suggest creating a tracker and uploading a full debug log from the
primary so we can look at this in more detail.

On Mon, Feb 13, 2017 at 9:11 PM,   wrote:
> Hi Brad,
>
> I could not tell you that as `ceph pg 1.323 query` never completes, it just 
> hangs there.
>
> On 11/02/2017, 00:40, "Brad Hubbard"  wrote:
>
> On Thu, Feb 9, 2017 at 3:36 AM,   wrote:
> > Hi Corentin,
> >
> > I've tried that, the primary hangs when trying to injectargs so I set 
> the option in the config file and restarted all OSDs in the PG, it came up 
> with:
> >
> > pg 1.323 is remapped+peering, acting 
> [595,1391,2147483647,127,937,362,267,320,7,634,716]
> >
> > Still can't query the PG, no error messages in the logs of osd.240.
> > The logs on osd.595 and osd.7 still fill up with the same messages.
>
> So what does "peering_blocked_by_detail" show in that case since it
> can no longer show "peering_blocked_by_history_les_bound"?
>
> >
> > Regards,
> >
> > George
> > 
> > From: Corentin Bonneton [l...@titin.fr]
> > Sent: 08 February 2017 16:31
> > To: Vasilakakos, George (STFC,RAL,SC)
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] PG stuck peering after host reboot
> >
> > Hello,
> >
> > I already had the case, I applied the parameter 
> (osd_find_best_info_ignore_history_les) to all the osd that have reported the 
> queries blocked.
> >
> > --
> > Cordialement,
> > CEO FEELB | Corentin BONNETON
> > cont...@feelb.io
> >
> > Le 8 févr. 2017 à 17:17, 
> george.vasilaka...@stfc.ac.uk a écrit :
> >
> > Hi Ceph folks,
> >
> > I have a cluster running Jewel 10.2.5 using a mix EC and replicated 
> pools.
> >
> > After rebooting a host last night, one PG refuses to complete peering
> >
> > pg 1.323 is stuck inactive for 73352.498493, current state peering, 
> last acting [595,1391,240,127,937,362,267,320,7,634,716]
> >
> > Restarting OSDs or hosts does nothing to help, or sometimes results in 
> things like this:
> >
> > pg 1.323 is remapped+peering, acting 
> [2147483647,1391,240,127,937,362,267,320,7,634,716]
> >
> >
> > The host that was rebooted is home to osd.7 (8). If I go onto it to 
> look at the logs for osd.7 this is what I see:
> >
> > $ tail -f /var/log/ceph/ceph-osd.7.log
> > 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- 
> XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 
> sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating 
> reconnect
> >
> > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates 
> the direction of communication. I've traced these to osd.7 (rank 8 in the 
> stuck PG) reaching out to osd.595 (the primary in the stuck PG).
> >
> > Meanwhile, looking at the logs of osd.595 I see this:
> >
> > $ tail -f /var/log/ceph/ceph-osd.595.log
> > 2017-02-08 15:41:15.760708 7f1765673700  0 -- 
> XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 
> sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs 
> existing 477 state standby
> > 2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 1941070384 
> != exp 3786596716
> >
> > which again shows osd.595 reaching out to osd.7 and from what I could 
> gather the CRC problem is about messaging.
> >
> > Google searching has yielded nothing particularly useful on how to get 
> this unstuck.
> >
> > ceph pg 1.323 query seems to hang forever but it completed once last 
> night and I noticed this:
> >
> >"peering_blocked_by_detail": [
> >{
> >"detail": "peering_blocked_by_history_les_bound"
> >}
> >
> > We have seen this before and it was cleared by setting 
> osd_find_best_info_ignore_history_les to true for the first two OSDs on the 
> stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and 
> I suspect the option needs to be set on either a majority of OSDs or enough k 
> number of OSDs to be able to use their data and ignore history.
> >
> > We would really appreciate any guidance and/or help the community can 
> offer!
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] To backup or not to backup the classic way - How to backup hundreds of TB?

2017-02-14 Thread Irek Fasikhov
Hi.

We use Ceph Rados GW S3. And we are very happy :).
Each administrator is responsible for its service.

Using the following clients S3:
Linux - s3cmd, duply;
Windows - cloudberry.

P.S 500 TB data, 3x replication, 3 datacenter.

С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

2017-02-14 12:15 GMT+03:00 Götz Reinicke :

> Hi,
>
> I guess that's a question that pops up in different places, but I could
> not find any which fits to my thoughts.
>
> Currently we start to use ceph for file shares of our films produced by
> our students and some xen/vmware VMs. Thd VM data is already backed up; the
> fils original footage is stored in other places.
>
> We start with some 100TB rbd and mount smb/NFS shares from the clients.
> May be we look into ceph fs soon.
>
> The question is: How would someone handle a backup of 100 TB data?
> Rsyncing that to an other system or having a commercial backup solution
> looks not that good e.g. regarding the price.
>
> One thought is, is there some sort of best practice in the ceph world e.g.
> replicating to an other physical independent cluster? Or use more replicas,
> odds, nodes and do snapshots in one cluster?
>
> Having productive data and backup on the same hardware currently makes me
> feel not that good too….But the world changes :)
>
> Long story short: How do you do backup hundreds of TB?
>
> Curious for suggestions and thoughts .. Thanks and Regards . Götz
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache vs flashcache vs cache tiering

2017-02-14 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Dongsheng Yang
> Sent: 14 February 2017 09:01
> To: Sage Weil 
> Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
> Subject: [ceph-users] bcache vs flashcache vs cache tiering
> 
> Hi Sage and all,
>  We are going to use SSDs for cache in ceph. But I am not sure which one 
> is the best solution, bcache? flashcache? or cache
tier?

I would vote for cache tier. Being able to manage it from within Ceph, instead 
of having to manage X number of bcache/flashcache
instances, appeals to me more. Also last time I looked Flashcache seems 
unmaintained and bcache might be going that way with talk of
this new bcachefs. Another point to consider is that Ceph has had a lot of work 
done on it to ensure data consistency; I don't ever
want to be in a position where I'm trying to diagnose problems that might be 
being caused by another layer sitting in-between Ceph
and the Disk.

However, I know several people on here are using bcache and potentially getting 
better performance than with cache tiering, so
hopefully someone will give their views.

> 
> I found there are some CAUTION in ceph.com about cache tiering. Is cache 
> tiering is already production ready? especially for rbd.

Several people have been using it in production and with Jewel I would say it's 
stable. There were a few gotcha's in previous
releases, but they all appear to be fixed in Jewel. The main reasons for the 
warnings now are that unless you have a cacheable
workload, performance can actually be degraded. If you can predict that say 10% 
of your data will be hot and provision enough SSD
capacity for this hot data, then it should work really well. If you data will 
be uniformly random or sequential in nature, then I
would steer clear, but this applies to most caching solutions albeit with maybe 
more graceful degradation

> 
> thanx in advance.
> Yang
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Where did monitors keep their keys?

2017-02-14 Thread George Shuklin

Hello.

Where monitors are keeping their keys? I can't see them in 'ceph auth 
list'. Are they in that list but I have no permission to see them (as 
admin), or are they stored somewhere else? How can I see that list?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] extending ceph cluster with osds close to near full ratio (85%)

2017-02-14 Thread Tyanko Aleksiev
Hi Cephers,

At University of Zurich we are using Ceph as a storage back-end for our
OpenStack installation. Since we recently reached 70% of occupancy
(mostly caused by the cinder pool served by 16384PGs) we are in the
phase of extending the cluster with additional storage nodes of the same
type (except for a slight more powerful CPU).

We decided to opt for a gradual OSD deployment: we created a temporary
"root"
bucket called "fresh-install" containing the newly installed nodes and then
we
moved OSDs from this bucket to the current production root via:

ceph osd crush set osd.{id} {weight} host={hostname} root={production_root}

Everything seemed nicely planned but when we started adding a few new
OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs,
already at 84% disk use, passed the 85% threshold. This in turn
triggered the "near full osd(s)" warning and more than 20PGs previously
in "wait_backfill" state were marked as: "wait_backfill+backfill_toofull".
Since the OSD kept growing until, reached 90% disk use, we decided to reduce
its relative weight from 1 to 0.95.
The last action recalculated the crushmap and remapped a few PGs but did
not appear to move any data off the almost full OSD. Only when, by steps
of 0.05, we reached 0.50 of relative weight data was moved and some
"backfill_toofull" requests were released. However, he had do go down
almost to 0.10% of relative weight in order to trigger some additional
data movement and have the backfilling process finally finished.

We are now adding new OSDs but the problem is constantly triggered since
we have multiple OSDs > 83% that starts growing during the rebalance.

My questions are:

- Is there something wrong in our process of adding new OSDs (some
additional
details below)?
- We also noticed that the problem has the tendency to cluster around the
newly
added OSDs, so could those two things be correlated?
- Why reweighting does not trigger instant data moving? What's the logic
behind remapped PGs? Is there some sort of flat queue of tasks or does
it have some priorities defined?
- Did somebody experience this situation and eventually how was it
solved/bypassed?

Cluster details are as follows:

- version: 0.94.9
- 5 monitors,
- 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs
in total),
- osd pool default size = 3,
- journaling is on SSDs.

We have "hosts" failure domain. Relevant crushmap details:

# rules
rule sas {
ruleset 1
type replicated
min_size 1
max_size 10
step take sas
step chooseleaf firstn 0 type host
step emit
}

root sas {
id -41 # do not change unnecessarily
# weight 3283.279
alg straw
hash 0 # rjenkins1
item osd-l2-16 weight 87.360
item osd-l4-06 weight 87.360
...
item osd-k7-41 weight 14.560
item osd-l4-36 weight 14.560
item osd-k5-36 weight 14.560
}

host osd-k7-21 {
id -46 # do not change unnecessarily
# weight 87.360
alg straw
hash 0 # rjenkins1
item osd.281 weight 3.640
item osd.282 weight 3.640
item osd.285 weight 3.640
...
}

host osd-k7-41 {
id -50 # do not change unnecessarily
# weight 14.560
alg straw
hash 0 # rjenkins1
item osd.900 weight 3.640
item osd.901 weight 3.640
item osd.902 weight 3.640
item osd.903 weight 3.640
}


As mentioned before we created a temporary bucket called "fresh-install"
containing the newly installed nodes (i.e.):

root fresh-install {
id -34 # do not change unnecessarily
# weight 218.400
alg straw
hash 0 # rjenkins1
item osd-k5-36-fresh weight 72.800
item osd-k7-41-fresh weight 72.800
item osd-l4-36-fresh weight 72.800
}

Then, by steps of 6 OSDs (2 OSDs from each new host), we move OSDs from
the "fresh-install" to the "sas" bucket.

Thank you in advance for all the suggestions.

Cheers,
Tyanko
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow performances on our Ceph Cluster

2017-02-14 Thread Jason Dillaman
On Tue, Feb 14, 2017 at 3:48 AM, David Ramahefason  wrote:
> Any idea on how we could increase performances ? as this really impact our
> openstack MOS9.0 Mitaka infrastructure, VM spawning can take up to 15
> minutes...

Have you configured Glance RBD store properly? The Mikata release
changed the configuration options needed to thinly provision VMs from
backing Glance images [1]

[1] http://docs.ceph.com/docs/master/rbd/rbd-openstack/#for-mitaka-only

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow performances on our Ceph Cluster

2017-02-14 Thread Beard Lionel (BOSTON-STORAGE)
Hi,

> On Tue, Feb 14, 2017 at 3:48 AM, David Ramahefason 
> wrote:
> > Any idea on how we could increase performances ? as this really impact
> > our openstack MOS9.0 Mitaka infrastructure, VM spawning can take up to
> > 15 minutes...
>
> Have you configured Glance RBD store properly? The Mikata release changed
> the configuration options needed to thinly provision VMs from backing
> Glance images [1]
>
> [1] http://docs.ceph.com/docs/master/rbd/rbd-openstack/#for-mitaka-only

And also be sure to have your images in Glance in RAW format, as copy-on-write 
is not supported with Ceph when using QCOW2 format.

Lionel


Ce message et toutes les pièces jointes (ci-après le "message") sont établis à 
l'intention exclusive de ses destinataires et sont confidentiels. Si vous 
recevez ce message par erreur ou s'il ne vous est pas destiné, merci de le 
détruire ainsi que toute copie de votre système et d'en avertir immédiatement 
l'expéditeur. Toute lecture non autorisée, toute utilisation de ce message qui 
n'est pas conforme à sa destination, toute diffusion ou toute publication, 
totale ou partielle, est interdite. L'Internet ne permettant pas d'assurer 
l'intégrité de ce message électronique susceptible d'altération, l’expéditeur 
(et ses filiales) décline(nt) toute responsabilité au titre de ce message dans 
l'hypothèse où il aurait été modifié ou falsifié.

This message and any attachments (the "message") is intended solely for the 
intended recipient(s) and is confidential. If you receive this message in 
error, or are not the intended recipient(s), please delete it and any copies 
from your systems and immediately notify the sender. Any unauthorized view, use 
that does not comply with its purpose, dissemination or disclosure, either 
whole or partial, is prohibited. Since the internet cannot guarantee the 
integrity of this message which may not be reliable, the sender (and its 
subsidiaries) shall not be liable for the message if modified or falsified.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to change the owner of a bucket

2017-02-14 Thread Yoann Moulin
Dear list,

I was looking on how to change the owner of a bucket. There is a lack of 
documentation on that point (even the man page is not clear), I found
how with the Help of Orit.

> radosgw-admin metadata get bucket:
> radosgw-admin bucket link --uid= --bucket= 
> --bucket-id=

this issue helped me : http://tracker.ceph.com/issues/14949

Also, in the radosgw-admin man page, unlink is described as "Remove a bucket", 
what does "remove" means in that case ? Delete ?

> Remove a bucket:
> $ radosgw-admin bucket unlink --bucket=foo

http://docs.ceph.com/docs/master/man/8/radosgw-admin/

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache vs flashcache vs cache tiering

2017-02-14 Thread Wido den Hollander

> Op 14 februari 2017 om 11:14 schreef Nick Fisk :
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> > Dongsheng Yang
> > Sent: 14 February 2017 09:01
> > To: Sage Weil 
> > Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
> > Subject: [ceph-users] bcache vs flashcache vs cache tiering
> > 
> > Hi Sage and all,
> >  We are going to use SSDs for cache in ceph. But I am not sure which 
> > one is the best solution, bcache? flashcache? or cache
> tier?
> 
> I would vote for cache tier. Being able to manage it from within Ceph, 
> instead of having to manage X number of bcache/flashcache
> instances, appeals to me more. Also last time I looked Flashcache seems 
> unmaintained and bcache might be going that way with talk of
> this new bcachefs. Another point to consider is that Ceph has had a lot of 
> work done on it to ensure data consistency; I don't ever
> want to be in a position where I'm trying to diagnose problems that might be 
> being caused by another layer sitting in-between Ceph
> and the Disk.
> 
> However, I know several people on here are using bcache and potentially 
> getting better performance than with cache tiering, so
> hopefully someone will give their views.

I am using Bcache on various systems and it performs really well. The caching 
layer in Ceph is slow. Promoting Objects is slow and it also involves 
additional RADOS lookups.

The benefit with bcache is that it's handled by the OS locally, see it being a 
extension of the page cache.

A Fast NVM-e device of 1 to 2TB can vastly improve the performance of a bunch 
of spinning disks. What I've seen is that overall the I/O pattern on the disks 
stabilizes and has less spikes.

Frequent reads will be cached in the page cache and less frequent by bcache.

Running this with a few clients now for over 18 months and no issues so far.

Starting from kernel 4.11 you can also create partitions on bcache devices 
which makes it very easy to use bcache with ceph-disk and thus FileStore and 
BlueStore.

Wido

> 
> > 
> > I found there are some CAUTION in ceph.com about cache tiering. Is cache 
> > tiering is already production ready? especially for rbd.
> 
> Several people have been using it in production and with Jewel I would say 
> it's stable. There were a few gotcha's in previous
> releases, but they all appear to be fixed in Jewel. The main reasons for the 
> warnings now are that unless you have a cacheable
> workload, performance can actually be degraded. If you can predict that say 
> 10% of your data will be hot and provision enough SSD
> capacity for this hot data, then it should work really well. If you data will 
> be uniformly random or sequential in nature, then I
> would steer clear, but this applies to most caching solutions albeit with 
> maybe more graceful degradation
> 
> > 
> > thanx in advance.
> > Yang
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw scaling recommendation?

2017-02-14 Thread Benjeman Meekhof
Thanks everyone for the suggestions, playing with all three of the
tuning knobs mentioned has greatly increased the number of client
connections an instance can deal with.  We're still experimenting to
find the max values to saturate our hardware.

With values as below we'd see something around 50 reqs/s and at higher
rates start to see some 403 responses or TCP peer resets.  However
we're still only hitting 3-5% utilization of the hardware CPU and
plenty of headroom with other resources so there's room to go higher I
think.  There wasn't a lot of thought put into those numbers or the
relation between them, just 'bigger'.

My analysis is that connection resets are likely due to too few
civetweb threads to handle requests, and 403 responses from too few
threads/handles to handle connections that do get through.

rgw_thread_pool_size= 800,
civetweb num_threads = 400
rgw_num_rados_handles = 8

regards,
Ben

On Thu, Feb 9, 2017 at 4:48 PM, Ben Hines  wrote:
> I'm curious how does the num_threads option to civetweb relate to the 'rgw
> thread pool size'?  Should i make them equal?
>
> ie:
>
> rgw frontends = civetweb enable_keep_alive=yes port=80 num_threads=125
> error_log_file=/var/log/ceph/civetweb.error.log
> access_log_file=/var/log/ceph/civetweb.access.log
>
>
> -Ben
>
> On Thu, Feb 9, 2017 at 12:30 PM, Wido den Hollander  wrote:
>>
>>
>> > Op 9 februari 2017 om 19:34 schreef Mark Nelson :
>> >
>> >
>> > I'm not really an RGW expert, but I'd suggest increasing the
>> > "rgw_thread_pool_size" option to something much higher than the default
>> > 100 threads if you haven't already.  RGW requires at least 1 thread per
>> > client connection, so with many concurrent connections some of them
>> > might end up timing out.  You can scale the number of threads and even
>> > the number of RGW instances on a single server, but at some point you'll
>> > run out of threads at the OS level.  Probably before that actually
>> > happens though, you'll want to think about multiple RGW gateway nodes
>> > behind a load balancer.  Afaik that's how the big sites do it.
>> >
>>
>> In addition, have you tried to use more RADOS handles?
>>
>> rgw_num_rados_handles = 8
>>
>> That with more RGW threads as Mark mentioned.
>>
>> Wido
>>
>> > I believe some folks are considering trying to migrate rgw to a
>> > threadpool/event processing model but it sounds like it would be quite a
>> > bit of work.
>> >
>> > Mark
>> >
>> > On 02/09/2017 12:25 PM, Benjeman Meekhof wrote:
>> > > Hi all,
>> > >
>> > > We're doing some stress testing with clients hitting our rados gw
>> > > nodes with simultaneous connections.  When the number of client
>> > > connections exceeds about 5400 we start seeing 403 forbidden errors
>> > > and log messages like the following:
>> > >
>> > > 2017-02-09 08:53:16.915536 7f8c667bc700 0 NOTICE: request time skew
>> > > too big now=2017-02-09 08:53:16.00 req_time=2017-02-09
>> > > 08:37:18.00
>> > >
>> > > This is version 10.2.5 using embedded civetweb.  There's just one
>> > > instance per node, and they all start generating 403 errors and the
>> > > above log messages when enough clients start hitting them.  The
>> > > hardware is not being taxed at all, negligible load and network
>> > > throughput.   OSD don't show any appreciable increase in CPU load or
>> > > io wait on journal/data devices.  Unless I'm missing something it
>> > > looks like the RGW is just not scaling to fill out the hardware it is
>> > > on.
>> > >
>> > > Does anyone have advice on scaling RGW to fully utilize a host?
>> > >
>> > > thanks,
>> > > Ben
>> > > ___
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache vs flashcache vs cache tiering

2017-02-14 Thread Nick Fisk
> -Original Message-
> From: Wido den Hollander [mailto:w...@42on.com]
> Sent: 14 February 2017 16:25
> To: Dongsheng Yang ; n...@fisk.me.uk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bcache vs flashcache vs cache tiering
> 
> 
> > Op 14 februari 2017 om 11:14 schreef Nick Fisk :
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Dongsheng Yang
> > > Sent: 14 February 2017 09:01
> > > To: Sage Weil 
> > > Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
> > > Subject: [ceph-users] bcache vs flashcache vs cache tiering
> > >
> > > Hi Sage and all,
> > >  We are going to use SSDs for cache in ceph. But I am not sure
> > > which one is the best solution, bcache? flashcache? or cache
> > tier?
> >
> > I would vote for cache tier. Being able to manage it from within Ceph,
> > instead of having to manage X number of bcache/flashcache instances,
> > appeals to me more. Also last time I looked Flashcache seems
> > unmaintained and bcache might be going that way with talk of this new
> > bcachefs. Another point to consider is that Ceph has had a lot of work done 
> > on it to ensure data consistency; I don't ever want to be
> in a position where I'm trying to diagnose problems that might be being 
> caused by another layer sitting in-between Ceph and the Disk.
> >
> > However, I know several people on here are using bcache and
> > potentially getting better performance than with cache tiering, so 
> > hopefully someone will give their views.
> 
> I am using Bcache on various systems and it performs really well. The caching 
> layer in Ceph is slow. Promoting Objects is slow and it
> also involves additional RADOS lookups.
> 
> The benefit with bcache is that it's handled by the OS locally, see it being 
> a extension of the page cache.
> 
> A Fast NVM-e device of 1 to 2TB can vastly improve the performance of a bunch 
> of spinning disks. What I've seen is that overall the
> I/O pattern on the disks stabilizes and has less spikes.
> 
> Frequent reads will be cached in the page cache and less frequent by bcache.
> 
> Running this with a few clients now for over 18 months and no issues so far.
> 
> Starting from kernel 4.11 you can also create partitions on bcache devices 
> which makes it very easy to use bcache with ceph-disk and
> thus FileStore and BlueStore.

Thanks for the input Wido. 

So I assume you currently run with the Journals on separate raw SSD partitions, 
but post 4.11 you will allow ceph-disk to partition a single bcache device for 
both data and journal?

Have you seen any quirks with bcache over the time you have been using it? I 
know when I 1st looked at it for non-ceph use a few years back it had a few 
gremlins hidden in it.

Nick

> 
> Wido
> 
> >
> > >
> > > I found there are some CAUTION in ceph.com about cache tiering. Is cache 
> > > tiering is already production ready? especially for rbd.
> >
> > Several people have been using it in production and with Jewel I would
> > say it's stable. There were a few gotcha's in previous releases, but
> > they all appear to be fixed in Jewel. The main reasons for the
> > warnings now are that unless you have a cacheable workload,
> > performance can actually be degraded. If you can predict that say 10%
> > of your data will be hot and provision enough SSD capacity for this
> > hot data, then it should work really well. If you data will be
> > uniformly random or sequential in nature, then I would steer clear,
> > but this applies to most caching solutions albeit with maybe more
> > graceful degradation
> >
> > >
> > > thanx in advance.
> > > Yang
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache vs flashcache vs cache tiering

2017-02-14 Thread Tomasz Kuzemko
We are running flashcache in production for RBD behind OSDs since over two 
years now. We had a few issues with it:
• one rare kernel livelock between XFS and flashcache that took some effort to 
track down and fix (we could release patched flashcache if there is interest)
• careful tuning of skip seq threshold so cache is not tainted with deep scrub 
and backfilling
• flashcache does not support FUA so hdd write cache has to be disabled to 
prevent data loss in case of power failure (results in ~5% performance drop in 
our case)
• XFS should be formatted with forced 4kb sector
• requires patching for newer kernels because it's no longer maintained 

Overall it allows us to get ~40% more IOPS from hdds than if we had not used 
it. Whether you should use it or not really depends on the use case. If your 
hot working set size fits in cache then it could give you a lot of performance 
increase.

--
Tomasz Kuzemko
tomasz.kuze...@corp.ovh.com

Dnia 14.02.2017 o godz. 17:25 Wido den Hollander  napisał(a):

> 
>> Op 14 februari 2017 om 11:14 schreef Nick Fisk :
>> 
>> 
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>>> Dongsheng Yang
>>> Sent: 14 February 2017 09:01
>>> To: Sage Weil 
>>> Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
>>> Subject: [ceph-users] bcache vs flashcache vs cache tiering
>>> 
>>> Hi Sage and all,
>>> We are going to use SSDs for cache in ceph. But I am not sure which one 
>>> is the best solution, bcache? flashcache? or cache
>> tier?
>> 
>> I would vote for cache tier. Being able to manage it from within Ceph, 
>> instead of having to manage X number of bcache/flashcache
>> instances, appeals to me more. Also last time I looked Flashcache seems 
>> unmaintained and bcache might be going that way with talk of
>> this new bcachefs. Another point to consider is that Ceph has had a lot of 
>> work done on it to ensure data consistency; I don't ever
>> want to be in a position where I'm trying to diagnose problems that might be 
>> being caused by another layer sitting in-between Ceph
>> and the Disk.
>> 
>> However, I know several people on here are using bcache and potentially 
>> getting better performance than with cache tiering, so
>> hopefully someone will give their views.
> 
> I am using Bcache on various systems and it performs really well. The caching 
> layer in Ceph is slow. Promoting Objects is slow and it also involves 
> additional RADOS lookups.
> 
> The benefit with bcache is that it's handled by the OS locally, see it being 
> a extension of the page cache.
> 
> A Fast NVM-e device of 1 to 2TB can vastly improve the performance of a bunch 
> of spinning disks. What I've seen is that overall the I/O pattern on the 
> disks stabilizes and has less spikes.
> 
> Frequent reads will be cached in the page cache and less frequent by bcache.
> 
> Running this with a few clients now for over 18 months and no issues so far.
> 
> Starting from kernel 4.11 you can also create partitions on bcache devices 
> which makes it very easy to use bcache with ceph-disk and thus FileStore and 
> BlueStore.
> 
> Wido
> 
>> 
>>> 
>>> I found there are some CAUTION in ceph.com about cache tiering. Is cache 
>>> tiering is already production ready? especially for rbd.
>> 
>> Several people have been using it in production and with Jewel I would say 
>> it's stable. There were a few gotcha's in previous
>> releases, but they all appear to be fixed in Jewel. The main reasons for the 
>> warnings now are that unless you have a cacheable
>> workload, performance can actually be degraded. If you can predict that say 
>> 10% of your data will be hot and provision enough SSD
>> capacity for this hot data, then it should work really well. If you data 
>> will be uniformly random or sequential in nature, then I
>> would steer clear, but this applies to most caching solutions albeit with 
>> maybe more graceful degradation
>> 
>>> 
>>> thanx in advance.
>>> Yang
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD client newer than cluster

2017-02-14 Thread Lukáš Kubín
Hi,
I'm most probably hitting bug http://tracker.ceph.com/issues/13755 - when
libvirt mounted RBD disks suspend I/O during snapshot creation until hard
reboot.

My Ceph cluster (monitors and OSDs) is running v0.94.3, while clients
(OpenStack/KVM computes) run v0.94.5. Can I still update the client
packages (librbd1 and dependencies) to a patched release 0.94.7, while
keeping the cluster on v0.94.3?

I realize it's not ideal but does it present any risk? Can I assume that
patching the client is sufficient to resolve the mentioned bug?

Ceph cluster nodes can't receive updates currently and this will stay so
for some time still, but I need to resolve the snapshot bug urgently.

Greetings,

Lukáš
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] async-ms with RDMA or DPDK?

2017-02-14 Thread Bastian Rosner

Hi,

according to kraken release-notes and documentation, AsyncMessenger now 
also supports RDMA and DPDK.


Is anyone already using async-ms with RDMA or DPDK and might be able to 
tell us something about real-world performance gains and stability?


Best, Bastian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] extending ceph cluster with osds close to near full ratio (85%)

2017-02-14 Thread Brian Andrus
On Tue, Feb 14, 2017 at 5:27 AM, Tyanko Aleksiev 
wrote:

> Hi Cephers,
>
> At University of Zurich we are using Ceph as a storage back-end for our
> OpenStack installation. Since we recently reached 70% of occupancy
> (mostly caused by the cinder pool served by 16384PGs) we are in the
> phase of extending the cluster with additional storage nodes of the same
> type (except for a slight more powerful CPU).
>
> We decided to opt for a gradual OSD deployment: we created a temporary
> "root"
> bucket called "fresh-install" containing the newly installed nodes and
> then we
> moved OSDs from this bucket to the current production root via:
>
> ceph osd crush set osd.{id} {weight} host={hostname} root={production_root}
>
> Everything seemed nicely planned but when we started adding a few new
> OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs,
> already at 84% disk use, passed the 85% threshold. This in turn
> triggered the "near full osd(s)" warning and more than 20PGs previously
> in "wait_backfill" state were marked as: "wait_backfill+backfill_toofull".
> Since the OSD kept growing until, reached 90% disk use, we decided to
> reduce
> its relative weight from 1 to 0.95.
> The last action recalculated the crushmap and remapped a few PGs but did
> not appear to move any data off the almost full OSD. Only when, by steps
> of 0.05, we reached 0.50 of relative weight data was moved and some
> "backfill_toofull" requests were released. However, he had do go down
> almost to 0.10% of relative weight in order to trigger some additional
> data movement and have the backfilling process finally finished.
>
> We are now adding new OSDs but the problem is constantly triggered since
> we have multiple OSDs > 83% that starts growing during the rebalance.
>
> My questions are:
>
> - Is there something wrong in our process of adding new OSDs (some
> additional
> details below)?
>
>
It could work but - also could be more disruptive than need be. We have a
similar situation/configuration and what we do is start OSDs with ` osd
crush initial weight = 0` as well as "crush_osd_location" set properly.
This will weight the OSDs at 0 weight and let us bring them in in a
controlled fashion. We bring them in to 1 (no disruption), then crush
weight in gradually.


> - We also noticed that the problem has the tendency to cluster around the 
> newly
> added OSDs, so could those two things be correlated?
>
> I'm not sure which problem you are referring to - this OSDs filling?
Possibly due to temporary files or some other mechanism I'm not familiar
with adding a little extra data on top.

> - Why reweighting does not trigger instant data moving? What's the logic
> behind remapped PGs? Is there some sort of flat queue of tasks or does
> it have some priorities defined?
>
>
It should, perhaps you aren't choosing large enough increments or perhaps
you have some settings set.


> - Did somebody experience this situation and eventually how was it 
> solved/bypassed?
>
>
FWIW, we also run a rebalance cronjob every hour with the following:

`ceph osd reweight-by-utilization 103 .010 10`

it was detailed in another recent thread on [ceph-users]


> Cluster details are as follows:
>
> - version: 0.94.9
> - 5 monitors,
> - 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs in 
> total),
> - osd pool default size = 3,
> - journaling is on SSDs.
>
> We have "hosts" failure domain. Relevant crushmap details:
>
> # rules
> rule sas {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take sas
> step chooseleaf firstn 0 type host
> step emit
> }
>
> root sas {
> id -41  # do not change unnecessarily
> # weight 3283.279
> alg straw
> hash 0  # rjenkins1
> item osd-l2-16 weight 87.360
> item osd-l4-06 weight 87.360
> ...
> item osd-k7-41 weight 14.560
> item osd-l4-36 weight 14.560
> item osd-k5-36 weight 14.560
> }
>
> host osd-k7-21 {
> id -46  # do not change unnecessarily
> # weight 87.360
> alg straw
> hash 0  # rjenkins1
> item osd.281 weight 3.640
> item osd.282 weight 3.640
> item osd.285 weight 3.640
> ...
> }
>
> host osd-k7-41 {
> id -50  # do not change unnecessarily
> # weight 14.560
> alg straw
> hash 0  # rjenkins1
> item osd.900 weight 3.640
> item osd.901 weight 3.640
> item osd.902 weight 3.640
> item osd.903 weight 3.640
> }
>
>
> As mentioned before we created a temporary bucket called "fresh-install"
> containing the newly installed nodes (i.e.):
>
> root fresh-install {
> id -34  # do not change unnecessarily
> # weight 218.400
> alg straw
> hash 0  # rjenkins1
> item osd-k5-36-fresh weight 72.800
> item 

Re: [ceph-users] Jewel to Kraken OSD upgrade issues

2017-02-14 Thread Gregory Farnum
On Tue, Feb 14, 2017 at 11:38 AM, Benjeman Meekhof  wrote:
> Hi all,
>
> We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken
> (11.2.0).  OS was RHEL derivative.  Prior to this we updated all the
> mons to Kraken.
>
> After updating ceph packages I restarted the 60 OSD on the box with
> 'systemctl restart ceph-osd.target'.  Very soon after the system cpu
> load flat-lines at 100% with top showing all of that being system load
> from ceph-osd processes.  Not long after we get OSD flapping due to
> the load on the system (noout was set to start this, but perhaps
> too-quickly unset post restart).
>
> This is causing problems in the cluster, and we reboot the box.  The
> OSD don't start up/mount automatically - not a new problem on this
> setup.  We run 'ceph-disk activate $disk' on a list of all the
> /dev/dm-X devices as output by ceph-disk list.  Everything activates
> and the CPU gradually climbs to once again be a solid 100%.  No OSD
> have joined cluster so it isn't causing issues.
>
> I leave the box overnight...by the time I leave I see that 1-2 OSD on
> this box are marked up/in.   By morning all are in, CPU is fine,
> cluster is still fine.
>
> This is not a show-stopping issue now that I know what happens though
> it means upgrades are a several hour or overnight affair.  Next box I
> will just mark all the OSD out before updating and restarting them or
> try leaving them up but being sure to set noout to avoid flapping
> while they churn.
>
> Here's a log snippet from one currently spinning in the startup
> process since 11am.  This is the second box we did, the first
> experience being as detailed above.  Could this have anything to do
> with the 'PGs are upgrading' message?

It doesn't seem likely — there's a fixed per-PG overhead that doesn't
scale with the object count. I could be missing something but I don't
see anything in the upgrade notes that should be doing this either.
Try running an upgrade with "debug osd = 20" and "debug filestore =
20" set and see what the log spits out.
-Greg

>
> 2017-02-14 11:04:07.028311 7fd7a0372940  0 _get_class not permitted to load 
> lua
> 2017-02-14 11:04:07.077304 7fd7a0372940  0 osd.585 135493 crush map
> has features 288514119978713088, adjusting msgr requires for clients
> 2017-02-14 11:04:07.077318 7fd7a0372940  0 osd.585 135493 crush map
> has features 288514394856620032 was 8705, adjusting msgr requires for
> mons
> 2017-02-14 11:04:07.077324 7fd7a0372940  0 osd.585 135493 crush map
> has features 288514394856620032, adjusting msgr requires for osds
> 2017-02-14 11:04:09.446832 7fd7a0372940  0 osd.585 135493 load_pgs
> 2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading
> 2017-02-14 11:04:10.246166 7fd7a0372940  0 osd.585 135493 load_pgs
> opened 148 pgs
> 2017-02-14 11:04:10.246249 7fd7a0372940  0 osd.585 135493 using 1 op
> queue with priority op cut off at 64.
> 2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493
> log_to_monitors {default=true}
> 2017-02-14 11:04:12.473450 7fd7a0372940  0 osd.585 135493 done with
> init, starting boot process
> (logs stop here, cpu spinning)
>
>
> regards,
> Ben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache vs flashcache vs cache tiering

2017-02-14 Thread Gregory Farnum
On Tue, Feb 14, 2017 at 8:25 AM, Wido den Hollander  wrote:
>
>> Op 14 februari 2017 om 11:14 schreef Nick Fisk :
>>
>>
>> > -Original Message-
>> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> > Dongsheng Yang
>> > Sent: 14 February 2017 09:01
>> > To: Sage Weil 
>> > Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
>> > Subject: [ceph-users] bcache vs flashcache vs cache tiering
>> >
>> > Hi Sage and all,
>> >  We are going to use SSDs for cache in ceph. But I am not sure which 
>> > one is the best solution, bcache? flashcache? or cache
>> tier?
>>
>> I would vote for cache tier. Being able to manage it from within Ceph, 
>> instead of having to manage X number of bcache/flashcache
>> instances, appeals to me more. Also last time I looked Flashcache seems 
>> unmaintained and bcache might be going that way with talk of
>> this new bcachefs. Another point to consider is that Ceph has had a lot of 
>> work done on it to ensure data consistency; I don't ever
>> want to be in a position where I'm trying to diagnose problems that might be 
>> being caused by another layer sitting in-between Ceph
>> and the Disk.
>>
>> However, I know several people on here are using bcache and potentially 
>> getting better performance than with cache tiering, so
>> hopefully someone will give their views.
>
> I am using Bcache on various systems and it performs really well. The caching 
> layer in Ceph is slow. Promoting Objects is slow and it also involves 
> additional RADOS lookups.

Yeah. Cache tiers have gotten a lot more usable in Ceph, but the use
cases where they're effective are still pretty limited and I think
in-node caching has a brighter future. We just don't like to maintain
the global state that makes separate caching locations viable and
unless you're doing something analogous to the supercomputing "burst
buffers" (which some people are!), it's going to be hard to beat
something that doesn't have to pay the cost of extra network
hops/bandwidth.
Cache tiers are also not a feature that all the vendors support in
their downstream products, so it will probably see less ongoing
investment than you'd expect from such a system.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache vs flashcache vs cache tiering

2017-02-14 Thread Nick Fisk

> -Original Message-
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: 14 February 2017 21:05
> To: Wido den Hollander 
> Cc: Dongsheng Yang ; Nick Fisk
> ; Ceph Users 
> Subject: Re: [ceph-users] bcache vs flashcache vs cache tiering
> 
> On Tue, Feb 14, 2017 at 8:25 AM, Wido den Hollander 
> wrote:
> >
> >> Op 14 februari 2017 om 11:14 schreef Nick Fisk :
> >>
> >>
> >> > -Original Message-
> >> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >> > Behalf Of Dongsheng Yang
> >> > Sent: 14 February 2017 09:01
> >> > To: Sage Weil 
> >> > Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
> >> > Subject: [ceph-users] bcache vs flashcache vs cache tiering
> >> >
> >> > Hi Sage and all,
> >> >  We are going to use SSDs for cache in ceph. But I am not sure
> >> > which one is the best solution, bcache? flashcache? or cache
> >> tier?
> >>
> >> I would vote for cache tier. Being able to manage it from within
> >> Ceph, instead of having to manage X number of bcache/flashcache
> >> instances, appeals to me more. Also last time I looked Flashcache
> >> seems unmaintained and bcache might be going that way with talk of
> >> this new bcachefs. Another point to consider is that Ceph has had a lot of
> work done on it to ensure data consistency; I don't ever want to be in a
> position where I'm trying to diagnose problems that might be being caused
> by another layer sitting in-between Ceph and the Disk.
> >>
> >> However, I know several people on here are using bcache and
> >> potentially getting better performance than with cache tiering, so
> hopefully someone will give their views.
> >
> > I am using Bcache on various systems and it performs really well. The
> caching layer in Ceph is slow. Promoting Objects is slow and it also involves
> additional RADOS lookups.
> 
> Yeah. Cache tiers have gotten a lot more usable in Ceph, but the use cases
> where they're effective are still pretty limited and I think in-node caching 
> has
> a brighter future. We just don't like to maintain the global state that makes
> separate caching locations viable and unless you're doing something
> analogous to the supercomputing "burst buffers" (which some people are!),
> it's going to be hard to beat something that doesn't have to pay the cost of
> extra network hops/bandwidth.
> Cache tiers are also not a feature that all the vendors support in their
> downstream products, so it will probably see less ongoing investment than
> you'd expect from such a system.

Should that be taken as an unofficial sign that the tiering support is likely 
to fade away?

I think both approaches have different strengths and probably the difference 
between a tiering system and a caching one is what causes some of the problems.

If something like bcache is going to be the preferred approach, then I think 
more work needs to be done around certifying it for use with Ceph and allowing 
its behavior to be more controlled by Ceph as well. I assume there are issues 
around backfilling and scrubbing polluting the cache? Maybe you would want to 
be able to pass hints down from Ceph, which could also allow per pool cache 
behavior??

> -Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD client newer than cluster

2017-02-14 Thread Christian Balzer
On Wed, 15 Feb 2017 02:56:22 +0900 Shinobu Kinjo wrote:

> On Wed, Feb 15, 2017 at 2:18 AM, Lukáš Kubín  wrote:
> > Hi,
> > I'm most probably hitting bug http://tracker.ceph.com/issues/13755 - when
> > libvirt mounted RBD disks suspend I/O during snapshot creation until hard
> > reboot.
> >
> > My Ceph cluster (monitors and OSDs) is running v0.94.3, while clients
> > (OpenStack/KVM computes) run v0.94.5. Can I still update the client packages
> > (librbd1 and dependencies) to a patched release 0.94.7, while keeping the
> > cluster on v0.94.3?  
> 
> The latest hammer is v0.94.9 and hammer will be EOL in this spring.

As long as http://tracker.ceph.com/issues/17386 isn't rolled out in a .10
release telling people to upgrade to .9 when they have no pressing needs
otherwise is a bit iffy. 
Might work out w/o problems in a small enough cluster, but still.

And Jewel has seriously failed to impress me so far considering the stream
of releases to deal with bugs and regressions. 

While I'm going to start our next cluster with Jewel, another one is
likely to remain frozen into Hammer until its natural death (HW
retirement or renewal).

Christian

> Why do you want to keep v0.94.3? Is it because you just want to avoid
> any risks regarding to upgrading packages?
> 
> >
> > I realize it's not ideal but does it present any risk? Can I assume that
> > patching the client is sufficient to resolve the mentioned bug?
> >
> > Ceph cluster nodes can't receive updates currently and this will stay so for
> > some time still, but I need to resolve the snapshot bug urgently.
> >
> > Greetings,
> >
> > Lukáš
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs advice

2017-02-14 Thread Sam Huracan
Hi Khang,

What file system do you use in OSD node?
XFS always use Memory for caching data before writing to disk.

So, don't worry, it always holds memory in your system as much as possible.



2017-02-15 10:35 GMT+07:00 Khang Nguyễn Nhật 
:

> Hi all,
> My ceph OSDs is running on Fedora-server24 with config are:
> 128GB RAM DDR3, CPU Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 72 OSDs
> (8TB per OSD). My cluster was use ceph object gateway with S3 API. Now, it
> had contained 500GB data but it was used > 50GB RAM. I'm worry my OSD will
> dead if i continue put file to it. I had read "OSDs do not require as
> much RAM for regular operations (e.g., 500MB of RAM per daemon instance);
> however, during recovery they need significantly more RAM (e.g., ~1GB per
> 1TB of storage per daemon)." in Ceph Hardware Recommendations. Someone
> can give me advice on this issue? Thank
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] async-ms with RDMA or DPDK?

2017-02-14 Thread Haomai Wang
On Tue, Feb 14, 2017 at 11:44 PM, Bastian Rosner
 wrote:
>
> Hi,
>
> according to kraken release-notes and documentation, AsyncMessenger now also 
> supports RDMA and DPDK.
>
> Is anyone already using async-ms with RDMA or DPDK and might be able to tell 
> us something about real-world performance gains and stability?
>

make rdma mature is a ongoing plan, but stability is the first thing
then performance

>
> Best, Bastian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs advice

2017-02-14 Thread Khang Nguyễn Nhật
Hi Sam,
Thank for your reply. I use BTRFS file system on OSDs.
Here is result of "*free -hw*":

   total used  freeshared
buffers   cache available
Mem:   125G 58G 31G1.2M3.7M
36G 60G

and "*ceph df*":

GLOBAL:
SIZE AVAIL RAW USED %RAW USED
523T  522T1539G  0.29
POOLS:
NAME   ID USED %USED MAX AVAIL
OBJECTS

default.rgw.buckets.data  92 597G  0.15  391T
 84392


I was reviced this a few minutes ago.

2017-02-15 10:50 GMT+07:00 Sam Huracan :

> Hi Khang,
>
> What file system do you use in OSD node?
> XFS always use Memory for caching data before writing to disk.
>
> So, don't worry, it always holds memory in your system as much as possible.
>
>
>
> 2017-02-15 10:35 GMT+07:00 Khang Nguyễn Nhật <
> nguyennhatkhang2...@gmail.com>:
>
>> Hi all,
>> My ceph OSDs is running on Fedora-server24 with config are:
>> 128GB RAM DDR3, CPU Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 72 OSDs
>> (8TB per OSD). My cluster was use ceph object gateway with S3 API. Now, it
>> had contained 500GB data but it was used > 50GB RAM. I'm worry my OSD will
>> dead if i continue put file to it. I had read "OSDs do not require as
>> much RAM for regular operations (e.g., 500MB of RAM per daemon instance);
>> however, during recovery they need significantly more RAM (e.g., ~1GB per
>> 1TB of storage per daemon)." in Ceph Hardware Recommendations. Someone
>> can give me advice on this issue? Thank
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache vs flashcache vs cache tiering

2017-02-14 Thread Christian Balzer
On Tue, 14 Feb 2017 22:42:21 - Nick Fisk wrote:

> > -Original Message-
> > From: Gregory Farnum [mailto:gfar...@redhat.com]
> > Sent: 14 February 2017 21:05
> > To: Wido den Hollander 
> > Cc: Dongsheng Yang ; Nick Fisk
> > ; Ceph Users 
> > Subject: Re: [ceph-users] bcache vs flashcache vs cache tiering
> > 
> > On Tue, Feb 14, 2017 at 8:25 AM, Wido den Hollander 
> > wrote:  
> > >  
> > >> Op 14 februari 2017 om 11:14 schreef Nick Fisk :
> > >>
> > >>  
> > >> > -Original Message-
> > >> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > >> > Behalf Of Dongsheng Yang
> > >> > Sent: 14 February 2017 09:01
> > >> > To: Sage Weil 
> > >> > Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
> > >> > Subject: [ceph-users] bcache vs flashcache vs cache tiering
> > >> >
> > >> > Hi Sage and all,
> > >> >  We are going to use SSDs for cache in ceph. But I am not sure
> > >> > which one is the best solution, bcache? flashcache? or cache  
> > >> tier?
> > >>
> > >> I would vote for cache tier. Being able to manage it from within
> > >> Ceph, instead of having to manage X number of bcache/flashcache
> > >> instances, appeals to me more. Also last time I looked Flashcache
> > >> seems unmaintained and bcache might be going that way with talk of
> > >> this new bcachefs. Another point to consider is that Ceph has had a lot 
> > >> of  
> > work done on it to ensure data consistency; I don't ever want to be in a
> > position where I'm trying to diagnose problems that might be being caused
> > by another layer sitting in-between Ceph and the Disk.  
> > >>
> > >> However, I know several people on here are using bcache and
> > >> potentially getting better performance than with cache tiering, so  
> > hopefully someone will give their views.  
> > >
> > > I am using Bcache on various systems and it performs really well. The  
> > caching layer in Ceph is slow. Promoting Objects is slow and it also 
> > involves
> > additional RADOS lookups.
> > 
> > Yeah. Cache tiers have gotten a lot more usable in Ceph, but the use cases
> > where they're effective are still pretty limited and I think in-node 
> > caching has
> > a brighter future. We just don't like to maintain the global state that 
> > makes
> > separate caching locations viable and unless you're doing something
> > analogous to the supercomputing "burst buffers" (which some people are!),
> > it's going to be hard to beat something that doesn't have to pay the cost of
> > extra network hops/bandwidth.
> > Cache tiers are also not a feature that all the vendors support in their
> > downstream products, so it will probably see less ongoing investment than
> > you'd expect from such a system.  
> 
> Should that be taken as an unofficial sign that the tiering support is likely 
> to fade away?
> 
Nick, you also posted back in October in the 
"cache tiering deprecated in RHCS 2.0" thread and should remember the
deafening silence when I asked that question.

I'm actually surprised that Greg said as much as he did now,
unfortunately that doesn't really cover all the questions I had back then,
in particular long term support and bug fixes, not necessarily more
features.

We're literally about to order our next cluster and cache-tiering works
like a charm for us, even in Hammer.
With the (still undocumented) knobs in Jewel and read-forward it will be
even more effective.

So given the lack of any statements that next cluster will still use the
same design as the previous one, since BlueStore isn't ready, bcache and
others haven't been tested here to my satisfaction and we know very well
what works and what not.

So 3 regular (HDD OSD, journal SSD) nodes and 3 cache-tier ones. Dedicated
cache-tier nodes allow for deployment of high end CPUs only in those nodes.

Another point in favor of cache-tiering is that it can be added at a
later stage, while in-node caching requires an initial design with large
local SSDs/NVMes or at least the space for them. 
Because the journal SSDs most people will deploy initially don't tend to
be large enough to be effective when used with bcache or similar. 

> I think both approaches have different strengths and probably the difference 
> between a tiering system and a caching one is what causes some of the 
> problems.
> 
> If something like bcache is going to be the preferred approach, then I think 
> more work needs to be done around certifying it for use with Ceph and 
> allowing its behavior to be more controlled by Ceph as well. I assume there 
> are issues around backfilling and scrubbing polluting the cache? Maybe you 
> would want to be able to pass hints down from Ceph, which could also allow 
> per pool cache behavior??
> 
According to the RHCS release notes back then their idea to achieve
rainbows and pink ponies was using dm-cache.


[ceph-users] Ceph OSDs advice

2017-02-14 Thread Khang Nguyễn Nhật
Hi all,
My ceph OSDs is running on Fedora-server24 with config are:
128GB RAM DDR3, CPU Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 72 OSDs (8TB
per OSD). My cluster was use ceph object gateway with S3 API. Now, it had
contained 500GB data but it was used > 50GB RAM. I'm worry my OSD will dead
if i continue put file to it. I had read "OSDs do not require as much RAM
for regular operations (e.g., 500MB of RAM per daemon instance); however,
during recovery they need significantly more RAM (e.g., ~1GB per 1TB of
storage per daemon)." in Ceph Hardware Recommendations. Someone can give me
advice on this issue? Thank
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bcache vs flashcache vs cache tiering

2017-02-14 Thread Stefan Priebe - Profihost AG
I've been testing flashcache, bcache, dm-cache and even dm-writeboost in 
production ceph clusters.

The only one that is working fine and gives the speed we need is bcache. All 
others failed with slow speeds or low latencies.

Stefan

Excuse my typo sent from my mobile phone.

> Am 15.02.2017 um 02:42 schrieb Christian Balzer :
> 
> On Tue, 14 Feb 2017 22:42:21 - Nick Fisk wrote:
> 
>>> -Original Message-
>>> From: Gregory Farnum [mailto:gfar...@redhat.com]
>>> Sent: 14 February 2017 21:05
>>> To: Wido den Hollander 
>>> Cc: Dongsheng Yang ; Nick Fisk
>>> ; Ceph Users 
>>> Subject: Re: [ceph-users] bcache vs flashcache vs cache tiering
>>> 
>>> On Tue, Feb 14, 2017 at 8:25 AM, Wido den Hollander 
>>> wrote:  
 
> Op 14 februari 2017 om 11:14 schreef Nick Fisk :
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> Behalf Of Dongsheng Yang
>> Sent: 14 February 2017 09:01
>> To: Sage Weil 
>> Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
>> Subject: [ceph-users] bcache vs flashcache vs cache tiering
>> 
>> Hi Sage and all,
>> We are going to use SSDs for cache in ceph. But I am not sure
>> which one is the best solution, bcache? flashcache? or cache  
> tier?
> 
> I would vote for cache tier. Being able to manage it from within
> Ceph, instead of having to manage X number of bcache/flashcache
> instances, appeals to me more. Also last time I looked Flashcache
> seems unmaintained and bcache might be going that way with talk of
> this new bcachefs. Another point to consider is that Ceph has had a lot 
> of  
>>> work done on it to ensure data consistency; I don't ever want to be in a
>>> position where I'm trying to diagnose problems that might be being caused
>>> by another layer sitting in-between Ceph and the Disk.  
> 
> However, I know several people on here are using bcache and
> potentially getting better performance than with cache tiering, so  
>>> hopefully someone will give their views.  
 
 I am using Bcache on various systems and it performs really well. The  
>>> caching layer in Ceph is slow. Promoting Objects is slow and it also 
>>> involves
>>> additional RADOS lookups.
>>> 
>>> Yeah. Cache tiers have gotten a lot more usable in Ceph, but the use cases
>>> where they're effective are still pretty limited and I think in-node 
>>> caching has
>>> a brighter future. We just don't like to maintain the global state that 
>>> makes
>>> separate caching locations viable and unless you're doing something
>>> analogous to the supercomputing "burst buffers" (which some people are!),
>>> it's going to be hard to beat something that doesn't have to pay the cost of
>>> extra network hops/bandwidth.
>>> Cache tiers are also not a feature that all the vendors support in their
>>> downstream products, so it will probably see less ongoing investment than
>>> you'd expect from such a system.  
>> 
>> Should that be taken as an unofficial sign that the tiering support is 
>> likely to fade away?
>> 
> Nick, you also posted back in October in the 
> "cache tiering deprecated in RHCS 2.0" thread and should remember the
> deafening silence when I asked that question.
> 
> I'm actually surprised that Greg said as much as he did now,
> unfortunately that doesn't really cover all the questions I had back then,
> in particular long term support and bug fixes, not necessarily more
> features.
> 
> We're literally about to order our next cluster and cache-tiering works
> like a charm for us, even in Hammer.
> With the (still undocumented) knobs in Jewel and read-forward it will be
> even more effective.
> 
> So given the lack of any statements that next cluster will still use the
> same design as the previous one, since BlueStore isn't ready, bcache and
> others haven't been tested here to my satisfaction and we know very well
> what works and what not.
> 
> So 3 regular (HDD OSD, journal SSD) nodes and 3 cache-tier ones. Dedicated
> cache-tier nodes allow for deployment of high end CPUs only in those nodes.
> 
> Another point in favor of cache-tiering is that it can be added at a
> later stage, while in-node caching requires an initial design with large
> local SSDs/NVMes or at least the space for them. 
> Because the journal SSDs most people will deploy initially don't tend to
> be large enough to be effective when used with bcache or similar. 
> 
>> I think both approaches have different strengths and probably the difference 
>> between a tiering system and a caching one is what causes some of the 
>> problems.
>> 
>> If something like bcache is going to be the preferred approach, then I think 
>> more work needs to be done around certifying it for use with Ceph and