[ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-02 Thread Christian Balzer

Hello,

not a Ceph specific issue, but this is probably the largest sample size of
SSD users I'm familiar with. ^o^

This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
religious experience.

It turns out that the SMART check plugin I run to mostly get an early
wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
200GB DC S3700 used for journals.

While SMART is of the opinion that this drive is failing and will explode
spectacularly any moment that particular failure is of little worries to
me, never mind that I'll eventually replace this unit.

What brings me here is that this is the first time in over 3 years that an
Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
this particular failure has been seen by others.

That of course entails people actually monitoring for these things. ^o^

Thanks,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Ric Wheeler

On 08/02/2016 07:26 PM, Ilya Dryomov wrote:

This seems to reflect the granularity (4194304), which matches the
>8192 pages (8192 x 512 = 4194304).  However, there is no alignment
>value.
>
>Can discard_alignment be specified with RBD?

It's exported as a read-only sysfs attribute, just like
discard_granularity:

# cat /sys/block/rbd0/discard_alignment
4194304


Note that this is the standard way Linux export alignment for storage discard 
for *any* kind of storage so worth using :)


Ric


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph RGW issue.

2016-08-02 Thread Khang Nguyễn Nhật
Hi,
I have seen an error when I'm using Ceph RGW v10.2.2 with S3 API, it's as
follows:
I have three S3 users are A, B, C. Both A, B, C have some buckets and
objects. When I used A or C in order to PUT, GET object to RGW, I have seen
"decode_policy Read
AccessControlPolicy

Re: [ceph-users] Read Stalls with Multiple OSD Servers

2016-08-02 Thread Helander, Thomas
Hi David,

There’s a good amount of backstory to our configuration, but I’m happy to 
report I found the source of my problem.

We were applying some “optimizations” for our 10GbE via sysctl, including 
disabling net.ipv4.tcp_sack. Re-enabling net.ipv4.tcp_sack resolved the issue.

Thanks,
Tom

From: David Turner [mailto:david.tur...@storagecraft.com]
Sent: Monday, August 01, 2016 12:06 PM
To: Helander, Thomas ; ceph-users@lists.ceph.com
Subject: RE: Read Stalls with Multiple OSD Servers

Why are you running Raid 6 osds?  Ceph's usefulness is a lot of osds that can 
fail and be replaced.  With your processors/ram, you should be running these as 
individual osds.  That will utilize your dual processor setup much more.  Ceph 
is optimal for 1 core per osd.  Extra cores are more or less wasted in the 
storage node.  You only have 2 storage nodes, so you can't utilize a lot of the 
benefits of Ceph.  Your setup looks like you're much better suited for a 
Gluster cluster instead of a Ceph cluster.  I don't know what your needs are, 
but that's what it looks like from here.

[cid:image001.jpg@01D1ECB5.B37D8B00]

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



From: Helander, Thomas [thomas.helan...@kla-tencor.com]
Sent: Monday, August 01, 2016 11:10 AM
To: David Turner; ceph-users@lists.ceph.com
Subject: RE: Read Stalls with Multiple OSD Servers
Hi David,

Thanks for the quick response and suggestion. I do have just a basic network 
config (one network, no VLANs) and am able to ping between the storage servers 
using hostnames and IPs.

Thanks,
Tom

From: David Turner [mailto:david.tur...@storagecraft.com]
Sent: Monday, August 01, 2016 9:14 AM
To: Helander, Thomas 
>; 
ceph-users@lists.ceph.com
Subject: RE: Read Stalls with Multiple OSD Servers

This could be explained by your osds not being able to communicate with each 
other.  We have 2 vlans between our storage nodes, the public and private 
networks for ceph to use.  We added 2 new nodes in a new rack on new switches 
and as soon as we added a single osd for one of them to the cluster, the 
peering never finished and we had a lot of blocked requests that never went 
away.

In testing we found that the rest of the cluster could not communicate with 
these nodes on the private vlan and after fixing the network switch config, 
everything worked perfectly for adding in the 2 new nodes.

If you are using a basic network configuration with only one network and/or 
vlan, then this is likely not to be your issue.  But to check and make sure, 
you should test pinging between your nodes on all of the IPs they have.

[cid:image001.jpg@01D1ECB5.B37D8B00]

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Helander, 
Thomas [thomas.helan...@kla-tencor.com]
Sent: Monday, August 01, 2016 10:06 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Read Stalls with Multiple OSD Servers
Hi,

I’m running a three server cluster (one monitor, two OSD) and am having a 
problem where after adding the second OSD server, my read rate drops 
significantly and eventually the reads stall (writes are improved as expected). 
Attached is a log of the rados benchmarks for the two configurations and below 
is my hardware configuration. I’m not using replicas (capacity is more 
important than uptime for our use case) and am using a single 10GbE network. 
The pool (rbd) is configured with 128 placement groups.

I’ve checked the CPU utilization of the ceph-osd processes and they all hover 
around 10% until the stall. After the stall, the CPU usage is 0% and the disks 
all show zero operations via iostat. Iperf reports 9.9Gb/s between the monitor 
and OSD 

Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-02 Thread Gaurav Goyal
Hello David,

Thanks a lot for detailed information!

This is going to help me.


Regards
Gaurav Goyal

On Tue, Aug 2, 2016 at 11:46 AM, David Turner  wrote:

> I'm going to assume you know how to add and remove storage
> http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-osds/.  The
> only other part of this process is reweighting the crush map for the old
> osds to a new weight of 0.0
> http://docs.ceph.com/docs/master/rados/operations/crush-map/.
>
> I would recommend setting the nobackfill and norecover flags.
>
> ceph osd set nobackfill
> ceph osd set norecover
>
> Next you would add all of the new osds according to the ceph docs and then
> reweight the old osds to 0.0.
>
> ceph osd crush reweight osd.1 0.0
>
> Once you have all of that set, unset nobackfill and norecover.
>
> ceph osd unset nobackfill
> ceph osd unset norecover
>
> Wait until all of the backfilling finishes and then remove the old SAN
> osds as per the ceph docs.
>
>
> There is a thread from this mailing list about the benefits of weighting
> osds to 0.0 instead of just removing them.  The best thing that you gain
> from doing it this way is that you can remove multiple nodes/osds at the
> same time without having degraded objects and especially without losing
> objects.
>
> --
>
>  David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation 
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ONE pg deep-scrub blocks cluster

2016-08-02 Thread c

Am 2016-08-02 13:30, schrieb c:

Hello Guys,

this time without the original acting-set osd.4, 16 and 28. The issue
still exists...

[...]

For the record, this ONLY happens with this PG and no others that
share
the same OSDs, right?


Yes, right.

[...]

When doing the deep-scrub, monitor (atop, etc) all 3 nodes and
see if a
particular OSD (HDD) stands out, as I would expect it to.


Now I logged all disks via atop each 2 seconds while the deep-scrub
was running ( atop -w osdXX_atop 2 ).
As you expected all disks was 100% busy - with constant 150MB
(osd.4), 130MB (osd.28) and 170MB (osd.16)...

- osd.4 (/dev/sdf) http://slexy.org/view/s21emd2u6j [1]
- osd.16 (/dev/sdm): http://slexy.org/view/s20vukWz5E [2]
- osd.28 (/dev/sdh): http://slexy.org/view/s20YX0lzZY [3]
[...]
But what is causing this? A deep-scrub on all other disks - same
model and ordered at the same time - seems to not have this issue.

[...]

Next week, I will do this

1.1 Remove osd.4 completely from Ceph - again (the actual primary
for PG 0.223)


osd.4 is now removed completely.
The Primary PG is now on "osd.9"

# ceph pg map 0.223
osdmap e8671 pg 0.223 (0.223) -> up [9,16,28] acting [9,16,28]


1.2 xfs_repair -n /dev/sdf1 (osd.4): to see possible error


xfs_repair did not find/show any error


1.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.4,16,28 injectargs "--debug_osd 5/5"


Because now osd.9 is the Primary PG i have set the debug_osd on this 
too:

ceph tell osd.9 injectargs "--debug_osd 5/5"

and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop
working for a while)
Start @ 15:33:27
End @ 15:48:31

The "ceph.log"
- http://slexy.org/view/s2WbdApDLz

The related LogFiles (OSDs 9,16 and 28) and the LogFile via atop for 
the osds


LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s2kXeLMQyw
- atop Log: http://slexy.org/view/s21wJG2qr8

LogFile - osd.16 (/dev/sdh)
- ceph-osd.16.log: http://slexy.org/view/s20D6WhD4d
- atop Log: http://slexy.org/view/s2iMjer8rC

LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s21dmXoEo7
- atop log: http://slexy.org/view/s2gJqzu3uG


2.1 Remove osd.16 completely from Ceph


osd.16 is now removed completely - now replaced with osd.17 witihin
the acting set.

# ceph pg map 0.223
osdmap e9017 pg 0.223 (0.223) -> up [9,17,28] acting [9,17,28]


2.2 xfs_repair -n /dev/sdh1


xfs_repair did not find/show any error


2.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,28 injectargs "--debug_osd 5/5"


and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop
working for a while)

Start @ 2016-08-02 10:02:44
End @ 2016-08-02 10:17:22

The "Ceph.log": http://slexy.org/view/s2ED5LvuV2

LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s21z9JmwSu
- atop Log: http://slexy.org/view/s20XjFZFEL

LogFile - osd.17 (/dev/sdi)
- ceph-osd.17.log: http://slexy.org/view/s202fpcZS9
- atop Log: http://slexy.org/view/s2TxeR1JSz

LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s2eCUyC7xV
- atop log: http://slexy.org/view/s21AfebBqK


3.1 Remove osd.28 completely from Ceph


Now osd.28 is also removed completely from Ceph - now replaced with 
osd.23


# ceph pg map 0.223
osdmap e9363 pg 0.223 (0.223) -> up [9,17,23] acting [9,17,23]


3.2 xfs_repair -n /dev/sdm1


As expected: xfs_repair did not find/show any error


3.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,23 injectargs "--debug_osd 5/5"


... againg nearly all of my VMs stop working for a while...

Now are all "original" OSDs (4,16,28) removed which was in the
acting-set when i wrote my first eMail to this mailinglist. But the
issue still exists with different OSDs (9,17,23) as the acting-set
while the questionable PG 0.223 is still the same!

In suspicion that the "tunable" could be the cause, i have now changed
this back to "default" via " ceph osd crush tunables default ".
This will take a whille... then i will do " ceph pg deep-scrub 0.223 "
again (without osds 4,16,28)...


Really, i do not know whats going on here.

Ceph finished its recovering to "default" tunables but the issue still 
exists!:*(


The acting set has changed again

# ceph pg map 0.223
osdmap e11230 pg 0.223 (0.223) -> up [9,11,20] acting [9,11,20]

But when i start " ceph pg deep-scrub 0.223 ", again nearly all of my 
VMs stop working for a while!


Does any one have an idea where i should have a look to find the cause 
for this?


It seems that everytime the Primary OSD from the acting set of PG 0.223 
(*4*,16,28; *9*,17,23 or *9*,11,20) leads to "currently waiting for 
subops from 9,X" and the deep-scrub takes always nearly 15 minutes to 
finish.


My output from " ceph pg 0.223 query "

- http://slexy.org/view/s21d6qUqnV

Mehmet



For the records: Although nearly all disks are busy i have no
slow/blocked requests and i am watching the logfiles for nearly 20
minutes now...

Your help is realy appreciated!
- Mehmet

___

Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-02 Thread David Turner
I'm going to assume you know how to add and remove storage 
http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-osds/.  The only 
other part of this process is reweighting the crush map for the old osds to a 
new weight of 0.0 http://docs.ceph.com/docs/master/rados/operations/crush-map/.

I would recommend setting the nobackfill and norecover flags.

ceph osd set nobackfill
ceph osd set norecover

Next you would add all of the new osds according to the ceph docs and then 
reweight the old osds to 0.0.

ceph osd crush reweight osd.1 0.0

Once you have all of that set, unset nobackfill and norecover.

ceph osd unset nobackfill
ceph osd unset norecover

Wait until all of the backfilling finishes and then remove the old SAN osds as 
per the ceph docs.


There is a thread from this mailing list about the benefits of weighting osds 
to 0.0 instead of just removing them.  The best thing that you gain from doing 
it this way is that you can remove multiple nodes/osds at the same time without 
having degraded objects and especially without losing objects.



[cid:imagea5f6bc.JPG@1f888d0d.4ab0136e]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-02 Thread David Turner
Just add the new storage and weight the old storage to 0.0 so all data will 
move off of the old storage to the new storage.  It's not unique to migrating 
from SANs to Local Disks.  You would do the same any time you wanted to migrate 
to newer servers and retire old servers.  After the backfilling is done, you 
can just remove the old osds from the cluster and no more backfilling will 
happen.



[cid:imaged112aa.JPG@2eb52165.4bb022da]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Gaurav Goyal 
[er.gauravgo...@gmail.com]
Sent: Tuesday, August 02, 2016 9:19 AM
To: ceph-users
Subject: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local 
Disks

Dear Ceph Team,

I need your guidance on this.


Regards
Gaurav Goyal

On Wed, Jul 27, 2016 at 4:03 PM, Gaurav Goyal 
> wrote:
Dear Team,

I have ceph storage installed on SAN storage which is connected to Openstack 
Hosts via iSCSI LUNs.
Now we want to get rid of SAN storage and move over ceph to LOCAL disks.

Can i add new local disks as new OSDs and remove the old osds ?
or

I will have to remove the ceph from scratch and install it freshly with Local 
disks?


Regards
Gaurav Goyal






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-02 Thread Gaurav Goyal
Hi David,

Thanks for your comments!
Can you please help to share the procedure/Document if available?

Regards
Gaurav Goyal

On Tue, Aug 2, 2016 at 11:24 AM, David Turner  wrote:

> Just add the new storage and weight the old storage to 0.0 so all data
> will move off of the old storage to the new storage.  It's not unique to
> migrating from SANs to Local Disks.  You would do the same any time you
> wanted to migrate to newer servers and retire old servers.  After the
> backfilling is done, you can just remove the old osds from the cluster and
> no more backfilling will happen.
>
> --
>
>  David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation 
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> --
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Gaurav Goyal [er.gauravgo...@gmail.com]
> *Sent:* Tuesday, August 02, 2016 9:19 AM
> *To:* ceph-users
> *Subject:* [ceph-users] Fwd: Ceph Storage Migration from SAN storage to
> Local Disks
>
> Dear Ceph Team,
>
> I need your guidance on this.
>
>
> Regards
> Gaurav Goyal
>
> On Wed, Jul 27, 2016 at 4:03 PM, Gaurav Goyal 
> wrote:
>
>> Dear Team,
>>
>> I have ceph storage installed on SAN storage which is connected to
>> Openstack Hosts via iSCSI LUNs.
>> Now we want to get rid of SAN storage and move over ceph to LOCAL disks.
>>
>> Can i add new local disks as new OSDs and remove the old osds ?
>> or
>>
>> I will have to remove the ceph from scratch and install it freshly with
>> Local disks?
>>
>>
>> Regards
>> Gaurav Goyal
>>
>>
>>
>>
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-02 Thread Gaurav Goyal
Dear Ceph Team,

I need your guidance on this.


Regards
Gaurav Goyal

On Wed, Jul 27, 2016 at 4:03 PM, Gaurav Goyal 
wrote:

> Dear Team,
>
> I have ceph storage installed on SAN storage which is connected to
> Openstack Hosts via iSCSI LUNs.
> Now we want to get rid of SAN storage and move over ceph to LOCAL disks.
>
> Can i add new local disks as new OSDs and remove the old osds ?
> or
>
> I will have to remove the ceph from scratch and install it freshly with
> Local disks?
>
>
> Regards
> Gaurav Goyal
>
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: (no subject)

2016-08-02 Thread Gaurav Goyal
Hello Jason/Kees,

I am trying to take snapshot of my instance.

Image was stuck up in Queued state and instance is stuck up in Image
Pending Upload state.

I had to manually quit the job as it was not working since last 1 hour ..
my instance is still in Image Pending Upload state.

Is it something wrong with my ceph configuration?
can i take snapshots with ceph storage? How?

Regards
Gaurav Goyal

On Wed, Jul 13, 2016 at 9:44 AM, Jason Dillaman  wrote:

> The RAW file will appear to be the exact image size but the filesystem
> will know about the holes in the image and it will be sparsely
> allocated on disk.  For example:
>
> # dd if=/dev/zero of=sparse-file bs=1 count=1 seek=2GiB
> # ll sparse-file
> -rw-rw-r--. 1 jdillaman jdillaman 2147483649 Jul 13 09:20 sparse-file
> # du -sh sparse-file
> 4.0K sparse-file
>
> Now, running qemu-img to copy the image into the backing RBD pool:
>
> # qemu-img convert -f raw -O raw ~/sparse-file rbd:rbd/sparse-file
> # rbd disk-usage sparse-file
> NAMEPROVISIONED USED
> sparse-file   2048M0
>
>
> On Wed, Jul 13, 2016 at 3:31 AM, Fran Barrera 
> wrote:
> > Yes, but is the same problem isn't? The image will be too large because
> the
> > format is raw.
> >
> > Thanks.
> >
> > 2016-07-13 9:24 GMT+02:00 Kees Meijs :
> >>
> >> Hi Fran,
> >>
> >> Fortunately, qemu-img(1) is able to directly utilise RBD (supporting
> >> sparse block devices)!
> >>
> >> Please refer to http://docs.ceph.com/docs/hammer/rbd/qemu-rbd/ for
> >> examples.
> >>
> >> Cheers,
> >> Kees
> >>
> >> On 13-07-16 09:18, Fran Barrera wrote:
> >> > Can you explain how you do this procedure? I have the same problem
> >> > with the large images and snapshots.
> >> >
> >> > This is what I do:
> >> >
> >> > # qemu-img convert -f qcow2 -O raw image.qcow2 image.img
> >> > # openstack image create image.img
> >> >
> >> > But the image.img is too large.
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reminder: CDM tomorrow

2016-08-02 Thread Patrick McGarry
Hey cephers,

Just a reminder that our Ceph Developer Monthly discussion is
happening tomorrow at 12:30p EDT on bluejeans. Please, if you are
working on something in the Ceph code base currently, just drop a
quick note on the CDM page so that we’re able to get it on the agenda.
Thanks!

http://wiki.ceph.com/CDM_03-AUG-2016

If you need the dial-in information, you can find it at:

http://wiki.ceph.com/Planning


See you there!

-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Alex Gorbachev
On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov  wrote:
> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  
> wrote:
>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  wrote:
>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
 Hi Ilya,

 On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
> wrote:
>> RBD illustration showing RBD ignoring discard until a certain
>> threshold - why is that?  This behavior is unfortunately incompatible
>> with ESXi discard (UNMAP) behavior.
>>
>> Is there a way to lower the discard sensitivity on RBD devices?
>>
 
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 819200 KB
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 782336 KB
>
> Think about it in terms of underlying RADOS objects (4M by default).
> There are three cases:
>
> discard range   | command
> -
> whole object| delete
> object's tail   | truncate
> object's head   | zero
>
> Obviously, only delete and truncate free up space.  In all of your
> examples, except the last one, you are attempting to discard the head
> of the (first) object.
>
> You can free up as little as a sector, as long as it's the tail:
>
> OffsetLength  Type
> 0 4194304 data
>
> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>
> OffsetLength  Type
> 0 4193792 data

 Looks like ESXi is sending in each discard/unmap with the fixed
 granularity of 8192 sectors, which is passed verbatim by SCST.  There
 is a slight reduction in size via rbd diff method, but now I
 understand that actual truncate only takes effect when the discard
 happens to clip the tail of an image.

 So far looking at
 https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513

 ...the only variable we can control is the count of 8192-sector chunks
 and not their size.  Which means that most of the ESXi discard
 commands will be disregarded by Ceph.

 Vlad, is 8192 sectors coming from ESXi, as in the debug:

 Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
 1342099456, nr_sects 8192)
>>>
>>> Yes, correct. However, to make sure that VMware is not (erroneously) 
>>> enforced to do this, you need to perform one more check.
>>>
>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct 
>>> granularity and alignment (4M, I guess?)
>>
>> This seems to reflect the granularity (4194304), which matches the
>> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
>> value.
>>
>> Can discard_alignment be specified with RBD?
>
> It's exported as a read-only sysfs attribute, just like
> discard_granularity:
>
> # cat /sys/block/rbd0/discard_alignment
> 4194304

Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
discard_alignment in /sys/block//queue, but for RBD it's in
/sys/block/ - could this be the source of the issue?

Here is what I get querying the iscsi-exported RBD device on Linux:

root@kio1:/sys/block/sdf#  sg_inq -p 0xB0 /dev/sdf
VPD INQUIRY: Block limits page (SBC)
  Maximum compare and write length: 255 blocks
  Optimal transfer length granularity: 8 blocks
  Maximum transfer length: 16384 blocks
  Optimal transfer length: 1024 blocks
  Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
  Maximum unmap LBA count: 8192
  Maximum unmap block descriptor count: 4294967295
  Optimal unmap granularity: 8192
  Unmap granularity alignment valid: 1
  Unmap granularity alignment: 8192


>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Ilya Dryomov
On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  wrote:
> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  wrote:
>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>>> Hi Ilya,
>>>
>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
 On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
 wrote:
> RBD illustration showing RBD ignoring discard until a certain
> threshold - why is that?  This behavior is unfortunately incompatible
> with ESXi discard (UNMAP) behavior.
>
> Is there a way to lower the discard sensitivity on RBD devices?
>
>>> 
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 819200 KB
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 782336 KB

 Think about it in terms of underlying RADOS objects (4M by default).
 There are three cases:

 discard range   | command
 -
 whole object| delete
 object's tail   | truncate
 object's head   | zero

 Obviously, only delete and truncate free up space.  In all of your
 examples, except the last one, you are attempting to discard the head
 of the (first) object.

 You can free up as little as a sector, as long as it's the tail:

 OffsetLength  Type
 0 4194304 data

 # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28

 OffsetLength  Type
 0 4193792 data
>>>
>>> Looks like ESXi is sending in each discard/unmap with the fixed
>>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>>> is a slight reduction in size via rbd diff method, but now I
>>> understand that actual truncate only takes effect when the discard
>>> happens to clip the tail of an image.
>>>
>>> So far looking at
>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513
>>>
>>> ...the only variable we can control is the count of 8192-sector chunks
>>> and not their size.  Which means that most of the ESXi discard
>>> commands will be disregarded by Ceph.
>>>
>>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>>
>>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>>> 1342099456, nr_sects 8192)
>>
>> Yes, correct. However, to make sure that VMware is not (erroneously) 
>> enforced to do this, you need to perform one more check.
>>
>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct 
>> granularity and alignment (4M, I guess?)
>
> This seems to reflect the granularity (4194304), which matches the
> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
> value.
>
> Can discard_alignment be specified with RBD?

It's exported as a read-only sysfs attribute, just like
discard_granularity:

# cat /sys/block/rbd0/discard_alignment
4194304

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cleaning Up Failed Multipart Uploads

2016-08-02 Thread Tyler Bishop
We're having the same issues. I have a 1200TB pool at 90% utilization however 
disk utilization is only 40% 







Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 



tyler.bis...@beyondhosting.net 


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. 




From: "Brian Felton"  
To: "ceph-users"  
Sent: Wednesday, July 27, 2016 9:24:30 AM 
Subject: [ceph-users] Cleaning Up Failed Multipart Uploads 

Greetings, 

Background: If an object storage client re-uploads parts to a multipart object, 
RadosGW does not clean up all of the parts properly when the multipart upload 
is aborted or completed. You can read all of the gory details (including 
reproduction steps) in this bug report: http://tracker.ceph.com/issues/16767 . 

My setup: Hammer 0.94.6 cluster only used for S3-compatible object storage. RGW 
stripe size is 4MiB. 

My problem: I have buckets that are reporting TB more utilization (and, in one 
case, 200k more objects) than they should report. I am trying to remove the 
detritus from the multipart uploads, but removing the leftover parts directly 
from the .rgw.buckets pool is having no effect on bucket utilization (i.e. 
neither the object count nor the space used are declining). 

To give an example, I have a client that uploaded a very large multipart object 
(8000 15MiB parts). Due to a bug in the client, it uploaded each of the 8000 
parts 6 times. After the sixth attempt, it gave up and aborted the upload, at 
which point RGW removed the 8000 parts from the sixth attempt. When I list the 
bucket's contents with radosgw-admin (radosgw-admin bucket list 
--bucket= --max-entries=), I see all of the object's 
8000 parts five separate times, each under a namespace of 'multipart'. 

Since the multipart upload was aborted, I can't remove the object by name via 
the S3 interface. Since my RGW stripe size is 4MiB, I know that each part of 
the object will be stored across 4 entries in the .rgw.buckets pool -- 4 MiB in 
a 'multipart' file, and 4, 4, and 3 MiB in three successive 'shadow' files. 
I've created a script to remove these parts (rados -p .rgw.buckets rm 
__multipart_. and rados -p .rgw.buckets rm 
__shadow_..[1-3]). The removes are completing 
successfully (in that additional attempts to remove the object result in a 
failure), but I'm not seeing any decrease in the bucket's space used, nor am I 
seeing a decrease in the bucket's object count. In fact, if I do another 
'bucket list', all of the removed parts are still included. 

I've looked at the output of 'gc list --include-all', and the removed parts are 
never showing up for garbage collection. Garbage collection is otherwise 
functioning normally and will successfully remove data for any object properly 
removed via the S3 interface. 

I've also gone so far as to write a script to list the contents of bucket 
shards in the .rgw.buckets.index pool, check for the existence of the entry in 
.rgw.buckets, and remove entries that cannot be found, but that is also failing 
to decrement the size/object count counters. 

What am I missing here? Where, aside from .rgw.buckets and .rgw.buckets.index 
is RGW looking to determine object count and space used for a bucket? 

Many thanks to any and all who can assist. 

Brian Felton 



___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cleaning Up Failed Multipart Uploads

2016-08-02 Thread Brian Felton
I am actively working through the code and debugging everything.  I figure
the issue is with how RGW is listing the parts of a multipart upload when
it completes or aborts the upload (read: it's not getting *all* the parts,
just those that are either most recent or tagged with the upload id).  As
soon as I can figure out a patch, or, more importantly, how to manually
address the problem, I will respond with instructions.

The reported bug contains detailed instructions on reproducing the problem,
so it's trivial to reproduce and test on a small and/or new cluster.

Brian

On Tue, Aug 2, 2016 at 8:53 AM, Tyler Bishop  wrote:

> We're having the same issues.   I have a 1200TB pool at 90% utilization
> however disk utilization is only 40%
>
>
>
>  [image: http://static.beyondhosting.net/img/bh-small.png]
>
>
> *Tyler Bishop *Chief Technical Officer
> 513-299-7108 x10
>
> tyler.bis...@beyondhosting.net
>
> If you are not the intended recipient of this transmission you are
> notified that disclosing, copying, distributing or taking any action in
> reliance on the contents of this information is strictly prohibited.
>
>
>
> --
> *From: *"Brian Felton" 
> *To: *"ceph-users" 
> *Sent: *Wednesday, July 27, 2016 9:24:30 AM
> *Subject: *[ceph-users] Cleaning Up Failed Multipart Uploads
>
> Greetings,
>
> Background: If an object storage client re-uploads parts to a multipart
> object, RadosGW does not clean up all of the parts properly when the
> multipart upload is aborted or completed.  You can read all of the gory
> details (including reproduction steps) in this bug report:
> http://tracker.ceph.com/issues/16767.
>
> My setup: Hammer 0.94.6 cluster only used for S3-compatible object
> storage.  RGW stripe size is 4MiB.
>
> My problem: I have buckets that are reporting TB more utilization (and, in
> one case, 200k more objects) than they should report.  I am trying to
> remove the detritus from the multipart uploads, but removing the leftover
> parts directly from the .rgw.buckets pool is having no effect on bucket
> utilization (i.e. neither the object count nor the space used are
> declining).
>
> To give an example, I have a client that uploaded a very large multipart
> object (8000 15MiB parts).  Due to a bug in the client, it uploaded each of
> the 8000 parts 6 times.  After the sixth attempt, it gave up and aborted
> the upload, at which point RGW removed the 8000 parts from the sixth
> attempt.  When I list the bucket's contents with radosgw-admin
> (radosgw-admin bucket list --bucket= --max-entries= bucket>), I see all of the object's 8000 parts five separate times, each
> under a namespace of 'multipart'.
>
> Since the multipart upload was aborted, I can't remove the object by name
> via the S3 interface.  Since my RGW stripe size is 4MiB, I know that each
> part of the object will be stored across 4 entries in the .rgw.buckets pool
> -- 4 MiB in a 'multipart' file, and 4, 4, and 3 MiB in three successive
> 'shadow' files.  I've created a script to remove these parts (rados -p
> .rgw.buckets rm __multipart_. and rados -p
> .rgw.buckets rm __shadow_..[1-3]).  The
> removes are completing successfully (in that additional attempts to remove
> the object result in a failure), but I'm not seeing any decrease in the
> bucket's space used, nor am I seeing a decrease in the bucket's object
> count.  In fact, if I do another 'bucket list', all of the removed parts
> are still included.
>
> I've looked at the output of 'gc list --include-all', and the removed
> parts are never showing up for garbage collection.  Garbage collection is
> otherwise functioning normally and will successfully remove data for any
> object properly removed via the S3 interface.
>
> I've also gone so far as to write a script to list the contents of bucket
> shards in the .rgw.buckets.index pool, check for the existence of the entry
> in .rgw.buckets, and remove entries that cannot be found, but that is also
> failing to decrement the size/object count counters.
>
> What am I missing here?  Where, aside from .rgw.buckets and
> .rgw.buckets.index is RGW looking to determine object count and space used
> for a bucket?
>
> Many thanks to any and all who can assist.
>
> Brian Felton
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Alex Gorbachev
On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  wrote:
> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>> Hi Ilya,
>>
>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
>>> wrote:
 RBD illustration showing RBD ignoring discard until a certain
 threshold - why is that?  This behavior is unfortunately incompatible
 with ESXi discard (UNMAP) behavior.

 Is there a way to lower the discard sensitivity on RBD devices?

>> 

 root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
 print SUM/1024 " KB" }'
 819200 KB

 root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
 print SUM/1024 " KB" }'
 782336 KB
>>>
>>> Think about it in terms of underlying RADOS objects (4M by default).
>>> There are three cases:
>>>
>>> discard range   | command
>>> -
>>> whole object| delete
>>> object's tail   | truncate
>>> object's head   | zero
>>>
>>> Obviously, only delete and truncate free up space.  In all of your
>>> examples, except the last one, you are attempting to discard the head
>>> of the (first) object.
>>>
>>> You can free up as little as a sector, as long as it's the tail:
>>>
>>> OffsetLength  Type
>>> 0 4194304 data
>>>
>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>
>>> OffsetLength  Type
>>> 0 4193792 data
>>
>> Looks like ESXi is sending in each discard/unmap with the fixed
>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>> is a slight reduction in size via rbd diff method, but now I
>> understand that actual truncate only takes effect when the discard
>> happens to clip the tail of an image.
>>
>> So far looking at
>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513
>>
>> ...the only variable we can control is the count of 8192-sector chunks
>> and not their size.  Which means that most of the ESXi discard
>> commands will be disregarded by Ceph.
>>
>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>
>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>> 1342099456, nr_sects 8192)
>
> Yes, correct. However, to make sure that VMware is not (erroneously) enforced 
> to do this, you need to perform one more check.
>
> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct 
> granularity and alignment (4M, I guess?)

This seems to reflect the granularity (4194304), which matches the
8192 pages (8192 x 512 = 4194304).  However, there is no alignment
value.

Can discard_alignment be specified with RBD?

>
> 2. Connect to the this iSCSI device from a Linux box and run sg_inq -p 0xB0 
> /dev/
>
> SCST should correctly report those values for unmap parameters (in blocks).
>
> If in both cases you see correct the same values, then this is VMware issue, 
> because it is ignoring what it is told to do (generate appropriately sized 
> and aligned UNMAP requests). If either Ceph, or SCST doesn't show correct 
> numbers, then the broken party should be fixed.
>
> Vlad
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ONE pg deep-scrub blocks cluster

2016-08-02 Thread c

Hello Guys,

this time without the original acting-set osd.4, 16 and 28. The issue 
still exists...


[...]

For the record, this ONLY happens with this PG and no others that
share
the same OSDs, right?


Yes, right.

[...]

When doing the deep-scrub, monitor (atop, etc) all 3 nodes and
see if a
particular OSD (HDD) stands out, as I would expect it to.


Now I logged all disks via atop each 2 seconds while the deep-scrub
was running ( atop -w osdXX_atop 2 ).
As you expected all disks was 100% busy - with constant 150MB
(osd.4), 130MB (osd.28) and 170MB (osd.16)...

- osd.4 (/dev/sdf) http://slexy.org/view/s21emd2u6j [1]
- osd.16 (/dev/sdm): http://slexy.org/view/s20vukWz5E [2]
- osd.28 (/dev/sdh): http://slexy.org/view/s20YX0lzZY [3]
[...]
But what is causing this? A deep-scrub on all other disks - same
model and ordered at the same time - seems to not have this issue.

[...]

Next week, I will do this

1.1 Remove osd.4 completely from Ceph - again (the actual primary
for PG 0.223)


osd.4 is now removed completely.
The Primary PG is now on "osd.9"

# ceph pg map 0.223
osdmap e8671 pg 0.223 (0.223) -> up [9,16,28] acting [9,16,28]


1.2 xfs_repair -n /dev/sdf1 (osd.4): to see possible error


xfs_repair did not find/show any error


1.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.4,16,28 injectargs "--debug_osd 5/5"


Because now osd.9 is the Primary PG i have set the debug_osd on this 
too:

ceph tell osd.9 injectargs "--debug_osd 5/5"

and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop 
working for a while)

Start @ 15:33:27
End @ 15:48:31

The "ceph.log"
- http://slexy.org/view/s2WbdApDLz

The related LogFiles (OSDs 9,16 and 28) and the LogFile via atop for the 
osds


LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s2kXeLMQyw
- atop Log: http://slexy.org/view/s21wJG2qr8

LogFile - osd.16 (/dev/sdh)
- ceph-osd.16.log: http://slexy.org/view/s20D6WhD4d
- atop Log: http://slexy.org/view/s2iMjer8rC

LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s21dmXoEo7
- atop log: http://slexy.org/view/s2gJqzu3uG


2.1 Remove osd.16 completely from Ceph


osd.16 is now removed completely - now replaced with osd.17 witihin the 
acting set.


# ceph pg map 0.223
osdmap e9017 pg 0.223 (0.223) -> up [9,17,28] acting [9,17,28]


2.2 xfs_repair -n /dev/sdh1


xfs_repair did not find/show any error


2.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,28 injectargs "--debug_osd 5/5"


and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop 
working for a while)


Start @ 2016-08-02 10:02:44
End @ 2016-08-02 10:17:22

The "Ceph.log": http://slexy.org/view/s2ED5LvuV2

LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s21z9JmwSu
- atop Log: http://slexy.org/view/s20XjFZFEL

LogFile - osd.17 (/dev/sdi)
- ceph-osd.17.log: http://slexy.org/view/s202fpcZS9
- atop Log: http://slexy.org/view/s2TxeR1JSz

LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s2eCUyC7xV
- atop log: http://slexy.org/view/s21AfebBqK


3.1 Remove osd.28 completely from Ceph


Now osd.28 is also removed completely from Ceph - now replaced with 
osd.23


# ceph pg map 0.223
osdmap e9363 pg 0.223 (0.223) -> up [9,17,23] acting [9,17,23]


3.2 xfs_repair -n /dev/sdm1


As expected: xfs_repair did not find/show any error


3.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,23 injectargs "--debug_osd 5/5"


... againg nearly all of my VMs stop working for a while...

Now are all "original" OSDs (4,16,28) removed which was in the 
acting-set when i wrote my first eMail to this mailinglist. But the 
issue still exists with different OSDs (9,17,23) as the acting-set while 
the questionable PG 0.223 is still the same!


In suspicion that the "tunable" could be the cause, i have now changed 
this back to "default" via " ceph osd crush tunables default ".
This will take a whille... then i will do " ceph pg deep-scrub 0.223 " 
again (without osds 4,16,28)...


For the records: Although nearly all disks are busy i have no 
slow/blocked requests and i am watching the logfiles for nearly 20 
minutes now...


Your help is realy appreciated!
- Mehmet

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Ilya Dryomov
On Tue, Aug 2, 2016 at 1:05 AM, Alex Gorbachev  wrote:
> Hi Ilya,
>
> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
>> wrote:
>>> RBD illustration showing RBD ignoring discard until a certain
>>> threshold - why is that?  This behavior is unfortunately incompatible
>>> with ESXi discard (UNMAP) behavior.
>>>
>>> Is there a way to lower the discard sensitivity on RBD devices?
>>>
> 
>>>
>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>> print SUM/1024 " KB" }'
>>> 819200 KB
>>>
>>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>> print SUM/1024 " KB" }'
>>> 782336 KB
>>
>> Think about it in terms of underlying RADOS objects (4M by default).
>> There are three cases:
>>
>> discard range   | command
>> -
>> whole object| delete
>> object's tail   | truncate
>> object's head   | zero
>>
>> Obviously, only delete and truncate free up space.  In all of your
>> examples, except the last one, you are attempting to discard the head
>> of the (first) object.
>>
>> You can free up as little as a sector, as long as it's the tail:
>>
>> OffsetLength  Type
>> 0 4194304 data
>>
>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>
>> OffsetLength  Type
>> 0 4193792 data
>
> Looks like ESXi is sending in each discard/unmap with the fixed
> granularity of 8192 sectors, which is passed verbatim by SCST.  There
> is a slight reduction in size via rbd diff method, but now I
> understand that actual truncate only takes effect when the discard
> happens to clip the tail of an image.

... the tail of the *object*.  And again, with "filestore punch hole
= true", page-sized discards anywhere within the image would free up
space, but "rbd diff" won't reflect that.

>
> So far looking at
> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513
>
> ...the only variable we can control is the count of 8192-sector chunks
> and not their size.  Which means that most of the ESXi discard
> commands will be disregarded by Ceph.
>
> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>
> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
> 1342099456, nr_sects 8192)

They won't be disregarded, but it would definitely work better if they
were aligned.  1342099456 isn't 4M-aligned.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to configure OSD heart beat to happen on public network

2016-08-02 Thread Shinobu Kinjo
osd_heartbeat_addr must be in [osd] section.

On Thu, Jul 28, 2016 at 4:31 AM, Venkata Manojawa Paritala
 wrote:
> Hi,
>
> I have configured the below 2 networks in Ceph.conf.
>
> 1. public network
> 2. cluster_network
>
> Now, the heart beat for the OSDs is happening thru cluster_network. How can
> I configure the heart beat to happen thru public network?
>
> I actually configured the property "osd heartbeat address" in the global
> section and provided public network's subnet, but it is not working out.
>
> Am I doing something wrong? Appreciate your quick responses, as I need to
> urgently.
>
>
> Thanks & Regards,
> Manoj
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Email:
shin...@linux.com
shin...@redhat.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com