date:20160831

Re: [ceph-users] the reweight value of OSD is always 1

2016-08-31 Thread Henrik Korkuc


Hey,
it is normal for reweight value to be 1. You (with "ceph osd reweight 
OSDNUM newweight") or "ceph osd reweight-by-utilization" can decrease it 
to move some pgs out of that OSD.


Thing that usually differs and depends on disk size is "weight"

On 16-08-31 22:06, 한승진 wrote:

Hi Cephers!

The re-weight value of OSD is always 1 when we create and activate an 
OSD daemon.


I utilize ceph-deploy tool whenever deploy ceph cluster.

Is there a default reweight value of ceph-deploy tool?

Can we adjust the reweight value when we activate OSD daemon?

ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 39.26390 root default
-2  9.81599 host ngcephnode01
 1  0.81799 osd.1   up 1.0  1.0
 3  0.81799 osd.3   up 1.0  1.0
 4  0.81799 osd.4   up 1.0  1.0
 5  0.81799 osd.5   up 1.0  1.0
 6  0.81799 osd.6   up 1.0  1.0



Thanks all!

John Haan



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] the reweight value of OSD is always 1

2016-08-31 Thread 한승진

Hi Cephers!

The re-weight value of OSD is always 1 when we create and activate an OSD
daemon.

I utilize ceph-deploy tool whenever deploy ceph cluster.

Is there a default reweight value of ceph-deploy tool?

Can we adjust the reweight value when we activate OSD daemon?

ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 39.26390 root default
-2  9.81599 host ngcephnode01
 1  0.81799 osd.1   up  1.0  1.0
 3  0.81799 osd.3   up  1.0  1.0
 4  0.81799 osd.4   up  1.0  1.0
 5  0.81799 osd.5   up  1.0  1.0
 6  0.81799 osd.6   up  1.0  1.0



Thanks all!

John Haan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to debug pg inconsistent state - no ioerrors seen

2016-08-31 Thread Goncalo Borges


Hi Kenneth, All

Just an update for completeness on this topic.

We have been hit again by this issue.

I have been discussing it with Brad (RH staff) in another ML thread, and 
I have opened a tracker issue: http://tracker.ceph.com/issues/17177


I believe this is a bug since there are other people in the ticket 
saying they see the same thing.


Kenneth: Probably you should also add your info there.

Cheers
Goncalo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] HitSet - memory requirement

2016-08-31 Thread Kjetil Jørgensen

Hi,

http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ states

>
> Note A larger hit_set_count results in more RAM consumed by the ceph-osd
> process.


By how much - what order - kb ? mb ? gb ?

After some spelunking - there's osd_hit_set_max_size, is it fair to make
the following assumption that - we're approximately upper bounded by
(osd_hit_set_max_size + change) * number-of-hit-sets ?

-KJ
-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs metadata pool: deep-scrub error "omap_digest != best guess omap_digest"

2016-08-31 Thread Brad Hubbard

On Thu, Sep 1, 2016 at 1:08 AM, Sean Redmond  wrote:
> I have updated the tracker with some log extracts as I seem to be hitting
> this or a very similar issue.

I've updated the tracker asking for "rados list-inconsistent-obj" from
the relevant pgs.

That should give us a better idea of the nature of the divergent data.

>
> I was unsure of the correct syntax for the command ceph-objectstore-tool to
> try and extract that information.
>
> On Wed, Aug 31, 2016 at 5:56 AM, Brad Hubbard  wrote:
>>
>>
>> On Wed, Aug 31, 2016 at 2:30 PM, Goncalo Borges
>>  wrote:
>> > Here it goes:
>> >
>> > # xfs_info /var/lib/ceph/osd/ceph-78
>> > meta-data=/dev/sdu1  isize=2048   agcount=4,
>> > agsize=183107519 blks
>> >  =   sectsz=512   attr=2, projid32bit=1
>> >  =   crc=0finobt=0
>> > data =   bsize=4096   blocks=732430075,
>> > imaxpct=5
>> >  =   sunit=0  swidth=0 blks
>> > naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
>> > log  =internal   bsize=4096   blocks=357631, version=2
>> >  =   sectsz=512   sunit=0 blks, lazy-count=1
>> > realtime =none   extsz=4096   blocks=0, rtextents=0
>> >
>> >
>> > # xfs_info /var/lib/ceph/osd/ceph-49
>> > meta-data=/dev/sde1  isize=2048   agcount=4,
>> > agsize=183105343 blks
>> >  =   sectsz=512   attr=2, projid32bit=1
>> >  =   crc=0finobt=0
>> > data =   bsize=4096   blocks=732421371,
>> > imaxpct=5
>> >  =   sunit=0  swidth=0 blks
>> > naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
>> > log  =internal   bsize=4096   blocks=357627, version=2
>> >  =   sectsz=512   sunit=0 blks, lazy-count=1
>> > realtime =none   extsz=4096   blocks=0, rtextents=0
>> >
>> >
>> > # xfs_info /var/lib/ceph/osd/ceph-59
>> > meta-data=/dev/sdg1  isize=2048   agcount=4,
>> > agsize=183105343 blks
>> >  =   sectsz=512   attr=2, projid32bit=1
>> >  =   crc=0finobt=0
>> > data =   bsize=4096   blocks=732421371,
>> > imaxpct=5
>> >  =   sunit=0  swidth=0 blks
>> > naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
>> > log  =internal   bsize=4096   blocks=357627, version=2
>> >  =   sectsz=512   sunit=0 blks, lazy-count=1
>> > realtime =none   extsz=4096   blocks=0, rtextents=0
>>
>> OK, all look pretty similar so there goes that theory ;)
>>
>> I thought if one or more of the filesystems had a smaller isize they would
>> not
>> be able to store as many extended attributes and these would spill over
>> into
>> omap storage only on those OSDs. It's not that easy but it might be
>> something
>> similar given the ERANGE errors.
>>
>> I've assigned the tracker (thanks) to myself and will follow through on
>> it.
>> Please give me a little time to look further into the ERANGE errors and
>> the logs
>> you provided (thanks again) and I'll update here and the tracker when I
>> know
>> more.
>>
>> --
>> Cheers,
>> Brad
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow Request on OSD

2016-08-31 Thread Reed Dier

Multiple XFS corruptions, multiple leveldb issues. Looked to be result of write 
cache settings which have been adjusted now.

You’ll see below that there are tons of PG’s in bad states, and it was slowly 
but surely bringing the number of bad PGs down, but it seems to have hit a 
brick wall with this one slow request operation.

> ceph -s
> cluster []
>  health HEALTH_ERR
> 292 pgs are stuck inactive for more than 300 seconds
> 142 pgs backfill_wait
> 135 pgs degraded
> 63 pgs down
> 80 pgs incomplete
> 199 pgs inconsistent
> 2 pgs recovering
> 5 pgs recovery_wait
> 1 pgs repair
> 132 pgs stale
> 160 pgs stuck inactive
> 132 pgs stuck stale
> 71 pgs stuck unclean
> 128 pgs undersized
> 1 requests are blocked > 32 sec
> recovery 5301381/46255447 objects degraded (11.461%)
> recovery 6335505/46255447 objects misplaced (13.697%)
> recovery 131/20781800 unfound (0.001%)
> 14943 scrub errors
> mds cluster is degraded
>  monmap e1: 3 mons at {core=[]:6789/0,db=[]:6789/0,dev=[]:6789/0}
> election epoch 262, quorum 0,1,2 core,dev,db
>   fsmap e3627: 1/1/1 up {0=core=up:replay}
>  osdmap e3685: 8 osds: 8 up, 8 in; 153 remapped pgs
> flags sortbitwise
>   pgmap v1807138: 744 pgs, 10 pools, 7668 GB data, 20294 kobjects
> 8998 GB used, 50598 GB / 59596 GB avail
> 5301381/46255447 objects degraded (11.461%)
> 6335505/46255447 objects misplaced (13.697%)
> 131/20781800 unfound (0.001%)
>  209 active+clean
>  170 active+clean+inconsistent
>  112 stale+active+clean
>   74 undersized+degraded+remapped+wait_backfill+peered
>   63 down+incomplete
>   48 active+undersized+degraded+remapped+wait_backfill
>   19 stale+active+clean+inconsistent
>   17 incomplete
>   12 active+remapped+wait_backfill
>5 active+recovery_wait+degraded
>4 
> undersized+degraded+remapped+inconsistent+wait_backfill+peered
>4 active+remapped+inconsistent+wait_backfill
>2 active+recovering+degraded
>2 undersized+degraded+remapped+peered
>1 stale+active+clean+scrubbing+deep+inconsistent+repair
>1 active+clean+scrubbing+deep
>1 active+clean+scrubbing+inconsistent


Thanks,

Reed

> On Aug 31, 2016, at 4:08 PM, Wido den Hollander  wrote:
> 
>> 
>> Op 31 augustus 2016 om 22:56 schreef Reed Dier > >:
>> 
>> 
>> After a power failure left our jewel cluster crippled, I have hit a sticking 
>> point in attempted recovery.
>> 
>> Out of 8 osd’s, we likely lost 5-6, trying to salvage what we can.
>> 
> 
> That's probably to much. How do you mean lost? Is XFS crippled/corrupted? 
> That shouldn't happen.
> 
>> In addition to rados pools, we were also using CephFS, and the 
>> cephfs.metadata and cephfs.data pools likely lost plenty of PG’s.
>> 
> 
> What is the status of all PGs? What does 'ceph -s' show?
> 
> Are all PGs active? Since that's something which needs to be done first.
> 
>> The mds has reported this ever since returning from the power loss:
>>> # ceph mds stat
>>> e3627: 1/1/1 up {0=core=up:replay}
>> 
>> 
>> When looking at the slow request on the osd, it shows this task which I 
>> can’t quite figure out. Any help appreciated.
>> 
> 
> Are all clients (including MDS) and OSDs running the same version?
> 
> Wido
> 
>>> # ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight
>>> {
>>>"ops": [
>>>{
>>>"description": "osd_op(mds.0.3625:8 6.c5265ab3 (undecoded) 
>>> ack+retry+read+known_if_redirected+full_force e3668)",
>>>"initiated_at": "2016-08-31 10:37:18.833644",
>>>"age": 22212.235361,
>>>"duration": 22212.235379,
>>>"type_data": [
>>>"no flag points reached",
>>>[
>>>{
>>>"time": "2016-08-31 10:37:18.833644",
>>>"event": "initiated"
>>>}
>>>]
>>>]
>>>}
>>>],
>>>"num_ops": 1
>>> }
>> 
>> Thanks,
>> 
>> Reed
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow Request on OSD

2016-08-31 Thread Wido den Hollander


> Op 31 augustus 2016 om 22:56 schreef Reed Dier :
> 
> 
> After a power failure left our jewel cluster crippled, I have hit a sticking 
> point in attempted recovery.
> 
> Out of 8 osd’s, we likely lost 5-6, trying to salvage what we can.
> 

That's probably to much. How do you mean lost? Is XFS crippled/corrupted? That 
shouldn't happen.

> In addition to rados pools, we were also using CephFS, and the 
> cephfs.metadata and cephfs.data pools likely lost plenty of PG’s.
> 

What is the status of all PGs? What does 'ceph -s' show?

Are all PGs active? Since that's something which needs to be done first.

> The mds has reported this ever since returning from the power loss:
> > # ceph mds stat
> > e3627: 1/1/1 up {0=core=up:replay}
> 
> 
> When looking at the slow request on the osd, it shows this task which I can’t 
> quite figure out. Any help appreciated.
> 

Are all clients (including MDS) and OSDs running the same version?

Wido

> > # ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight
> > {
> > "ops": [
> > {
> > "description": "osd_op(mds.0.3625:8 6.c5265ab3 (undecoded) 
> > ack+retry+read+known_if_redirected+full_force e3668)",
> > "initiated_at": "2016-08-31 10:37:18.833644",
> > "age": 22212.235361,
> > "duration": 22212.235379,
> > "type_data": [
> > "no flag points reached",
> > [
> > {
> > "time": "2016-08-31 10:37:18.833644",
> > "event": "initiated"
> > }
> > ]
> > ]
> > }
> > ],
> > "num_ops": 1
> > }
> 
> Thanks,
> 
> Reed
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Jewel - frequent ceph-osd crashes

2016-08-31 Thread Wido den Hollander


> Op 31 augustus 2016 om 22:14 schreef Gregory Farnum :
> 
> 
> On Tue, Aug 30, 2016 at 2:17 AM, Andrei Mikhailovsky  
> wrote:
> > Hello
> >
> > I've got a small cluster of 3 osd servers and 30 osds between them running
> > Jewel 10.2.2 on Ubuntu 16.04 LTS with stock kernel version 4.4.0-34-generic.
> >
> > I am experiencing rather frequent osd crashes, which tend to happen a few
> > times a month on random osds. The latest one gave me the following log
> > message:
> >
> >
> > 2016-08-30 06:26:29.861106 7f8ed54f1700 -1 journal aio to 13085011968~8192
> > wrote 18446744073709551615
> > 2016-08-30 06:26:29.862558 7f8ed54f1700 -1 os/filestore/FileJournal.cc: In
> > function 'void FileJournal::write_finish_thread_entry()' thread 7f8ed54f1700
> > time 2016-08-30 06:26:29.86112
> > 2
> > os/filestore/FileJournal.cc: 1541: FAILED assert(0 == "unexpected aio
> > error")
> 
> As it says, the OSD got back an unexpected AIO error (and so it quit
> rather than trying to continue on a possibly/probably flaky FS/disk).
> Look at dmesg et al and see if there's anything useful; check your
> disk info; etc.

I have seen this happen on Dell systems with a PERC (Rebranded LSI) controller 
after a upgrade to Jewel.

These systems were running Ubuntu 14.04 with the 3.13 kernel.

After upgrading the kernel to at least 3.19 (newer is better) it went away. If 
that didn't work I set the queue_depth of the device to 1 in /sys.

On Ubuntu 16.04 I didn't observe these crashes.

Wido

> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Slow Request on OSD

2016-08-31 Thread Reed Dier

After a power failure left our jewel cluster crippled, I have hit a sticking 
point in attempted recovery.

Out of 8 osd’s, we likely lost 5-6, trying to salvage what we can.

In addition to rados pools, we were also using CephFS, and the cephfs.metadata 
and cephfs.data pools likely lost plenty of PG’s.

The mds has reported this ever since returning from the power loss:
> # ceph mds stat
> e3627: 1/1/1 up {0=core=up:replay}


When looking at the slow request on the osd, it shows this task which I can’t 
quite figure out. Any help appreciated.

> # ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight
> {
> "ops": [
> {
> "description": "osd_op(mds.0.3625:8 6.c5265ab3 (undecoded) 
> ack+retry+read+known_if_redirected+full_force e3668)",
> "initiated_at": "2016-08-31 10:37:18.833644",
> "age": 22212.235361,
> "duration": 22212.235379,
> "type_data": [
> "no flag points reached",
> [
> {
> "time": "2016-08-31 10:37:18.833644",
> "event": "initiated"
> }
> ]
> ]
> }
> ],
> "num_ops": 1
> }

Thanks,

Reed
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] /var/lib/mysql, CephFS vs RBD

2016-08-31 Thread RDS

xfs
> On Aug 31, 2016, at 12:14 PM, Lazuardi Nasution  
> wrote:
> 
> Hi,
> 
> Thank you for your opinion. I don't know if RBD-NBD is supported by OpenStack 
> since my environment is OpenStack. What file system do you use in your test 
> for RBD and RBD-NBD?
> 
> Best regards,
> 
> On Wed, Aug 31, 2016 at 10:11 PM, RDS mailto:rs3...@me.com>> 
> wrote:
> In my testing, using RBD-NBD is faster than using RBD or CephFS.
> For a MySQL/sysbench test using 25 threads using OLTP, using a 40G network 
> between the client and Ceph, here are some of my results:
> Using ceph-rbd:  transactions per sec:  8620
> using ceph rbd-nbd:  transaction per sec:  9359
> using cephfs:  transactions per sec:  7550
> Rick
> > On Aug 31, 2016, at 10:59 AM, Lazuardi Nasution  > > wrote:
> >
> > Hi,
> >
> > I'm looking for pros and cons of mounting /var/lib/mysql with CephFS or RBD 
> > for getting best performance. MySQL save data as files on mostly 
> > configuration but the I/O is block access because the file is opened until 
> > MySQL down. This case give us both options for storing the data files. For 
> > RBD pros, please suggest the file system should be formatted on the mounted 
> > volume.
> >
> > Actually this case can happen on any database which stores the data as 
> > files.
> >
> > Best regards,
> 

Rick Stehno


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Jewel - frequent ceph-osd crashes

2016-08-31 Thread Gregory Farnum

On Tue, Aug 30, 2016 at 2:17 AM, Andrei Mikhailovsky  wrote:
> Hello
>
> I've got a small cluster of 3 osd servers and 30 osds between them running
> Jewel 10.2.2 on Ubuntu 16.04 LTS with stock kernel version 4.4.0-34-generic.
>
> I am experiencing rather frequent osd crashes, which tend to happen a few
> times a month on random osds. The latest one gave me the following log
> message:
>
>
> 2016-08-30 06:26:29.861106 7f8ed54f1700 -1 journal aio to 13085011968~8192
> wrote 18446744073709551615
> 2016-08-30 06:26:29.862558 7f8ed54f1700 -1 os/filestore/FileJournal.cc: In
> function 'void FileJournal::write_finish_thread_entry()' thread 7f8ed54f1700
> time 2016-08-30 06:26:29.86112
> 2
> os/filestore/FileJournal.cc: 1541: FAILED assert(0 == "unexpected aio
> error")

As it says, the OSD got back an unexpected AIO error (and so it quit
rather than trying to continue on a possibly/probably flaky FS/disk).
Look at dmesg et al and see if there's anything useful; check your
disk info; etc.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

2016-08-31 Thread Daniel Gryniewicz


On 08/31/2016 02:15 PM, Wido den Hollander wrote:



Op 31 augustus 2016 om 15:28 schreef Daniel Gryniewicz :


I believe this is a Ganesha bug, as discussed on the Ganesha list.



Ah, thanks. Do you maybe have a link or subject so I can chime in?

Wido


https://sourceforge.net/p/nfs-ganesha/mailman/message/35321818/

Daniel




Daniel

On 08/31/2016 06:55 AM, Wido den Hollander wrote:



Op 31 augustus 2016 om 12:42 schreef John Spray :


On Wed, Aug 31, 2016 at 11:23 AM, Wido den Hollander  wrote:

Hi,

I have a CephFS filesystem which is re-exported through NFS Ganesha (v2.3.0) 
with Ceph 10.2.2

The export works fine, but when calling a chgrp on a file the UID is set to 
root.

Example list of commands:

$ chown www-data:www-data myfile

That works, file is now owned by www-data/www-data

$ chgrp nogroup myfile


Does the nogroup group have GID -1 by any chance?  There is a bug in
the userspace client where it doesn't handle negative gid/uids
properly at all, which Greg is working on at the moment
(http://tracker.ceph.com/issues/16367)



No, it doesn't. nogroup is just an example here. This happens with any group.

The UID is always set back to 0.

Wido


John


That files, the UID is now set to 0 (root) and the group hasn't changed.

I tracked this down to being a Ganesha problem in combination with CephFS. 
Running these commands on either a kernel mounted CephFS or via FUSE I don't 
see this problem.

The Ganesha configuration:

NFSv4
{
IdmapConf = /etc/idmapd.conf;
}

EXPORT
{
Export_ID = 1;
Path = "/";
Pseudo = "/";
Access_Type = RW;
Protocols = "4";
Squash = no_root_squash;
Transports = TCP;
SecType = sys;

FSAL {
Name = CEPH;
}
}

Has anybody seen this before?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

2016-08-31 Thread Wido den Hollander


> Op 31 augustus 2016 om 15:28 schreef Daniel Gryniewicz :
> 
> 
> I believe this is a Ganesha bug, as discussed on the Ganesha list.
> 

Ah, thanks. Do you maybe have a link or subject so I can chime in?

Wido

> Daniel
> 
> On 08/31/2016 06:55 AM, Wido den Hollander wrote:
> >
> >> Op 31 augustus 2016 om 12:42 schreef John Spray :
> >>
> >>
> >> On Wed, Aug 31, 2016 at 11:23 AM, Wido den Hollander  wrote:
> >>> Hi,
> >>>
> >>> I have a CephFS filesystem which is re-exported through NFS Ganesha 
> >>> (v2.3.0) with Ceph 10.2.2
> >>>
> >>> The export works fine, but when calling a chgrp on a file the UID is set 
> >>> to root.
> >>>
> >>> Example list of commands:
> >>>
> >>> $ chown www-data:www-data myfile
> >>>
> >>> That works, file is now owned by www-data/www-data
> >>>
> >>> $ chgrp nogroup myfile
> >>
> >> Does the nogroup group have GID -1 by any chance?  There is a bug in
> >> the userspace client where it doesn't handle negative gid/uids
> >> properly at all, which Greg is working on at the moment
> >> (http://tracker.ceph.com/issues/16367)
> >>
> >
> > No, it doesn't. nogroup is just an example here. This happens with any 
> > group.
> >
> > The UID is always set back to 0.
> >
> > Wido
> >
> >> John
> >>
> >>> That files, the UID is now set to 0 (root) and the group hasn't changed.
> >>>
> >>> I tracked this down to being a Ganesha problem in combination with 
> >>> CephFS. Running these commands on either a kernel mounted CephFS or via 
> >>> FUSE I don't see this problem.
> >>>
> >>> The Ganesha configuration:
> >>>
> >>> NFSv4
> >>> {
> >>> IdmapConf = /etc/idmapd.conf;
> >>> }
> >>>
> >>> EXPORT
> >>> {
> >>> Export_ID = 1;
> >>> Path = "/";
> >>> Pseudo = "/";
> >>> Access_Type = RW;
> >>> Protocols = "4";
> >>> Squash = no_root_squash;
> >>> Transports = TCP;
> >>> SecType = sys;
> >>>
> >>> FSAL {
> >>> Name = CEPH;
> >>> }
> >>> }
> >>>
> >>> Has anybody seen this before?
> >>>
> >>> Wido
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs page cache

2016-08-31 Thread Sean Redmond

I am not sure how to tell?

Server1 and Server2 mount the ceph file system using kernel client 4.7.2
and I can replicate the problem using '/usr/bin/sum' to read the file or a
http GET request via a web server (apache).

On Wed, Aug 31, 2016 at 2:38 PM, Yan, Zheng  wrote:

> On Wed, Aug 31, 2016 at 12:49 AM, Sean Redmond 
> wrote:
> > Hi,
> >
> > I have been able to pick through the process a little further and
> replicate
> > it via the command line. The flow seems looks like this:
> >
> > 1) The user uploads an image to webserver server 'uploader01' it gets
> > written to a path such as '/cephfs/webdata/static/456/
> JHL/66448H-755h.jpg'
> > on cephfs
> >
> > 2) The MDS makes the file meta data available for this new file
> immediately
> > to all clients.
> >
> > 3) The 'uploader01' server asynchronously commits the file contents to
> disk
> > as sync is not explicitly called during the upload.
> >
> > 4) Before step 3 is done the visitor requests the file via one of two web
> > servers server1 or server2 - the MDS provides the meta data but the
> contents
> > of the file is not committed to disk yet so the data read returns 0's -
> This
> > is then cached by the file system page cache until it expires or is
> flushed
> > manually.
>
> do server1 or server2 use memory-mapped IO to read the file?
>
> Regards
> Yan, Zheng
>
> >
> > 5) As step 4 typically only happens on one of the two web servers before
> > step 3 is complete we get the mismatch between server1 and server2 file
> > system page cache.
> >
> > The below demonstrates how to reproduce this issue
> >
> > http://pastebin.com/QK8AemAb
> >
> > As we can see the checksum of the file returned by the web server is 0 as
> > the file contents has not been flushed to disk from server uploader01
> >
> > If however we call ‘sync’ as shown below the checksum is correct:
> >
> > http://pastebin.com/p4CfhEFt
> >
> > If we also wait for 10 seconds for the kernel to flush the dirty pages,
> we
> > can also see the checksum is valid:
> >
> > http://pastebin.com/1w6UZzNQ
> >
> > It looks it maybe a race between the time it takes the uploader01 server
> to
> > commit the file to the file system and the fast incoming read request
> from
> > the visiting user to server1 or server2.
> >
> > Thanks
> >
> >
> > On Tue, Aug 30, 2016 at 10:21 AM, Sean Redmond 
> > wrote:
> >>
> >> You are correct it only seems to impact recently modified files.
> >>
> >> On Tue, Aug 30, 2016 at 3:36 AM, Yan, Zheng  wrote:
> >>>
> >>> On Tue, Aug 30, 2016 at 2:11 AM, Gregory Farnum 
> >>> wrote:
> >>> > On Mon, Aug 29, 2016 at 7:14 AM, Sean Redmond <
> sean.redmo...@gmail.com>
> >>> > wrote:
> >>> >> Hi,
> >>> >>
> >>> >> I am running cephfs (10.2.2) with kernel 4.7.0-1. I have noticed
> that
> >>> >> frequently static files are showing empty when serviced via a web
> >>> >> server
> >>> >> (apache). I have tracked this down further and can see when running
> a
> >>> >> checksum against the file on the cephfs file system on the node
> >>> >> serving the
> >>> >> empty http response the checksum is '0'
> >>> >>
> >>> >> The below shows the checksum on a defective node.
> >>> >>
> >>> >> [root@server2]# ls -al /cephfs/webdata/static/456/
> JHL/66448H-755h.jpg
> >>> >> -rw-r--r-- 1 apache apache 53317 Aug 28 23:46
> >>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
> >>>
> >>> It seems this file was modified recently. Maybe the web server
> >>> silently modifies the files. Please check if this issue happens on
> >>> older files.
> >>>
> >>> Regards
> >>> Yan, Zheng
> >>>
> >>> >>
> >>> >> [root@server2]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
> >>> >> 053
> >>> >
> >>> > So can we presume there are no file contents, and it's just 53 blocks
> >>> > of zeros?
> >>> >
> >>> > This doesn't sound familiar to me; Zheng, do you have any ideas?
> >>> > Anyway, ceph-fuse shouldn't be susceptible to this bug even with the
> >>> > page cache enabled; if you're just serving stuff via the web it's
> >>> > probably a better idea anyway (harder to break, easier to update,
> >>> > etc).
> >>> > -Greg
> >>> >
> >>> >>
> >>> >> The below shows the checksum on a working node.
> >>> >>
> >>> >> [root@server1]# ls -al /cephfs/webdata/static/456/
> JHL/66448H-755h.jpg
> >>> >> -rw-r--r-- 1 apache apache 53317 Aug 28 23:46
> >>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
> >>> >>
> >>> >> [root@server1]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
> >>> >> 0362053
> >>> >> [root@server1]#
> >>> >>
> >>> >> If I flush the cache as shown below the checksum returns as expected
> >>> >> and the
> >>> >> web server serves up valid content.
> >>> >>
> >>> >> [root@server2]# echo 3 > /proc/sys/vm/drop_caches
> >>> >> [root@server2]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
> >>> >> 0362053
> >>> >>
> >>> >> After some time typically less than 1hr the issue repeats, It seems
> to
> >>> >> not
> >>> >> repeat if I take any one of the servers out of the LB and onl

Re: [ceph-users] /var/lib/mysql, CephFS vs RBD

2016-08-31 Thread Lazuardi Nasution

Hi,

Thank you for your opinion. I don't know if RBD-NBD is supported by
OpenStack since my environment is OpenStack. What file system do you use in
your test for RBD and RBD-NBD?

Best regards,

On Wed, Aug 31, 2016 at 10:11 PM, RDS  wrote:

> In my testing, using RBD-NBD is faster than using RBD or CephFS.
> For a MySQL/sysbench test using 25 threads using OLTP, using a 40G network
> between the client and Ceph, here are some of my results:
> Using ceph-rbd:  transactions per sec:  8620
> using ceph rbd-nbd:  transaction per sec:  9359
> using cephfs:  transactions per sec:  7550
> Rick
> > On Aug 31, 2016, at 10:59 AM, Lazuardi Nasution 
> wrote:
> >
> > Hi,
> >
> > I'm looking for pros and cons of mounting /var/lib/mysql with CephFS or
> RBD for getting best performance. MySQL save data as files on mostly
> configuration but the I/O is block access because the file is opened until
> MySQL down. This case give us both options for storing the data files. For
> RBD pros, please suggest the file system should be formatted on the mounted
> volume.
> >
> > Actually this case can happen on any database which stores the data as
> files.
> >
> > Best regards,
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] /var/lib/mysql, CephFS vs RBD

2016-08-31 Thread RDS

In my testing, using RBD-NBD is faster than using RBD or CephFS.
For a MySQL/sysbench test using 25 threads using OLTP, using a 40G network 
between the client and Ceph, here are some of my results:
Using ceph-rbd:  transactions per sec:  8620
using ceph rbd-nbd:  transaction per sec:  9359
using cephfs:  transactions per sec:  7550
Rick 
> On Aug 31, 2016, at 10:59 AM, Lazuardi Nasution  
> wrote:
> 
> Hi,
> 
> I'm looking for pros and cons of mounting /var/lib/mysql with CephFS or RBD 
> for getting best performance. MySQL save data as files on mostly 
> configuration but the I/O is block access because the file is opened until 
> MySQL down. This case give us both options for storing the data files. For 
> RBD pros, please suggest the file system should be formatted on the mounted 
> volume.
> 
> Actually this case can happen on any database which stores the data as files.
> 
> Best regards,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs metadata pool: deep-scrub error "omap_digest != best guess omap_digest"

2016-08-31 Thread Sean Redmond

I have updated the tracker with some log extracts as I seem to be hitting
this or a very similar issue.

I was unsure of the correct syntax for the command ceph-objectstore-tool to
try and extract that information.

On Wed, Aug 31, 2016 at 5:56 AM, Brad Hubbard  wrote:

>
> On Wed, Aug 31, 2016 at 2:30 PM, Goncalo Borges <
> goncalo.bor...@sydney.edu.au> wrote:
> > Here it goes:
> >
> > # xfs_info /var/lib/ceph/osd/ceph-78
> > meta-data=/dev/sdu1  isize=2048   agcount=4,
> agsize=183107519 blks
> >  =   sectsz=512   attr=2, projid32bit=1
> >  =   crc=0finobt=0
> > data =   bsize=4096   blocks=732430075, imaxpct=5
> >  =   sunit=0  swidth=0 blks
> > naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> > log  =internal   bsize=4096   blocks=357631, version=2
> >  =   sectsz=512   sunit=0 blks, lazy-count=1
> > realtime =none   extsz=4096   blocks=0, rtextents=0
> >
> >
> > # xfs_info /var/lib/ceph/osd/ceph-49
> > meta-data=/dev/sde1  isize=2048   agcount=4,
> agsize=183105343 blks
> >  =   sectsz=512   attr=2, projid32bit=1
> >  =   crc=0finobt=0
> > data =   bsize=4096   blocks=732421371, imaxpct=5
> >  =   sunit=0  swidth=0 blks
> > naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> > log  =internal   bsize=4096   blocks=357627, version=2
> >  =   sectsz=512   sunit=0 blks, lazy-count=1
> > realtime =none   extsz=4096   blocks=0, rtextents=0
> >
> >
> > # xfs_info /var/lib/ceph/osd/ceph-59
> > meta-data=/dev/sdg1  isize=2048   agcount=4,
> agsize=183105343 blks
> >  =   sectsz=512   attr=2, projid32bit=1
> >  =   crc=0finobt=0
> > data =   bsize=4096   blocks=732421371, imaxpct=5
> >  =   sunit=0  swidth=0 blks
> > naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> > log  =internal   bsize=4096   blocks=357627, version=2
> >  =   sectsz=512   sunit=0 blks, lazy-count=1
> > realtime =none   extsz=4096   blocks=0, rtextents=0
>
> OK, all look pretty similar so there goes that theory ;)
>
> I thought if one or more of the filesystems had a smaller isize they would
> not
> be able to store as many extended attributes and these would spill over
> into
> omap storage only on those OSDs. It's not that easy but it might be
> something
> similar given the ERANGE errors.
>
> I've assigned the tracker (thanks) to myself and will follow through on it.
> Please give me a little time to look further into the ERANGE errors and
> the logs
> you provided (thanks again) and I'll update here and the tracker when I
> know
> more.
>
> --
> Cheers,
> Brad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] /var/lib/mysql, CephFS vs RBD

2016-08-31 Thread Lazuardi Nasution

Hi,

I'm looking for pros and cons of mounting /var/lib/mysql with CephFS or RBD
for getting best performance. MySQL save data as files on mostly
configuration but the I/O is block access because the file is opened until
MySQL down. This case give us both options for storing the data files. For
RBD pros, please suggest the file system should be formatted on the mounted
volume.

Actually this case can happen on any database which stores the data as
files.

Best regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Antw: Re: Antw: Re: rbd cache mode with qemu

2016-08-31 Thread Steffen Weißgerber




>>> Alexandre DERUMIER  schrieb am Mittwoch, 31.
August 2016
um 16:10:
>> >eanwhile I tried to update the viostor driver within the vm (a
W2k8)
>>>but that
>>>results in a bluescreen.
> 
>>>When booting via recovery console and loading the new driver from
an
>>>actual
>>>qemu driver iso the disks are all in writeback mode.
>>>So maybe the cache mode depends on the iodriver within the machine.
>>>
>>>I'll see how to upgrade the driver without having a bluescreen
>>>afterwards
>>>(by having another reason to avoid that windows crap).
> 
> Very old virtio drivers (don't remember, but it was some year ago),
didn't 
> support flush/fua correctly.
> https://bugzilla.redhat.com/show_bug.cgi?id=837324 
> 
> So, it's quite possible that rbd_cache_writethrough_until_flush force

> writethrough in this case.

That makes sense. The time reference of the bugreport matches the
driver version 61.63.103.3000
from 03.07.2012 distributed with virtio-win-0.1-30.iso from Fedora.

Thank you.

Regards

> 
> - Mail original -
> De: "Steffen Weißgerber" 
> À: "ceph-users" , l...@stella-telecom.fr 
> Envoyé: Mercredi 31 Août 2016 15:43:17
> Objet: [ceph-users] Antw: Re:  rbd cache mode with qemu
> 
 Loris Cuoghi  schrieb am Dienstag, 30.
August 
> 2016 um 
> 16:34: 
>> Hello, 
>> 
> 
> Hi Loris, 
> 
> thank you for your answer. 
> 
>> Le 30/08/2016 à 14:08, Steffen Weißgerber a écrit : 
>>> Hello, 
>>> 
>>> after correcting the configuration for different qemu vm's with rbd

> disks 
>>> (we removed the cache=writethrough option to have the default 
>>> writeback mode) we have a strange behaviour after restarting the 
> vm's. 
>>> 
>>> For most of them the cache mode is now writeback as expected. But 
> some 
>>> neverthless use the disks in writethrough mode (at least an 'info 
> block' 
>>> reports that on the qemu monitor). This does also not change when 
>>> configuring cache=writeback explicitly. 
>>> 
>>> On our 6 node KVM Cluster we have the same behaviour for the 
> problematic 
>>> vm's on all hosts which are configured equally with qemu 2.5.1,
ceph 
> 0.94.7 
>>> and kernel 4.4.6. 
>>> 
>>> The ceph cluster has version 0.94.6. 
>>> 
>>> For me it seems to be a problem specific to the rbd's. Is there a 
> way to 
>>> check the cache behaviour of a single rbd? 
>> 
>> To my knowledge, there is not such a thing as an RBD's cache mode. 
>> 
>> The librbd cache exists in the client's memory, not on the Ceph 
>> cluster's hosts. Its configuration is to be put in the ceph 
>> configuration file, on each client host. 
>> 
> 
> Yes, that's what I think also but my guess was that the client cache

> behaviour 
> is somehow controlled by the communication between librbd on the
client 
> and 
> the ceph cluster. 
> 
>> Setting QEMU's disk cache mode to "writeback" informs the guest's OS

> 
>> that it needs to explicitly flush dirty data to persistent storage 
> when 
>> needed. 
> 
> Meanwhile I tried to update the viostor driver within the vm (a W2k8)

> but that 
> results in a bluescreen. 
> 
> When booting via recovery console and loading the new driver from an

> actual 
> qemu driver iso the disks are all in writeback mode. 
> So maybe the cache mode depends on the iodriver within the machine. 
> 
> I'll see how to upgrade the driver without having a bluescreen 
> afterwards 
> (by having another reason to avoid that windows crap). 
> 
>> 
>>> 
>>> Regards 
>>> 
>>> Steffen 
>>> 
>>> 
> 
> Regards 
> 
>>> 
>> ___ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> -- 
> Klinik-Service Neubrandenburg GmbH 
> Allendestr. 30, 17036 Neubrandenburg 
> Amtsgericht Neubrandenburg, HRB 2457 
> Geschaeftsfuehrerin: Gudrun Kappich 

> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

-- 
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] build and Compile ceph in development mode takes an hour

2016-08-31 Thread Lenz Grimmer

On 08/18/2016 12:42 AM, Brad Hubbard wrote:

> On Thu, Aug 18, 2016 at 1:12 AM, agung Laksono
>  wrote:
>> 
>> Is there a way to make the compiling process be faster? something
>> like only compile a particular code that I change.
> 
> Sure, just use the same build directory and run "make" again after
> you make code changes and it should only re-compile the binaries that
> are effected by your code changes.
> 
> You can use "make -jX" if you aren't already where 'X' is usually 
> number of CPUs / 2 which may speed up the build.

In addition to that, ccache might come in handy, too:
https://ccache.samba.org/

Lenz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] build and Compile ceph in development mode takes an hour

2016-08-31 Thread agung Laksono

HI Brad,

After exploring Ceph, I found sometimes when I run *make* command without
removing the build folder, it will end with this error:

/home/agung/project/samc/ceph/build/bin/init-ceph: ceph conf ./ceph.conf
not found; system is not configured.
rm -f core*
ip 127.0.0.1
port 6789

NOTE: hostname resolves to loopback; remote hosts will not be able to
  connect.  either adjust /etc/hosts, or edit this script to use your
  machine's real IP.

/home/agung/project/samc/ceph/build/bin/ceph-authtool --create-keyring
--gen-key --name=mon. /home/agung/project/samc/ceph/build/keyring --cap mon
allow *
creating /home/agung/project/samc/ceph/build/keyring
/home/agung/project/samc/ceph/build/bin/ceph-authtool --gen-key
--name=client.admin --set-uid=0 --cap mon allow * --cap osd allow * --cap
mds allow * /home/agung/project/samc/ceph/build/keyring
/home/agung/project/samc/ceph/build/bin/monmaptool --create --clobber --add
a 127.0.0.1:6789 --add b 127.0.0.1:6790 --add c 127.0.0.1:6791 --print
/tmp/ceph_monmap.8374
/home/agung/project/samc/ceph/build/bin/monmaptool: monmap file
/tmp/ceph_monmap.8374
/home/agung/project/samc/ceph/build/bin/monmaptool: generated fsid
9a08986c-b051-48e1-9002-f49af0cb9efd
epoch 0
fsid 9a08986c-b051-48e1-9002-f49af0cb9efd
last_changed 2016-08-31 20:54:42.473678
created 2016-08-31 20:54:42.473678
0: 127.0.0.1:6789/0 mon.a
1: 127.0.0.1:6790/0 mon.b
2: 127.0.0.1:6791/0 mon.c
/home/agung/project/samc/ceph/build/bin/monmaptool: writing epoch 0 to
/tmp/ceph_monmap.8374 (3 monitors)
rm -rf -- /home/agung/project/samc/ceph/build/dev/mon.a
mkdir -p /home/agung/project/samc/ceph/build/dev/mon.a
/home/agung/project/samc/ceph/build/bin/ceph-mon --mkfs -c
/home/agung/project/samc/ceph/build/ceph.conf -i a
--monmap=/tmp/ceph_monmap.8374
--keyring=/home/agung/project/samc/ceph/build/keyring
Segmentation fault (core dumped)


I have tried to remove the ceph.conf and clean up the temp/ folder before
run OSD=3 MON=3 MDS=1 ../src/vstart.sh -n -x -l.
But still did not work. probably you have any suggestion?


On Thu, Aug 18, 2016 at 5:42 AM, Brad Hubbard  wrote:

> On Thu, Aug 18, 2016 at 1:12 AM, agung Laksono 
> wrote:
> > Hi Ceph User,
> >
> > When I make change inside ceph codes in the development mode,
> > I found that recompiling takes around an hour because I have to remove
> > a build folder and all the contest and then reproduce it.
> >
> > Is there a way to make the compiling process be faster? something like
> only
> > compile a particular code that I change.
>
> Sure, just use the same build directory and run "make" again after you
> make code
> changes and it should only re-compile the binaries that are effected
> by your code
> changes.
>
> You can use "make -jX" if you aren't already where 'X' is usually
> number of CPUs / 2
> which may speed up the build.
>
> HTH,
> Brad
>
> >
> > Thanks before
> >
> >
> > --
> > Cheers,
> >
> > Agung Laksono
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
Cheers,

Agung Laksono
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Antw: Re: rbd cache mode with qemu

2016-08-31 Thread Alexandre DERUMIER

>>eanwhile I tried to update the viostor driver within the vm (a W2k8)
>>but that
>>results in a bluescreen.

>>When booting via recovery console and loading the new driver from an
>>actual
>>qemu driver iso the disks are all in writeback mode.
>>So maybe the cache mode depends on the iodriver within the machine.
>>
>>I'll see how to upgrade the driver without having a bluescreen
>>afterwards
>>(by having another reason to avoid that windows crap).

Very old virtio drivers (don't remember, but it was some year ago), didn't 
support flush/fua correctly.
https://bugzilla.redhat.com/show_bug.cgi?id=837324

So, it's quite possible that rbd_cache_writethrough_until_flush force 
writethrough in this case.

- Mail original -
De: "Steffen Weißgerber" 
À: "ceph-users" , l...@stella-telecom.fr
Envoyé: Mercredi 31 Août 2016 15:43:17
Objet: [ceph-users] Antw: Re:  rbd cache mode with qemu

>>> Loris Cuoghi  schrieb am Dienstag, 30. August 
2016 um 
16:34: 
> Hello, 
> 

Hi Loris, 

thank you for your answer. 

> Le 30/08/2016 à 14:08, Steffen Weißgerber a écrit : 
>> Hello, 
>> 
>> after correcting the configuration for different qemu vm's with rbd 
disks 
>> (we removed the cache=writethrough option to have the default 
>> writeback mode) we have a strange behaviour after restarting the 
vm's. 
>> 
>> For most of them the cache mode is now writeback as expected. But 
some 
>> neverthless use the disks in writethrough mode (at least an 'info 
block' 
>> reports that on the qemu monitor). This does also not change when 
>> configuring cache=writeback explicitly. 
>> 
>> On our 6 node KVM Cluster we have the same behaviour for the 
problematic 
>> vm's on all hosts which are configured equally with qemu 2.5.1, ceph 
0.94.7 
>> and kernel 4.4.6. 
>> 
>> The ceph cluster has version 0.94.6. 
>> 
>> For me it seems to be a problem specific to the rbd's. Is there a 
way to 
>> check the cache behaviour of a single rbd? 
> 
> To my knowledge, there is not such a thing as an RBD's cache mode. 
> 
> The librbd cache exists in the client's memory, not on the Ceph 
> cluster's hosts. Its configuration is to be put in the ceph 
> configuration file, on each client host. 
> 

Yes, that's what I think also but my guess was that the client cache 
behaviour 
is somehow controlled by the communication between librbd on the client 
and 
the ceph cluster. 

> Setting QEMU's disk cache mode to "writeback" informs the guest's OS 

> that it needs to explicitly flush dirty data to persistent storage 
when 
> needed. 

Meanwhile I tried to update the viostor driver within the vm (a W2k8) 
but that 
results in a bluescreen. 

When booting via recovery console and loading the new driver from an 
actual 
qemu driver iso the disks are all in writeback mode. 
So maybe the cache mode depends on the iodriver within the machine. 

I'll see how to upgrade the driver without having a bluescreen 
afterwards 
(by having another reason to avoid that windows crap). 

> 
>> 
>> Regards 
>> 
>> Steffen 
>> 
>> 

Regards 

>> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

-- 
Klinik-Service Neubrandenburg GmbH 
Allendestr. 30, 17036 Neubrandenburg 
Amtsgericht Neubrandenburg, HRB 2457 
Geschaeftsfuehrerin: Gudrun Kappich 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Antw: Re: rbd cache mode with qemu

2016-08-31 Thread Steffen Weißgerber




>>> Loris Cuoghi  schrieb am Dienstag, 30. August
2016 um
16:34:
> Hello,
> 

Hi Loris,

thank you for your answer.

> Le 30/08/2016 à 14:08, Steffen Weißgerber a écrit :
>> Hello,
>>
>> after correcting the configuration for different qemu vm's with rbd
disks
>> (we removed the cache=writethrough option to have the default
>> writeback mode) we have a strange behaviour after restarting the
vm's.
>>
>> For most of them the cache mode is now writeback as expected. But
some
>> neverthless use the disks in writethrough mode (at least an 'info
block'
>> reports that on the qemu monitor). This does also not change when
>> configuring cache=writeback explicitly.
>>
>> On our 6 node KVM Cluster we have the same behaviour for the
problematic
>> vm's on all hosts which are configured equally with qemu 2.5.1, ceph
0.94.7
>> and kernel 4.4.6.
>>
>> The ceph cluster has version 0.94.6.
>>
>> For me it seems to be a problem specific to the rbd's. Is there a
way to
>> check the cache behaviour of a single rbd?
> 
> To my knowledge, there is not such a thing as an RBD's cache mode.
> 
> The librbd cache exists in the client's memory, not on the Ceph 
> cluster's hosts. Its configuration is to be put in the ceph 
> configuration file, on each client host.
> 

Yes, that's what I think also but my guess was that the client cache
behaviour
is somehow controlled by the communication between librbd on the client
and
the ceph cluster.

> Setting QEMU's disk cache mode to "writeback" informs the guest's OS

> that it needs to explicitly flush dirty data to persistent storage
when 
> needed.

Meanwhile I tried to update the viostor driver within the vm (a W2k8)
but that
results in a bluescreen.

When booting via recovery console and loading the new driver from an
actual
qemu driver iso the disks are all in writeback mode.
So maybe the cache mode depends on the iodriver within the machine.

I'll see how to upgrade the driver without having a bluescreen
afterwards
(by having another reason to avoid that windows crap).

> 
>>
>> Regards
>>
>> Steffen
>>
>>

Regards

>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs page cache

2016-08-31 Thread Yan, Zheng

On Wed, Aug 31, 2016 at 12:49 AM, Sean Redmond  wrote:
> Hi,
>
> I have been able to pick through the process a little further and replicate
> it via the command line. The flow seems looks like this:
>
> 1) The user uploads an image to webserver server 'uploader01' it gets
> written to a path such as '/cephfs/webdata/static/456/JHL/66448H-755h.jpg'
> on cephfs
>
> 2) The MDS makes the file meta data available for this new file immediately
> to all clients.
>
> 3) The 'uploader01' server asynchronously commits the file contents to disk
> as sync is not explicitly called during the upload.
>
> 4) Before step 3 is done the visitor requests the file via one of two web
> servers server1 or server2 - the MDS provides the meta data but the contents
> of the file is not committed to disk yet so the data read returns 0's - This
> is then cached by the file system page cache until it expires or is flushed
> manually.

do server1 or server2 use memory-mapped IO to read the file?

Regards
Yan, Zheng

>
> 5) As step 4 typically only happens on one of the two web servers before
> step 3 is complete we get the mismatch between server1 and server2 file
> system page cache.
>
> The below demonstrates how to reproduce this issue
>
> http://pastebin.com/QK8AemAb
>
> As we can see the checksum of the file returned by the web server is 0 as
> the file contents has not been flushed to disk from server uploader01
>
> If however we call ‘sync’ as shown below the checksum is correct:
>
> http://pastebin.com/p4CfhEFt
>
> If we also wait for 10 seconds for the kernel to flush the dirty pages, we
> can also see the checksum is valid:
>
> http://pastebin.com/1w6UZzNQ
>
> It looks it maybe a race between the time it takes the uploader01 server to
> commit the file to the file system and the fast incoming read request from
> the visiting user to server1 or server2.
>
> Thanks
>
>
> On Tue, Aug 30, 2016 at 10:21 AM, Sean Redmond 
> wrote:
>>
>> You are correct it only seems to impact recently modified files.
>>
>> On Tue, Aug 30, 2016 at 3:36 AM, Yan, Zheng  wrote:
>>>
>>> On Tue, Aug 30, 2016 at 2:11 AM, Gregory Farnum 
>>> wrote:
>>> > On Mon, Aug 29, 2016 at 7:14 AM, Sean Redmond 
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> I am running cephfs (10.2.2) with kernel 4.7.0-1. I have noticed that
>>> >> frequently static files are showing empty when serviced via a web
>>> >> server
>>> >> (apache). I have tracked this down further and can see when running a
>>> >> checksum against the file on the cephfs file system on the node
>>> >> serving the
>>> >> empty http response the checksum is '0'
>>> >>
>>> >> The below shows the checksum on a defective node.
>>> >>
>>> >> [root@server2]# ls -al /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> -rw-r--r-- 1 apache apache 53317 Aug 28 23:46
>>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>>
>>> It seems this file was modified recently. Maybe the web server
>>> silently modifies the files. Please check if this issue happens on
>>> older files.
>>>
>>> Regards
>>> Yan, Zheng
>>>
>>> >>
>>> >> [root@server2]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> 053
>>> >
>>> > So can we presume there are no file contents, and it's just 53 blocks
>>> > of zeros?
>>> >
>>> > This doesn't sound familiar to me; Zheng, do you have any ideas?
>>> > Anyway, ceph-fuse shouldn't be susceptible to this bug even with the
>>> > page cache enabled; if you're just serving stuff via the web it's
>>> > probably a better idea anyway (harder to break, easier to update,
>>> > etc).
>>> > -Greg
>>> >
>>> >>
>>> >> The below shows the checksum on a working node.
>>> >>
>>> >> [root@server1]# ls -al /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> -rw-r--r-- 1 apache apache 53317 Aug 28 23:46
>>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >>
>>> >> [root@server1]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> 0362053
>>> >> [root@server1]#
>>> >>
>>> >> If I flush the cache as shown below the checksum returns as expected
>>> >> and the
>>> >> web server serves up valid content.
>>> >>
>>> >> [root@server2]# echo 3 > /proc/sys/vm/drop_caches
>>> >> [root@server2]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> 0362053
>>> >>
>>> >> After some time typically less than 1hr the issue repeats, It seems to
>>> >> not
>>> >> repeat if I take any one of the servers out of the LB and only serve
>>> >> requests from one of the servers.
>>> >>
>>> >> I may try and use the FUSE client has has a mount option direct_io
>>> >> that
>>> >> looks to disable page cache.
>>> >>
>>> >> I have been hunting in the ML and tracker but could not see anything
>>> >> really
>>> >> close to this issue, Any input or feedback on similar experiences is
>>> >> welcome.
>>> >>
>>> >> Thanks
>>> >>
>>> >>
>>> >> ___
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

2016-08-31 Thread Daniel Gryniewicz


I believe this is a Ganesha bug, as discussed on the Ganesha list.

Daniel

On 08/31/2016 06:55 AM, Wido den Hollander wrote:



Op 31 augustus 2016 om 12:42 schreef John Spray :


On Wed, Aug 31, 2016 at 11:23 AM, Wido den Hollander  wrote:

Hi,

I have a CephFS filesystem which is re-exported through NFS Ganesha (v2.3.0) 
with Ceph 10.2.2

The export works fine, but when calling a chgrp on a file the UID is set to 
root.

Example list of commands:

$ chown www-data:www-data myfile

That works, file is now owned by www-data/www-data

$ chgrp nogroup myfile


Does the nogroup group have GID -1 by any chance?  There is a bug in
the userspace client where it doesn't handle negative gid/uids
properly at all, which Greg is working on at the moment
(http://tracker.ceph.com/issues/16367)



No, it doesn't. nogroup is just an example here. This happens with any group.

The UID is always set back to 0.

Wido


John


That files, the UID is now set to 0 (root) and the group hasn't changed.

I tracked this down to being a Ganesha problem in combination with CephFS. 
Running these commands on either a kernel mounted CephFS or via FUSE I don't 
see this problem.

The Ganesha configuration:

NFSv4
{
IdmapConf = /etc/idmapd.conf;
}

EXPORT
{
Export_ID = 1;
Path = "/";
Pseudo = "/";
Access_Type = RW;
Protocols = "4";
Squash = no_root_squash;
Transports = TCP;
SecType = sys;

FSAL {
Name = CEPH;
}
}

Has anybody seen this before?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

2016-08-31 Thread Wido den Hollander


> Op 31 augustus 2016 om 12:42 schreef John Spray :
> 
> 
> On Wed, Aug 31, 2016 at 11:23 AM, Wido den Hollander  wrote:
> > Hi,
> >
> > I have a CephFS filesystem which is re-exported through NFS Ganesha 
> > (v2.3.0) with Ceph 10.2.2
> >
> > The export works fine, but when calling a chgrp on a file the UID is set to 
> > root.
> >
> > Example list of commands:
> >
> > $ chown www-data:www-data myfile
> >
> > That works, file is now owned by www-data/www-data
> >
> > $ chgrp nogroup myfile
> 
> Does the nogroup group have GID -1 by any chance?  There is a bug in
> the userspace client where it doesn't handle negative gid/uids
> properly at all, which Greg is working on at the moment
> (http://tracker.ceph.com/issues/16367)
> 

No, it doesn't. nogroup is just an example here. This happens with any group.

The UID is always set back to 0.

Wido

> John
> 
> > That files, the UID is now set to 0 (root) and the group hasn't changed.
> >
> > I tracked this down to being a Ganesha problem in combination with CephFS. 
> > Running these commands on either a kernel mounted CephFS or via FUSE I 
> > don't see this problem.
> >
> > The Ganesha configuration:
> >
> > NFSv4
> > {
> > IdmapConf = /etc/idmapd.conf;
> > }
> >
> > EXPORT
> > {
> > Export_ID = 1;
> > Path = "/";
> > Pseudo = "/";
> > Access_Type = RW;
> > Protocols = "4";
> > Squash = no_root_squash;
> > Transports = TCP;
> > SecType = sys;
> >
> > FSAL {
> > Name = CEPH;
> > }
> > }
> >
> > Has anybody seen this before?
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

2016-08-31 Thread John Spray

On Wed, Aug 31, 2016 at 11:23 AM, Wido den Hollander  wrote:
> Hi,
>
> I have a CephFS filesystem which is re-exported through NFS Ganesha (v2.3.0) 
> with Ceph 10.2.2
>
> The export works fine, but when calling a chgrp on a file the UID is set to 
> root.
>
> Example list of commands:
>
> $ chown www-data:www-data myfile
>
> That works, file is now owned by www-data/www-data
>
> $ chgrp nogroup myfile

Does the nogroup group have GID -1 by any chance?  There is a bug in
the userspace client where it doesn't handle negative gid/uids
properly at all, which Greg is working on at the moment
(http://tracker.ceph.com/issues/16367)

John

> That files, the UID is now set to 0 (root) and the group hasn't changed.
>
> I tracked this down to being a Ganesha problem in combination with CephFS. 
> Running these commands on either a kernel mounted CephFS or via FUSE I don't 
> see this problem.
>
> The Ganesha configuration:
>
> NFSv4
> {
> IdmapConf = /etc/idmapd.conf;
> }
>
> EXPORT
> {
> Export_ID = 1;
> Path = "/";
> Pseudo = "/";
> Access_Type = RW;
> Protocols = "4";
> Squash = no_root_squash;
> Transports = TCP;
> SecType = sys;
>
> FSAL {
> Name = CEPH;
> }
> }
>
> Has anybody seen this before?
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] UID reset to root after chgrp on CephFS Ganesha export

2016-08-31 Thread Wido den Hollander

Hi,

I have a CephFS filesystem which is re-exported through NFS Ganesha (v2.3.0) 
with Ceph 10.2.2

The export works fine, but when calling a chgrp on a file the UID is set to 
root.

Example list of commands:

$ chown www-data:www-data myfile

That works, file is now owned by www-data/www-data

$ chgrp nogroup myfile

That files, the UID is now set to 0 (root) and the group hasn't changed.

I tracked this down to being a Ganesha problem in combination with CephFS. 
Running these commands on either a kernel mounted CephFS or via FUSE I don't 
see this problem.

The Ganesha configuration:

NFSv4
{
IdmapConf = /etc/idmapd.conf;
}

EXPORT
{
Export_ID = 1;
Path = "/";
Pseudo = "/";
Access_Type = RW;
Protocols = "4";
Squash = no_root_squash;
Transports = TCP;
SecType = sys;

FSAL {
Name = CEPH;
}
}

Has anybody seen this before?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs page cache

2016-08-31 Thread Sean Redmond

It seems using the 'sync' mount option on the server uploader01 is also a
valid work around.

Is it a problem that the meta data is available to other cephfs clients
ahead of the file contents being flushed by the client doing the write?

I think having an invalid page cache of zeros is a problem but its not
clear to me what the expected behavior is when a cephfs client is trying to
read a file contents that is currently still being flushed to the file
system by the cephfs client that created the file.

On Tue, Aug 30, 2016 at 5:49 PM, Sean Redmond 
wrote:

> Hi,
>
> I have been able to pick through the process a little further and
> replicate it via the command line. The flow seems looks like this:
>
> 1) The user uploads an image to webserver server 'uploader01' it gets
> written to a path such as '/cephfs/webdata/static/456/JHL/66448H-755h.jpg'
> on cephfs
>
> 2) The MDS makes the file meta data available for this new file
> immediately to all clients.
>
> 3) The 'uploader01' server asynchronously commits the file contents to
> disk as sync is not explicitly called during the upload.
>
> 4) Before step 3 is done the visitor requests the file via one of two web
> servers server1 or server2 - the MDS provides the meta data but
> the contents of the file is not committed to disk yet so the data read
> returns 0's - This is then cached by the file system page cache until it
> expires or is flushed manually.
>
> 5) As step 4 typically only happens on one of the two web servers before
> step 3 is complete we get the mismatch between server1 and server2 file
> system page cache.
>
> *The below demonstrates how to reproduce this issue*
> http://pastebin.com/QK8AemAb
>
> As we can see the checksum of the file returned by the web server is 0 as
> the file contents has not been flushed to disk from server uploader01
>
> *If however we call ‘sync’ as shown below the checksum is correct:*
>
> http://pastebin.com/p4CfhEFt
>
> *If we also wait for 10 seconds for the kernel to flush the dirty pages,
> we can also see the checksum is valid:*
>
> http://pastebin.com/1w6UZzNQ
>
> It looks it maybe a race between the time it takes the uploader01 server
> to commit the file to the file system and the fast incoming read request
> from the visiting user to server1 or server2.
>
> Thanks
>
>
> On Tue, Aug 30, 2016 at 10:21 AM, Sean Redmond 
> wrote:
>
>> You are correct it only seems to impact recently modified files.
>>
>> On Tue, Aug 30, 2016 at 3:36 AM, Yan, Zheng  wrote:
>>
>>> On Tue, Aug 30, 2016 at 2:11 AM, Gregory Farnum 
>>> wrote:
>>> > On Mon, Aug 29, 2016 at 7:14 AM, Sean Redmond 
>>> wrote:
>>> >> Hi,
>>> >>
>>> >> I am running cephfs (10.2.2) with kernel 4.7.0-1. I have noticed that
>>> >> frequently static files are showing empty when serviced via a web
>>> server
>>> >> (apache). I have tracked this down further and can see when running a
>>> >> checksum against the file on the cephfs file system on the node
>>> serving the
>>> >> empty http response the checksum is '0'
>>> >>
>>> >> The below shows the checksum on a defective node.
>>> >>
>>> >> [root@server2]# ls -al /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> -rw-r--r-- 1 apache apache 53317 Aug 28 23:46
>>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>>
>>> It seems this file was modified recently. Maybe the web server
>>> silently modifies the files. Please check if this issue happens on
>>> older files.
>>>
>>> Regards
>>> Yan, Zheng
>>>
>>> >>
>>> >> [root@server2]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> 053
>>> >
>>> > So can we presume there are no file contents, and it's just 53 blocks
>>> of zeros?
>>> >
>>> > This doesn't sound familiar to me; Zheng, do you have any ideas?
>>> > Anyway, ceph-fuse shouldn't be susceptible to this bug even with the
>>> > page cache enabled; if you're just serving stuff via the web it's
>>> > probably a better idea anyway (harder to break, easier to update,
>>> > etc).
>>> > -Greg
>>> >
>>> >>
>>> >> The below shows the checksum on a working node.
>>> >>
>>> >> [root@server1]# ls -al /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> -rw-r--r-- 1 apache apache 53317 Aug 28 23:46
>>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >>
>>> >> [root@server1]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> 0362053
>>> >> [root@server1]#
>>> >>
>>> >> If I flush the cache as shown below the checksum returns as expected
>>> and the
>>> >> web server serves up valid content.
>>> >>
>>> >> [root@server2]# echo 3 > /proc/sys/vm/drop_caches
>>> >> [root@server2]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >> 0362053
>>> >>
>>> >> After some time typically less than 1hr the issue repeats, It seems
>>> to not
>>> >> repeat if I take any one of the servers out of the LB and only serve
>>> >> requests from one of the servers.
>>> >>
>>> >> I may try and use the FUSE client has has a mount option direct_io
>>> that
>>> >> looks to disable page cach

[ceph-users] how to print the incremental osdmap

2016-08-31 Thread Zhongyan Gu

Hi kefu,
A quick question about the incremental osdmap.
I used ceph-objectstore-tool with the op get-inc-osdmap to get the specific
inc osdmap, but how to view the content of the incremental osdmap? Seems
osdmaptool --print incremental-osdmap can't output the content.

Thanks
Zhongyan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fuse "Transport endpoint is not connected" on Jewel 10.2.2

2016-08-31 Thread Dennis Kramer (DBS)

Hi all,

I just want to confirm that the patch works in our environment.
Thanks!

On 08/30/2016 02:04 PM, Dennis Kramer (DBS) wrote:
> Awesome Goncalo, that is very helpful.
> 
> Cheers.
> 
> On 08/30/2016 01:21 PM, Goncalo Borges wrote:
>> Hi Dennis.
>>
>> That is the first issue we saw and has nothing to do with the amd processors 
>> (which only relates to the second issue we saw). So the fix in the patch
>>
>> https://github.com/ceph/ceph/pull/10027
>>
>> should work for you.
>>
>> In our case we went for the full compilation for our own specific reasons. 
>> But you should only need to recompile the ceph fuse client. If you want a 
>> temp solution while this is not fixed in jewel,  just deploy ceph-fuse using 
>> an infernalis client. That is how we did it during the 3 weeks we were 
>> debugging our issues. 
>>
>> Cheers
>> Goncalo
>>
>> 
>> From: Dennis Kramer (DBS) [den...@holmes.nl]
>> Sent: 30 August 2016 20:59
>> To: Goncalo Borges; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] ceph-fuse "Transport endpoint is not connected" on 
>> Jewel 10.2.2
>>
>> Hi Goncalo,
>>
>> Thank you for providing below info. I'm getting the exact same errors:
>> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>  1: (()+0x2ae88e) [0x5647a76f488e]
>>  2: (()+0x113d0) [0x7f7d14c393d0]
>>  3: (Client::get_root_ino()+0x10) [0x5647a75eb730]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175)
>> [0x5647a75e9595]
>>  5: (()+0x1a3eb1) [0x5647a75e9eb1]
>>  6: (()+0x14ef5) [0x7f7d15283ef5]
>>  7: (()+0x15679) [0x7f7d15284679]
>>  8: (()+0x11e38) [0x7f7d15280e38]
>>  9: (()+0x76fa) [0x7f7d14c2f6fa]
>>  10: (clone()+0x6d) [0x7f7d1351ab5d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>> After reading your thread I wasn't sure if your solution would work in
>> our environment, since we don't use the AMD procs you mentioned. Though
>> the segfaults are identical in debugging.
>>
>> Have you recompiled ceph completely for your cluster or just the MDS server?
>>
>>
>> On 08/25/2016 02:45 AM, Goncalo Borges wrote:
>>> Hi Dennis...
>>>
>>> We use ceph-fuse in 10.2.2 and we saw two main issues with it immediately 
>>> after
>>> upgrading from Infernalis to Jewel.
>>>
>>> In our case, we are enabling ceph-fuse in a heavily used Linux cluster, and 
>>> our
>>> users complained about the mount points becoming unavailable some time after
>>> their applications start up.
>>>
>>> First we saw
>>>
>>> https://github.com/ceph/ceph/pull/10027
>>>
>>> and once that was fixed, we saw
>>>
>>> http://tracker.ceph.com/issues/16610
>>>
>>>
>>> There is a long ML thread with the subject 'ceph-fuse segfaults ( jewel 
>>> 10.2.2)'
>>> on the topic. At the end, RH staff proposed some patches which we applied 
>>> (we
>>> recompile ceph ourselves) and which resolved the issues we saw.
>>>
>>> You should run ceph-fuse in debug mode to actually check what segfaults you 
>>> may
>>> have, and if it is a similar problem. You can do that by mounting ceph-fuse 
>>> with
>>> nohup and the '-d'. Something like:
>>>
>>> nohup ceph-fuse --id mount_user -k  -m :6789 -d -r
>>> /cephfs /coepp/cephfs > /path/to/some/log 2>&1 &
>>>
>>> If you want an even bigger log level, you should set 'debug client = 20' in 
>>> your
>>> /etc/ceph/ceph.conf before mounting.
>>>
>>>
>>> Cheers
>>> Goncalo
>>>
>>> On 08/24/2016 10:28 PM, Dennis Kramer (DT) wrote:
 Hi all,

 Running ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) on
 Ubuntu 16.04LTS.

 Currently I have the weirdest thing, I have a bunch of linux clients, 
 mostly
 debian based (Ubuntu/Mint). They all use version 10.2.2 of ceph-fuse. I'm
 running cephfs since Hammer without any issues, but upgraded last week to
 Jewel and now my clients get:
 "Transport endpoint is not connected".

 It seems the error only arises when the client is using the GUI when they
 browse through the ceph-fuse mount, some use nemo, some nautilus. The error
 doesnt show up immediatly, sometimes the client can browse through the 
 share
 for some time before they are kicked out with the error.

 But when I strictly use the shell to browse the ceph-fuse mount in the CLI 
 it
 works without any issues, when I try to use the GUI browser on the same
 client, the error shows and I get kicked out of the ceph-fuse mount until I
 remount.

 Any suggestions?

 With regards,


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> --
>>> Goncalo Borges
>>> Research Computing
>>> ARC Centre of Excellence for Particle Physics at the Terascale
>>> School of Physics A28 | University of Sydney, NSW  2006
>>> T: +61 2 93511937
>>>
>>
>> --
___
ce

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-31 Thread Nick Fisk

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 31 August 2016 08:56
To: n...@fisk.me.uk; 'Alex Gorbachev' ; 'Horace Ng' 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Nick,

what do you think about Infiniband?

I have read that with Infiniband the latency is at 1,2us

It’s great, but I don’t believe the Ceph support for RDMA is finished yet, so 
you are stuck using IPoIB, which has similar performance to 10G Ethernet.

For now concentrate on removing latency where you easily can (3.5+ Ghz CPU’s, 
NVME journals) and then when stuff like RDMA comes along, you will be in a 
better place to take advantage of it.

Kind Regards!

Am 31.08.16 um 09:51 schrieb Nick Fisk:

From: w...@globe.de   [mailto:w...@globe.de] 
Sent: 30 August 2016 18:40
To: n...@fisk.me.uk  ; 'Alex Gorbachev'  

Cc: 'Horace Ng'   
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi Nick,

here are my answers and questions...

Am 30.08.16 um 19:05 schrieb Nick Fisk:

From: w...@globe.de   [mailto:w...@globe.de] 
Sent: 30 August 2016 08:48
To: n...@fisk.me.uk  ; 'Alex Gorbachev'  

Cc: 'Horace Ng'   
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi Nick, Hi Ales,

Nick: i've got my 600GB SAS HP Drives.

Performance is not good soo i don't paste the results here...

Generally another thing: I've build in the Ceph Cluster Samsung SM863 
Enterprise SSD's

If i do a 4k Test on the SSD directly without filesystem i become 

(See Sebastien's Han Tests)

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

dd if=/dev/zero of=/dev/sdd bs=4k count=100 oflag=direct,dsync
100+0 Datensätze ein
100+0 Datensätze aus
409600 bytes (4,1 GB, 3,8 GiB) copied, 52,7139 s, 77,7 MB/s

77000/4 = ~2 IOP’s

If i format the device with xfs i become:

mkfs.xfs -f /dev/sdd

mount /dev/sdd /mnt

cd /mnt

dd if=/dev/zero of=/mnt/test.txt bs=4k count=10 oflag=direct,dsync
10+0 Datensätze ein
10+0 Datensätze aus
40960 bytes (410 MB, 391 MiB) copied, 21,1856 s, 19,3 MB/s

19300/4 = ~5000 IOPs
I know once you have a FS on the device it will slow down due to the extra 
journal writes, maybe this is a little more than expected here…but still 
reasonably fast. Can you see in iostat how many IO’s the device is doing during 
this test?

watch iostat -dmx -t -y 1 1 /dev/sde

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde   0,00 0,000,00 9625,00 0,0025,85 5,50 
0,600,060,000,06   0,06  59,60

So there seems to be an extra delay somewhere when writing via a FS instead of 
raw device. You are still getting around 10,000 iops though, so not too bad.

If i use the ssd in the ceph cluster and i do the test again with rados bench 
bs=4K and -t = 1 (one thread) i become only 2-3 MByte/s

2500/4 = ~600IOP’s

My question is: How can it be that the pure device performance is so high 
against the xfs and the ceph rbd performance?

Ceph will be a lot slower as you are replacing a 30cm SAS/SATA cable with 
networking, software and also doing replication. You have at least 2 network 
hops with Ceph. For a slightly fairer test set replication to 1x.

Replication 3x:
rados bench -p rbd 60 write -b 4k -t 1
Invalid value for block-size: The option value '4k' seems to be invalid
root@ceph-mon-1:~# rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30407
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   1   402   4011.5661   1.56641  0.00226091  0.00248929
2   1   775   774   1.51142   1.45703   0.0021945  0.00258187
3   1  1110  1109   1.44374   1.30859  0.00278291  0.00270182
4   1  1421  1420   1.38647   1.21484  0.00199578  0.00281537
5   1  1731  1730   1.35132   1.21094  0.00219136  0.00288843
6   1  2044  2043   1.32985   1.22266   0.00

Re: [ceph-users] linuxcon north america, ceph bluestore slides

2016-08-31 Thread Wido den Hollander


> Op 31 augustus 2016 om 9:51 schreef "Brian ::" :
> 
> 
> Amazing improvements   to performance in the preview now.. I wonder

Indeed, great work!

> will there be a filestore --> bluestore upgrade path...
> 

Yes and No. Since the OSDs API doesn't change you can 'simply':

1. Shut down OSD
2. Wipe local disk of OSD
3. Re-format OSD with BlueStore
4. Have Ceph's recovery do the work

So if you have a large cluster this will be a lot of work, but you will not 
suffer downtime or face manual data migration.

Wido

> On Wed, Aug 31, 2016 at 6:32 AM, Alexandre DERUMIER  
> wrote:
> > Hi,
> >
> > Here the slides of the ceph bluestore prensentation
> >
> > http://events.linuxfoundation.org/sites/events/files/slides/LinuxCon%20NA%20BlueStore.pdf
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-31 Thread Nick Fisk

 

 

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 30 August 2016 18:40
To: n...@fisk.me.uk; 'Alex Gorbachev' 
Cc: 'Horace Ng' 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick,

here are my answers and questions...

 

Am 30.08.16 um 19:05 schrieb Nick Fisk:

 

 

From: w...@globe.de   [mailto:w...@globe.de] 
Sent: 30 August 2016 08:48
To: n...@fisk.me.uk  ; 'Alex Gorbachev'  
 
Cc: 'Horace Ng'   
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick, Hi Ales,

Nick: i've got my 600GB SAS HP Drives.

Performance is not good soo i don't paste the results here...

 

Generally another thing: I've build in the Ceph Cluster Samsung SM863 
Enterprise SSD's

If i do a 4k Test on the SSD directly without filesystem i become 

(See Sebastien's Han Tests)

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 

 

 

dd if=/dev/zero of=/dev/sdd bs=4k count=100 oflag=direct,dsync
100+0 Datensätze ein
100+0 Datensätze aus
409600 bytes (4,1 GB, 3,8 GiB) copied, 52,7139 s, 77,7 MB/s

77000/4 = ~2 IOP’s

 

If i format the device with xfs i become:

mkfs.xfs -f /dev/sdd

mount /dev/sdd /mnt

cd /mnt

dd if=/dev/zero of=/mnt/test.txt bs=4k count=10 oflag=direct,dsync
10+0 Datensätze ein
10+0 Datensätze aus
40960 bytes (410 MB, 391 MiB) copied, 21,1856 s, 19,3 MB/s

19300/4 = ~5000 IOPs
I know once you have a FS on the device it will slow down due to the extra 
journal writes, maybe this is a little more than expected here…but still 
reasonably fast. Can you see in iostat how many IO’s the device is doing during 
this test?



watch iostat -dmx -t -y 1 1 /dev/sde

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde   0,00 0,000,00 9625,00 0,0025,85 5,50 
0,600,060,000,06   0,06  59,60



So there seems to be an extra delay somewhere when writing via a FS instead of 
raw device. You are still getting around 10,000 iops though, so not too bad.






If i use the ssd in the ceph cluster and i do the test again with rados bench 
bs=4K and -t = 1 (one thread) i become only 2-3 MByte/s

2500/4 = ~600IOP’s

My question is: How can it be that the pure device performance is so high 
against the xfs and the ceph rbd performance?

Ceph will be a lot slower as you are replacing a 30cm SAS/SATA cable with 
networking, software and also doing replication. You have at least 2 network 
hops with Ceph. For a slightly fairer test set replication to 1x.


Replication 3x:
rados bench -p rbd 60 write -b 4k -t 1
Invalid value for block-size: The option value '4k' seems to be invalid
root@ceph-mon-1:~# rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30407
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   1   402   4011.5661   1.56641  0.00226091  0.00248929
2   1   775   774   1.51142   1.45703   0.0021945  0.00258187
3   1  1110  1109   1.44374   1.30859  0.00278291  0.00270182
4   1  1421  1420   1.38647   1.21484  0.00199578  0.00281537
5   1  1731  1730   1.35132   1.21094  0.00219136  0.00288843
6   1  2044  2043   1.32985   1.22266   0.0023981  0.00293468
7   1  2351  2350   1.31116   1.19922  0.00258856  0.00296963
8   1  2703  2702   1.31911 1.375   0.0224678  0.00295862
9   1  2955  2954   1.28191  0.984375  0.00841621  0.00304526
   10   1  3228  3227   1.26034   1.06641  0.00261023  0.00309665
   11   1  3501  35001.2427   1.06641  0.00659853  0.00313985
   12   1  3791  3790   1.23353   1.13281   0.0027244  0.00316168
   13   1  4150  4149   1.24649   1.40234  0.00262242  0.00313177
   14   1  4460  4459   1.24394   1.21094  0.00262075  0.00313735
   15   1  4721  4720   1.22897   1.01953  0.00239961  0.00317357
   16   1  4983  4982   1.21611   1.02344  0.00290526  0.00321005
   17   1  5279  5278   1.21258   1.15625  0.00252002   0.0032196
   18   1  5605  5604   1.21595   1.27344  0.00281887  0.00320714

Replication 1x:
rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes

Re: [ceph-users] linuxcon north america, ceph bluestore slides

2016-08-31 Thread Brian ::

Amazing improvements   to performance in the preview now.. I wonder
will there be a filestore --> bluestore upgrade path...

On Wed, Aug 31, 2016 at 6:32 AM, Alexandre DERUMIER  wrote:
> Hi,
>
> Here the slides of the ceph bluestore prensentation
>
> http://events.linuxfoundation.org/sites/events/files/slides/LinuxCon%20NA%20BlueStore.pdf
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] the reweight value of OSD is always 1

[ceph-users] the reweight value of OSD is always 1

Re: [ceph-users] how to debug pg inconsistent state - no ioerrors seen

[ceph-users] HitSet - memory requirement

Re: [ceph-users] cephfs metadata pool: deep-scrub error "omap_digest != best guess omap_digest"

Re: [ceph-users] Slow Request on OSD

Re: [ceph-users] Slow Request on OSD

Re: [ceph-users] Jewel - frequent ceph-osd crashes

[ceph-users] Slow Request on OSD

Re: [ceph-users] /var/lib/mysql, CephFS vs RBD

Re: [ceph-users] Jewel - frequent ceph-osd crashes

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

Re: [ceph-users] cephfs page cache

Re: [ceph-users] /var/lib/mysql, CephFS vs RBD

Re: [ceph-users] /var/lib/mysql, CephFS vs RBD

Re: [ceph-users] cephfs metadata pool: deep-scrub error "omap_digest != best guess omap_digest"

[ceph-users] /var/lib/mysql, CephFS vs RBD

[ceph-users] Antw: Re: Antw: Re: rbd cache mode with qemu

Re: [ceph-users] build and Compile ceph in development mode takes an hour

Re: [ceph-users] build and Compile ceph in development mode takes an hour

Re: [ceph-users] Antw: Re: rbd cache mode with qemu

[ceph-users] Antw: Re: rbd cache mode with qemu

Re: [ceph-users] cephfs page cache

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

Re: [ceph-users] UID reset to root after chgrp on CephFS Ganesha export

[ceph-users] UID reset to root after chgrp on CephFS Ganesha export

Re: [ceph-users] cephfs page cache

[ceph-users] how to print the incremental osdmap

Re: [ceph-users] ceph-fuse "Transport endpoint is not connected" on Jewel 10.2.2

Re: [ceph-users] Ceph + VMware + Single Thread Performance

Re: [ceph-users] linuxcon north america, ceph bluestore slides

Re: [ceph-users] Ceph + VMware + Single Thread Performance

Re: [ceph-users] linuxcon north america, ceph bluestore slides

35 matches

Site Navigation

Mail list logo

Footer information