Re: [ceph-users] pg 17.36 is active+clean+inconsistent head expected clone 1 missing?

2018-11-16 Thread Steve Anthony
Looks similar to a problem I had after a several OSDs crashed while
trimming snapshots. In my case, the primary OSD thought the snapshot was
gone, but some of the replicas are still there, so scrubbing flags it.

First I purged all snapshots and then ran ceph pg repair on the
problematic placement groups. The first time I encountered this, that
action was sufficient to repair the problem. The second time however, I
ended up having to manually remove the snapshot objects.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027431.html

Once I had done that, repair the placement group fixed the issue.

-Steve

On 11/16/2018 04:00 AM, Marc Roos wrote:
>  
>
> I am not sure that is going to work, because I have this error quite 
> some time, from before I added the 4th node. And on the 3 node cluster 
> it was:
>  
> osdmap e18970 pg 17.36 (17.36) -> up [9,0,12] acting [9,0,12]
>
> If I understand correctly what you intent to do, moving the data around. 
> This was sort of accomplished by adding the 4th node.
>
>
>
> -Original Message-
> From: Frank Yu [mailto:flyxia...@gmail.com] 
> Sent: vrijdag 16 november 2018 3:51
> To: Marc Roos
> Cc: ceph-users
> Subject: Re: [ceph-users] pg 17.36 is active+clean+inconsistent head 
> expected clone 1 missing?
>
> try to restart osd.29, then use pg repair. If this doesn't work or it 
> appear again after a while, scan your HDD which used for osd.29, maybe 
> there is bad sector of your disks, just replace the disk with new one.
>
>
>
> On Thu, Nov 15, 2018 at 5:00 PM Marc Roos  
> wrote:
>
>
>
>   Forgot, these are bluestore osds
>   
>   
>   
>   -Original Message-
>   From: Marc Roos 
>   Sent: donderdag 15 november 2018 9:59
>   To: ceph-users
>   Subject: [ceph-users] pg 17.36 is active+clean+inconsistent head 
>   expected clone 1 missing?
>   
>   
>   
>   I thought I will give it another try, asking again here since there 
> is 
>   another thread current. I am having this error since a year or so.
>   
>   This I of course already tried:
>   ceph pg deep-scrub 17.36
>   ceph pg repair 17.36
>   
>   
>   [@c01 ~]# rados list-inconsistent-obj 17.36 
>   {"epoch":24363,"inconsistents":[]}
>   
>   
>   [@c01 ~]# ceph pg map 17.36
>   osdmap e24380 pg 17.36 (17.36) -> up [29,12,6] acting [29,12,6]
>   
>   
>   [@c04 ceph]# zgrep ERR ceph-osd.29.log*gz
>   ceph-osd.29.log-20181114.gz:2018-11-13 14:19:55.766604 7f25a05b1700 
> -1
>   log_channel(cluster) log [ERR] : deep-scrub 17.36 
>   17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:head 
> expected 
>   clone 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4 1 
> missing
>   ceph-osd.29.log-20181114.gz:2018-11-13 14:24:55.943454 7f25a05b1700 
> -1
>   log_channel(cluster) log [ERR] : 17.36 deep-scrub 1 errors
>   
>   
>   ___
>   ceph-users mailing list
>   ceph-users@lists.ceph.com
>   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>   
>   
>   ___
>   ceph-users mailing list
>   ceph-users@lists.ceph.com
>   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>   
>
>
>

-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unable to remove phantom snapshot for object, snapset_inconsistency

2018-07-06 Thread Steve Anthony
In case anyone else runs into this, I resolved using removeall on both 
bad OSDs and running ceph pg repair, which copied the good object back.


-Steve


On 06/27/2018 06:17 PM, Steve Anthony wrote:

In the process of trying to repair snapshot inconsistencies associated
with the issues in this thread,
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027125.html
("FAILED assert(p != recovery_info.ss.clone_snaps.end())​"), I have one
PG I still can't get to repair.

Two of the three replicas appear to have (or think they have) a
snapshot. However, neither ceph-objectstore-tool list operation nor
running find on the OSD fuse mounted report or find the snaps.

# ceph-objectstore-tool --type bluestore --data-path
/var/lib/ceph/osd/ceph-313/ --pgid 2.13e --op list
rb.0.2479b45.238e1f29.00125cbb
["2.13e",{"oid":"rb.0.2479b45.238e1f29.00125cbb","key":"","snapid":-2,"hash":2016338238,"max":0,"pool":2,"namespace":"","max":0}]

The ceph-objectstore tool remove-clone-metadata operation also reports
the snapshot does not exist.

# ceph-objectstore-tool --dry-run --type bluestore --data-path
/var/lib/ceph/osd/ceph-313/ --pgid 2.13e
'{"oid":"rb.0.2479b45.238e1f29.00125cbb","key":"","snapid":-2,"hash":2016338238,"max":0,"pool":2,"namespace":"","max":0}'
remove-clone-metadata 4896
Clone 1320 not presentdry-run: Nothing changed

However, the remove operation sees the snapshot and refuses to delete
the object.

# ceph-objectstore-tool --dry-run --type bluestore --data-path
/var/lib/ceph/osd/ceph-313/ --pgid 2.13e
'{"oid":"rb.0.2479b45.238e1f29.00125cbb","key":"","snapid":-2,"hash":2016338238,"max":0,"pool":2,"namespace":"","max":0}'
remove
Snapshots are present, use removeall to delete everything
dry-run: Nothing changed

Listing the inconsistencies with rados, it appears that the phantom
snapshot is present on 2/3 replicas. Other PGs had this issue, but on
1/3 replicas and using removeall on the bad copy, then repairing the PG
fixed the issue. Running removeall on the primary replica resulted in
the repair replicating the other bad object. Should I just issue
removeall on both OSDs and then run repair to fix the missing objects,
or is there some other way to purge snaps on an object? (I've already
purged all snapshots on all images in the cluster with rbd snap purge)

Thoughts?

# rados list-inconsistent-obj 2.13e

{
   "epoch": 1008264,
   "inconsistents": [
     {
   "object": {
     "name": "rb.0.2479b45.238e1f29.00125cbb",
     "nspace": "",
     "locator": "",
     "snap": "head",
     "version": 2024222
   },
   "errors": [
     "object_info_inconsistency",
     "snapset_inconsistency"
   ],
   "union_shard_errors": [

   ],

   "selected_object_info": {
     "oid": {
   "oid": "rb.0.2479b45.238e1f29.00125cbb",
   "key": "",
   "snapid": -2,
   "hash": 2016338238,
   "max": 0,
   "pool": 2,
   "namespace": ""
     },
     "version": "946857'2041225",
     "prior_version": "943431'2032262",
     "last_reqid": "osd.36.0:48196",
     "user_version": 2024222,
     "size": 4194304,
     "mtime": "2018-05-13 08:58:21.359912",
     "local_mtime": "2018-05-13 08:58:21.537637",
     "lost": 0,
     "flags": [
   "dirty",
   "data_digest",
   "omap_digest"
     ],
     "legacy_snaps": [
  
     ],

     "truncate_seq": 0,
     "truncate_size": 0,
     "data_digest": "0x0d99bd77",
     "omap_digest": "0x",
     "expected_object_size": 4194304,
     "expected_write_size": 4194304,
     "alloc_hint_flags": 0,
     "manifest": {
   "type": 0,
   "redirect_target": {
     "oid": "",
     "key": "",
     "snapid": 0,
     "hash": 0,
     "max": 0,
     "pool": -9.2233720368548e

[ceph-users] unable to remove phantom snapshot for object, snapset_inconsistency

2018-06-27 Thread Steve Anthony
rors": [
   
  ],
  "size": 4194304,
  "omap_digest": "0x",
  "data_digest": "0x0d99bd77",
  "object_info": {
    "oid": {
  "oid": "rb.0.2479b45.238e1f29.00125cbb",
  "key": "",
  "snapid": -2,
  "hash": 2016338238,
  "max": 0,
  "pool": 2,
  "namespace": ""
    },
    "version": "946857'2041225",
    "prior_version": "943431'2032262",
    "last_reqid": "osd.36.0:48196",
    "user_version": 2024222,
    "size": 4194304,
    "mtime": "2018-05-13 08:58:21.359912",
    "local_mtime": "2018-05-13 08:58:21.537637",
    "lost": 0,
    "flags": [
  "dirty",
  "data_digest",
  "omap_digest"
    ],
    "legacy_snaps": [
 
    ],
    "truncate_seq": 0,
    "truncate_size": 0,
    "data_digest": "0x0d99bd77",
    "omap_digest": "0x",
    "expected_object_size": 4194304,
    "expected_write_size": 4194304,
    "alloc_hint_flags": 0,
    "manifest": {
  "type": 0,
  "redirect_target": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9.2233720368548e+18,
    "namespace": ""
  }
    },
    "watchers": {
 
    }
  },
  "snapset": {
    "snap_context": {
  "seq": 4896,
  "snaps": [
   
  ]
    },
    "head_exists": 1,
    "clones": [
 
    ]
  }
    },
    {
  "osd": 305,
  "primary": false,
  "errors": [
   
  ],
  "size": 4194304,
  "omap_digest": "0x",
  "data_digest": "0x0d99bd77",
  "object_info": {
    "oid": {
  "oid": "rb.0.2479b45.238e1f29.00125cbb",
  "key": "",
  "snapid": -2,
  "hash": 2016338238,
  "max": 0,
  "pool": 2,
  "namespace": ""
    },
    "version": "943431'2032262",
    "prior_version": "942275'2030618",
    "last_reqid": "osd.36.0:48196",
    "user_version": 2024222,
    "size": 4194304,
    "mtime": "2018-05-13 08:58:21.359912",
    "local_mtime": "2018-05-13 08:58:21.537637",
    "lost": 0,
    "flags": [
  "dirty",
  "data_digest",
  "omap_digest"
    ],
    "legacy_snaps": [
 
    ],
    "truncate_seq": 0,
    "truncate_size": 0,
    "data_digest": "0x0d99bd77",
    "omap_digest": "0x",
    "expected_object_size": 4194304,
    "expected_write_size": 4194304,
    "alloc_hint_flags": 0,
    "manifest": {
  "type": 0,
  "redirect_target": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9.2233720368548e+18,
    "namespace": ""
          }
    },
    "watchers": {
 
    }
  },
  "snapset": {
    "snap_context": {
  "seq": 4896,
  "snaps": [
    4896
  ]
    },
    "head_exists": 1,
    "clones": [
 
    ]
  }
    },
    {
  "osd": 313,
  "primary": true,
  "errors": [
   
  ],
  "size": 4194304,
  "omap_digest": "0x",
  "data_digest": "0x0d99bd77",
  "object_info": {
    "oid": {
  "oid": "rb.0.2479b45.238e1f29.00125cbb",
  "key": "",
  "snapid": -2,
  "hash": 2016338238,
  "max": 0,
  "pool": 2,
  "namespace": ""
    },
    "version": "943431'2032262",
    "prior_version": "942275'2030618",
    "last_reqid": "osd.36.0:48196",
    "user_version": 2024222,
    "size": 4194304,
    "mtime": "2018-05-13 08:58:21.359912",
    "local_mtime": "2018-05-13 08:58:21.537637",
    "lost": 0,
    "flags": [
  "dirty",
  "data_digest",
  "omap_digest"
    ],
    "legacy_snaps": [
 
    ],
    "truncate_seq": 0,
    "truncate_size": 0,
    "data_digest": "0x0d99bd77",
    "omap_digest": "0x",
    "expected_object_size": 4194304,
    "expected_write_size": 4194304,
    "alloc_hint_flags": 0,
    "manifest": {
  "type": 0,
  "redirect_target": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9.2233720368548e+18,
    "namespace": ""
  }
    },
    "watchers": {
 
    }
  },
  "snapset": {
    "snap_context": {
  "seq": 4896,
  "snaps": [
    4896
  ]
    },
    "head_exists": 1,
    "clones": [
 
    ]
  }
    }
  ]
    }
  ]
}

-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-27 Thread Steve Anthony
One addendum for the sake of completeness. A few PGs still refused to
repair even after the clone object was gone. To resolve this I needed to
remove the clone metadata from the HEAD using ceph-objectstore-tool.
First, I found the problematic clone ID in the log on the primary replica:

ceph2:~# grep ERR /var/log/ceph/ceph-osd.229.log
2018-06-25 10:59:37.554924 7fbdd80d2700 -1 log_channel(cluster) log
[ERR] : repair 2.9a6
2:65942a51:::rb.0.2479b45.238e1f29.002d338d:head expected clone
2:65942a51:::rb.0.2479b45.238e1f29.002d338d:1320 1 missing

In this case the clone ID is 1320. Note that this is the hex value and
ceph-objectstore-tool will expect the decimal equivalent (4896 in this
case). Then on each host stop the OSD and remove the metadata. For
Bluestore this looks like:

ceph2:~# ceph-objectstore-tool --type bluestore --data-path
/var/lib/ceph/osd/ceph-229/ --pgid 2.9a6
'{"oid":"rb.0.2479b45.238e1f29","snapid":-2,"hash":2320771494,"max":0,"pool":2,"namespace":"","max":0}'
remove-clone-metadata 4896
Removal of clone 1320 complete
Use pg repair after OSD restarted to correct stat information

And if it's a Filestore OSD:

ceph15:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-122/
--pgid 2.9a6
'{"oid":"rb.0.2479b45.238e1f29.002d338d","key":"","snapid":-2,"hash":2320771494,"max":0,"pool":2,"namespace":"","max":0}'
remove-clone-metadata 4896
Removal of clone 1320 complete
Use pg repair after OSD restarted to correct stat information

Once that's done, starting the OSD and repairing the PG finally marked
it as clean.

-Steve

On 06/14/2018 05:07 PM, Steve Anthony wrote:
>
> For reference, building luminous with the changes in the pull request
> also fixed this issue for me. Some of my unexpected snapshots were on
> Bluestore devices; here's how I used the objectstore tool to remove
> them. In the example, the problematic placement group is 2.1c3f, and
> the unexpected clone is identified in the OSD's log as
> rb.0.2479b45.238e1f29.0df:1356.
>
> ceph14:~# systemctl stop ceph-osd@34.service
> ceph14:~# ceph-objectstore-tool --type bluestore --data-path 
> /var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f --op list 
> rb.0.2479b45.238e1f29.0ddf
> Error getting attr on : 2.1c3f_head,#-4:fc38:::scrub_2.1c3f:head#, (61) 
> No data available
> ["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":-2,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}]
> ceph14:~# ceph-objectstore-tool --dry-run --type bluestore --data-path 
> /var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f 
> '{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}'
>  remove
> remove #2:fc3af922:::rb.0.2479b45.238e1f29.0ddf:1356#
> dry-run: Nothing changed
> ceph14:~# ceph-objectstore-tool --type bluestore --data-path 
> /var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f 
> '{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}'
>  remove
> remove #2:fc3af922:::rb.0.2479b45.238e1f29.0ddf:1356#
> ceph14:~# systemctl start ceph-osd@34.service
>
> -Steve
>
> On 06/14/2018 04:59 PM, Nick Fisk wrote:
>> For completeness in case anyone has this issue in the future and stumbles 
>> across this thread
>>
>> If your OSD is crashing and you are still running on a Luminous build that 
>> does not have the fix in the pull request below, you will
>> need to compile the ceph-osd binary and replace it on the affected OSD node. 
>> This will get your OSD's/cluster back up and running.
>>
>> In regards to the stray object/clone, I was unable to remove it using the 
>> objectstore tool, I'm guessing this is because as far as
>> the OSD is concerned it believes that clone should have already been 
>> deleted. I am still running Filestore on this cluster and
>> simply removing the clone object from the OSD PG folder (Note: the object 
>> won't have _head in its name) and then running a deep
>> scrub

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-14 Thread Steve Anthony
479:1c, version: 2195927'1249660, data_included: [], 
> data_size: 0, omap_header_size: 0, omap_entries_size: 0,
> attrset_size: 1, recovery_info: 
> ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14
> 238e1f29.000bf479:1c@2195927'1249660, size: 4194304, copy_subset: [], 
> clone_subset: {}, snapset: 1c=[]:{}), after_progress:
> ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
> omap_re covered_to:, omap_complete:true, error:false),
> before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
> data_complete:false, omap_recovered_to:, omap_complete:false,
> error:false))]) v3  909+0+0 (7
> 22394556 0 0) 0x5574480d0d80 con 0x557447510800
> -2> 2018-06-05 16:28:59.560183 7fcd7b655700  5 -- 
> [2a03:25e0:254:5::113]:6829/525383 >> [2a03:25e0:254:5::12]:6809/5784710
> conn(0x557447510800 :6829 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
> pgs=13524 cs
> =1 l=0). rx osd.46 seq 7 0x55744783f900 pg_backfill(progress 1.2ca e 
> 2196813/2196813 lb
> 1:534b0b88:::rbd_data.f870ac238e1f29.000ff145:head) v3
> -1> 2018-06-05 16:28:59.560189 7fcd7b655700  1 -- 
> [2a03:25e0:254:5::113]:6829/525383 <== osd.46
> [2a03:25e0:254:5::12]:6809/5784710 7  pg_backfill(progress 1.2ca e 
> 2196813/2196813 lb 1:534b0b88:::rbd_data
> .f870ac238e1f29.000ff145:head) v3  946+0+0 (3865576583 0 0) 
> 0x55744783f900 con 0x557447510800
>  0> 2018-06-05 16:28:59.564054 7fcd5f3ba700 -1 
> /build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
> PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
> ObjectContextR ef, bool, ObjectStore::Transaction*)'
> thread 7fcd5f3ba700 time 2018-06-05 16:28:59.561060
> /build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
> recovery_info.ss.clone_snaps.end())
>
>  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x102) [0x557424971a02]
>  2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo 
> const&, std::shared_ptr, bool,
> ObjectStore::Transaction*)+0xd63) [0x5574244df873]
>  3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, 
> ObjectStore::Transaction*)+0x2da) [0x5574246715ca]
>  4: (ReplicatedBackend::_do_push(boost::intrusive_ptr)+0x12e) 
> [0x5574246717fe]
>  5: 
> (ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x2c1) 
> [0x557424680d71]
>  6: (PGBackend::handle_message(boost::intrusive_ptr)+0x50) 
> [0x55742458c440]
>  7: (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
> ThreadPool::TPHandle&)+0x543) [0x5574244f0853]
>  8: (OSD::dequeue_op(boost::intrusive_ptr, 
> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3a9) 
> [0x557424367539]
>  9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr 
> const&)+0x57) [0x557424610f37]
>  10: (OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*)+0x1047) [0x557424395847]
>  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) 
> [0x5574249767f4]
>  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x557424979830]
>  13: (()+0x76ba) [0x7fcd7f1cb6ba]
>  14: (clone()+0x6d) [0x7fcd7e24241d]
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-06-02 Thread Steve Anthony
I'm seeing this again on two OSDs after adding another 20 disks to my
cluster. Is there someway I can maybe determine which snapshots the
recovery process is looking for? Or maybe find and remove the objects
it's trying to recover, since there's apparently a problem with them?
Thanks!

-Steve

On 05/18/2017 01:06 PM, Steve Anthony wrote:
>
> Hmmm, after crashing for a few days every 30 seconds it's apparently
> running normally again. Weird. I was thinking since it's looking for a
> snapshot object, maybe re-enabling snaptrimming and removing all the
> snapshots in the pool would remove that object (and the problem)?
> Never got to that point this time, but I'm going to need to cycle more
> OSDs in and out of the cluster, so if it happens again I might try
> that and update.
>
> Thanks!
>
> -Steve
>
>
> On 05/17/2017 03:17 PM, Gregory Farnum wrote:
>>
>>
>> On Wed, May 17, 2017 at 10:51 AM Steve Anthony <sma...@lehigh.edu
>> <mailto:sma...@lehigh.edu>> wrote:
>>
>> Hello,
>>
>> After starting a backup (create snap, export and import into a second
>> cluster - one RBD image still exporting/importing as of this message)
>> the other day while recovery operations on the primary cluster were
>> ongoing I noticed an OSD (osd.126) start to crash; I reweighted
>> it to 0
>> to prepare to remove it. Shortly thereafter I noticed the problem
>> seemed
>> to move to another OSD (osd.223). After looking at the logs, I
>> noticed
>> they appeared to have the same problem. I'm running Ceph version
>> 9.2.1
>> (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.
>>
>> Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe
>>
>> Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA
>>
>>
>> May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
>> 10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
>> {default=true}
>> May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
>> 7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
>> ReplicatedPG::on_local_recover(const hobject_t&, const
>> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
>> ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
>> 10:39:55.322306
>> May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192:
>> FAILED
>> assert(recovery_info.oi.snaps.size())
>>
>> May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
>> 7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
>> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In
>> function
>> 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
>> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
>> ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
>> 16:45:30.799839
>> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192:
>> FAILED
>> assert(recovery_info.oi.snaps.size())
>>
>>
>> I did some searching and thought it might be related to
>> http://tracker.ceph.com/issues/13837 aka
>> https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
>> scrubbing and deep-scrubbing, and set
>> osd_pg_max_concurrent_snap_trims
>> to 0 for all OSDs. No luck. I had changed the systemd service file to
>> automatically restart osd.223 while recovery was happening, but it
>> appears to have stalled; I suppose it's needed up for the
>> remaining objects.
>>
>>
>> Yeah, these aren't really related that I can see — though I haven't
>> spent much time in this code that I can recall. The OSD is receiving
>> a "push" as part of log recovery and finds that the object it's
>> receiving is a snapshot object without having any information about
>> the snap IDs that exist, which is weird. I don't know of any way a
>> client could break it either, but maybe David or Jason know something
>> more.
>> -Greg
>>  
>>
>>
>> I didn't see anything else online, so I thought I see if anyone
>> has seen
>> this before or has any other ideas. Thanks for taking the time.
>>
>> -Steve
>>
>>
>> --
>> Steve Anthony
>>     LTS HPC Senior Analyst
>> Lehigh University
>> sma...@lehigh.edu <mailto:sma...@lehigh.edu>
&g

Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-05-18 Thread Steve Anthony
Hmmm, after crashing for a few days every 30 seconds it's apparently
running normally again. Weird. I was thinking since it's looking for a
snapshot object, maybe re-enabling snaptrimming and removing all the
snapshots in the pool would remove that object (and the problem)? Never
got to that point this time, but I'm going to need to cycle more OSDs in
and out of the cluster, so if it happens again I might try that and update.

Thanks!

-Steve


On 05/17/2017 03:17 PM, Gregory Farnum wrote:
>
>
> On Wed, May 17, 2017 at 10:51 AM Steve Anthony <sma...@lehigh.edu
> <mailto:sma...@lehigh.edu>> wrote:
>
> Hello,
>
> After starting a backup (create snap, export and import into a second
> cluster - one RBD image still exporting/importing as of this message)
> the other day while recovery operations on the primary cluster were
> ongoing I noticed an OSD (osd.126) start to crash; I reweighted it
> to 0
> to prepare to remove it. Shortly thereafter I noticed the problem
> seemed
> to move to another OSD (osd.223). After looking at the logs, I noticed
> they appeared to have the same problem. I'm running Ceph version 9.2.1
> (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.
>
> Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe
>
> Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA
>
>
> May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
> 10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
> {default=true}
> May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
> 7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
> ReplicatedPG::on_local_recover(const hobject_t&, const
> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
> ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
> 10:39:55.322306
> May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192:
> FAILED
> assert(recovery_info.oi.snaps.size())
>
> May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
> 7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In
> function
> 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
> ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
> 16:45:30.799839
> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192:
> FAILED
> assert(recovery_info.oi.snaps.size())
>
>
> I did some searching and thought it might be related to
> http://tracker.ceph.com/issues/13837 aka
> https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
> scrubbing and deep-scrubbing, and set osd_pg_max_concurrent_snap_trims
> to 0 for all OSDs. No luck. I had changed the systemd service file to
> automatically restart osd.223 while recovery was happening, but it
> appears to have stalled; I suppose it's needed up for the
> remaining objects.
>
>
> Yeah, these aren't really related that I can see — though I haven't
> spent much time in this code that I can recall. The OSD is receiving a
> "push" as part of log recovery and finds that the object it's
> receiving is a snapshot object without having any information about
> the snap IDs that exist, which is weird. I don't know of any way a
> client could break it either, but maybe David or Jason know something
> more.
> -Greg
>  
>
>
> I didn't see anything else online, so I thought I see if anyone
> has seen
> this before or has any other ideas. Thanks for taking the time.
>
> -Steve
>
>
> --
> Steve Anthony
> LTS HPC Senior Analyst
> Lehigh University
> sma...@lehigh.edu <mailto:sma...@lehigh.edu>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-05-17 Thread Steve Anthony
Hello,

After starting a backup (create snap, export and import into a second
cluster - one RBD image still exporting/importing as of this message)
the other day while recovery operations on the primary cluster were
ongoing I noticed an OSD (osd.126) start to crash; I reweighted it to 0
to prepare to remove it. Shortly thereafter I noticed the problem seemed
to move to another OSD (osd.223). After looking at the logs, I noticed
they appeared to have the same problem. I'm running Ceph version 9.2.1
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.

Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe

Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA


May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
{default=true}
May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
ReplicatedPG::on_local_recover(const hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
10:39:55.322306
May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192: FAILED
assert(recovery_info.oi.snaps.size())

May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In function
'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
16:45:30.799839
May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192: FAILED
assert(recovery_info.oi.snaps.size())


I did some searching and thought it might be related to
http://tracker.ceph.com/issues/13837 aka
https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
scrubbing and deep-scrubbing, and set osd_pg_max_concurrent_snap_trims
to 0 for all OSDs. No luck. I had changed the systemd service file to
automatically restart osd.223 while recovery was happening, but it
appears to have stalled; I suppose it's needed up for the remaining objects.

I didn't see anything else online, so I thought I see if anyone has seen
this before or has any other ideas. Thanks for taking the time.

-Steve


-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] download.ceph.com metadata problem?

2016-01-21 Thread Steve Anthony
It looks like there might be an issue with the repo metadata. I'm not
seeing ceph, ceph-common, librbd1, etc. in the debian-giant wheezy
branch. I ended up just downloading the debs and installing them
manually in the interim. FYI.

-Steve

cat /etc/apt/sources.list.d/ceph.list
deb http://download.ceph.com/debian-giant/ wheezy main

grep Package
/var/lib/apt/lists/download.ceph.com_debian-giant_dists_wheezy_main_binary-amd64_Packages

Package: ceph-dbg
Package: ceph-deploy
Package: ceph-fs-common
Package: ceph-fuse
Package: ceph-fuse-dbg
Package: ceph-test
Package: librados2-dbg
Package: radosgw-agent

apt-cache policy ceph
ceph:
  Installed: 0.87.2-1~bpo70+1
  Candidate: 0.87.2-1~bpo70+1
  Version table:
 *** 0.87.2-1~bpo70+1 0
100 /var/lib/dpkg/status
 0.80.7-1~bpo70+1 0
100 http://debian.cc.lehigh.edu/debian/ wheezy-backports/main
amd64 Packages


-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] nfs over rbd problem

2015-12-24 Thread Steve Anthony
cib: info: cib_perform_op:
> +  /cib:  @num_updates=162
> Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op:
> +  /cib/status/node_state[@id='node2']: 
> @crm-debug-origin=do_update_resource
> Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op:
> + 
> /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']/lrm_rsc_op[@id='p_rbd_map_1_last_0']:
>  
> @operation_key=p_rbd_map_1_start_0, @operation=start,
> @transition-key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9,
> @transition-magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9,
> @call-id=48, @rc-code=1, @op-status=2, @last-run=1450430539,
> @last-rc-change=1450430539, @exec-time=20002
> Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op:
> ++
> /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']:
>  
>  operation_key="p_rbd_map_1_start_0" operation="start"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> transition-key="6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9"
> transition-magic="2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9"
> call-id="48" rc-code="1" op-status="2" interval="0" l
> Dec 18 17:22:39 [2690] node2cib: info:
> cib_process_request: Completed cib_modify operation for section
> status: OK (rc=0, origin=node2/crmd/99, version=0.69.162)
> Dec 18 17:22:39 [2695] node2   crmd:  warning: status_from_rc:
> Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1):
> Error
> Dec 18 17:22:39 [2695] node2   crmd:  warning: update_failcount:
> Updating failcount for p_rbd_map_1 on node2 after failed start:
> rc=1 (update=INFINITY, time=1450430559)
> Dec 18 17:22:39 [2695] node2   crmd:   notice:
> abort_transition_graph: Transition aborted by p_rbd_map_1_start_0
> 'modify' on node2: Event failed
> (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162,
> source=match_graph_event:344, 0)
> Dec 18 17:22:39 [2695] node2   crmd: info: match_graph_event:
> Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4)
> Dec 18 17:22:39 [2693] node2  attrd:   notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-p_rbd_map_1 (INFINITY)
> Dec 18 17:22:39 [2695] node2   crmd:  warning: update_failcount:
> Updating failcount for p_rbd_map_1 on node2 after failed start:
> rc=1 (update=INFINITY, time=1450430559)
> Dec 18 17:22:39 [2695] node2   crmd: info:
> process_graph_event: Detected action (3.6)
> p_rbd_map_1_start_0.48=unknown error: failed
> Dec 18 17:22:39 [2695] node2   crmd:  warning: status_from_rc:
> Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1):
> Error
> Dec 18 17:22:39 [2695] node2   crmd:  warning: update_failcount:
> Updating failcount for p_rbd_map_1 on node2 after failed start:
> rc=1 (update=INFINITY, time=1450430559)
> Dec 18 17:22:39 [2695] node2   crmd: info:
> abort_transition_graph: Transition aborted by p_rbd_map_1_start_0
> 'create' on (null): Event failed
> (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162,
> source=match_graph_event:344, 0)
> Dec 18 17:22:39 [2695] node2   crmd: info: match_graph_event:
> Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4)
> Dec 18 17:22:39 [2695] node2   crmd:  warning: update_failcount:
> Updating failcount for p_rbd_map_1 on node2 after failed start:
> rc=1 (update=INFINITY, time=1450430559)
> Dec 18 17:22:39 [2695] node2   crmd: info:
> process_graph_event: Detected action (3.6)
> p_rbd_map_1_start_0.48=unknown error: failed
> Dec 18 17:22:39 [2693] node2  attrd:   notice:
> attrd_perform_update: Sent update 28: fail-count-p_rbd_map_1=INFINITY
> Dec 18 17:22:39 [2690] node2cib: info:
> cib_process_request: Forwarding cib_modify operation for section
> status to master (origin=local/attrd/28)
> Dec 18 17:22:39 [2695] node2   crmd:   notice: run_graph:
> Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=8, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-234.bz2): Stopped
> Dec 18 17:22:39 [2695] node2   crmd: info:
> do_state_transition: State transition S_TRANSITION_ENGINE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> Dec 18 17:22:39 [2693] node2  attrd:   notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> last-failure-p_rbd_map_1 (1450430559)
> Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op:
> Diff: --- 0.69.162 2
> Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op:
> Diff: +++ 0.69.163 (null)
> Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op:
> +  /cib:  @num_updates=163
> Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op:
> ++
> /cib/status/node_state[@id='node2']/transient_attributes[@id='node2']/instance_attributes[@id='status-node2']:
>  
>  name="fail-count-p_rbd_map_1" value="INFINITY"/>
> .
>
> thanks
>
>
>
>
>  
>
>
>
>  
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing OSD - double rebalance?

2015-11-30 Thread Steve Anthony
It's probably worth noting that if you're planning on removing multiple
OSDs in this manner, you should make sure they are not in the same
failure domain, per your CRUSH rules. For example, if you keep one
replica per node and three copies (as in the default) and remove OSDs
from multiple nodes without marking them as out first, you risk losing
data if they are in the same placement group, depending on the number of
replicas you have and the number of OSDs you simultaneously remove.

That said, it would be safe in the above scenario to remove multiple
OSDs from a single node simultaneously, since the CRUSH rules aren't
placing multiple replicas on the same host.

-Steve  

On 11/30/2015 04:33 AM, Wido den Hollander wrote:
>
> On 30-11-15 10:08, Carsten Schmitt wrote:
>> Hi all,
>>
>> I'm running ceph version 0.94.5 and I need to downsize my servers
>> because of insufficient RAM.
>>
>> So I want to remove OSDs from the cluster and according to the manual
>> it's a pretty straightforward process:
>> I'm beginning with "ceph osd out {osd-num}" and the cluster starts
>> rebalancing immediately as expected. After the process is finished, the
>> rest should be quick:
>> Stop the daemon "/etc/init.d/ceph stop osd.{osd-num}" and remove the OSD
>> from the crush map: "ceph osd crush remove {name}"
>>
>> But after entering the last command, the cluster starts rebalancing again.
>>
>> And that I don't understand: Shouldn't be one rebalancing process enough
>> or am I missing something?
>>
> Well, for CRUSH this are two different things. First, the weight of the
> node goes to 0 (zero), but it's still a part of the CRUSH map.
>
> Say, there are still 5 OSDS on that host, 4 with a weight of X and one
> with a weight of zero.
>
> When you remove the OSD, there are only 4 OSDs left, that's a change for
> CRUSH.
>
> What you should do in this case. Only remove the OSD from CRUSH and
> don't mark it as out.
>
> When the cluster is done you can mark it out, but that won't cause a
> rebalance since it's already out of the CRUSH map.
>
> It will still work with the other OSDs to migrate the data since the
> cluster knows it had that PG information.
>
>> My config is pretty vanilla, except for:
>> [osd]
>> osd recovery max active = 4
>> osd max backfills = 4
>>
>> Thanks in advance,
>> Carsten
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't activate osd in infernalis

2015-11-20 Thread Steve Anthony
> command_check_call
>>>
>>> > [ceph01][WARNIN] return subprocess.check_call(arguments)
>>>
>>> > [ceph01][WARNIN]   File
>>> "/usr/lib64/python2.7/subprocess.py", line
>>>
>>> > 542, in check_call
>>>
>>> > [ceph01][WARNIN] raise CalledProcessError(retcode, cmd)
>>>
>>> > [ceph01][WARNIN] subprocess.CalledProcessError: Command
>>>
>>> > '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs',
>>> '--mkkey', '-i',
>>>
>>> > '0', '--monmap',
>>> '/var/lib/ceph/tmp/mnt.pmHRuu/activate.monmap',
>>>
>>> > '--osd-data', '/var/lib/ceph/tmp/mnt.pmHRuu',
>>> '--osd-journal',
>>>
>>> > '/var/lib/ceph/tmp/mnt.pmHRuu/journal', '--osd-uuid',
>>>
>>> > 'de162e24-16b6-4796-b6b9-774fdb8ec234', '--keyring',
>>>
>>> > '/var/lib/ceph/tmp/mnt.pmHRuu/keyring', '--setuser', 'ceph',
>>>
>>> > '--setgroup', 'ceph']' returned non-zero exit status 1
>>>
>>> > [ceph01][ERROR ] RuntimeError: command returned non-zero
>>> exit status: 1
>>>
>>> > [ceph_deploy][ERROR ] RuntimeError: Failed to execute
>>> command:
>>>
>>> > ceph-disk -v activate --mark-init systemd --mount /dev/sda1
>>>
>>> > 
>>>
>>> > The output of ls -lahn in /var/lib/ceph/ is
>>>
>>> > 
>>>
>>> > drwxr-x---.  9 167 167 4,0K 19. Nov 10:32 .
>>>
>>> > drwxr-xr-x. 28   0   0 4,0K 19. Nov 11:14 ..
>>>
>>> > drwxr-x---.  2 167 1676 10. Nov 13:06 bootstrap-mds
>>>
>>> > drwxr-x---.  2 167 167   25 19. Nov 10:48 bootstrap-osd
>>>
>>> > drwxr-x---.  2 167 1676 10. Nov 13:06 bootstrap-rgw
>>>
>>> > drwxr-x---.  2 167 1676 10. Nov 13:06 mds
>>>
>>> > drwxr-x---.  2 167 1676 10. Nov 13:06 mon
>>>
>>> > drwxr-x---.  2 167 1676 10. Nov 13:06 osd
>>>
>>> > drwxr-x---.  2 167 167   65 19. Nov 11:22 tmp
>>>
>>> > 
>>>
>>> > 
>>>
>>> > I hope someone can help me, I am really lost right now.
>>>
>>> > 
>>>
>>>  
>>>
>>> -- 
>>>
>>> Mit freundlichen Grüßen
>>>
>>>  
>>>
>>> David Riedl
>>>
>>>  
>>>
>>>  
>>>
>>>  
>>>
>>> WINGcon GmbH Wireless New Generation - Consulting & Solutions
>>>
>>>  
>>>
>>> Phone: +49 (0) 7543 9661 - 26
>>> <tel:%2B49%20%280%29%207543%209661%20-%2026>
>>>
>>> E-Mail: david.ri...@wingcon.com
>>>
>>> Web: http://www.wingcon.com
>>>
>>>  
>>>
>>> Sitz der Gesellschaft: Langenargen
>>>
>>> Registergericht: ULM, HRB 632019
>>>
>>> USt-Id.: DE232931635, WEEE-Id.: DE74015979
>>>
>>> Geschäftsführer: Norbert Schäfer, Fritz R. Paul
>>>
>>>  
>>>
>>> ___
>>>
>>> ceph-users mailing list
>>>
>>> ceph-users@lists.ceph.com
>>>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>  
>>>
>>>
>>>
>>>
>>> -- 
>>>  Mykola* *
>>
>> -- 
>> Mit freundlichen Grüßen
>>
>> David Riedl
>>
>>
>>
>> WINGcon GmbH Wireless New Generation - Consulting & Solutions
>>
>> Phone: +49 (0) 7543 9661 - 26 
>> <tel:%2B49%20%280%29%207543%209661%20-%2026>
>> E-Mail: david.ri...@wingcon.com <mailto:david.ri...@wingcon.com>
>> Web: http://www.wingcon.com
>>
>> Sitz der Gesellschaft: Langenargen
>> Registergericht: ULM, HRB 632019
>> USt-Id.: DE232931635, WEEE-Id.: DE74015979
>> Geschäftsführer: Norbert Schäfer, Fritz R. Paul 
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> -- 
>>  Mykola* *
>
> -- 
> Mit freundlichen Grüßen
>
> David Riedl
>
>
>
> WINGcon GmbH Wireless New Generation - Consulting & Solutions
>
> Phone: +49 (0) 7543 9661 - 26
> E-Mail: david.ri...@wingcon.com
> Web: http://www.wingcon.com
>
> Sitz der Gesellschaft: Langenargen
> Registergericht: ULM, HRB 632019
> USt-Id.: DE232931635, WEEE-Id.: DE74015979
> Geschäftsführer: Norbert Schäfer, Fritz R. Paul 
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] upgrading 0.94.5 to 9.2.0 notes

2015-11-20 Thread Steve Anthony
On journal device permissions see my reply in "Can't activate osd in
infernalis". Basically, if you set the partition type GUID to
45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (the Ceph journal type GUID), the
existing Ceph udev rules will set permissions on the partitions
correctly at boot.

Changing the ownership on the journal partitions manually will not
persist across reboots. Easy reference:
http://www.spinics.net/lists/ceph-users/msg23685.html

-Steve

On 11/20/2015 10:14 AM, Kenneth Waegeman wrote:
> Hi,
>
> I recently started a test to upgrade ceph from 0.94.5 to 9.2.0 on
> Centos7. I had some issues not mentioned in the release notes. Hereby
> some notes:
>
> * Upgrading instructions are only in the release notes, not updated on
> the upgrade page in the docs:
> http://docs.ceph.com/docs/master/install/upgrading-ceph/
>
> * Once you've updated the packages, `service ceph stop` or `service
> ceph stop `  won't actually work anymore, is pointing to a
> non-existing target. This is a step in the upgrade procedure I
> couldn't do, I manually killed the processes.
> [root@ceph001 ~]# service ceph stop osd
> Redirecting to /bin/systemctl stop  osd ceph.service
> Failed to issue method call: Unit osd.service not loaded
>
> * You also need to chown the journal partitions used for the osds.
> only chowning /var/lib/ceph is not enough
>
> * Permissions on log files are not completely ok. The /var/log/ceph
> folder is owned by ceph, but existing files are still owned by root,
> so I had to manually chown these, otherwise I got messages like this:
> 2015-11-13 11:32:26.641870 7f55a4ffd700  1 mon.ceph003@2(peon).log
> v4672 unable to write to '/var/log/ceph/ceph.log' for channel
> 'cluster': (13) Permission denied
>
> .* I still get messages like these in the log files, not sure if they
> are harmless or not:
>
> 2015-11-13 11:52:53.840414 7f610f376700 -1 lsb_release_parse - pclose
> failed: (13) Permission denied
>
> * systemctl start ceph.target does not start my osds.., I have to
> start them all with systemctl start ceph-osd@...
> * systemctl restart ceph.target restart the running osds, but not the
> osds that are not yet running.
> * systemctl stop ceph.target stops everything, as expected :)
>
> I didn't tested everything thoroughly yet, but does someone has seen
> the same issues?
>
> Thanks!
>
> Kenneth
> _______
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Interesting postmortem on SSDs from Algolia

2015-06-17 Thread Steve Anthony
There's often a great deal of discussion about which SSDs to use for
journals, and why some of the cheaper SSDs end up being more expensive
in the long run. The recent blog post at Algoria, though not Ceph
specific, provides a good illustration of exactly how insidious
kernel/SSD interactions can be. Thought the list might find it
interesting.   

https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/

-Steve

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to backup hundreds or thousands of TB

2015-05-06 Thread Steve Anthony
 Wissenschaft, Forschung und Kunst Baden-Württemberg

 Geschäftsführer: Prof. Thomas Schadt


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 -- 
 ==
 Jean-Philippe Méthot
 Administrateur système / System administrator
 GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Managing larger ceph clusters

2015-04-17 Thread Steve Anthony
, but
 feel that moving the management of our clusters to standard tools
 would
 provide a little more consistency and help prevent some mistakes that
 have happened while using ceph-deploy.

 We're looking at using the same tools we use in our OpenStack
 environment (puppet/ansible), but I'm interested in hearing from
 people
 using chef/salt/juju as well.

 Some of the cluster operation tasks that I can think of along with
 ideas/concerns I have are:

 Keyring management
   Seems like hiera-eyaml is a natural fit for storing the keyrings.

 ceph.conf
   I believe the puppet ceph module can be used to manage this
 file, but
   I'm wondering if using a template (erb?) might be better method to
   keeping it organized and properly documented.

 Pool configuration
   The puppet module seems to be able to handle managing replicas
 and the
   number of placement groups, but I don't see support for erasure
 coded
   pools yet.  This is probably something we would want the initial
   configuration to be set up by puppet, but not something we would
 want
   puppet changing on a production cluster.

 CRUSH maps
   Describing the infrastructure in yaml makes sense.  Things like
 which
   servers are in which rows/racks/chassis.  Also describing the
 type of
   server (model, number of HDDs, number of SSDs) makes sense.

 CRUSH rules
   I could see puppet managing the various rules based on the backend
   storage (HDD, SSD, primary affinity, erasure coding, etc).

 Replacing a failed HDD disk
   Do you automatically identify the new drive and start using it right
   away?  I've seen people talk about using a combination of udev and
   special GPT partition IDs to automate this.  If you have a cluster
   with thousands of drives I think automating the replacement makes
   sense.  How do you handle the journal partition on the SSD?  Does
   removing the old journal partition and creating a new one create a
   hole in the partition map (because the old partition is removed and
   the new one is created at the end of the drive)?

 Replacing a failed SSD journal
   Has anyone automated recreating the journal drive using Sebastien
   Han's instructions, or do you have to rebuild all the OSDs as well?


 
 http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou
 rnal-failure/
 
 http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou%0Arnal-failure/

 Adding new OSD servers
   How are you adding multiple new OSD servers to the cluster?  I could
   see an ansible playbook which disables nobackfill, noscrub, and
   nodeep-scrub followed by adding all the OSDs to the cluster being
   useful.

 Upgrading releases
   I've found an ansible playbook for doing a rolling upgrade which
 looks
   like it would work well, but are there other methods people are
 using?


 
 http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi
 ble/
 
 http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi%0Able/

 Decommissioning hardware
   Seems like another ansible playbook for reducing the OSDs weights to
   zero, marking the OSDs out, stopping the service, removing the
 OSD ID,
   removing the CRUSH entry, unmounting the drives, and finally
 removing
   the server would be the best method here.  Any other ideas on how to
   approach this?


 That's all I can think of right now.  Is there any other tasks that
 people have run into that are missing from this list?

 Thanks,
 Bryan


 This E-mail and any of its attachments may contain Time Warner
 Cable proprietary information, which is privileged, confidential,
 or subject to copyright belonging to Time Warner Cable. This
 E-mail is intended solely for the use of the individual or entity
 to which it is addressed. If you are not the intended recipient of
 this E-mail, you are hereby notified that any dissemination,
 distribution, copying, or action taken in relation to the contents
 of and attachments to this E-mail is strictly prohibited and may
 be unlawful. If you have received this E-mail in error, please
 notify the sender immediately and permanently delete the original
 and any copy of this E-mail and any printout.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu



signature.asc

Re: [ceph-users] Replication question

2015-03-12 Thread Steve Anthony
Actually, it's more like 41TB. It's a bad idea to run at near full
capacity (by default past 85%) because you need some space where Ceph
can replicate data as part of its healing process in the event of disk
or node failure. You'll get a health warning when you exceed this ratio.

You can use erasure coding to increase the amount of data you can store
beyond 41TB, but you'll still need some replicated disk as a caching
layer in front of the erasure coded pool if you're using RBD. See:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-December/036430.html

As to how much space you can save with erasure coding, that will depend
on if you're using RBD and need a cache layer and the values you set for
k and m (number of data chunks and coding chunks). There's been some
discussion on the list with regards to choosing those values.

-Steve

On 03/12/2015 10:07 AM, Thomas Foster wrote:
 I am looking into how I can maximize my space with replication, and I
 am trying to understand how I can do that.

 I have 145TB of space and a replication of 3 for the pool and was
 thinking that the max data I can have in the cluster is ~47TB in my
 cluster at one time..is that correct?  Or is there a way to get more
 data into the cluster with less space using erasure coding?  

 Any help would be greatly appreciated.




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] import-diff requires snapshot exists?

2015-03-03 Thread Steve Anthony
Hello,

I've been playing with backing up images from my production site
(running 0.87) to my backup site (running 0.87.1) using export/import
and export-diff/import-diff. After initially exporting and importing the
image (rbd/small to backup/small) I took a snapshot (called test1) on
the production cluster, ran export-diff from that snapshot, and then
attempted to import-diff the diff file on the backup cluster.

# rbd import-diff ./foo.diff backup/small
start snapshot 'test1' does not exist in the image, aborting
Importing image diff: 0% complete...failed.
rbd: import-diff failed: (22) Invalid argument

This works fine if I create a test1 snapshot on the backup cluster
before running import-diff. However, it appears that the changes get
written into backup/small not backup/small@test1. So unless I'm not
understanding something, it seems like the content of the snapshot on
the backup cluster is of no importance, which makes me wonder why it
must exist at all.

Any thoughts? Thanks!

-Steve

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] import-diff requires snapshot exists?

2015-03-03 Thread Steve Anthony
Jason,

Ah, ok that makes sense. I was forgetting snapshots are read-only. Thanks!

My plan was to do something like this. First, create a sync snapshot and
seed the backup:

rbd snap create rbd/small@sync
rbd export rbd/small@sync ./foo

rbd import ./foo backup/small
rbd snap create backup/small@sync

Then each day, create a daily snap on the backup cluster:

rbd snap create backup/small@2015-02-03

Then send that day's changes:

rbd export-diff --from-snap sync rbd/small ./foo.diff
rbd import-diff ./foo.diff rbd/small

Then remove and recreate the daily snap marker to prepare for the next sync.

rbd snap rm rbd/small@sync
rbd snap rm backup/small@sync

rbd snap create rbd/small@sync
rbd snap create backup/small@sync

Finally remove any dated snapshots on the remote cluster outside the
retention window.

-Steve

On 03/03/2015 04:37 PM, Jason Dillaman wrote:
 Snapshots are read-only, so all changes to the image can only be applied to 
 the HEAD revision.

 In general, you should take a snapshot prior to export / export-diff to 
 ensure consistent images:

   rbd snap create rbd/small@snap1
   rbd export rbd/small@snap1 ./foo

   rbd import ./foo backup/small
   rbd snap create backup/small@snap1

   ** rbd/small and backup/small are now consistent through snap1 -- rbd/small 
 might have been modified post snapshot

   rbd snap create rbd/small@snap2
   rbd export-diff --from-snap snap1 rbd/small@snap2 ./foo.diff
   rbd import-diff ./foo.diff backup/small

   ** rbd/small and backup/small are now consistent through snap2.  
 import-diff automatically created backup/small@snap2 after importing all 
 changes. 

 -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com
 - Original Message - From: Steve Anthony sma...@lehigh.edu
 To: ceph-users@lists.ceph.com Sent: Tuesday, March 3, 2015 2:06:44 PM
 Subject: [ceph-users] import-diff requires snapshot exists? Hello,
 I've been playing with backing up images from my production site
 (running 0.87) to my backup site (running 0.87.1) using export/import
 and export-diff/import-diff. After initially exporting and importing
 the image (rbd/small to backup/small) I took a snapshot (called test1)
 on the production cluster, ran export-diff from that snapshot, and
 then attempted to import-diff the diff file on the backup cluster. #
 rbd import-diff ./foo.diff backup/small start snapshot 'test1' does
 not exist in the image, aborting Importing image diff: 0%
 complete...failed. rbd: import-diff failed: (22) Invalid argument This
 works fine if I create a test1 snapshot on the backup cluster before
 running import-diff. However, it appears that the changes get written
 into backup/small not backup/small@test1. So unless I'm not
 understanding something, it seems like the content of the snapshot on
 the backup cluster is of no importance, which makes me wonder why it
 must exist at all. Any thoughts? Thanks! -Steve
 -- Steve Anthony LTS HPC Support Specialist Lehigh University
 sma...@lehigh.edu ___
 ceph-users mailing list ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 85% of the cluster won't start, or how I learned why to use disk UUIDs

2015-01-27 Thread Steve Anthony
 someone from making the same mistakes!

-Steve

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph as a primary storage for owncloud

2015-01-27 Thread Steve Anthony
I tried this a while back. In my setup, I exposed an block device with
rbd on the owncloud host and tried sharing an image to the owncloud host
via NFS. If I recall correctly, both worked fine (I didn't try S3). The
problem I had at the time (maybe 6-12 months ago) was that owncloud
didn't support enough automated management of LDAP group permissions for
me to easily deploy and manage it for 1000+ users. It is on my list of
things to revisit however, so I'd be curious to hear how things go for
you. If it doesn't work out, I'd also recommend checking out Pydio. It
didn't make it into production in my environment (I didn't have time to
focus on it), but I liked its user management better than owncloud's at
the time.

-Steve

On 01/27/2015 05:05 AM, Simone Spinelli wrote:
 Dear all,

 we would like to use ceph as a a primary (object) storage for owncloud.
 Did anyone already do this? I mean: is that actually possible or am I
 wrong?
 As I understood I have to use radosGW in swift flavor, but what about
 s3 flavor?
 I cannot find anything official so hence my question.
 Do you have any advice or can you indicate me some kind of
 documentation/how-to?

 I know that maybe this is not the right place for this questions but I
 also asked owncloud's community... in the meantime...

 Every answer is appreciated!

 Thanks

 Simone


-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd troubleshooting

2014-11-04 Thread Steve Anthony
Shiva,

You need to connect to the host where the OSD is located and stop it by
invoking:

service stop ceph osd.1

I don't think there's a way to stop and start OSDs from an admin node,
unless I missed a change that provides this functionality.

-Steve

On 11/04/2014 10:59 PM, shiva rkreddy wrote:
 Hi,
 I'm trying to run osd troubleshooting commands.

 *Use case: Stopping osd without re-balancing.*

 .#ceph osd noout  // this command works.
 But, neither of the following work:
 #stop ceph-osd id=1
 (Error message: /*no valid command found; 10 closest matches:*/ ...)
  or
 # ceph osd stop osd.1
 ( Error message: /*stop: Unknown job: ceph-osd*/ )

 Environment:
 ceph: 0.80.7
 OS: RHEL6.5
 upstart-0.6.5-13.el6_5.3.x86_64
 ceph-0.80.7-0.el6.x86_64
 ceph-common-0.80.7-0.el6.x86_64

 Thanks,
 shiva



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] journals relabeled by OS, symlinks broken

2014-10-27 Thread Steve Anthony
Nice. Thanks all, I'll adjust my scripts to call ceph-deploy using
/dev/disk/by-id for future ODSs.

I tried stopping an existing OSD on another node (which is working -
osd.33 in this case), changing /var/lib/ceph/osd/ceph-33/journal to
point to the same partition using /dev/disk/by-id, and starting the OSD
again, but it fails to start with:

2014-10-27 11:03:31.607060 7fa65018e780 -1
filestore(/var/lib/ceph/osd/ceph-33) mount failed to open journal
/var/lib/ceph/osd/ceph-33/journal: (2) No such file or directory
2014-10-27 11:03:31.617262 7fa65018e780 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-33: (2) No such file or directory

The journal symlink exists and points to the same partition as before
when it was /dev/sde1. Can I not change these existing symlinks manually
to point to the same partition using /dev/disk/by-id?

-Steve

On 10/27/2014 12:44 PM, Mariusz Gronczewski wrote:
 * /dev/disk/by-id

 by-path will change if you connect it to different controller, or
 replace your controller with other model, or put it in different pci
 slot

 On Sat, 25 Oct 2014 17:20:58 +, Scott Laird sc...@sigkill.org
 wrote:

 You'd be best off using /dev/disk/by-path/ or similar links; that way
they
 follow the disks if they're renamed again.

 On Fri, Oct 24, 2014, 9:40 PM Steve Anthony sma...@lehigh.edu wrote:

 Hello,

 I was having problems with a node in my cluster (Ceph v0.80.7/Debian
 Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when
 it came back up. Now all the symlinks to the journals are broken. The
 SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde:

 root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal
 lrwxrwxrwx 1 root root 9 Oct 20 16:47 /var/lib/ceph/osd/ceph-150/journal
 - /dev/sde1
 lrwxrwxrwx 1 root root 9 Oct 20 16:53 /var/lib/ceph/osd/ceph-157/journal
 - /dev/sdd1
 lrwxrwxrwx 1 root root 9 Oct 21 08:31 /var/lib/ceph/osd/ceph-164/journal
 - /dev/sdc1
 lrwxrwxrwx 1 root root 9 Oct 21 16:33 /var/lib/ceph/osd/ceph-171/journal
 - /dev/sde2
 lrwxrwxrwx 1 root root 9 Oct 22 10:50 /var/lib/ceph/osd/ceph-178/journal
 - /dev/sdc2
 lrwxrwxrwx 1 root root 9 Oct 22 15:48 /var/lib/ceph/osd/ceph-184/journal
 - /dev/sdd2
 lrwxrwxrwx 1 root root 9 Oct 23 10:46 /var/lib/ceph/osd/ceph-191/journal
 - /dev/sde3
 lrwxrwxrwx 1 root root 9 Oct 23 15:22 /var/lib/ceph/osd/ceph-195/journal
 - /dev/sdc3
 lrwxrwxrwx 1 root root 9 Oct 23 16:59 /var/lib/ceph/osd/ceph-201/journal
 - /dev/sdd3
 lrwxrwxrwx 1 root root 9 Oct 24 21:32 /var/lib/ceph/osd/ceph-214/journal
 - /dev/sde4
 lrwxrwxrwx 1 root root 9 Oct 24 21:33 /var/lib/ceph/osd/ceph-215/journal
 - /dev/sdd4

 Any way to fix this without just removing all the OSDs and re-adding
 them? I thought about recreating the symlinks to point at the new SSD
 labels, but I figured I'd check here first. Thanks!

 -Steve

 --
 Steve Anthony
 LTS HPC Support Specialist
 Lehigh University
 sma...@lehigh.edu

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] journals relabeled by OS, symlinks broken

2014-10-27 Thread Steve Anthony
Oh, hey look at that. I must have screwed something up before. I thought
it was strange that it didn't work.

Works now, thanks!

-Steve

On 10/27/2014 03:20 PM, Scott Laird wrote:
 Double-check that you did it right.  Does 'ls -lL
 /var/lib/ceph/osd/ceph-33/journal' resolve to a block-special device?

 On Mon Oct 27 2014 at 12:12:20 PM Steve Anthony sma...@lehigh.edu
 mailto:sma...@lehigh.edu wrote:

 Nice. Thanks all, I'll adjust my scripts to call ceph-deploy using
 /dev/disk/by-id for future ODSs.

 I tried stopping an existing OSD on another node (which is working
 - osd.33 in this case), changing /var/lib/ceph/osd/ceph-33/journal
 to point to the same partition using /dev/disk/by-id, and starting
 the OSD again, but it fails to start with:

 2014-10-27 11:03:31.607060 7fa65018e780 -1
 filestore(/var/lib/ceph/osd/ceph-33) mount failed to open journal
 /var/lib/ceph/osd/ceph-33/journal: (2) No such file or directory
 2014-10-27 11:03:31.617262 7fa65018e780 -1  ** ERROR: error
 converting store /var/lib/ceph/osd/ceph-33: (2) No such file or
 directory

 The journal symlink exists and points to the same partition as
 before when it was /dev/sde1. Can I not change these existing
 symlinks manually to point to the same partition using
 /dev/disk/by-id?


 -Steve


 On 10/27/2014 12:44 PM, Mariusz Gronczewski wrote:
  * /dev/disk/by-id
 
  by-path will change if you connect it to different controller, or
  replace your controller with other model, or put it in different pci
  slot
 
  On Sat, 25 Oct 2014 17:20:58 +, Scott Laird
 sc...@sigkill.org mailto:sc...@sigkill.org
  wrote:
 
  You'd be best off using /dev/disk/by-path/ or similar links;
 that way they
  follow the disks if they're renamed again.
 
  On Fri, Oct 24, 2014, 9:40 PM Steve Anthony sma...@lehigh.edu
 mailto:sma...@lehigh.edu wrote:
 
  Hello,
 
  I was having problems with a node in my cluster (Ceph
 v0.80.7/Debian
  Wheezy/Kernel 3.12), so I rebooted it and the disks were
 relabled when
  it came back up. Now all the symlinks to the journals are
 broken. The
  SSDs are now sda, sdb, and sdc but the journals were sdc, sdd,
 and sde:
 
  root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal
  lrwxrwxrwx 1 root root 9 Oct 20 16:47
 /var/lib/ceph/osd/ceph-150/journal
  - /dev/sde1
  lrwxrwxrwx 1 root root 9 Oct 20 16:53
 /var/lib/ceph/osd/ceph-157/journal
  - /dev/sdd1
  lrwxrwxrwx 1 root root 9 Oct 21 08:31
 /var/lib/ceph/osd/ceph-164/journal
  - /dev/sdc1
  lrwxrwxrwx 1 root root 9 Oct 21 16:33
 /var/lib/ceph/osd/ceph-171/journal
  - /dev/sde2
  lrwxrwxrwx 1 root root 9 Oct 22 10:50
 /var/lib/ceph/osd/ceph-178/journal
  - /dev/sdc2
  lrwxrwxrwx 1 root root 9 Oct 22 15:48
 /var/lib/ceph/osd/ceph-184/journal
  - /dev/sdd2
  lrwxrwxrwx 1 root root 9 Oct 23 10:46
 /var/lib/ceph/osd/ceph-191/journal
  - /dev/sde3
  lrwxrwxrwx 1 root root 9 Oct 23 15:22
 /var/lib/ceph/osd/ceph-195/journal
  - /dev/sdc3
  lrwxrwxrwx 1 root root 9 Oct 23 16:59
 /var/lib/ceph/osd/ceph-201/journal
  - /dev/sdd3
  lrwxrwxrwx 1 root root 9 Oct 24 21:32
 /var/lib/ceph/osd/ceph-214/journal
  - /dev/sde4
  lrwxrwxrwx 1 root root 9 Oct 24 21:33
 /var/lib/ceph/osd/ceph-215/journal
  - /dev/sdd4
 
  Any way to fix this without just removing all the OSDs and
 re-adding
  them? I thought about recreating the symlinks to point at the
 new SSD
  labels, but I figured I'd check here first. Thanks!
 
  -Steve
 
  --
  Steve Anthony
  LTS HPC Support Specialist
  Lehigh University
  sma...@lehigh.edu mailto:sma...@lehigh.edu
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 

 -- 
 Steve Anthony
 LTS HPC Support Specialist
 Lehigh University
 sma...@lehigh.edu mailto:sma...@lehigh.edu


-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] journals relabeled by OS, symlinks broken

2014-10-24 Thread Steve Anthony
Hello,

I was having problems with a node in my cluster (Ceph v0.80.7/Debian
Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when
it came back up. Now all the symlinks to the journals are broken. The
SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde:

root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal
lrwxrwxrwx 1 root root 9 Oct 20 16:47 /var/lib/ceph/osd/ceph-150/journal
- /dev/sde1
lrwxrwxrwx 1 root root 9 Oct 20 16:53 /var/lib/ceph/osd/ceph-157/journal
- /dev/sdd1
lrwxrwxrwx 1 root root 9 Oct 21 08:31 /var/lib/ceph/osd/ceph-164/journal
- /dev/sdc1
lrwxrwxrwx 1 root root 9 Oct 21 16:33 /var/lib/ceph/osd/ceph-171/journal
- /dev/sde2
lrwxrwxrwx 1 root root 9 Oct 22 10:50 /var/lib/ceph/osd/ceph-178/journal
- /dev/sdc2
lrwxrwxrwx 1 root root 9 Oct 22 15:48 /var/lib/ceph/osd/ceph-184/journal
- /dev/sdd2
lrwxrwxrwx 1 root root 9 Oct 23 10:46 /var/lib/ceph/osd/ceph-191/journal
- /dev/sde3
lrwxrwxrwx 1 root root 9 Oct 23 15:22 /var/lib/ceph/osd/ceph-195/journal
- /dev/sdc3
lrwxrwxrwx 1 root root 9 Oct 23 16:59 /var/lib/ceph/osd/ceph-201/journal
- /dev/sdd3
lrwxrwxrwx 1 root root 9 Oct 24 21:32 /var/lib/ceph/osd/ceph-214/journal
- /dev/sde4
lrwxrwxrwx 1 root root 9 Oct 24 21:33 /var/lib/ceph/osd/ceph-215/journal
- /dev/sdd4

Any way to fix this without just removing all the OSDs and re-adding
them? I thought about recreating the symlinks to point at the new SSD
labels, but I figured I'd check here first. Thanks!

-Steve

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] get amount of space used by snapshots

2014-09-22 Thread Steve Anthony
Hello,

If I have an rbd image and a series of snapshots of that image, is there
a fast way to determine how much space the objects composing the
original image and all the snapshots are using in the cluster, or even
just the space used by the snaps?

The only way I've been able to find so far is to get the
block_name_prefix for the image with rbd info and then grep for that
prefix in the output of rados ls, eg. rados ls|grep
rb.0.396de.238e1f29|wc -l. This is relatively slow, printing ~250
objects/s, which means hours to count through 10s of TB of objects.

Basically, if I'm keeping daily snapshots for a set of images, I'd like
to be able to tell how much space those snapshots are using so I can
determine how frequently I need to prune old snaps. Thanks!

-Steve

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)

2014-08-26 Thread Steve Anthony
Ok, after some delays and the move to new network hardware I have an
update. I'm still seeing the same low bandwidth and high retransmissions
from iperf after moving to the Cisco 6001 (10Gb) and 2960 (1Gb). I've
narrowed it down to transmissions from a 10Gb connected host to a 1Gb
connected host. Taking a more targeted tcpdump, I discovered that there
are multiple duplicate ACKs, triggering fast retransmissions between the
two test hosts.

There are several websites/articles which suggest that mixing 10Gb and
1Gb hosts causes performance issues, but no concrete explanation of why
that's the case, and if it can be avoided without moving everything to
10Gb, eg.

http://blogs.technet.com/b/networking/archive/2011/05/16/tcp-dupacks-and-tcp-fast-retransmits.aspx
http://en.community.dell.com/dell-groups/dtcmedia/m/mediagallery/19856911/download.aspx
[PDF]
http://packetpushers.net/flow-control-storm-%E2%80%93-ip-storage-performance-effects/

I verified that it's not a flow control storm (the pause frame counters
along the network path are zero), so assuming it might be bandwidth
related I installed trickle and used it to limit the bandwidth of iperf
to 1Gb; no change. I further restricted it down to 100Kbps, and was
*still* seeing high retransmission. This seems to imply it's not purely
bandwidth related.

After further research, I noticed a difference of about 0.1ms in the RTT
between two 10Gb hosts (intra-switch) and the 10Gb and 1Gb host
(inter-switch). I theorized this may be affecting the retransmission
timeout counter calculations, per:

http://sgros.blogspot.com/2012/02/calculating-tcp-rto.html

so I used ethtool to set the link plugged into the 10Gb 6001 to 1Gb;
this immediately fixed the issue. After this change the difference in
RTTs moved to about .025ms. Plugging another host into the old 10Gb FEX,
I have 10Gb to 10Gb RTTs withing .001ms of 6001 to 2960 RTTs, and don't
see the high retransmissions with iperf between those 10Gb hosts.


 tldr 

So, right now I don't see retransmissions between hosts on the same
switch (even if speeds are mixed), but I do across switches when the
hosts are mixed 10Gb/1Gb. Also, I wonder what the difference between
process bandwidth limiting and link 1Gb negotiation is which leads to
the differences observed. I checked the link per Mark's suggestion
below, but all the values they increase in that old post are already
lower than the defaults set on my hosts.

If anyone has any ideas or explanations, I'd appreciate it. Otherwise,
I'll keep the list posted if I uncover a solution or make more progress.
Thanks.

-Steve

On 07/28/2014 01:21 PM, Mark Nelson wrote:
 On 07/28/2014 11:28 AM, Steve Anthony wrote:
 While searching for more information I happened across the following
 post (http://dachary.org/?p=2961) which vaguely resembled the symptoms
 I've been experiencing. I ran tcpdump and noticed what appeared to be a
 high number of retransmissions on the host where the images are mounted
 during a read from a Ceph rbd, so I ran iperf3 to get some concrete
 numbers:

 Very interesting that you are seeing retransmissions.


 Server: nas4 (where rbd images are mapped)
 Client: ceph2 (currently not in the cluster, but configured
 identically to the other nodes)

 Start server on nas4:
 iperf3 -s

 On ceph2, connect to server nas4, send 4096MB of data, report on 1
 second intervals. Add -R to reverse the client/server roles.
 iperf3 -c nas4 -n 4096M -i 1

 Summary of traffic going out the 1Gb interface to a switch

 [ ID] Interval   Transfer Bandwidth   Retr
 [  5]   0.00-36.53  sec  4.00 GBytes   941 Mbits/sec   15
 sender
 [  5]   0.00-36.53  sec  4.00 GBytes   940 Mbits/sec
 receiver

 Reversed, summary of traffic going over the fabric extender

 [ ID] Interval   Transfer Bandwidth   Retr
 [  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec  30756
 sender
 [  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec
 receiver

 Definitely looks suspect!



 It appears that the issue is related to the network topology employed.
 The private cluster network and nas4's public interface are both
 connected to a 10Gb Cisco Fabric Extender (FEX), in turn connected to a
 Nexus 7000. This was meant as a temporary solution until our network
 team could finalize their design and bring up the Nexus 6001 for the
 cluster. From what our network guys have said, the FEX has been much
 more limited than they anticipated and they haven't been pleased with it
 as a solution in general. The 6001 is supposed be ready this week, so
 once it's online I'll move the cluster to that switch and re-test to see
 if this fixes the issues I've been experiencing.

 If it's not the hardware, one other thing you might want to test is to
 make sure it's not something similar to the autotuning issues we used
 to see.  I don't think this should be an issue at this point given the
 code changes we made to address it, but it would be easy to test. 
 Doesn't seem like

Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)

2014-07-24 Thread Steve Anthony
Thanks for the information!

Based on my reading of http://ceph.com/docs/next/rbd/rbd-config-ref I
was under the impression that rbd cache options wouldn't apply, since
presumably the kernel is handling the caching. I'll have to toggle some
of those values and see it they make a difference in my setup.

I did some additional testing today. If I limit the write benchmark to 1
concurrent operation I see a lower bandwidth number, as I expected.
However, when writing to the XFS filesystem on an rbd image I see
transfer rates closer to to 400MB/s.

# rados -p bench bench 300 write --no-cleanup -t 1

Total time run: 300.105945
Total writes made:  1992
Write size: 4194304
Bandwidth (MB/sec): 26.551

Stddev Bandwidth:   5.69114
Max bandwidth (MB/sec): 40
Min bandwidth (MB/sec): 0
Average Latency:0.15065
Stddev Latency: 0.0732024
Max latency:0.617945
Min latency:0.097339

# time cp -a /mnt/local/climate /mnt/ceph_test1

real2m11.083s
user0m0.440s
sys1m11.632s

# du -h --max-deph=1 /mnt/local
53G/mnt/local/climate

This seems to imply that the there is more than one concurrent operation
when writing into the filesystem on top of the rbd image. However, given
that the filesystem read speeds and the rados benchmark read speeds are
much closer in reported bandwidth, it's as if reads are occurring as a
single operation.

# time cp -a /mnt/ceph_test2/isos /mnt/local/

real36m2.129s
user0m1.572s
sys3m23.404s

# du -h --max-deph=1 /mnt/ceph_test2/
68G/mnt/ceph_test2/isos

Is this apparent single-thread read and multi-thread write with the rbd
kernel module the expected mode of operation? If so, could someone
explain the reason for this limitation?

Based on the information on data striping in
http://ceph.com/docs/next/architecture/#data-striping I would assume
that a format 1 image would stripe a file larger than the 4MB object
size over multiple objects and that those objects would be distributed
over multiple OSDs. This would seem to indicate that reading a file back
would be much faster since even though Ceph is only reading the primary
replica, the read is still distributed over multiple OSDs. At worst I
would expect something near the read bandwidth of a single OSD, which
would still be much higher than 30-40MB/s.

-Steve

On 07/24/2014 04:07 PM, Udo Lembke wrote:
 Hi Steve,
 I'm also looking for improvements of single-thread-reads.

 A little bit higher values (twice?) should be possible with your config.
 I have 5 nodes with 60 4-TB hdds and got following:
 rados -p test bench -b 4194304 60 seq -t 1 --no-cleanup
 Total time run:60.066934
 Total reads made: 863
 Read size:4194304
 Bandwidth (MB/sec):57.469
 Average Latency:   0.0695964
 Max latency:   0.434677
 Min latency:   0.016444

 In my case I had some osds (xfs) with an high fragmentation (20%).
 Changing the mount options and defragmentation help slightly.
 Performance changes:
 [client]
 rbd cache = true
 rbd cache writethrough until flush = true

 [osd] 
   

 osd mount options xfs =
 rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M  
   

 osd_op_threads =
 4 
  

 osd_disk_threads = 4


 But I expect much more speed for an single thread...

 Udo

 On 23.07.2014 22:13, Steve Anthony wrote:
 Ah, ok. That makes sense. With one concurrent operation I see numbers
 more in line with the read speeds I'm seeing from the filesystems on the
 rbd images.

 # rados -p bench bench 300 seq --no-cleanup -t 1
 Total time run:300.114589
 Total reads made: 2795
 Read size:4194304
 Bandwidth (MB/sec):37.252

 Average Latency:   0.10737
 Max latency:   0.968115
 Min latency:   0.039754

 # rados -p bench bench 300 rand --no-cleanup -t 1
 Total time run:300.164208
 Total reads made: 2996
 Read size:4194304
 Bandwidth (MB/sec):39.925

 Average Latency:   0.100183
 Max latency:   1.04772
 Min latency:   0.039584

 I really wish I could find my data on read speeds from a couple weeks
 ago. It's possible that they've always been in this range, but I
 remember one of my test users saturating his 1GbE link over NFS reading
 copying from the rbd client to his workstation. Of course, it's also
 possible that the data set he was using was cached in RAM when he was
 testing, masking the lower rbd speeds.

 It just seems counterintuitive to me that read speeds would be so much
 slower that writes at the filesystem layer in practice. With images in
 the 10-100TB range, reading data at 20-60MB/s isn't going to be
 pleasant. Can you suggest any tunables or other approaches