I am sorry for posting this if this has been addressed already. I am not
sure on how to search through old ceph-users mailing list posts. I used to
use gmane.org but that seems to be down.
My setup::
I have a moderate ceph cluster (ceph hammer 94.9
- fe6d859066244b97b24f09d46552afc2071e6f90 ). The cluster is running ubuntu
but the gateways are running centos7 due to an odd memory issue we had
across all of our gateways.
Outside of that the cluster is pretty standard and healthy:
[root@kh11-9 ~]# ceph -s
cluster XXX-XXX-XXX-XXX
health HEALTH_OK
monmap e4: 3 mons at
{kh11-8=X.X.X.X:6789/0,kh12-8=X.X.X.X:6789/0,kh13-8=X.X.X.X:6789/0}
election epoch 150, quorum 0,1,2 kh11-8,kh12-8,kh13-8
osdmap e69678: 627 osds: 627 up, 627 in
Here is my radosgw config in ceph::
[client.rgw.kh09-10]
log_file = /var/log/radosgw/client.radosgw.log
rgw_frontends = "civetweb port=80
access_log_file=/var/log/radosgw/rgw.access
error_log_file=/var/log/radosgw/rgw.error"
rgw_enable_ops_log = true
rgw_ops_log_rados = true
rgw_thread_pool_size = 1000
rgw_override_bucket_index_max_shards = 23
error_log_file = /var/log/radosgw/civetweb.error.log
access_log_file = /var/log/radosgw/civetweb.access.log
objecter_inflight_op_bytes = 1073741824
objecter_inflight_ops = 20480
ms_dispatch_throttle_bytes = 209715200
The gateways are sitting behind haproxy for ssl termination. Here is my
haproxy config:
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /var/lib/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
ca-base /etc/ssl/certs
crt-base /etc/ssl/private
tune.ssl.default-dh-param 2048
tune.ssl.maxrecord 2048
ssl-default-bind-ciphers
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
ssl-default-server-ciphers
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11
no-tls-tickets
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
option forwardfor
option http-server-close
frontend fourfourthree
bind :443 ssl crt /etc/ssl/STAR.opensciencedatacloud.org.pem
reqadd X-Forwarded-Proto:\ https
default_backend radosgw
backend radosgw
cookie RADOSGWLB insert indirect nocache
server primary 127.0.0.1:80 check cookie primary
--------------------
I am seeing sporadic 500 errors in my access logs on all of my radosgws:
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.635645 7feacf6c6700
0 RGWObjManifest::operator++(): result: ofs=12607029248
stripe_ofs=12607029248 part_ofs=12598640640 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.637559 7feacf6c6700
0 RGWObjManifest::operator++(): result: ofs=12611223552
stripe_ofs=12611223552 part_ofs=12598640640 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.642630 7feacf6c6700
0 RGWObjManifest::operator++(): result: ofs=12614369280
stripe_ofs=12614369280 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.644368 7feadf6e6700
1 ====== req done req=0x7fed00053a50 http_status=500 ======
/var/log/radosgw/client.radosgw.log:2017-01-13 11:30:41.644475 7feadf6e6700
1 civetweb: 0x7fed00009340: 10.64.0.124 - - [13/Jan/2017:11:28:24 -0600]
"GET
/BUCKET/306d4fe1-1515-44e0-b527-eee0e83412bf/306d4fe1-1515-44e0-b527-eee0e83412bf_gdc_realn_rehead.bam
HTTP/1.1" 500 0 - Boto/2.36.0 Python/2.7.6 Linux/3.13.0-95-generic
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.645611 7feacf6c6700
0 RGWObjManifest::operator++(): result: ofs=12618563584
stripe_ofs=12618563584 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.647998 7feacf6c6700
0 RGWObjManifest::operator++(): result: ofs=12622757888
stripe_ofs=12622757888 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.650262 7feacf6c6700
0 RGWObjManifest::operator++(): result: ofs=12626952192
stripe_ofs=12626952192 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.656394 7feacf6c6700
0 RGWObjManifest::operator++(): result: ofs=12630097920
stripe_ofs=12630097920 part_ofs=12630097920 rule->part_size=15728640
I am able to download that file just fine locally using boto but i have
heard from some users that the download hangs indefinitely on occasion. The
cluster has been healthy afaik (as of graphite showing health_ok) for the
entire period. I am not sure why this is happening or how to troubleshoot
it further. Obviously rgw is throwing a 500 which to me means an underlying
issue with ceph or the rgw server. All of my downloads complete with boto
so I am not sure what is wrong or how this is happening. Is there anything
I can do to figure out where the 500 is coming from // troubleshoot
further?
--
- Sean: I wrote this. -
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com