New question #660458 on Graphite: https://answers.launchpad.net/graphite/+question/660458
We are seeing some missing metrics (null) from the cache when we query from graphite-web but some data is retrieved from the cache and displayed. Once everything is flushed to disk all metrics appear. Both members of the cluster exhibit the same behavior. We have a new Graphite Cluster of two servers, each server is running carbon-relay, one carbon cache instance per server and graphite-web sitting behind a load balancer. We are using consistent-hashing with a replication-factor of 2 so that both servers have the same data. Carbon-cache settles between 100k-300k metrics, the fewer metrics in cache results in fewer metrics missing. Restarts of carbon-cache will mask the issue for a few hours as the cache grows. We are writing ~275k metrics per minute. The server referenced below uploads metrics every 5 seconds, matching what is set in the storage-scheme.conf. We have other servers that upload metrics every 30 seconds and the behavior is similar. Attached below are some of the configs and logs illustrating the issue. If any other info is needed please let me know and any assistance is greatly appreciated. tail -f query.log 07/11/2017 12:40:43 :: [127.0.0.1:36494] cache query for "web.servers.web.HOST1.perf.processor.pct_processor_time" returned 7 values Below the oldest ~16 metrics have been persisted to disk.. all showing up correctly in graphite-web. We then have ~21 null metrics that are not being pulled from the cache but I presume to be there since they get written to the disk a few moments later. We then have the newest 7 metrics being pulled from the cache and shown in graphite-web correctly. curl "http://SERVER1/render/?target=web.servers.web.HOST1.perf.processor.pct_processor_time&format=json" [7.0, 1510079840], [6.0, 1510079845], [4.0, 1510079850], [5.0, 1510079855], [5.0, 1510079860], [8.0, 1510079865], [5.0, 1510079870], [6.0, 1510079875], [9.0, 1510079880], [5.0, 1510079885], [4.0, 1510079890], [3.0, 1510079895], [4.0, 1510079900], [3.0, 1510079905], [null, 1510079910], [null, 1510079915], [null, 1510079920], [null, 1510079925], [null, 1510079930], [null, 1510079935], [null, 1510079940], [null, 1510079945], [null, 1510079950], [null, 1510079955], [null, 1510079960], [null, 1510079965], [null, 1510079970], [null, 1510079975], [null, 1510079980], [null, 1510079985], [null, 1510079990], [null, 1510079995], [null, 1510080000], [null, 1510080005], [4.0, 1510080010], [8.0, 1510080015], [6.0, 1510080020], [6.0, 1510080025], [4.0, 1510080030], [5.0, 1510080035], [4.0, 1510080040]], python ~/whisper-info.py /opt/graphite/storage/whisper/web/servers/web/HOST1/perf/processor/pct_processor_time.wsp maxRetention: 63072000 xFilesFactor: 0.5 aggregationMethod: average fileSize: 10172212 Archive 0 retention: 2592000 secondsPerPoint: 5 points: 518400 size: 6220800 offset: 52 Archive 1 retention: 15552000 secondsPerPoint: 60 points: 259200 size: 3110400 offset: 6220852 Archive 2 retention: 63072000 secondsPerPoint: 900 points: 70080 size: 840960 offset: 9331252 carbon.conf [cache] DATABASE = whisper ENABLE_LOGROTATION = True USER = MAX_CACHE_SIZE = 5000000 MAX_UPDATES_PER_SECOND = 750 MAX_CREATES_PER_MINUTE = 500 MIN_TIMESTAMP_RESOLUTION = 1 LINE_RECEIVER_INTERFACE = 0.0.0.0 LINE_RECEIVER_PORT = 2004 ENABLE_UDP_LISTENER = True UDP_RECEIVER_INTERFACE = 0.0.0.0 UDP_RECEIVER_PORT = 2004 PICKLE_RECEIVER_INTERFACE = 0.0.0.0 PICKLE_RECEIVER_PORT = 2005 USE_INSECURE_UNPICKLER = False CACHE_QUERY_INTERFACE = 0.0.0.0 CACHE_QUERY_PORT = 7002 USE_FLOW_CONTROL = True LOG_UPDATES = False LOG_CREATES = False LOG_CACHE_HITS = True LOG_CACHE_QUEUE_SORTS = True CACHE_WRITE_STRATEGY = sorted WHISPER_AUTOFLUSH = False WHISPER_FALLOCATE_CREATE = True CARBON_METRIC_PREFIX = carbon CARBON_METRIC_INTERVAL = 60 GRAPHITE_URL = http://127.0.0.1:80 [relay] LINE_RECEIVER_INTERFACE = 0.0.0.0 LINE_RECEIVER_PORT = 2003 PICKLE_RECEIVER_INTERFACE = 0.0.0.0 PICKLE_RECEIVER_PORT = 2014 ENABLE_UDP_LISTENER = True UDP_RECEIVER_INTERFACE = 0.0.0.0 UDP_RECEIVER_PORT = 2003 RELAY_METHOD = consistent-hashing REPLICATION_FACTOR = 2 DESTINATIONS = 127.0.0.1:2005, 10.1.1.12:2005 MAX_QUEUE_SIZE = 100000 MAX_DATAPOINTS_PER_MESSAGE = 500 QUEUE_LOW_WATERMARK_PCT = 0.8 TIME_TO_DEFER_SENDING = 0.0001 USE_FLOW_CONTROL = True USE_RATIO_RESET=False MIN_RESET_STAT_FLOW=1000 MIN_RESET_RATIO=0.9 MIN_RESET_INTERVAL=121 [aggregator] LINE_RECEIVER_INTERFACE = 0.0.0.0 LINE_RECEIVER_PORT = 2023 PICKLE_RECEIVER_INTERFACE = 0.0.0.0 PICKLE_RECEIVER_PORT = 2024 FORWARD_ALL = True DESTINATIONS = 127.0.0.1:2004 REPLICATION_FACTOR = 1 MAX_QUEUE_SIZE = 10000 USE_FLOW_CONTROL = True MAX_DATAPOINTS_PER_MESSAGE = 500 MAX_AGGREGATION_INTERVALS = 5 local_settings.py SECRET_KEY = 'Edited' CLUSTER_SERVERS = ["10.1.1.12:80"] REMOTE_FIND_TIMEOUT = 3.0 # Timeout for metric find requests REMOTE_FETCH_TIMEOUT = 3.0 # Timeout to fetch series data REMOTE_RETRY_DELAY = 60.0 # Time before retrying a failed remote webapp -- You received this question notification because your team graphite-dev is an answer contact for Graphite. _______________________________________________ Mailing list: https://launchpad.net/~graphite-dev Post to : graphite-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~graphite-dev More help : https://help.launchpad.net/ListHelp