[GitHub] [druid] tanisdlj opened a new issue #11840: Coordinator reports full or missing tier nodes to assign segments incorrectly, failing to load segments

GitBox Mon, 25 Oct 2021 02:27:54 -0700


tanisdlj opened a new issue #11840:
URL: https://github.com/apache/druid/issues/11840



   ### Affected Version
   
   0.22.2
   
   ### Description
   
   - Cluster size: 2 brokers, 2 routers, 2 coordinators, 37 historicals (15 
hot, 21 cold, 1 frozen), 2 overlords, 43 middlemanagers
   - Configurations in use: Any datasource storing data in the Frozen tier has 
as retention rules:
   ```
   [
     {"type":"loadByPeriod","period":"P2M","tieredReplicants":{"stde2-hot":2}},
     
{"type":"loadByPeriod","period":"P14M","tieredReplicants":{"stde2-cold":1}},
     {"type":"loadForever","tieredReplicants":{"stde2-frozen":1}}
   ]
   ```
   
   Historical (Frozen) config:
   ```
   druid.service=druid/historical
   druid.plaintextPort=8083
   
   druid.server.tier=stde2-frozen
   druid.server.http.numThreads=90
   
   druid.processing.buffer.sizeBytes=1GiB
   druid.processing.numThreads=54
   druid.processing.numMergeBuffers=3
   
   
druid.segmentCache.locations=[{"path":"/druid/segment-cache","maxSize":"16T","freeSpacePercent":
 1.0}]
   druid.segmentCache.lazyLoadOnStart=false
   druid.segmentCache.numLoadingThreads=128
   druid.segmentCache.numBootstrapThreads=128
   
   druid.historical.cache.useCache=true
   druid.historical.cache.populateCache=true
   druid.cache.type=caffeine
   druid.cache.sizeInBytes=1GiB
   
   druid.query.vectorize=true
   
druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor","org.apache.druid.server.metrics.HistoricalMetricsMonitor"]
   ```
   
   Coordinator config:
   ```
   druid.service=druid/coordinator
   druid.plaintextPort=8081
   druid.coordinator.startDelay=PT300S
   druid.coordinator.period=PT60S
   druid.coordinator.kill.on=true
   druid.coordinator.kill.maxSegments=100
   druid.coordinator.kill.durationToRetain=P7D
   druid.serverview.type=http
   druid.coordinator.loadqueuepeon.type=http
   druid.coordinator.loadqueuepeon.http.batchSize=56
   druid.coordinator.loadqueuepeon.curator.numCallbackThreads=200
   druid.coordinator.balancer.strategy=cachingCost
   druid.coordinator.balancer.cachingCost.awaitInitialization=true
   maxSegmentsInNodeLoadingQueue=1000
   druid.announcer.type=http
   ```
   
   - Steps to reproduce the problem: Set a single server with a new tier with 
old data, expect to load all the segments assigned to it: it won't happen.
   
   - The error message or stack traces encountered. Providing more context, 
such as nearby log messages or even entire logs, can be helpful:
   
   Coordinator log (the same two messages goes on and on forever):
   ```
   Oct 25 08:51:44 druid-master-1 java[19246]: 2021-10-25T08:51:44,668 INFO 
[Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - 
Loading in progress, skipping drop until loading is complete
   Oct 25 08:51:44 druid-master-1 java[19246]: 2021-10-25T08:51:44,668 WARN 
[Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - No 
available [stde2-frozen] servers or node capacity to assign primary 
segment[datasource_2020-05-07T17:00:00.000Z_2020-05-07T18:00:00.000Z_2020-05-07T17:00:00.025Z_104]!
 Expected Replicants[1]
   Oct 25 08:51:44 druid-master-1 java[19246]: 2021-10-25T08:51:44,668 INFO 
[Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - 
Loading in progress, skipping drop until loading is complete
   Oct 25 08:51:44 druid-master-1 java[19246]: 2021-10-25T08:51:44,668 WARN 
[Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - No 
available [stde2-frozen] servers or node capacity to assign primary 
segment[datasource_2020-05-07T17:00:00.000Z_2020-05-07T18:00:00.000Z_2020-05-07T17:00:00.025Z_103]!
 Expected Replicants[1]
   Oct 25 08:51:44 druid-master-1 java[19246]: 2021-10-25T08:51:44,668 INFO 
[Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - 
Loading in progress, skipping drop until loading is complete
   Oct 25 08:51:44 druid-master-1 java[19246]: 2021-10-25T08:51:44,668 WARN 
[Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - No 
available [stde2-frozen] servers or node capacity to assign primary 
segment[datasource_2020-05-07T17:00:00.000Z_2020-05-07T18:00:00.000Z_2020-05-07T17:00:00.025Z_102]!
 Expected Replicants[1]```
   
   Log of the "Frozen" historical (always the same too)
   ```
   Oct 25 09:19:57 stde2-hfrozen-01.stde2java[50891]: 2021-10-25T09:19:57,860 
INFO [NamespaceExtractionCacheManager-0] 
org.apache.druid.server.lookup.namespace.UriCacheGenerator - Finished loading 
9,041 values from 9,041 lines for [namespace 
[UriExtractionNamespace{uri=file:///usr/share/druid/lookups/xyz.json, 
uriPrefix=null, namespaceParseSpec=JSONFlatDataParser{keyFieldName='token', 
valueFieldName='categoryName'}, fileRegex='null', pollPeriod=PT30M}] : 
org.apache.druid.server.lookup.namespace.cache.CacheScheduler$EntryImpl@2b10ee1d]
 in 35,353,269 ns
   Oct 25 09:19:57 stde2-hfrozen-01.stde2 java[50891]: 2021-10-25T09:19:57,868 
INFO [NamespaceExtractionCacheManager-1] 
org.apache.druid.server.lookup.namespace.UriCacheGenerator - Finished loading 
10,000 values from 10,000 lines for [namespace 
[UriExtractionNamespace{uri=file:///usr/share/druid/lookups/abc.json, 
uriPrefix=null, namespaceParseSpec=JSONFlatDataParser{keyFieldName='id', 
valueFieldName='size'}, fileRegex='null', pollPeriod=PT30M}] : 
org.apache.druid.server.lookup.namespace.cache.CacheScheduler$EntryImpl@46ebec02]
 in 43,116,107 ns
   ```
   
   Frozen disk free:
   ```
   ~$ df -h
   Filesystem      Size  Used Avail Use% Mounted on
   /dev/sdb2       439G  5.8G  411G   2% /
   /dev/sdb1       488M   95M  368M  21% /boot
   /dev/md0         15T  5.8T  8.9T  40% /druid/segment-cache
   ```
   Frozen disk free reported by druid console:
   
   
![image](https://user-images.githubusercontent.com/1453135/138670348-83430c93-e13e-4628-a732-23ec24342cc1.png)
   
   Historical Frozen hardware is:
   - AMD EPYC 7502P 32 Cores "Rome"
   - 256GB RAM DDR4 ECC
   - 2 x HDD (RAID 1) 16Tb + 1 SSD root disk (480Gb)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] tanisdlj opened a new issue #11840: Coordinator reports full or missing tier nodes to assign segments incorrectly, failing to load segments

Reply via email to