Right now, we have 6 flush writers and compaction_throughput_mb_per_sec is set to 0, which I believe disables throttling.
Also, Here is the iostat -x 5 5 output: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 10.00 1450.35 50.79 55.92 9775.97 12030.14 204.34 1.56 14.62 1.05 11.21 dm-0 0.00 0.00 3.59 18.82 166.52 150.35 14.14 0.44 19.49 0.54 1.22 dm-1 0.00 0.00 2.32 5.37 18.56 42.98 8.00 0.76 98.82 0.43 0.33 dm-2 0.00 0.00 162.17 5836.66 32714.46 47040.87 13.30 5.57 0.90 0.06 36.00 sdb 0.40 4251.90 106.72 107.35 23123.61 35204.09 272.46 4.43 20.68 1.29 27.64 avg-cpu: %user %nice %system %iowait %steal %idle 14.64 10.75 1.81 13.50 0.00 59.29 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 15.40 1344.60 68.80 145.60 4964.80 11790.40 78.15 0.38 1.80 0.80 17.10 dm-0 0.00 0.00 43.00 1186.20 2292.80 9489.60 9.59 4.88 3.90 0.09 11.58 dm-1 0.00 0.00 1.60 0.00 12.80 0.00 8.00 0.03 16.00 2.00 0.32 dm-2 0.00 0.00 197.20 17583.80 35152.00 140664.00 9.89 2847.50 109.52 0.05 93.50 sdb 13.20 16552.20 159.00 742.20 32745.60 129129.60 179.62 72.88 66.01 1.04 93.42 avg-cpu: %user %nice %system %iowait %steal %idle 15.51 19.77 1.97 5.02 0.00 57.73 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 16.20 523.40 60.00 285.00 5220.80 5913.60 32.27 0.25 0.72 0.60 20.86 dm-0 0.00 0.00 0.80 1.40 32.00 11.20 19.64 0.01 3.18 1.55 0.34 dm-1 0.00 0.00 1.60 0.00 12.80 0.00 8.00 0.03 21.00 2.62 0.42 dm-2 0.00 0.00 339.40 5886.80 66219.20 47092.80 18.20 251.66 184.72 0.10 63.48 sdb 1.00 5025.40 264.20 209.20 60992.00 50422.40 235.35 5.98 40.92 1.23 58.28 avg-cpu: %user %nice %system %iowait %steal %idle 16.59 16.34 2.03 9.01 0.00 56.04 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 5.40 320.00 37.40 159.80 2483.20 3529.60 30.49 0.10 0.52 0.39 7.76 dm-0 0.00 0.00 0.20 3.60 1.60 28.80 8.00 0.00 0.68 0.68 0.26 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 287.20 13108.20 53985.60 104864.00 11.86 869.18 48.82 0.06 76.96 sdb 5.20 12163.40 238.20 532.00 51235.20 93753.60 188.25 21.46 23.75 0.97 75.08 On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy <mark.re...@boxever.com> wrote: > Hi Ruchir, > > With the large number of blocked flushes and the number of pending > compactions would still indicate IO contention. Can you post the output of > 'iostat -x 5 5' > > If you do in fact have spare IO, there are several configuration options > you can tune such as increasing the number of flush writers and > compaction_throughput_mb_per_sec > > Mark > > > On Tue, Aug 5, 2014 at 5:22 PM, Ruchir Jha <ruchir....@gmail.com> wrote: > >> Also Mark to your comment on my tpstats output, below is my iostat >> output, and the iowait is at 4.59%, which means no IO pressure, but we are >> still seeing the bad flush performance. Should we try increasing the flush >> writers? >> >> >> Linux 2.6.32-358.el6.x86_64 (ny4lpcas13.fusionts.corp) 08/05/2014 >> _x86_64_ (24 CPU) >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 5.80 10.25 0.65 4.59 0.00 78.72 >> >> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn >> sda 103.83 9630.62 11982.60 3231174328 4020290310 >> dm-0 13.57 160.17 81.12 53739546 27217432 >> dm-1 7.59 16.94 43.77 5682200 14686784 >> dm-2 5792.76 32242.66 45427.12 10817753530 15241278360 >> sdb 206.09 22789.19 33569.27 7646015080 11262843224 >> >> >> >> On Tue, Aug 5, 2014 at 12:13 PM, Ruchir Jha <ruchir....@gmail.com> wrote: >> >>> nodetool status: >>> >>> Datacenter: datacenter1 >>> ======================= >>> Status=Up/Down >>> |/ State=Normal/Leaving/Joining/Moving >>> -- Address Load Tokens Owns (effective) Host ID >>> Rack >>> UN 10.10.20.27 1.89 TB 256 25.4% >>> 76023cdd-c42d-4068-8b53-ae94584b8b04 rack1 >>> UN 10.10.20.62 1.83 TB 256 25.5% >>> 84b47313-da75-4519-94f3-3951d554a3e5 rack1 >>> UN 10.10.20.47 1.87 TB 256 24.7% >>> bcd51a92-3150-41ae-9c51-104ea154f6fa rack1 >>> UN 10.10.20.45 1.7 TB 256 22.6% >>> 8d6bce33-8179-4660-8443-2cf822074ca4 rack1 >>> UN 10.10.20.15 1.86 TB 256 24.5% >>> 01a01f07-4df2-4c87-98e9-8dd38b3e4aee rack1 >>> UN 10.10.20.31 1.87 TB 256 24.9% >>> 1435acf9-c64d-4bcd-b6a4-abcec209815e rack1 >>> UN 10.10.20.35 1.86 TB 256 25.8% >>> 17cb8772-2444-46ff-8525-33746514727d rack1 >>> UN 10.10.20.51 1.89 TB 256 25.0% >>> 0343cd58-3686-465f-8280-56fb72d161e2 rack1 >>> UN 10.10.20.19 1.91 TB 256 25.5% >>> 30ddf003-4d59-4a3e-85fa-e94e4adba1cb rack1 >>> UN 10.10.20.39 1.93 TB 256 26.0% >>> b7d44c26-4d75-4d36-a779-b7e7bdaecbc9 rack1 >>> UN 10.10.20.52 1.81 TB 256 25.4% >>> 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e rack1 >>> UN 10.10.20.22 1.89 TB 256 24.8% >>> 46af9664-8975-4c91-847f-3f7b8f8d5ce2 rack1 >>> >>> >>> Note: The new node is not part of the above list. >>> >>> nodetool compactionstats: >>> >>> pending tasks: 1649 >>> compaction type keyspace column family >>> completed total unit progress >>> Compaction iprod customerorder >>> 1682804084 17956558077 bytes 9.37% >>> Compaction prodgatecustomerorder >>> 1664239271 1693502275 bytes 98.27% >>> Compaction qa_config_bkupfixsessionconfig_hist >>> 2443 27253 bytes 8.96% >>> Compaction prodgatecustomerorder_hist >>> 1770577280 5026699390 bytes 35.22% >>> Compaction iprodgatecustomerorder_hist >>> 2959560205 312350192622 bytes 0.95% >>> >>> >>> >>> >>> On Tue, Aug 5, 2014 at 11:37 AM, Mark Reddy <mark.re...@boxever.com> >>> wrote: >>> >>>> Yes num_tokens is set to 256. initial_token is blank on all nodes >>>>> including the new one. >>>> >>>> >>>> Ok so you have num_tokens set to 256 for all nodes with initial_token >>>> commented out, this means you are using vnodes and the new node will >>>> automatically grab a list of tokens to take over responsibility for. >>>> >>>> Pool Name Active Pending Completed Blocked >>>>> All time blocked >>>>> FlushWriter 0 0 1136 0 >>>>> 512 >>>>> >>>>> Looks like about 50% of flushes are blocked. >>>>> >>>> >>>> This is a problem as it indicates that the IO system cannot keep up. >>>> >>>> Just ran this on the new node: >>>>> nodetool netstats | grep "Streaming from" | wc -l >>>>> 10 >>>> >>>> >>>> This is normal as the new node will most likely take tokens from all >>>> nodes in the cluster. >>>> >>>> Sorry for the multiple updates, but another thing I found was all the >>>>> other existing nodes have themselves in the seeds list, but the new node >>>>> does not have itself in the seeds list. Can that cause this issue? >>>> >>>> >>>> Seeds are only used when a new node is bootstrapping into the cluster >>>> and needs a set of ips to contact and discover the cluster, so this would >>>> have no impact on data sizes or streaming. In general it would be >>>> considered best practice to have a set of 2-3 seeds from each data center, >>>> with all nodes having the same seed list. >>>> >>>> >>>> What is the current output of 'nodetool compactionstats'? Could you >>>> also paste the output of nodetool status <keyspace>? >>>> >>>> Mark >>>> >>>> >>>> >>>> On Tue, Aug 5, 2014 at 3:59 PM, Ruchir Jha <ruchir....@gmail.com> >>>> wrote: >>>> >>>>> Sorry for the multiple updates, but another thing I found was all the >>>>> other existing nodes have themselves in the seeds list, but the new node >>>>> does not have itself in the seeds list. Can that cause this issue? >>>>> >>>>> >>>>> On Tue, Aug 5, 2014 at 10:30 AM, Ruchir Jha <ruchir....@gmail.com> >>>>> wrote: >>>>> >>>>>> Just ran this on the new node: >>>>>> >>>>>> nodetool netstats | grep "Streaming from" | wc -l >>>>>> 10 >>>>>> >>>>>> Seems like the new node is receiving data from 10 other nodes. Is >>>>>> that expected in a vnodes enabled environment? >>>>>> >>>>>> Ruchir. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 5, 2014 at 10:21 AM, Ruchir Jha <ruchir....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Also not sure if this is relevant but just noticed the nodetool >>>>>>> tpstats output: >>>>>>> >>>>>>> Pool Name Active Pending Completed >>>>>>> Blocked All time blocked >>>>>>> FlushWriter 0 0 1136 >>>>>>> 0 512 >>>>>>> >>>>>>> Looks like about 50% of flushes are blocked. >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 5, 2014 at 10:14 AM, Ruchir Jha <ruchir....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Yes num_tokens is set to 256. initial_token is blank on all nodes >>>>>>>> including the new one. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Aug 5, 2014 at 10:03 AM, Mark Reddy <mark.re...@boxever.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> My understanding was that if initial_token is left empty on the >>>>>>>>>> new node, it just contacts the heaviest node and bisects its token >>>>>>>>>> range. >>>>>>>>> >>>>>>>>> >>>>>>>>> If you are using vnodes and you have num_tokens set to 256 the new >>>>>>>>> node will take token ranges dynamically. What is the configuration of >>>>>>>>> your >>>>>>>>> other nodes, are you setting num_tokens or initial_token on those? >>>>>>>>> >>>>>>>>> >>>>>>>>> Mark >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha <ruchir....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks Patricia for your response! >>>>>>>>>> >>>>>>>>>> On the new node, I just see a lot of the following: >>>>>>>>>> >>>>>>>>>> INFO [FlushWriter:75] 2014-08-05 09:53:04,394 Memtable.java (line >>>>>>>>>> 400) Writing Memtable >>>>>>>>>> INFO [CompactionExecutor:3] 2014-08-05 09:53:11,132 >>>>>>>>>> CompactionTask.java (line 262) Compacted 12 sstables to >>>>>>>>>> >>>>>>>>>> so basically it is just busy flushing, and compacting. Would you >>>>>>>>>> have any ideas on why the 2x disk space blow up. My understanding >>>>>>>>>> was that >>>>>>>>>> if initial_token is left empty on the new node, it just contacts the >>>>>>>>>> heaviest node and bisects its token range. And the heaviest node is >>>>>>>>>> around >>>>>>>>>> 2.1 TB, and the new node is already at 4 TB. Could this be because >>>>>>>>>> compaction is falling behind? >>>>>>>>>> >>>>>>>>>> Ruchir >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Aug 4, 2014 at 7:23 PM, Patricia Gorla < >>>>>>>>>> patri...@thelastpickle.com> wrote: >>>>>>>>>> >>>>>>>>>>> Ruchir, >>>>>>>>>>> >>>>>>>>>>> What exactly are you seeing in the logs? Are you running major >>>>>>>>>>> compactions on the new bootstrapping node? >>>>>>>>>>> >>>>>>>>>>> With respect to the seed list, it is generally advisable to use >>>>>>>>>>> 3 seed nodes per AZ / DC. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 4, 2014 at 11:41 AM, Ruchir Jha < >>>>>>>>>>> ruchir....@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I am trying to bootstrap the thirteenth node in a 12 node >>>>>>>>>>>> cluster where the average data size per node is about 2.1 TB. The >>>>>>>>>>>> bootstrap >>>>>>>>>>>> streaming has been going on for 2 days now, and the disk size on >>>>>>>>>>>> the new >>>>>>>>>>>> node is already above 4 TB and still going. Is this because the >>>>>>>>>>>> new node is >>>>>>>>>>>> running major compactions while the streaming is going on? >>>>>>>>>>>> >>>>>>>>>>>> One thing that I noticed that seemed off was the seeds property >>>>>>>>>>>> in the yaml of the 13th node comprises of 1..12. Where as the seeds >>>>>>>>>>>> property on the existing 12 nodes consists of all the other nodes >>>>>>>>>>>> except >>>>>>>>>>>> the thirteenth node. Is this an issue? >>>>>>>>>>>> >>>>>>>>>>>> Any other insight is appreciated? >>>>>>>>>>>> >>>>>>>>>>>> Ruchir. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Patricia Gorla >>>>>>>>>>> @patriciagorla >>>>>>>>>>> >>>>>>>>>>> Consultant >>>>>>>>>>> Apache Cassandra Consulting >>>>>>>>>>> http://www.thelastpickle.com <http://thelastpickle.com> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >