[
https://issues.apache.org/jira/browse/HBASE-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-2774:
-------------------------
Attachment: sync-wait3.txt
This patch of Ryan's seems to fix the issue. No longer do I see a spike in
load.
The patch uses volatile longs in place of AtomicLong -- don't need the
AtomicLong functionality -- and it then does a wait/notify instead of spinning.
The spun counter is not used... should return it out and log extreme counts?
Let me try Todd's suggestion next.
> Spin in ReadWriteConsistencyControl eating CPU (load > 40) and no progress
> running YCSB on clean cluster startup
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-2774
> URL: https://issues.apache.org/jira/browse/HBASE-2774
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Attachments: sync-wait3.txt
>
>
> When I try to do a YCSB load, RSs will spin up massive load but make no
> progress. Seems to happen to each RS in turn until they do their first
> flush. They stay in the high-load mode for maybe 5-10 minutes or so and then
> fall out of the bad condition.
> Here is my ugly YCSB command (Haven't gotten around to tidying it up yet):
> {code}
> $ java -cp
> build/ycsb.jar:/home/hadoop/current/conf/:/home/hadoop/current/hbase-0.21.0-SNAPSHOT.jar:/home/hadoop/current/lib/hadoop-core-0.20.3-append-r956776.jar:/home/hadoop/current/lib/zookeeper-3.3.1.jar:/home/hadoop/current/lib/commons-logging-1.1.1.jar:/home/hadoop/current/lib/log4j-1.2.15.jar
> com.yahoo.ycsb.Client -load -db com.yahoo.ycsb.db.HBaseClient -P
> workloads/5050 -p columnfamily=values -s -threads 100 -p recordcount=10000000
> {code}
> Cluster is 5 regionservers NOT running hadoop-core-0.20.3-append-r956776 but
> rather old head of branch-0.20 hadoop.
> It seems that its easy to repro if you start fresh. It might happen later in
> loading but it seems as though after first flush, we're ok.
> It comes on pretty immediately. The server that is taking on the upload has
> its load start to climb gradually up into the 40s then stays there. Later it
> falls when condtion clears.
> Here is content of my yahoo workload file:
> {code}
> recordcount=100000000
> operationcount=100000000
> workload=com.yahoo.ycsb.workloads.CoreWorkload
> readallfields=true
> readproportion=0.5
> updateproportion=0.5
> scanproportion=0
> insertproportion=0
> requestdistribution=zipfian
> {code}
> Here is my hbase-site.xml
> {code}
> <property>
> <name>hbase.regions.slop</name>
> <value>0.01</value>
> <description>Rebalance if regionserver has average + (average * slop)
> regions.
> Default is 30% slop.
> </description>
> </property>
> <property>
> <name>hbase.zookeeper.quorum</name>
> <value>XXXXXXXXX</value>
> </property>
> <property>
> <name>hbase.regionserver.hlog.blocksize</name>
> <value>67108864</value>
> <description>Block size for HLog files. To minimize potential data loss,
> the size should be (avg key length) * (avg value length) *
> flushlogentries.
> Default 1MB.
> </description>
> </property>
> <property>
> <name>hbase.hstore.blockingStoreFiles</name>
> <value>25</value>
> </property>
> <property>
> <name>hbase.rootdir</name>
> <value>hdfs://svXXXXXX:9000/hbase</value>
> <description>The directory shared by region servers.</description>
> </property>
> <property>
> <name>hbase.cluster.distributed</name>
> <value>true</value>
> </property>
> <property>
> <name>zookeeper.znode.parent</name>
> <value>/stack</value>
> <description>
> the path in zookeeper for this cluster
> </description>
> </property>
> <property>
> <name>hfile.block.cache.size</name>
> <value>0.2</value>
> <description>
> The size of the block cache used by HFile/StoreFile. Set to 0 to disable.
> </description>
> </property>
> <property>
> <name>hbase.hregion.memstore.block.multiplier</name>
> <value>8</value>
> <description>
> Block updates if memcache has hbase.hregion.block.memcache
> time hbase.hregion.flush.size bytes. Useful preventing
> runaway memcache during spikes in update traffic. Without an
> upper-bound, memcache fills such that when it flushes the
> resultant flush files take a long time to compact or split, or
> worse, we OOME.
> </description>
> </property>
> <property>
> <name>zookeeper.session.timeout</name>
> <value>60000</value>
> </property>
> <property>
> <name>hbase.regionserver.handler.count</name>
> <value>60</value>
> <description>Count of RPC Server instances spun up on RegionServers
> Same property is used by the HMaster for count of master handlers.
> Default is 10.
> </description>
> </property>
> <property>
> <name>hbase.regions.percheckin</name>
> <value>20</value>
> </property>
> <property>
> <name>hbase.regionserver.maxlogs</name>
> <value>128</value>
> </property>
> <property>
> <name>hbase.regionserver.logroll.multiplier</name>
> <value>2.95</value>
> </property>
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.