>I'm sorry but i don't understand, of course i have a disk and network >saturation and the flush stop to flush because he is waiting for >compaction >to finish. Since this a major compaction was triggered - all the >stores (large number) present on the disks (7 disk per RS) will be >grabbed >for major compact, and the I/O is affected. Network is also affected since >all are major compacting at the same time and replicating files on same >time (1GB network).
When you have an IO problem, there are multiple pieces at play that you can adjust: Write: HLog, Flush, Compaction Read: Point Query, Scan If your writes are far more than your reads, then you should relax one of the write pieces. - HLog: You can't really adjust HLog IO outside of key compression (HBASE-4608) - Flush: You can adjust your compression. None->LZO == 5x compression. LZO->GZ == 2x compression. Both are at the expense of CPU. HBASE-4241 minimizes flush IO significantly in the update-heavy use case (discussed this in the last email). - Compaction: You can lower the compaction ratio to minimize the amount of rewrites over time. That's why I suggested changing the ratio from 1.2 -> 0.25. This gives a ~50% IO reduction (blog post on this forthcoming @ http://www.facebook.com/UsingHBase ). However, you may have a lot more reads than you think. For example, let's say read:write ratio is 1:10, so significantly read dominated. Without any of the optimizations I listed in the previous email, your real read ratio is multiplied by the StoreFile count (because you naively read all StoreFiles). So let say, during congestion, you have 20 StoreFiles. 1*20:10 means that you're now 2:1 read dominated. You need features to reduce the number of StoreFiles you scan when the StoreFile count is high. - Point Query: bloom filters (HBASE-1200, HBASE-2794), lazy seek (HBASE-4465), and seek optimizations (HBASE-4433, HBASE-4434, HBASE-4469, HBASE-4532) - Scan: not as many optimizations here. Mostly revolve around proper usage & seek-next optimization when using filters. Don't have JIRA numbers here, but probably half-dozen small tweaks were added to 0.92. >I don't have an increment workload (the workload either update columns on >a >CF or add column on a CF for the same key), so how those patch will help? Increment & read->update workload end up roughly picking up the same optimizations. Adding a column to an existing row is no different than adding a new row as far as optimizations are concerned because there's nothing to de-dupe. >I don't say this is a bad thing, this is just an observation from our >test, >HBase will slow down the flush in case too many store file are present, >and >will add pressure on GC and memory affecting performance. >The update workload does not send all the row content for a certain key so >only partial data is written, in order to get all the row i presume that >reading the newest Store is not enough ("all" stores need to be read >collecting the more up to date field a rebuild a full row), or i'm missing >something? Reading all row columns is the same as doing a scan. You're not doing a point query if you don't specify the exact key (columns) you're looking for. Setting versions to unlimited, then getting all versions of a particular ROW+COL would also be considered a scan vs a point query as far as optimizations are concerned. >1. If i did not set a specific property for bloom filter (BF), does it >means that i'm not using them (the book only refer to BF with regards to >CF)? By default, bloom filters are disabled, so you need to enable them to get the optimizations. This is by design. Bloom Filters trade off cache space for low-overhead probabilistic queries. Default is 8-bytes per bloom entry (key) & 1% false positive rate. You can use 'bin/hbase org.apache.hadoop.hbase.io.hfile.HFile' (look at help, then -f to specify a StoreFile and then use -m for meta info) to see your StoreFile's average KV size. If size(KV) == 100 bytes, then blooms use 8% of the space in cache, which is better than loading the StoreFile block only to get a miss. Whether to use a ROW or ROWCOL bloom filter depends on your write & read pattern. If you read the entire row at a time, use a ROW bloom. If you point query, ROW or ROWCOL are both options. If you write all columns for a row at the same time, definitely use a ROW bloom. If you have a small column range and you update them at different rates/times, then a ROWCOL bloom filter may be more helpful. ROWCOL is really useful if a scan query for a ROW will normally return results, but a point query for a ROWCOL may have a high miss rate. A perfect example is storing unique hash-values for a user on disk. You'd use 'user' as the row & the hash as the column. Most instances, the hash won't be a duplicate, so a ROWCOL bloom would be better. >3. How can we ensure that compaction will not suck too much I/O if we >cannot control major compaction? TCP Congestion Control will ensure that a single TCP socket won't consume too much bandwidth, so that part of compactions is automatically handled. The part that you need to handle is the number of simultaneous TCP sockets (currently 1 until multi-threaded compactions) & the aggregate data volume transferred over time. As I said, this is controlled by compaction.ratio. If temporary high StoreFile counts cause you to bottleneck, slight latency variance is an annoyance of the current compaction algorithm but the underlying problem you should be looking at solving is the system's inability to filter out the unnecessary StoreFiles.
