Wow, thank you very much for all those precious explanations, pointers and examples. It's a lot to ingest... I will try them (at least what i can with 0.90.4 (yes i'm upgrading from 0.90.1 to 0.90.4)) and keep you informed. BTW I'm already using compression (GZ), the current data is randomized so I don't have so much gain as you mentioned ( i think i'm around 30% only). It seems that BF is one of the major thing i need to look up with the compaction.ratio, and i need a different setting for my different CF. (one CF has small set of column and each update will change 50% --> ROWCOL, the second CF has always a new column per update --> ROW) I'm not keeping more than one version neither, and you wrote this is not a point query.
A suggestion is perhaps to take all those example/explanation and add them to the book for future reference. Regards, Mikael.S On Sat, Jan 14, 2012 at 4:06 AM, Nicolas Spiegelberg <[email protected]>wrote: > >I'm sorry but i don't understand, of course i have a disk and network > >saturation and the flush stop to flush because he is waiting for > >compaction > >to finish. Since this a major compaction was triggered - all the > >stores (large number) present on the disks (7 disk per RS) will be > >grabbed > >for major compact, and the I/O is affected. Network is also affected since > >all are major compacting at the same time and replicating files on same > >time (1GB network). > > When you have an IO problem, there are multiple pieces at play that you > can adjust: > > Write: HLog, Flush, Compaction > Read: Point Query, Scan > > If your writes are far more than your reads, then you should relax one of > the write pieces. > - HLog: You can't really adjust HLog IO outside of key compression > (HBASE-4608) > - Flush: You can adjust your compression. None->LZO == 5x compression. > LZO->GZ == 2x compression. Both are at the expense of CPU. HBASE-4241 > minimizes flush IO significantly in the update-heavy use case (discussed > this in the last email). > - Compaction: You can lower the compaction ratio to minimize the amount of > rewrites over time. That's why I suggested changing the ratio from 1.2 -> > 0.25. This gives a ~50% IO reduction (blog post on this forthcoming @ > http://www.facebook.com/UsingHBase ). > > However, you may have a lot more reads than you think. For example, let's > say read:write ratio is 1:10, so significantly read dominated. Without > any of the optimizations I listed in the previous email, your real read > ratio is multiplied by the StoreFile count (because you naively read all > StoreFiles). So let say, during congestion, you have 20 StoreFiles. > 1*20:10 means that you're now 2:1 read dominated. You need features to > reduce the number of StoreFiles you scan when the StoreFile count is high. > > - Point Query: bloom filters (HBASE-1200, HBASE-2794), lazy seek > (HBASE-4465), and seek optimizations (HBASE-4433, HBASE-4434, HBASE-4469, > HBASE-4532) > - Scan: not as many optimizations here. Mostly revolve around proper > usage & seek-next optimization when using filters. Don't have JIRA numbers > here, but probably half-dozen small tweaks were added to 0.92. > > >I don't have an increment workload (the workload either update columns on > >a > >CF or add column on a CF for the same key), so how those patch will help? > > Increment & read->update workload end up roughly picking up the same > optimizations. Adding a column to an existing row is no different than > adding a new row as far as optimizations are concerned because there's > nothing to de-dupe. > > >I don't say this is a bad thing, this is just an observation from our > >test, > >HBase will slow down the flush in case too many store file are present, > >and > >will add pressure on GC and memory affecting performance. > >The update workload does not send all the row content for a certain key so > >only partial data is written, in order to get all the row i presume that > >reading the newest Store is not enough ("all" stores need to be read > >collecting the more up to date field a rebuild a full row), or i'm missing > >something? > > Reading all row columns is the same as doing a scan. You're not doing a > point query if you don't specify the exact key (columns) you're looking > for. Setting versions to unlimited, then getting all versions of a > particular ROW+COL would also be considered a scan vs a point query as far > as optimizations are concerned. > > >1. If i did not set a specific property for bloom filter (BF), does it > >means that i'm not using them (the book only refer to BF with regards to > >CF)? > > By default, bloom filters are disabled, so you need to enable them to get > the optimizations. This is by design. Bloom Filters trade off cache > space for low-overhead probabilistic queries. Default is 8-bytes per > bloom entry (key) & 1% false positive rate. You can use 'bin/hbase > org.apache.hadoop.hbase.io.hfile.HFile' (look at help, then -f to specify > a StoreFile and then use -m for meta info) to see your StoreFile's average > KV size. If size(KV) == 100 bytes, then blooms use 8% of the space in > cache, which is better than loading the StoreFile block only to get a miss. > > Whether to use a ROW or ROWCOL bloom filter depends on your write & read > pattern. If you read the entire row at a time, use a ROW bloom. If you > point query, ROW or ROWCOL are both options. If you write all columns for > a row at the same time, definitely use a ROW bloom. If you have a small > column range and you update them at different rates/times, then a ROWCOL > bloom filter may be more helpful. ROWCOL is really useful if a scan query > for a ROW will normally return results, but a point query for a ROWCOL may > have a high miss rate. A perfect example is storing unique hash-values > for a user on disk. You'd use 'user' as the row & the hash as the column. > Most instances, the hash won't be a duplicate, so a ROWCOL bloom would be > better. > > >3. How can we ensure that compaction will not suck too much I/O if we > >cannot control major compaction? > > TCP Congestion Control will ensure that a single TCP socket won't consume > too much bandwidth, so that part of compactions is automatically handled. > The part that you need to handle is the number of simultaneous TCP sockets > (currently 1 until multi-threaded compactions) & the aggregate data volume > transferred over time. As I said, this is controlled by compaction.ratio. > If temporary high StoreFile counts cause you to bottleneck, slight > latency variance is an annoyance of the current compaction algorithm but > the underlying problem you should be looking at solving is the system's > inability to filter out the unnecessary StoreFiles. > >
