A follow-up... 1. my CF were already working with BF they used ROWCOL, (i didn't pay attention to that at the time i wrote my answers) 2. I see form the logs that the BF is already 100% - is it bad? should I had more memory for BF? 3. HLog compression (HBASE-4608) is not scheduled yet, is it by intention? 4. Compaction.ratio is only for 0.92.x releases, so i cannot use it yet. 5. all other patches are also for 0.92/0.94 so my situation will not be better till then, beside playing with the log rolling size, and max number of store files 6. I have also noticed that in a workload of pure insert (no read, empty regions, new keys) the store files on the RS can reach more than 4500 files, nevertheless with a update/read scenario the store files were not passing 1500 files per region (the throttling of the flush was active and not in insert) Is there an explanation for that? 7. I also have a 0.92 fresh install, and checking there the behavior (additional result soon, hopefully)
Mikael.S On Sat, Jan 14, 2012 at 11:30 PM, Mikael Sitruk <[email protected]>wrote: > Wow, thank you very much for all those precious explanations, pointers and > examples. It's a lot to ingest... I will try them (at least what i can with > 0.90.4 (yes i'm upgrading from 0.90.1 to 0.90.4)) and keep you informed. > BTW I'm already using compression (GZ), the current data is randomized so > I don't have so much gain as you mentioned ( i think i'm around 30% only). > It seems that BF is one of the major thing i need to look up with the > compaction.ratio, and i need a different setting for my different CF. (one > CF has small set of column and each update will change 50% --> ROWCOL, the > second CF has always a new column per update --> ROW) > I'm not keeping more than one version neither, and you wrote this is not a > point query. > > A suggestion is perhaps to take all those example/explanation and add them > to the book for future reference. > > Regards, > Mikael.S > > > On Sat, Jan 14, 2012 at 4:06 AM, Nicolas Spiegelberg > <[email protected]>wrote: > >> >I'm sorry but i don't understand, of course i have a disk and network >> >saturation and the flush stop to flush because he is waiting for >> >compaction >> >to finish. Since this a major compaction was triggered - all the >> >stores (large number) present on the disks (7 disk per RS) will be >> >grabbed >> >for major compact, and the I/O is affected. Network is also affected >> since >> >all are major compacting at the same time and replicating files on same >> >time (1GB network). >> >> When you have an IO problem, there are multiple pieces at play that you >> can adjust: >> >> Write: HLog, Flush, Compaction >> Read: Point Query, Scan >> >> If your writes are far more than your reads, then you should relax one of >> the write pieces. >> - HLog: You can't really adjust HLog IO outside of key compression >> (HBASE-4608) >> - Flush: You can adjust your compression. None->LZO == 5x compression. >> LZO->GZ == 2x compression. Both are at the expense of CPU. HBASE-4241 >> minimizes flush IO significantly in the update-heavy use case (discussed >> this in the last email). >> - Compaction: You can lower the compaction ratio to minimize the amount of >> rewrites over time. That's why I suggested changing the ratio from 1.2 -> >> 0.25. This gives a ~50% IO reduction (blog post on this forthcoming @ >> http://www.facebook.com/UsingHBase ). >> >> However, you may have a lot more reads than you think. For example, let's >> say read:write ratio is 1:10, so significantly read dominated. Without >> any of the optimizations I listed in the previous email, your real read >> ratio is multiplied by the StoreFile count (because you naively read all >> StoreFiles). So let say, during congestion, you have 20 StoreFiles. >> 1*20:10 means that you're now 2:1 read dominated. You need features to >> reduce the number of StoreFiles you scan when the StoreFile count is high. >> >> - Point Query: bloom filters (HBASE-1200, HBASE-2794), lazy seek >> (HBASE-4465), and seek optimizations (HBASE-4433, HBASE-4434, HBASE-4469, >> HBASE-4532) >> - Scan: not as many optimizations here. Mostly revolve around proper >> usage & seek-next optimization when using filters. Don't have JIRA numbers >> here, but probably half-dozen small tweaks were added to 0.92. >> >> >I don't have an increment workload (the workload either update columns on >> >a >> >CF or add column on a CF for the same key), so how those patch will help? >> >> Increment & read->update workload end up roughly picking up the same >> optimizations. Adding a column to an existing row is no different than >> adding a new row as far as optimizations are concerned because there's >> nothing to de-dupe. >> >> >I don't say this is a bad thing, this is just an observation from our >> >test, >> >HBase will slow down the flush in case too many store file are present, >> >and >> >will add pressure on GC and memory affecting performance. >> >The update workload does not send all the row content for a certain key >> so >> >only partial data is written, in order to get all the row i presume that >> >reading the newest Store is not enough ("all" stores need to be read >> >collecting the more up to date field a rebuild a full row), or i'm >> missing >> >something? >> >> Reading all row columns is the same as doing a scan. You're not doing a >> point query if you don't specify the exact key (columns) you're looking >> for. Setting versions to unlimited, then getting all versions of a >> particular ROW+COL would also be considered a scan vs a point query as far >> as optimizations are concerned. >> >> >1. If i did not set a specific property for bloom filter (BF), does it >> >means that i'm not using them (the book only refer to BF with regards to >> >CF)? >> >> By default, bloom filters are disabled, so you need to enable them to get >> the optimizations. This is by design. Bloom Filters trade off cache >> space for low-overhead probabilistic queries. Default is 8-bytes per >> bloom entry (key) & 1% false positive rate. You can use 'bin/hbase >> org.apache.hadoop.hbase.io.hfile.HFile' (look at help, then -f to specify >> a StoreFile and then use -m for meta info) to see your StoreFile's average >> KV size. If size(KV) == 100 bytes, then blooms use 8% of the space in >> cache, which is better than loading the StoreFile block only to get a >> miss. >> >> Whether to use a ROW or ROWCOL bloom filter depends on your write & read >> pattern. If you read the entire row at a time, use a ROW bloom. If you >> point query, ROW or ROWCOL are both options. If you write all columns for >> a row at the same time, definitely use a ROW bloom. If you have a small >> column range and you update them at different rates/times, then a ROWCOL >> bloom filter may be more helpful. ROWCOL is really useful if a scan query >> for a ROW will normally return results, but a point query for a ROWCOL may >> have a high miss rate. A perfect example is storing unique hash-values >> for a user on disk. You'd use 'user' as the row & the hash as the column. >> Most instances, the hash won't be a duplicate, so a ROWCOL bloom would be >> better. >> >> >3. How can we ensure that compaction will not suck too much I/O if we >> >cannot control major compaction? >> >> TCP Congestion Control will ensure that a single TCP socket won't consume >> too much bandwidth, so that part of compactions is automatically handled. >> The part that you need to handle is the number of simultaneous TCP sockets >> (currently 1 until multi-threaded compactions) & the aggregate data volume >> transferred over time. As I said, this is controlled by compaction.ratio. >> If temporary high StoreFile counts cause you to bottleneck, slight >> latency variance is an annoyance of the current compaction algorithm but >> the underlying problem you should be looking at solving is the system's >> inability to filter out the unnecessary StoreFiles. >> >> > > > > -- Mikael.S
