Ok. I have activate them on 2 of my main tables and I will re-run the job and see.
2 other questions then ;) 1) I have activated them that way: alter 'work_proposed', NAME => '@', BLOOMFILTER => 'ROW' how can I remove them? 2) Should I major_compact to make sure all the hash are stored? Thanks, JM 2013/1/4, Adrien Mogenet <[email protected]>: > On every Get, BloomFilter is acting as a filter (!) on top of each HFile > and allows to check if a key is absent from the HFile. So yes, you will > benefit from these filters. > > > On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari > <[email protected] >> wrote: > >> Is KeyOnlyFilter using the BloomFilters too? >> >> Here is, with more details, what I'm doing. >> >> Few questions. >> - Can I create one single KeyOnlyFilter and give the same filter to >> all the gets? >> - Will bloom filters benefit in such scenario? My key is small. Let's >> say average 128 bytes. >> >> The goal here is to check about 500 entries at a time to validate if >> they already exist or not. >> >> In my MR, I'm starting when I have more than 100K lines to handle, and >> each line car have up to 1K entries. So it can result up to 100M >> gets... Job took initially 500 minutes to complete. I have added few >> pretty good nodes and it's not taking less than 300 minutes. But I >> would like to get under 100 minutes if I can... >> >> Thanks, >> >> JM >> >> Vector<Get> gets_entry_exist = new Vector<Get>(); >> for (Entry entry : entries.getEntries()) >> { >> Get entry_exist = new Get(entry.toKey()); >> entry_exist.setFilter(new KeyOnlyFilter()); >> gets_entry_exist.add(entry_exist); >> } >> >> Result[] result_entry_exist = table_entry.get(gets_entry_exist); >> >> int index = 0; >> for (Entry entry : entries.getEntries()) >> { >> boolean isEmpty = result_entry_exist[index++].isEmpty(); >> if (isEmpty) >> { >> // Process here >> } >> } >> { >> >> >> 2013/1/4, Damien Hardy <[email protected]>: >> > Hello Jean-Marc, >> > >> > BloomFilters are just designed for that. >> > >> > But they say if a row doesn't exist with a ash of the key (not the >> oposit, >> > 2 rowkeys could have the same ash result). >> > >> > If you want to be sure the rowkey exists you have to search for it in >> > the >> > HFile ( the whole mechanism is transparent with the get() ). >> > >> > Their is also an KeOnlyFilter >> > >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html >> > preventing from getting the whole columns of the existing key as return >> > (which could be heavy). >> > >> > Cheers, >> > >> > -- >> > Damien >> > >> > >> > 2013/1/4 Jean-Marc Spaggiari <[email protected]> >> > >> >> Hi, >> >> >> >> What's the fastest way to know if a row exist? >> >> >> >> Today I'm doing that: >> >> >> >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA); >> >> Result entry_exist = table_entry.get(get_entry_exist); >> >> >> >> But should this be faster? >> >> Get get_entry_exist = new Get(key); >> >> Result entry_exist = table_entry.get(get_entry_exist); >> >> >> >> There is only one CF and one C on my table. >> >> >> >> Or is there an even faster way? >> >> >> >> Also, is there a way to make that even faster? I think BloomFilters >> >> can help, right? >> >> >> >> Thanks, >> >> >> >> JM >> >> >> > >> > > > > -- > Adrien Mogenet > 06.59.16.64.22 > http://www.mogenet.me >
