Why do you want to remove the bloom filter? I think you should keep the bloom filter but also use the KeyOnlyFilter to cut down on data transferred over the wire.
On Fri, Jan 4, 2013 at 3:28 PM, Jean-Marc Spaggiari <[email protected] > wrote: > Ok. I have activate them on 2 of my main tables and I will re-run the > job and see. > > 2 other questions then ;) > > 1) I have activated them that way: alter 'work_proposed', NAME => '@', > BLOOMFILTER => 'ROW' how can I remove them? > 2) Should I major_compact to make sure all the hash are stored? > > Thanks, > > JM > > 2013/1/4, Adrien Mogenet <[email protected]>: > > On every Get, BloomFilter is acting as a filter (!) on top of each HFile > > and allows to check if a key is absent from the HFile. So yes, you will > > benefit from these filters. > > > > > > On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari > > <[email protected] > >> wrote: > > > >> Is KeyOnlyFilter using the BloomFilters too? > >> > >> Here is, with more details, what I'm doing. > >> > >> Few questions. > >> - Can I create one single KeyOnlyFilter and give the same filter to > >> all the gets? > >> - Will bloom filters benefit in such scenario? My key is small. Let's > >> say average 128 bytes. > >> > >> The goal here is to check about 500 entries at a time to validate if > >> they already exist or not. > >> > >> In my MR, I'm starting when I have more than 100K lines to handle, and > >> each line car have up to 1K entries. So it can result up to 100M > >> gets... Job took initially 500 minutes to complete. I have added few > >> pretty good nodes and it's not taking less than 300 minutes. But I > >> would like to get under 100 minutes if I can... > >> > >> Thanks, > >> > >> JM > >> > >> Vector<Get> gets_entry_exist = new Vector<Get>(); > >> for (Entry entry : entries.getEntries()) > >> { > >> Get entry_exist = new Get(entry.toKey()); > >> entry_exist.setFilter(new KeyOnlyFilter()); > >> gets_entry_exist.add(entry_exist); > >> } > >> > >> Result[] result_entry_exist = table_entry.get(gets_entry_exist); > >> > >> int index = 0; > >> for (Entry entry : entries.getEntries()) > >> { > >> boolean isEmpty = > result_entry_exist[index++].isEmpty(); > >> if (isEmpty) > >> { > >> // Process here > >> } > >> } > >> { > >> > >> > >> 2013/1/4, Damien Hardy <[email protected]>: > >> > Hello Jean-Marc, > >> > > >> > BloomFilters are just designed for that. > >> > > >> > But they say if a row doesn't exist with a ash of the key (not the > >> oposit, > >> > 2 rowkeys could have the same ash result). > >> > > >> > If you want to be sure the rowkey exists you have to search for it in > >> > the > >> > HFile ( the whole mechanism is transparent with the get() ). > >> > > >> > Their is also an KeOnlyFilter > >> > > >> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html > >> > preventing from getting the whole columns of the existing key as > return > >> > (which could be heavy). > >> > > >> > Cheers, > >> > > >> > -- > >> > Damien > >> > > >> > > >> > 2013/1/4 Jean-Marc Spaggiari <[email protected]> > >> > > >> >> Hi, > >> >> > >> >> What's the fastest way to know if a row exist? > >> >> > >> >> Today I'm doing that: > >> >> > >> >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA); > >> >> Result entry_exist = table_entry.get(get_entry_exist); > >> >> > >> >> But should this be faster? > >> >> Get get_entry_exist = new Get(key); > >> >> Result entry_exist = table_entry.get(get_entry_exist); > >> >> > >> >> There is only one CF and one C on my table. > >> >> > >> >> Or is there an even faster way? > >> >> > >> >> Also, is there a way to make that even faster? I think BloomFilters > >> >> can help, right? > >> >> > >> >> Thanks, > >> >> > >> >> JM > >> >> > >> > > >> > > > > > > > > -- > > Adrien Mogenet > > 06.59.16.64.22 > > http://www.mogenet.me > > >
