Re: Fastest way to find is a row exist?

Jean-Marc Spaggiari Fri, 04 Jan 2013 12:54:39 -0800

I want to remove it because I have set it up on the wrong column ;) I
should have used NAME => 'a' instead of ='@' ;)


I have setup the kof on the code and redeployed. I have also added the
bloom on the right column. I will remove the wrong one later.

As soon as the compaction is done I will restart my MR and keep
fingers crossed...

2013/1/4, Bryan Beaudreault <[email protected]>:
> Why do you want to remove the bloom filter?  I think you should keep the
> bloom filter but also use the KeyOnlyFilter to cut down on data transferred
> over the wire.
>
>
> On Fri, Jan 4, 2013 at 3:28 PM, Jean-Marc Spaggiari
> <[email protected]
>> wrote:
>
>> Ok. I have activate them on 2 of my main tables and I will re-run the
>> job and see.
>>
>> 2 other questions then ;)
>>
>> 1) I have activated them that way: alter 'work_proposed', NAME => '@',
>> BLOOMFILTER => 'ROW' how can I remove them?
>> 2) Should I major_compact to make sure all the hash are stored?
>>
>> Thanks,
>>
>> JM
>>
>> 2013/1/4, Adrien Mogenet <[email protected]>:
>> > On every Get, BloomFilter is acting as a filter (!) on top of each
>> > HFile
>> > and allows to check if a key is absent from the HFile. So yes, you will
>> > benefit from these filters.
>> >
>> >
>> > On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari
>> > <[email protected]
>> >> wrote:
>> >
>> >> Is KeyOnlyFilter using the BloomFilters too?
>> >>
>> >> Here is, with more details, what I'm doing.
>> >>
>> >> Few questions.
>> >> - Can I create one single KeyOnlyFilter and give the same filter to
>> >> all the gets?
>> >> - Will bloom filters benefit in such scenario? My key is small. Let's
>> >> say average 128 bytes.
>> >>
>> >> The goal here is to check about 500 entries at a time to validate if
>> >> they already exist or not.
>> >>
>> >> In my MR, I'm starting when I have more than 100K lines to handle, and
>> >> each line car have up to 1K entries. So it can result up to 100M
>> >> gets... Job took initially 500 minutes to complete. I have added few
>> >> pretty good nodes and it's not taking less than 300 minutes. But I
>> >> would like to get under 100 minutes if I can...
>> >>
>> >> Thanks,
>> >>
>> >> JM
>> >>
>> >>         Vector<Get> gets_entry_exist = new Vector<Get>();
>> >>         for (Entry entry : entries.getEntries())
>> >>         {
>> >>                 Get entry_exist = new Get(entry.toKey());
>> >>                 entry_exist.setFilter(new KeyOnlyFilter());
>> >>                 gets_entry_exist.add(entry_exist);
>> >>         }
>> >>
>> >>         Result[] result_entry_exist =
>> >> table_entry.get(gets_entry_exist);
>> >>
>> >>         int index = 0;
>> >>         for (Entry entry : entries.getEntries())
>> >>         {
>> >>                 boolean isEmpty =
>>  result_entry_exist[index++].isEmpty();
>> >>                 if (isEmpty)
>> >>                 {
>> >>                         // Process here
>> >>                 }
>> >>         }
>> >>                                                 {
>> >>
>> >>
>> >> 2013/1/4, Damien Hardy <[email protected]>:
>> >> > Hello Jean-Marc,
>> >> >
>> >> > BloomFilters are just designed for that.
>> >> >
>> >> > But they say if a row doesn't exist with a ash of the key (not the
>> >> oposit,
>> >> > 2 rowkeys could have the same ash result).
>> >> >
>> >> > If you want to be sure the rowkey exists you have to search for it
>> >> > in
>> >> > the
>> >> > HFile ( the whole mechanism is transparent with the get() ).
>> >> >
>> >> > Their is also an KeOnlyFilter
>> >> >
>> >>
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html
>> >> > preventing from getting the whole columns of the existing key as
>> return
>> >> > (which could be heavy).
>> >> >
>> >> > Cheers,
>> >> >
>> >> > --
>> >> > Damien
>> >> >
>> >> >
>> >> > 2013/1/4 Jean-Marc Spaggiari <[email protected]>
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> What's the fastest way to know if a row exist?
>> >> >>
>> >> >> Today I'm doing that:
>> >> >>
>> >> >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA);
>> >> >> Result entry_exist = table_entry.get(get_entry_exist);
>> >> >>
>> >> >> But should this be faster?
>> >> >> Get get_entry_exist = new Get(key);
>> >> >> Result entry_exist = table_entry.get(get_entry_exist);
>> >> >>
>> >> >> There is only one CF and one C on my table.
>> >> >>
>> >> >> Or is there an even faster way?
>> >> >>
>> >> >> Also, is there a way to make that even faster? I think BloomFilters
>> >> >> can help, right?
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> JM
>> >> >>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Adrien Mogenet
>> > 06.59.16.64.22
>> > http://www.mogenet.me
>> >
>>
>

Re: Fastest way to find is a row exist?

Reply via email to