Re: Fastest way to find is a row exist?

Jean-Marc Spaggiari Sun, 06 Jan 2013 18:15:28 -0800

Finally, I looked at how exists(Get) is done and build
exists(List<Get>)... (HBASE-7503)


I will run some bench to compare what is faster. batch(List<Get>) or
exists(List<Get>)... I build it for 0.94 too and will deploy the
updated build on my cluster...

2013/1/6, Asaf Mesika <[email protected]>:
> Why not write your own filter class which you can initialize with a
> set of keys to search for.
> The HTable on the client side will split the keys based on row keys so
> it will be sent to the right regions. There your filter can utilize
> SEEK_NEXT_USING_HINT Return Code to see efficiently on those set of
> key values
> This will ensure you do this search in one rpc call.
> Your filter can also transform the KeyValue so that only the row keys
> are returned
>
> Sent from my iPad
>
> On 6 בינו 2013, at 05:46, Mohamed Ibrahim <[email protected]> wrote:
>
>> Sorry, I didn't notice your email about packing 500 operations before.
>>
>> You might actually benefit from checking with a batch of Gets vs
>> individual
>> exists.
>>
>> Best,
>> Mohamed
>>
>>
>> On Sat, Jan 5, 2013 at 8:29 AM, Jean-Marc Spaggiari
>> <[email protected]
>>> wrote:
>>
>>> Hum, very interesting!
>>>
>>> Now, what's the best option? Array of get which will retrieve more
>>> information? Or multiple HTable.exits one by one?
>>>
>>> The best will have been to have an array of gets passed to the
>>> exist... I will see how big it is to add that...
>>>
>>> JM
>>>
>>> 2013/1/4, Mohamed Ibrahim <[email protected]>:
>>>> What about HTable.exists ??
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#exists(org.apache.hadoop.hbase.client.Get)
>>>>
>>>> I think that should work if the Get has only the row key.
>>>>
>>>> Mohamed
>>>>
>>>>
>>>> On Fri, Jan 4, 2013 at 3:17 PM, Adrien Mogenet
>>>> <[email protected]>wrote:
>>>>
>>>>> On every Get, BloomFilter is acting as a filter (!) on top of each
>>>>> HFile
>>>>> and allows to check if a key is absent from the HFile. So yes, you
>>>>> will
>>>>> benefit from these filters.
>>>>>
>>>>>
>>>>> On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari <
>>>>> [email protected]
>>>>>> wrote:
>>>>>
>>>>>> Is KeyOnlyFilter using the BloomFilters too?
>>>>>>
>>>>>> Here is, with more details, what I'm doing.
>>>>>>
>>>>>> Few questions.
>>>>>> - Can I create one single KeyOnlyFilter and give the same filter to
>>>>>> all the gets?
>>>>>> - Will bloom filters benefit in such scenario? My key is small. Let's
>>>>>> say average 128 bytes.
>>>>>>
>>>>>> The goal here is to check about 500 entries at a time to validate if
>>>>>> they already exist or not.
>>>>>>
>>>>>> In my MR, I'm starting when I have more than 100K lines to handle,
>>>>>> and
>>>>>> each line car have up to 1K entries. So it can result up to 100M
>>>>>> gets... Job took initially 500 minutes to complete. I have added few
>>>>>> pretty good nodes and it's not taking less than 300 minutes. But I
>>>>>> would like to get under 100 minutes if I can...
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> JM
>>>>>>
>>>>>>        Vector<Get> gets_entry_exist = new Vector<Get>();
>>>>>>        for (Entry entry : entries.getEntries())
>>>>>>        {
>>>>>>                Get entry_exist = new Get(entry.toKey());
>>>>>>                entry_exist.setFilter(new KeyOnlyFilter());
>>>>>>                gets_entry_exist.add(entry_exist);
>>>>>>        }
>>>>>>
>>>>>>        Result[] result_entry_exist =
>>>>>> table_entry.get(gets_entry_exist);
>>>>>>
>>>>>>        int index = 0;
>>>>>>        for (Entry entry : entries.getEntries())
>>>>>>        {
>>>>>>                boolean isEmpty =
>>>>>> result_entry_exist[index++].isEmpty();
>>>>>>                if (isEmpty)
>>>>>>                {
>>>>>>                        // Process here
>>>>>>                }
>>>>>>        }
>>>>>>                                                {
>>>>>>
>>>>>>
>>>>>> 2013/1/4, Damien Hardy <[email protected]>:
>>>>>>> Hello Jean-Marc,
>>>>>>>
>>>>>>> BloomFilters are just designed for that.
>>>>>>>
>>>>>>> But they say if a row doesn't exist with a ash of the key (not the
>>>>>> oposit,
>>>>>>> 2 rowkeys could have the same ash result).
>>>>>>>
>>>>>>> If you want to be sure the rowkey exists you have to search for it
>>> in
>>>>> the
>>>>>>> HFile ( the whole mechanism is transparent with the get() ).
>>>>>>>
>>>>>>> Their is also an KeOnlyFilter
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html
>>>>>>> preventing from getting the whole columns of the existing key as
>>>>>>> return
>>>>>>> (which could be heavy).
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> --
>>>>>>> Damien
>>>>>>>
>>>>>>>
>>>>>>> 2013/1/4 Jean-Marc Spaggiari <[email protected]>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> What's the fastest way to know if a row exist?
>>>>>>>>
>>>>>>>> Today I'm doing that:
>>>>>>>>
>>>>>>>> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA);
>>>>>>>> Result entry_exist = table_entry.get(get_entry_exist);
>>>>>>>>
>>>>>>>> But should this be faster?
>>>>>>>> Get get_entry_exist = new Get(key);
>>>>>>>> Result entry_exist = table_entry.get(get_entry_exist);
>>>>>>>>
>>>>>>>> There is only one CF and one C on my table.
>>>>>>>>
>>>>>>>> Or is there an even faster way?
>>>>>>>>
>>>>>>>> Also, is there a way to make that even faster? I think BloomFilters
>>>>>>>> can help, right?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> JM
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Adrien Mogenet
>>>>> 06.59.16.64.22
>>>>> http://www.mogenet.me
>>>
>

Re: Fastest way to find is a row exist?

Reply via email to