Finally, I looked at how exists(Get) is done and build exists(List<Get>)... (HBASE-7503)
I will run some bench to compare what is faster. batch(List<Get>) or exists(List<Get>)... I build it for 0.94 too and will deploy the updated build on my cluster... 2013/1/6, Asaf Mesika <[email protected]>: > Why not write your own filter class which you can initialize with a > set of keys to search for. > The HTable on the client side will split the keys based on row keys so > it will be sent to the right regions. There your filter can utilize > SEEK_NEXT_USING_HINT Return Code to see efficiently on those set of > key values > This will ensure you do this search in one rpc call. > Your filter can also transform the KeyValue so that only the row keys > are returned > > Sent from my iPad > > On 6 בינו 2013, at 05:46, Mohamed Ibrahim <[email protected]> wrote: > >> Sorry, I didn't notice your email about packing 500 operations before. >> >> You might actually benefit from checking with a batch of Gets vs >> individual >> exists. >> >> Best, >> Mohamed >> >> >> On Sat, Jan 5, 2013 at 8:29 AM, Jean-Marc Spaggiari >> <[email protected] >>> wrote: >> >>> Hum, very interesting! >>> >>> Now, what's the best option? Array of get which will retrieve more >>> information? Or multiple HTable.exits one by one? >>> >>> The best will have been to have an array of gets passed to the >>> exist... I will see how big it is to add that... >>> >>> JM >>> >>> 2013/1/4, Mohamed Ibrahim <[email protected]>: >>>> What about HTable.exists ?? >>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#exists(org.apache.hadoop.hbase.client.Get) >>>> >>>> I think that should work if the Get has only the row key. >>>> >>>> Mohamed >>>> >>>> >>>> On Fri, Jan 4, 2013 at 3:17 PM, Adrien Mogenet >>>> <[email protected]>wrote: >>>> >>>>> On every Get, BloomFilter is acting as a filter (!) on top of each >>>>> HFile >>>>> and allows to check if a key is absent from the HFile. So yes, you >>>>> will >>>>> benefit from these filters. >>>>> >>>>> >>>>> On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari < >>>>> [email protected] >>>>>> wrote: >>>>> >>>>>> Is KeyOnlyFilter using the BloomFilters too? >>>>>> >>>>>> Here is, with more details, what I'm doing. >>>>>> >>>>>> Few questions. >>>>>> - Can I create one single KeyOnlyFilter and give the same filter to >>>>>> all the gets? >>>>>> - Will bloom filters benefit in such scenario? My key is small. Let's >>>>>> say average 128 bytes. >>>>>> >>>>>> The goal here is to check about 500 entries at a time to validate if >>>>>> they already exist or not. >>>>>> >>>>>> In my MR, I'm starting when I have more than 100K lines to handle, >>>>>> and >>>>>> each line car have up to 1K entries. So it can result up to 100M >>>>>> gets... Job took initially 500 minutes to complete. I have added few >>>>>> pretty good nodes and it's not taking less than 300 minutes. But I >>>>>> would like to get under 100 minutes if I can... >>>>>> >>>>>> Thanks, >>>>>> >>>>>> JM >>>>>> >>>>>> Vector<Get> gets_entry_exist = new Vector<Get>(); >>>>>> for (Entry entry : entries.getEntries()) >>>>>> { >>>>>> Get entry_exist = new Get(entry.toKey()); >>>>>> entry_exist.setFilter(new KeyOnlyFilter()); >>>>>> gets_entry_exist.add(entry_exist); >>>>>> } >>>>>> >>>>>> Result[] result_entry_exist = >>>>>> table_entry.get(gets_entry_exist); >>>>>> >>>>>> int index = 0; >>>>>> for (Entry entry : entries.getEntries()) >>>>>> { >>>>>> boolean isEmpty = >>>>>> result_entry_exist[index++].isEmpty(); >>>>>> if (isEmpty) >>>>>> { >>>>>> // Process here >>>>>> } >>>>>> } >>>>>> { >>>>>> >>>>>> >>>>>> 2013/1/4, Damien Hardy <[email protected]>: >>>>>>> Hello Jean-Marc, >>>>>>> >>>>>>> BloomFilters are just designed for that. >>>>>>> >>>>>>> But they say if a row doesn't exist with a ash of the key (not the >>>>>> oposit, >>>>>>> 2 rowkeys could have the same ash result). >>>>>>> >>>>>>> If you want to be sure the rowkey exists you have to search for it >>> in >>>>> the >>>>>>> HFile ( the whole mechanism is transparent with the get() ). >>>>>>> >>>>>>> Their is also an KeOnlyFilter >>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html >>>>>>> preventing from getting the whole columns of the existing key as >>>>>>> return >>>>>>> (which could be heavy). >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> -- >>>>>>> Damien >>>>>>> >>>>>>> >>>>>>> 2013/1/4 Jean-Marc Spaggiari <[email protected]> >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> What's the fastest way to know if a row exist? >>>>>>>> >>>>>>>> Today I'm doing that: >>>>>>>> >>>>>>>> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA); >>>>>>>> Result entry_exist = table_entry.get(get_entry_exist); >>>>>>>> >>>>>>>> But should this be faster? >>>>>>>> Get get_entry_exist = new Get(key); >>>>>>>> Result entry_exist = table_entry.get(get_entry_exist); >>>>>>>> >>>>>>>> There is only one CF and one C on my table. >>>>>>>> >>>>>>>> Or is there an even faster way? >>>>>>>> >>>>>>>> Also, is there a way to make that even faster? I think BloomFilters >>>>>>>> can help, right? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> JM >>>>> >>>>> >>>>> >>>>> -- >>>>> Adrien Mogenet >>>>> 06.59.16.64.22 >>>>> http://www.mogenet.me >>> >
