Re: [Announce] 张铎 (Duo Zhang) is Apache HBase PMC chair

2019-07-20 Thread Anoop Sam John
Congrats Duo.

Thanks Misty for your great work as the PMC chair.

Anoop

On Sat, Jul 20, 2019 at 12:07 AM Xu Cang  wrote:

> Thank you Misty!
> Congratulations Duo, thanks for taking extra work!
>
> On Fri, Jul 19, 2019 at 11:23 AM Zach York 
> wrote:
>
> > Congratulations Duo! Thanks for offering to take on the additional work!
> >
> > On Fri, Jul 19, 2019 at 10:34 AM Stack  wrote:
> >
> > > Thank you Misty for your years of service (FYI, for non-PMCers, the
> > reports
> > > Misty wrote to the Apache Board on our behalf were repeatedly called
> out
> > > for their quality and thoughtfulness).
> > >
> > > Duo Zhang, thank you for taking on the mantle.
> > >
> > > S
> > >
> > > On Thu, Jul 18, 2019 at 10:46 AM Misty Linville 
> > wrote:
> > >
> > > > Each Apache project has a project management committee (PMC) that
> > > oversees
> > > > governance of the project, votes on new committers and PMC members,
> and
> > > > ensures that the software we produce adheres to the standards of the
> > > > Foundation. One of the roles on the PMC is the PMC chair. The PMC
> chair
> > > > represents the project as a Vice President of the Foundation and
> > > > communicates to the board about the project's health, once per
> quarter
> > > and
> > > > at other times as needed.
> > > >
> > > > It's been my honor to serve as your PMC chair since 2017, when I took
> > > over
> > > > from Andrew Purtell. I've decided to step back from my volunteer ASF
> > > > activities to leave room in my life for other things. The HBase PMC
> > > > nominated Duo for this role, and Duo has kindly agreed! The board
> > passed
> > > > this resolution in its meeting yesterday[1] and it is already
> > > official[2].
> > > > Congratulations, Duo, and thank you for continuing to honor the
> project
> > > > with your dedication.
> > > >
> > > > Misty
> > > >
> > > > [1] The minutes have not yet posted at the time of this email, but
> will
> > > be
> > > > available at http://www.apache.org/foundation/records/minutes/2019/.
> > > > [2] https://www.apache.org/foundation/#who-runs-the-asf
> > > >
> > >
> >
>


HBase Meetups in India

2019-05-08 Thread Anoop Sam John
Hi all,
 I have seen HBase meetups happening in Bay area as well as
different cities in PRC. I believe we have many devs and users based in
India.  Some of us were discussing about starting this kind of meetups.  In
order to know the dev/users and which city they are based out, I have
created a google group [1].   Please join this group.   This group is to
better decide/discuss abt the place, time, host etc.

Note : When we decide to have such meetups , anyway it will be informed in
the dev@ and user@ too.

Anoop

[1] https://groups.google.com/forum/#!forum/hbase-india ?


RE: Data not loaded in table via ImportTSV

2013-04-16 Thread Anoop Sam John
Hi
   Have you used the tool, LoadIncrementalHFiles  after the 
ImportTSV?

-Anoop-

From: Omkar Joshi [omkar.jo...@lntinfotech.com]
Sent: Tuesday, April 16, 2013 12:01 PM
To: user@hbase.apache.org
Subject: Data not loaded in table via ImportTSV

Hi,

The background thread is this :

http://mail-archives.apache.org/mod_mbox/hbase-user/201304.mbox/%3ce689a42b73c5a545ad77332a4fc75d8c1efbd80...@vshinmsmbx01.vshodc.lntinfotech.com%3E

I'm referring to the HBase doc. 
http://hbase.apache.org/book/ops_mgt.html#importtsv

Accordingly, my command is :

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop 
jar ${HBASE_HOME}/hbase-0.94.6.1.jar importtsv '-Dimporttsv.separator=;' 
-Dimporttsv.columns=HBASE_ROW_KEY,CUSTOMER_INFO:NAME,CUSTOMER_INFO:EMAIL,CUSTOMER_INFO:ADDRESS,CUSTOMER_INFO:MOBILE
  -Dimporttsv.bulk.output=hdfs://cldx-1139-1033:9000/hbase/storefileoutput 
CUSTOMERS hdfs://cldx-1139-1033:9000/hbase/copiedFromLocal/customer.txt

/*classpath echoed here*/

13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Client 
environment:java.library.path=/home/hduser/hadoop_ecosystem/apache_hadoop/hadoop_installation/hadoop-1.0.4/libexec/../lib/native/Linux-amd64-64
13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Client 
environment:java.io.tmpdir=/tmp
13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Client 
environment:java.compiler=NA
13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Client 
environment:os.version=3.2.0-23-generic
13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Client environment:user.name=hduser
13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Client 
environment:user.home=/home/hduser
13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Client 
environment:user.dir=/home/hduser/hadoop_ecosystem/apache_hbase/hbase_installation/hbase-0.94.6.1/bin
13/04/16 17:18:43 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=cldx-1140-1034:2181 sessionTimeout=18 watcher=hconnection
13/04/16 17:18:43 INFO zookeeper.ClientCnxn: Opening socket connection to 
server cldx-1140-1034/172.25.6.71:2181. Will not attempt to authenticate using 
SASL (unknown error)
13/04/16 17:18:43 INFO zookeeper.RecoverableZooKeeper: The identifier of this 
process is 5483@cldx-1139-1033
13/04/16 17:18:43 INFO zookeeper.ClientCnxn: Socket connection established to 
cldx-1140-1034/172.25.6.71:2181, initiating session
13/04/16 17:18:43 INFO zookeeper.ClientCnxn: Session establishment complete on 
server cldx-1140-1034/172.25.6.71:2181, sessionid = 0x13def2889530023, 
negotiated timeout = 18
13/04/16 17:18:44 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=cldx-1140-1034:2181 sessionTimeout=18 
watcher=catalogtracker-on-org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@34d03009
13/04/16 17:18:44 INFO zookeeper.RecoverableZooKeeper: The identifier of this 
process is 5483@cldx-1139-1033
13/04/16 17:18:44 INFO zookeeper.ClientCnxn: Opening socket connection to 
server cldx-1140-1034/172.25.6.71:2181. Will not attempt to authenticate using 
SASL (unknown error)
13/04/16 17:18:44 INFO zookeeper.ClientCnxn: Socket connection established to 
cldx-1140-1034/172.25.6.71:2181, initiating session
13/04/16 17:18:44 INFO zookeeper.ClientCnxn: Session establishment complete on 
server cldx-1140-1034/172.25.6.71:2181, sessionid = 0x13def2889530024, 
negotiated timeout = 18
13/04/16 17:18:44 INFO zookeeper.ZooKeeper: Session: 0x13def2889530024 closed
13/04/16 17:18:44 INFO zookeeper.ClientCnxn: EventThread shut down
13/04/16 17:18:44 INFO mapreduce.HFileOutputFormat: Looking up current regions 
for table org.apache.hadoop.hbase.client.HTable@238cfdf
13/04/16 17:18:44 INFO mapreduce.HFileOutputFormat: Configuring 1 reduce 
partitions to match current region count
13/04/16 17:18:44 INFO mapreduce.HFileOutputFormat: Writing partition 
information to 
hdfs://cldx-1139-1033:9000/user/hduser/partitions_4159cd24-b8ff-4919-854b-a7d1da5069ad
13/04/16 17:18:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/04/16 17:18:44 INFO zlib.ZlibFactory: Successfully loaded  initialized 
native-zlib library
13/04/16 17:18:44 INFO compress.CodecPool: Got brand-new compressor
13/04/16 17:18:44 INFO mapreduce.HFileOutputFormat: Incremental table output 
configured.
13/04/16 17:18:47 INFO input.FileInputFormat: Total input paths to process : 1
13/04/16 17:18:47 WARN snappy.LoadSnappy: Snappy native library not loaded
13/04/16 17:18:47 INFO mapred.JobClient: Running job: job_201304091909_0010
13/04/16 17:18:48 INFO mapred.JobClient:  map 0% reduce 0%
13/04/16 17:19:07 INFO mapred.JobClient:  map 100% reduce 0%
13/04/16 17:19:19 INFO mapred.JobClient:  map 100% reduce 100%
13/04/16 17:19:24 INFO mapred.JobClient: Job complete: job_201304091909_0010
13/04/16 17:19:24 INFO 

RE: HBase random read performance

2013-04-15 Thread Anoop Sam John
Ankit
 I guess you might be having default HFile block size which is 
64KB.
For random gets a lower value will be better. Try will some thing like 8KB and 
check the latency?

Ya ofcourse blooms can help (if major compaction was not done at the time of 
testing)

-Anoop-

From: Ankit Jain [ankitjainc...@gmail.com]
Sent: Saturday, April 13, 2013 11:01 AM
To: user@hbase.apache.org
Subject: HBase random read performance

Hi All,

We are using HBase 0.94.5 and Hadoop 1.0.4.

We have HBase cluster of 5 nodes(5 regionservers and 1 master node). Each
regionserver has 8 GB RAM.

We have loaded 25 millions records in HBase table, regions are pre-split
into 16 regions and all the regions are equally loaded.

We are getting very low random read performance while performing multi get
from HBase.

We are passing random 1 row-keys as input, while HBase is taking around
17 secs to return 1 records.

Please suggest some tuning to increase HBase read performance.

Thanks,
Ankit Jain
iLabs



--
Thanks,
Ankit Jain

RE: coprocessor load test metrics

2013-04-15 Thread Anoop Sam John
I dont think that I got ur question completely. Sorry..
I can not see any special metric from the CP parts.  If u can tell what is your 
requirement can check it out.

-Anoop-

From: Kumar, Deepak8  [deepak8.ku...@citi.com]
Sent: Monday, April 15, 2013 4:48 PM
To: Anoop Sam John; 'user@hbase.apache.org'
Subject: coprocessor load test metrics

Hi Anoop,
Do we have any metrics for load testing for HBase Coprocessors?

Regards,
Deepak

RE: Essential column family performance

2013-04-09 Thread Anoop Sam John
Good finding Lars  team  :)

-Anoop-

From: lars hofhansl [la...@apache.org]
Sent: Wednesday, April 10, 2013 9:46 AM
To: user@hbase.apache.org
Subject: Re: Essential column family performance

That part did not show up in the profiling session.
It was just the unnecessary seek that slowed it all down.

-- Lars




 From: Ted Yu yuzhih...@gmail.com
To: user@hbase.apache.org
Sent: Tuesday, April 9, 2013 9:03 PM
Subject: Re: Essential column family performance

Looking at populateFromJoinedHeap():

  KeyValue kv = populateResult(results, this.joinedHeap, limit,

  joinedContinuationRow.getBuffer(), joinedContinuationRow
.getRowOffset(),

  joinedContinuationRow.getRowLength(), metric);

...

  Collections.sort(results, comparator);

Arrays.mergeSort() is used in the Collections.sort() call.

There seems to be some optimization we can do above: we can record the size
of results before calling populateResult(). Upon return, we can merge the
two segments without resorting to Arrays.mergeSort() which is recursive.


On Tue, Apr 9, 2013 at 6:21 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. with only 1 rows that would all fit in the memstore.

 This aspect should be enhanced in the test.

 Cheers

 On Tue, Apr 9, 2013 at 6:17 PM, Lars Hofhansl lhofha...@yahoo.com wrote:

 Also the unittest tests with only 1 rows that would all fit in the
 memstore. Seek vs reseek should make little difference for the memstore.

 We tested with 1m and 10m rows, and flushed the memstore  and compacted
 the store.

 Will do some more verification later tonight.

 -- Lars


 Lars H lhofha...@yahoo.com wrote:

 Your slow scanner performance seems to vary as well. How come? Slow is
 with the feature off.
 
 I don't how reseek can be slower than seek in any scenario.
 
 -- Lars
 
 Ted Yu yuzhih...@gmail.com schrieb:
 
 I tried using reseek() as suggested, along with my patch from
 HBASE-8306 (30%
 selection rate, random distribution and FAST_DIFF encoding on both
 column
 families).
 I got uneven results:
 
 2013-04-09 16:59:01,324 INFO  [main]
 regionserver.TestJoinedScanners(167):
 Slow scanner finished in 7.529083 seconds, got 1546 rows
 
 2013-04-09 16:59:06,760 INFO  [main]
 regionserver.TestJoinedScanners(167):
 Joined scanner finished in 5.43579 seconds, got 1546 rows
 ...
 2013-04-09 16:59:12,711 INFO  [main]
 regionserver.TestJoinedScanners(167):
 Slow scanner finished in 5.95016 seconds, got 1546 rows
 
 2013-04-09 16:59:20,240 INFO  [main]
 regionserver.TestJoinedScanners(167):
 Joined scanner finished in 7.529044 seconds, got 1546 rows
 
 FYI
 
 On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl la...@apache.org wrote:
 
  We did some tests here.
  I ran this through the profiler against a local RegionServer and
 found the
  part that causes the slowdown is a seek called here:
   boolean mayHaveData =
(nextJoinedKv != null 
  nextJoinedKv.matchingRow(currentRow, offset, length))
||
  (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset,
 length))
 joinedHeap.peek() != null
 joinedHeap.peek().matchingRow(currentRow, offset,
  length));
 
  Looking at the code, this is needed because the joinedHeap can fall
  behind, and hence we have to catch it up.
  The key observation, though, is that the joined heap can only ever be
  behind, and hence we do not need a seek, but only a reseek.
 
  Deploying a RegionServer with the seek replaced with reseek we see an
  improvement in *all* cases.
 
  I'll file a jira with a fix later.
 
  -- Lars
 
 
 
  
   From: James Taylor jtay...@salesforce.com
  To: user@hbase.apache.org
  Sent: Monday, April 8, 2013 6:53 PM
  Subject: Re: Essential column family performance
 
  Good idea, Sergey. We'll rerun with larger non essential column family
  values and see if there's a crossover point. One other difference for
 us
  is that we're using FAST_DIFF encoding. We'll try with no encoding
 too.
  Our table has 20 million rows across four regions servers.
 
  Regarding the parallelization we do, we run multiple scans in parallel
  instead of one single scan over the table. We use the region
 boundaries
  of the table to divide up the work evenly, adding a start/stop key for
  each scan that corresponds to the region boundaries. Our client then
  does a final merge/aggregation step (i.e. adding up the count it gets
  back from the scan for each region).
 
  On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
   IntegrationTestLazyCfLoading uses randomly distributed keys with the
   following condition for filtering:
   1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16)  1); where
 rowKey
   is hex string of MD5 key.
   Then, there are 2 lazy CFs, each of which has a value of 4-64k.
   This test also showed significant improvement IIRC, so random
  distribution
   and high %%ge 

RE: Scanner returning subset of data

2013-04-08 Thread Anoop Sam John
Randy 
As Ted suggested can you see the client logs closely (RS side also)?  Is there 
next() call retries happening from the client side because of RPC timeouts?
In such a case this kind of issue can happen.   I doubt he hit HBASE-5974

-Anoop-

From: Ted Yu [yuzhih...@gmail.com]
Sent: Tuesday, April 09, 2013 2:48 AM
To: user@hbase.apache.org
Subject: Re: Scanner returning subset of data

0.92.1 is pretty old. Are you able to deploy newer release, e.g. 0.94.6.1
and see if the problem can be reproduced ?

Otherwise we have two choices:
1. write a unit / integration test that shows this bug
2. see more of the region server / client logs so that further analysis can
be performed.

Thanks

On Mon, Apr 8, 2013 at 2:07 PM, Randy Fox randy@connexity.com wrote:

 I have a needle-in-the-haystack type scan.  I have tried to read all the
 issues with ScannerTimeoutException and LeaseException, but do have not
 seen anyone report what I am seeing.

 Running 0.92.1-cdh4.1.1.  All config wrt to timeouts and periods are
 default: 60s.

 When I run a scanner that will return few results and my cache setting is
 a bit too high for results to return in 60 seconds, i sometimes get a
 subset of results (the last few returnable rows) and no exception.  it may
 take a while to get those results.  Other times I get the LeaseException,
 the ScannerTimeoutException, or the RetriesExhaustedException. I can see
 throwExceptionIfCallerDisconne**cted in RS logs.

 The incorrect return set has me very concerned.  I can easily reproduce
 this with my own code or hbase shell.

 Any help is greatly appreciated.

 Cheers,

 Randy Fox






RE: Disabling balancer permanently in HBase

2013-04-07 Thread Anoop Sam John
HBASE-6260 made the balancer state to be persisted in ZK so that the restart of 
the Master wont have an issue.  But this is available with 0.95 only.
Just telling FYI

-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Monday, April 08, 2013 6:05 AM
To: user@hbase.apache.org
Subject: Re: Disabling balancer permanently in HBase

2 other options:
1) Build your own balancer which always returns null and set it with
hbase.master.loadbalancer.class; (*)
2) Give a dummy non-existing class for hbase.master.loadbalancer.class ? (**)

(*) hbase.master.loadbalancer.class is still missing in the
documentation. HBASE-7296 has been opened last year. Might be good if
someone can apply it.
(**) I have not tried that, so I have no idea if this is working or
not. However, if the class given for the load balancer doesn't  exist,
I think it will simply log an error on the logs and return. But you
will have to test that.

Finally, maybe we should have something like
hbase.master.loadbalancer.disable that we can setup to TRUE if we want
to totaly disable load balancing? (Even if this is not recommanded).

JM


2013/4/7 Stack st...@duboce.net:
 Try setting the hbase.balancer.period to a very high number in you
 hbase-site.xml:
 http://hbase.apache.org/book.html#hbase.master.dns.nameserver

 St.Ack


 On Sun, Apr 7, 2013 at 3:14 PM, Akshay Singh akshay_i...@yahoo.com wrote:

 Hi,

 I am trying to permanently switch off the balancer in HBase, as my request
 distribution is not uniform across the data.

 I understand that this can be done by, setting balance_switch to false in
 hbase shell

 hbase(main):023:0 balance_switch false

 However, value of balance_switch is reset back to true.. every time I
 restart the HBase cluster (which cannot be avoided in my deployment
 scenario).

 So my question is : Is there a way to permanently/persistently disable the
 hbase balancer ? I could not find a property for this balance_switch.

 I though of one possible solution, which is to set 'hbase.balancer.period'
 property to '-1'.. but it does not seems to work.

 Looking for suggestions.

 Thanks,
 Akshay

RE: Getting less write throughput due to more number of columns

2013-03-26 Thread Anoop Sam John
When the number of columns (qualifiers) are more yes it can impact the 
performance. In HBase every where the storage will be in terms of KVs. The key 
will be some thing like rowkey+cfname+columnname+TS...

So when u have 26 cells in a put then there will be repetition of many bytes in 
the key.(One KV per column) So u will end up in transferring more data. Within 
memstore more data(actual KV data size) getting written and so more frequent 
flushes.. etc..

Have a look at Intel Panthera Document Store impl.

-Anoop-

From: Ankit Jain [ankitjainc...@gmail.com]
Sent: Monday, March 25, 2013 10:19 PM
To: user@hbase.apache.org
Subject: Getting less write throughput due to more number of columns

Hi All,

I am writing a records into HBase. I ran the performance test on following
two cases:

Set1: Input record contains 26 columns and record size is 2Kb.

Set2: Input record contain 1 column and record size is 2Kb.

In second case I am getting 8MBps more performance than step.

are the large number of columns have any impact on write performance and If
yes, how we can overcome it.

--
Thanks,
Ankit Jain

RE: Compaction problem

2013-03-26 Thread Anoop Sam John
@tarang 
As per 4G max heap size, you will get by deafult 1.4G total memory for all the 
memstores (5/6 regions).. By default you will get 35% of the heap size for 
memstore. Is your process only write centric? If rare read happens, think of 
increasing this global heap space setting..Else can increase 4G heap size? 
(Still 1G for a memstore might be too much. You are now getting flushes because 
of global heap preassure before each memstore reaches 1GB. )
//hbase.regionserver.global.memstore.lowerlimit   
hbase.regionserver.global.memstore.upperlimit

hbase.hregion.max.filesize is given as 1 GB. Try increasing this. See region 
splits frequently happening with your case.

See all compaction related params... also tells us abt the status of the Qs

-Anoop-


From: tarang dawer [tarang.da...@gmail.com]
Sent: Friday, March 22, 2013 9:04 PM
To: user@hbase.apache.org
Subject: Re: Compaction problem

3 region servers 2 region servers having 5 regions each , 1 having 6
+2(meta and root)
1 CF
set HBASE_HEAPSIZE in hbase-env.sh as 4gb .

is the flush size okay ? or do i need to reduce/increase it ?

i'll look into the flushQ and compactionQ size and get back to you .

do these parameters seem okay to you ? if something seems odd / not in
order , please do tell

Thanks
Tarang Dawer

On Fri, Mar 22, 2013 at 8:21 PM, Anoop John anoop.hb...@gmail.com wrote:

 How many regions per  RS? And CF in table?
 What is the -Xmx for RS process? You will bget 35% of that memory for all
 the memstores in the RS.
 hbase.hregion.memstore.flush.size = 1GB!!

 Can you closely observe the flushQ size and compactionQ size?  You may be
 getting so many small file flushes(Due to global heap pressure) and
 subsequently many minor compactions.

 -Anoop-

 On Fri, Mar 22, 2013 at 8:14 PM, tarang dawer tarang.da...@gmail.com
 wrote:

  Hi
  As per my use case , I have to write around 100gb data , with a ingestion
  speed of around 200 mbps. While writing , i am getting a performance hit
 by
  compaction , which adds to the delay.
  I am using a 8 core machine with 16 gb RAM available., 2 Tb hdd 7200RPM.
  Got some idea from the archives and  tried pre splitting the regions ,
  configured HBase with following parameters(configured the parameters in a
  haste , so please guide me if anything's out of order) :-
 
 
  property
  namehbase.hregion.memstore.block.multiplier/name
  value4/value
  /property
  property
   namehbase.hregion.memstore.flush.size/name
   value1073741824/value
  /property
 
  property
  namehbase.hregion.max.filesize/name
  value1073741824/value
  /property
  property
  namehbase.hstore.compactionThreshold/name
  value5/value
  /property
  property
namehbase.hregion.majorcompaction/name
value0/value
  /property
  property
  namehbase.hstore.blockingWaitTime/name
  value3/value
  /property
   property
   namehbase.hstore.blockingStoreFiles/name
   value200/value
   /property
 
property
  namehbase.regionserver.lease.period/name
  value300/value
/property
 
 
  but still m not able to achieve the optimal rate , getting around 110
 mbps.
  Need some optimizations ,so please could you help out ?
 
  Thanks
  Tarang Dawer
 
 
 
 
 
  On Fri, Mar 22, 2013 at 6:05 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
   Hi Tarang,
  
   I will recommand you to take a look at the list archives first to see
   all the discussions related to compaction. You will found many
   interesting hints and tips.
  
  
  
 
 http://search-hadoop.com/?q=compactionsfc_project=HBasefc_type=mail+_hash_+user
  
   After that, you will need to provide more details regarding how you
   are using HBase and how the compaction is impacting you.
  
   JM
  
   2013/3/22 tarang dawer tarang.da...@gmail.com:
Hi
I am using HBase 0.94.2 currently. While using it  , its write
   performance,
due to compaction is being affeced by compaction.
Please could you suggest some quick tips in relation to how to deal
  with
   it
?
   
Thanks
Tarang Dawer
  
 


RE: Truncate hbase table based on column family

2013-03-26 Thread Anoop Sam John
varaprasad
 Pls see HBaseAdmin#deleteColumn()..  You should disable the table before 
making an schema changes and enable back after that.

-Anoop-


From: varaprasad.bh...@polarisft.com [varaprasad.bh...@polarisft.com]
Sent: Tuesday, March 26, 2013 2:15 PM
To: user@hbase.apache.org
Subject: Re: Truncate hbase table based on column family

Yes. If there is a table having data in column families F1, F2. I want to 
truncate the data of column family F1 alone.
Is it possible?


Thanks  Regards,
Varaprasada Reddy

-Ted Yu yuzhih...@gmail.com wrote: -
To: user@hbase.apache.org
From: Ted Yu yuzhih...@gmail.com
Date: 03/20/2013 08:12PM
Subject: Re: Truncate hbase table based on column family

Can you clarify your question ?

Did you mean that you only want to drop certain column families ?

Thanks

On Wed, Mar 20, 2013 at 7:15 AM, varaprasad.bh...@polarisft.com wrote:

 Hi All,

 Can we truncate a table in hbase based on the column family.
 Please give your comments.


 Thanks  Regards,
 Varaprasada Reddy


 This e-Mail may contain proprietary and confidential information and is
 sent for the intended recipient(s) only.  If by an addressing or
 transmission error this mail has been misdirected to you, you are requested
 to delete this mail immediately. You are also hereby notified that any use,
 any form of reproduction, dissemination, copying, disclosure, modification,
 distribution and/or publication of this e-mail message, contents or its
 attachment other than by its intended recipient/s is strictly prohibited.

 Visit us at http://www.polarisFT.com



This e-Mail may contain proprietary and confidential information and is sent 
for the intended recipient(s) only.  If by an addressing or transmission error 
this mail has been misdirected to you, you are requested to delete this mail 
immediately. You are also hereby notified that any use, any form of 
reproduction, dissemination, copying, disclosure, modification, distribution 
and/or publication of this e-mail message, contents or its attachment other 
than by its intended recipient/s is strictly prohibited.

Visit us at http://www.polarisFT.com

RE: Is there a way to only scan data in memstore

2013-03-21 Thread Anoop Sam John
How you can be sure abt data will be in memstore only. What if in btw flush 
happening?   Which version in use?
In 94.x version (I am not sure abt the .x version no#) there is 
preStoreScannerOpen() CP hook. This impl can return a KVScanner for a store (In 
your impl the scanner can be only for Memstore?). Pls see how this hook is 
being used and how the StoreScanner is being created.

Can you think of control this using TimeRange on Scan ?

-Anoop-

From: Snake [yfw...@xingcloud.com]
Sent: Thursday, March 21, 2013 2:35 PM
To: user@hbase.apache.org
Subject: Is there a way to only scan data in memstore

Hi,

I just want to scan data in memstore, but I saw InternalScan is private class.  
Is there a way i can do it without change the hbase source code?

Thanks,
Snake

RE: NameNode of Hadoop Crash?

2013-03-18 Thread Anoop Sam John
Can you ask this question in HDFS user group pls?

-Anoop-

From: bhushan.kandalkar [bhushan.kandal...@harbingergroup.com]
Sent: Monday, March 18, 2013 12:29 PM
To: user@hbase.apache.org
Subject: NameNode of Hadoop Crash?

Hi Following is the error log in Nemenode log file:

2013-03-18 11:11:40,910 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
FSNamesystemStateMBean and NameNodeMXBean
2013-03-18 11:11:40,928 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring
more than 10 times
2013-03-18 11:11:40,936 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files = 156
2013-03-18 11:11:40,958 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files under construction = 0
2013-03-18 11:11:40,958 INFO org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 23325 loaded in 0 seconds.
2013-03-18 11:11:40,962 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode:
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:629)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)

How to resolve it? I have restarted it many times but got same error.

Thanks in advance.




--
View this message in context: 
http://apache-hbase.679495.n3.nabble.com/NameNode-of-Hadoop-Crash-tp4040453.html
Sent from the HBase User mailing list archive at Nabble.com.

RE: Regionserver goes down while endpoint execution

2013-03-15 Thread Anoop Sam John
Himanshu told it clearly. To make to more clear I am adding :)

When the range of rowkeys that you are looking for spread across 5 regions, at 
the client side there will be 5 exec requests created and submitted to a thread 
pool.[HBase client side thread pool associated with HTable]
Now as per the availability of slots in this pool the operations will be 
requested to regions in RSs. So in different RS there can be parallel 
execution.  Within one RS itself parallel execution on different regions also 
possible. 

It is upto the client app to deal with the results what the Endpoint has 
returned. It will return a Map right? region name vs result from region. Client 
app need to take care about the ordering 

-Anoop-


From: hv.cs...@gmail.com [hv.cs...@gmail.com] on behalf of Himanshu Vashishtha 
[hvash...@cs.ualberta.ca]
Sent: Thursday, March 14, 2013 11:15 PM
To: user@hbase.apache.org
Subject: Re: Regionserver goes down while endpoint execution

There is no ordering guarantee for the endpoint execution, other than
that calls will be executed in parallel across all the regions.
In case you have 5 regions, then there will be 5 separate calls to
these regions. Then, you get 5 results from these regions at your
client, where you use the Callback class to aggregate the results. You
can define your ordering this aggregate class for sure.


On Thu, Mar 14, 2013 at 10:15 AM, Ted Yu yuzhih...@gmail.com wrote:
 bq. provide the rowkey range as rowkey1 to rowkey100 in endpoint RPC client

 If I understand you correctly, you perform batching at the client as
 described above.
 The order would be as you expected.

 Cheers

 On Thu, Mar 14, 2013 at 10:09 AM, Kumar, Deepak8 
 deepak8.ku...@citi.comwrote:

  Hi,

 It seems due to huge data the RegionServer is getting down. Now I am
 trying to fetch the data in parts  is running fine. I need some more info
 about the Endpoint execution:

 ** **

 My use case is to fetch the data from HBase as per some rowkey range  to
 render it at UI. Since endpoints are executed in parallel so I am looking
 to use it. 

 ** **

 **Ø  **Suppose I provide the rowkey range as rowkey1 to rowkey100 in
 endpoint RPC client  these rowkeys are distributed at 5 regions across 4
 region servers. If I fetch 10  records at a time, do we have any way to
 guarantee that it would come in serial order like first result would of
 rowkey1 to rowkey10, next time I set the start rowkey as rowkey11  the
 fetch would be from rowkey11 to rowkey20, irrespective of the region 
 region servers?

 ** **

 Regards,

 Deepak

 ** **

 ** **

 -Original Message-
 From: hv.cs...@gmail.com [mailto:hv.cs...@gmail.com] On Behalf Of
 Himanshu Vashishtha
 Sent: Wednesday, March 13, 2013 12:09 PM
 To: user@hbase.apache.org
 Cc: Gary Helmling; yuzhih...@gmail.com; lars hofhansl
 Subject: Re: Regionserver goes down while endpoint execution

 ** **

 On Wed, Mar 13, 2013 at 8:19 AM, Kumar, Deepak8 deepak8.ku...@citi.com
 wrote:

  Thanks guys for assisting. I am getting OOM exception yet. I have one
 query about Endpoints. As endpoint executes in parallel, so if I have a
 table which is distributed at 101 regions across 5 regionserver. Would it
 be 101 threads of endpoint executing in parallel?

 ** **

 No and Yes.

 ** **

 The endpoints are not processed as separate threads, they are processed as
 just another request (via regionserver handlers). Yes, the execution will
 be in parallel in the sense that a separate client side call will be used
 for each of the regions that are in the range you specify.

 ** **

 ** **

  Regards,

  Deepak

 ** **

  From: Gary Helmling [mailto:ghelml...@gmail.com ghelml...@gmail.com]**
 **

  Sent: Tuesday, March 12, 2013 2:14 PM

  To: user@hbase.apache.org

  Cc: lars hofhansl; Kumar, Deepak8 [CCC-OT_IT NE]

  Subject: Re: Regionserver goes down while endpoint execution

 ** **

  To expand on what Himanshu said, your endpoint is doing an unbounded
 scan on the region, so with a region with a lot of rows it's taking more
 than 60 seconds to run to the region end, which is why the client side of
 the call is timing out.  In addition you're building up an in memory list
 of all the values for that qualifier in that region, which could cause you
 to bump into OOM issues, depending on how big your values are and how
 sparse the given column qualifier is.  If you trigger an OOMException, then
 the region server would abort.

 ** **

  For this usage specifically, though -- scanning through a single column
 qualifier for all rows -- you would be better off just doing a normal
 client side scan, ie. HTable.getScanner().  Then you will avoid the client
 timeout and potential server-side memory issues.

 ** **

  On Tue, Mar 12, 2013 at 9:29 AM, Ted Yu 
 yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote:

  From region server log:

 ** **

  2013-03-12 03:07:22,605 

RE: region server down when scanning using mapreduce

2013-03-12 Thread Anoop Sam John
How is the GC pattern in your RSs which are getting down? In RS logs you might 
be having YouAreDeadExceptions...
Pls try tuning your RS memory and GC opts.

-Anoop-

From: Lu, Wei [w...@microstrategy.com]
Sent: Tuesday, March 12, 2013 1:42 PM
To: user@hbase.apache.org
Subject: RE: region server down when scanning using mapreduce

We turned the block cache to false and tried again, regionserver still crash 
one after another.
There are a lot of scanner lease time out, and then master log info:
RegionServer ephemeral node deleted, processing expiration 
[rs21,60020,1363010589837]
Seems the problem is not caused by block cache


Thanks

-Original Message-
From: Azuryy Yu [mailto:azury...@gmail.com]
Sent: Tuesday, March 12, 2013 1:41 PM
To: user@hbase.apache.org
Subject: Re: region server down when scanning using mapreduce

please read here http://hbase.apache.org/book.html (11.8.5. Block Cache) to
get some background of block cache.


On Tue, Mar 12, 2013 at 1:31 PM, Lu, Wei w...@microstrategy.com wrote:

 No, does block cache matter? Btw, the mr dump is a mr program we
 implemented rather than the hbase tool.

 Thanks

 -Original Message-
 From: Azuryy Yu [mailto:azury...@gmail.com]
 Sent: Tuesday, March 12, 2013 1:18 PM
 To: user@hbase.apache.org
 Subject: Re: region server down when scanning using mapreduce

 did you closed block cache when you used mr dump?
 On Mar 12, 2013 1:06 PM, Lu, Wei w...@microstrategy.com wrote:

  Hi,
 
  When we use mapreduce to dump data from a pretty large table on hbase.
 One
  region server crash and then another. Mapreduce is deployed together with
  hbase.
 
  1) From log of the region server, there are both next and multi
  operations on going. Is it because there is write/read conflict that
 cause
  scanner timeout?
  2) Region server has 24 cores, and # max map tasks is 24 too; the table
  has about 30 regions (each of size 0.5G) on the region server, is it
  because cpu is all used by mapreduce and that case region server slow and
  then timeout?
  2) current hbase.regionserver.handler.count is 10 by default, should it
 be
  enlarged?
 
  Please give us some advices.
 
  Thanks,
  Wei
 
 
  Log information:
 
 
  [Regionserver rs21:]
 
  2013-03-11 18:36:28,148 INFO
  org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/
  adcbg21.machine.wisdom.com
 ,60020,1363010589837/rs21%2C60020%2C1363010589837.1363025554488,
  entries=22417, filesize=127539793.  for
 
 /hbase/.logs/rs21,60020,1363010589837/rs21%2C60020%2C1363010589837.1363026988052
  2013-03-11 18:37:39,481 WARN org.apache.hadoop.hbase.util.Sleeper: We
  slept 28183ms instead of 3000ms, this is likely due to a long garbage
  collecting pause and it's usually bad, see
  http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
  2013-03-11 18:37:40,163 WARN org.apache.hadoop.ipc.HBaseServer:
  (responseTooSlow):
  {processingtimems:29830,call:next(1656517918313948447, 1000), rpc
  version=1, client version=29, methodsFingerPrint=54742778,client:
  10.20.127.21:56058
 
 ,starttimems:1363027030280,queuetimems:4602,class:HRegionServer,responsesize:2774484,method:next}
  2013-03-11 18:37:40,163 WARN org.apache.hadoop.ipc.HBaseServer:
  (responseTooSlow):
  {processingtimems:31195,call:next(-8353194140406556404, 1000), rpc
  version=1, client version=29, methodsFingerPrint=54742778,client:
  10.20.127.21:56529
 
 ,starttimems:1363027028804,queuetimems:3634,class:HRegionServer,responsesize:2270919,method:next}
  2013-03-11 18:37:40,163 WARN org.apache.hadoop.ipc.HBaseServer:
  (responseTooSlow):
  {processingtimems:30965,call:next(2623756537510669130, 1000), rpc
  version=1, client version=29, methodsFingerPrint=54742778,client:
  10.20.127.21:56146
 
 ,starttimems:1363027028807,queuetimems:3484,class:HRegionServer,responsesize:2753299,method:next}
  2013-03-11 18:37:40,236 WARN org.apache.hadoop.ipc.HBaseServer:
  (responseTooSlow):
  {processingtimems:31023,call:next(5293572780165196795, 1000), rpc
  version=1, client version=29, methodsFingerPrint=54742778,client:
  10.20.127.21:56069
 
 ,starttimems:1363027029086,queuetimems:3589,class:HRegionServer,responsesize:2722543,method:next}
  2013-03-11 18:37:40,368 WARN org.apache.hadoop.ipc.HBaseServer:
  (responseTooSlow):
  {processingtimems:31160,call:next(-4285417329791344278, 1000), rpc
  version=1, client version=29, methodsFingerPrint=54742778,client:
  10.20.127.21:56586
 
 ,starttimems:1363027029204,queuetimems:3707,class:HRegionServer,responsesize:2938870,method:next}
  2013-03-11 18:37:43,652 WARN org.apache.hadoop.ipc.HBaseServer:
  (responseTooSlow):
 
 {processingtimems:31249,call:multi(org.apache.hadoop.hbase.client.MultiAction@2d19985a
 ),
  rpc version=1, client version=29, methodsFingerPrint=54742778,client:
  10.20.109.21:35342
 
 ,starttimems:1363027031505,queuetimems:5720,class:HRegionServer,responsesize:0,method:multi}
  2013-03-11 18:37:49,108 WARN 

RE: Welcome our newest Committer Anoop

2013-03-10 Thread Anoop Sam John
Thanks to all.. Hope to work more and more for HBase!

-Anoop-


From: Andrew Purtell [apurt...@apache.org]
Sent: Monday, March 11, 2013 7:33 AM
To: user@hbase.apache.org
Subject: Re: Welcome our newest Committer Anoop

Congratulations Anoop. Welcome!


On Mon, Mar 11, 2013 at 12:42 AM, ramkrishna vasudevan 
ramkrishna.s.vasude...@gmail.com wrote:

 Hi All

 Pls welcome Anoop, our newest committer.  Anoop's work in HBase has been
 great and he has helped lot of users in the mailing list.

 He has contributed features related to Endpoints and CPs.

 Welcome Anoop and best wishes for your future work.

 Hope to see your continuing efforts to the community.

 Regards
 Ram




--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

RE: How HBase perform per-column scan?

2013-03-10 Thread Anoop Sam John
ROWCOL bloom says whether for a given row (rowkey) a given column (qualifier) 
is present in an HFile or not.  But for the user he dont know the rowkeys. He 
wants all the rows with column 'x'

-Anoop-


From: Liu, Raymond [raymond@intel.com]
Sent: Monday, March 11, 2013 7:43 AM
To: user@hbase.apache.org
Subject: RE: How HBase perform per-column scan?

Just curious, won't ROWCOL bloom filter works for this case?

Best Regards,
Raymond Liu


 As per the above said, you will need a full table scan on that CF.
 As Ted said, consider having a look at your schema design.

 -Anoop-


 On Sun, Mar 10, 2013 at 8:10 PM, Ted Yu yuzhih...@gmail.com wrote:

  bq. physically column family should be able to perform efficiently
  (storage layer
 
  When you scan a row, data for different column families would be
  brought into memory (if you don't utilize HBASE-5416) Take a look at:
 
 
 https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=1354
  1258page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabp
  anel#comment-13541258
 
  which was based on the settings described in:
 
 
 
 https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=1354
  1191page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabp
  anel#comment-13541191
 
  This boils down to your schema design. If possible, consider
  extracting column C into its own column family.
 
  Cheers
 
  On Sun, Mar 10, 2013 at 7:14 AM, PG pengyunm...@gmail.com wrote:
 
   Hi, Ted and Anoop, thanks for your notes.
   I am talking about column rather than column family, since
   physically column family should be able to perform efficiently
   (storage layer, CF's are stored separately). But columns of the same
   column family may be
  mixed
   physically, and that makes filters column value hard... So I want to
   know if there are any mechanism in HBase worked on this...
   Regards,
   Yun
  
   On Mar 10, 2013, at 10:01 AM, Ted Yu yuzhih...@gmail.com wrote:
  
Hi, Yun:
Take a look at HBASE-5416 (Improve performance of scans with some
kind
  of
filters) which is in 0.94.5 release.
   
In your case, you can use a filter which specifies column C as the
essential family.
Here I interpret column C as column family.
   
Cheers
   
On Sat, Mar 9, 2013 at 11:11 AM, yun peng pengyunm...@gmail.com
  wrote:
   
Hi, All,
I want to find all existing values for a given column in a HBase,
and
   would
that result in a full-table scan in HBase? For example, given a
column
   C,
the table is of very large number of rows, from which few rows
(say
   only 1
row) have non-empty values for column C. Would HBase still ues a
full
   table
scan to find this row? Or HBase has any optimization work for
this
  kind
   of
query?
Thanks...
Regards
Yun
   
  
 

RE: can we use same column name for 2 different column families?

2013-03-10 Thread Anoop Sam John
can we have column name dob under column family F1  F2?
Just fine..  Go ahead.. :)

-Anoop-

From: Ramasubramanian Narayanan [ramasubramanian.naraya...@gmail.com]
Sent: Sunday, March 10, 2013 11:41 PM
To: user@hbase.apache.org
Subject: can we use same column name for 2 different column families?

Hi,

Is it fine to use same column name for 2 different column families?

For example,

In a table emp,

can we have column name dob under column family F1  F2?

Please let me know the impact of having like this if any...

Note : I don't want to use dob1 or some other field name for the second
column... use case is like that...

regards,
Rams

RE: Odd WARN in hbase 0.94.2

2013-03-07 Thread Anoop Sam John
Hi Byran,
  This change is needed with usage of any of the open 
src HDFS release or is it only in CDH? Is this related with HDFS-347?
In such a case forget abt my previous mail abt adding in book  :)

-Anoop-

From: Kevin O'dell [kevin.od...@cloudera.com]
Sent: Thursday, March 07, 2013 2:13 AM
To: user@hbase.apache.org
Cc: hbase-u...@hadoop.apache.org
Subject: Re: Odd WARN in hbase 0.94.2

Hi Byran,

One of engineers wanted me to pass this recommendation along:

The issue here is mostly likely a missing configuration.

We need to have this configuration on both the DataNode and RegionServer:
property
  namedfs.domain.socket.path/name
  value/var/run/hadoop-hdfs/dn._PORT/value
/property
property
  namedfs.client.read.shortcircuit/name
  valuetrue/value
/property

In CDH4.1 and earlier, the DN didn't need to be configured to use SCR,
so most likely he does not have dfs.client.read.shortcircuit set in
hdfs-site.xml.  Adding that should fix it

On Tue, Mar 5, 2013 at 5:17 PM, Bryan Beaudreault
bbeaudrea...@hubspot.comwrote:

 Yep we do have that property set to that value.  The file does not seem to
 exist when I try ls'ing it myself.  I'm not sure where it comes from or how
 it should be created.


 On Tue, Mar 5, 2013 at 4:35 PM, Kevin O'dell kevin.od...@cloudera.com
 wrote:

  Bryan,
 
What permissions did you set for the SCRs?  The DNs need
 
  property
  namedfs.datanode.data.dir.perm/name
  value755/name
  /property
 
 
  On Tue, Mar 5, 2013 at 4:28 PM, Bryan Beaudreault
  bbeaudrea...@hubspot.comwrote:
 
   We are running hbase 0.94.2, cdh4.2 edition.  We have the local read
   shortcut enabled.  Since updating to this version I've seen a bunch of
  WARN
   messages corresponding to exceptions trying to connect to what looks
 like
   the local DN.
  
   As far as I can tell it doesn't appear to be having much affect on the
 RS
   but it'd be nice to clear it up and know for sure that it doesn't
  indicate
   some problem. Has anyone ever seen this and know what it is or how to
 fix
   it?
  
   http://pastebin.com/PA6Y9pJN
  
   Thanks!
  
 
 
 
  --
  Kevin O'Dell
  Customer Operations Engineer, Cloudera
 




--
Kevin O'Dell
Customer Operations Engineer, Cloudera

RE: Why InternalScanner doesn't have a method that returns entire row or object of Result

2013-03-07 Thread Anoop Sam John
Asaf
 You are correct!
You mean the RegionScanner I think..  The 'limit' is applied at this level. 
HRegion$RegionScannerImpl

-Anoop-

From: Asaf Mesika [asaf.mes...@gmail.com]
Sent: Thursday, March 07, 2013 6:04 PM
To: user@hbase.apache.org
Subject: Re: Why InternalScanner doesn't have a method that returns entire row 
or object of Result

Guys,
Just to make things clear:

if I have a row which have 12 keys values, and then another row with 5 KVs,
and I called InternelScanner(results, 10), where 10 is the limit, then I
would get:
1. 10 KV of the 1st row
2. 2 KV of the 1st row
3. 5 KV of the 2nd row

Is this correct?



On Sat, Dec 1, 2012 at 4:01 AM, anil gupta anilgupt...@gmail.com wrote:

 Hi Ted,

 I figured out that i have to use next from InternalScanner. Thanks for the
 response.
 The comment for method Grab the next row's worth of values. was a little
 confusing to me.
 Get the keyValue's for the next row would have been better. Just
 saying

 Thanks,
 Anil

 On Fri, Nov 30, 2012 at 5:20 PM, Ted Yu yuzhih...@gmail.com wrote:

  Right.
 
  Take a look at AggregateImplementation.getAvg(), you would see how the
  following method is used.
 
  On Fri, Nov 30, 2012 at 1:53 PM, anil gupta anilgupt...@gmail.com
 wrote:
 
   Does this method in InternalScanner gets KeyValue's for only 1 row in 1
   call. Am i right?
  
   boolean *next
  
 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/InternalScanner.html#next%28java.util.List%29
   
   *(List
  
 
 http://download.oracle.com/javase/6/docs/api/java/util/List.html?is-external=true
   
   KeyValue
   http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/KeyValue.html
results)
 Grab the next row's worth of values.
  
  
 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/InternalScanner.html
  
   Thanks,
   Anil Gupta
  
   On Fri, Nov 30, 2012 at 10:54 AM, anil gupta anilgupt...@gmail.com
   wrote:
  
Hi All,
   
I am developing a Coprocessor to sort results on the basis of Cell
  Value.
Basically an equivalent of order by clause in RDBMS.
In my subclass of BaseEndpointCoprocessor i would like to do fetch of
entire rows rather than individual KeyValue using the
 InternalScanner.
   But,
surprisingly there is no method to do that. Can any one tell me why
 we
   dont
have a method for fetching rows? What is the most optimized way to
  fetch
rows through current InternalScanner methods?
--
Thanks  Regards,
Anil Gupta
   
  
  
  
   --
   Thanks  Regards,
   Anil Gupta
  
 



 --
 Thanks  Regards,
 Anil Gupta


RE: Odd WARN in hbase 0.94.2

2013-03-06 Thread Anoop Sam John
Hi Kevin
  Thanks for the information.  In HBase book, we have added a note 
on how to use short circuit reads.. Can we update it accordingly? Which version 
of HDFS need this attribute to be present in DN side also?  It would be great 
if you can file a JIRA and give a change in the description on the usage... 

-Anoop-

From: Kevin O'dell [kevin.od...@cloudera.com]
Sent: Thursday, March 07, 2013 2:13 AM
To: user@hbase.apache.org
Cc: hbase-u...@hadoop.apache.org
Subject: Re: Odd WARN in hbase 0.94.2

Hi Byran,

One of engineers wanted me to pass this recommendation along:

The issue here is mostly likely a missing configuration.

We need to have this configuration on both the DataNode and RegionServer:
property
  namedfs.domain.socket.path/name
  value/var/run/hadoop-hdfs/dn._PORT/value
/property
property
  namedfs.client.read.shortcircuit/name
  valuetrue/value
/property

In CDH4.1 and earlier, the DN didn't need to be configured to use SCR,
so most likely he does not have dfs.client.read.shortcircuit set in
hdfs-site.xml.  Adding that should fix it

On Tue, Mar 5, 2013 at 5:17 PM, Bryan Beaudreault
bbeaudrea...@hubspot.comwrote:

 Yep we do have that property set to that value.  The file does not seem to
 exist when I try ls'ing it myself.  I'm not sure where it comes from or how
 it should be created.


 On Tue, Mar 5, 2013 at 4:35 PM, Kevin O'dell kevin.od...@cloudera.com
 wrote:

  Bryan,
 
What permissions did you set for the SCRs?  The DNs need
 
  property
  namedfs.datanode.data.dir.perm/name
  value755/name
  /property
 
 
  On Tue, Mar 5, 2013 at 4:28 PM, Bryan Beaudreault
  bbeaudrea...@hubspot.comwrote:
 
   We are running hbase 0.94.2, cdh4.2 edition.  We have the local read
   shortcut enabled.  Since updating to this version I've seen a bunch of
  WARN
   messages corresponding to exceptions trying to connect to what looks
 like
   the local DN.
  
   As far as I can tell it doesn't appear to be having much affect on the
 RS
   but it'd be nice to clear it up and know for sure that it doesn't
  indicate
   some problem. Has anyone ever seen this and know what it is or how to
 fix
   it?
  
   http://pastebin.com/PA6Y9pJN
  
   Thanks!
  
 
 
 
  --
  Kevin O'Dell
  Customer Operations Engineer, Cloudera
 




--
Kevin O'Dell
Customer Operations Engineer, Cloudera

RE: Miserable Performance of gets

2013-03-05 Thread Anoop Sam John
Hi Kiran
When you say doing a batch get with 20 Gets, whether the rowkeys for 
these 20 Gets are in same region? How many RS you are having?  Can u observer 
out of this 20, which all gets targetting which all regions.  Some 
information on this can help explain the slowness...

-Anoop-

From: kiran [kiran.sarvabho...@gmail.com]
Sent: Wednesday, March 06, 2013 10:36 AM
To: user@hbase.apache.org
Subject: Re: Miserable Performance of gets

Version is 0.94.1

Yes, the gets are issued against the second table scanning the first table


On Wed, Mar 6, 2013 at 10:27 AM, Ted Yu yuzhih...@gmail.com wrote:

 Which HBase version are you using ?

 bq. But even for 20 gets
 These were issued against the second table ?

 Thanks

 On Tue, Mar 5, 2013 at 8:36 PM, kiran kiran.sarvabho...@gmail.com wrote:

  Dear All,
 
  I had some miserable experience with gets (batch gets) in hbase. I have
 two
  tables with different rowkeys, columns are distributed across the two
  tables.
 
  Currently what I am doing is scan over one table and get all the rowkeys
 in
  the first table matching my filter. Then issue a batch get on another
 table
  to retrieve some columns. But even for 20 gets, the performance is like
  miserable (almost a second or two for 20 gets which is not acceptable).
  But, scanning even on few thousands of rows is getting completed in
  milliseconds.
 
  My concern is for about 20 gets if it takes second or two,
  How can it scale ??
  Will the performance be the same even if I issue 1000 gets ??
  Is it advisable in hbase to avoid gets ??
 
  I can include all columns in only one table and do a scan also, but
 before
  doing that I need to really understand the issue...
 
  Is scanning a better solution for scalability and performance ???
 
  Is it advisable not to do joins or normalizations in NOSQL databases,
  include all the data in only table and not do joins with another table ??
 
 
  --
  Thank you
  Kiran Sarvabhotla
 
  -Even a correct decision is wrong when it is taken late
 




--
Thank you
Kiran Sarvabhotla

-Even a correct decision is wrong when it is taken late

RE: HBase CheckSum vs Hadoop CheckSum

2013-02-26 Thread Anoop Sam John
I was typing a reply and by the time Liang replied :)
Ya agree with him.  It is only the HDFS client (At RS) not doing the checksum 
verification based on the HDFS stored checksum.
Instead HBase only check for the correctness by comparing with stored checksum 
values. Still the periodic operation of block scanning at HDFS will continue. 
We can turn this OFF by configuring this period with a -ve value I think.

-Anoop-

From: 谢良 [xieli...@xiaomi.com]
Sent: Tuesday, February 26, 2013 5:54 PM
To: user@hbase.apache.org
Subject: 答复: HBase CheckSum vs Hadoop CheckSum

comments in line

Regards,
Liang

发件人: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
发送时间: 2013年2月26日 20:03
收件人: user
主题: HBase CheckSum vs Hadoop CheckSum

Hi,

Quick question.

When we are activating the short circuit read in HBase, it's
recommanded to activate the HBase checksum instead of Hadoop ones.
This is done in the HBase configuration.

I'm wondering what is the impact on the DataNode Block Scanner.

Is it going to be stopped because checksums can't be used anymore? Or
will Hadoop continue to store its own checksum and use them but it's
just that HBase will not look at them anymore and will store and use
its own checksums?
[liang xie]: yes, still store checksum in meta file in current community 
version.
btw, facebook's hadoop-fb20 branch has an inline checksum feature,IIRC

Since it's an HBase configuration (hbase.regionserver.checksum.verify)
I'm expecting this to not have any impact on the Block Scanner, but
I'm looking for a confirmation.
[liang xie]: yes, no impact on hdfs's DataBlockScanner, you can check
detail in datanode's BlockPoolSliceScanner.verifyBlock():
blockSender = new BlockSender(block, 0, -1, false, true, true,
datanode, null);


Thanks,

JM

RE: HBase CheckSum vs Hadoop CheckSum

2013-02-26 Thread Anoop Sam John
JM
Pls check  dfs.datanode.scan.period.hours

-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Tuesday, February 26, 2013 7:04 PM
To: user@hbase.apache.org
Subject: Re: HBase CheckSum vs Hadoop CheckSum

Thanks for your replies. Few seconds I was feeling unsecured ;)

Seems the default period for the DataBlockScanner is 3 weeks:
static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L;

And I have not found anyway to modify that. I will continue to search
and might drop a msg on hadoop list if I still don't find.

Thanks,

JM

2013/2/26 Anoop Sam John anoo...@huawei.com:
 I was typing a reply and by the time Liang replied :)
 Ya agree with him.  It is only the HDFS client (At RS) not doing the checksum 
 verification based on the HDFS stored checksum.
 Instead HBase only check for the correctness by comparing with stored 
 checksum values. Still the periodic operation of block scanning at HDFS will 
 continue. We can turn this OFF by configuring this period with a -ve value I 
 think.

 -Anoop-
 
 From: 谢良 [xieli...@xiaomi.com]
 Sent: Tuesday, February 26, 2013 5:54 PM
 To: user@hbase.apache.org
 Subject: 答复: HBase CheckSum vs Hadoop CheckSum

 comments in line

 Regards,
 Liang
 
 发件人: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
 发送时间: 2013年2月26日 20:03
 收件人: user
 主题: HBase CheckSum vs Hadoop CheckSum

 Hi,

 Quick question.

 When we are activating the short circuit read in HBase, it's
 recommanded to activate the HBase checksum instead of Hadoop ones.
 This is done in the HBase configuration.

 I'm wondering what is the impact on the DataNode Block Scanner.

 Is it going to be stopped because checksums can't be used anymore? Or
 will Hadoop continue to store its own checksum and use them but it's
 just that HBase will not look at them anymore and will store and use
 its own checksums?
 [liang xie]: yes, still store checksum in meta file in current community 
 version.
 btw, facebook's hadoop-fb20 branch has an inline checksum feature,IIRC

 Since it's an HBase configuration (hbase.regionserver.checksum.verify)
 I'm expecting this to not have any impact on the Block Scanner, but
 I'm looking for a confirmation.
 [liang xie]: yes, no impact on hdfs's DataBlockScanner, you can check
 detail in datanode's BlockPoolSliceScanner.verifyBlock():
 blockSender = new BlockSender(block, 0, -1, false, true, true,
 datanode, null);


 Thanks,

 JM

RE: attributes - basic question

2013-02-22 Thread Anoop Sam John
We have used setAttribute() along with Scan which we are using in the CP.  Ya 
it will work fine.
Pls try with ur use case and if finding any issue pls report 

-Anoop-

From: Toby Lazar [tla...@gmail.com]
Sent: Saturday, February 23, 2013 4:07 AM
To: user@hbase.apache.org
Subject: Re: attributes - basic question

Your last point was exactly what I was looking for.  I am thinking about
using attributes along with coprocessors to impose some application-level
authorization constraints.  For example, in a get, I can pass username and
credential attributes and have the coprocessor filter results based on some
rules or group membership.  Of course, I'll need to make sure that that
step doesn't violate regular good practices for coprocessors.  If anyone
has used attributes for any similar purpose, I'd be interested in hearing
about those experiences.

Thanks,

Toby



On Fri, Feb 22, 2013 at 4:24 PM, Harsh J ha...@cloudera.com wrote:

 The attributes are serialized along with the base operation request.
 There's perhaps no immediate client-side usage of this, it is used by
 the Mutation class to set a cluster ID in HBase's Replication context:

 http://svn.apache.org/viewvc/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/Mutation.java?view=markup

 I suppose it could also be used to tag client requests, and used along
 with Coprocessors that way for some behavior differences.

 On Sat, Feb 23, 2013 at 12:11 AM, Toby Lazar tla...@gmail.com wrote:
  What is the purpose of the getAttribute and setAttribute methods for
  classes implementing OperationWithAttributes?  I don't see much
  documentation about them, haven't seen much discussion on this list about
  them, and am wondering what people use them for.  Or, are they mostly
 just
  used internally?  Any insights about these methods are appreciated.
 
  Thanks!
 
  Toby



 --
 Harsh J


RE: Optimizing Multi Gets in hbase

2013-02-18 Thread Anoop Sam John
It will instantiate one scan op per Get 

-Anoop-

From: Varun Sharma [va...@pinterest.com]
Sent: Monday, February 18, 2013 3:27 PM
To: user@hbase.apache.org
Subject: Optimizing Multi Gets in hbase

Hi,

I am trying to batched get(s) on a cluster. Here is the code:

ListGet gets = ...
// Prepare my gets with the rows i need
myHTable.get(gets);

I have two questions about the above scenario:
i) Is this the most optimal way to do this ?
ii) I have a feeling that if there are multiple gets in this case, on the
same region, then each one of those shall instantiate separate scan(s) over
the region even though a single scan is sufficient. Am I mistaken here ?

Thanks
Varun

RE: Co-Processor in scanning the HBase's Table

2013-02-17 Thread Anoop Sam John
I wanna use a custom code after scanning a large table and prefer to run
the code after scanning each region

Exactly at what point you want to run your custom code?  We have hooks at 
points like opening a scanner at a region, closing scanner at a region, calling 
next (pre/post) etc

-Anoop-

From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com]
Sent: Monday, February 18, 2013 12:21 AM
To: cdh-u...@cloudera.org; user@hbase.apache.org
Subject: Co-Processor in scanning the HBase's Table

Hi there
I wanna use a custom code after scanning a large table and prefer to run
the code after scanning each region.I know that I should use
co-processor,but don't know which of Observer,Endpoint or both of them I
should use ? Is there any simple example of them ?

Tnx

RE: Co-Processor in scanning the HBase's Table

2013-02-17 Thread Anoop Sam John
We dont have any hook like postScan()..  In ur case you can try with 
postScannerClose()..  This will be called once per region. When the scan on 
that region is over the scanner opened on that region will get closed and at 
that time this hook will get executed.

-Anoop-

From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com]
Sent: Monday, February 18, 2013 10:27 AM
To: user@hbase.apache.org
Cc: cdh-u...@cloudera.org
Subject: Re: Co-Processor in scanning the HBase's Table

Thanks you Amit,I will check that.
@Anoop: I wanna run that just after scanning a region or after scanning the
regions that to belong to one regionserver.

On Mon, Feb 18, 2013 at 7:45 AM, Anoop Sam John anoo...@huawei.com wrote:

 I wanna use a custom code after scanning a large table and prefer to run
 the code after scanning each region

 Exactly at what point you want to run your custom code?  We have hooks at
 points like opening a scanner at a region, closing scanner at a region,
 calling next (pre/post) etc

 -Anoop-
 
 From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com]
 Sent: Monday, February 18, 2013 12:21 AM
 To: cdh-u...@cloudera.org; user@hbase.apache.org
 Subject: Co-Processor in scanning the HBase's Table

 Hi there
 I wanna use a custom code after scanning a large table and prefer to run
 the code after scanning each region.I know that I should use
 co-processor,but don't know which of Observer,Endpoint or both of them I
 should use ? Is there any simple example of them ?

 Tnx


RE: Using HBase for Deduping

2013-02-15 Thread Anoop Sam John
Or may be go with large value for max version and put the duplicate entry. Now 
in the compact, need to have a wrapper for InternalScanner and next() method 
return only the 1st KV out, removing the others...  Even while scan also same 
kind of logic will be needed..  This will be good enough IMO especially when 
there wont be so many duplicate events for same rowkey..  That is why I asked 
some questions before

I think this solution can be checked.

-Anoop-

From: Asaf Mesika [asaf.mes...@gmail.com]
Sent: Friday, February 15, 2013 3:06 PM
To: user@hbase.apache.org
Cc: Rahul Ravindran
Subject: Re: Using HBase for Deduping

Then maybe he can place an event in the same rowkey but with a column
qualifier which the time stamp of the event saved as long. Upon preCompact
in a region observer he can filter out for any row all column but the first?

On Friday, February 15, 2013, Anoop Sam John wrote:

 When max versions set as 1 and duplicate key is added, the last added will
 win removing the old.  This is what you want Rahul?  I think from his
 explanation he needs the reverse way

 -Anoop-
 
 From: Asaf Mesika [asaf.mes...@gmail.com javascript:;]
 Sent: Friday, February 15, 2013 3:56 AM
 To: user@hbase.apache.org javascript:;; Rahul Ravindran
 Subject: Re: Using HBase for Deduping

 You can load the events into an Hbase table, which has the event id as the
 unique row key. You can define max versions of 1 to the column family thus
 letting Hbase get rid of the duplicates for you during major compaction.



 On Thursday, February 14, 2013, Rahul Ravindran wrote:

  Hi,
 We have events which are delivered into our HDFS cluster which may be
  duplicated. Each event has a UUID and we were hoping to leverage HBase to
  dedupe them. We run a MapReduce job which would perform a lookup for each
  UUID on HBase and then emit the event only if the UUID was absent and
 would
  also insert into the HBase table(This is simplistic, I am missing out
  details to make this more resilient to failures). My concern is that
 doing
  a Read+Write for every event in MR would be slow (We expect around 1
  Billion events every hour). Does anyone use Hbase for a similar use case
 or
  is there a different approach to achieving the same end result. Any
  information, comments would be great.
 
  Thanks,
  ~Rahul.

RE: Using Hbase for Dedupping

2013-02-14 Thread Anoop Sam John
Hi Rahul
 When you say that some events can come with duplicate UUID, what 
is the probability of such duplicate events?  Is it like most of the events 
wont be unique and only few are duplicate?  Also whether this same duplicated 
events come again and again (I mean same UUID for so many times)?

-Anoop-

From: Rahul Ravindran [rahu...@yahoo.com]
Sent: Friday, February 15, 2013 12:53 AM
To: user@hbase.apache.org
Subject: Using Hbase for Dedupping

Hi,
   We have events which are delivered into our HDFS cluster which may be 
duplicated. Each event has a UUID and we were hoping to leverage HBase to 
dedupe them. We run a MapReduce job which would perform a lookup for each UUID 
on HBase and then emit the event only if the UUID was absent and would also 
insert into the HBase table(This is simplistic, I am missing out details to 
make this more resilient to failures). My concern is that doing a Read+Write 
for every event in MR would be slow (We expect around 1 Billion events every 
hour). Does anyone use Hbase for a similar use case or is there a different 
approach to achieving the same end result. Any information, comments would be 
great.

Thanks,
~Rahul.

RE: Using HBase for Deduping

2013-02-14 Thread Anoop Sam John
When max versions set as 1 and duplicate key is added, the last added will win 
removing the old.  This is what you want Rahul?  I think from his explanation 
he needs the reverse way

-Anoop-

From: Asaf Mesika [asaf.mes...@gmail.com]
Sent: Friday, February 15, 2013 3:56 AM
To: user@hbase.apache.org; Rahul Ravindran
Subject: Re: Using HBase for Deduping

You can load the events into an Hbase table, which has the event id as the
unique row key. You can define max versions of 1 to the column family thus
letting Hbase get rid of the duplicates for you during major compaction.



On Thursday, February 14, 2013, Rahul Ravindran wrote:

 Hi,
We have events which are delivered into our HDFS cluster which may be
 duplicated. Each event has a UUID and we were hoping to leverage HBase to
 dedupe them. We run a MapReduce job which would perform a lookup for each
 UUID on HBase and then emit the event only if the UUID was absent and would
 also insert into the HBase table(This is simplistic, I am missing out
 details to make this more resilient to failures). My concern is that doing
 a Read+Write for every event in MR would be slow (We expect around 1
 Billion events every hour). Does anyone use Hbase for a similar use case or
 is there a different approach to achieving the same end result. Any
 information, comments would be great.

 Thanks,
 ~Rahul.

RE: Custom preCompact RegionObserver crashes entire cluster on OOME: Heap Space

2013-02-12 Thread Anoop Sam John
The question is: is it legal to change a KV I received from the 
InternalScanner before adding it the Result - i..e returning it from my own 
InternalScanner?

You can change as per your need IMO

-Anoop-


From: Mesika, Asaf [asaf.mes...@gmail.com]
Sent: Tuesday, February 12, 2013 2:43 PM
To: user@hbase.apache.org
Subject: Re: Custom preCompact RegionObserver crashes entire cluster on OOME: 
Heap Space

I am trying to reduce the amount of KeyValue generated during the preCompact, 
but I'm getting some weird behaviors.

Let me describe what I am doing in short:

We have a counters table, with the following structure:

RowKey =  A combination of field values representing group by key.
CF = time span aggregate (Hour, Day, Month). Currently we have only for Hour.
CQ = Round-to-Hour timestamp (long).
Value = The count

We collect raw data, and updates the counters table for the matched group by 
key, hour.
We tried using Increment, but discovered its very very slow.
Instead we've decided to update the counters upon compaction. We write the 
deltas into the same row-key, but a longer column qualifier: 
RoundedToTheHourTSTypeUniqueId.
Type is: Delta or Aggregate.
Delta stands for a delta column qualifier we send from our client.

in the preCompact, I create an InternalScanner which aggregates the delta 
column qualifier values and generates a new key value with Type Aggregate: 
TSAUniqueID

The problem with this implementation that it consumes more memory.

Now, I've tried avoiding the creation of the Aggregate type KV, by simply 
re-using the 1st delta column qualifier: simply changing its value in the 
KeyValue.
But from some reason, after a couple of minor / major compactions, I see data 
loss, when I count the values and compare them to the expected.


The question is: is it legal to change a KV I received from the 
InternalScanner before adding it the Result - i..e returning it from my own 
InternalScanner?






On Feb 12, 2013, at 8:44 AM, Anoop Sam John wrote:

 Asaf,
   You have created a wrapper around the original InternalScanner 
 instance created by the compaction flow?

 Where do the KV generated during the compaction process queue up before 
 being written to the disk? Is this buffer configurable?
 When I wrote the Region Observer my assumption was the the compaction process 
 works in Streaming fashion, thus even if I decide to generate a KV per KV I 
 see, it still shouldn't be a problem memory wise.

 There is no queuing. Your assumption is correct only. It is written to the 
 writer as and when. (Just like how memstore flush doing the HFile write) As 
 Lars said a look at your code can tell if some thing is going wrong.  Do you 
 have blooms being used?

 -Anoop-
 
 From: Mesika, Asaf [asaf.mes...@gmail.com]
 Sent: Tuesday, February 12, 2013 11:16 AM
 To: user@hbase.apache.org
 Subject: Custom preCompact RegionObserver crashes entire cluster on OOME: 
 Heap Space

 Hi,

 I wrote a RegionObserver which does preCompact.
 I activated in pre-production, and then entire cluster dropped dead: One 
 RegionServer after another crashed on OutOfMemoryException: Heap Space.

 My preCompact method generates a KeyValue per each set of Column Qualifiers 
 it sees.
 When I remove the coprocessor and restart the cluster, cluster remains stable.
 I have 8 RS, each has 4 GB Heap. There about 9 regions (from a specific table 
 I'm working on) per Region Server.
 Running HBase 0.94.3

 The crash occur when the major compaction fires up, apparently cluster wide.


 My question is this: Where do the KV generated during the compaction process 
 queue up before being written to the disk? Is this buffer configurable?
 When I wrote the Region Observer my assumption was the the compaction process 
 works in Streaming fashion, thus even if I decide to generate a KV per KV I 
 see, it still shouldn't be a problem memory wise.

 Of course I'm trying to improve my code so it will generate much less new KV 
 (by simply altering the existing KVs received from the InternalScanner).

 Thank you,

 Asaf

RE: Custom preCompact RegionObserver crashes entire cluster on OOME: Heap Space

2013-02-12 Thread Anoop Sam John
Can you post the code in your new InternalScanner ?  next() method 
implementation.
Would like to see how you are doing thie KV change

-Anoop-

From: Mesika, Asaf [asaf.mes...@gmail.com]
Sent: Tuesday, February 12, 2013 8:11 PM
To: user@hbase.apache.org
Subject: Re: Custom preCompact RegionObserver crashes entire cluster on OOME: 
Heap Space

I'm seeing a very strange behavior:

If I run a scan during major compaction, I can see both the modified Delta Key 
Value (which contains the aggregated values - e.g. 9) and the other two delta 
columns that were used for this aggregated column (e.g, 3, 3) - as if Scan is 
exposed to the key values produced in mid scan.
Could it be related to Cache somehow?

I am modifying the KeyValue object received from the InternalScanner in 
preCompact (modifying its value).

On Feb 12, 2013, at 11:22 AM, Anoop Sam John wrote:

 The question is: is it legal to change a KV I received from the 
 InternalScanner before adding it the Result - i..e returning it from my own 
 InternalScanner?

 You can change as per your need IMO

 -Anoop-

 
 From: Mesika, Asaf [asaf.mes...@gmail.com]
 Sent: Tuesday, February 12, 2013 2:43 PM
 To: user@hbase.apache.org
 Subject: Re: Custom preCompact RegionObserver crashes entire cluster on OOME: 
 Heap Space

 I am trying to reduce the amount of KeyValue generated during the preCompact, 
 but I'm getting some weird behaviors.

 Let me describe what I am doing in short:

 We have a counters table, with the following structure:

 RowKey =  A combination of field values representing group by key.
 CF = time span aggregate (Hour, Day, Month). Currently we have only for Hour.
 CQ = Round-to-Hour timestamp (long).
 Value = The count

 We collect raw data, and updates the counters table for the matched group by 
 key, hour.
 We tried using Increment, but discovered its very very slow.
 Instead we've decided to update the counters upon compaction. We write the 
 deltas into the same row-key, but a longer column qualifier: 
 RoundedToTheHourTSTypeUniqueId.
 Type is: Delta or Aggregate.
 Delta stands for a delta column qualifier we send from our client.

 in the preCompact, I create an InternalScanner which aggregates the delta 
 column qualifier values and generates a new key value with Type Aggregate: 
 TSAUniqueID

 The problem with this implementation that it consumes more memory.

 Now, I've tried avoiding the creation of the Aggregate type KV, by simply 
 re-using the 1st delta column qualifier: simply changing its value in the 
 KeyValue.
 But from some reason, after a couple of minor / major compactions, I see data 
 loss, when I count the values and compare them to the expected.


 The question is: is it legal to change a KV I received from the 
 InternalScanner before adding it the Result - i..e returning it from my own 
 InternalScanner?






 On Feb 12, 2013, at 8:44 AM, Anoop Sam John wrote:

 Asaf,
  You have created a wrapper around the original InternalScanner 
 instance created by the compaction flow?

 Where do the KV generated during the compaction process queue up before 
 being written to the disk? Is this buffer configurable?
 When I wrote the Region Observer my assumption was the the compaction 
 process works in Streaming fashion, thus even if I decide to generate a KV 
 per KV I see, it still shouldn't be a problem memory wise.

 There is no queuing. Your assumption is correct only. It is written to the 
 writer as and when. (Just like how memstore flush doing the HFile write) As 
 Lars said a look at your code can tell if some thing is going wrong.  Do you 
 have blooms being used?

 -Anoop-
 
 From: Mesika, Asaf [asaf.mes...@gmail.com]
 Sent: Tuesday, February 12, 2013 11:16 AM
 To: user@hbase.apache.org
 Subject: Custom preCompact RegionObserver crashes entire cluster on OOME: 
 Heap Space

 Hi,

 I wrote a RegionObserver which does preCompact.
 I activated in pre-production, and then entire cluster dropped dead: One 
 RegionServer after another crashed on OutOfMemoryException: Heap Space.

 My preCompact method generates a KeyValue per each set of Column Qualifiers 
 it sees.
 When I remove the coprocessor and restart the cluster, cluster remains 
 stable.
 I have 8 RS, each has 4 GB Heap. There about 9 regions (from a specific 
 table I'm working on) per Region Server.
 Running HBase 0.94.3

 The crash occur when the major compaction fires up, apparently cluster wide.


 My question is this: Where do the KV generated during the compaction process 
 queue up before being written to the disk? Is this buffer configurable?
 When I wrote the Region Observer my assumption was the the compaction 
 process works in Streaming fashion, thus even if I decide to generate a KV 
 per KV I see, it still shouldn't be a problem memory wise.

 Of course I'm trying to improve my code so it will generate much less new KV

RE: Get on a row with multiple columns

2013-02-11 Thread Anoop Sam John
You mean the end point is geetting executed with high QoS?  You checked with 
some logs? 

-Anoop-

From: Varun Sharma [va...@pinterest.com]
Sent: Monday, February 11, 2013 4:05 AM
To: user@hbase.apache.org; lars hofhansl
Subject: Re: Get on a row with multiple columns

Back to BulkDeleteEndpoint, i got it to work but why are the scanner.next()
calls executing on the Priority handler queue ?

Varun

On Sat, Feb 9, 2013 at 8:46 AM, lars hofhansl la...@apache.org wrote:

 The answer is probably :)
 It's disabled in 0.96 by default. Check out HBASE-7008 (
 https://issues.apache.org/jira/browse/HBASE-7008) and the discussion
 there.

 Also check out the discussion in HBASE-5943 and HADOOP-8069 (
 https://issues.apache.org/jira/browse/HADOOP-8069)


 -- Lars



 
  From: Jean-Marc Spaggiari jean-m...@spaggiari.org
 To: user@hbase.apache.org
 Sent: Saturday, February 9, 2013 5:02 AM
 Subject: Re: Get on a row with multiple columns

 Lars, should we always consider disabling Nagle? What's the down side?

 JM

 2013/2/9, Varun Sharma va...@pinterest.com:
  Yeah, I meant true...
 
  On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl la...@apache.org wrote:
 
  Should be set to true. If tcpnodelay is set to true, Nagle's is
 disabled.
 
  -- Lars
 
 
 
  
   From: Varun Sharma va...@pinterest.com
  To: user@hbase.apache.org; lars hofhansl la...@apache.org
  Sent: Saturday, February 9, 2013 12:11 AM
  Subject: Re: Get on a row with multiple columns
 
 
  Okay I did my research - these need to be set to false. I agree.
 
 
  On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com
  wrote:
 
  I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
  hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
  network latency ?
  
  
  On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl la...@apache.org
 wrote:
  
  Sorry.. I meant set these two config parameters to true (not false as I
  state below).
  
  
  
  
  - Original Message -
  From: lars hofhansl la...@apache.org
  To: user@hbase.apache.org user@hbase.apache.org
  Cc:
  Sent: Friday, February 8, 2013 11:41 PM
  Subject: Re: Get on a row with multiple columns
  
  Only somewhat related. Seeing the magic 40ms random read time there.
   Did
  you disable Nagle's?
  (set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
  hbase-site.xml).
  
  
  From: Varun Sharma va...@pinterest.com
  To: user@hbase.apache.org; lars hofhansl la...@apache.org
  Sent: Friday, February 8, 2013 10:45 PM
  Subject: Re: Get on a row with multiple columns
  
  The use case is like your twitter feed. Tweets from people u follow.
   When
  someone unfollows, you need to delete a bunch of his tweets from the
  following feed. So, its frequent, and we are essentially running into
  some
  extreme corner cases like the one above. We need high write throughput
  for
  this, since when someone tweets, we need to fanout the tweet to all
 the
  followers. We need the ability to do fast deletes (unfollow) and fast
  adds
  (follow) and also be able to do fast random gets - when a real user
   loads
  the feed. I doubt we will able to play much with the schema here since
   we
  need to support a bunch of use cases.
  
  @lars: It does not take 30 seconds to place 300 delete markers. It
   takes
  30
  seconds to first find which of those 300 pins are in the set of
 columns
  present - this invokes 300 gets and then place the appropriate delete
  markers. Note that we can have tens of thousands of columns in a
 single
  row
  so a single get is not cheap.
  
  If we were to just place delete markers, that is very fast. But when
  started doing that, our random read performance suffered because of
 too
  many delete markers. The 90th percentile on random reads shot up from
   40
  milliseconds to 150 milliseconds, which is not acceptable for our
  usecase.
  
  Thanks
  Varun
  
  On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl la...@apache.org
   wrote:
  
   Can you organize your columns and then delete by column family?
  
   deleteColumn without specifying a TS is expensive, since HBase first
  has
   to figure out what the latest TS is.
  
   Should be better in 0.94.1 or later since deletes are batched like
   Puts
   (still need to retrieve the latest version, though).
  
   In 0.94.3 or later you can also the BulkDeleteEndPoint, which
   basically
   let's specify a scan condition and then place specific delete marker
  for
   all KVs encountered.
  
  
   If you wanted to get really
   fancy, you could hook up a coprocessor to the compaction process and
   simply filter all KVs you no longer want (without ever placing any
   delete markers).
  
  
   Are you saying it takes 15 seconds to place 300 version delete
  markers?!
  
  
   -- Lars
  
  
  
   
From: 

RE: restrict clients

2013-02-11 Thread Anoop Sam John
HBase supports Kerberos based authentication. Only those client nodes with a 
valid Kerberos ticket can connect with the HBase cluster.

-Anoop-

From: Rita [rmorgan...@gmail.com]
Sent: Monday, February 11, 2013 6:37 PM
To: user@hbase.apache.org
Subject: Re: restrict clients

Hi,

I am looking for more than an ACL. I want to control what clients can
connect to the hbase cluster. Is that possible?


On Fri, Feb 8, 2013 at 10:36 AM, Stas Maksimov maksi...@gmail.com wrote:

 Hi Rita,

 As far as I know ACL is on a user basis. Here's a link for you:
 http://hbase.apache.org/book/hbase.accesscontrol.configuration.html

 Thanks,
 Stas


 On 8 February 2013 15:20, Rita rmorgan...@gmail.com wrote:

  Hi,
 
  In an enterprise deployment, how can I restrict who can access the data?
  For example, I want only certain servers able to GET,PUT data everyone
 else
  should be denied. Is this possible?
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 




--
--- Get your facts first, then you can distort them as you please.--

RE: Custom preCompact RegionObserver crashes entire cluster on OOME: Heap Space

2013-02-11 Thread Anoop Sam John
Asaf,
   You have created a wrapper around the original InternalScanner 
instance created by the compaction flow?

Where do the KV generated during the compaction process queue up before being 
written to the disk? Is this buffer configurable?
When I wrote the Region Observer my assumption was the the compaction process 
works in Streaming fashion, thus even if I decide to generate a KV per KV I 
see, it still shouldn't be a problem memory wise.

There is no queuing. Your assumption is correct only. It is written to the 
writer as and when. (Just like how memstore flush doing the HFile write) As 
Lars said a look at your code can tell if some thing is going wrong.  Do you 
have blooms being used?

-Anoop-

From: Mesika, Asaf [asaf.mes...@gmail.com]
Sent: Tuesday, February 12, 2013 11:16 AM
To: user@hbase.apache.org
Subject: Custom preCompact RegionObserver crashes entire cluster on OOME: Heap 
Space

Hi,

I wrote a RegionObserver which does preCompact.
I activated in pre-production, and then entire cluster dropped dead: One 
RegionServer after another crashed on OutOfMemoryException: Heap Space.

My preCompact method generates a KeyValue per each set of Column Qualifiers it 
sees.
When I remove the coprocessor and restart the cluster, cluster remains stable.
I have 8 RS, each has 4 GB Heap. There about 9 regions (from a specific table 
I'm working on) per Region Server.
Running HBase 0.94.3

The crash occur when the major compaction fires up, apparently cluster wide.


My question is this: Where do the KV generated during the compaction process 
queue up before being written to the disk? Is this buffer configurable?
When I wrote the Region Observer my assumption was the the compaction process 
works in Streaming fashion, thus even if I decide to generate a KV per KV I 
see, it still shouldn't be a problem memory wise.

Of course I'm trying to improve my code so it will generate much less new KV 
(by simply altering the existing KVs received from the InternalScanner).

Thank you,

Asaf

RE: Start key and End key in HBase

2013-02-03 Thread Anoop Sam John
Can you pls make the question clear?
You mean use in Scan?

-Anoop-


From: raviprasa...@polarisft.com [raviprasa...@polarisft.com]
Sent: Monday, February 04, 2013 10:11 AM
To: user@hbase.apache.org
Subject: Start key and End key in HBase

Hi all,
  Can  anyone let me know what is the use of  Start key and End key in HBase.


Regards
Raviprasad. T
Mobile :-  91- 9894769541


This e-Mail may contain proprietary and confidential information and is sent 
for the intended recipient(s) only.  If by an addressing or transmission error 
this mail has been misdirected to you, you are requested to delete this mail 
immediately. You are also hereby notified that any use, any form of 
reproduction, dissemination, copying, disclosure, modification, distribution 
and/or publication of this e-mail message, contents or its attachment other 
than by its intended recipient/s is strictly prohibited.

Visit us at http://www.polarisFT.com

RE: Start key and End key in HBase

2013-02-03 Thread Anoop Sam John
When you do a scan with out specifying any start/end keys, it is full table 
scan. The scan from client side will go through all the regions one after the 
other.
But when you know the rowkey range that you want to scan you can specify that 
using start/end keys. This time client will evaluate which all regions it need 
to contact for scaning the data  (Based on the start/end keys of regions as 
stored in META table) and only those regions will get contacted and scanned.

Also remember that the start key need not be a full rowkey which exactly map to 
a rowkey within a table.  This can be a prefix part of the actual rowkey also.

-Anoop-

From: raviprasa...@polarisft.com [raviprasa...@polarisft.com]
Sent: Monday, February 04, 2013 10:11 AM
To: user@hbase.apache.org
Subject: Start key and End key in HBase

Hi all,
  Can  anyone let me know what is the use of  Start key and End key in HBase.


Regards
Raviprasad. T
Mobile :-  91- 9894769541


This e-Mail may contain proprietary and confidential information and is sent 
for the intended recipient(s) only.  If by an addressing or transmission error 
this mail has been misdirected to you, you are requested to delete this mail 
immediately. You are also hereby notified that any use, any form of 
reproduction, dissemination, copying, disclosure, modification, distribution 
and/or publication of this e-mail message, contents or its attachment other 
than by its intended recipient/s is strictly prohibited.

Visit us at http://www.polarisFT.com

RE: HBase Checksum

2013-01-31 Thread Anoop Sam John
You can check with HDFS level logs whether the checksum meta file is getting 
read to the DFS client? In the HBase handled checksum, this should not happen.
Have you noticed any perf gain when you configure the HBase handled checksum 
option?

-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Friday, February 01, 2013 4:16 AM
To: user
Subject: HBase Checksum

Hi,

I have activated shortcircuit and checksum and I would like to get a
confirmation that it's working fine.

So I have activated short circuit first and saw a 40% improvement of
the MR rowcount job. So I guess it's working fine.

Now, I'm configuring the checksum option, and I'm wondering how I can
do to validate that it's taken into consideration and used, or not. Is
there a way to see that?

Thanks,

JM

RE: HBase Checksum

2013-01-31 Thread Anoop Sam John
Hi Robert
  When HDFS is doing the local short circuit read, it will use 
BlockReaderLocal class for reading.  There should be some logs at the DFS 
client side (RS) which tells abt creating new BlockReaderLocal .  If you can 
see this then sure the local read is happening.

Also check DN log.  If local read happening, then you will not see  read 
request related logs for the HFile at the DN side.  
You check your no# of HFiles and names for checking the logs

Are you sure that when you tested, u have data locality? Region movements 
across RSs can break the full data locality.

-Anoop-

From: Robert Dyer [psyb...@gmail.com]
Sent: Friday, February 01, 2013 11:10 AM
To: Hbase-User
Subject: Re: HBase Checksum

Not trying to hijack your thread here...

But can you verify via logs that the shortcircuit is working?  Because I
enabled shortcircuit but I sure didn't see any performance increase.

I haven't tried enabling hbase checksum yet but I'd like to be able to
verify that works too.


On Thu, Jan 31, 2013 at 9:55 PM, Anoop Sam John anoo...@huawei.com wrote:

 You can check with HDFS level logs whether the checksum meta file is
 getting read to the DFS client? In the HBase handled checksum, this should
 not happen.
 Have you noticed any perf gain when you configure the HBase handled
 checksum option?

 -Anoop-
 
 From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
 Sent: Friday, February 01, 2013 4:16 AM
 To: user
 Subject: HBase Checksum

 Hi,

 I have activated shortcircuit and checksum and I would like to get a
 confirmation that it's working fine.

 So I have activated short circuit first and saw a 40% improvement of
 the MR rowcount job. So I guess it's working fine.

 Now, I'm configuring the checksum option, and I'm wondering how I can
 do to validate that it's taken into consideration and used, or not. Is
 there a way to see that?

 Thanks,

 JM


RE: Pagination with HBase - getting previous page of data

2013-01-30 Thread Anoop Sam John
@Anil

I could not understand that why it goes to multiple regionservers in
parallel. Why it cannot guarantee results = page size( my guess: due to
multiple RS scans)? If you have used it then maybe you can explain the
behaviour?

Scan from client side never go to multiple RS in parallel. Scan from HTable API 
will be sequential with one region after the other. For every region it will 
open up scanner in the RS and do next() calls. The filter will be instantiated 
at server side per region level ...

When u need 100 rows in the page and you created a Scan at client side with the 
filter and suppose there are 2 regions, 1st the scanner is opened at for 
region1 and scan is happening. It will ensure that max 100 rows will be 
retrieved from that region.  But when the region boundary crosses and client 
automatically open up scanner for the region2, there also it will pass filter 
with max 100 rows and so from there also max 100 rows can come..  So over all 
at the client side we can not guartee that the scan created will only scan 100 
rows as a whole from the table.

I think I am making it clear.   I have not PageFilter at all.. I am just 
explaining as per the knowledge on scan flow and the general filter usage.

This is because the filter is applied separately on different region servers. 
It does however optimize the scan of individual HRegions by making sure that 
the page size is never exceeded locally. 

I guess it need to be saying that   This is because the filter is applied 
separately on different regions.

-Anoop-


From: anil gupta [anilgupt...@gmail.com]
Sent: Wednesday, January 30, 2013 1:33 PM
To: user@hbase.apache.org
Subject: Re: Pagination with HBase - getting previous page of data

Hi Mohammad,

You are most welcome to join the discussion. I have never used PageFilter
so i don't really have concrete input.
I had a look at
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html
I could not understand that why it goes to multiple regionservers in
parallel. Why it cannot guarantee results = page size( my guess: due to
multiple RS scans)? If you have used it then maybe you can explain the
behaviour?

Thanks,
Anil

On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq donta...@gmail.com wrote:

 I'm kinda hesitant to put my leg in between the pros ;)But, does it sound
 sane to use PageFilter for both rows and columns and having some additional
 logic to handle the 'nth' page logic?It'll help us in both kind of paging.

 On Wednesday, January 30, 2013, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org
 wrote:
  Hi Anil,
 
  I think it really depend on the way you want to use the pagination.
 
  Do you need to be able to jump to page X? Are you ok if you miss a
  line or 2? Is your data growing fastly? Or slowly? Is it ok if your
  page indexes are a day old? Do you need to paginate over 300 colums?
  Or just 1? Do you need to always have the exact same number of entries
  in each page?
 
  For my usecase I need to be able to jump to the page X and I don't
  have any content. I have hundred of millions lines. Only the rowkey
  matter for me and I'm fine if sometime I have 50 entries displayed,
  and sometime only 45. So I'm thinking about calculating which row is
  the first one for each page, and store that separatly. Then I just
  need to run the MR daily.
 
  It's not a perfect solution I agree, but this might do the job for me.
  I'm totally open to all other idea which might do the job to.
 
  JM
 
  2013/1/29, anil gupta anilgupt...@gmail.com:
  Yes, your suggested solution only works on RowKey based pagination. It
 will
  fail when you start filtering on the basis of columns.
 
  Still, i would say it's comparatively easier to maintain this at
  Application level rather than creating tables for pagination.
 
  What if you have 300 columns in your schema. Will you create 300 tables?
  What about handling of pagination when filtering is done based on
 multiple
  columns (and and or conditions)?
 
  On Tue, Jan 29, 2013 at 1:08 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
  No, no killer solution here ;)
 
  But I'm still thinking about that because I might have to implement
  some pagination options soon...
 
  As you are saying, it's only working on the row-key, but if you want
  to do the same-thing on non-rowkey, you might have to create a
  secondary index table...
 
  JM
 
  2013/1/27, anil gupta anilgupt...@gmail.com:
   That's alright..I thought that you have come-up with a killer
 solution.
  So,
   got curious to hear your ideas. ;)
   It seems like your below mentioned solution will not work on
 filtering
   on
   non row-key columns since when you are deciding the page numbers you
   are
   only considering rowkey.
  
   Thanks,
   Anil
  
   On Fri, Jan 25, 2013 at 6:58 PM, Jean-Marc Spaggiari 
   jean-m...@spaggiari.org wrote:
  
   Hi Anil,
  
   I don't have a solution. I never tought about that ;) But I 

RE: Pagination with HBase - getting previous page of data

2013-01-30 Thread Anoop Sam John
JM,

100 rows from the 2nd region is using extra time and resources. Why
not ask for only the number of missing lines?

These are some thing needs to be controlled by the scanning app. It can well 
control the pagination with out using the PageFilter I guess..  What do u say?


-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Wednesday, January 30, 2013 5:48 PM
To: user@hbase.apache.org
Subject: Re: Pagination with HBase - getting previous page of data

Hi Anoop,

So does it mean the scanner can send back LIMIT*2-1 lines max? Reading
100 rows from the 2nd region is using extra time and resources. Why
not ask for only the number of missing lines?

JM

2013/1/30, Anoop Sam John anoo...@huawei.com:
 @Anil

I could not understand that why it goes to multiple regionservers in
 parallel. Why it cannot guarantee results = page size( my guess: due to
 multiple RS scans)? If you have used it then maybe you can explain the
 behaviour?

 Scan from client side never go to multiple RS in parallel. Scan from HTable
 API will be sequential with one region after the other. For every region it
 will open up scanner in the RS and do next() calls. The filter will be
 instantiated at server side per region level ...

 When u need 100 rows in the page and you created a Scan at client side with
 the filter and suppose there are 2 regions, 1st the scanner is opened at for
 region1 and scan is happening. It will ensure that max 100 rows will be
 retrieved from that region.  But when the region boundary crosses and client
 automatically open up scanner for the region2, there also it will pass
 filter with max 100 rows and so from there also max 100 rows can come..  So
 over all at the client side we can not guartee that the scan created will
 only scan 100 rows as a whole from the table.

 I think I am making it clear.   I have not PageFilter at all.. I am just
 explaining as per the knowledge on scan flow and the general filter usage.

 This is because the filter is applied separately on different region
 servers. It does however optimize the scan of individual HRegions by making
 sure that the page size is never exceeded locally. 

 I guess it need to be saying that   This is because the filter is applied
 separately on different regions.

 -Anoop-

 
 From: anil gupta [anilgupt...@gmail.com]
 Sent: Wednesday, January 30, 2013 1:33 PM
 To: user@hbase.apache.org
 Subject: Re: Pagination with HBase - getting previous page of data

 Hi Mohammad,

 You are most welcome to join the discussion. I have never used PageFilter
 so i don't really have concrete input.
 I had a look at
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html
 I could not understand that why it goes to multiple regionservers in
 parallel. Why it cannot guarantee results = page size( my guess: due to
 multiple RS scans)? If you have used it then maybe you can explain the
 behaviour?

 Thanks,
 Anil

 On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq donta...@gmail.com wrote:

 I'm kinda hesitant to put my leg in between the pros ;)But, does it sound
 sane to use PageFilter for both rows and columns and having some
 additional
 logic to handle the 'nth' page logic?It'll help us in both kind of
 paging.

 On Wednesday, January 30, 2013, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org
 wrote:
  Hi Anil,
 
  I think it really depend on the way you want to use the pagination.
 
  Do you need to be able to jump to page X? Are you ok if you miss a
  line or 2? Is your data growing fastly? Or slowly? Is it ok if your
  page indexes are a day old? Do you need to paginate over 300 colums?
  Or just 1? Do you need to always have the exact same number of entries
  in each page?
 
  For my usecase I need to be able to jump to the page X and I don't
  have any content. I have hundred of millions lines. Only the rowkey
  matter for me and I'm fine if sometime I have 50 entries displayed,
  and sometime only 45. So I'm thinking about calculating which row is
  the first one for each page, and store that separatly. Then I just
  need to run the MR daily.
 
  It's not a perfect solution I agree, but this might do the job for me.
  I'm totally open to all other idea which might do the job to.
 
  JM
 
  2013/1/29, anil gupta anilgupt...@gmail.com:
  Yes, your suggested solution only works on RowKey based pagination. It
 will
  fail when you start filtering on the basis of columns.
 
  Still, i would say it's comparatively easier to maintain this at
  Application level rather than creating tables for pagination.
 
  What if you have 300 columns in your schema. Will you create 300
  tables?
  What about handling of pagination when filtering is done based on
 multiple
  columns (and and or conditions)?
 
  On Tue, Jan 29, 2013 at 1:08 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
  No, no killer solution here ;)
 
  But I'm still thinking about

RE: Find the tablename in Observer

2013-01-28 Thread Anoop Sam John
Will the CoprocessorEnvironment reference in the  start() method be
instanceof RegionCoprocessorEnvironment too 

No. It will be reference of RegionEnvironment . This is not a public class so 
you wont be able to do the casting.
As I read your need, you want to get the table name just once and store and 
dont want to do the operation again and again in every prePut()

yes. You can do this by getting the table name in pre/postOpen() method. There 
you are getting a RegionCoprocessorEnvironment  object and this method will be 
called once.

-Anoop-


From: Ted Yu [yuzhih...@gmail.com]
Sent: Tuesday, January 29, 2013 6:55 AM
To: user@hbase.apache.org
Subject: Re: Find the tablename in Observer

start() method of which class ?

If you use Eclipse, you can navigate through the classes and find out the
answer - that was what I did :-)

You can also place a breakpoint in the following method :
public void prePut(final ObserverContextRegionCoprocessorEnvironment
c,
src/test//java/org/apache/hadoop/hbase/coprocessor/TestRegionServerCoprocessorExceptionWithAbort.java

Cheers

On Mon, Jan 28, 2013 at 5:21 PM, Rajgopal Vaithiyanathan 
raja.f...@gmail.com wrote:

 Will the CoprocessorEnvironment reference in the  start() method be
 instanceof RegionCoprocessorEnvironment too ? if so i can typecast..  (
 sorry :) i would normally check these myself :P but because i dont see
 anyway to put breakpoints and debug, i have to use sysouts to get such
 info. )


 On Mon, Jan 28, 2013 at 5:16 PM, Rajgopal Vaithiyanathan 
 raja.f...@gmail.com wrote:

  Great. Thanks..
 
  is there anyway that I can get it before prePut() ??
  Like from the constructor or from the start() method too ? i followed the
  code of CoprocessorEnvironment and didn't seem to get anything out of it.
 
 
  On Mon, Jan 28, 2013 at 5:09 PM, Ted Yu yuzhih...@gmail.com wrote:
 
void prePut(final ObserverContextRegionCoprocessorEnvironment c,
 
final Put put, final WALEdit edit, final boolean writeToWAL)
 
 
 ((RegionCoprocessorEnvironment)c.getEnvironment()).getRegion().getRegionInfo().getTableName()
 
  Cheers
 
  On Mon, Jan 28, 2013 at 4:56 PM, Rajgopal Vaithiyanathan 
  raja.f...@gmail.com wrote:
 
   Hi all,
  
   inside the prePut() method, Is there anyway to know the table name for
   which the prePut() is running ?
  
   --
   Thanks and Regards,
   Rajgopal Vaithiyanathan.
  
 
 
 
 
  --
  Thanks and Regards,
  Rajgopal Vaithiyanathan.
 



 --
 Thanks and Regards,
 Rajgopal Vaithiyanathan.


RE: Find the tablename in Observer

2013-01-28 Thread Anoop Sam John
Oh sorry...
Not checked the interface...  We were doing in postOpen()...
Thaks Gary for correcting me...:)

-Anoop-


From: Gary Helmling [ghelml...@gmail.com]
Sent: Tuesday, January 29, 2013 11:29 AM
To: user@hbase.apache.org
Subject: Re: Find the tablename in Observer

 Will the CoprocessorEnvironment reference in the  start() method be
 instanceof RegionCoprocessorEnvironment too

 No. It will be reference of RegionEnvironment . This is not a public class
 so you wont be able to do the casting.


Since RegionEnvionment implements RegionCoprocessorEnvironment, you should
be able to do:

((RegionCoprocessorEnvironment)env).getRegion().getRegionInfo().getTableName();

within your start() method without a problem.

RE: paging results filter

2013-01-24 Thread Anoop Sam John
@Toby

You mean to say that you need a mechanism for directly jumping to a page. Say 
you are in page#1 (1-20) now and you want to jump to page#4(61-80).. Yes this 
is not there in PageFilter...
The normal way of next page , next page will work fine as within the server the 
next() calls on the scanner works this way...

-Anoop-

From: Toby Lazar [tla...@gmail.com]
Sent: Thursday, January 24, 2013 6:44 PM
To: user@hbase.apache.org
Subject: Re: paging results filter

I don't see a way of specifying which page of resluts I want.  For example,
if I want page 3 with page size of 20 (only results 41-60), I don't see how
PageFilter can be configued for that.  Am I missing the obvious?

Thanks,

Toby

On Thu, Jan 24, 2013 at 7:52 AM, Mohammad Tariq donta...@gmail.com wrote:

 I think you need
 PageFilter
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html
 
 .

 HTH

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Thu, Jan 24, 2013 at 6:20 PM, Toby Lazar tla...@gmail.com wrote:

  Hi,
 
  I need to create a client function that allows paging of scan results
  (initially return results 1-20, then click on page to to show results
  21-40, 41-60, etc.) without needing to remember the start rowkey.  I
  beleive that a filter would be far more efficient than implementing the
  logic client-side.  I couldn't find any OOTB filter for this
 functionality
  so I wrote the class below.  It seems to work fine for me, but can anyone
  comment if this approach makes sense?  Is there another OOTB filter that
 I
  can use instead?
 
  Thank you,
 
  Toby
 
 
 
  import java.io.DataInput;
  import java.io.DataOutput;
  import java.io.IOException;
  import org.apache.hadoop.hbase.filter.FilterBase;
  public class PageOffsetFilter extends FilterBase {
   private long startRowCount;
   private long endRowCount;
 
   private int count = 0;
   public PageOffsetFilter() {
   }
 
   public PageOffsetFilter(long pageNumber, long pageSize) {
 
if(pageNumber1)
 pageNumber=1;
 
startRowCount = (pageNumber - 1) * pageSize;
endRowCount = (pageSize * pageNumber)-1;
   }
   @Override
   public boolean filterAllRemaining() {
return count  endRowCount;
   }
   @Override
   public boolean filterRow() {
 
count++;
if(count = startRowCount) {
 return true;
} else {
 return false;
}
 
   }
 
   @Override
   public void readFields(DataInput dataInput) throws IOException {
 
this.startRowCount = dataInput.readLong();
this.endRowCount = dataInput.readLong();
   }
   @Override
   public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(startRowCount);
dataOutput.writeLong(endRowCount);
   }
 
  }
 


RE: Region server Memory Use is double the -Xmx setting

2013-01-23 Thread Anoop Sam John
Are  you using compression for HFiles?

Yes we are using  MaxDirectMemorySize and we dont use off-heap cache.

-Anoop-

From: Buckley,Ron [buckl...@oclc.org]
Sent: Wednesday, January 23, 2013 8:49 PM
To: user@hbase.apache.org
Subject: RE: Region server Memory Use is double the -Xmx setting

Liang,

Thanks.  I wasn’t really aware that the direct memory could get that large. 
(Full disclosure, we did switch from jdk1.6.0_25 to jdk1.6.0_31 the last time 
we restarted HBase.)

I've only seen explicit setting of -XX:MaxDirectMemorySize for regionservers 
associated with the experimental off-heap cache.

Is anyone else running their region servers with -XX:MaxDirectMemorySize (not 
using the off-heap cache)?

Ron

-Original Message-
From: 谢良 [mailto:xieli...@xiaomi.com]
Sent: Tuesday, January 22, 2013 9:20 PM
To: user@hbase.apache.org
Subject: 答复: Region server Memory Use is double the -Xmx setting

Please set -XX:MaxDirectMemorySize explicitly,  else the default is taking 
the value like -Xmx in currenty JDK6, at least for jdk1.6.30+

Best Regards,
Liang

发件人: Buckley,Ron [buckl...@oclc.org]
发送时间: 2013年1月23日 5:17
收件人: user@hbase.apache.org
主题: Region server Memory Use is double the -Xmx setting

We have a 50 node cluster replicating to a 6 node cluster. Both clusters
are running CDH4.1.2 and HBase 0.94.2.



Today we noticed that the region servers at our replica site are using
10GB more memory than the '-Xmx12288m' we have defined in hbase-env.sh



These region servers have been up since January 9, 2013.



Does anyone have suggestions about tracking down this additional memory
use?



I'm not necessarily expecting the Region Server to stay right at the
12GB that we allocated, but having it running at 24GB is starting to
cause the servers to swap.






PIDUSER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND


28544  prodcon   18   0 24.1g  23g  18m S 20.0 74.0   9071:34 java




28544:   /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9
%p -Xmx12288m -Dcom.sun.management.jmxr

emote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.port=9021 -ea

-server -XX:+HeapDumpOnOutOfMemoryError -Xmn256m -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOc

cupancyFraction=70 -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:/drive1/hadoop/2.0/isoft/../

hbase/logs/gc-hbase.log -ea -server -XX:+HeapDumpOnOutOfMemoryError
-Xmn256m -XX





--

Ron Buckley

x6365

http://intranet-wiki.oclc.org/wiki/XWC/XServe/SRU



http://intranet-wiki.oclc.org/wiki/FIND



http://intranet-wiki.oclc.org/wiki/Higgins


http://intranet-wiki.oclc.org/wiki/Firefly

RE: HBase split policy

2013-01-22 Thread Anoop Sam John
Jean good topic.
When a region splits it is the HFile(s) split happening.  You know HFile 
logically split into n HFileBlocks and we will be having index meta data for 
these blocks at every HFile level.   HBase will find the midkey from these 
block index data. It will take the mid block as the split point.
So it all depends on how the data is spread across different HFileBlocks. So 
when you split a region [a,e) it need not be split at point c. It all depends 
on how many data you have corresponding to each rowkey patterns.

One more thing to remember that some time there can be really big HFileBlocks . 
Even though the default size for a block is 64K some times it can be much 
larger than this. One row can not be split into 2 or more blocks. It needs to 
be in one block. So it can so happen that when a split happens bigger blocks 
going to one daughter making that region as still big !!...   [When one row is 
really huge comparing to others]

Some thoughts on the topic as per my limited knowledge on the code. ...

-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Tuesday, January 22, 2013 5:12 PM
To: user
Subject: HBase split policy

Hi,

I'm wondering, what is HBase split policy.

I mean, let's imagine this situation.

I have a region full of rows starting from AA to AZ. Thousands of
hundreds. I also have few rows from B to DZ. Let's say only one
hundred.

Region is just above the maxfilesize, so it's fine.

No, I add A and store a very big row into it. Almost half the size
of my maxfilesize value. That mean it's now time to split this row.

How will HBase decide where to split it? Is it going to use the
lexical order? Which mean it will split somewhere between B and C? If
it's done that way, I will have one VERY small region, and one VERY
big which will still be over the maxfilesize and will need to be split
again, and most probably many times, right?

Or will HBase take the middle of the region, look at the closest key,
and cut there?

Yesterday, for one table, I merged all my regions into a single one.
This gave me something like a 10GB region. Since I want to have at
least 100 regions for this table, I have setup the maxfilesize to
100MB. I have restarted HBase, and let it worked over night.

This morning, I have some very big regions, still over the 100MB, and
some very small. And the big regions are at least hundred times bigger
than the small one.

I just stopped the cluster again to re-merge the regions into a single
one and see if I have not done something wrong in the process, but in
the meantime, I'm looking for more information about the way HBase is
deciding where to cut, and if there is a way to customize that.

Thanks,

JM

PS: Numbers are out of my head. I don't really recall how big the last
region was yesterday. I will take more notes when the current
MassMerge will be done.

RE: ResultCode.NEXT_ROW and scans with batching enabled

2013-01-22 Thread Anoop Sam John
Hi,

In a scan, when a filter's filterKeyValue method returns
ReturnCode.NEXT_ROW - does it actually skip to the next row or just the
next batch

It will go to the new row.

In HBase 0.92
 hasFilterRow has not been overridden for certain filters which effectively
 do filter out rows (SingleColumnValueFilter for example). 

Yes this is an issue in old versions. It is fixed in trunk now.

 I spent some time looking at HRegion.java to get to grips with how
 filterRow works (or not) when batching is enabled.

See the method RegionScannerImpl#nextInternal(int limit)  [In HRegion.java]. 
You can see a do while loop. This loop takes all the KVs for a row (and thus 
can be grouped as one Result). This one only checks for the batch size (limit)  
When the filter says to go to next row, there will be a seek to the next row 
[As Ted said see the code in StoreScanner]. This will make the peekRow() return 
the next row key which is not same as the currentRow.. [Pls see the code]..  So 
this batch will end there and next batch will be KVs from next row only.

-Anoop-

From: Ted Yu [yuzhih...@gmail.com]
Sent: Wednesday, January 23, 2013 6:18 AM
To: user@hbase.apache.org
Subject: Re: ResultCode.NEXT_ROW and scans with batching enabled

Take a look at StoreScanner#next():

ScanQueryMatcher.MatchCode qcode = matcher.match(kv);

...

  case SEEK_NEXT_ROW:

// This is just a relatively simple end of scan fix, to
short-cut end

// us if there is an endKey in the scan.

if (!matcher.moreRowsMayExistAfter(kv)) {

  return false;

}

reseek(matcher.getKeyForNextRow(kv));

break;
Cheers

On Tue, Jan 22, 2013 at 4:13 PM, David Koch ogd...@googlemail.com wrote:

 Hello,

 In a scan, when a filter's filterKeyValue method returns
 ReturnCode.NEXT_ROW - does it actually skip to the next row or just the
 next batch, provided of course batching is enabled? Where in the HBase
 source code can I find out about this?

 I spent some time looking at HRegion.java to get to grips with how
 filterRow works (or not) when batching is enabled. In HBase 0.92
 hasFilterRow has not been overridden for certain filters which effectively
 do filter out rows (SingleColumnValueFilter for example). Thus, these
 filters do not generate a warning when used with a batched scan which -
 while risky - provides the needed filtering in some cases. This has been
 fixed for subsequent versions (at least 0.96) so I need to re-implement
 custom filters which use this effect.

 Thanks,

 /David


RE: HBase split policy

2013-01-22 Thread Anoop Sam John
What will trigger the split?
The things which can trigger a split
1. Explicit split call from the client side using admin API
2. A memstore flush
3. A compaction

So even though there is no write operations happening on the region (no 
flushes) still a compaction performed for that region can trigger split.  May 
be in your case compaction happened for some of the regions and resulted in 
split...

-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Wednesday, January 23, 2013 8:09 AM
To: user@hbase.apache.org
Subject: Re: HBase split policy

Another related question.

What will trigger the split?

I mean, I merge all the regions in a single one, split that in 4 2.5GB
regions, alter it to set maxsize to 300MB and enable the table. I
don't do anything. No put, no get. What will trigger the regions
split?

I have one small table, about 1.2GB with 8M lines. I merged it in a
single region, and setup the maxsize to the 12MB. It got almost
split... All the regions got split except one.

Here is the screenshot: http://imageshack.us/photo/my-images/834/hannibalb.png/

It's not the first region, not the last. There is nothing specific
with this region, and it's not getting split.

Any idea why, and how I can trigger the split without putting any data
into the date?

Thanks,

JM

RE: Custom Filter and SEEK_NEXT_USING_HINT issue

2013-01-21 Thread Anoop Sam John
 I suppose if scanning process has started at once on
all regions, then I would find in log files at least one value per region,
but I have found one value per region only for those regions, that resides
before the particular one.

@Eugeny -  FuzzyFilter like any other filter works at the server side. The 
scanning from client side will be like sequential starting from the 1st region 
(Region with empty startkey or the corresponding region which contains the 
startkey whatever you mentioned in your scan). From client, request will go to 
RS for scanning a region. Once that region is over the next region will be 
contacted for scan(from client) and so on.  There is no parallel scanning of 
multiple regions from client side.  [This is when using a HTable scan APIs]

When MR used for scanning, we will be doing parallel scans from all the 
regions. Here will be having mappers per region.  But the normal scan from 
client side will be sequential on the regions not parallel.

-Anoop-

From: Eugeny Morozov [emoro...@griddynamics.com]
Sent: Monday, January 21, 2013 1:46 PM
To: user@hbase.apache.org
Cc: Alex Baranau
Subject: Re: Custom Filter and SEEK_NEXT_USING_HINT issue

Finally, the mystery has been solved.

Small remark before I explain everything.

The situation with only region is absolutely the same:
Fzzy: 1Q7iQ9JA
Next fzzy: F7dtxwqVQ_Pw  -- the value I'm trying to find.
Fzzy: F7dt8QWPSIDw
Somehow FuzzyRowFilter has just omit my value here.


So, the explanation.
In javadoc for FuzzyRowFilter question mark is used as substitution for
unknown value. Of course it's possible to use anything including zero
instead of question mark.
For quite some time we used literals to encode our keys. Literals like
you've seen already: 1Q7iQ9JA or F7dt8QWPSIDw. But that's Base64 form
of just 8 bytes, which requires 1.5 times more space. So we've decided to
store raw version - just  byte[8]. But unfortunately the symbol '?' is
exactly in the middle of the byte (according to ascii table
http://www.asciitable.com/), which means with FuzzyRowFilter we skip half
of values in some cases. In the same time question mark is exactly before
any letter that could be used in key.

Despite the fact we have integration tests - that's just a coincidence we
haven't such an example in there.

So, as an advice - always use zero instead of question mark for
FuzzyRowFilter.

Thank's to everyone!

P.S. But the question with region scanning order is still here. I do not
understand why with FuzzyFilter it goes from one region to another until it
stops at the value. I suppose if scanning process has started at once on
all regions, then I would find in log files at least one value per region,
but I have found one value per region only for those regions, that resides
before the particular one.


On Mon, Jan 21, 2013 at 4:22 AM, Michael Segel michael_se...@hotmail.comwrote:

 If its the same class and its not a patch, then the first class loaded
 wins.

 So if you have a Class Foo and HBase has a Class Foo, your code will never
 see the light of day.

 Perhaps I'm stating the obvious but its something to think about when
 working w Hadoop.

 On Jan 19, 2013, at 3:36 AM, Eugeny Morozov emoro...@griddynamics.com
 wrote:

  Ted,
 
  that is correct.
  HBase 0.92.x and we use part of the patch 6509.
 
  I use the filter as a custom filter, it lives in separate jar file and
 goes
  to HBase's classpath. I did not patch HBase.
  Moreover I do not use protobuf's descriptions that comes with the filter
 in
  patch. Only two classes I have - FuzzyRowFilter itself and its test
 class.
 
  And it works perfectly on small dataset like 100 rows (1 region). But
 when
  my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm
  not sure, but it seems to me it is not fault of the filter.
 
 
  On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  To my knowledge CDH-4.1.2 is based on HBase 0.92.x
 
  Looks like you were using patch from HBASE-6509 which was integrated to
  trunk only.
  Please confirm.
 
  Copying Alex who wrote the patch.
 
  Cheers
 
  On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov
  emoro...@griddynamics.comwrote:
 
  Hi, folks!
 
  HBase, Hadoop, etc version is CDH-4.1.2
 
  I'm using custom FuzzyRowFilter, which I get from
 
 
 
 http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and
  suddenly after quite a time we found that it starts loosing data.
 
  Basically the idea of FuzzyRowFilter is that it tries to find key that
  has
  been provided and if there is no such a key - but more exists in table
 -
  it
  returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds
  required
  key. As I understand, HBase in this key will fast-forward to required
  key -
  it must be similar or same as to get Scan with setStartRow.
 
  I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm
 able
 

RE: Loading data, hbase slower than Hive?

2013-01-20 Thread Anoop Sam John
Austin,
You are using HFileOutputFormat or TableOutputFormat?

-Anoop-

From: Austin Chungath [austi...@gmail.com]
Sent: Monday, January 21, 2013 11:15 AM
To: user@hbase.apache.org
Subject: Re: Loading data, hbase slower than Hive?

Thank you Tariq.
I will let you know how things went after I implement these suggestions.

Regards,
Austin

On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq donta...@gmail.com wrote:

 Hello Austin,

   I am sorry for the late response.

 Asaf has made a very valid point. Rowkwey design is very crucial.
 Specially if the data is gonna be sequential(timeseries kinda thing).
 You may end up with hotspotting problem. Use pre-splitted tables
 or hash the keys to avoid that. It'll also allow you to fetch the results
 faster.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika asaf.mes...@gmail.com
 wrote:

  Start by telling us your row key design.
  Check for pre splitting your table regions.
  I managed to get to 25mb/sec write throughput in Hbase using 1 region
  server. If your data is evenly spread you can get around 7 times that in
 a
  10 regions server environment. Should mean that 1 gig should take 4 sec.
 
 
  On Friday, January 18, 2013, praveenesh kumar wrote:
 
   Hey,
   Can someone throw some pointers on what would be the best practice for
  bulk
   imports in hbase ?
   That would be really helpful.
  
   Regards,
   Praveenesh
  
   On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq donta...@gmail.com
  javascript:;
   wrote:
  
Just to add to whatever all the heavyweights have said above, your MR
  job
may not be as efficient as the MR job corresponding to your Hive
 query.
   You
can enhance the performance by setting the mapred config parameters
   wisely
and by tuning your MR job.
   
Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com
   
   
On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan 
ramkrishna.s.vasude...@gmail.com javascript:; wrote:
   
 Hive is more for batch and HBase is for more of real time data.

 Regards
 Ram

 On Thu, Jan 17, 2013 at 10:30 PM, Anoop John 
 anoop.hb...@gmail.com
  javascript:;
   
 wrote:

  In case of Hive data insertion means placing the file under table
   path
in
  HDFS.  HBase need to read the data and convert it into its
 format.
 (HFiles)
  MR is doing this work..  So this makes it clear that HBase will
 be
 slower.
  :)  As Michael said the read operation...
 
 
 
  -Anoop-
 
  On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath 
   austi...@gmail.com javascript:;
  wrote:
 
 Hi,
   Problem: hive took 6 mins to load a data set, hbase took 1 hr
 14
mins.
   It's a 20 gb data set approx 230 million records. The data is
 in
hdfs,
   single text file. The cluster is 11 nodes, 8 cores.
  
   I loaded this in hive, partitioned by date and bucketed into 32
  and
  sorted.
   Time taken is 6 mins.
  
   I loaded the same data into hbase, in the same cluster by
  writing a
map
   reduce code. It took 1hr 14 mins. The cluster wasn't running
   anything
  else
   and assuming that the code that i wrote is good enough, what is
  it
that
   makes hbase slower than hive in loading the data?
  
   Thanks,
   Austin
  
 

   
  
 


RE: Hbase Mapreduce- Problem in using arrayList of pust in MapFunction

2013-01-20 Thread Anoop Sam John
And also how can I use autoflush  bufferclientside in Map function for
inserting data to Hbase Table ?

You are using TableOutputFormat right? Here autoFlush is turned OFF ... You can 
use config param hbase.client.write.buffer to set the client side buffer size.

-Anoop-

From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com]
Sent: Monday, January 21, 2013 11:41 AM
To: user@hbase.apache.org
Subject: Hbase Mapreduce- Problem in using arrayList of pust in MapFunction

Hi there
Is there any way to use arrayList of Puts in map function to insert data to
hbase ? Because,the context.write method doesn't allow to use arraylist of
puts,so in every map function I can only put one row. What can I do for
inserting some rows in each map function ?
And also how can I use autoflush  bufferclientside in Map function for
inserting data to Hbase Table ?

Mohandes Zebeleh

RE: Loading data, hbase slower than Hive?

2013-01-20 Thread Anoop Sam John
@Mohammad 
As he is using HFileOutputFormat, there is no put call happening on HTable. In 
this case the MR will create the HFiles directly with out using the normal 
HBase write path. Then later using HRS API the HFiles are loaded to the table 
regions.
In this case the number of reducers will be that of the table regions. So 
Austin you can check with proper presplit of table.

-Anoop-

From: Mohammad Tariq [donta...@gmail.com]
Sent: Monday, January 21, 2013 12:01 PM
To: user@hbase.apache.org
Subject: Re: Loading data, hbase slower than Hive?

Apart from this you can have some additional tweaks to improve
put performance. Like, creating pre-splitted tables, making use of
put(ListPut puts) instead of normal put etc.


Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Jan 21, 2013 at 11:46 AM, Austin Chungath austi...@gmail.comwrote:

 Anoop,

 I am using HFileOutputFormat. I am doing nothing but splitting the data
 from each row by the delimiter and sending it into their respective
 columns.
 Is there some kind of preprocessing or steps that I should do before this?
 As suggested I will look into the above solutions and let you guys know
 what the problem was. I might have to rethink the Rowkey design.

 Regards,
 Austin.

 On Mon, Jan 21, 2013 at 11:24 AM, Anoop Sam John anoo...@huawei.com
 wrote:

  Austin,
  You are using HFileOutputFormat or TableOutputFormat?
 
  -Anoop-
  
  From: Austin Chungath [austi...@gmail.com]
  Sent: Monday, January 21, 2013 11:15 AM
  To: user@hbase.apache.org
  Subject: Re: Loading data, hbase slower than Hive?
 
  Thank you Tariq.
  I will let you know how things went after I implement these suggestions.
 
  Regards,
  Austin
 
  On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq donta...@gmail.com
  wrote:
 
   Hello Austin,
  
 I am sorry for the late response.
  
   Asaf has made a very valid point. Rowkwey design is very crucial.
   Specially if the data is gonna be sequential(timeseries kinda thing).
   You may end up with hotspotting problem. Use pre-splitted tables
   or hash the keys to avoid that. It'll also allow you to fetch the
 results
   faster.
  
   Warm Regards,
   Tariq
   https://mtariq.jux.com/
   cloudfront.blogspot.com
  
  
   On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika asaf.mes...@gmail.com
   wrote:
  
Start by telling us your row key design.
Check for pre splitting your table regions.
I managed to get to 25mb/sec write throughput in Hbase using 1 region
server. If your data is evenly spread you can get around 7 times that
  in
   a
10 regions server environment. Should mean that 1 gig should take 4
  sec.
   
   
On Friday, January 18, 2013, praveenesh kumar wrote:
   
 Hey,
 Can someone throw some pointers on what would be the best practice
  for
bulk
 imports in hbase ?
 That would be really helpful.

 Regards,
 Praveenesh

 On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq 
 donta...@gmail.com
javascript:;
 wrote:

  Just to add to whatever all the heavyweights have said above,
 your
  MR
job
  may not be as efficient as the MR job corresponding to your Hive
   query.
 You
  can enhance the performance by setting the mapred config
 parameters
 wisely
  and by tuning your MR job.
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
  cloudfront.blogspot.com
 
 
  On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan 
  ramkrishna.s.vasude...@gmail.com javascript:; wrote:
 
   Hive is more for batch and HBase is for more of real time data.
  
   Regards
   Ram
  
   On Thu, Jan 17, 2013 at 10:30 PM, Anoop John 
   anoop.hb...@gmail.com
javascript:;
 
   wrote:
  
In case of Hive data insertion means placing the file under
  table
 path
  in
HDFS.  HBase need to read the data and convert it into its
   format.
   (HFiles)
MR is doing this work..  So this makes it clear that HBase
 will
   be
   slower.
:)  As Michael said the read operation...
   
   
   
-Anoop-
   
On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath 
 austi...@gmail.com javascript:;
wrote:
   
   Hi,
 Problem: hive took 6 mins to load a data set, hbase took 1
 hr
   14
  mins.
 It's a 20 gb data set approx 230 million records. The data
 is
   in
  hdfs,
 single text file. The cluster is 11 nodes, 8 cores.

 I loaded this in hive, partitioned by date and bucketed
 into
  32
and
sorted.
 Time taken is 6 mins.

 I loaded the same data into hbase, in the same cluster by
writing a
  map
 reduce code. It took 1hr 14 mins. The cluster wasn't
 running
 anything
else

RE: ValueFilter and VERSIONS

2013-01-17 Thread Anoop Sam John
Can you make use of SingleColumnValueFilter.  In this you can specify whether 
the condition to be checked only on the latest version or not.
SCVF#setLatestVersionOnly ( true)

-Anoop-

From: Li, Min [m...@microstrategy.com]
Sent: Friday, January 18, 2013 11:47 AM
To: user@hbase.apache.org
Subject: ValueFilter and VERSIONS

Hi all,

As you know, ValueFilter will filter data from all versions, so I create a 
table and indicate it has only 1 version. However, the old version record still 
can be gotten by ValueFilter? Does anyone know how to create a table with only 
one version record?

BTW, I am using hbase 0.92.1. Following is my testing commands:


hbase(main):016:0 create 'testUser',  {NAME = 'F', VERSIONS = 1}
0 row(s) in 1.0630 seconds

hbase(main):017:0 put 'testUser', '123, 'F:f', '3'
0 row(s) in 0.0120 seconds

hbase(main):018:0 put 'testUser', '123, 'F:f', '1'
0 row(s) in 0.0060 seconds

hbase(main):019:0 scan 'testUser'
ROWCOLUMN+CELL
 123column=F:f, timestamp=1358489113213, value=1
1 row(s) in 0.0110 seconds

hbase(main):020:0 scan 'testUser',{FILTER = (PrefixFilter ('123') AND 
ValueFilter (,'binary:1')}
ROWCOLUMN+CELL
 123column=F:f, timestamp=1358489110172, value=3
1 row(s) in 0.1790 seconds


Thanks,
Min

RE: ValueFilter and VERSIONS

2013-01-17 Thread Anoop Sam John

ValueFilter works only on the KVs not at a row level . So something similar is 
not possible.
Setting versions to 1 will make only one version (latest) version getting back 
to the client. But the filtering is done prior to the versioning decision and 
filters will see all the version values.

-Anoop-

From: Li, Min [m...@microstrategy.com]
Sent: Friday, January 18, 2013 12:00 PM
To: user@hbase.apache.org
Subject: RE: ValueFilter and VERSIONS

Hi Anoop,

Thanks for your reply. But I have to use value filter here, because in some of 
my use case, I can't identify the qualifier.

Thanks,
Min

-Original Message-
From: Anoop Sam John [mailto:anoo...@huawei.com]
Sent: Friday, January 18, 2013 2:28 PM
To: user@hbase.apache.org
Subject: RE: ValueFilter and VERSIONS

Can you make use of SingleColumnValueFilter.  In this you can specify whether 
the condition to be checked only on the latest version or not.
SCVF#setLatestVersionOnly ( true)

-Anoop-

From: Li, Min [m...@microstrategy.com]
Sent: Friday, January 18, 2013 11:47 AM
To: user@hbase.apache.org
Subject: ValueFilter and VERSIONS

Hi all,

As you know, ValueFilter will filter data from all versions, so I create a 
table and indicate it has only 1 version. However, the old version record still 
can be gotten by ValueFilter? Does anyone know how to create a table with only 
one version record?

BTW, I am using hbase 0.92.1. Following is my testing commands:


hbase(main):016:0 create 'testUser',  {NAME = 'F', VERSIONS = 1}
0 row(s) in 1.0630 seconds

hbase(main):017:0 put 'testUser', '123, 'F:f', '3'
0 row(s) in 0.0120 seconds

hbase(main):018:0 put 'testUser', '123, 'F:f', '1'
0 row(s) in 0.0060 seconds

hbase(main):019:0 scan 'testUser'
ROWCOLUMN+CELL
 123column=F:f, timestamp=1358489113213, value=1
1 row(s) in 0.0110 seconds

hbase(main):020:0 scan 'testUser',{FILTER = (PrefixFilter ('123') AND 
ValueFilter (,'binary:1')}
ROWCOLUMN+CELL
 123column=F:f, timestamp=1358489110172, value=3
1 row(s) in 0.1790 seconds


Thanks,
Min

RE: Hbase as mongodb

2013-01-16 Thread Anoop Sam John
Such as I can directly say Mongodb to get me
all the objects having timestamp value of xxx date where timestamp is a
field in Json objects stored in Mongodb

It is possible to store any data in HBase which can be converted into byte[].  
Yes using filters one can perform above kind of query. There is no built in 
filter for above kind of need but custom one can be created.  But remember that 
there is no built in secondary indexing capability in HBase.  Here by I can see 
you have a need for indexing a part of column value. [timestamp is a field in 
Json objects ]

-Anoop-

From: Panshul Whisper [ouchwhis...@gmail.com]
Sent: Wednesday, January 16, 2013 6:36 PM
To: user@hbase.apache.org
Subject: Re: Hbase as mongodb

Hello Tariq,

Thank you for the reply.

My concern is that I have been working with MongoDB, but now I am switching
over to Hadoop and I want to use HBase for certain reasons. I was wondering
if I can store Json files in Hbase in a way that I can query the Json files
in Hbase as I can in Mongodb. Such as I can directly say Mongodb to get me
all the objects having timestamp value of xxx date where timestamp is a
field in Json objects stored in Mongodb. Can I perform similar operations
on Hbase or does it have another approach for doing similar operations.
I do not have much knowledge on Hbase yet. I am beginning to learn it, but
I just want to be sure i am investing my time in the right direction.

Thank you so much for the help,

Regards,
Panshul.


On Wed, Jan 16, 2013 at 11:45 AM, Mohammad Tariq donta...@gmail.com wrote:

 Hello Panshul,

 Hbase and MongoDB are built to serve different purposes. You can't
 replace one with the other. They have different strengths and weaknesses.
 So, if you are using Hbase for something, think well before switching to
 MongoDB or vice verca.

 Coming back to the actual question, you can store anything which can be
 converted into a sequence of bytes into Hbase and query it. Could you
 please elaborate your problem a bit?It will help us to answer your question
 in a better manner.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Wed, Jan 16, 2013 at 4:03 PM, Panshul Whisper ouchwhis...@gmail.com
 wrote:

  Hello,
 
  Is it possible to use hbase to query json documents in a same way as we
 can
  do with Mongodb
 
  Suggestions please.
  If we can then a small example as how.. not the query but the process
  flow..
  Thanku so much
  Regards,
  Panshul.
 




--
Regards,
Ouch Whisper
010101010101

RE: Hbase as mongodb

2013-01-16 Thread Anoop Sam John
Yes Mohammad. Smarter way like this is needed..  I was telling that even if the 
full JSON is stored as a column value it will be possible to achive what 
Panshul needs. :) But a full table scan will not be acceptable I guess.

As Ted suggested pls check Panthera also. Panthera seems to use Hive HBase 
integration in a smart way. 

-Anoop-
__
From: Mohammad Tariq [donta...@gmail.com]
Sent: Wednesday, January 16, 2013 7:08 PM
To: user@hbase.apache.org
Subject: Re: Hbase as mongodb

@Anoop sir : Does it make sense to extract the timestamp of JSON
object beforehand and use it as the rowkey? After that serialize the
JSON object and store it in the Hbase cell. Gets would a lot faster
then???

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 7:02 PM, Imran M Yousuf imyou...@gmail.com wrote:

 We have used Jackson library for converting Java Object to JSON String
 and eventually to byte[] and vice-versa; but that is not scan/query
 friendly, so we integrated Apache Solr to the stack to get that done.
 http://smart-cms.org

 Thank you,

 Imran

 On Wed, Jan 16, 2013 at 7:27 PM, Anoop Sam John anoo...@huawei.com
 wrote:
 Such as I can directly say Mongodb to get me
  all the objects having timestamp value of xxx date where timestamp is a
  field in Json objects stored in Mongodb
 
  It is possible to store any data in HBase which can be converted into
 byte[].  Yes using filters one can perform above kind of query. There is no
 built in filter for above kind of need but custom one can be created.  But
 remember that there is no built in secondary indexing capability in HBase.
  Here by I can see you have a need for indexing a part of column value.
 [timestamp is a field in Json objects ]
 
  -Anoop-
  
  From: Panshul Whisper [ouchwhis...@gmail.com]
  Sent: Wednesday, January 16, 2013 6:36 PM
  To: user@hbase.apache.org
  Subject: Re: Hbase as mongodb
 
  Hello Tariq,
 
  Thank you for the reply.
 
  My concern is that I have been working with MongoDB, but now I am
 switching
  over to Hadoop and I want to use HBase for certain reasons. I was
 wondering
  if I can store Json files in Hbase in a way that I can query the Json
 files
  in Hbase as I can in Mongodb. Such as I can directly say Mongodb to get
 me
  all the objects having timestamp value of xxx date where timestamp is a
  field in Json objects stored in Mongodb. Can I perform similar operations
  on Hbase or does it have another approach for doing similar operations.
  I do not have much knowledge on Hbase yet. I am beginning to learn it,
 but
  I just want to be sure i am investing my time in the right direction.
 
  Thank you so much for the help,
 
  Regards,
  Panshul.
 
 
  On Wed, Jan 16, 2013 at 11:45 AM, Mohammad Tariq donta...@gmail.com
 wrote:
 
  Hello Panshul,
 
  Hbase and MongoDB are built to serve different purposes. You
 can't
  replace one with the other. They have different strengths and
 weaknesses.
  So, if you are using Hbase for something, think well before switching to
  MongoDB or vice verca.
 
  Coming back to the actual question, you can store anything which can be
  converted into a sequence of bytes into Hbase and query it. Could you
  please elaborate your problem a bit?It will help us to answer your
 question
  in a better manner.
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
  cloudfront.blogspot.com
 
 
  On Wed, Jan 16, 2013 at 4:03 PM, Panshul Whisper ouchwhis...@gmail.com
  wrote:
 
   Hello,
  
   Is it possible to use hbase to query json documents in a same way as
 we
  can
   do with Mongodb
  
   Suggestions please.
   If we can then a small example as how.. not the query but the process
   flow..
   Thanku so much
   Regards,
   Panshul.
  
 
 
 
 
  --
  Regards,
  Ouch Whisper
  010101010101



 --
 Imran M Yousuf
 Entrepreneur  CEO
 Smart IT Engineering Ltd.
 Dhaka, Bangladesh
 Twitter: @imyousuf - http://twitter.com/imyousuf
 Blog: http://imyousuf-tech.blogs.smartitengineering.com/
 Mobile: +880-1711402557
+880-1746119494


RE: Coprocessor / threading model

2013-01-15 Thread Anoop Sam John
Thanks Andrew. A detailed and useful reply Nothing more needed to explain 
the anti pattern..  :)

-Anoop-

From: Andrew Purtell [apurt...@apache.org]
Sent: Wednesday, January 16, 2013 12:50 AM
To: user@hbase.apache.org
Subject: Re: Coprocessor / threading model

HTable is a blocking interface. When a client issues a put, for example, we
do not want to return until we can confirm the store has been durably
persisted. For client convenience many additional details of remote region
invocation are hidden, for example META table lookups for relocated
regions, reconnection, retries. Just about all coprocessor upcalls for the
Observer interface happen with the RPC handler context. RPC handlers are
drawn from a fixed pool of threads. Your CP code is tying up one of a fixed
resource for as long as it has control. And in here you are running the
complex HTable machinery. For many reasons your method call on HTable may
block (potentially for a long time) and therefore the RPC handler your
invocation is executing within will also block. An accidental cycle can
cause a deadlock once there are no free handlers somewhere, which will
happen as part of normal operation when the cluster is loaded, and the
higher the load the more likely.

Instead you can do what Anoop has described in this thread and install a CP
into the master that insures index regions are assigned to the same
regionserver as the primary table, and then call from a region of the
primary table into a colocated region of the index table, or vice versa,
bypassing HTable and the RPC stack. This is just making an in process
method call on one object from another.

Or, you could allocate a small executor pool for cross region RPC. When the
upcall into your CP happens, dispatch work to the executor and return
immediately to release the RPC worker thread back to the pool. This would
avoid the possibility of deadlock but this may not give you the semantics
you want because that background work could lag unpredictably.


On Tue, Jan 15, 2013 at 10:44 AM, Wei Tan w...@us.ibm.com wrote:

 Andrew, could you explain more, why doing cross-table operation is an
 anti-pattern of using CP?
 Durability might be an issue, as far as I understand. Thanks,


 Best Regards,
 Wei


--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

RE: Maximizing throughput

2013-01-10 Thread Anoop Sam John
Hi
   You mind telling the configs that you changed and set? BTW which version 
of HBase you are using?

-Anoop-

From: Bryan Keller [brya...@gmail.com]
Sent: Friday, January 11, 2013 10:01 AM
To: user@hbase.apache.org
Subject: Maximizing throughput

I am attempting to configure HBase to maximize throughput, and have noticed 
some bottlenecks. In particular, with my configuration, write performance is 
well below theoretical throughput. I have a test program that inserts many rows 
into a test table. Network I/O is less than 20% of max, and disk I/O is even 
lower, maybe around 5% max on all boxes in the cluster. CPU is well below than 
50% max on all boxes. I do not see any I/O waits or anything in particular than 
raises concerns. I am using iostat and iftop to test throughput. To determine 
theoretical max, I used dd and iperf. I have spent quite a bit of time 
optimizing the HBase config parameters, optimizing GC, etc., and am familiar 
with the HBase book online and such.

RE: HBase - Secondary Index

2013-01-08 Thread Anoop Sam John
Totally agree with Lars.  The design came up as per our usage and data 
distribution style etc.
Also the put performance we were not able to compromise. That is why the region 
collocation based region based indexing design came :)
Also as we are having the indexing and index usage every thing happening at 
server side, there is no need for any change in the client part depending on 
what type of client u use. Java code or REST APIs or any thing.  Also MR based 
parallel scans any thing can be comparably easy I feel as there is absolutely 
no changes needed at client side.  :)

As Anil said there will be pros and cons for every way and which one suits your 
usage, needs to be adopted. :)

-Anoop-

From: anil gupta [anilgupt...@gmail.com]
Sent: Wednesday, January 09, 2013 6:58 AM
To: user@hbase.apache.org; lars hofhansl
Subject: Re: HBase - Secondary Index

+1 on Lars comment.

Either the client gets the rowkey from secondary table and then gets the
real data from Primary Table. ** OR ** Send the request to all the RS(or
region) hosting a region of primary table.

Anoop is using the latter mechanism. Both the mechanism have their pros and
cons. IMO, there is no outright winner.

~Anil Gupta

On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl la...@apache.org wrote:

 Different use cases.


 For global point queries you want exactly what you said below.
 For range scans across many rows you want Anoop's design. As usually it
 depends.


 The tradeoff is bringing a lot of unnecessary data to the client vs having
 to contact each region (or at least each region server).


 -- Lars



 
  From: Michael Segel michael_se...@hotmail.com
 To: user@hbase.apache.org
 Sent: Tuesday, January 8, 2013 6:33 AM
 Subject: Re: HBase - Secondary Index

 So if you're using an inverted table / index why on earth are you doing it
 at the region level?

 I've tried to explain this to others over 6 months ago and its not really
 a good idea.

 You're over complicating this and you will end up creating performance
 bottlenecks when your secondary index is completely orthogonal to your row
 key.

 To give you an example...

 Suppose you're CCCIS and you have a large database of auto insurance
 claims that you've acquired over the years from your Pathways product.

 Your primary key would be a combination of the Insurance Company's ID and
 their internal claim ID for the individual claim.
 Your row would be all of the data associated to that claim.

 So now lets say you want to find the average cost to repair a front end
 collision of an S80 Volvo.
 The make and model of the car would be orthogonal to the initial key. This
 means that the result set containing insurance records for Front End
 collisions of S80 Volvos would be most likely evenly distributed across the
 cluster's regions.

 If you used a series of inverted tables, you would be able to use a series
 of get()s to get the result set from each index and then find their
 intersections. (Note that you could also put them in sort order so that the
 intersections would be fairly straight forward to find.

 Doing this at the region level isn't so simple.

 So I have to again ask why go through and over complicate things?

 Just saying...

 On Jan 7, 2013, at 7:49 AM, Anoop Sam John anoo...@huawei.com wrote:

  Hi,
  It is inverted index based on column(s) value(s)
  It will be region wise indexing. Can work when some one knows the rowkey
 range or NOT.
 
  -Anoop-
  
  From: Mohit Anchlia [mohitanch...@gmail.com]
  Sent: Monday, January 07, 2013 9:47 AM
  To: user@hbase.apache.org
  Subject: Re: HBase - Secondary Index
 
  Hi Anoop,
 
  Am I correct in understanding that this indexing mechanism is only
  applicable when you know the row key? It's not an inverted index truly
  based on the column value.
 
  Mohit
  On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John anoo...@huawei.com
 wrote:
 
  Hi Adrien
  We are making the consistency btw the main table and
  index table and the roll back mentioned below etc using the CP hooks.
 The
  current hooks were not enough for those though..  I am in the process of
  trying to contribute those new hooks, core changes etc now...  Once all
 are
  done I will be able to explain in details..
 
  -Anoop-
  
  From: Adrien Mogenet [adrien.moge...@gmail.com]
  Sent: Monday, January 07, 2013 2:00 AM
  To: user@hbase.apache.org
  Subject: Re: HBase - Secondary Index
 
  Nice topic, perhaps one of the most important for 2013 :-)
  I still don't get how you're ensuring consistency between index table
 and
  main table, without an external component (such as
 bookkeeper/zookeeper).
  What's the exact write path in your situation when inserting data ?
  (WAL/RegionObserver, pre/post put/WALedit...)
 
  The underlying question is about how you're ensuring that WALEdit in
 Index
  and Main

RE: HBase - Secondary Index

2013-01-07 Thread Anoop Sam John
Hi,
It is inverted index based on column(s) value(s)
It will be region wise indexing. Can work when some one knows the rowkey range 
or NOT.

-Anoop-

From: Mohit Anchlia [mohitanch...@gmail.com]
Sent: Monday, January 07, 2013 9:47 AM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Hi Anoop,

Am I correct in understanding that this indexing mechanism is only
applicable when you know the row key? It's not an inverted index truly
based on the column value.

Mohit
On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John anoo...@huawei.com wrote:

 Hi Adrien
  We are making the consistency btw the main table and
 index table and the roll back mentioned below etc using the CP hooks. The
 current hooks were not enough for those though..  I am in the process of
 trying to contribute those new hooks, core changes etc now...  Once all are
 done I will be able to explain in details..

 -Anoop-
 
 From: Adrien Mogenet [adrien.moge...@gmail.com]
 Sent: Monday, January 07, 2013 2:00 AM
  To: user@hbase.apache.org
 Subject: Re: HBase - Secondary Index

 Nice topic, perhaps one of the most important for 2013 :-)
 I still don't get how you're ensuring consistency between index table and
 main table, without an external component (such as bookkeeper/zookeeper).
 What's the exact write path in your situation when inserting data ?
 (WAL/RegionObserver, pre/post put/WALedit...)

 The underlying question is about how you're ensuring that WALEdit in Index
 and Main tables are perfectly sync'ed, and how you 're able to rollback in
 case of issue in both WAL ?


 On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min kelvin@gmail.com
 wrote:

  Yes as you say when the no of rows to be returned is becoming more and
  more the latency will be becoming more.  seeks within an HFile block is
  some what expensive op now. (Not much but still)  The new encoding
 prefix
  trie will be a huge bonus here. There the seeks will be flying.. [Ted
 also
  presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
  measure the scan performance with this new encoding . Trying to back
 port
  a simple patch for 94 version just for testing...   Yes when the no of
  results to be returned is more and more any index will become less
  performing as per my study  :)
 
  yes, you are right, I guess it's just a drawback of any index approach.
  Thanks for the explanation.
 
  Shengjie
 
  On 28 December 2012 04:14, Anoop Sam John anoo...@huawei.com wrote:
 
Do you have link to that presentation?
  
   http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
  
   -Anoop-
  
   
   From: Mohit Anchlia [mohitanch...@gmail.com]
   Sent: Friday, December 28, 2012 9:12 AM
   To: user@hbase.apache.org
   Subject: Re: HBase - Secondary Index
  
   On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John anoo...@huawei.com
   wrote:
  
Yes as you say when the no of rows to be returned is becoming more
 and
more the latency will be becoming more.  seeks within an HFile block
 is
some what expensive op now. (Not much but still)  The new encoding
  prefix
trie will be a huge bonus here. There the seeks will be flying.. [Ted
   also
presented this in the Hadoop China]  Thanks to Matt... :)  I am
 trying
  to
measure the scan performance with this new encoding . Trying to back
   port a
simple patch for 94 version just for testing...   Yes when the no of
results to be returned is more and more any index will become less
performing as per my study  :)
   
Do you have link to that presentation?
  
  
btw, quick question- in your presentation, the scale there is
 seconds
  or
mill-seconds:)
   
It is seconds.  Dont consider the exact values. What is the % of
  increase
in latency is important :) Those were not high end machines.
   
-Anoop-

From: Shengjie Min [kelvin@gmail.com]
Sent: Thursday, December 27, 2012 9:59 PM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index
   
 Didnt follow u completely here. There wont be any get() happening..
  As
the
exact rowkey in a region we get from the index table, we can seek to
  the
exact position and return that row.
   
Sorry, When I misused get() here, I meant seeking. Yes, if it's
 just
small number of rows returned, this works perfect. As you said you
 will
   get
the exact rowkey positions per region, and simply seek them. I was
  trying
to work out the case that when the number of result rows increases
massively. Like in Anil's case, he wants to do a scan query against
 the
2ndary index(timestamp): select all rows from timestamp1 to
  timestamp2
given no customerId provided. During that time period, he might have
 a
   big
chunk of rows from different customerIds. The index table returns a
 lot

RE: HBase - Secondary Index

2013-01-06 Thread Anoop Sam John
Hi Adrien 
 We are making the consistency btw the main table and index 
table and the roll back mentioned below etc using the CP hooks. The current 
hooks were not enough for those though..  I am in the process of trying to 
contribute those new hooks, core changes etc now...  Once all are done I will 
be able to explain in details..

-Anoop-

From: Adrien Mogenet [adrien.moge...@gmail.com]
Sent: Monday, January 07, 2013 2:00 AM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Nice topic, perhaps one of the most important for 2013 :-)
I still don't get how you're ensuring consistency between index table and
main table, without an external component (such as bookkeeper/zookeeper).
What's the exact write path in your situation when inserting data ?
(WAL/RegionObserver, pre/post put/WALedit...)

The underlying question is about how you're ensuring that WALEdit in Index
and Main tables are perfectly sync'ed, and how you 're able to rollback in
case of issue in both WAL ?


On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min kelvin@gmail.com wrote:

 Yes as you say when the no of rows to be returned is becoming more and
 more the latency will be becoming more.  seeks within an HFile block is
 some what expensive op now. (Not much but still)  The new encoding prefix
 trie will be a huge bonus here. There the seeks will be flying.. [Ted also
 presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
 measure the scan performance with this new encoding . Trying to back port
 a simple patch for 94 version just for testing...   Yes when the no of
 results to be returned is more and more any index will become less
 performing as per my study  :)

 yes, you are right, I guess it's just a drawback of any index approach.
 Thanks for the explanation.

 Shengjie

 On 28 December 2012 04:14, Anoop Sam John anoo...@huawei.com wrote:

   Do you have link to that presentation?
 
  http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
 
  -Anoop-
 
  
  From: Mohit Anchlia [mohitanch...@gmail.com]
  Sent: Friday, December 28, 2012 9:12 AM
  To: user@hbase.apache.org
  Subject: Re: HBase - Secondary Index
 
  On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John anoo...@huawei.com
  wrote:
 
   Yes as you say when the no of rows to be returned is becoming more and
   more the latency will be becoming more.  seeks within an HFile block is
   some what expensive op now. (Not much but still)  The new encoding
 prefix
   trie will be a huge bonus here. There the seeks will be flying.. [Ted
  also
   presented this in the Hadoop China]  Thanks to Matt... :)  I am trying
 to
   measure the scan performance with this new encoding . Trying to back
  port a
   simple patch for 94 version just for testing...   Yes when the no of
   results to be returned is more and more any index will become less
   performing as per my study  :)
  
   Do you have link to that presentation?
 
 
   btw, quick question- in your presentation, the scale there is seconds
 or
   mill-seconds:)
  
   It is seconds.  Dont consider the exact values. What is the % of
 increase
   in latency is important :) Those were not high end machines.
  
   -Anoop-
   
   From: Shengjie Min [kelvin@gmail.com]
   Sent: Thursday, December 27, 2012 9:59 PM
   To: user@hbase.apache.org
   Subject: Re: HBase - Secondary Index
  
Didnt follow u completely here. There wont be any get() happening..
 As
   the
   exact rowkey in a region we get from the index table, we can seek to
 the
   exact position and return that row.
  
   Sorry, When I misused get() here, I meant seeking. Yes, if it's just
   small number of rows returned, this works perfect. As you said you will
  get
   the exact rowkey positions per region, and simply seek them. I was
 trying
   to work out the case that when the number of result rows increases
   massively. Like in Anil's case, he wants to do a scan query against the
   2ndary index(timestamp): select all rows from timestamp1 to
 timestamp2
   given no customerId provided. During that time period, he might have a
  big
   chunk of rows from different customerIds. The index table returns a lot
  of
   rowkey positions for different customerIds (I believe they are
 scattered
  in
   different regions), then you end up seeking all different positions in
   different regions and return all the rows needed. According to your
   presentation page14 - Performance Test Results (Scan), without index,
  it's
   a linear increase as result rows # increases. on the other hand, with
   index, time spent climbs up way quicker than the case without index.
  
   btw, quick question- in your presentation, the scale there is seconds
 or
   mill-seconds:)
  
   - Shengjie
  
  
   On 27 December 2012 15:54, Anoop John anoop.hb...@gmail.com wrote:
  
how the massive number of get() is going to
perform

RE: responsetooslow from regionserver

2013-01-04 Thread Anoop Sam John
This logs warns that the operation at the region server side is taking too much 
time...  This is not an error...
Pls check your cluster. You have hot spotting ?  Also can check the GC logs at 
that server side...

-Anoop-

From: hua beatls [bea...@gmail.com]
Sent: Friday, January 04, 2013 4:39 PM
To: user@hbase.apache.org
Subject: responsetooslow from regionserver

HI,
   below is the error log form my regionserver.:

2013-01-04 14:12:37,970 WARN org.apache.hadoop.ipc.HBaseServer:
(responseTooSlow):
{processingtimems:12349,call:multi(org.apache.hadoop.hbase.client.MultiAction@6790e868),
rpc version=1, client version=29, methodsFingerPrint=54742778,client:
192.168.250.108:43072
,starttimems:1357279945618,queuetimems:166,class:HRegionServer,responsesize:0,method:multi}

2013-01-04 14:12:38,072 WARN org.apache.hadoop.ipc.HBaseServer: (
responseTooSlow):
{processingtimems:10204,call:multi(org.apache.hadoop.hbase.client.MultiAction@5a14ab21),
rpc version=1, client version=29, methodsFingerPrint=54742778,client:
192.168.250.107:43283
,starttimems:1357279947865,queuetimems:204,class:HRegionServer,responsesize:0,method:multi}

2013-01-04 14:12:38,778 WARN org.apache.hadoop.ipc.HBaseServer:



what the problem?



   Thanks



beatls

RE: which API is to get table meta data in hbase

2012-12-27 Thread Anoop Sam John

 But I say, there need some meta data which record how many row
number in the give table, I say , hbase has this meta data, is it,
And which API is to get it, and how to use API,

There is no such meta data for a table.

You can check whether you can do this work on your own using co processors. 
There were some discussion on this in the mailing list. Sorry I am not able 
find the links for that now. You can create another table and store the stats 
there in the second table.

But see whether you can make use of the AggregationClient API. U have some 
column for which every row will have a not null value??

-Anoop-

From: tgh [guanhua.t...@ia.ac.cn]
Sent: Thursday, December 27, 2012 1:45 PM
To: user@hbase.apache.org
Subject: which API is to get table meta data in hbase

Hi
I try to use API of hbase, and I store a big table in hbase,
But I have not find how to get the row number of this big table,
I say ,
AggregationClient#rowCount() is to statistic the Number of given
Family and given col in a table, and it scan the related region to satistic
the Number, that is, some colume can be null, is it,

But I say, there need some meta data which record how many row
number in the give table, I say , hbase has this meta data, is it,
And which API is to get it, and how to use API,


Could you help me

Thank you
-
Tian Guanhua

RE: HBase - Secondary Index

2012-12-27 Thread Anoop Sam John

What happens when regions get splitted ? do you update the startkey on the
index table?

We have a custom HalfStoreFileReader to read the split index region data. This 
reader will change the rowkey it returns with replacing the startkey part. 
After a split immediately HBase will initiate a compaction and the compation 
uses this new reader. So the rowkey coming out will be a changed one and thus 
the newly written HFiles will have the changed rowkey.  Also a normal read (as 
part of scan) during this time uses this new reader and so we will always get 
the rowkey in the expected format..  :)   Hope I make it clear for you.

-Anoop-

From: Shengjie Min [kelvin@gmail.com]
Sent: Thursday, December 27, 2012 4:53 PM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Hi Anoop,

First all there will be same number of regions in both primary and index
tables. All the start/stop keys of the regions also will be same.
Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
 Then we will create 2 regions in index table also with same key ranges.
At the master balancing level it is easy to collocate these regions seeing
the start and end keys.
When the selection of the rowkey that will be used in the index table is
the key here.
What we will do is all the rowkeys in the index table will be prefixed
with the start key of the region/
When an entry is added to the main table with rowkey as 5 it will go to
the 1st region (0-10)
Now there will be index region with range as 0-10.  We will select this
region to store this index data.
The row getting added into the index region for this entry will have a
rowkey 0_x_5
I am just using '_' as a seperator here just to show this. Actually we
wont be having any seperator.
So the rowkeys (in index region) will have a static begin part always.
 Will scan time also we know this part and so the startrow and endrow
creation for the scan will be possible.. Note that we will store the actual
table row key as the last part of the index rowkey itself not as a value.
This is better option in our case of handling the scan index usage also at
sever side.  There is no index data fetch to client side..

What happens when regions get splitted ? do you update the startkey on the
index table?

-Shengjie


On 14 December 2012 08:54, Anoop Sam John anoo...@huawei.com wrote:

 Hi Anil,

 1. In your presentation you mentioned that region of Primary Table and
 Region of Secondary Table are always located on the same region server. How
 do you achieve it? By using the Primary table rowkey as prefix of  Rowkey
 of Secondary Table? Will your implementation work if the rowkey of primary
 table cannot be used as prefix in rowkey of Secondary table( i have this
 limitation in my use case)?
 First all there will be same number of regions in both primary and index
 tables. All the start/stop keys of the regions also will be same.
 Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
  Then we will create 2 regions in index table also with same key ranges.
 At the master balancing level it is easy to collocate these regions seeing
 the start and end keys.
 When the selection of the rowkey that will be used in the index table is
 the key here.
 What we will do is all the rowkeys in the index table will be prefixed
 with the start key of the region/
 When an entry is added to the main table with rowkey as 5 it will go to
 the 1st region (0-10)
 Now there will be index region with range as 0-10.  We will select this
 region to store this index data.
 The row getting added into the index region for this entry will have a
 rowkey 0_x_5
 I am just using '_' as a seperator here just to show this. Actually we
 wont be having any seperator.
 So the rowkeys (in index region) will have a static begin part always.
  Will scan time also we know this part and so the startrow and endrow
 creation for the scan will be possible.. Note that we will store the actual
 table row key as the last part of the index rowkey itself not as a value.
 This is better option in our case of handling the scan index usage also at
 sever side.  There is no index data fetch to client side..

 I feel your use case perfectly fit with our model

 2. Are you using an Endpoint or Observer for building the secondary index
 table?
 Observer

 3. Custom balancer do collocation. Is it a custom load balancer of HBase
 Master or something else?
 It is a balancer implementation which will be plugged into Master

 4. Your region split looks interesting. I dont have much info about it.
 Can
 you point to some docs on IndexHalfStoreFileReader?
 Sorry I am not able to publish any design doc or code as the company has
 not decided to open src the solution yet.
 Any particular query you come acorss pls feel free to aske me :)
 You can see the HalfStoreFileReader class 1st..

 -Anoop-
 
 From: anil gupta [anilgupt...@gmail.com]
 Sent

RE: HBase - Secondary Index

2012-12-27 Thread Anoop Sam John
Yes as you say when the no of rows to be returned is becoming more and more the 
latency will be becoming more.  seeks within an HFile block is some what 
expensive op now. (Not much but still)  The new encoding prefix trie will be a 
huge bonus here. There the seeks will be flying.. [Ted also presented this in 
the Hadoop China]  Thanks to Matt... :)  I am trying to measure the scan 
performance with this new encoding . Trying to back port a simple patch for 94 
version just for testing...   Yes when the no of results to be returned is more 
and more any index will become less performing as per my study  :)

btw, quick question- in your presentation, the scale there is seconds or
mill-seconds:)

It is seconds.  Dont consider the exact values. What is the % of increase in 
latency is important :) Those were not high end machines.

-Anoop-

From: Shengjie Min [kelvin@gmail.com]
Sent: Thursday, December 27, 2012 9:59 PM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Didnt follow u completely here. There wont be any get() happening.. As the
exact rowkey in a region we get from the index table, we can seek to the
exact position and return that row.

Sorry, When I misused get() here, I meant seeking. Yes, if it's just
small number of rows returned, this works perfect. As you said you will get
the exact rowkey positions per region, and simply seek them. I was trying
to work out the case that when the number of result rows increases
massively. Like in Anil's case, he wants to do a scan query against the
2ndary index(timestamp): select all rows from timestamp1 to timestamp2
given no customerId provided. During that time period, he might have a big
chunk of rows from different customerIds. The index table returns a lot of
rowkey positions for different customerIds (I believe they are scattered in
different regions), then you end up seeking all different positions in
different regions and return all the rows needed. According to your
presentation page14 - Performance Test Results (Scan), without index, it's
a linear increase as result rows # increases. on the other hand, with
index, time spent climbs up way quicker than the case without index.

btw, quick question- in your presentation, the scale there is seconds or
mill-seconds:)

- Shengjie


On 27 December 2012 15:54, Anoop John anoop.hb...@gmail.com wrote:

 how the massive number of get() is going to
 perform againt the main table

 Didnt follow u completely here. There wont be any get() happening.. As the
 exact rowkey in a region we get from the index table, we can seek to the
 exact position and return that row.

 -Anoop-

 On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min kelvin@gmail.com
 wrote:

  how the massive number of get() is going to
  perform againt the main table
 




--
All the best,
Shengjie Min

RE: HBase - Secondary Index

2012-12-27 Thread Anoop Sam John
 Do you have link to that presentation?

http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf

-Anoop-


From: Mohit Anchlia [mohitanch...@gmail.com]
Sent: Friday, December 28, 2012 9:12 AM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John anoo...@huawei.com wrote:

 Yes as you say when the no of rows to be returned is becoming more and
 more the latency will be becoming more.  seeks within an HFile block is
 some what expensive op now. (Not much but still)  The new encoding prefix
 trie will be a huge bonus here. There the seeks will be flying.. [Ted also
 presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
 measure the scan performance with this new encoding . Trying to back port a
 simple patch for 94 version just for testing...   Yes when the no of
 results to be returned is more and more any index will become less
 performing as per my study  :)

 Do you have link to that presentation?


 btw, quick question- in your presentation, the scale there is seconds or
 mill-seconds:)

 It is seconds.  Dont consider the exact values. What is the % of increase
 in latency is important :) Those were not high end machines.

 -Anoop-
 
 From: Shengjie Min [kelvin@gmail.com]
 Sent: Thursday, December 27, 2012 9:59 PM
 To: user@hbase.apache.org
 Subject: Re: HBase - Secondary Index

  Didnt follow u completely here. There wont be any get() happening.. As
 the
 exact rowkey in a region we get from the index table, we can seek to the
 exact position and return that row.

 Sorry, When I misused get() here, I meant seeking. Yes, if it's just
 small number of rows returned, this works perfect. As you said you will get
 the exact rowkey positions per region, and simply seek them. I was trying
 to work out the case that when the number of result rows increases
 massively. Like in Anil's case, he wants to do a scan query against the
 2ndary index(timestamp): select all rows from timestamp1 to timestamp2
 given no customerId provided. During that time period, he might have a big
 chunk of rows from different customerIds. The index table returns a lot of
 rowkey positions for different customerIds (I believe they are scattered in
 different regions), then you end up seeking all different positions in
 different regions and return all the rows needed. According to your
 presentation page14 - Performance Test Results (Scan), without index, it's
 a linear increase as result rows # increases. on the other hand, with
 index, time spent climbs up way quicker than the case without index.

 btw, quick question- in your presentation, the scale there is seconds or
 mill-seconds:)

 - Shengjie


 On 27 December 2012 15:54, Anoop John anoop.hb...@gmail.com wrote:

  how the massive number of get() is going to
  perform againt the main table
 
  Didnt follow u completely here. There wont be any get() happening.. As
 the
  exact rowkey in a region we get from the index table, we can seek to the
  exact position and return that row.
 
  -Anoop-
 
  On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min kelvin@gmail.com
  wrote:
 
   how the massive number of get() is going to
   perform againt the main table
  
 



 --
 All the best,
 Shengjie Min


RE: how to use API to statistic how many message has been store in the table in hbase

2012-12-26 Thread Anoop Sam John
So you want to know the no# of rows in a table?
Have a look at AggregationClient#rowCount() 

-Anoop-

From: tgh [guanhua.t...@ia.ac.cn]
Sent: Thursday, December 27, 2012 7:51 AM
To: user@hbase.apache.org
Subject: how to use API to statistic how many message has been store in the 
table in hbase

Hi
I try to use hbaseAPI to store data, and I want to get the number of
the message stored in the table in HBase,
how to use API to statistic how many message has been store in the
table in hbase
What API can do this,
Could you help me


Thank you
-
Tian Guanhua

RE: HBase - Secondary Index

2012-12-19 Thread Anoop Sam John
David
  Not using any existing library like Lucene.  The index data of a table will 
be written in another HBase table.

-Anoop-

From: David Arthur [mum...@gmail.com]
Sent: Thursday, December 20, 2012 8:17 AM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Very cool design. Just curious, for the index did you write something
custom or using an existing library like Lucene?

-David

On 12/4/12 3:10 AM, Anoop Sam John wrote:
 Hi All

  Last week I got a chance to present the secondary indexing 
 solution what we have done in Huawei at the China Hadoop Conference.  You can 
 see the presentation from 
 http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf



 I would like to hear what others think on this. :)



 -Anoop-


RE: MR missing lines

2012-12-19 Thread Anoop Sam John
Hi All
   Be careful with selecting the Delete#deleteColumn() 
Delete#deleteColumns().
deleteColumn() API is to delete just one version of a column in a give row. 
While the other deletes all the versions data of the column.
In Jean's case which API is used will not matter in a functional way as he is 
having only one version for a column and even one column in every row.

But deleteColumn will be having an overhead. When this is used and not passing 
any TS ( latestTimeStamp by default comes in), there will be a get operation 
happening within the HRegion to get the ts of the most recent version for this 
column.   deleteColumn (cf,qualifier) API tells to delete the most recent 
version of the cf:qualifier while deleteColumns(cf,qualifier) tells to delete 
the whole column from the row (all the versions)

-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Thursday, December 20, 2012 6:09 AM
To: user@hbase.apache.org
Subject: Re: MR missing lines

Hi Anoop,

Thanks for the hint! Even if it's not fixing my issue, at least my
tests are going to be faster.

I will take a look at the documentation to understand what
deleteColumn was doing.

JM

2012/12/19, Anoop Sam John anoo...@huawei.com:
 Jean:  just one thought after seeing the description and the code.. Not
 related to the missing as such

 You want to delete the row fully right?
My table is only one CF with one C with one version
 And your code is like
  Delete delete_entry_proposed = new Delete(key);
  delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
 KVs.get(0).getQualifier());

 deleteColumn() is useful when you want to delete specific column's specific
 version in a row.  In your case this may be really not needed. Just Delete
 delete_entry_proposed = new Delete(key);  may be enough so that the delete
 type is ROW delete.

 You can see the javadoc of the deleteColumn() API in which it clearly says
 it is an expensive op. At the server side there will be a need to do a Get
 call..
 In your case these are really unwanted over head .. I think...

 -Anoop-
 
 From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
 Sent: Tuesday, December 18, 2012 7:07 PM
 To: user@hbase.apache.org
 Subject: Re: MR missing lines

 I faced the issue again today...

 RowCounter gave me 104313 lines
 Here is the output of the job counters:
 12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_ADDED=81594
 12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_SIMILAR=434
 12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_NO_CHANGES=14250
 12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_DUPLICATE=428
 12/12/17 22:32:52 INFO mapred.JobClient: NON_DELETED_ROWS=0
 12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_EXISTING=7605
 12/12/17 22:32:52 INFO mapred.JobClient: ROWS_PARSED=104311

 There is a 2 lines difference between ROWS_PARSED and he counter.
 ENTRY_ADDED, ENTRY_SIMILAR, ENTRY_NO_CHANGES, ENTRY_DUPLICATE and
 ENTRY_EXISTING are the 5 states an entry can have. Total of all those
 counters is equal to the ROWS_PARSED value, so it's alligned. Code is
 handling all the possibilities.

 The ROWS_PARSED counter is incremented right at the beginning like
 that (I removed the comments and javadoc for lisibility):
 /**
  * The comments ...
  */
 @Override
 public void map(ImmutableBytesWritable row__, Result values,
 Context
 context) throws IOException
 {


 context.getCounter(Counters.ROWS_PARSED).increment(1);
 ListKeyValue KVs = values.list();
 try
 {

 // Get the current row.
 byte[] key = values.getRow();

 // First thing we do, we mark this line to
 be deleted.
 Delete delete_entry_proposed = new
 Delete(key);

 delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
 KVs.get(0).getQualifier());

 deletes_entry_proposed.add(delete_entry_proposed);


 The deletes_entry_proposed is a list of rows to delete. After each
 call to the delete method, the number of remaining lines into this
 list is added to NON_DELETED_ROWS which is 0 at the end, so all lines
 should be deleted correctly.

 I re-ran the rowcounter after the job, and I still have ROWS=5971
 lines into the table. I check all my feeding process and they are
 all closed.

 My table is only one CF with one C with one version.

 I can guess that the remaining 5971 lines into the table is an error
 on my side, but I'm not able to find where since all the counters are
 matching. I will add one counter which will add all the entries in the
 delete list before calling the delete method. This should match the
 number of rows.

 Again, I will re-feed the table today with fresh data and re-run the job...

 JM

RE: HBase - Secondary Index

2012-12-18 Thread Anoop Sam John
Anil:
If the scan from client side does not specify any rowkey range but only the 
filter condition, yes it will go to all the primary table regions for the scan. 
There 1st it will scan the index table region and seek to exact rows in the 
main table region.  If that region is not having any data at all corresponding 
to the filter condition, the entire region will get skipped simply.

In a normal scan also, if there is a rowkey range that we can specify, then 
only to specific regions the request will go. In the sec index case of ours 
also it is same..

In a simple way what I can say is for the scan there is no change at all wrt 
the operation that is what is happening at the client side. From the meta data 
to know which all region and RSs to contact, and contacting that regions one by 
one and getting data from that region. Only difference is what is happening at 
the server side. With out index the whole data from all the Hfiles will get 
fetched at the server side and the filter will get applied for every row. Only 
those rows which passes the filter will get back to the client side.  With 
index, when the scanning happen at the server side, the index data will get 
scanned 1st from the index region. This region will be in the same RS so no 
extra RPCs. The data to be scanned from the index table will be limited.. We 
can create the start key and stop key for that.. Based on the result of the 
index scan, we will know the rowkeys where all the data what we are interested 
in resides. So reseek will happen to those rows and read only those rows. So 
the time spent at the server side for scanning a region will get reduced to a 
very high value.

Yes but still there will be calls from the client side to the RS for each 
region...

Now I think u might be clear.. In the ppt that I have shared, there also it is 
saying the same thing. It is showing what is happening at the server side.

-Anoop-


From: anil gupta [anilgupt...@gmail.com]
Sent: Tuesday, December 18, 2012 1:58 PM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Hi Anoop,

Please find my reply inline.

Thanks,
Anil Gupta

On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John anoo...@huawei.com wrote:

 Hi Anil
 During the scan, there is no need to fetch any index data
 to client side. So there is no need to create any scanner on the index
 table at the client side. This happens at the server side.




 For the Scan on the main table with condition on timestamp and customer
 id, a scanner to be created with Filters. Yes like normal when there is no
 secondary index. So this scan from the client will go through all the
 regions in the main table.


Anil: Do you mean that if the table is spread across 50 region servers in
60 node cluster then we need to send a scan request to all the 50 RS.
Right? Doesn't it sounds expensive? IMHO you were not doing this in your
solution. Your solution looked cleaner than this since you exactly knew
which Node you need to go to for querying while using secondary index due
to co-location(due to static begin part for secondary table rowkey) of
region of primary table and secondary index table. My problem is little
more complicated due to the constraints that: I cannot have a static begin
part in the rowkey of my secondary table.

When it scans one particular region say (x,y] on the main table, using the
 CP we can get the index table region object corresponding to this main
 table region from the RS.  There is no issue in creating the static part of
 the rowkey. You know 'x' is the region start key. Then at the server side
 will create a scanner on the index region directly and here we can specify
 the startkey. 'x' + timestamp value + customer id..  Using the results
 from the index scan we will make reseek on the main region to the exact
 rows where the data what we are interested in is available. So there wont
 be a full region data scan happening.


 When in the cases where only timestamp is there but no customer id, it
 will be simple again. Create a scanner on the main table with only one
 filter. At the CP side the scanner on the index region will get created
 with startkey as 'x' + timestamp value..When you create the scan
 object and set startRow on that it need not be the full rowkey. It can be
 part of the rowkey also. Yes like prefix.

 Hope u got it now :)

Anil: I hope now we are on same page. Thanks a lot for your valuable time
to discuss this stuff.


 -Anoop-
 
 From: anil gupta [anilgupt...@gmail.com]
 Sent: Friday, December 14, 2012 11:31 PM
 To: user@hbase.apache.org
 Subject: Re: HBase - Secondary Index

 On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John anoo...@huawei.com
 wrote:

  Hi Anil,
 
  1. In your presentation you mentioned that region of Primary Table and
  Region of Secondary Table are always located on the same region server.
 How
  do you achieve it? By using

RE: HBase - Secondary Index

2012-12-18 Thread Anoop Sam John
Hi Mike
My question is that since you don't have any formal SQL syntax, how are you 
doing this all server side?
I think the question is to Anil.. In his case he is not doing the index data 
scan at the server side. He scan the index table data back to client and from 
client doing gets to get the main table data.  Correct Anil?
Just making  it clear... :)

-Anoop-

From: Michel Segel [michael_se...@hotmail.com]
Sent: Tuesday, December 18, 2012 2:32 PM
To: user@hbase.apache.org
Cc: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Just a couple of questions...

First, since you don't have any natural secondary indices, you can create one 
from a couple of choices. Keeping it simple, you choose an inverted table as 
your index.

In doing so, you have one column containing all of the row ids for a given 
value.
This means that it is a simple get().

My question is that since you don't have any formal SQL syntax, how are you 
doing this all server side?


Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 18, 2012, at 2:28 AM, anil gupta anilgupt...@gmail.com wrote:

 Hi Anoop,

 Please find my reply inline.

 Thanks,
 Anil Gupta

 On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John anoo...@huawei.com wrote:

 Hi Anil
During the scan, there is no need to fetch any index data
 to client side. So there is no need to create any scanner on the index
 table at the client side. This happens at the server side.



 For the Scan on the main table with condition on timestamp and customer
 id, a scanner to be created with Filters. Yes like normal when there is no
 secondary index. So this scan from the client will go through all the
 regions in the main table.


 Anil: Do you mean that if the table is spread across 50 region servers in
 60 node cluster then we need to send a scan request to all the 50 RS.
 Right? Doesn't it sounds expensive? IMHO you were not doing this in your
 solution. Your solution looked cleaner than this since you exactly knew
 which Node you need to go to for querying while using secondary index due
 to co-location(due to static begin part for secondary table rowkey) of
 region of primary table and secondary index table. My problem is little
 more complicated due to the constraints that: I cannot have a static begin
 part in the rowkey of my secondary table.

 When it scans one particular region say (x,y] on the main table, using the
 CP we can get the index table region object corresponding to this main
 table region from the RS.  There is no issue in creating the static part of
 the rowkey. You know 'x' is the region start key. Then at the server side
 will create a scanner on the index region directly and here we can specify
 the startkey. 'x' + timestamp value + customer id..  Using the results
 from the index scan we will make reseek on the main region to the exact
 rows where the data what we are interested in is available. So there wont
 be a full region data scan happening.

 When in the cases where only timestamp is there but no customer id, it
 will be simple again. Create a scanner on the main table with only one
 filter. At the CP side the scanner on the index region will get created
 with startkey as 'x' + timestamp value..When you create the scan
 object and set startRow on that it need not be the full rowkey. It can be
 part of the rowkey also. Yes like prefix.

 Hope u got it now :)
 Anil: I hope now we are on same page. Thanks a lot for your valuable time
 to discuss this stuff.


 -Anoop-
 
 From: anil gupta [anilgupt...@gmail.com]
 Sent: Friday, December 14, 2012 11:31 PM
 To: user@hbase.apache.org
 Subject: Re: HBase - Secondary Index

 On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John anoo...@huawei.com
 wrote:

 Hi Anil,

 1. In your presentation you mentioned that region of Primary Table and
 Region of Secondary Table are always located on the same region server.
 How
 do you achieve it? By using the Primary table rowkey as prefix of  Rowkey
 of Secondary Table? Will your implementation work if the rowkey of
 primary
 table cannot be used as prefix in rowkey of Secondary table( i have this
 limitation in my use case)?
 First all there will be same number of regions in both primary and index
 tables. All the start/stop keys of the regions also will be same.
 Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
 Then we will create 2 regions in index table also with same key ranges.
 At the master balancing level it is easy to collocate these regions
 seeing
 the start and end keys.
 When the selection of the rowkey that will be used in the index table is
 the key here.
 What we will do is all the rowkeys in the index table will be prefixed
 with the start key of the region/
 When an entry is added to the main table with rowkey as 5 it will go to
 the 1st region (0-10)
 Now there will be index region with range as 0-10.  We

RE: MR missing lines

2012-12-18 Thread Anoop Sam John
Jean:  just one thought after seeing the description and the code.. Not related 
to the missing as such

You want to delete the row fully right?
My table is only one CF with one C with one version
And your code is like
  Delete delete_entry_proposed = new Delete(key);
  delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(), 
 KVs.get(0).getQualifier());

deleteColumn() is useful when you want to delete specific column's specific 
version in a row.  In your case this may be really not needed. Just Delete 
delete_entry_proposed = new Delete(key);  may be enough so that the delete type 
is ROW delete.

You can see the javadoc of the deleteColumn() API in which it clearly says it 
is an expensive op. At the server side there will be a need to do a Get call..
In your case these are really unwanted over head .. I think...

-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Tuesday, December 18, 2012 7:07 PM
To: user@hbase.apache.org
Subject: Re: MR missing lines

I faced the issue again today...

RowCounter gave me 104313 lines
Here is the output of the job counters:
12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_ADDED=81594
12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_SIMILAR=434
12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_NO_CHANGES=14250
12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_DUPLICATE=428
12/12/17 22:32:52 INFO mapred.JobClient: NON_DELETED_ROWS=0
12/12/17 22:32:52 INFO mapred.JobClient: ENTRY_EXISTING=7605
12/12/17 22:32:52 INFO mapred.JobClient: ROWS_PARSED=104311

There is a 2 lines difference between ROWS_PARSED and he counter.
ENTRY_ADDED, ENTRY_SIMILAR, ENTRY_NO_CHANGES, ENTRY_DUPLICATE and
ENTRY_EXISTING are the 5 states an entry can have. Total of all those
counters is equal to the ROWS_PARSED value, so it's alligned. Code is
handling all the possibilities.

The ROWS_PARSED counter is incremented right at the beginning like
that (I removed the comments and javadoc for lisibility):
/**
 * The comments ...
 */
@Override
public void map(ImmutableBytesWritable row__, Result values, 
Context
context) throws IOException
{

context.getCounter(Counters.ROWS_PARSED).increment(1);
ListKeyValue KVs = values.list();
try
{

// Get the current row.
byte[] key = values.getRow();

// First thing we do, we mark this line to be 
deleted.
Delete delete_entry_proposed = new Delete(key);

delete_entry_proposed.deleteColumn(KVs.get(0).getFamily(),
KVs.get(0).getQualifier());

deletes_entry_proposed.add(delete_entry_proposed);


The deletes_entry_proposed is a list of rows to delete. After each
call to the delete method, the number of remaining lines into this
list is added to NON_DELETED_ROWS which is 0 at the end, so all lines
should be deleted correctly.

I re-ran the rowcounter after the job, and I still have ROWS=5971
lines into the table. I check all my feeding process and they are
all closed.

My table is only one CF with one C with one version.

I can guess that the remaining 5971 lines into the table is an error
on my side, but I'm not able to find where since all the counters are
matching. I will add one counter which will add all the entries in the
delete list before calling the delete method. This should match the
number of rows.

Again, I will re-feed the table today with fresh data and re-run the job...

JM

2012/12/17, Jean-Marc Spaggiari jean-m...@spaggiari.org:
 The job run the morning, and of course, this time, all the rows got
 processed ;)

 So I will give it few other tries and will keep you posted if I'm able
 to reproduce that again.

 Thanks,

 JM

 2012/12/16, Jean-Marc Spaggiari jean-m...@spaggiari.org:
 Thanks for the suggestions.

 I already have logs to display all the exepctions and there is
 nothing. I can't display the work done, there is to much :(

 I have counters counting the rows processed and they match what is
 done, minus what is not processed. I have just added few other
 counters. One right at the beginning, and one to count what are the
 records remaining on the delete list, as suggested.

 I will run the job again tomorrow, see the result and keep you posted.

 JM


 2012/12/16, Asaf Mesika asaf.mes...@gmail.com:
 Did you check the returned array of the delete method to make sure all
 records sent for delete have been deleted?

 Sent from my iPhone

 On 16 בדצמ 2012, at 14:52, Jean-Marc Spaggiari jean-m...@spaggiari.org
 wrote:

 Hi,

 I have a table where I'm running MR each time is exceding 100 000 rows.

 When the target is reached, all the feeding process are stopped.

 Yesterday it reached 123608 

RE: HBase - Secondary Index

2012-12-16 Thread Anoop Sam John
Hi Anil
During the scan, there is no need to fetch any index data to 
client side. So there is no need to create any scanner on the index table at 
the client side. This happens at the server side.

For the Scan on the main table with condition on timestamp and customer id, a 
scanner to be created with Filters. Yes like normal when there is no secondary 
index. So this scan from the client will go through all the regions in the main 
table. When it scans one particular region say (x,y] on the main table, using 
the CP we can get the index table region object corresponding to this main 
table region from the RS.  There is no issue in creating the static part of the 
rowkey. You know 'x' is the region start key. Then at the server side will 
create a scanner on the index region directly and here we can specify the 
startkey. 'x' + timestamp value + customer id..  Using the results from the 
index scan we will make reseek on the main region to the exact rows where the 
data what we are interested in is available. So there wont be a full region 
data scan happening.   

When in the cases where only timestamp is there but no customer id, it will be 
simple again. Create a scanner on the main table with only one filter. At the 
CP side the scanner on the index region will get created with startkey as 'x' + 
timestamp value..When you create the scan object and set startRow on that 
it need not be the full rowkey. It can be part of the rowkey also. Yes like 
prefix.

Hope u got it now :)

-Anoop-

From: anil gupta [anilgupt...@gmail.com]
Sent: Friday, December 14, 2012 11:31 PM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John anoo...@huawei.com wrote:

 Hi Anil,

 1. In your presentation you mentioned that region of Primary Table and
 Region of Secondary Table are always located on the same region server. How
 do you achieve it? By using the Primary table rowkey as prefix of  Rowkey
 of Secondary Table? Will your implementation work if the rowkey of primary
 table cannot be used as prefix in rowkey of Secondary table( i have this
 limitation in my use case)?
 First all there will be same number of regions in both primary and index
 tables. All the start/stop keys of the regions also will be same.
 Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
  Then we will create 2 regions in index table also with same key ranges.
 At the master balancing level it is easy to collocate these regions seeing
 the start and end keys.
 When the selection of the rowkey that will be used in the index table is
 the key here.
 What we will do is all the rowkeys in the index table will be prefixed
 with the start key of the region/
 When an entry is added to the main table with rowkey as 5 it will go to
 the 1st region (0-10)
 Now there will be index region with range as 0-10.  We will select this
 region to store this index data.
 The row getting added into the index region for this entry will have a
 rowkey 0_x_5
 I am just using '_' as a seperator here just to show this. Actually we
 wont be having any seperator.
 So the rowkeys (in index region) will have a static begin part always.
  Will scan time also we know this part and so the startrow and endrow
 creation for the scan will be possible.. Note that we will store the actual
 table row key as the last part of the index rowkey itself not as a value.
 This is better option in our case of handling the scan index usage also at
 sever side.  There is no index data fetch to client side..


Anil: My primary table rowkey is customerId+event_id, and my secondary
table rowkey is timestamp+ customerid. In your implementation it seems like
for using secondary index the application needs to know about the
start_key of the region(static begin part) it wants to query. Right? Do
you separately manage the logic of determining the region
start_key(static begin part) for a scan?
Also, Its possible that while using secondary index the customerId is not
provided. So, i wont be having customer id for all the queries. Hence i
cannot use customer_id as a prefix in rowkey of my Secondary Table.


 I feel your use case perfectly fit with our model

Anil: Somehow i am unable to fit your implementation into my use case due
to the constraint of static begin part of rowkey in Secondary table. There
seems to be a disconnect. Can you tell me how does my use case fits into
your implementation?


 2. Are you using an Endpoint or Observer for building the secondary index
 table?
 Observer

 3. Custom balancer do collocation. Is it a custom load balancer of HBase
 Master or something else?
 It is a balancer implementation which will be plugged into Master

 4. Your region split looks interesting. I dont have much info about it.
 Can
 you point to some docs on IndexHalfStoreFileReader?
 Sorry I am not able to publish any design doc or code as the company has
 not decided

RE: Re:Re: Counter and Coprocessor Musing

2012-12-11 Thread Anoop Sam John
Agree with Azury
Ted : He mentions some thing different than HBASE-5982.
If the count of the rows maintained in another meta table, then getting the 
rows count from that will be much faster than the AggregateImplementation 
getRowNum I think.

Specific to the use case some one can make this using the CP. But a generic 
implementation might be difficult. How we can handle the versioning. When a new 
version comes for an existing row, we should not increment this. Also to handle 
the TTLs..

-Anoop-

From: Azury [ziqidonglai1...@126.com]
Sent: Wednesday, December 12, 2012 9:40 AM
To: user@hbase.apache.org
Subject: Re:Re: Counter and Coprocessor Musing

Hi Ted,
I think he want to table 'meta data', not similar to Coprocessor.
such as long rows = table.rows();

just probably, not sure about that.



At 2012-12-12 01:11:49,Ted Yu yuzhih...@gmail.com wrote:
Thanks for sharing your thoughts.

Which HBase version are you currently using ?
Have you looked at AggregateImplementation which is included in hbase jar ?
A count operation (getRowNum) is in AggregateImplementation.

It would be nice if you can tell us how much difference (in terms of
response time) this aggregation lags your expectation.

Also take a look at HBASE-5982 HBase Coprocessor Local Aggregation

Cheers

On Tue, Dec 11, 2012 at 6:50 AM, nicolas maillard 
nicolas.maill...@fifty-five.com wrote:

 Hi everyone

 While working with hbase and looking at what the tables and meta look like
 I
 hava
 thought of a couple things, maybe someone has insights.
 My thoughts are around the count situation it is a current database
 process to
 count entries for a given query.
 for example as a first check to see if everything is written or sometimes
 to get
 a
 feel of a population.
 I was wondering 2 things:
 - Should'nt Hbase keep in the metrics for a table it's total entry count?
 this would not take too much space and often comes in handy. Granted with a
 coprocessor you could easily create a table with counters for all the other
 tables in the system but it would be a nice have as a standard.

 - I was also wondering maybe every region could know the number of entries
 it
 contains. Every region already knows the start and endkey of it's entries.
 For a
 count on a given scan this would speed up the count. Every region who's
 start
 and
 and endkey are in the scan would just send back it's population count and
 only a
 region that is wider then the count would need to be scanned and counted.

 Wondering if these thoughts are already implemented and if I'm missing
 something
 or would not be a good idea. Altenratly if this is a not a definite No for
 some
 reason could coprocessors allow to implement these thoughts. Can I with a
 coprocessor write in the metrics part, or on a given scan first check if,
 for a
 region smaller than my scan, I already have written somewhere the count
 instead
 of
 scanning and couning.

 Thnaks for any thoughts you may have



RE: Heterogeneous cluster

2012-12-10 Thread Anoop Sam John
But if the job is running there, it can also be
considered as running locally, right? Or will it always be retrieved
from the datanode linked to the RS hosting the region we are dealing
with? Not sure I'm clear :(

Hi Jean,
 Sorry I have not seen the history of this mailing thread. As 
far as seeing this question from you, I guess the MR is scanning HTable data, 
even if the job is running on a replicate I dont think it will be local. The MR 
job need to fetch the data via HBase only. Means it need to contact the RS 
hosting the region. Then in turn HBase will contact any of the DN where the 
data is available.  So it will be multiple steps.  There is nothing like one RS 
in some way linked to one DN. From which DN the data to be fetched depends on 
the decision taken by the DFS client. May be it will not contact any DN but 
will do a local read, if the short circuit read option is enabled and the data 
is there in the same server where the region is hosted..   I guess I make it 
clear here.  :)

-Anoop-


From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Monday, December 10, 2012 7:33 PM
To: user@hbase.apache.org
Subject: Re: Heterogeneous cluster

@Asaf  Robert: I have posted the code here. But be careful with it.
Read Mike's comment above.
http://www.spaggiari.org/index.php/hbase/changing-the-hbase-default-loadbalancer
I'm a newby on HBase, so you're better to rely on someone more
experienced feedback.

@Mike:

Hi Mike,

I totally agree with your opinion. My balancer is totally a hack on a
'Frankencluster' (BTW, I LOVE this description! Perfect fit!) and a
way for me to take a deeper look at HBase's code.

One question about data locality. When you run an HBase MR, even with
a factor 3 replication, data is considered local only if it's running
on the RS version the region is stored. But does HBase has a way to
see if it can be run on any of the replicats? The replicate might be
on a different rack. But if the job is running there, it can also be
considered as running locally, right? Or will it always be retrieved
from the datanode linked to the RS hosting the region we are dealing
with? Not sure I'm clear :(

JM

2012/12/9, Michael Segel michael_se...@hotmail.com:
 Ok...

 From a production/commercial grade answer...

 With respect to HBase, you will have 1 live copy and 2 replications.
 (Assuming you didn't change this.) So when you run against HBase, data
 locality becomes less of an issue.
 And again, you have to temper that with that it depends on the number of
 regions within the table...

 A lot of people, including committers tend to get hung up on some of the
 details and they tend to lose focus on the larger picture.

 If you were running a production cluster and your one node was radically
 different... then you would be better off taking it out of the cluster and
 making it an edge node. (Edge nodes are very important...)

 If we're talking about a very large cluster which has evolved... then you
 would want to work out your rack aware placements.  Note that rack aware is
 a logical and not a physical location. So you can modify it to let the
 distro's placement take the hint and move the data.  This is more of a cheat
 and even here... I think that at scale, the potential improvement gains are
 going to be minimal.

 This works for everything but HBase.

 On that note, it doesn't matter. Again, assume that you have your data
 equally distributed around the cluster and that your access pattern is to
 all nodes in the cluster.  The parallelization in the cluster will average
 out the slow ones.

 In terms of your small research clusters...

 You're not looking at performance when you build a 'Frankencluster'

 Specifically to your case... move all the data to that node and you end up
 with both a networking and disk i/o bottlenecks.

 You're worried about the noise.

 Having said that...

 If you want to improve the balancer code, sure, however, you're going to
 need to do some work where you capture your cluster's statistics so that the
 balancer has more intelligence.

 You may start off wanting to allow HBase to take hints about the cluster,
 but in truth, I don't think its a good idea. Note, I realize that you and
 Jean-Marc are not suggesting that it is your intent to add something like
 this, but that someone will create a JIRA and then someone else may act upon
 it

 IMHO, that's a lot of work, adding intelligence to the HBase Scheduler and I
 don't think it will really make a difference in terms of overall
 performance.


 Just saying...

 -Mike

 On Dec 8, 2012, at 5:50 PM, Robert Dyer rd...@iastate.edu wrote:

 I of course can not speak for Jean-Marc, however my use case is not very
 corporate.  It is a small cluster (9 nodes) and only 1 of those nodes is
 different (drastically different).

 And yes, I configured it so that node has a lot more map slots.  However,
 the problem is HBase balances without regard to 

RE: .META. region server DDOSed by too many clients

2012-12-05 Thread Anoop Sam John

is the META table cached just like other tables 
Yes Varun I think so. 

-Anoop-

From: Varun Sharma [va...@pinterest.com]
Sent: Thursday, December 06, 2012 6:10 AM
To: user@hbase.apache.org; lars hofhansl
Subject: Re: .META. region server DDOSed by too many clients

We only see this on the .META. region not otherwise...

On Wed, Dec 5, 2012 at 4:37 PM, Varun Sharma va...@pinterest.com wrote:

 I see but is this pointing to the fact that we are heading to disk for
 scanning META - if yes, that would be pretty bad, no ? Currently I am
 trying to see if the freeze coincides with Block Cache being full (we have
 an inmemory column) - is the META table cached just like other tables ?

 Varun


 On Wed, Dec 5, 2012 at 4:20 PM, lars hofhansl lhofha...@yahoo.com wrote:

 Looks like you're running into HBASE-5898.



 - Original Message -
 From: Varun Sharma va...@pinterest.com
 To: user@hbase.apache.org
 Cc:
 Sent: Wednesday, December 5, 2012 3:51 PM
 Subject: .META. region server DDOSed by too many clients

 Hi,

 I am running hbase 0.94.0 and I have a significant write load being put on
 a table with 98 regions on a 15 node cluster - also this write load comes
 from a very large number of clients (~ 1000). I am running with 10
 priority
 IPC handlers and 200 IPC handlers. It seems the region server holding
 .META
 is DDOSed. All the 200 handlers are busy serving the .META. region and
 they
 are all locked onto on object. The Jstack is here for the regoin server

 IPC Server handler 182 on 60020 daemon prio=10 tid=0x7f329872c800
 nid=0x4401 waiting on condition [0x7f328807f000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000542d72e30 (a
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 at
 java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at

 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
 at

 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:871)
 at

 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1201)
 at

 java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
 at
 java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
 at

 java.util.concurrent.ConcurrentHashMap$Segment.put(ConcurrentHashMap.java:445)
 at

 java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:925)
 at
 org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:71)
 at

 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:290)
 at

 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:213)
 at

 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:455)
 at

 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:493)
 at

 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:242)
 at

 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:167)
 at

 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54)
 at

 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:299)
 at

 org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:244)
 at

 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:521)
 - locked 0x00063b4965d0 (a
 org.apache.hadoop.hbase.regionserver.StoreScanner)
 at

 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:402)
 - locked 0x00063b4965d0 (a
 org.apache.hadoop.hbase.regionserver.StoreScanner)
 at

 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:127)
 at

 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3354)
 at

 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3310)
 - locked 0x000523c211e0 (a
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at

 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3327)
 - locked 0x000523c211e0 (a
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at
 org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4066)
 at
 org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4039)
 at

 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1941)

 The client side trace shows that we are looking 

RE: Reg:delete performance on HBase table

2012-12-05 Thread Anoop Sam John
Hi Manoj
If I read you correctly, I think you want to aggregate some 3,4 days of 
data and those data you want to get deleted.  Can you think of creating tables 
for this period (one table for 4 days) and aggregate and drop the table?  Then 
for the next 4 days another table?

Or another option is TTL which HBase provides.

-Anoop-

From: Manoj Babu [manoj...@gmail.com]
Sent: Thursday, December 06, 2012 8:44 AM
To: user
Subject: Re: Reg:delete performance on HBase table

Team,

Thank you very much for the valuable information.

HBase version am using is:
HBase Version0.90.3-cdh3u1, r

Use case is:
We are collecting information on where the user is spending time in our
site(tracking the user events) also we are doing historical data migration
from existing system also based on the data we need to populate metrics for
the year. like Customer A hits option x n times, hits option y n
times, Customer B hits option x1 n times, hits option y1 n time.

Earlier by using Hadoop MapReduce we are aggregating the whole year data
every 2 or 4 days once and using DBOutputFormat emiting to Oracle Table and
for inserting 181 Million rows it took only 20 mins through 20 reducers
hitting parallel so before populating the year table we use to delete
the existing 181 Million rows of that year alone but it tooks more than
3hrs even not deleted then by killing the session done a truncate actually
we are in development stage so planning to try HBase for this case since
delete is taking too much time in oracle for millions of rows.


Need to delete rows based on the year only cannot drop, In oracle also
truncate is extremely fast.

Cheers!
Manoj.



On Wed, Dec 5, 2012 at 11:44 PM, Nick Dimiduk ndimi...@gmail.com wrote:

 On Wed, Dec 5, 2012 at 7:46 AM, Doug Meil doug.m...@explorysmedical.com
 wrote:

  You probably want to read this section on the RefGuide about deleting
 from
  HBase.
 
  http://hbase.apache.org/book.html#perf.deleting


 So hold on. From the guide:

 11.9.2. Delete RPC Behavior
 

  Be aware that htable.delete(Delete) doesn't use the writeBuffer. It will
  execute an RegionServer RPC with each invocation. For a large number of
  deletes, consider htable.delete(List).
 

  See
 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29


 So Deletes are like Puts except they're not executed the same why. Indeed,
 HTable.put() is implemented using the write buffer while HTable.delete()
 makes a MutateRequest directly. What is the reason for this? Why is the
 semantic of Delete subtly different from Put?

 For that matter, why not buffer all mutation operations?
 HTable.checkAndPut(), checkAndDelete() both make direct MutateRequest calls
 as well.

 Thanks,
 -n


HBase - Secondary Index

2012-12-04 Thread Anoop Sam John
Hi All

Last week I got a chance to present the secondary indexing solution 
what we have done in Huawei at the China Hadoop Conference.  You can see the 
presentation from 
http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf



I would like to hear what others think on this. :)



-Anoop-


RE: Data Locality, HBase? Or Hadoop?

2012-12-03 Thread Anoop Sam John
I think all is clear now.. Just to conclude, the data locality is feature 
provided by HDFS. When DFS client writes some data, hadoop will try to maintain 
the data locality. HBase region server writes and reads data via the DFS client 
which is in the same process as that of the RS.  When the flush happens data 
locality would have been achieved for that data..  Later when the region is 
getting moved by the balancer or manually, data locality may again be available 
after a compaction as the compaction will rewrite the data into HDFS again.. 
(merging many files into 1 HFile)
Major compaction if done all the data will get local..  If it is minor 
compaction only that much data which are present in the minor compacted files 
will get moved into a new HFile and thus only that much locality.  :)

-Anoop-

From: Jean-Marc Spaggiari [jean-m...@spaggiari.org]
Sent: Monday, December 03, 2012 9:23 PM
To: user@hbase.apache.org
Subject: Re: Data Locality, HBase? Or Hadoop?

Ok. I will try the major compaction then ;)

Doug, thanks for pointing to the doc! I now totally understand why
it's moved locally when the compaction occurs!

Thanks all! I will give that a try very shortly.

JM

2012/12/3, Doug Meil doug.m...@explorysmedical.com:

 Hi there-

 This is also discussed in the Regions section in the RefGuide:

 http://hbase.apache.org/book.html#regions.arch

 9.7.3. Region-RegionServer Locality




 On 12/3/12 10:08 AM, Kevin O'dell kevin.od...@cloudera.com wrote:

JM,

  If you have disabled the balancer and are manually moving regions, you
will need to run a compaction on those regions.  That is the only(logical)
way of bringing the data local.  HDFS does not have a concept of HBase
locality.  HBase locality is all managed through major and minor
compactions.

On Mon, Dec 3, 2012 at 10:04 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi,

 I'm wondering who is taking care of the data locality. Is it hadoop? Or
 hbase?

 Let's say I have disabled the load balancer and I'm manually moving a
 region to a specific server. Who is going to take care that the data
 is going to be on the same datanode as the regionserver I moved the
 region to? Is hadoop going to see that my region is now on this region
 server and make sure my data is moved there too? Or is hbase going to
 ask hadoop to do it?

 Or, since I moved it manually, there is not any data locality
guaranteed?

 Thanks,

 JM




--
Kevin O'Dell
Customer Operations Engineer, Cloudera




RE: Long row + column keys

2012-12-03 Thread Anoop Sam John
Hi Varun
 It looks to be very clear that you need to use some sort of 
encoding scheme.  PrefixDeltaEncoding would be fine may be..  You can see the 
other algos also like the FastDiff...  and see how much space it can save in 
your case. Also suggest you can use the encoding for data on disk as well as in 
memory (block cache)
The total key size, as far as i know, would be 8 + 12 + 8 (timestamp) = 28 
bytes
In every KV that is getting stored the key size would be
4(key length) + 4(value length) + 2(rowkey length) + 8(rowkey) + 1[cf length] + 
12(cf + qualifer) + 8(timestamp) + 1( type PUT/DELETE...)  + value (0 bytes 
atleast 1 byte right) = 39+  bytes... 

Just making it clear for you :)

-Anoop-

From: Varun Sharma [va...@pinterest.com]
Sent: Tuesday, December 04, 2012 2:36 AM
To: Marcos Ortiz
Cc: user@hbase.apache.org
Subject: Re: Long row + column keys

Hi Marcos,

Thanks for the links. We have gone through these and thought about the
schema. My question is about whether using PrefixDeltaEncoding makes sense
in our situation...

Varun

On Mon, Dec 3, 2012 at 12:36 PM, Marcos Ortiz mlor...@uci.cu wrote:

 Regards, Varun.
 I think that you can see the Bernoit Sigoure (@tsuna)愀 talk called
 Lessons learned from OpenTSDB in the last
 HBaseCon . [1]
 He explained in great detail how to design your schema to obtain the best
 performance from HBase.

 Other recommended talks are: HBase Internals from Lars, and HBase
 Schema Design from Ian
 [2][3]

 [1] 
 http://www.slideshare.net/**cloudera/4-opentsdb-hbaseconhttp://www.slideshare.net/cloudera/4-opentsdb-hbasecon
 [2] http://www.slideshare.net/**cloudera/3-learning-h-base-**
 internals-lars-hofhansl-**salesforce-final/http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final/
 [3] 
 http://www.slideshare.net/**cloudera/5-h-base-**schemahbasecon2012http://www.slideshare.net/cloudera/5-h-base-schemahbasecon2012


 On 12/03/2012 02:58 PM, Varun Sharma wrote:

 Hi,

 I have a schema where the rows are 8 bytes long and the columns are 12
 bytes long (roughly 1000 columns per row). The value is 0 bytes. Is this
 going to be space inefficient in terms of HFile size (large index +
 blocks)
 ? The total key size, as far as i know, would be 8 + 12 + 8 (timestamp) =
 28 bytes. I am using hbase 0.94.0 which has HFile v2.

 Yes, like you said, HFile v2 is included in 0.94, but although is in trunk
 right now, your should
 keep following the development of HBase, focused on HBASE-5313 and
 HBASE-5521, because
 the development team is working in a new file storage format called HFile
 v3, based on a columnar
 format called Trevni for Avro by Dug Cutting.[4][5][6][7]


 [4] 
 https://issues.apache.org/**jira/browse/HBASE-5313https://issues.apache.org/jira/browse/HBASE-5313
 [5] 
 https://issues.apache.org/**jira/browse/HBASE-5521https://issues.apache.org/jira/browse/HBASE-5521
 [6] https://github.com/cutting/**trevnihttps://github.com/cutting/trevni
 [7] 
 https://issues.apache.org/**jira/browse/AVRO-806https://issues.apache.org/jira/browse/AVRO-806




 Also, should I be using an encoding technique to get the number of bytes
 down (like PrefixDeltaEncoding) which is provided by hbase ?

 Read the Cloudera愀 blog post called HBase I/O - HFile to see how Prefix
 and Diff encodings
 works, and decide which is the more suitable for you.[8]


 [8] 
 http://blog.cloudera.com/blog/**2012/06/hbase-io-hfile-input-**output/http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/

 I hope that all this information could help you.
 Best wishes




 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/**universidad.ucihttp://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/**universidad_ucihttp://www.flickr.com/photos/universidad_uci


RE: Changing column family in hbase

2012-11-28 Thread Anoop Sam John
If you are having data in the current table schema?
You want some how to move the data to new CF?  If yes I dont think it is 
possible. Some similar question was asked in the mailing list today. Is your 
scenario also same?

-Anoop-

From: raviprasa...@polarisft.com [raviprasa...@polarisft.com]
Sent: Wednesday, November 28, 2012 4:34 PM
To: user@hbase.apache.org
Subject: Re: Changing column family in hbase

Hi Mohammad,
  Our requirement is that,
  Initially we have created a table 'emp'  with the below detail in single 
column family (cfemp )
cfemp:eno
cfemp:ename
cfemp:address

After that we have added two columns  in the table  (  address2, address3 and  
city),
  We plan to create a new coloum family called  'cfadd'  to store all the 
address details as below
 cfemp:eno
 cfemp:ename
 cfadd:address   -- Previously it was in cfemp column family
 cfadd:address2
 cfadd:address3
 cfadd:city

Regards,
Ravi

-Mohammad Tariq donta...@gmail.com wrote: -
To: user@hbase.apache.org user@hbase.apache.org
From: Mohammad Tariq donta...@gmail.com
Date: 11/28/2012 02:39PM
Subject: Re: Changing column family in hbase

Hello Ravi,

Short answer, no. We don't have a way to achieve this. At lest I am not
aware of any. (Please share with us if you are able to achieve this.)

But, just out of curiosity, I would like to ask you, why would you want to
do that? I mean I don't see any fundamental difference between both the
schemata.

Regards,
Mohammad Tariq



On Wed, Nov 28, 2012 at 2:28 PM, raviprasa...@polarisft.com wrote:

  Hi,
   We need to change columns of Hbase table from one column family to
 another column family .

 Example :-
 HBase table Name :-  emp

 Column family :-  cf1
 Under the column family cf1, we have the following columns
   cf1: no
   cf1:name
   cf1:salary
   cf1: job

 We have created another column family called  'cf2'  in the same table
 'emp'

 How to replace Column family cf1's   columns to another column family
 'cf2' like below ?

   cf2: no
cf2:name
cf2:salary
cf2: job

 Regards
 Ravi



 This e-Mail may contain proprietary and confidential information and is
 sent for the intended recipient(s) only.  If by an addressing or
 transmission error this mail has been misdirected to you, you are requested
 to delete this mail immediately. You are also hereby notified that any use,
 any form of reproduction, dissemination, copying, disclosure, modification,
 distribution and/or publication of this e-mail message, contents or its
 attachment other than by its intended recipient/s is strictly prohibited.

 Visit us at http://www.polarisFT.com



This e-Mail may contain proprietary and confidential information and is sent 
for the intended recipient(s) only.  If by an addressing or transmission error 
this mail has been misdirected to you, you are requested to delete this mail 
immediately. You are also hereby notified that any use, any form of 
reproduction, dissemination, copying, disclosure, modification, distribution 
and/or publication of this e-mail message, contents or its attachment other 
than by its intended recipient/s is strictly prohibited.

Visit us at http://www.polarisFT.com

RE: Aggregation while Bulk Loading into HBase

2012-11-28 Thread Anoop Sam John

Hi,
Looks like you do not want more than one table instance in Mapper. On 
one table instance you want a Get before doing the Put.
See TableOutputFormat and try changing the code to implement your req and use 
this custom output format.

-Anoop-

From: andrew.purt...@gmail.com [andrew.purt...@gmail.com] on behalf of Andrew 
Purtell [apurt...@apache.org]
Sent: Thursday, November 29, 2012 12:23 AM
To: user@hbase.apache.org
Subject: Re: Aggregation while Bulk Loading into HBase

Have a look at https://issues.apache.org/jira/browse/HBASE-3936

Is this what you have in mind as something that could help you here?

On Wed, Nov 28, 2012 at 5:37 AM, Narayanan K knarayana...@gmail.com wrote:

 But in our case, we will need an instance of the HTable in the Mapper, do a
 GET operation and find the rowkey if it already exists and then add up the
 column amounts and then write back.


--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

RE: Regarding rework in changing column family

2012-11-27 Thread Anoop Sam John

Also what about the current data in the table. Now all are under the single CF. 
Modifying the table with addition of a new CF will not move data to the new 
family!
Remember HBase only deals with CF at the table schema level. There is no 
qualifiers in the schema as such. When data is inserted/retrieved we can 
specify a qualifier.

-Anoop-

From: ramkrishna vasudevan [ramkrishna.s.vasude...@gmail.com]
Sent: Wednesday, November 28, 2012 11:41 AM
To: user@hbase.apache.org
Subject: Re: Regarding rework in changing column family

I am afraid it has to be changed...Because for your puts to go to the
specified Col family the col family name should appear in your Puts that is
created by the client.

Regards
Ram

On Wed, Nov 28, 2012 at 11:18 AM, Ramasubramanian Narayanan 
ramasubramanian.naraya...@gmail.com wrote:

 Thanks Ram!!!

 My question is like this...

 suppose I have create a table with 100 columns with single column family
 'cf1',

 now in production there are billions of records are there in that table and
 there are mulitiple programs that is feeding into this table (let us take
 some 50 programs)...

 In this scenario, if I change the column family like first 40 columns let
 it be in 'cf1', the last 60 columns I want to move to new column family
 'cf2', in this case, *do we need to change all 50 programs which are
 inserting into that table with 'cf1' for all columns?*
 *
 *
 regards,
 Rams

 On Wed, Nov 28, 2012 at 10:24 AM, ramkrishna vasudevan 
 ramkrishna.s.vasude...@gmail.com wrote:

  As far as i see altering the table with the new columnfamily should be
  easier.
  - disable the table
  - Issue modify table command with the new col family.
  - run a compaction.
  Now after this when you start doing your puts, they should be in
 alignment
  with the new schema defined for the table.  You may have to see one thing
  is how much your rate of puts is getting affected because now both of
 your
  CFs will start flushing whenever a memstore flush happens.
 
  Hope this helps.
 
  Regards
  Ram
 
  On Wed, Nov 28, 2012 at 10:10 AM, Ramasubramanian 
  ramasubramanian.naraya...@gmail.com wrote:
 
   Hi,
  
   I have created table in hbase with one column family and planned to
   release for development (in pentaho).
  
   Suppose later after doing the data profiling in production if I feel
 that
   out of 600 columns 200 is not going to get used frequently I am
 planning
  to
   group those into another column family.
  
   If I change the column family at later point of time I hope there will
 a
   lots of rework that has to be done (either if we use java or pentaho).
 Is
   my understanding is correct? Is there any other alternative available
 to
   overcome?
  
   Regards,
   Rams
 


RE: Hbase Region Split not working with JAVA API

2012-11-15 Thread Anoop Sam John
Pls give your used command for Put  as well as the java code for put.

-Anoop-

From: msmdhussain [msmdhuss...@gmail.com]
Sent: Thursday, November 15, 2012 2:13 PM
To: user@hbase.apache.org
Subject: Hbase Region Split not working with JAVA API

Hi,

I created a table in hbase shell with pre split

create 'Test','C',{SPLITS=['\0x1','\0x2','\0x3','\0x4','\0x5']}

when i put a new value using put command in hbase shell the request is
passed to the different regions.

but, when i put the value in hbase table using java api the request is
passed to the first region.

can any one help me on this, i need to store the value in different region
on basis of the rowkey.

Thx,




--
View this message in context: 
http://apache-hbase.679495.n3.nabble.com/Hbase-Region-Split-not-working-with-JAVA-API-tp4034028.html
Sent from the HBase User mailing list archive at Nabble.com.

RE: Column Family level bloom filters

2012-11-05 Thread Anoop Sam John
about column family level bloom filters
You mean column blooms right? [Bloom on rowkey  cf+qualifier]
Are these filters in memory or are they just persisted as part of the HFile or 
both 
All blooms will get persisted while writing the HFile. When the HFile is opened 
for read the bloom info will be read and will be available in memory.
Only difference is what data is getting added to blooms.

-Anoop-

From: Varun Sharma [va...@pinterest.com]
Sent: Tuesday, November 06, 2012 7:11 AM
To: user@hbase.apache.org
Subject: Column Family level bloom filters

Hi,

I had a question about column family level bloom filters. Are these filters
in memory or are they just persisted as part of the HFile or both ? For
in-memory bloom filters, deletes could incur a high cost of rebuilding the
bloom filter.

Thanks
VArun

RE: Bulk Loading - LoadIncrementalHFiles

2012-11-01 Thread Anoop Sam John
Hi
 Yes while doing the bulk load the table can be presplit. It will have the 
same number of reducers as that of the region. One per region. Each HFile that 
the reducer generates will be having a max size of HFile max size 
configuration. 
You can see that while bulk loading also there will be splits on the HFiles if 
needed (as per the new splits which may happen on the regions)
Yes in case of table being not splits, later it will lead to splits...

Better way would be to do presplit I would say.

-Anoop-

From: Amit Sela [am...@infolinks.com]
Sent: Thursday, November 01, 2012 10:33 PM
To: user@hbase.apache.org
Subject: Bulk Loading - LoadIncrementalHFiles

Hi everyone,

I'm using MR to bulk load into HBase by
using HFileOutputFormat.configureIncrementalLoad and after the job is
complete I use loadIncrementalHFiles.doBulkLoad

From what I see, the MR outputs a file for each CF written and to my
understanding these files are loaded as store files into a region.

What I don't understand is *how many regions will open* ? and *how is that
determined *?
If I have 3 CF's and a lot of data to load, does that mean 3 large store
files will load into 1 region (more ?) and this region will split on major
compaction ?

Can I pre-create regions and tell the bulk load to split the data between
them during the load ?

In general, if someone could elaborate about LoadIncrementalHFiles it would
save me a lot of time diving into it.


Another question I is about running over values, is it possible to load an
updated value ? or generally updating columns and values for an existing
key ?
I'd think that there's no problem but when I try to run the same bulk load
twice (MR and then load) with the same data, the second time fails.
Right after mapreduce.LoadIncrementalHFiles: Trying to load hfile=
I get: ERROR mapreduce.LoadIncrementalHFiles: Unexpected execution
exception during splitting...


Thanks!

RE: Filters for hbase scans require reboot.

2012-11-01 Thread Anoop Sam John

Yes Jonathan as of now we need a reboot..  Take a look at HBASE-1936. This is 
not completed. You can give your thoughts there and have a look at the 
patch/discussion...

-Anoop-

From: Jonathan Bishop [jbishop@gmail.com]
Sent: Friday, November 02, 2012 2:52 AM
To: user@hbase.apache.org
Subject: Filters for hbase scans require reboot.

Hi,

I am developing a filter to be used in a scan for hbase, and I find that I
need to...

1) make sure HBASE_CLASSPATH points to a jar or bin with my filter
2) reboot hbase (stop-hbase.sh, start-hbase.sh)

Otherwise, it seems hbase does not pick up my changes to my filter.

Is there an easier way to do this?

Thanks,

Jon

RE: Best technique for doing lookup with Secondary Index

2012-10-25 Thread Anoop Sam John
Hi Anil,
  Some confusion after seeing your reply.
You use bulk loading?  You created your own mapper?  You call HTable#put() from 
mappers?

I think confusion in another thread also..  I was refering to the 
HFileOutputReducer.. There is a TableOutputFormat also... In TableOutputFormat 
it will try put to the HTable...  Here write to WAL is applicable...


[HFileOutputReducer] : As we discussed in another thread, in case of bulk 
loading the aproach is like MR job create KVs and write to files and this file 
is written as an HFile. Yes this will contain all meta information, trailer 
etc... Finally only HBase cluster need to be contacted just to load this 
HFile(s) into HBase cluster.. Under corresponding regions.  This will be the 
fastest way for bulk loading of huge data... 


-Anoop-

From: anil gupta [anilgupt...@gmail.com]
Sent: Friday, October 26, 2012 3:40 AM
To: user@hbase.apache.org
Subject: Re: Best technique for doing lookup with Secondary Index

Anoop:  In prePut hook u call HTable#put()?
Anil: Yes i call HTable#put() in prePut. Is there better way of doing it?

Anoop: Why use the network calls from server side here then?
Anil: I thought this is a cleaner approach since i am using BulkLoader. I
decided not to run two jobs since i am generating a UniqueIdentifier at
runtime in bulkloader.

Anoop: can not handle it from client alone?
Anil: I cannot handle it from client since i am using BulkLoader. Is it a
good idea to create Htable instance on B and do put in my mapper? I might
try this idea.

Anoop: You can have a look at Lily project.
Anil: It's little late for us to evaluate Lily now and at present we dont
need complex secondary index since our data is immutable.

Ram: what is rowkey B here?
Anil: Suppose i am storing customer events in table A. I have two
requirement for data query:
1. Query customer events on basis of customer_Id and event_ID.
2. Query customer events on basis of event_timestamp and customer_ID.

70% of querying is done by query#1, so i will create
customer_Idevent_ID as row key of Table A.
Now, in order to support fast results for query#2, i need to create a
secondary index on A. I store that secondary index in B, rowkey of B is
event_timestampcustomer_ID  .Every row stores the corresponding rowkey
of A.

Ram:How is the startRow determined for every query?
Anil: Its determined by a very simple application logic.

Thanks,
Anil Gupta

On Wed, Oct 24, 2012 at 10:16 PM, Ramkrishna.S.Vasudevan 
ramkrishna.vasude...@huawei.com wrote:

 Just out of curiosity,
  The secondary index is stored in table B as rowkey B --
  family:rowkey
  A
 what is rowkey B here?
  1. Scan the secondary table by using prefix filter and startRow.
 How is the startRow determined for every query ?

 Regards
 Ram

  -Original Message-
  From: Anoop Sam John [mailto:anoo...@huawei.com]
  Sent: Thursday, October 25, 2012 10:15 AM
  To: user@hbase.apache.org
  Subject: RE: Best technique for doing lookup with Secondary Index
 
  I build the secondary table B using a prePut RegionObserver.
 
  Anil,
 In prePut hook u call HTable#put()?  Why use the network calls
  from server side here then? can not handle it from client alone? You
  can have a look at Lily project.   Thoughts after seeing ur idea on put
  and scan..
 
  -Anoop-
  
  From: anil gupta [anilgupt...@gmail.com]
  Sent: Thursday, October 25, 2012 3:10 AM
  To: user@hbase.apache.org
  Subject: Best technique for doing lookup with Secondary Index
 
  Hi All,
 
  I am using HBase 0.92.1. I have created a secondary index on table A.
  Table A stores immutable data. I build the secondary table B using a
  prePut RegionObserver.
 
  The secondary index is stored in table B as rowkey B --
  family:rowkey
  A  . rowkey A is the column qualifier. Every row in B will only on
  have one column and the name of that column is the rowkey of A. So the
  value is blank. As per my understanding, accessing column qualifier is
  faster than accessing value. Please correct me if i am wrong.
 
 
  HBase Querying approach:
  1. Scan the secondary table by using prefix filter and startRow.
  2. Do a batch get on primary table by using HTable.get(ListGet)
  method.
 
  The above approach for retrieval works fine but i was wondering it
  there is
  a better approach. I was planning to try out doing the retrieval using
  coprocessors.
  Have anyone tried using coprocessors? I would appreciate if others can
  share their experience with secondary index for HBase queries.
 
  --
  Thanks  Regards,
  Anil Gupta=




--
Thanks  Regards,
Anil Gupta

RE: Hbase import Tsv performance (slow import)

2012-10-25 Thread Anoop Sam John
As per Anoop and Ram, WAL is not used with bulk loading so turning off WAL
wont have any impact on performance.

This is if HFileOutputFormat is being used..  There is a TableOutputFormat 
which also can be used as the OutputFormat for MR.. Here write to wal is 
applicable
This one, instead of write to HFile and upload at one shot, puts data into 
HTable calling put() method...

-Anoop-

From: anil gupta [anilgupt...@gmail.com]
Sent: Friday, October 26, 2012 2:05 AM
To: user@hbase.apache.org
Subject: Re: Hbase import Tsv performance (slow import)

@Jonathan,

As per Anoop and Ram, WAL is not used with bulk loading so turning off WAL
wont have any impact on performance.

On Thu, Oct 25, 2012 at 1:33 PM, anil gupta anilgupt...@gmail.com wrote:

 Hi Nicolas,

 As per my experience you wont get good performance if you run 3 Map task
 simultaneously on one Hard Drive. That seems like a lot of I/O on one disk.

 HBase performs well when you have at least 5 nodes in cluster. So, running
 HBase on 3 nodes is not something you would do in prod.

 Thanks,
 Anil

 On Thu, Oct 25, 2012 at 8:57 AM, Jonathan Bishop jbishop@gmail.comwrote:

 Nicolas,

 I just went through the same exercise. There are many ways to get this to
 go faster, but eventually I decided that bulk loading is the best solution
 as run times scaled with the number machines in my cluster when I used
 that
 approach.

 One thing you can try is to turn off hbase's write ahead log (WAL). But be
 aware that regionserver failure will cause data loss if you do this.

 Jon

 On Tue, Oct 23, 2012 at 8:48 AM, Nick maillard 
 nicolas.maill...@fifty-five.com wrote:

  Hi everyone
 
  I'm starting with hbase and testing for our needs. I have set up a
 hadoop
  cluster of Three machines and A Hbase cluster atop on the same three
  machines,
  one master two slaves.
 
  I am testing the Import of a 5GB csv file with the importTsv tool. I
  import the
  file in the HDFS and use the importTsv tool to import in Hbase.
 
  Right now it takes a little over an hour to complete. It creates around
 2
  million entries in one table with a single family.
  If I use bulk uploading it goes down to 20 minutes.
 
  My hadoop has 21 map tasks but they all seem to be taking a very long
 time
  to
  finish many tasks end up in time out.
 
  I am wondering what I have missed in my configuration. I have followed
 the
  different prerequisites in the documentations but I am really unsure as
 to
  what
  is causing this slow down. If I were to apply the wordcount example to
 the
  same
  file it takes only minutes to complete so I am guessing the issue lies
 in
  my
  Hbase configuration.
 
  Any help or pointers would by appreciated
 
 




 --
 Thanks  Regards,
 Anil Gupta




--
Thanks  Regards,
Anil Gupta

RE: problem with fliter in scan

2012-10-25 Thread Anoop Sam John

Use  SingleColumnValueFilter#filterIfMissing(true)
s.setBatch(10);
How many total columns in the Schema? When using the SingleColumnValueFilter 
setBatch() might not work ou always.. FYI


-Anoop-

From: jian fan [xiaofanhb...@gmail.com]
Sent: Friday, October 26, 2012 7:24 AM
To: user@hbase.apache.org
Subject: problem with fliter in scan

HI:
   Guys, I have a program to filter the data by scan, the code is as
follows:

String familyName = data;
String qualifierName = speed;
String minValue = 0;
String maxValue = 20121016124537;
HTablePool pool = new HTablePool(cfg, 1000);
HTable table = (HTable) pool.getTable(tableName);
ListFilter filters = new ArrayListFilter();
SingleColumnValueFilter minFilter = new
SingleColumnValueFilter(familyName.getBytes(), qualifierName.getBytes(),
CompareOp.GREATER_OR_EQUAL, minValue.getBytes());
SingleColumnValueFilter maxFilter = new
SingleColumnValueFilter(familyName.getBytes(), qualifierName.getBytes(),
CompareOp.LESS_OR_EQUAL, maxValue.getBytes());

filters.add(maxFilter);
filters.add(minFilter);
Scan s = new Scan();
s.setCaching(1);
s.setBatch(10);
FilterList fl = new FilterList( FilterList.Operator.MUST_PASS_ALL,
filters);
s.setFilter(fl);
ResultScanner scanner = table.getScanner(s);
for (Result r : scanner) {
KeyValue[] kv = r.raw();

for (int i = 0; i  kv.length; i++) {
System.out.println(RowKey:+new String(kv[i].getRow())
+  );
System.out.print(new String(kv[i].getFamily()) + :);
System.out.println(new String(kv[i].getQualifier()) +
 );
System.out.println(value:+new
String(kv[i].getValue()));

}
}


The result is :

RowKey:020028
data:location
value:CA
RowKey:020028
data:speed
value:20121016124537

RowKey:2068098
data:location
CA


Seems that the kv without qualiter speed is also include in the search
result, how to slove the problem?

Thanks

Jian Fan

RE: Best technique for doing lookup with Secondary Index

2012-10-25 Thread Anoop Sam John
Anil
Have a look at MultiTableOutputFormat  ( I am refering to 0.94 code base 
Not sure whether available in older versions)

-Anoop-

From: Ramkrishna.S.Vasudevan [ramkrishna.vasude...@huawei.com]
Sent: Friday, October 26, 2012 9:50 AM
To: user@hbase.apache.org
Subject: RE: Best technique for doing lookup with Secondary Index

 Is it a
 good idea to create Htable instance on B and do put in my mapper? I
 might
 try this idea.
Yes you can do this..  May be the same mapper you can do a put for table
B.  This was how we have tried loading data to another table by using the
main table A
Puts.

Now your main question is lookups right
Now there are some more hooks in the scan flow called pre/postScannerOpen,
pre/postScannerNext.
May be you can try using them to do a look up on the secondary table and
then use those values and pass it to the main table next().
But this may involve more RPC calls as your regions of A and B may be in
different RS.

If something is wrong in my understanding of what you said, kindly spare me.
:)

Regards
Ram


 -Original Message-
 From: anil gupta [mailto:anilgupt...@gmail.com]
 Sent: Friday, October 26, 2012 3:40 AM
 To: user@hbase.apache.org
 Subject: Re: Best technique for doing lookup with Secondary Index

 Anoop:  In prePut hook u call HTable#put()?
 Anil: Yes i call HTable#put() in prePut. Is there better way of doing
 it?

 Anoop: Why use the network calls from server side here then?
 Anil: I thought this is a cleaner approach since i am using BulkLoader.
 I
 decided not to run two jobs since i am generating a UniqueIdentifier at
 runtime in bulkloader.

 Anoop: can not handle it from client alone?
 Anil: I cannot handle it from client since i am using BulkLoader. Is it
 a
 good idea to create Htable instance on B and do put in my mapper? I
 might
 try this idea.

 Anoop: You can have a look at Lily project.
 Anil: It's little late for us to evaluate Lily now and at present we
 dont
 need complex secondary index since our data is immutable.

 Ram: what is rowkey B here?
 Anil: Suppose i am storing customer events in table A. I have two
 requirement for data query:
 1. Query customer events on basis of customer_Id and event_ID.
 2. Query customer events on basis of event_timestamp and customer_ID.

 70% of querying is done by query#1, so i will create
 customer_Idevent_ID as row key of Table A.
 Now, in order to support fast results for query#2, i need to create a
 secondary index on A. I store that secondary index in B, rowkey of B is
 event_timestampcustomer_ID  .Every row stores the corresponding
 rowkey
 of A.

 Ram:How is the startRow determined for every query?
 Anil: Its determined by a very simple application logic.

 Thanks,
 Anil Gupta

 On Wed, Oct 24, 2012 at 10:16 PM, Ramkrishna.S.Vasudevan 
 ramkrishna.vasude...@huawei.com wrote:

  Just out of curiosity,
   The secondary index is stored in table B as rowkey B --
   family:rowkey
   A
  what is rowkey B here?
   1. Scan the secondary table by using prefix filter and startRow.
  How is the startRow determined for every query ?
 
  Regards
  Ram
 
   -Original Message-
   From: Anoop Sam John [mailto:anoo...@huawei.com]
   Sent: Thursday, October 25, 2012 10:15 AM
   To: user@hbase.apache.org
   Subject: RE: Best technique for doing lookup with Secondary Index
  
   I build the secondary table B using a prePut RegionObserver.
  
   Anil,
  In prePut hook u call HTable#put()?  Why use the network
 calls
   from server side here then? can not handle it from client alone?
 You
   can have a look at Lily project.   Thoughts after seeing ur idea on
 put
   and scan..
  
   -Anoop-
   
   From: anil gupta [anilgupt...@gmail.com]
   Sent: Thursday, October 25, 2012 3:10 AM
   To: user@hbase.apache.org
   Subject: Best technique for doing lookup with Secondary Index
  
   Hi All,
  
   I am using HBase 0.92.1. I have created a secondary index on table
 A.
   Table A stores immutable data. I build the secondary table B
 using a
   prePut RegionObserver.
  
   The secondary index is stored in table B as rowkey B --
   family:rowkey
   A  . rowkey A is the column qualifier. Every row in B will
 only on
   have one column and the name of that column is the rowkey of A. So
 the
   value is blank. As per my understanding, accessing column qualifier
 is
   faster than accessing value. Please correct me if i am wrong.
  
  
   HBase Querying approach:
   1. Scan the secondary table by using prefix filter and startRow.
   2. Do a batch get on primary table by using HTable.get(ListGet)
   method.
  
   The above approach for retrieval works fine but i was wondering it
   there is
   a better approach. I was planning to try out doing the retrieval
 using
   coprocessors.
   Have anyone tried using coprocessors? I would appreciate if others
 can
   share their experience with secondary index for HBase queries

RE: Best technique for doing lookup with Secondary Index

2012-10-24 Thread Anoop Sam John
I build the secondary table B using a prePut RegionObserver.

Anil,
   In prePut hook u call HTable#put()?  Why use the network calls from 
server side here then? can not handle it from client alone? You can have a look 
at Lily project.   Thoughts after seeing ur idea on put and scan..

-Anoop-

From: anil gupta [anilgupt...@gmail.com]
Sent: Thursday, October 25, 2012 3:10 AM
To: user@hbase.apache.org
Subject: Best technique for doing lookup with Secondary Index

Hi All,

I am using HBase 0.92.1. I have created a secondary index on table A.
Table A stores immutable data. I build the secondary table B using a
prePut RegionObserver.

The secondary index is stored in table B as rowkey B -- family:rowkey
A  . rowkey A is the column qualifier. Every row in B will only on
have one column and the name of that column is the rowkey of A. So the
value is blank. As per my understanding, accessing column qualifier is
faster than accessing value. Please correct me if i am wrong.


HBase Querying approach:
1. Scan the secondary table by using prefix filter and startRow.
2. Do a batch get on primary table by using HTable.get(ListGet) method.

The above approach for retrieval works fine but i was wondering it there is
a better approach. I was planning to try out doing the retrieval using
coprocessors.
Have anyone tried using coprocessors? I would appreciate if others can
share their experience with secondary index for HBase queries.

--
Thanks  Regards,
Anil Gupta

RE: repetita iuvant?

2012-10-24 Thread Anoop Sam John
Hi
Can you tell more details? How much data your scan is going to retrieve?  What 
is the time taken in each attempt ?
Can you observe the cache hit ratio? What is the memory avail in RS?.Also 
the cluster details and regions

-Anoop-

From: surfer [sur...@crs4.it]
Sent: Thursday, October 25, 2012 11:00 AM
To: user@hbase.apache.org
Subject: repetita iuvant?

Hi
I tried to run twice the same scan on my table data. I expected time to
improve but that was not the case.
What am I doing wrong? I set scan.setCacheBlocks(true); before the
first scanning job to put if not all at least some block in memory.

thank you
surfer

RE: A question of storage structure for memstore?

2012-10-22 Thread Anoop Sam John
To be precise there will be one memstore per family per region..
If table having 2 CFs and there are 10 regions for that table then totally 
2*10=20 memstores..

-Anoop-

From: Kevin O'dell [kevin.od...@cloudera.com]
Sent: Monday, October 22, 2012 5:55 PM
To: user@hbase.apache.org
Subject: Re: A question of storage structure for memstore?

Yes, there will be two memstores if you have two CFs.
On Oct 22, 2012 7:25 AM, yonghu yongyong...@gmail.com wrote:

 Dear All,

 In the description it mentions that a Store file (per column family)
 is composed of one memstore and a set of HFiles. Does it imply that
 for every column family there is a corresponding memstore? For
 example. if a table has 2 column families, there will be 2 memstores
 in memory?

 regards!

 Yong


RE: HRegionInfo returns empty values.

2012-10-19 Thread Anoop Sam John

Actually how many regions in your table?
Only one region? In that case it will be having startkey and endkey as empty.. 
So your case what it prints looks to be correct.


-Anoop-

From: Henry JunYoung KIM [henry.jy...@gmail.com]
Sent: Friday, October 19, 2012 2:13 PM
To: user@hbase.apache.org
Subject: HRegionInfo returns empty values.

Hi, hbase-users.

To get a start-key and end-key from each region, I implemented simple code like 
this.



HTable table = new HTable(admin.getConf(), admin.getTableName());
NavigableMapHRegionInfo, ServerName locations = 
table.getRegionLocations();
for (Map.EntryHRegionInfo, ServerName entry: locations.entrySet()) {
HRegionInfo info = entry.getKey();

System.out.println(server :  + entry.getValue().getHostname());
System.out.println(start :  + info.getStartKey().length);
System.out.println(end :  + info.getEndKey().length);
}


but, this code returns

server : one-of-servers-name
start : 0
end : 0

start-key and end-key is empty. nothing!
data size is very strong. it's about 10,000.
from integer 0 to integer 10,000.

how could I get correct range of a region?

thanks for your concerns.

RE: Coprocessor end point vs MapReduce?

2012-10-18 Thread Anoop Sam John
A CP and Endpoints operates at a region level.. Any operation within one region 
we can perform using this..  I have seen in below use case that along with the 
delete there was a need for inserting data to some other table also.. Also this 
was kind of a periodic action.. I really doubt how the endpoints alone can be 
used here.. I also tend towards the MR..

  The idea behind the bulk delete CP is simple.  We have a use case of deleting 
a bulk of rows and this need to be online delete. I also have seen in the 
mailing list many people ask question regarding that... In all people were 
using scans and get the rowkeys to the client side and then doing the deletes.. 
 Yes most of the time complaint was the slowness..  One bulk delete performance 
improvement was done in HBASE-6284..  Still thought we can do all the operation 
(scan+delete) in server side and we can make use of the endpoints here.. This 
will be much more faster and can be used for online bulk deletes..

-Anoop-


From: Michael Segel [michael_se...@hotmail.com]
Sent: Thursday, October 18, 2012 11:31 PM
To: user@hbase.apache.org
Subject: Re: Coprocessor end point vs MapReduce?

Doug,

One thing that concerns me is that a lot of folks are gravitating to 
Coprocessors and may be using them for the wrong thing.
Has anyone done any sort of research as to some of the limitations and negative 
impacts on using coprocessors?

While I haven't really toyed with the idea of bulk deletes, periodic deletes is 
probably not a good use of coprocessors however using them to synchronize 
tables would be a valid use case.

Thx

-Mike

On Oct 18, 2012, at 7:36 AM, Doug Meil doug.m...@explorysmedical.com wrote:


 To echo what Mike said about KISS, would you use triggers for a large
 time-sensitive batch job in an RDBMS?  It's possible, but probably not.
 Then you might want to think twice about using co-processors for such a
 purpose with HBase.





 On 10/17/12 9:50 PM, Michael Segel michael_se...@hotmail.com wrote:

 Run your weekly job in a low priority fair scheduler/capacity scheduler
 queue.

 Maybe its just me, but I look at Coprocessors as a similar structure to
 RDBMS triggers and stored procedures.
 You need to restrain and use them sparingly otherwise you end up creating
 performance issues.

 Just IMHO.

 -Mike

 On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org wrote:

 I don't have any concern about the time it's taking. It's more about
 the load it's putting on the cluster. I have other jobs that I need to
 run (secondary index, data processing, etc.). So the more time this
 new job is taking, the less CPU the others will have.

 I tried the M/R and I really liked the way it's done. So my only
 concern will really be the performance of the delete part.

 That's why I'm wondering what's the best practice to move a row to
 another table.

 2012/10/17, Michael Segel michael_se...@hotmail.com:
 If you're going to be running this weekly, I would suggest that you
 stick
 with the M/R job.

 Is there any reason why you need to be worried about the time it takes
 to do
 the deletes?


 On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org
 wrote:

 Hi Mike,

 I'm expecting to run the job weekly. I initially thought about using
 end points because I found HBASE-6942 which was a good example for my
 needs.

 I'm fine with the Put part for the Map/Reduce, but I'm not sure about
 the delete. That's why I look at coprocessors. Then I figure that I
 also can do the Put on the coprocessor side.

 On a M/R, can I delete the row I'm dealing with based on some criteria
 like timestamp? If I do that, I will not do bulk deletes, but I will
 delete the rows one by one, right? Which might be very slow.

 If in the future I want to run the job daily, might that be an issue?

 Or should I go with the initial idea of doing the Put with the M/R job
 and the delete with HBASE-6942?

 Thanks,

 JM


 2012/10/17, Michael Segel michael_se...@hotmail.com:
 Hi,

 I'm a firm believer in KISS (Keep It Simple, Stupid)

 The Map/Reduce (map job only) is the simplest and least prone to
 failure.

 Not sure why you would want to do this using coprocessors.

 How often are you running this job? It sounds like its going to be
 sporadic.

 -Mike

 On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org
 wrote:

 Hi,

 Can someone please help me to understand the pros and cons between
 those 2 options for the following usecase?

 I need to transfer all the rows between 2 timestamps to another
 table.

 My first idea was to run a MapReduce to map the rows and store them
 on
 another table, and then delete them using an end point coprocessor.
 But the more I look into it, the more I think the MapReduce is not a
 good idea and I should use a coprocessor instead.

 BUT... The MapReduce framework guarantee me that it will run against
 all the regions. I tried to stop a regionserver 

RE: Unable to add co-processor to table through HBase api

2012-10-18 Thread Anoop Sam John

hAdmin.getTableDescriptor(Bytes.toBytes(tableName)).addCoprocessor(className,
  new Path(hdfs://hbasecluster/tmp/hbase_cdh4.jar),
Coprocessor.PRIORITY_USER,map);

Anil,

Don't you have to modify the table calling Admin API??  !  Not seeing that 
code here...

-Anoop-


From: anil gupta [anilgupt...@gmail.com]
Sent: Friday, October 19, 2012 2:46 AM
To: user@hbase.apache.org
Subject: Re: Unable to add co-processor to table through HBase api

Hi Folks,

Still, i am unable to add the co-processors through HBase client api. This
time i tried loading the coprocessor by providing the jar path along with
parameters. But, it failed.
I was able to add the same coprocessor to the table through HBase shell.
I also dont see any logs regarding adding coprocessors in regionservers
when i try to add the co-processor through api.I strongly feel that HBase
client api for adding coprocessor seems to be broken. Please let me know if
the code below seems to be problematic.

Here is the code i used to add the coprocessor through HBase api:
private static void modifyTable() throws IOException
{
Configuration conf = HBaseConfiguration.create();
HBaseAdmin hAdmin = new HBaseAdmin(conf);
String tableName = txn;
hAdmin.disableTable(tableName);
if(!hAdmin.isTableEnabled(tableName))
{
  System.out.println(Trying to add coproc to table); // using err so
that it's easy to read this on eclipse console.
  HashMapString, String map = new HashMapString,String();
  map.put(arg1, batchdate);
  String className =
com.intuit.ihub.hbase.poc.coprocessor.observer.IhubTxnRegionObserver;

hAdmin.getTableDescriptor(Bytes.toBytes(tableName)).addCoprocessor(className,
  new Path(hdfs://hbasecluster/tmp/hbase_cdh4.jar),
Coprocessor.PRIORITY_USER,map);

  if(
hAdmin.getTableDescriptor(Bytes.toBytes(tableName)).hasCoprocessor(className)
  )
  {
System.err.println(YIPIE!!!);
  }
  hAdmin.enableTable(tableName);

}
hAdmin.close();
   }

Thanks,
Anil Gupta

On Wed, Oct 17, 2012 at 9:27 PM, Ramkrishna.S.Vasudevan 
ramkrishna.vasude...@huawei.com wrote:

 Do let me know if you are stuck up.  May be I did not get your actual
 problem.

 All the best.

 Regards
 Ram

  -Original Message-
  From: anil gupta [mailto:anilgupt...@gmail.com]
  Sent: Wednesday, October 17, 2012 11:34 PM
  To: user@hbase.apache.org
  Subject: Re: Unable to add co-processor to table through HBase api
 
  Hi Ram,
 
  The table exists and I don't get any error while running the program(i
  would get an error if the table did not exist). I am running a
  distributed
  cluster.
 
  Tried following additional ways also:
 
 1. I tried loading the AggregationImplementation coproc.
 2. I also tried adding the coprocs while the table is enabled.
 
 
  Also had a look at the JUnit test cases and could not find any
  difference.
 
  I am going to try adding the coproc along with jar in Hdfs and see what
  happens.
 
  Thanks,
  Anil Gupta
 
  On Tue, Oct 16, 2012 at 11:44 PM, Ramkrishna.S.Vasudevan 
  ramkrishna.vasude...@huawei.com wrote:
 
   I tried out a sample test class.  It is working properly.  I just
  have a
   doubt whether you are doing the
   Htd.addCoprocessor() step before creating the table?  Try that way
  hope it
   should work.
  
   Regards
   Ram
  
-Original Message-
From: anil gupta [mailto:anilgupt...@gmail.com]
Sent: Wednesday, October 17, 2012 4:05 AM
To: user@hbase.apache.org
Subject: Unable to add co-processor to table through HBase api
   
Hi All,
   
I would like to add a RegionObserver to a HBase table through HBase
api. I
don't want to put this RegionObserver as a user or system co-
  processor
in
hbase-site.xml since this is specific to a table. So, option of
  using
hbase
properties is out. I have already copied the jar file in the
  classpath
of
region server and restarted the cluster.
   
Can any one point out the problem in following code for adding the
co-processor to the table:
private void modifyTable(String name) throws IOException
{
Configuration conf = HBaseConfiguration.create();
HBaseAdmin hAdmin = new HBaseAdmin(conf);
hAdmin.disableTable(txn_subset);
if(!hAdmin.isTableEnabled(txn_subset))
{
  System.err.println(Trying to add coproc to table); // using
  err
so
that it's easy to read this on eclipse console.
   
   
  hAdmin.getTableDescriptor(Bytes.toBytes(txn_subset)).addCoprocessor(
com.intuit.hbase.poc.coprocessor.observer.IhubTxnRegionObserver);
  if(
   
  hAdmin.getTableDescriptor(Bytes.toBytes(txn_subset)).hasCoprocessor(
com.intuit.hbase.poc.coprocessor.observer.IhubTxnRegionObserver)
)
  {
System.err.println(YIPIE!!!);
  }
  

RE: Unable to add co-processor to table through HBase api

2012-10-18 Thread Anoop Sam John

Anil
 Yes the same. You got the HTD from the master to your client code and just 
added the CP into that Object. In order to reflect the change in the HBase 
cluster you need to call the modifyTable API with your changed HTD. Master will 
change the table. When you enable back the table, regions will get opened in 
the RSs and will be having the CP with that then..  :)  Hope now I make it 
clear for you..

-Anoop-

From: anil gupta [anilgupt...@gmail.com]
Sent: Friday, October 19, 2012 11:01 AM
To: user@hbase.apache.org
Subject: Re: Unable to add co-processor to table through HBase api

Hi Guys,

Do you mean to say that i need to call the following method after the call
to addCoprocessor method:

public void *modifyTable*(byte[] tableName,
HTableDescriptor
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html
htd)
 throws IOException
http://download.oracle.com/javase/6/docs/api/java/io/IOException.html?is-external=true


http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#modifyTable%28byte[],%20org.apache.hadoop.hbase.HTableDescriptor%29

Thanks,
Anil Gupta

On Thu, Oct 18, 2012 at 10:23 PM, Ramkrishna.S.Vasudevan 
ramkrishna.vasude...@huawei.com wrote:

 I can attach the code that I tried.  Here as the HTD is getting modified we
 may need to call modifyTable().
 My testclass did try this while doing creation of table itself.

 I will attach shortly.

 Regards
 Ram

  -Original Message-
  From: anil gupta [mailto:anilgupt...@gmail.com]
  Sent: Friday, October 19, 2012 10:29 AM
  To: user@hbase.apache.org
  Subject: Re: Unable to add co-processor to table through HBase api
 
  Hi Anoop,
 
  Sorry, i am unable to understand what you mean by have to modify the
  table
  calling Admin API??. Am i missing some other calls in my code?
 
  Thanks,
  Anil Gupta
 
  On Thu, Oct 18, 2012 at 9:43 PM, Anoop Sam John anoo...@huawei.com
  wrote:
 
  
  
  
  hAdmin.getTableDescriptor(Bytes.toBytes(tableName)).addCoprocessor(cla
  ssName,
 new Path(hdfs://hbasecluster/tmp/hbase_cdh4.jar),
   Coprocessor.PRIORITY_USER,map);
  
   Anil,
  
   Don't you have to modify the table calling Admin API??  !  Not
  seeing
   that code here...
  
   -Anoop-
  
   
   From: anil gupta [anilgupt...@gmail.com]
   Sent: Friday, October 19, 2012 2:46 AM
   To: user@hbase.apache.org
   Subject: Re: Unable to add co-processor to table through HBase api
  
   Hi Folks,
  
   Still, i am unable to add the co-processors through HBase client api.
  This
   time i tried loading the coprocessor by providing the jar path along
  with
   parameters. But, it failed.
   I was able to add the same coprocessor to the table through HBase
  shell.
   I also dont see any logs regarding adding coprocessors in
  regionservers
   when i try to add the co-processor through api.I strongly feel that
  HBase
   client api for adding coprocessor seems to be broken. Please let me
  know if
   the code below seems to be problematic.
  
   Here is the code i used to add the coprocessor through HBase api:
   private static void modifyTable() throws IOException
   {
   Configuration conf = HBaseConfiguration.create();
   HBaseAdmin hAdmin = new HBaseAdmin(conf);
   String tableName = txn;
   hAdmin.disableTable(tableName);
   if(!hAdmin.isTableEnabled(tableName))
   {
 System.out.println(Trying to add coproc to table); // using
  err so
   that it's easy to read this on eclipse console.
 HashMapString, String map = new HashMapString,String();
 map.put(arg1, batchdate);
 String className =
  
  com.intuit.ihub.hbase.poc.coprocessor.observer.IhubTxnRegionObserver;
  
  
  
  hAdmin.getTableDescriptor(Bytes.toBytes(tableName)).addCoprocessor(clas
  sName,
 new Path(hdfs://hbasecluster/tmp/hbase_cdh4.jar),
   Coprocessor.PRIORITY_USER,map);
  
 if(
  
  
  hAdmin.getTableDescriptor(Bytes.toBytes(tableName)).hasCoprocessor(clas
  sName)
 )
 {
   System.err.println(YIPIE!!!);
 }
 hAdmin.enableTable(tableName);
  
   }
   hAdmin.close();
  }
  
   Thanks,
   Anil Gupta
  
   On Wed, Oct 17, 2012 at 9:27 PM, Ramkrishna.S.Vasudevan 
   ramkrishna.vasude...@huawei.com wrote:
  
Do let me know if you are stuck up.  May be I did not get your
  actual
problem.
   
All the best.
   
Regards
Ram
   
 -Original Message-
 From: anil gupta [mailto:anilgupt...@gmail.com]
 Sent: Wednesday, October 17, 2012 11:34 PM
 To: user@hbase.apache.org
 Subject: Re: Unable to add co-processor to table through HBase
  api

 Hi Ram,

 The table exists and I don't get any error while running the
  program(i
 would get an error if the table did not exist). I am running

RE: Where is code in hbase that physically delete a record?

2012-10-17 Thread Anoop Sam John
You can see the code in ScanQueryMatcher
Basically in major compact a scan will be happening scanning all the files... 
As per the delete markers, the deleted KVs wont come out of the scanner and 
thus gets eliminated.  Also in case of major compact the delete markers itself 
will get deleted ( Still some more complicated conditions are there though for 
these like keep deleted cells and time to purge deletes etc)
I would say check the code in that class...

-Anoop-

From: yun peng [pengyunm...@gmail.com]
Sent: Wednesday, October 17, 2012 5:54 PM
To: user@hbase.apache.org
Subject: Where is code in hbase that physically delete a record?

Hi, All,
I want to find internal code in hbase where physical deleting a record
occurs.

-some of my understanding.
Correct me if I am wrong. (It is largely based on my experience and even
speculation.) Logically deleting a KeyValue data in hbase is performed by
marking tombmarker (by Delete() per records) or setting TTL/max_version
(per Store). After these actions, however, the physical data are still
there, somewhere in the system. Physically deleting a record in hbase is
realised by *a scanner to discard a keyvalue data record* during the
major_compact.

-what I need
I want to extend hbase to associate some actions with physically deleting a
record. Does hbase provide such hook (or coprocessor API) to inject code
for each KV record that is skipped by hbase storescanner in major_compact.
If not, anyone knows where should I look into in hbase (-0.94.2) for such
code modification?

Thanks.
Yun

  1   2   >