Problems while exporting from Hbase to CSV file

2013-06-27 Thread Vimal Jain
Hi,
I am trying to export from hbase to a CSV file.
I am using Scan class to scan all data  in the table.
But i am facing some problems while doing it.

1) My table has around 1.5 million rows  and around 150 columns for each
row , so i can not use default scan() constructor as it will scan whole
table in one go which results in OutOfMemory error in client process.I
heard of using setCaching() and setBatch() but i am not able to understand
how it will solve OOM error.

I thought of providing startRow and stopRow in scan object but i want to
scan whole table so how will this help ?

2) As hbase stores data for a row only when we explicitly provide it and
their is no concept of default value as found in RDBMS , i want to have
each and evey column in the CSV file i generate for every user.In case
column values are not there in hbase , i want to use default  values for
them(I have list of default values for each column). Is there any method in
Result class or any other class to accomplish this ?


Please help here.

-- 
Thanks and Regards,
Vimal Jain


Re: Problems while exporting from Hbase to CSV file

2013-06-27 Thread Azuryy Yu
Sorry, maybe Phonex is not suitable for you.


On Thu, Jun 27, 2013 at 3:21 PM, Azuryy Yu azury...@gmail.com wrote:

 1) Scan.setCaching() to specify the number of rows for caching that will
 be passed to scanners.
 and what's your block cache size?

 but if OOM from the client, not sever side, then I don't think this is
 Scan related, please check your client code.

 2) we cannot add default value from HBase,  but you can add it on your
 client when iterate the Result.

 Also, you can using Phonex, this is cool for your scenario.
 https://github.com/forcedotcom/phoenix



 On Thu, Jun 27, 2013 at 3:11 PM, Vimal Jain vkj...@gmail.com wrote:

 Hi,
 I am trying to export from hbase to a CSV file.
 I am using Scan class to scan all data  in the table.
 But i am facing some problems while doing it.

 1) My table has around 1.5 million rows  and around 150 columns for each
 row , so i can not use default scan() constructor as it will scan whole
 table in one go which results in OutOfMemory error in client process.I
 heard of using setCaching() and setBatch() but i am not able to understand
 how it will solve OOM error.

 I thought of providing startRow and stopRow in scan object but i want to
 scan whole table so how will this help ?

 2) As hbase stores data for a row only when we explicitly provide it and
 their is no concept of default value as found in RDBMS , i want to have
 each and evey column in the CSV file i generate for every user.In case
 column values are not there in hbase , i want to use default  values for
 them(I have list of default values for each column). Is there any method
 in
 Result class or any other class to accomplish this ?


 Please help here.

 --
 Thanks and Regards,
 Vimal Jain





flushing + compactions after config change

2013-06-27 Thread Viral Bajaria
Hi All,

I wanted some help on understanding what's going on with my current setup.
I updated from config to the following settings:

  property
namehbase.hregion.max.filesize/name
value107374182400/value
  /property

  property
namehbase.hregion.memstore.block.multiplier/name
value4/value
  /property

  property
namehbase.hregion.memstore.flush.size/name
value134217728/value
  /property

  property
namehbase.hstore.blockingStoreFiles/name
value50/value
  /property

  property
namehbase.hregion.majorcompaction/name
value0/value
  /property

Prior to this, all the settings were default values. I wanted to increase
the write throughput on my system and also control when major compactions
happen. In addition to that, I wanted to make sure that my regions don't
split quickly.

After the change in settings, I am seeing a huge storm of memstore flushing
and minor compactions some of which get promoted to major compaction. The
compaction queue is also way too high. For example a few of the line that I
see in the logs are as follows:

http://pastebin.com/Gv1S9GKX

The regionserver whose logs are pasted above keeps on flushing and creating
those small files shows the follwoing metrics:
memstoreSizeMB=657, compactionQueueSize=233, flushQueueSize=0,
usedHeapMB=3907, maxHeapMB=10231

I am unsure why it's causing such high amount of flush ( 100m) even though
the flush size is at 128m and there is no memory pressure.

Any thoughts ? Let me know if you need any more information, I also have
ganglia running and can provide more metrics if needed.

Thanks,
Viral


Re: flushing + compactions after config change

2013-06-27 Thread Anoop John
the flush size is at 128m and there is no memory pressure
You mean there is enough memstore reserved heap in the RS, so that there
wont be premature flushes because of global heap pressure?  What is the RS
max mem and how many regions and CFs in each?  Can you check whether the
flushes happening because of too many WAL files?

-Anoop-

On Thu, Jun 27, 2013 at 1:10 PM, Viral Bajaria viral.baja...@gmail.comwrote:

 Hi All,

 I wanted some help on understanding what's going on with my current setup.
 I updated from config to the following settings:

   property
 namehbase.hregion.max.filesize/name
 value107374182400/value
   /property

   property
 namehbase.hregion.memstore.block.multiplier/name
 value4/value
   /property

   property
 namehbase.hregion.memstore.flush.size/name
 value134217728/value
   /property

   property
 namehbase.hstore.blockingStoreFiles/name
 value50/value
   /property

   property
 namehbase.hregion.majorcompaction/name
 value0/value
   /property

 Prior to this, all the settings were default values. I wanted to increase
 the write throughput on my system and also control when major compactions
 happen. In addition to that, I wanted to make sure that my regions don't
 split quickly.

 After the change in settings, I am seeing a huge storm of memstore flushing
 and minor compactions some of which get promoted to major compaction. The
 compaction queue is also way too high. For example a few of the line that I
 see in the logs are as follows:

 http://pastebin.com/Gv1S9GKX

 The regionserver whose logs are pasted above keeps on flushing and creating
 those small files shows the follwoing metrics:
 memstoreSizeMB=657, compactionQueueSize=233, flushQueueSize=0,
 usedHeapMB=3907, maxHeapMB=10231

 I am unsure why it's causing such high amount of flush ( 100m) even though
 the flush size is at 128m and there is no memory pressure.

 Any thoughts ? Let me know if you need any more information, I also have
 ganglia running and can provide more metrics if needed.

 Thanks,
 Viral



Re: flushing + compactions after config change

2013-06-27 Thread Viral Bajaria
Thanks for the quick response Anoop.

The current memstore reserved (IIRC) would be 0.35 of total heap right ?

The RS total heap is 10231MB, used is at 5000MB. Total number of regions is
217 and there are approx 150 regions with 2 families, ~60 with 1 family and
remaining with 3 families.

How to check if the flushes are due to too many WAL files ? Does it get
logged ?

Thanks,
Viral

On Thu, Jun 27, 2013 at 12:51 AM, Anoop John anoop.hb...@gmail.com wrote:

 You mean there is enough memstore reserved heap in the RS, so that there
 wont be premature flushes because of global heap pressure?  What is the RS
 max mem and how many regions and CFs in each?  Can you check whether the
 flushes happening because of too many WAL files?

 -Anoop-



答复: flushing + compactions after config change

2013-06-27 Thread 谢良
If  reached memstore global up-limit,  you'll find Blocking updates on in 
your files(see MemStoreFlusher.reclaimMemStoreMemory);
If  it's caused by too many log files, you'll find Too many hlogs: logs=(see 
HLog.cleanOldLogs)
Hope it's helpful for you:)

Best,
Liang

发件人: Viral Bajaria [viral.baja...@gmail.com]
发送时间: 2013年6月27日 16:18
收件人: user@hbase.apache.org
主题: Re: flushing + compactions after config change

Thanks for the quick response Anoop.

The current memstore reserved (IIRC) would be 0.35 of total heap right ?

The RS total heap is 10231MB, used is at 5000MB. Total number of regions is
217 and there are approx 150 regions with 2 families, ~60 with 1 family and
remaining with 3 families.

How to check if the flushes are due to too many WAL files ? Does it get
logged ?

Thanks,
Viral

On Thu, Jun 27, 2013 at 12:51 AM, Anoop John anoop.hb...@gmail.com wrote:

 You mean there is enough memstore reserved heap in the RS, so that there
 wont be premature flushes because of global heap pressure?  What is the RS
 max mem and how many regions and CFs in each?  Can you check whether the
 flushes happening because of too many WAL files?

 -Anoop-


Re: 答复: flushing + compactions after config change

2013-06-27 Thread Viral Bajaria
Thanks Liang!

Found the logs. I had gone overboard with my grep's and missed the Too
many hlogs line for the regions that I was trying to debug.

A few sample log lines:

2013-06-27 07:42:49,602 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s):
0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14
2013-06-27 08:10:29,996 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s):
0e940167482d42f1999b29a023c7c18a
2013-06-27 08:17:44,719 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s):
0e940167482d42f1999b29a023c7c18a, e380fd8a7174d34feb903baa97564e08
2013-06-27 08:23:45,357 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Too many hlogs: logs=33, maxlogs=32; forcing flush of 3 regions(s):
0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14,
e380fd8a7174d34feb903baa97564e08

Any pointers on what's the best practice for avoiding this scenario ?

Thanks,
Viral

On Thu, Jun 27, 2013 at 1:21 AM, 谢良 xieli...@xiaomi.com wrote:

 If  reached memstore global up-limit,  you'll find Blocking updates on
 in your files(see MemStoreFlusher.reclaimMemStoreMemory);
 If  it's caused by too many log files, you'll find Too many hlogs:
 logs=(see HLog.cleanOldLogs)
 Hope it's helpful for you:)

 Best,
 Liang



Re: 答复: flushing + compactions after config change

2013-06-27 Thread Anoop John
The config hbase.regionserver.maxlogs specifies what is the max #logs and
defaults to 32.  But remember if there are so many log files to replay then
the MTTR will become more (RS down case )

-Anoop-
On Thu, Jun 27, 2013 at 1:59 PM, Viral Bajaria viral.baja...@gmail.comwrote:

 Thanks Liang!

 Found the logs. I had gone overboard with my grep's and missed the Too
 many hlogs line for the regions that I was trying to debug.

 A few sample log lines:

 2013-06-27 07:42:49,602 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
 Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s):
 0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14
 2013-06-27 08:10:29,996 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
 Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s):
 0e940167482d42f1999b29a023c7c18a
 2013-06-27 08:17:44,719 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
 Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s):
 0e940167482d42f1999b29a023c7c18a, e380fd8a7174d34feb903baa97564e08
 2013-06-27 08:23:45,357 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
 Too many hlogs: logs=33, maxlogs=32; forcing flush of 3 regions(s):
 0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14,
 e380fd8a7174d34feb903baa97564e08

 Any pointers on what's the best practice for avoiding this scenario ?

 Thanks,
 Viral

 On Thu, Jun 27, 2013 at 1:21 AM, 谢良 xieli...@xiaomi.com wrote:

  If  reached memstore global up-limit,  you'll find Blocking updates on
  in your files(see MemStoreFlusher.reclaimMemStoreMemory);
  If  it's caused by too many log files, you'll find Too many hlogs:
  logs=(see HLog.cleanOldLogs)
  Hope it's helpful for you:)
 
  Best,
  Liang
 



Re: 答复: flushing + compactions after config change

2013-06-27 Thread Azuryy Yu
hey Viral,
Which hbase version are you using?



On Thu, Jun 27, 2013 at 5:03 PM, Anoop John anoop.hb...@gmail.com wrote:

 The config hbase.regionserver.maxlogs specifies what is the max #logs and
 defaults to 32.  But remember if there are so many log files to replay then
 the MTTR will become more (RS down case )

 -Anoop-
 On Thu, Jun 27, 2013 at 1:59 PM, Viral Bajaria viral.baja...@gmail.com
 wrote:

  Thanks Liang!
 
  Found the logs. I had gone overboard with my grep's and missed the Too
  many hlogs line for the regions that I was trying to debug.
 
  A few sample log lines:
 
  2013-06-27 07:42:49,602 INFO
 org.apache.hadoop.hbase.regionserver.wal.HLog:
  Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s):
  0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14
  2013-06-27 08:10:29,996 INFO
 org.apache.hadoop.hbase.regionserver.wal.HLog:
  Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s):
  0e940167482d42f1999b29a023c7c18a
  2013-06-27 08:17:44,719 INFO
 org.apache.hadoop.hbase.regionserver.wal.HLog:
  Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s):
  0e940167482d42f1999b29a023c7c18a, e380fd8a7174d34feb903baa97564e08
  2013-06-27 08:23:45,357 INFO
 org.apache.hadoop.hbase.regionserver.wal.HLog:
  Too many hlogs: logs=33, maxlogs=32; forcing flush of 3 regions(s):
  0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14,
  e380fd8a7174d34feb903baa97564e08
 
  Any pointers on what's the best practice for avoiding this scenario ?
 
  Thanks,
  Viral
 
  On Thu, Jun 27, 2013 at 1:21 AM, 谢良 xieli...@xiaomi.com wrote:
 
   If  reached memstore global up-limit,  you'll find Blocking updates
 on
   in your files(see MemStoreFlusher.reclaimMemStoreMemory);
   If  it's caused by too many log files, you'll find Too many hlogs:
   logs=(see HLog.cleanOldLogs)
   Hope it's helpful for you:)
  
   Best,
   Liang
  
 



Re: 答复: flushing + compactions after config change

2013-06-27 Thread Viral Bajaria
0.94.4 with plans to upgrade to the latest 0.94 release.

On Thu, Jun 27, 2013 at 2:22 AM, Azuryy Yu azury...@gmail.com wrote:

 hey Viral,
 Which hbase version are you using?



Re: 答复: flushing + compactions after config change

2013-06-27 Thread Azuryy Yu
Can you paste your JVM options here? and Do you have an extensive write on
your hbase cluster?


On Thu, Jun 27, 2013 at 5:47 PM, Viral Bajaria viral.baja...@gmail.comwrote:

 0.94.4 with plans to upgrade to the latest 0.94 release.

 On Thu, Jun 27, 2013 at 2:22 AM, Azuryy Yu azury...@gmail.com wrote:

  hey Viral,
  Which hbase version are you using?
 



Re: 答复: flushing + compactions after config change

2013-06-27 Thread Viral Bajaria
I do have a heavy write operation going on. Actually heavy is relative. Not
all tables/regions are seeing the same amount of writes at the same time.
There is definitely a burst of writes that can happen on some regions. In
addition to that there are some processing jobs which play catch up and
could be processing data in the past and they could have more heavy write
operations.

I think my main problem is, my writes are well distributed across regions.
A batch of puts most probably end up hitting every region since they get
distributed fairly well. In that scenario, I am guessing I get a lot of
WALs though I am just speculating.

Regarding the JVM options (minus some settings for remote profiling):
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M

On Thu, Jun 27, 2013 at 2:48 AM, Azuryy Yu azury...@gmail.com wrote:

 Can you paste your JVM options here? and Do you have an extensive write on
 your hbase cluster?



答复: 答复: flushing + compactions after config change

2013-06-27 Thread 谢良
btw, don't use CMSIncrementalMode, iirc, it had been removed from hotspot 
upstream accually.

发件人: Viral Bajaria [viral.baja...@gmail.com]
发送时间: 2013年6月27日 18:08
收件人: user@hbase.apache.org
主题: Re: 答复: flushing + compactions after config change

I do have a heavy write operation going on. Actually heavy is relative. Not
all tables/regions are seeing the same amount of writes at the same time.
There is definitely a burst of writes that can happen on some regions. In
addition to that there are some processing jobs which play catch up and
could be processing data in the past and they could have more heavy write
operations.

I think my main problem is, my writes are well distributed across regions.
A batch of puts most probably end up hitting every region since they get
distributed fairly well. In that scenario, I am guessing I get a lot of
WALs though I am just speculating.

Regarding the JVM options (minus some settings for remote profiling):
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M

On Thu, Jun 27, 2013 at 2:48 AM, Azuryy Yu azury...@gmail.com wrote:

 Can you paste your JVM options here? and Do you have an extensive write on
 your hbase cluster?


Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store

2013-06-27 Thread Shahab Yunus
I think you will need to update your hash function and redistribute data.
As far as I know this has been on of the drawbacks of this approach (and
the SemaText library)

Regards,
Shahab


On Wed, Jun 26, 2013 at 7:24 PM, Joarder KAMAL joard...@gmail.com wrote:

 May be a simple question to answer for the experienced HBase users and
 developers:

 If I use hash partitioning to evenly distribute write workloads into my
 region servers and later add a new region server to scale or split an
 existing region, then do I need to change my hash function and re-shuffle
 all the existing data in between all the region servers (old and new)? Or,
 is there any better solution for this? Any guidance would be very much
 helpful.

 Thanks in advance.

  
 Regards,
 Joarder Kamal



Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store

2013-06-27 Thread Joarder KAMAL
Thanks Shahab for the reply. I was also thinking in the same way.
Could you able to guide me through any reference which can confirm this
understanding?

​
Regards,
Joarder Kamal



On 27 June 2013 23:24, Shahab Yunus shahab.yu...@gmail.com wrote:

 I think you will need to update your hash function and redistribute data.
 As far as I know this has been on of the drawbacks of this approach (and
 the SemaText library)

 Regards,
 Shahab


 On Wed, Jun 26, 2013 at 7:24 PM, Joarder KAMAL joard...@gmail.com wrote:

  May be a simple question to answer for the experienced HBase users and
  developers:
 
  If I use hash partitioning to evenly distribute write workloads into my
  region servers and later add a new region server to scale or split an
  existing region, then do I need to change my hash function and re-shuffle
  all the existing data in between all the region servers (old and new)?
 Or,
  is there any better solution for this? Any guidance would be very much
  helpful.
 
  Thanks in advance.
 
 
  Regards,
  Joarder Kamal
 



hbase replication and dfs replication?

2013-06-27 Thread Jason Huang
Hello,

I am a bit confused how configurations of hbase replication and dfs
replication works together.

My application deploys on an HBase cluster (0.94.3) with two Region
servers. The two hadoop datanodes run on the same two Region severs.

Because we only have two datanodes, dfs.replication was set to 2.

The person who configured the small cluster didn't explicitly set the hbase
replication configs, which includes:

(1) in ${HBASE_HOME}/conf/hbase-site.xml, hbase.replication is not set. I
think the default value is false according to
http://hbase.apache.org/replication.html.

(2) in the table,Replication_Scope is set to 0 (by default).

However, even without setting hbase.replication and replication_scope, it
appears that the tables are duplicated in the two Region servers (as I can
go to the shells of these two region servers and find the duplicate rows
from a scan).

My question is - does the default dfs replication takes care of replicating
hbase tables within the same cluster so we don't need to set up the hbase
replication configs? And only when we need to replicate hbase from one
cluster to another cluster should we set up the hbase replication configs
(1) and (2) above?

thanks!

Jason


Re: hbase replication and dfs replication?

2013-06-27 Thread Dave Wang
Jason,

HBase replication is for between two HBase clusters as you state.

What you are seeing is merely the expected behavior within a single
cluster.  DFS replication is not involved directly here - the shell ends up
acting like any other HBase client and constructing the scan the same way
(i.e. finding the right region servers to do the scan by contacting ZK,
region server serving .META., issuing the scan requests to the proper RSes,
etc.).  It doesn't matter where you are running the client from.

There is no replicating HBase tables within the same cluster - you're
just accessing the same table from different clients.

Hope this helps,

- Dave


On Thu, Jun 27, 2013 at 7:04 AM, Jason Huang jason.hu...@icare.com wrote:

 Hello,

 I am a bit confused how configurations of hbase replication and dfs
 replication works together.

 My application deploys on an HBase cluster (0.94.3) with two Region
 servers. The two hadoop datanodes run on the same two Region severs.

 Because we only have two datanodes, dfs.replication was set to 2.

 The person who configured the small cluster didn't explicitly set the hbase
 replication configs, which includes:

 (1) in ${HBASE_HOME}/conf/hbase-site.xml, hbase.replication is not set. I
 think the default value is false according to
 http://hbase.apache.org/replication.html.

 (2) in the table,Replication_Scope is set to 0 (by default).

 However, even without setting hbase.replication and replication_scope, it
 appears that the tables are duplicated in the two Region servers (as I can
 go to the shells of these two region servers and find the duplicate rows
 from a scan).

 My question is - does the default dfs replication takes care of replicating
 hbase tables within the same cluster so we don't need to set up the hbase
 replication configs? And only when we need to replicate hbase from one
 cluster to another cluster should we set up the hbase replication configs
 (1) and (2) above?

 thanks!

 Jason



Re: hbase replication and dfs replication?

2013-06-27 Thread Jason Huang
makes a lot of sense.

thanks Dave,

Jason

On Thu, Jun 27, 2013 at 10:26 AM, Dave Wang d...@cloudera.com wrote:

 Jason,

 HBase replication is for between two HBase clusters as you state.

 What you are seeing is merely the expected behavior within a single
 cluster.  DFS replication is not involved directly here - the shell ends up
 acting like any other HBase client and constructing the scan the same way
 (i.e. finding the right region servers to do the scan by contacting ZK,
 region server serving .META., issuing the scan requests to the proper RSes,
 etc.).  It doesn't matter where you are running the client from.

 There is no replicating HBase tables within the same cluster - you're
 just accessing the same table from different clients.

 Hope this helps,

 - Dave


 On Thu, Jun 27, 2013 at 7:04 AM, Jason Huang jason.hu...@icare.com
 wrote:

  Hello,
 
  I am a bit confused how configurations of hbase replication and dfs
  replication works together.
 
  My application deploys on an HBase cluster (0.94.3) with two Region
  servers. The two hadoop datanodes run on the same two Region severs.
 
  Because we only have two datanodes, dfs.replication was set to 2.
 
  The person who configured the small cluster didn't explicitly set the
 hbase
  replication configs, which includes:
 
  (1) in ${HBASE_HOME}/conf/hbase-site.xml, hbase.replication is not set. I
  think the default value is false according to
  http://hbase.apache.org/replication.html.
 
  (2) in the table,Replication_Scope is set to 0 (by default).
 
  However, even without setting hbase.replication and replication_scope, it
  appears that the tables are duplicated in the two Region servers (as I
 can
  go to the shells of these two region servers and find the duplicate rows
  from a scan).
 
  My question is - does the default dfs replication takes care of
 replicating
  hbase tables within the same cluster so we don't need to set up the hbase
  replication configs? And only when we need to replicate hbase from one
  cluster to another cluster should we set up the hbase replication configs
  (1) and (2) above?
 
  thanks!
 
  Jason
 



Re: 答复: flushing + compactions after config change

2013-06-27 Thread Azuryy Yu
your JVM options arenot enough. I will give you some detail when I go back
office tomorrow.

--Send from my Sony mobile.
On Jun 27, 2013 6:09 PM, Viral Bajaria viral.baja...@gmail.com wrote:

 I do have a heavy write operation going on. Actually heavy is relative. Not
 all tables/regions are seeing the same amount of writes at the same time.
 There is definitely a burst of writes that can happen on some regions. In
 addition to that there are some processing jobs which play catch up and
 could be processing data in the past and they could have more heavy write
 operations.

 I think my main problem is, my writes are well distributed across regions.
 A batch of puts most probably end up hitting every region since they get
 distributed fairly well. In that scenario, I am guessing I get a lot of
 WALs though I am just speculating.

 Regarding the JVM options (minus some settings for remote profiling):
 -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -verbose:gc
 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation
 -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M

 On Thu, Jun 27, 2013 at 2:48 AM, Azuryy Yu azury...@gmail.com wrote:

  Can you paste your JVM options here? and Do you have an extensive write
 on
  your hbase cluster?
 



Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store

2013-06-27 Thread Shahab Yunus
I don't have a particular document or source stating this but I think it is
actually kind of self-explanatory if your think about the algorithm.

Anyway, you can read this
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/

And some older discussions by experts on this topic:
http://search-hadoop.com/?q=prefix+salt+key+hotspotfc_project=HBase

Regards,
Shahab


On Thu, Jun 27, 2013 at 9:44 AM, Joarder KAMAL joard...@gmail.com wrote:

 Thanks Shahab for the reply. I was also thinking in the same way.
 Could you able to guide me through any reference which can confirm this
 understanding?

 
 Regards,
 Joarder Kamal



 On 27 June 2013 23:24, Shahab Yunus shahab.yu...@gmail.com wrote:

  I think you will need to update your hash function and redistribute data.
  As far as I know this has been on of the drawbacks of this approach (and
  the SemaText library)
 
  Regards,
  Shahab
 
 
  On Wed, Jun 26, 2013 at 7:24 PM, Joarder KAMAL joard...@gmail.com
 wrote:
 
   May be a simple question to answer for the experienced HBase users and
   developers:
  
   If I use hash partitioning to evenly distribute write workloads into my
   region servers and later add a new region server to scale or split an
   existing region, then do I need to change my hash function and
 re-shuffle
   all the existing data in between all the region servers (old and new)?
  Or,
   is there any better solution for this? Any guidance would be very much
   helpful.
  
   Thanks in advance.
  
  
   Regards,
   Joarder Kamal
  
 



Re: Schema design for filters

2013-06-27 Thread Michael Segel
Not an easy task. 

You first need to determine how you want to store the data within a column 
and/or apply a type constraint to a column. 

Even if you use JSON records to store your data within a column, does an 
equality comparator exist? If not, you would have to write one. 
(I kinda think that one may already exist...)


On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com wrote:

 Hi
 
 Working with the standard filtering mechanism to scan rows that have
 columns matching certain criterias.
 
 There are columns of numeric (integer and decimal) and string types. These
 columns are single or multi-valued like 1, 2, 1,2,3, a, b or
 a,b,c - not sure what the separator would be in the case of list types.
 Maybe none?
 
 I would like to compose the following queries to filter out rows that does
 not match.
 
 - contains(String column, String value)
  Single valued column that String.contain() provided value.
 
 - equal(String column, Object value)
  Single valued column that Object.equals() provided value.
  Value is either string or numeric type.
 
 - greaterThan(String column, java.lang.Number value)
  Single valued column that  provided numeric value.
 
 - in(String column, Object value...)
  Multi-valued column have values that Object.equals() all provided values.
  Values are of string or numeric type.
 
 How would I design a schema that can take advantage of the already existing
 filters and comparators to accomplish this?
 
 Already looked at the string and binary comparators but fail to see how to
 solve this in a clean way for multi-valued column values.
 
 Im aware of custom filters but would like to avoid it if possible.
 
 Cheers,
 -Kristoffer



Profiling map reduce jobs?

2013-06-27 Thread David Poisson
Howdy,
 I want to take a look at a MR job which seems to be slower than I had 
hoped. Mind you, this MR job is only running on a pseudo-distributed VM 
(cloudera cdh4).

I have modified my mapred-site.xml with the following (that last one is 
commented out because it crashes my MR job):

  property
namemapred.task.profile/name
valuetrue/value
  /property
  property
namemapred.task.profile.maps/name
value0-2/value
  /property
  property
namemapred.task.profile.reduces/name
value0-2/value
  /property
  !--property
namemapred.task.profile.params/name

valueagentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value
  /property--
Are there any resources that explain how to interpret the results?
Or maybe an open-source app that could help display the results in a more 
intuiative manner?

Ideally, we'd want to know where we are spending most of our time.

Cheers,

David


Re: 答复: flushing + compactions after config change

2013-06-27 Thread Viral Bajaria
Thanks Azuryy. Look forward to it.

Does DEFERRED_LOG_FLUSH impact the number of WAL files that will be created
? Tried looking around but could not find the details.

On Thu, Jun 27, 2013 at 7:53 AM, Azuryy Yu azury...@gmail.com wrote:

 your JVM options arenot enough. I will give you some detail when I go back
 office tomorrow.

 --Send from my Sony mobile.



Re: 答复: flushing + compactions after config change

2013-06-27 Thread Jean-Daniel Cryans
No, all your data eventually makes it into the log, just potentially
not as quickly :)

J-D

On Thu, Jun 27, 2013 at 2:06 PM, Viral Bajaria viral.baja...@gmail.com wrote:
 Thanks Azuryy. Look forward to it.

 Does DEFERRED_LOG_FLUSH impact the number of WAL files that will be created
 ? Tried looking around but could not find the details.

 On Thu, Jun 27, 2013 at 7:53 AM, Azuryy Yu azury...@gmail.com wrote:

 your JVM options arenot enough. I will give you some detail when I go back
 office tomorrow.

 --Send from my Sony mobile.



Re: Schema design for filters

2013-06-27 Thread Kristoffer Sjögren
I realize standard comparators cannot solve this.

However I do know the type of each column so writing custom list
comparators for boolean, char, byte, short, int, long, float, double seems
quite straightforward.

Long arrays, for example, are stored as a byte array with 8 bytes per item
so a comparator might look like this.

public class LongsComparator extends WritableByteArrayComparable {
public int compareTo(byte[] value, int offset, int length) {
long[] values = BytesUtils.toLongs(value, offset, length);
for (long longValue : values) {
if (longValue == val) {
return 0;
}
}
return 1;
}
}

public static long[] toLongs(byte[] value, int offset, int length) {
int num = (length - offset) / 8;
long[] values = new long[num];
for (int i = offset; i  num; i++) {
values[i] = getLong(value, i * 8);
}
return values;
}


Strings are similar but would require charset and length for each string.

public class StringsComparator extends WritableByteArrayComparable  {
public int compareTo(byte[] value, int offset, int length) {
String[] values = BytesUtils.toStrings(value, offset, length);
for (String stringValue : values) {
if (val.equals(stringValue)) {
return 0;
}
}
return 1;
}
}

public static String[] toStrings(byte[] value, int offset, int length) {
ArrayListString values = new ArrayListString();
int idx = 0;
ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
while (idx  length) {
int size = buffer.getInt();
byte[] bytes = new byte[size];
buffer.get(bytes);
values.add(new String(bytes));
idx += 4 + size;
}
return values.toArray(new String[values.size()]);
}


Am I on the right track or maybe overlooking some implementation details?
Not really sure how to target each comparator to a specific column value?


On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel michael_se...@hotmail.comwrote:

 Not an easy task.

 You first need to determine how you want to store the data within a column
 and/or apply a type constraint to a column.

 Even if you use JSON records to store your data within a column, does an
 equality comparator exist? If not, you would have to write one.
 (I kinda think that one may already exist...)


 On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com wrote:

  Hi
 
  Working with the standard filtering mechanism to scan rows that have
  columns matching certain criterias.
 
  There are columns of numeric (integer and decimal) and string types.
 These
  columns are single or multi-valued like 1, 2, 1,2,3, a, b or
  a,b,c - not sure what the separator would be in the case of list types.
  Maybe none?
 
  I would like to compose the following queries to filter out rows that
 does
  not match.
 
  - contains(String column, String value)
   Single valued column that String.contain() provided value.
 
  - equal(String column, Object value)
   Single valued column that Object.equals() provided value.
   Value is either string or numeric type.
 
  - greaterThan(String column, java.lang.Number value)
   Single valued column that  provided numeric value.
 
  - in(String column, Object value...)
   Multi-valued column have values that Object.equals() all provided
 values.
   Values are of string or numeric type.
 
  How would I design a schema that can take advantage of the already
 existing
  filters and comparators to accomplish this?
 
  Already looked at the string and binary comparators but fail to see how
 to
  solve this in a clean way for multi-valued column values.
 
  Im aware of custom filters but would like to avoid it if possible.
 
  Cheers,
  -Kristoffer




Re: Schema design for filters

2013-06-27 Thread Michael Segel
You have to remember that HBase doesn't enforce any sort of typing. 
That's why this can be difficult. 

You'd have to write a coprocessor to enforce a schema on a table. 
Even then YMMV if you're writing JSON structures to a column because while the 
contents of the structures could be the same, the actual strings could differ.  

HTH

-Mike

On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com wrote:

 I realize standard comparators cannot solve this.
 
 However I do know the type of each column so writing custom list
 comparators for boolean, char, byte, short, int, long, float, double seems
 quite straightforward.
 
 Long arrays, for example, are stored as a byte array with 8 bytes per item
 so a comparator might look like this.
 
 public class LongsComparator extends WritableByteArrayComparable {
public int compareTo(byte[] value, int offset, int length) {
long[] values = BytesUtils.toLongs(value, offset, length);
for (long longValue : values) {
if (longValue == val) {
return 0;
}
}
return 1;
}
 }
 
 public static long[] toLongs(byte[] value, int offset, int length) {
int num = (length - offset) / 8;
long[] values = new long[num];
for (int i = offset; i  num; i++) {
values[i] = getLong(value, i * 8);
}
return values;
 }
 
 
 Strings are similar but would require charset and length for each string.
 
 public class StringsComparator extends WritableByteArrayComparable  {
public int compareTo(byte[] value, int offset, int length) {
String[] values = BytesUtils.toStrings(value, offset, length);
for (String stringValue : values) {
if (val.equals(stringValue)) {
return 0;
}
}
return 1;
}
 }
 
 public static String[] toStrings(byte[] value, int offset, int length) {
ArrayListString values = new ArrayListString();
int idx = 0;
ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
while (idx  length) {
int size = buffer.getInt();
byte[] bytes = new byte[size];
buffer.get(bytes);
values.add(new String(bytes));
idx += 4 + size;
}
return values.toArray(new String[values.size()]);
 }
 
 
 Am I on the right track or maybe overlooking some implementation details?
 Not really sure how to target each comparator to a specific column value?
 
 
 On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel 
 michael_se...@hotmail.comwrote:
 
 Not an easy task.
 
 You first need to determine how you want to store the data within a column
 and/or apply a type constraint to a column.
 
 Even if you use JSON records to store your data within a column, does an
 equality comparator exist? If not, you would have to write one.
 (I kinda think that one may already exist...)
 
 
 On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 Hi
 
 Working with the standard filtering mechanism to scan rows that have
 columns matching certain criterias.
 
 There are columns of numeric (integer and decimal) and string types.
 These
 columns are single or multi-valued like 1, 2, 1,2,3, a, b or
 a,b,c - not sure what the separator would be in the case of list types.
 Maybe none?
 
 I would like to compose the following queries to filter out rows that
 does
 not match.
 
 - contains(String column, String value)
 Single valued column that String.contain() provided value.
 
 - equal(String column, Object value)
 Single valued column that Object.equals() provided value.
 Value is either string or numeric type.
 
 - greaterThan(String column, java.lang.Number value)
 Single valued column that  provided numeric value.
 
 - in(String column, Object value...)
 Multi-valued column have values that Object.equals() all provided
 values.
 Values are of string or numeric type.
 
 How would I design a schema that can take advantage of the already
 existing
 filters and comparators to accomplish this?
 
 Already looked at the string and binary comparators but fail to see how
 to
 solve this in a clean way for multi-valued column values.
 
 Im aware of custom filters but would like to avoid it if possible.
 
 Cheers,
 -Kristoffer
 
 



Re: Schema design for filters

2013-06-27 Thread Kristoffer Sjögren
I see your point. Everything is just bytes.

However, the schema is known and every row is formatted according to this
schema, although some columns may not exist, that is, no value exist for
this property on this row.

So if im able to apply these typed comparators to the right cell values
it may be possible? But I cant find a filter that target specific columns?

Seems like all filters scan every column/qualifier and there is no way of
knowing what column is currently being evaluated?


On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel
michael_se...@hotmail.comwrote:

 You have to remember that HBase doesn't enforce any sort of typing.
 That's why this can be difficult.

 You'd have to write a coprocessor to enforce a schema on a table.
 Even then YMMV if you're writing JSON structures to a column because while
 the contents of the structures could be the same, the actual strings could
 differ.

 HTH

 -Mike

 On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com wrote:

  I realize standard comparators cannot solve this.
 
  However I do know the type of each column so writing custom list
  comparators for boolean, char, byte, short, int, long, float, double
 seems
  quite straightforward.
 
  Long arrays, for example, are stored as a byte array with 8 bytes per
 item
  so a comparator might look like this.
 
  public class LongsComparator extends WritableByteArrayComparable {
 public int compareTo(byte[] value, int offset, int length) {
 long[] values = BytesUtils.toLongs(value, offset, length);
 for (long longValue : values) {
 if (longValue == val) {
 return 0;
 }
 }
 return 1;
 }
  }
 
  public static long[] toLongs(byte[] value, int offset, int length) {
 int num = (length - offset) / 8;
 long[] values = new long[num];
 for (int i = offset; i  num; i++) {
 values[i] = getLong(value, i * 8);
 }
 return values;
  }
 
 
  Strings are similar but would require charset and length for each string.
 
  public class StringsComparator extends WritableByteArrayComparable  {
 public int compareTo(byte[] value, int offset, int length) {
 String[] values = BytesUtils.toStrings(value, offset, length);
 for (String stringValue : values) {
 if (val.equals(stringValue)) {
 return 0;
 }
 }
 return 1;
 }
  }
 
  public static String[] toStrings(byte[] value, int offset, int length) {
 ArrayListString values = new ArrayListString();
 int idx = 0;
 ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
 while (idx  length) {
 int size = buffer.getInt();
 byte[] bytes = new byte[size];
 buffer.get(bytes);
 values.add(new String(bytes));
 idx += 4 + size;
 }
 return values.toArray(new String[values.size()]);
  }
 
 
  Am I on the right track or maybe overlooking some implementation details?
  Not really sure how to target each comparator to a specific column value?
 
 
  On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel 
 michael_se...@hotmail.comwrote:
 
  Not an easy task.
 
  You first need to determine how you want to store the data within a
 column
  and/or apply a type constraint to a column.
 
  Even if you use JSON records to store your data within a column, does an
  equality comparator exist? If not, you would have to write one.
  (I kinda think that one may already exist...)
 
 
  On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
  Hi
 
  Working with the standard filtering mechanism to scan rows that have
  columns matching certain criterias.
 
  There are columns of numeric (integer and decimal) and string types.
  These
  columns are single or multi-valued like 1, 2, 1,2,3, a, b or
  a,b,c - not sure what the separator would be in the case of list
 types.
  Maybe none?
 
  I would like to compose the following queries to filter out rows that
  does
  not match.
 
  - contains(String column, String value)
  Single valued column that String.contain() provided value.
 
  - equal(String column, Object value)
  Single valued column that Object.equals() provided value.
  Value is either string or numeric type.
 
  - greaterThan(String column, java.lang.Number value)
  Single valued column that  provided numeric value.
 
  - in(String column, Object value...)
  Multi-valued column have values that Object.equals() all provided
  values.
  Values are of string or numeric type.
 
  How would I design a schema that can take advantage of the already
  existing
  filters and comparators to accomplish this?
 
  Already looked at the string and binary comparators but fail to see how
  to
  solve this in a clean way for multi-valued column values.
 
  Im aware of custom filters but would like to avoid it if possible.
 
  Cheers,
  -Kristoffer
 
 




Re: Problems while exporting from Hbase to CSV file

2013-06-27 Thread Michael Segel
Phoenix, Hive, Pig, Java would all work. 
But to Azury Yu's post... 

The OP is doing a simple scan() to get rows. 
If the OP is hitting an OOM exception then its a code issue on the part of the 
OP. 


On Jun 27, 2013, at 2:22 AM, Azuryy Yu azury...@gmail.com wrote:

 Sorry, maybe Phonex is not suitable for you.
 
 
 On Thu, Jun 27, 2013 at 3:21 PM, Azuryy Yu azury...@gmail.com wrote:
 
 1) Scan.setCaching() to specify the number of rows for caching that will
 be passed to scanners.
and what's your block cache size?
 
but if OOM from the client, not sever side, then I don't think this is
 Scan related, please check your client code.
 
 2) we cannot add default value from HBase,  but you can add it on your
 client when iterate the Result.
 
 Also, you can using Phonex, this is cool for your scenario.
 https://github.com/forcedotcom/phoenix
 
 
 
 On Thu, Jun 27, 2013 at 3:11 PM, Vimal Jain vkj...@gmail.com wrote:
 
 Hi,
 I am trying to export from hbase to a CSV file.
 I am using Scan class to scan all data  in the table.
 But i am facing some problems while doing it.
 
 1) My table has around 1.5 million rows  and around 150 columns for each
 row , so i can not use default scan() constructor as it will scan whole
 table in one go which results in OutOfMemory error in client process.I
 heard of using setCaching() and setBatch() but i am not able to understand
 how it will solve OOM error.
 
 I thought of providing startRow and stopRow in scan object but i want to
 scan whole table so how will this help ?
 
 2) As hbase stores data for a row only when we explicitly provide it and
 their is no concept of default value as found in RDBMS , i want to have
 each and evey column in the CSV file i generate for every user.In case
 column values are not there in hbase , i want to use default  values for
 them(I have list of default values for each column). Is there any method
 in
 Result class or any other class to accomplish this ?
 
 
 Please help here.
 
 --
 Thanks and Regards,
 Vimal Jain
 
 
 



Re: Schema design for filters

2013-06-27 Thread Michael Segel
Ok... 

If you want to do type checking and schema enforcement... 

You will need to do this as a coprocessor. 

The quick and dirty way... (Not recommended) would be to hard code the schema 
in to the co-processor code.) 

A better way... at start up, load up ZK to manage the set of known table 
schemas which would be a map of column qualifier to data type. 
(If JSON then you need to do a separate lookup to get the records schema)

Then a single java class that does the look up and then handles the known data 
type comparators. 

Does this make sense? 
(Sorry, kinda was thinking this out as I typed the response. But it should work 
) 

At least it would be a design approach I would talk. YMMV

Having said that, I expect someone to say its a bad idea and that they have a 
better solution. 

HTH

-Mike

On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren sto...@gmail.com wrote:

 I see your point. Everything is just bytes.
 
 However, the schema is known and every row is formatted according to this
 schema, although some columns may not exist, that is, no value exist for
 this property on this row.
 
 So if im able to apply these typed comparators to the right cell values
 it may be possible? But I cant find a filter that target specific columns?
 
 Seems like all filters scan every column/qualifier and there is no way of
 knowing what column is currently being evaluated?
 
 
 On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel
 michael_se...@hotmail.comwrote:
 
 You have to remember that HBase doesn't enforce any sort of typing.
 That's why this can be difficult.
 
 You'd have to write a coprocessor to enforce a schema on a table.
 Even then YMMV if you're writing JSON structures to a column because while
 the contents of the structures could be the same, the actual strings could
 differ.
 
 HTH
 
 -Mike
 
 On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 I realize standard comparators cannot solve this.
 
 However I do know the type of each column so writing custom list
 comparators for boolean, char, byte, short, int, long, float, double
 seems
 quite straightforward.
 
 Long arrays, for example, are stored as a byte array with 8 bytes per
 item
 so a comparator might look like this.
 
 public class LongsComparator extends WritableByteArrayComparable {
   public int compareTo(byte[] value, int offset, int length) {
   long[] values = BytesUtils.toLongs(value, offset, length);
   for (long longValue : values) {
   if (longValue == val) {
   return 0;
   }
   }
   return 1;
   }
 }
 
 public static long[] toLongs(byte[] value, int offset, int length) {
   int num = (length - offset) / 8;
   long[] values = new long[num];
   for (int i = offset; i  num; i++) {
   values[i] = getLong(value, i * 8);
   }
   return values;
 }
 
 
 Strings are similar but would require charset and length for each string.
 
 public class StringsComparator extends WritableByteArrayComparable  {
   public int compareTo(byte[] value, int offset, int length) {
   String[] values = BytesUtils.toStrings(value, offset, length);
   for (String stringValue : values) {
   if (val.equals(stringValue)) {
   return 0;
   }
   }
   return 1;
   }
 }
 
 public static String[] toStrings(byte[] value, int offset, int length) {
   ArrayListString values = new ArrayListString();
   int idx = 0;
   ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
   while (idx  length) {
   int size = buffer.getInt();
   byte[] bytes = new byte[size];
   buffer.get(bytes);
   values.add(new String(bytes));
   idx += 4 + size;
   }
   return values.toArray(new String[values.size()]);
 }
 
 
 Am I on the right track or maybe overlooking some implementation details?
 Not really sure how to target each comparator to a specific column value?
 
 
 On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel 
 michael_se...@hotmail.comwrote:
 
 Not an easy task.
 
 You first need to determine how you want to store the data within a
 column
 and/or apply a type constraint to a column.
 
 Even if you use JSON records to store your data within a column, does an
 equality comparator exist? If not, you would have to write one.
 (I kinda think that one may already exist...)
 
 
 On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 Hi
 
 Working with the standard filtering mechanism to scan rows that have
 columns matching certain criterias.
 
 There are columns of numeric (integer and decimal) and string types.
 These
 columns are single or multi-valued like 1, 2, 1,2,3, a, b or
 a,b,c - not sure what the separator would be in the case of list
 types.
 Maybe none?
 
 I would like to compose the following queries to filter out rows that
 does
 not match.
 
 - contains(String column, String value)
 Single valued column that String.contain() provided value.
 
 - equal(String column, Object value)
 Single valued column that 

Re: 答复: flushing + compactions after config change

2013-06-27 Thread Viral Bajaria
Hey JD,

Thanks for the clarification. I also came across a previous thread which
sort of talks about a similar problem.
http://mail-archives.apache.org/mod_mbox/hbase-user/201204.mbox/%3ccagptdnfwnrsnqv7n3wgje-ichzpx-cxn1tbchgwrpohgcos...@mail.gmail.com%3E

I guess my problem is also similar to the fact that my writes are well
distributed and at a given time I could be writing to a lot of regions.
Some of the regions receive very little data but since the flush algorithm
choose at random what to flush when too many hlogs is hit, it will flush
a region with less than 10mb of data causing too many small files. This
in-turn causes compaction storms where even though major compactions is
disabled, some of the minor get upgraded to major and that's when things
start getting worse.

My compaction queues are still the same and so I doubt I will be coming out
of this storm without bumping up max hlogs for now. Reducing regions per
server is one option but then I will be wasting my resources since the
servers at current load are at  30% CPU and  25% RAM. Maybe I can bump up
heap space and give more memory to the the memstore. Sorry, I am just
thinking out loud.

Thanks,
Viral

On Thu, Jun 27, 2013 at 2:40 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote:

 No, all your data eventually makes it into the log, just potentially
 not as quickly :)



Re: Schema design for filters

2013-06-27 Thread Kristoffer Sjögren
Thanks for your help Mike. Much appreciated.

I dont store rows/columns in JSON format. The schema is exactly that of a
specific java class, where the rowkey is a unique object identifier with
the class type encoded into it. Columns are the field names of the class
and the values are that of the object instance.

Did think about coprocessors but the schema is discovered a runtime and I
cant hard code it.

However, I still believe that filters might work. Had a look
at SingleColumnValueFilter and this filter is be able to target specific
column qualifiers with specific WritableByteArrayComparables.

But list comparators are still missing... So I guess the only way is to
write these comparators?

Do you follow my reasoning? Will it work?




On Fri, Jun 28, 2013 at 12:58 AM, Michael Segel
michael_se...@hotmail.comwrote:

 Ok...

 If you want to do type checking and schema enforcement...

 You will need to do this as a coprocessor.

 The quick and dirty way... (Not recommended) would be to hard code the
 schema in to the co-processor code.)

 A better way... at start up, load up ZK to manage the set of known table
 schemas which would be a map of column qualifier to data type.
 (If JSON then you need to do a separate lookup to get the records schema)

 Then a single java class that does the look up and then handles the known
 data type comparators.

 Does this make sense?
 (Sorry, kinda was thinking this out as I typed the response. But it should
 work )

 At least it would be a design approach I would talk. YMMV

 Having said that, I expect someone to say its a bad idea and that they
 have a better solution.

 HTH

 -Mike

 On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren sto...@gmail.com wrote:

  I see your point. Everything is just bytes.
 
  However, the schema is known and every row is formatted according to this
  schema, although some columns may not exist, that is, no value exist for
  this property on this row.
 
  So if im able to apply these typed comparators to the right cell values
  it may be possible? But I cant find a filter that target specific
 columns?
 
  Seems like all filters scan every column/qualifier and there is no way of
  knowing what column is currently being evaluated?
 
 
  On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel
  michael_se...@hotmail.comwrote:
 
  You have to remember that HBase doesn't enforce any sort of typing.
  That's why this can be difficult.
 
  You'd have to write a coprocessor to enforce a schema on a table.
  Even then YMMV if you're writing JSON structures to a column because
 while
  the contents of the structures could be the same, the actual strings
 could
  differ.
 
  HTH
 
  -Mike
 
  On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
  I realize standard comparators cannot solve this.
 
  However I do know the type of each column so writing custom list
  comparators for boolean, char, byte, short, int, long, float, double
  seems
  quite straightforward.
 
  Long arrays, for example, are stored as a byte array with 8 bytes per
  item
  so a comparator might look like this.
 
  public class LongsComparator extends WritableByteArrayComparable {
public int compareTo(byte[] value, int offset, int length) {
long[] values = BytesUtils.toLongs(value, offset, length);
for (long longValue : values) {
if (longValue == val) {
return 0;
}
}
return 1;
}
  }
 
  public static long[] toLongs(byte[] value, int offset, int length) {
int num = (length - offset) / 8;
long[] values = new long[num];
for (int i = offset; i  num; i++) {
values[i] = getLong(value, i * 8);
}
return values;
  }
 
 
  Strings are similar but would require charset and length for each
 string.
 
  public class StringsComparator extends WritableByteArrayComparable  {
public int compareTo(byte[] value, int offset, int length) {
String[] values = BytesUtils.toStrings(value, offset, length);
for (String stringValue : values) {
if (val.equals(stringValue)) {
return 0;
}
}
return 1;
}
  }
 
  public static String[] toStrings(byte[] value, int offset, int length)
 {
ArrayListString values = new ArrayListString();
int idx = 0;
ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
while (idx  length) {
int size = buffer.getInt();
byte[] bytes = new byte[size];
buffer.get(bytes);
values.add(new String(bytes));
idx += 4 + size;
}
return values.toArray(new String[values.size()]);
  }
 
 
  Am I on the right track or maybe overlooking some implementation
 details?
  Not really sure how to target each comparator to a specific column
 value?
 
 
  On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel 
  michael_se...@hotmail.comwrote:
 
  Not an easy task.
 
  You first need to determine how you want to store the data within a
  column
  

Re: 答复: flushing + compactions after config change

2013-06-27 Thread Azuryy Yu
Hi Viral,
the following are all needed for CMS:

-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=70
-XX:+UseCMSCompactAtFullCollection
-XX:CMSFullGCsBeforeCompaction=0
-XX:+CMSClassUnloadingEnabled
-XX:CMSMaxAbortablePrecleanTime=300
-XX:+CMSScavengeBeforeRemark

and if your JDK version is greater than 1.6.23, then add :
-XX:+UseCompressedOops
-XX:SoftRefLRUPolicyMSPerMB=0


and you'd better add GC log
-verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:$HBASE_LOG_DIR/gc.log

if your JDK version is greater than 1.6.23, then add :
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=9
-XX:GCLogFileSize=20m

Hope that helpful.





On Fri, Jun 28, 2013 at 5:06 AM, Viral Bajaria viral.baja...@gmail.comwrote:

 Thanks Azuryy. Look forward to it.

 Does DEFERRED_LOG_FLUSH impact the number of WAL files that will be created
 ? Tried looking around but could not find the details.

 On Thu, Jun 27, 2013 at 7:53 AM, Azuryy Yu azury...@gmail.com wrote:

  your JVM options arenot enough. I will give you some detail when I go
 back
  office tomorrow.
 
  --Send from my Sony mobile.
 



Re: Schema design for filters

2013-06-27 Thread James Taylor
Hi Kristoffer,
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? You 
could model your schema much like an O/R mapper and issue SQL queries through 
Phoenix for your filtering.

James
@JamesPlusPlus
http://phoenix-hbase.blogspot.com

On Jun 27, 2013, at 4:39 PM, Kristoffer Sjögren sto...@gmail.com wrote:

 Thanks for your help Mike. Much appreciated.
 
 I dont store rows/columns in JSON format. The schema is exactly that of a
 specific java class, where the rowkey is a unique object identifier with
 the class type encoded into it. Columns are the field names of the class
 and the values are that of the object instance.
 
 Did think about coprocessors but the schema is discovered a runtime and I
 cant hard code it.
 
 However, I still believe that filters might work. Had a look
 at SingleColumnValueFilter and this filter is be able to target specific
 column qualifiers with specific WritableByteArrayComparables.
 
 But list comparators are still missing... So I guess the only way is to
 write these comparators?
 
 Do you follow my reasoning? Will it work?
 
 
 
 
 On Fri, Jun 28, 2013 at 12:58 AM, Michael Segel
 michael_se...@hotmail.comwrote:
 
 Ok...
 
 If you want to do type checking and schema enforcement...
 
 You will need to do this as a coprocessor.
 
 The quick and dirty way... (Not recommended) would be to hard code the
 schema in to the co-processor code.)
 
 A better way... at start up, load up ZK to manage the set of known table
 schemas which would be a map of column qualifier to data type.
 (If JSON then you need to do a separate lookup to get the records schema)
 
 Then a single java class that does the look up and then handles the known
 data type comparators.
 
 Does this make sense?
 (Sorry, kinda was thinking this out as I typed the response. But it should
 work )
 
 At least it would be a design approach I would talk. YMMV
 
 Having said that, I expect someone to say its a bad idea and that they
 have a better solution.
 
 HTH
 
 -Mike
 
 On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 I see your point. Everything is just bytes.
 
 However, the schema is known and every row is formatted according to this
 schema, although some columns may not exist, that is, no value exist for
 this property on this row.
 
 So if im able to apply these typed comparators to the right cell values
 it may be possible? But I cant find a filter that target specific
 columns?
 
 Seems like all filters scan every column/qualifier and there is no way of
 knowing what column is currently being evaluated?
 
 
 On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel
 michael_se...@hotmail.comwrote:
 
 You have to remember that HBase doesn't enforce any sort of typing.
 That's why this can be difficult.
 
 You'd have to write a coprocessor to enforce a schema on a table.
 Even then YMMV if you're writing JSON structures to a column because
 while
 the contents of the structures could be the same, the actual strings
 could
 differ.
 
 HTH
 
 -Mike
 
 On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 I realize standard comparators cannot solve this.
 
 However I do know the type of each column so writing custom list
 comparators for boolean, char, byte, short, int, long, float, double
 seems
 quite straightforward.
 
 Long arrays, for example, are stored as a byte array with 8 bytes per
 item
 so a comparator might look like this.
 
 public class LongsComparator extends WritableByteArrayComparable {
  public int compareTo(byte[] value, int offset, int length) {
  long[] values = BytesUtils.toLongs(value, offset, length);
  for (long longValue : values) {
  if (longValue == val) {
  return 0;
  }
  }
  return 1;
  }
 }
 
 public static long[] toLongs(byte[] value, int offset, int length) {
  int num = (length - offset) / 8;
  long[] values = new long[num];
  for (int i = offset; i  num; i++) {
  values[i] = getLong(value, i * 8);
  }
  return values;
 }
 
 
 Strings are similar but would require charset and length for each
 string.
 
 public class StringsComparator extends WritableByteArrayComparable  {
  public int compareTo(byte[] value, int offset, int length) {
  String[] values = BytesUtils.toStrings(value, offset, length);
  for (String stringValue : values) {
  if (val.equals(stringValue)) {
  return 0;
  }
  }
  return 1;
  }
 }
 
 public static String[] toStrings(byte[] value, int offset, int length)
 {
  ArrayListString values = new ArrayListString();
  int idx = 0;
  ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
  while (idx  length) {
  int size = buffer.getInt();
  byte[] bytes = new byte[size];
  buffer.get(bytes);
  values.add(new String(bytes));
  idx += 4 + size;
  }
  return values.toArray(new String[values.size()]);
 }
 
 
 Am I on the right track or maybe overlooking some implementation
 details?
 Not 

what the max number of column that a column family can has?

2013-06-27 Thread ch huang
ATT


Re: what the max number of column that a column family can has?

2013-06-27 Thread Ted Yu
Your row can be very wide.

Take a look at the first paragraph in this comment:
https://issues.apache.org/jira/browse/HBASE-7826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633620#comment-13633620

Cheers

On Fri, Jun 28, 2013 at 10:40 AM, ch huang justlo...@gmail.com wrote:

 ATT



Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store

2013-06-27 Thread ramkrishna vasudevan
I would suggest you could write a custom load balancer and then have your
hashing algo to determine how the load balancing should happen.  Hope this
helps.

Regards
Ram


On Fri, Jun 28, 2013 at 5:32 AM, Joarder KAMAL joard...@gmail.com wrote:

 Thanks St.Ack for mentioning about the load-balancer.

 But my question was two folded:
 Case-1. If a new RS is added, then the load-balancer will do it's job
 considering no new region has been created in the meanwhile. // As you've
 already answered.

 Case-2. Whether a new RS is added or not, an existing region is splitted
 into two, then how the new writes will to the new region? Because, lets say
 initially the hash function was calculated with *N* Regions and now there
 are *N+1* Regions in the cluster.

 In that case, do I need to change the Hash function and reshuffle all the
 existing data within the cluster !! Or, HBase has some mechanism to handle
 this?


 Many thanks again for helping me out...


 
 Regards,
 Joarder Kamal

 On 28 June 2013 02:26, Stack st...@duboce.net wrote:

  On Wed, Jun 26, 2013 at 4:24 PM, Joarder KAMAL joard...@gmail.com
 wrote:
 
   May be a simple question to answer for the experienced HBase users and
   developers:
  
   If I use hash partitioning to evenly distribute write workloads into my
   region servers and later add a new region server to scale or split an
   existing region, then do I need to change my hash function and
 re-shuffle
   all the existing data in between all the region servers (old and new)?
  Or,
   is there any better solution for this? Any guidance would be very much
   helpful.
  
 
  You do not need to change your hash function.
 
  When you add a new regionserver, the balancer will move some of the
  existing regions to the new host.
 
  St.Ack
 



is hbase cluster support multi-instance?

2013-06-27 Thread ch huang
hi all:
can hbase start more than one instance ,like mysql, if can ,how to manage
these instances? ,thanks a lot


what's the relationship between hadoop datanode and hbase region node?

2013-06-27 Thread ch huang
ATT


How many column families in one table ?

2013-06-27 Thread Vimal Jain
Hi,
How many column families should be there in an hbase table ? Is there any
performance issue in read/write if we have more column families ?
I have designed one table with around 14 column families in it with each
having on average 6 qualifiers.
Is it a good design ?

-- 
Thanks and Regards,
Vimal Jain


RE: what's the relationship between hadoop datanode and hbase region node?

2013-06-27 Thread Sandeep L
Hbase regions are stored in HFiles and HFiles use data node to store data of 
hfiles.

Thanks,Sandeep.

 Date: Fri, 28 Jun 2013 13:08:58 +0800
 Subject: what's the relationship between hadoop datanode and hbase region 
 node?
 From: justlo...@gmail.com
 To: user@hbase.apache.org
 
 ATT