Problems while exporting from Hbase to CSV file
Hi, I am trying to export from hbase to a CSV file. I am using Scan class to scan all data in the table. But i am facing some problems while doing it. 1) My table has around 1.5 million rows and around 150 columns for each row , so i can not use default scan() constructor as it will scan whole table in one go which results in OutOfMemory error in client process.I heard of using setCaching() and setBatch() but i am not able to understand how it will solve OOM error. I thought of providing startRow and stopRow in scan object but i want to scan whole table so how will this help ? 2) As hbase stores data for a row only when we explicitly provide it and their is no concept of default value as found in RDBMS , i want to have each and evey column in the CSV file i generate for every user.In case column values are not there in hbase , i want to use default values for them(I have list of default values for each column). Is there any method in Result class or any other class to accomplish this ? Please help here. -- Thanks and Regards, Vimal Jain
Re: Problems while exporting from Hbase to CSV file
Sorry, maybe Phonex is not suitable for you. On Thu, Jun 27, 2013 at 3:21 PM, Azuryy Yu azury...@gmail.com wrote: 1) Scan.setCaching() to specify the number of rows for caching that will be passed to scanners. and what's your block cache size? but if OOM from the client, not sever side, then I don't think this is Scan related, please check your client code. 2) we cannot add default value from HBase, but you can add it on your client when iterate the Result. Also, you can using Phonex, this is cool for your scenario. https://github.com/forcedotcom/phoenix On Thu, Jun 27, 2013 at 3:11 PM, Vimal Jain vkj...@gmail.com wrote: Hi, I am trying to export from hbase to a CSV file. I am using Scan class to scan all data in the table. But i am facing some problems while doing it. 1) My table has around 1.5 million rows and around 150 columns for each row , so i can not use default scan() constructor as it will scan whole table in one go which results in OutOfMemory error in client process.I heard of using setCaching() and setBatch() but i am not able to understand how it will solve OOM error. I thought of providing startRow and stopRow in scan object but i want to scan whole table so how will this help ? 2) As hbase stores data for a row only when we explicitly provide it and their is no concept of default value as found in RDBMS , i want to have each and evey column in the CSV file i generate for every user.In case column values are not there in hbase , i want to use default values for them(I have list of default values for each column). Is there any method in Result class or any other class to accomplish this ? Please help here. -- Thanks and Regards, Vimal Jain
flushing + compactions after config change
Hi All, I wanted some help on understanding what's going on with my current setup. I updated from config to the following settings: property namehbase.hregion.max.filesize/name value107374182400/value /property property namehbase.hregion.memstore.block.multiplier/name value4/value /property property namehbase.hregion.memstore.flush.size/name value134217728/value /property property namehbase.hstore.blockingStoreFiles/name value50/value /property property namehbase.hregion.majorcompaction/name value0/value /property Prior to this, all the settings were default values. I wanted to increase the write throughput on my system and also control when major compactions happen. In addition to that, I wanted to make sure that my regions don't split quickly. After the change in settings, I am seeing a huge storm of memstore flushing and minor compactions some of which get promoted to major compaction. The compaction queue is also way too high. For example a few of the line that I see in the logs are as follows: http://pastebin.com/Gv1S9GKX The regionserver whose logs are pasted above keeps on flushing and creating those small files shows the follwoing metrics: memstoreSizeMB=657, compactionQueueSize=233, flushQueueSize=0, usedHeapMB=3907, maxHeapMB=10231 I am unsure why it's causing such high amount of flush ( 100m) even though the flush size is at 128m and there is no memory pressure. Any thoughts ? Let me know if you need any more information, I also have ganglia running and can provide more metrics if needed. Thanks, Viral
Re: flushing + compactions after config change
the flush size is at 128m and there is no memory pressure You mean there is enough memstore reserved heap in the RS, so that there wont be premature flushes because of global heap pressure? What is the RS max mem and how many regions and CFs in each? Can you check whether the flushes happening because of too many WAL files? -Anoop- On Thu, Jun 27, 2013 at 1:10 PM, Viral Bajaria viral.baja...@gmail.comwrote: Hi All, I wanted some help on understanding what's going on with my current setup. I updated from config to the following settings: property namehbase.hregion.max.filesize/name value107374182400/value /property property namehbase.hregion.memstore.block.multiplier/name value4/value /property property namehbase.hregion.memstore.flush.size/name value134217728/value /property property namehbase.hstore.blockingStoreFiles/name value50/value /property property namehbase.hregion.majorcompaction/name value0/value /property Prior to this, all the settings were default values. I wanted to increase the write throughput on my system and also control when major compactions happen. In addition to that, I wanted to make sure that my regions don't split quickly. After the change in settings, I am seeing a huge storm of memstore flushing and minor compactions some of which get promoted to major compaction. The compaction queue is also way too high. For example a few of the line that I see in the logs are as follows: http://pastebin.com/Gv1S9GKX The regionserver whose logs are pasted above keeps on flushing and creating those small files shows the follwoing metrics: memstoreSizeMB=657, compactionQueueSize=233, flushQueueSize=0, usedHeapMB=3907, maxHeapMB=10231 I am unsure why it's causing such high amount of flush ( 100m) even though the flush size is at 128m and there is no memory pressure. Any thoughts ? Let me know if you need any more information, I also have ganglia running and can provide more metrics if needed. Thanks, Viral
Re: flushing + compactions after config change
Thanks for the quick response Anoop. The current memstore reserved (IIRC) would be 0.35 of total heap right ? The RS total heap is 10231MB, used is at 5000MB. Total number of regions is 217 and there are approx 150 regions with 2 families, ~60 with 1 family and remaining with 3 families. How to check if the flushes are due to too many WAL files ? Does it get logged ? Thanks, Viral On Thu, Jun 27, 2013 at 12:51 AM, Anoop John anoop.hb...@gmail.com wrote: You mean there is enough memstore reserved heap in the RS, so that there wont be premature flushes because of global heap pressure? What is the RS max mem and how many regions and CFs in each? Can you check whether the flushes happening because of too many WAL files? -Anoop-
答复: flushing + compactions after config change
If reached memstore global up-limit, you'll find Blocking updates on in your files(see MemStoreFlusher.reclaimMemStoreMemory); If it's caused by too many log files, you'll find Too many hlogs: logs=(see HLog.cleanOldLogs) Hope it's helpful for you:) Best, Liang 发件人: Viral Bajaria [viral.baja...@gmail.com] 发送时间: 2013年6月27日 16:18 收件人: user@hbase.apache.org 主题: Re: flushing + compactions after config change Thanks for the quick response Anoop. The current memstore reserved (IIRC) would be 0.35 of total heap right ? The RS total heap is 10231MB, used is at 5000MB. Total number of regions is 217 and there are approx 150 regions with 2 families, ~60 with 1 family and remaining with 3 families. How to check if the flushes are due to too many WAL files ? Does it get logged ? Thanks, Viral On Thu, Jun 27, 2013 at 12:51 AM, Anoop John anoop.hb...@gmail.com wrote: You mean there is enough memstore reserved heap in the RS, so that there wont be premature flushes because of global heap pressure? What is the RS max mem and how many regions and CFs in each? Can you check whether the flushes happening because of too many WAL files? -Anoop-
Re: 答复: flushing + compactions after config change
Thanks Liang! Found the logs. I had gone overboard with my grep's and missed the Too many hlogs line for the regions that I was trying to debug. A few sample log lines: 2013-06-27 07:42:49,602 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s): 0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14 2013-06-27 08:10:29,996 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s): 0e940167482d42f1999b29a023c7c18a 2013-06-27 08:17:44,719 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s): 0e940167482d42f1999b29a023c7c18a, e380fd8a7174d34feb903baa97564e08 2013-06-27 08:23:45,357 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 3 regions(s): 0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14, e380fd8a7174d34feb903baa97564e08 Any pointers on what's the best practice for avoiding this scenario ? Thanks, Viral On Thu, Jun 27, 2013 at 1:21 AM, 谢良 xieli...@xiaomi.com wrote: If reached memstore global up-limit, you'll find Blocking updates on in your files(see MemStoreFlusher.reclaimMemStoreMemory); If it's caused by too many log files, you'll find Too many hlogs: logs=(see HLog.cleanOldLogs) Hope it's helpful for you:) Best, Liang
Re: 答复: flushing + compactions after config change
The config hbase.regionserver.maxlogs specifies what is the max #logs and defaults to 32. But remember if there are so many log files to replay then the MTTR will become more (RS down case ) -Anoop- On Thu, Jun 27, 2013 at 1:59 PM, Viral Bajaria viral.baja...@gmail.comwrote: Thanks Liang! Found the logs. I had gone overboard with my grep's and missed the Too many hlogs line for the regions that I was trying to debug. A few sample log lines: 2013-06-27 07:42:49,602 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s): 0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14 2013-06-27 08:10:29,996 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s): 0e940167482d42f1999b29a023c7c18a 2013-06-27 08:17:44,719 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s): 0e940167482d42f1999b29a023c7c18a, e380fd8a7174d34feb903baa97564e08 2013-06-27 08:23:45,357 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 3 regions(s): 0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14, e380fd8a7174d34feb903baa97564e08 Any pointers on what's the best practice for avoiding this scenario ? Thanks, Viral On Thu, Jun 27, 2013 at 1:21 AM, 谢良 xieli...@xiaomi.com wrote: If reached memstore global up-limit, you'll find Blocking updates on in your files(see MemStoreFlusher.reclaimMemStoreMemory); If it's caused by too many log files, you'll find Too many hlogs: logs=(see HLog.cleanOldLogs) Hope it's helpful for you:) Best, Liang
Re: 答复: flushing + compactions after config change
hey Viral, Which hbase version are you using? On Thu, Jun 27, 2013 at 5:03 PM, Anoop John anoop.hb...@gmail.com wrote: The config hbase.regionserver.maxlogs specifies what is the max #logs and defaults to 32. But remember if there are so many log files to replay then the MTTR will become more (RS down case ) -Anoop- On Thu, Jun 27, 2013 at 1:59 PM, Viral Bajaria viral.baja...@gmail.com wrote: Thanks Liang! Found the logs. I had gone overboard with my grep's and missed the Too many hlogs line for the regions that I was trying to debug. A few sample log lines: 2013-06-27 07:42:49,602 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s): 0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14 2013-06-27 08:10:29,996 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s): 0e940167482d42f1999b29a023c7c18a 2013-06-27 08:17:44,719 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 2 regions(s): 0e940167482d42f1999b29a023c7c18a, e380fd8a7174d34feb903baa97564e08 2013-06-27 08:23:45,357 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 3 regions(s): 0e940167482d42f1999b29a023c7c18a, 3f486a879418257f053aa75ba5b69b14, e380fd8a7174d34feb903baa97564e08 Any pointers on what's the best practice for avoiding this scenario ? Thanks, Viral On Thu, Jun 27, 2013 at 1:21 AM, 谢良 xieli...@xiaomi.com wrote: If reached memstore global up-limit, you'll find Blocking updates on in your files(see MemStoreFlusher.reclaimMemStoreMemory); If it's caused by too many log files, you'll find Too many hlogs: logs=(see HLog.cleanOldLogs) Hope it's helpful for you:) Best, Liang
Re: 答复: flushing + compactions after config change
0.94.4 with plans to upgrade to the latest 0.94 release. On Thu, Jun 27, 2013 at 2:22 AM, Azuryy Yu azury...@gmail.com wrote: hey Viral, Which hbase version are you using?
Re: 答复: flushing + compactions after config change
Can you paste your JVM options here? and Do you have an extensive write on your hbase cluster? On Thu, Jun 27, 2013 at 5:47 PM, Viral Bajaria viral.baja...@gmail.comwrote: 0.94.4 with plans to upgrade to the latest 0.94 release. On Thu, Jun 27, 2013 at 2:22 AM, Azuryy Yu azury...@gmail.com wrote: hey Viral, Which hbase version are you using?
Re: 答复: flushing + compactions after config change
I do have a heavy write operation going on. Actually heavy is relative. Not all tables/regions are seeing the same amount of writes at the same time. There is definitely a burst of writes that can happen on some regions. In addition to that there are some processing jobs which play catch up and could be processing data in the past and they could have more heavy write operations. I think my main problem is, my writes are well distributed across regions. A batch of puts most probably end up hitting every region since they get distributed fairly well. In that scenario, I am guessing I get a lot of WALs though I am just speculating. Regarding the JVM options (minus some settings for remote profiling): -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M On Thu, Jun 27, 2013 at 2:48 AM, Azuryy Yu azury...@gmail.com wrote: Can you paste your JVM options here? and Do you have an extensive write on your hbase cluster?
答复: 答复: flushing + compactions after config change
btw, don't use CMSIncrementalMode, iirc, it had been removed from hotspot upstream accually. 发件人: Viral Bajaria [viral.baja...@gmail.com] 发送时间: 2013年6月27日 18:08 收件人: user@hbase.apache.org 主题: Re: 答复: flushing + compactions after config change I do have a heavy write operation going on. Actually heavy is relative. Not all tables/regions are seeing the same amount of writes at the same time. There is definitely a burst of writes that can happen on some regions. In addition to that there are some processing jobs which play catch up and could be processing data in the past and they could have more heavy write operations. I think my main problem is, my writes are well distributed across regions. A batch of puts most probably end up hitting every region since they get distributed fairly well. In that scenario, I am guessing I get a lot of WALs though I am just speculating. Regarding the JVM options (minus some settings for remote profiling): -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M On Thu, Jun 27, 2013 at 2:48 AM, Azuryy Yu azury...@gmail.com wrote: Can you paste your JVM options here? and Do you have an extensive write on your hbase cluster?
Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store
I think you will need to update your hash function and redistribute data. As far as I know this has been on of the drawbacks of this approach (and the SemaText library) Regards, Shahab On Wed, Jun 26, 2013 at 7:24 PM, Joarder KAMAL joard...@gmail.com wrote: May be a simple question to answer for the experienced HBase users and developers: If I use hash partitioning to evenly distribute write workloads into my region servers and later add a new region server to scale or split an existing region, then do I need to change my hash function and re-shuffle all the existing data in between all the region servers (old and new)? Or, is there any better solution for this? Any guidance would be very much helpful. Thanks in advance. Regards, Joarder Kamal
Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store
Thanks Shahab for the reply. I was also thinking in the same way. Could you able to guide me through any reference which can confirm this understanding? Regards, Joarder Kamal On 27 June 2013 23:24, Shahab Yunus shahab.yu...@gmail.com wrote: I think you will need to update your hash function and redistribute data. As far as I know this has been on of the drawbacks of this approach (and the SemaText library) Regards, Shahab On Wed, Jun 26, 2013 at 7:24 PM, Joarder KAMAL joard...@gmail.com wrote: May be a simple question to answer for the experienced HBase users and developers: If I use hash partitioning to evenly distribute write workloads into my region servers and later add a new region server to scale or split an existing region, then do I need to change my hash function and re-shuffle all the existing data in between all the region servers (old and new)? Or, is there any better solution for this? Any guidance would be very much helpful. Thanks in advance. Regards, Joarder Kamal
hbase replication and dfs replication?
Hello, I am a bit confused how configurations of hbase replication and dfs replication works together. My application deploys on an HBase cluster (0.94.3) with two Region servers. The two hadoop datanodes run on the same two Region severs. Because we only have two datanodes, dfs.replication was set to 2. The person who configured the small cluster didn't explicitly set the hbase replication configs, which includes: (1) in ${HBASE_HOME}/conf/hbase-site.xml, hbase.replication is not set. I think the default value is false according to http://hbase.apache.org/replication.html. (2) in the table,Replication_Scope is set to 0 (by default). However, even without setting hbase.replication and replication_scope, it appears that the tables are duplicated in the two Region servers (as I can go to the shells of these two region servers and find the duplicate rows from a scan). My question is - does the default dfs replication takes care of replicating hbase tables within the same cluster so we don't need to set up the hbase replication configs? And only when we need to replicate hbase from one cluster to another cluster should we set up the hbase replication configs (1) and (2) above? thanks! Jason
Re: hbase replication and dfs replication?
Jason, HBase replication is for between two HBase clusters as you state. What you are seeing is merely the expected behavior within a single cluster. DFS replication is not involved directly here - the shell ends up acting like any other HBase client and constructing the scan the same way (i.e. finding the right region servers to do the scan by contacting ZK, region server serving .META., issuing the scan requests to the proper RSes, etc.). It doesn't matter where you are running the client from. There is no replicating HBase tables within the same cluster - you're just accessing the same table from different clients. Hope this helps, - Dave On Thu, Jun 27, 2013 at 7:04 AM, Jason Huang jason.hu...@icare.com wrote: Hello, I am a bit confused how configurations of hbase replication and dfs replication works together. My application deploys on an HBase cluster (0.94.3) with two Region servers. The two hadoop datanodes run on the same two Region severs. Because we only have two datanodes, dfs.replication was set to 2. The person who configured the small cluster didn't explicitly set the hbase replication configs, which includes: (1) in ${HBASE_HOME}/conf/hbase-site.xml, hbase.replication is not set. I think the default value is false according to http://hbase.apache.org/replication.html. (2) in the table,Replication_Scope is set to 0 (by default). However, even without setting hbase.replication and replication_scope, it appears that the tables are duplicated in the two Region servers (as I can go to the shells of these two region servers and find the duplicate rows from a scan). My question is - does the default dfs replication takes care of replicating hbase tables within the same cluster so we don't need to set up the hbase replication configs? And only when we need to replicate hbase from one cluster to another cluster should we set up the hbase replication configs (1) and (2) above? thanks! Jason
Re: hbase replication and dfs replication?
makes a lot of sense. thanks Dave, Jason On Thu, Jun 27, 2013 at 10:26 AM, Dave Wang d...@cloudera.com wrote: Jason, HBase replication is for between two HBase clusters as you state. What you are seeing is merely the expected behavior within a single cluster. DFS replication is not involved directly here - the shell ends up acting like any other HBase client and constructing the scan the same way (i.e. finding the right region servers to do the scan by contacting ZK, region server serving .META., issuing the scan requests to the proper RSes, etc.). It doesn't matter where you are running the client from. There is no replicating HBase tables within the same cluster - you're just accessing the same table from different clients. Hope this helps, - Dave On Thu, Jun 27, 2013 at 7:04 AM, Jason Huang jason.hu...@icare.com wrote: Hello, I am a bit confused how configurations of hbase replication and dfs replication works together. My application deploys on an HBase cluster (0.94.3) with two Region servers. The two hadoop datanodes run on the same two Region severs. Because we only have two datanodes, dfs.replication was set to 2. The person who configured the small cluster didn't explicitly set the hbase replication configs, which includes: (1) in ${HBASE_HOME}/conf/hbase-site.xml, hbase.replication is not set. I think the default value is false according to http://hbase.apache.org/replication.html. (2) in the table,Replication_Scope is set to 0 (by default). However, even without setting hbase.replication and replication_scope, it appears that the tables are duplicated in the two Region servers (as I can go to the shells of these two region servers and find the duplicate rows from a scan). My question is - does the default dfs replication takes care of replicating hbase tables within the same cluster so we don't need to set up the hbase replication configs? And only when we need to replicate hbase from one cluster to another cluster should we set up the hbase replication configs (1) and (2) above? thanks! Jason
Re: 答复: flushing + compactions after config change
your JVM options arenot enough. I will give you some detail when I go back office tomorrow. --Send from my Sony mobile. On Jun 27, 2013 6:09 PM, Viral Bajaria viral.baja...@gmail.com wrote: I do have a heavy write operation going on. Actually heavy is relative. Not all tables/regions are seeing the same amount of writes at the same time. There is definitely a burst of writes that can happen on some regions. In addition to that there are some processing jobs which play catch up and could be processing data in the past and they could have more heavy write operations. I think my main problem is, my writes are well distributed across regions. A batch of puts most probably end up hitting every region since they get distributed fairly well. In that scenario, I am guessing I get a lot of WALs though I am just speculating. Regarding the JVM options (minus some settings for remote profiling): -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M On Thu, Jun 27, 2013 at 2:48 AM, Azuryy Yu azury...@gmail.com wrote: Can you paste your JVM options here? and Do you have an extensive write on your hbase cluster?
Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store
I don't have a particular document or source stating this but I think it is actually kind of self-explanatory if your think about the algorithm. Anyway, you can read this http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ And some older discussions by experts on this topic: http://search-hadoop.com/?q=prefix+salt+key+hotspotfc_project=HBase Regards, Shahab On Thu, Jun 27, 2013 at 9:44 AM, Joarder KAMAL joard...@gmail.com wrote: Thanks Shahab for the reply. I was also thinking in the same way. Could you able to guide me through any reference which can confirm this understanding? Regards, Joarder Kamal On 27 June 2013 23:24, Shahab Yunus shahab.yu...@gmail.com wrote: I think you will need to update your hash function and redistribute data. As far as I know this has been on of the drawbacks of this approach (and the SemaText library) Regards, Shahab On Wed, Jun 26, 2013 at 7:24 PM, Joarder KAMAL joard...@gmail.com wrote: May be a simple question to answer for the experienced HBase users and developers: If I use hash partitioning to evenly distribute write workloads into my region servers and later add a new region server to scale or split an existing region, then do I need to change my hash function and re-shuffle all the existing data in between all the region servers (old and new)? Or, is there any better solution for this? Any guidance would be very much helpful. Thanks in advance. Regards, Joarder Kamal
Re: Schema design for filters
Not an easy task. You first need to determine how you want to store the data within a column and/or apply a type constraint to a column. Even if you use JSON records to store your data within a column, does an equality comparator exist? If not, you would have to write one. (I kinda think that one may already exist...) On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Working with the standard filtering mechanism to scan rows that have columns matching certain criterias. There are columns of numeric (integer and decimal) and string types. These columns are single or multi-valued like 1, 2, 1,2,3, a, b or a,b,c - not sure what the separator would be in the case of list types. Maybe none? I would like to compose the following queries to filter out rows that does not match. - contains(String column, String value) Single valued column that String.contain() provided value. - equal(String column, Object value) Single valued column that Object.equals() provided value. Value is either string or numeric type. - greaterThan(String column, java.lang.Number value) Single valued column that provided numeric value. - in(String column, Object value...) Multi-valued column have values that Object.equals() all provided values. Values are of string or numeric type. How would I design a schema that can take advantage of the already existing filters and comparators to accomplish this? Already looked at the string and binary comparators but fail to see how to solve this in a clean way for multi-valued column values. Im aware of custom filters but would like to avoid it if possible. Cheers, -Kristoffer
Profiling map reduce jobs?
Howdy, I want to take a look at a MR job which seems to be slower than I had hoped. Mind you, this MR job is only running on a pseudo-distributed VM (cloudera cdh4). I have modified my mapred-site.xml with the following (that last one is commented out because it crashes my MR job): property namemapred.task.profile/name valuetrue/value /property property namemapred.task.profile.maps/name value0-2/value /property property namemapred.task.profile.reduces/name value0-2/value /property !--property namemapred.task.profile.params/name valueagentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value /property-- Are there any resources that explain how to interpret the results? Or maybe an open-source app that could help display the results in a more intuiative manner? Ideally, we'd want to know where we are spending most of our time. Cheers, David
Re: 答复: flushing + compactions after config change
Thanks Azuryy. Look forward to it. Does DEFERRED_LOG_FLUSH impact the number of WAL files that will be created ? Tried looking around but could not find the details. On Thu, Jun 27, 2013 at 7:53 AM, Azuryy Yu azury...@gmail.com wrote: your JVM options arenot enough. I will give you some detail when I go back office tomorrow. --Send from my Sony mobile.
Re: 答复: flushing + compactions after config change
No, all your data eventually makes it into the log, just potentially not as quickly :) J-D On Thu, Jun 27, 2013 at 2:06 PM, Viral Bajaria viral.baja...@gmail.com wrote: Thanks Azuryy. Look forward to it. Does DEFERRED_LOG_FLUSH impact the number of WAL files that will be created ? Tried looking around but could not find the details. On Thu, Jun 27, 2013 at 7:53 AM, Azuryy Yu azury...@gmail.com wrote: your JVM options arenot enough. I will give you some detail when I go back office tomorrow. --Send from my Sony mobile.
Re: Schema design for filters
I realize standard comparators cannot solve this. However I do know the type of each column so writing custom list comparators for boolean, char, byte, short, int, long, float, double seems quite straightforward. Long arrays, for example, are stored as a byte array with 8 bytes per item so a comparator might look like this. public class LongsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { long[] values = BytesUtils.toLongs(value, offset, length); for (long longValue : values) { if (longValue == val) { return 0; } } return 1; } } public static long[] toLongs(byte[] value, int offset, int length) { int num = (length - offset) / 8; long[] values = new long[num]; for (int i = offset; i num; i++) { values[i] = getLong(value, i * 8); } return values; } Strings are similar but would require charset and length for each string. public class StringsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { String[] values = BytesUtils.toStrings(value, offset, length); for (String stringValue : values) { if (val.equals(stringValue)) { return 0; } } return 1; } } public static String[] toStrings(byte[] value, int offset, int length) { ArrayListString values = new ArrayListString(); int idx = 0; ByteBuffer buffer = ByteBuffer.wrap(value, offset, length); while (idx length) { int size = buffer.getInt(); byte[] bytes = new byte[size]; buffer.get(bytes); values.add(new String(bytes)); idx += 4 + size; } return values.toArray(new String[values.size()]); } Am I on the right track or maybe overlooking some implementation details? Not really sure how to target each comparator to a specific column value? On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel michael_se...@hotmail.comwrote: Not an easy task. You first need to determine how you want to store the data within a column and/or apply a type constraint to a column. Even if you use JSON records to store your data within a column, does an equality comparator exist? If not, you would have to write one. (I kinda think that one may already exist...) On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Working with the standard filtering mechanism to scan rows that have columns matching certain criterias. There are columns of numeric (integer and decimal) and string types. These columns are single or multi-valued like 1, 2, 1,2,3, a, b or a,b,c - not sure what the separator would be in the case of list types. Maybe none? I would like to compose the following queries to filter out rows that does not match. - contains(String column, String value) Single valued column that String.contain() provided value. - equal(String column, Object value) Single valued column that Object.equals() provided value. Value is either string or numeric type. - greaterThan(String column, java.lang.Number value) Single valued column that provided numeric value. - in(String column, Object value...) Multi-valued column have values that Object.equals() all provided values. Values are of string or numeric type. How would I design a schema that can take advantage of the already existing filters and comparators to accomplish this? Already looked at the string and binary comparators but fail to see how to solve this in a clean way for multi-valued column values. Im aware of custom filters but would like to avoid it if possible. Cheers, -Kristoffer
Re: Schema design for filters
You have to remember that HBase doesn't enforce any sort of typing. That's why this can be difficult. You'd have to write a coprocessor to enforce a schema on a table. Even then YMMV if you're writing JSON structures to a column because while the contents of the structures could be the same, the actual strings could differ. HTH -Mike On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com wrote: I realize standard comparators cannot solve this. However I do know the type of each column so writing custom list comparators for boolean, char, byte, short, int, long, float, double seems quite straightforward. Long arrays, for example, are stored as a byte array with 8 bytes per item so a comparator might look like this. public class LongsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { long[] values = BytesUtils.toLongs(value, offset, length); for (long longValue : values) { if (longValue == val) { return 0; } } return 1; } } public static long[] toLongs(byte[] value, int offset, int length) { int num = (length - offset) / 8; long[] values = new long[num]; for (int i = offset; i num; i++) { values[i] = getLong(value, i * 8); } return values; } Strings are similar but would require charset and length for each string. public class StringsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { String[] values = BytesUtils.toStrings(value, offset, length); for (String stringValue : values) { if (val.equals(stringValue)) { return 0; } } return 1; } } public static String[] toStrings(byte[] value, int offset, int length) { ArrayListString values = new ArrayListString(); int idx = 0; ByteBuffer buffer = ByteBuffer.wrap(value, offset, length); while (idx length) { int size = buffer.getInt(); byte[] bytes = new byte[size]; buffer.get(bytes); values.add(new String(bytes)); idx += 4 + size; } return values.toArray(new String[values.size()]); } Am I on the right track or maybe overlooking some implementation details? Not really sure how to target each comparator to a specific column value? On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel michael_se...@hotmail.comwrote: Not an easy task. You first need to determine how you want to store the data within a column and/or apply a type constraint to a column. Even if you use JSON records to store your data within a column, does an equality comparator exist? If not, you would have to write one. (I kinda think that one may already exist...) On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Working with the standard filtering mechanism to scan rows that have columns matching certain criterias. There are columns of numeric (integer and decimal) and string types. These columns are single or multi-valued like 1, 2, 1,2,3, a, b or a,b,c - not sure what the separator would be in the case of list types. Maybe none? I would like to compose the following queries to filter out rows that does not match. - contains(String column, String value) Single valued column that String.contain() provided value. - equal(String column, Object value) Single valued column that Object.equals() provided value. Value is either string or numeric type. - greaterThan(String column, java.lang.Number value) Single valued column that provided numeric value. - in(String column, Object value...) Multi-valued column have values that Object.equals() all provided values. Values are of string or numeric type. How would I design a schema that can take advantage of the already existing filters and comparators to accomplish this? Already looked at the string and binary comparators but fail to see how to solve this in a clean way for multi-valued column values. Im aware of custom filters but would like to avoid it if possible. Cheers, -Kristoffer
Re: Schema design for filters
I see your point. Everything is just bytes. However, the schema is known and every row is formatted according to this schema, although some columns may not exist, that is, no value exist for this property on this row. So if im able to apply these typed comparators to the right cell values it may be possible? But I cant find a filter that target specific columns? Seems like all filters scan every column/qualifier and there is no way of knowing what column is currently being evaluated? On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel michael_se...@hotmail.comwrote: You have to remember that HBase doesn't enforce any sort of typing. That's why this can be difficult. You'd have to write a coprocessor to enforce a schema on a table. Even then YMMV if you're writing JSON structures to a column because while the contents of the structures could be the same, the actual strings could differ. HTH -Mike On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com wrote: I realize standard comparators cannot solve this. However I do know the type of each column so writing custom list comparators for boolean, char, byte, short, int, long, float, double seems quite straightforward. Long arrays, for example, are stored as a byte array with 8 bytes per item so a comparator might look like this. public class LongsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { long[] values = BytesUtils.toLongs(value, offset, length); for (long longValue : values) { if (longValue == val) { return 0; } } return 1; } } public static long[] toLongs(byte[] value, int offset, int length) { int num = (length - offset) / 8; long[] values = new long[num]; for (int i = offset; i num; i++) { values[i] = getLong(value, i * 8); } return values; } Strings are similar but would require charset and length for each string. public class StringsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { String[] values = BytesUtils.toStrings(value, offset, length); for (String stringValue : values) { if (val.equals(stringValue)) { return 0; } } return 1; } } public static String[] toStrings(byte[] value, int offset, int length) { ArrayListString values = new ArrayListString(); int idx = 0; ByteBuffer buffer = ByteBuffer.wrap(value, offset, length); while (idx length) { int size = buffer.getInt(); byte[] bytes = new byte[size]; buffer.get(bytes); values.add(new String(bytes)); idx += 4 + size; } return values.toArray(new String[values.size()]); } Am I on the right track or maybe overlooking some implementation details? Not really sure how to target each comparator to a specific column value? On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel michael_se...@hotmail.comwrote: Not an easy task. You first need to determine how you want to store the data within a column and/or apply a type constraint to a column. Even if you use JSON records to store your data within a column, does an equality comparator exist? If not, you would have to write one. (I kinda think that one may already exist...) On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Working with the standard filtering mechanism to scan rows that have columns matching certain criterias. There are columns of numeric (integer and decimal) and string types. These columns are single or multi-valued like 1, 2, 1,2,3, a, b or a,b,c - not sure what the separator would be in the case of list types. Maybe none? I would like to compose the following queries to filter out rows that does not match. - contains(String column, String value) Single valued column that String.contain() provided value. - equal(String column, Object value) Single valued column that Object.equals() provided value. Value is either string or numeric type. - greaterThan(String column, java.lang.Number value) Single valued column that provided numeric value. - in(String column, Object value...) Multi-valued column have values that Object.equals() all provided values. Values are of string or numeric type. How would I design a schema that can take advantage of the already existing filters and comparators to accomplish this? Already looked at the string and binary comparators but fail to see how to solve this in a clean way for multi-valued column values. Im aware of custom filters but would like to avoid it if possible. Cheers, -Kristoffer
Re: Problems while exporting from Hbase to CSV file
Phoenix, Hive, Pig, Java would all work. But to Azury Yu's post... The OP is doing a simple scan() to get rows. If the OP is hitting an OOM exception then its a code issue on the part of the OP. On Jun 27, 2013, at 2:22 AM, Azuryy Yu azury...@gmail.com wrote: Sorry, maybe Phonex is not suitable for you. On Thu, Jun 27, 2013 at 3:21 PM, Azuryy Yu azury...@gmail.com wrote: 1) Scan.setCaching() to specify the number of rows for caching that will be passed to scanners. and what's your block cache size? but if OOM from the client, not sever side, then I don't think this is Scan related, please check your client code. 2) we cannot add default value from HBase, but you can add it on your client when iterate the Result. Also, you can using Phonex, this is cool for your scenario. https://github.com/forcedotcom/phoenix On Thu, Jun 27, 2013 at 3:11 PM, Vimal Jain vkj...@gmail.com wrote: Hi, I am trying to export from hbase to a CSV file. I am using Scan class to scan all data in the table. But i am facing some problems while doing it. 1) My table has around 1.5 million rows and around 150 columns for each row , so i can not use default scan() constructor as it will scan whole table in one go which results in OutOfMemory error in client process.I heard of using setCaching() and setBatch() but i am not able to understand how it will solve OOM error. I thought of providing startRow and stopRow in scan object but i want to scan whole table so how will this help ? 2) As hbase stores data for a row only when we explicitly provide it and their is no concept of default value as found in RDBMS , i want to have each and evey column in the CSV file i generate for every user.In case column values are not there in hbase , i want to use default values for them(I have list of default values for each column). Is there any method in Result class or any other class to accomplish this ? Please help here. -- Thanks and Regards, Vimal Jain
Re: Schema design for filters
Ok... If you want to do type checking and schema enforcement... You will need to do this as a coprocessor. The quick and dirty way... (Not recommended) would be to hard code the schema in to the co-processor code.) A better way... at start up, load up ZK to manage the set of known table schemas which would be a map of column qualifier to data type. (If JSON then you need to do a separate lookup to get the records schema) Then a single java class that does the look up and then handles the known data type comparators. Does this make sense? (Sorry, kinda was thinking this out as I typed the response. But it should work ) At least it would be a design approach I would talk. YMMV Having said that, I expect someone to say its a bad idea and that they have a better solution. HTH -Mike On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren sto...@gmail.com wrote: I see your point. Everything is just bytes. However, the schema is known and every row is formatted according to this schema, although some columns may not exist, that is, no value exist for this property on this row. So if im able to apply these typed comparators to the right cell values it may be possible? But I cant find a filter that target specific columns? Seems like all filters scan every column/qualifier and there is no way of knowing what column is currently being evaluated? On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel michael_se...@hotmail.comwrote: You have to remember that HBase doesn't enforce any sort of typing. That's why this can be difficult. You'd have to write a coprocessor to enforce a schema on a table. Even then YMMV if you're writing JSON structures to a column because while the contents of the structures could be the same, the actual strings could differ. HTH -Mike On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com wrote: I realize standard comparators cannot solve this. However I do know the type of each column so writing custom list comparators for boolean, char, byte, short, int, long, float, double seems quite straightforward. Long arrays, for example, are stored as a byte array with 8 bytes per item so a comparator might look like this. public class LongsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { long[] values = BytesUtils.toLongs(value, offset, length); for (long longValue : values) { if (longValue == val) { return 0; } } return 1; } } public static long[] toLongs(byte[] value, int offset, int length) { int num = (length - offset) / 8; long[] values = new long[num]; for (int i = offset; i num; i++) { values[i] = getLong(value, i * 8); } return values; } Strings are similar but would require charset and length for each string. public class StringsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { String[] values = BytesUtils.toStrings(value, offset, length); for (String stringValue : values) { if (val.equals(stringValue)) { return 0; } } return 1; } } public static String[] toStrings(byte[] value, int offset, int length) { ArrayListString values = new ArrayListString(); int idx = 0; ByteBuffer buffer = ByteBuffer.wrap(value, offset, length); while (idx length) { int size = buffer.getInt(); byte[] bytes = new byte[size]; buffer.get(bytes); values.add(new String(bytes)); idx += 4 + size; } return values.toArray(new String[values.size()]); } Am I on the right track or maybe overlooking some implementation details? Not really sure how to target each comparator to a specific column value? On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel michael_se...@hotmail.comwrote: Not an easy task. You first need to determine how you want to store the data within a column and/or apply a type constraint to a column. Even if you use JSON records to store your data within a column, does an equality comparator exist? If not, you would have to write one. (I kinda think that one may already exist...) On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Working with the standard filtering mechanism to scan rows that have columns matching certain criterias. There are columns of numeric (integer and decimal) and string types. These columns are single or multi-valued like 1, 2, 1,2,3, a, b or a,b,c - not sure what the separator would be in the case of list types. Maybe none? I would like to compose the following queries to filter out rows that does not match. - contains(String column, String value) Single valued column that String.contain() provided value. - equal(String column, Object value) Single valued column that
Re: 答复: flushing + compactions after config change
Hey JD, Thanks for the clarification. I also came across a previous thread which sort of talks about a similar problem. http://mail-archives.apache.org/mod_mbox/hbase-user/201204.mbox/%3ccagptdnfwnrsnqv7n3wgje-ichzpx-cxn1tbchgwrpohgcos...@mail.gmail.com%3E I guess my problem is also similar to the fact that my writes are well distributed and at a given time I could be writing to a lot of regions. Some of the regions receive very little data but since the flush algorithm choose at random what to flush when too many hlogs is hit, it will flush a region with less than 10mb of data causing too many small files. This in-turn causes compaction storms where even though major compactions is disabled, some of the minor get upgraded to major and that's when things start getting worse. My compaction queues are still the same and so I doubt I will be coming out of this storm without bumping up max hlogs for now. Reducing regions per server is one option but then I will be wasting my resources since the servers at current load are at 30% CPU and 25% RAM. Maybe I can bump up heap space and give more memory to the the memstore. Sorry, I am just thinking out loud. Thanks, Viral On Thu, Jun 27, 2013 at 2:40 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: No, all your data eventually makes it into the log, just potentially not as quickly :)
Re: Schema design for filters
Thanks for your help Mike. Much appreciated. I dont store rows/columns in JSON format. The schema is exactly that of a specific java class, where the rowkey is a unique object identifier with the class type encoded into it. Columns are the field names of the class and the values are that of the object instance. Did think about coprocessors but the schema is discovered a runtime and I cant hard code it. However, I still believe that filters might work. Had a look at SingleColumnValueFilter and this filter is be able to target specific column qualifiers with specific WritableByteArrayComparables. But list comparators are still missing... So I guess the only way is to write these comparators? Do you follow my reasoning? Will it work? On Fri, Jun 28, 2013 at 12:58 AM, Michael Segel michael_se...@hotmail.comwrote: Ok... If you want to do type checking and schema enforcement... You will need to do this as a coprocessor. The quick and dirty way... (Not recommended) would be to hard code the schema in to the co-processor code.) A better way... at start up, load up ZK to manage the set of known table schemas which would be a map of column qualifier to data type. (If JSON then you need to do a separate lookup to get the records schema) Then a single java class that does the look up and then handles the known data type comparators. Does this make sense? (Sorry, kinda was thinking this out as I typed the response. But it should work ) At least it would be a design approach I would talk. YMMV Having said that, I expect someone to say its a bad idea and that they have a better solution. HTH -Mike On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren sto...@gmail.com wrote: I see your point. Everything is just bytes. However, the schema is known and every row is formatted according to this schema, although some columns may not exist, that is, no value exist for this property on this row. So if im able to apply these typed comparators to the right cell values it may be possible? But I cant find a filter that target specific columns? Seems like all filters scan every column/qualifier and there is no way of knowing what column is currently being evaluated? On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel michael_se...@hotmail.comwrote: You have to remember that HBase doesn't enforce any sort of typing. That's why this can be difficult. You'd have to write a coprocessor to enforce a schema on a table. Even then YMMV if you're writing JSON structures to a column because while the contents of the structures could be the same, the actual strings could differ. HTH -Mike On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com wrote: I realize standard comparators cannot solve this. However I do know the type of each column so writing custom list comparators for boolean, char, byte, short, int, long, float, double seems quite straightforward. Long arrays, for example, are stored as a byte array with 8 bytes per item so a comparator might look like this. public class LongsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { long[] values = BytesUtils.toLongs(value, offset, length); for (long longValue : values) { if (longValue == val) { return 0; } } return 1; } } public static long[] toLongs(byte[] value, int offset, int length) { int num = (length - offset) / 8; long[] values = new long[num]; for (int i = offset; i num; i++) { values[i] = getLong(value, i * 8); } return values; } Strings are similar but would require charset and length for each string. public class StringsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { String[] values = BytesUtils.toStrings(value, offset, length); for (String stringValue : values) { if (val.equals(stringValue)) { return 0; } } return 1; } } public static String[] toStrings(byte[] value, int offset, int length) { ArrayListString values = new ArrayListString(); int idx = 0; ByteBuffer buffer = ByteBuffer.wrap(value, offset, length); while (idx length) { int size = buffer.getInt(); byte[] bytes = new byte[size]; buffer.get(bytes); values.add(new String(bytes)); idx += 4 + size; } return values.toArray(new String[values.size()]); } Am I on the right track or maybe overlooking some implementation details? Not really sure how to target each comparator to a specific column value? On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel michael_se...@hotmail.comwrote: Not an easy task. You first need to determine how you want to store the data within a column
Re: 答复: flushing + compactions after config change
Hi Viral, the following are all needed for CMS: -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSClassUnloadingEnabled -XX:CMSMaxAbortablePrecleanTime=300 -XX:+CMSScavengeBeforeRemark and if your JDK version is greater than 1.6.23, then add : -XX:+UseCompressedOops -XX:SoftRefLRUPolicyMSPerMB=0 and you'd better add GC log -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:$HBASE_LOG_DIR/gc.log if your JDK version is greater than 1.6.23, then add : -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20m Hope that helpful. On Fri, Jun 28, 2013 at 5:06 AM, Viral Bajaria viral.baja...@gmail.comwrote: Thanks Azuryy. Look forward to it. Does DEFERRED_LOG_FLUSH impact the number of WAL files that will be created ? Tried looking around but could not find the details. On Thu, Jun 27, 2013 at 7:53 AM, Azuryy Yu azury...@gmail.com wrote: your JVM options arenot enough. I will give you some detail when I go back office tomorrow. --Send from my Sony mobile.
Re: Schema design for filters
Hi Kristoffer, Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? You could model your schema much like an O/R mapper and issue SQL queries through Phoenix for your filtering. James @JamesPlusPlus http://phoenix-hbase.blogspot.com On Jun 27, 2013, at 4:39 PM, Kristoffer Sjögren sto...@gmail.com wrote: Thanks for your help Mike. Much appreciated. I dont store rows/columns in JSON format. The schema is exactly that of a specific java class, where the rowkey is a unique object identifier with the class type encoded into it. Columns are the field names of the class and the values are that of the object instance. Did think about coprocessors but the schema is discovered a runtime and I cant hard code it. However, I still believe that filters might work. Had a look at SingleColumnValueFilter and this filter is be able to target specific column qualifiers with specific WritableByteArrayComparables. But list comparators are still missing... So I guess the only way is to write these comparators? Do you follow my reasoning? Will it work? On Fri, Jun 28, 2013 at 12:58 AM, Michael Segel michael_se...@hotmail.comwrote: Ok... If you want to do type checking and schema enforcement... You will need to do this as a coprocessor. The quick and dirty way... (Not recommended) would be to hard code the schema in to the co-processor code.) A better way... at start up, load up ZK to manage the set of known table schemas which would be a map of column qualifier to data type. (If JSON then you need to do a separate lookup to get the records schema) Then a single java class that does the look up and then handles the known data type comparators. Does this make sense? (Sorry, kinda was thinking this out as I typed the response. But it should work ) At least it would be a design approach I would talk. YMMV Having said that, I expect someone to say its a bad idea and that they have a better solution. HTH -Mike On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren sto...@gmail.com wrote: I see your point. Everything is just bytes. However, the schema is known and every row is formatted according to this schema, although some columns may not exist, that is, no value exist for this property on this row. So if im able to apply these typed comparators to the right cell values it may be possible? But I cant find a filter that target specific columns? Seems like all filters scan every column/qualifier and there is no way of knowing what column is currently being evaluated? On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel michael_se...@hotmail.comwrote: You have to remember that HBase doesn't enforce any sort of typing. That's why this can be difficult. You'd have to write a coprocessor to enforce a schema on a table. Even then YMMV if you're writing JSON structures to a column because while the contents of the structures could be the same, the actual strings could differ. HTH -Mike On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com wrote: I realize standard comparators cannot solve this. However I do know the type of each column so writing custom list comparators for boolean, char, byte, short, int, long, float, double seems quite straightforward. Long arrays, for example, are stored as a byte array with 8 bytes per item so a comparator might look like this. public class LongsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { long[] values = BytesUtils.toLongs(value, offset, length); for (long longValue : values) { if (longValue == val) { return 0; } } return 1; } } public static long[] toLongs(byte[] value, int offset, int length) { int num = (length - offset) / 8; long[] values = new long[num]; for (int i = offset; i num; i++) { values[i] = getLong(value, i * 8); } return values; } Strings are similar but would require charset and length for each string. public class StringsComparator extends WritableByteArrayComparable { public int compareTo(byte[] value, int offset, int length) { String[] values = BytesUtils.toStrings(value, offset, length); for (String stringValue : values) { if (val.equals(stringValue)) { return 0; } } return 1; } } public static String[] toStrings(byte[] value, int offset, int length) { ArrayListString values = new ArrayListString(); int idx = 0; ByteBuffer buffer = ByteBuffer.wrap(value, offset, length); while (idx length) { int size = buffer.getInt(); byte[] bytes = new byte[size]; buffer.get(bytes); values.add(new String(bytes)); idx += 4 + size; } return values.toArray(new String[values.size()]); } Am I on the right track or maybe overlooking some implementation details? Not
what the max number of column that a column family can has?
ATT
Re: what the max number of column that a column family can has?
Your row can be very wide. Take a look at the first paragraph in this comment: https://issues.apache.org/jira/browse/HBASE-7826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633620#comment-13633620 Cheers On Fri, Jun 28, 2013 at 10:40 AM, ch huang justlo...@gmail.com wrote: ATT
Re: Adding a new region server or splitting an old region in a Hash-partitioned HBase Data Store
I would suggest you could write a custom load balancer and then have your hashing algo to determine how the load balancing should happen. Hope this helps. Regards Ram On Fri, Jun 28, 2013 at 5:32 AM, Joarder KAMAL joard...@gmail.com wrote: Thanks St.Ack for mentioning about the load-balancer. But my question was two folded: Case-1. If a new RS is added, then the load-balancer will do it's job considering no new region has been created in the meanwhile. // As you've already answered. Case-2. Whether a new RS is added or not, an existing region is splitted into two, then how the new writes will to the new region? Because, lets say initially the hash function was calculated with *N* Regions and now there are *N+1* Regions in the cluster. In that case, do I need to change the Hash function and reshuffle all the existing data within the cluster !! Or, HBase has some mechanism to handle this? Many thanks again for helping me out... Regards, Joarder Kamal On 28 June 2013 02:26, Stack st...@duboce.net wrote: On Wed, Jun 26, 2013 at 4:24 PM, Joarder KAMAL joard...@gmail.com wrote: May be a simple question to answer for the experienced HBase users and developers: If I use hash partitioning to evenly distribute write workloads into my region servers and later add a new region server to scale or split an existing region, then do I need to change my hash function and re-shuffle all the existing data in between all the region servers (old and new)? Or, is there any better solution for this? Any guidance would be very much helpful. You do not need to change your hash function. When you add a new regionserver, the balancer will move some of the existing regions to the new host. St.Ack
is hbase cluster support multi-instance?
hi all: can hbase start more than one instance ,like mysql, if can ,how to manage these instances? ,thanks a lot
what's the relationship between hadoop datanode and hbase region node?
ATT
How many column families in one table ?
Hi, How many column families should be there in an hbase table ? Is there any performance issue in read/write if we have more column families ? I have designed one table with around 14 column families in it with each having on average 6 qualifiers. Is it a good design ? -- Thanks and Regards, Vimal Jain
RE: what's the relationship between hadoop datanode and hbase region node?
Hbase regions are stored in HFiles and HFiles use data node to store data of hfiles. Thanks,Sandeep. Date: Fri, 28 Jun 2013 13:08:58 +0800 Subject: what's the relationship between hadoop datanode and hbase region node? From: justlo...@gmail.com To: user@hbase.apache.org ATT