date:20130401

Please delete the files in /tmp folder on all the master/slaves and u
should be good to go.

-Vibhav


On Mon, Apr 1, 2013 at 3:14 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hello Rishabh,

  Is your NN able to come out of the safemode by itself?

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Mon, Apr 1, 2013 at 12:19 PM, Rishabh Agrawal 
 rishabh.agra...@impetus.co.in wrote:

  Hello
 
  Whenever, I stop Hbase and Hadoop  gracefully (in that order ) and then
  start Hadoop and Hbase (in that order), Hmaster refuses to start quoting
  zookeeper Config issues. It seems that it is not able to re-connect with
  Hadoop.
 
  Any help will be really appreciated.
 
  Thanks and Regards
  Rishabh Agrawal
 
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited when
  received in error. Impetus does not represent, warrant and/or guarantee,
  that the integrity of this communication has been maintained nor that the
  communication is free of errors, virus, interception or interference.

RE: Hbase restart issues

2013-04-01 Thread Rishabh Agrawal

Thanks everyone. It is working now. I deleted temp files and it started 
working. But I am not able to understand such behavior, any thoughts on that.

-Original Message-
From: Vibhav Mundra [mailto:mun...@gmail.com]
Sent: Monday, April 01, 2013 3:40 PM
To: user@hbase.apache.org
Subject: Re: Hbase restart issues

Please delete the files in /tmp folder on all the master/slaves and u should be 
good to go.

-Vibhav

On Mon, Apr 1, 2013 at 3:14 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hello Rishabh,

  Is your NN able to come out of the safemode by itself?

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com

 On Mon, Apr 1, 2013 at 12:19 PM, Rishabh Agrawal 
 rishabh.agra...@impetus.co.in wrote:

  Hello

  Whenever, I stop Hbase and Hadoop  gracefully (in that order ) and
  then start Hadoop and Hbase (in that order), Hmaster refuses to
  start quoting zookeeper Config issues. It seems that it is not able
  to re-connect with Hadoop.

  Any help will be really appreciated.

  Thanks and Regards
  Rishabh Agrawal

  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message
  is intended solely for the named addressee. If received in error,
  please destroy and notify the sender. Any use of this email is
  prohibited when received in error. Impetus does not represent,
  warrant and/or guarantee, that the integrity of this communication
  has been maintained nor that the communication is free of errors, virus, 
  interception or interference.

NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.

Re: Read thruput

2013-04-01 Thread ramkrishna vasudevan

Hi

How big is your row?  Are they wider rows and what would be the size of
every cell?
How many read threads are getting used?


Were you able to take a thread dump when this was happening?  Have you seen
the GC log?
May be need some more info before we can think of the problem.

Regards
Ram


On Mon, Apr 1, 2013 at 3:39 PM, Vibhav Mundra mun...@gmail.com wrote:

 Hi All,

 I am trying to use Hbase for real-time data retrieval with a timeout of 50
 ms.

 I am using 2 machines as datanode and regionservers,
 and one machine as a master for hadoop and Hbase.

 But I am able to fire only 3000 queries per sec and 10% of them are timing
 out.
 The database has 60 million rows.

 Are these figure okie, or I am missing something.
 I have used the scanner caching to be equal to one, because for each time
 we are fetching a single row only.

 Here are the various configurations:

 *Our schema
 *{NAME = 'mytable', FAMILIES = [{NAME = 'cf', DATA_BLOCK_ENCODING =
 'NONE', BLOOMFILTER = 'ROWCOL', REPLICATION_SCOPE = '0', COMPRESSION =
 'GZ', VERSIONS = '1', TTL = '2147483647', MIN_VERSIONS = '0', KEE
 P_DELETED_CELLS = 'false', BLOCKSIZE = '8192', ENCODE_ON_DISK = 'true',
 IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}

 *Configuration*
 1 Machine having both hbase and hadoop master
 2 machines having both region server node and datanode
 total 285 region servers

 *Machine Level Optimizations:*
 a)No of file descriptors is 100(ulimit -n gives 100)
 b)Increase the read-ahead value to 4096
 c)Added noatime,nodiratime to the disks

 *Hadoop Optimizations:*
 dfs.datanode.max.xcievers = 4096
 dfs.block.size = 33554432
 dfs.datanode.handler.count = 256
 io.file.buffer.size = 65536
 hadoop data is split on 4 directories, so that different disks are being
 accessed

 *Hbase Optimizations*:

 hbase.client.scanner.caching=1  #We have specifcally added this, as we
 return always one row.
 hbase.regionserver.handler.count=3200
 hfile.block.cache.size=0.35
 hbase.hregion.memstore.mslab.enabled=true
 hfile.min.blocksize.size=16384
 hfile.min.blocksize.size=4
 hbase.hstore.blockingStoreFiles=200
 hbase.regionserver.optionallogflushinterval=6
 hbase.hregion.majorcompaction=0
 hbase.hstore.compaction.max=100
 hbase.hstore.compactionThreshold=100

 *Hbase-GC
 *-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
 *Hadoop-GC*
 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

 -Vibhav

Re: Read thruput

The typical size of each of my row is less than 1KB.

Regarding the memory, I have used 8GB for Hbase regionservers and 4 GB for
datanodes and I dont see them completely used. So I ruled out the GC aspect.

In case u still believe that GC is an issue, I will upload the gc logs.

-Vibhav


On Mon, Apr 1, 2013 at 3:46 PM, ramkrishna vasudevan 
ramkrishna.s.vasude...@gmail.com wrote:

 Hi

 How big is your row?  Are they wider rows and what would be the size of
 every cell?
 How many read threads are getting used?


 Were you able to take a thread dump when this was happening?  Have you seen
 the GC log?
 May be need some more info before we can think of the problem.

 Regards
 Ram


 On Mon, Apr 1, 2013 at 3:39 PM, Vibhav Mundra mun...@gmail.com wrote:

  Hi All,
 
  I am trying to use Hbase for real-time data retrieval with a timeout of
 50
  ms.
 
  I am using 2 machines as datanode and regionservers,
  and one machine as a master for hadoop and Hbase.
 
  But I am able to fire only 3000 queries per sec and 10% of them are
 timing
  out.
  The database has 60 million rows.
 
  Are these figure okie, or I am missing something.
  I have used the scanner caching to be equal to one, because for each time
  we are fetching a single row only.
 
  Here are the various configurations:
 
  *Our schema
  *{NAME = 'mytable', FAMILIES = [{NAME = 'cf', DATA_BLOCK_ENCODING =
  'NONE', BLOOMFILTER = 'ROWCOL', REPLICATION_SCOPE = '0', COMPRESSION =
  'GZ', VERSIONS = '1', TTL = '2147483647', MIN_VERSIONS = '0', KEE
  P_DELETED_CELLS = 'false', BLOCKSIZE = '8192', ENCODE_ON_DISK =
 'true',
  IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}
 
  *Configuration*
  1 Machine having both hbase and hadoop master
  2 machines having both region server node and datanode
  total 285 region servers
 
  *Machine Level Optimizations:*
  a)No of file descriptors is 100(ulimit -n gives 100)
  b)Increase the read-ahead value to 4096
  c)Added noatime,nodiratime to the disks
 
  *Hadoop Optimizations:*
  dfs.datanode.max.xcievers = 4096
  dfs.block.size = 33554432
  dfs.datanode.handler.count = 256
  io.file.buffer.size = 65536
  hadoop data is split on 4 directories, so that different disks are being
  accessed
 
  *Hbase Optimizations*:
 
  hbase.client.scanner.caching=1  #We have specifcally added this, as we
  return always one row.
  hbase.regionserver.handler.count=3200
  hfile.block.cache.size=0.35
  hbase.hregion.memstore.mslab.enabled=true
  hfile.min.blocksize.size=16384
  hfile.min.blocksize.size=4
  hbase.hstore.blockingStoreFiles=200
  hbase.regionserver.optionallogflushinterval=6
  hbase.hregion.majorcompaction=0
  hbase.hstore.compaction.max=100
  hbase.hstore.compactionThreshold=100
 
  *Hbase-GC
  *-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
  -XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
  *Hadoop-GC*
  -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
 
  -Vibhav

Re: Read thruput

2013-04-01 Thread Ted

Can you increase block cache size ?

What version of hbase are you using ?

Thanks

On Apr 1, 2013, at 3:47 AM, Vibhav Mundra mun...@gmail.com wrote:

 The typical size of each of my row is less than 1KB.
 
 Regarding the memory, I have used 8GB for Hbase regionservers and 4 GB for
 datanodes and I dont see them completely used. So I ruled out the GC aspect.
 
 In case u still believe that GC is an issue, I will upload the gc logs.
 
 -Vibhav
 
 
 On Mon, Apr 1, 2013 at 3:46 PM, ramkrishna vasudevan 
 ramkrishna.s.vasude...@gmail.com wrote:
 
 Hi
 
 How big is your row?  Are they wider rows and what would be the size of
 every cell?
 How many read threads are getting used?
 
 
 Were you able to take a thread dump when this was happening?  Have you seen
 the GC log?
 May be need some more info before we can think of the problem.
 
 Regards
 Ram
 
 
 On Mon, Apr 1, 2013 at 3:39 PM, Vibhav Mundra mun...@gmail.com wrote:
 
 Hi All,
 
 I am trying to use Hbase for real-time data retrieval with a timeout of
 50
 ms.
 
 I am using 2 machines as datanode and regionservers,
 and one machine as a master for hadoop and Hbase.
 
 But I am able to fire only 3000 queries per sec and 10% of them are
 timing
 out.
 The database has 60 million rows.
 
 Are these figure okie, or I am missing something.
 I have used the scanner caching to be equal to one, because for each time
 we are fetching a single row only.
 
 Here are the various configurations:
 
 *Our schema
 *{NAME = 'mytable', FAMILIES = [{NAME = 'cf', DATA_BLOCK_ENCODING =
 'NONE', BLOOMFILTER = 'ROWCOL', REPLICATION_SCOPE = '0', COMPRESSION =
 'GZ', VERSIONS = '1', TTL = '2147483647', MIN_VERSIONS = '0', KEE
 P_DELETED_CELLS = 'false', BLOCKSIZE = '8192', ENCODE_ON_DISK =
 'true',
 IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}
 
 *Configuration*
 1 Machine having both hbase and hadoop master
 2 machines having both region server node and datanode
 total 285 region servers
 
 *Machine Level Optimizations:*
 a)No of file descriptors is 100(ulimit -n gives 100)
 b)Increase the read-ahead value to 4096
 c)Added noatime,nodiratime to the disks
 
 *Hadoop Optimizations:*
 dfs.datanode.max.xcievers = 4096
 dfs.block.size = 33554432
 dfs.datanode.handler.count = 256
 io.file.buffer.size = 65536
 hadoop data is split on 4 directories, so that different disks are being
 accessed
 
 *Hbase Optimizations*:
 
 hbase.client.scanner.caching=1  #We have specifcally added this, as we
 return always one row.
 hbase.regionserver.handler.count=3200
 hfile.block.cache.size=0.35
 hbase.hregion.memstore.mslab.enabled=true
 hfile.min.blocksize.size=16384
 hfile.min.blocksize.size=4
 hbase.hstore.blockingStoreFiles=200
 hbase.regionserver.optionallogflushinterval=6
 hbase.hregion.majorcompaction=0
 hbase.hstore.compaction.max=100
 hbase.hstore.compactionThreshold=100
 
 *Hbase-GC
 *-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
 *Hadoop-GC*
 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
 
 -Vibhav

Re: What is the output format of org.apache.hadoop.examples.Join?

2013-04-01 Thread Yanbo Liang

You can give the detail information about your running parameters, hadoop
version, etc.
From the principle and source code, you output is not reasonable.
The reduce stage of MR will merge the value to TupleWritable.


2013/3/28 jingguo yao yaojing...@gmail.com

 Yanbo:

 Sorry for pasting the wrong result.

 The output for joining a.txt, b.txt and c.txt is as follows (still not
 the same produced by Chris):

 a0  [,,]
 b0  [,,]
 c0  [,,]
 a1  [,,]
 b1  [,,]
 b2  [,,]
 b3  [,,]
 c1  [,,]
 a2  [,,]
 a3  [,,]
 c2  [,,]
 c3  [,,]


 On Thu, Mar 28, 2013 at 11:46 AM, Yanbo Liang yanboha...@gmail.com
 wrote:
  Your output is only a.txt join b.txt.
  You need to joint c.txt continually.
 
  2013/3/26 jingguo yao yaojing...@gmail.com
 
  I am reading the following mail:
 
  http://www.mail-archive.com/core-user@hadoop.apache.org/msg04066.html
 
  After running the following command (I am using Hadoop 1.0.4):
 
  bin/hadoop jar hadoop-examples-1.0.4.jar join \
 -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
 -outKey org.apache.hadoop.io.Text \
 -joinOp outer \
 join/a.txt join/b.txt join/c.txt joinout
 
 
  Then I run bin/hadoop fs -text joinout/part-0. I see the following
  result:
 
  a0  [,]
  b0  [,]
  a1  [,]
  b1  [,]
  b2  [,]
  b3  [,]
  a2  [,]
  a3  [,]
 
  But Chris said that the result should be:
 
  [a0,b0,c0]
  [a1,b1,c1]
  [a1,b2,c1]
  [a1,b3,c1]
  [a2,,]
  [a3,,]
  [,,c2]
  [,,c3]
 
  Is Join's output format changed for Hadoop 1.0.4?
 
 
  --
  Jingguo
 



 --
 Jingguo

Re: Read thruput

2013-04-01 Thread Azuryy Yu

can you output GC log? CMS GC should be optimized futher. please find it on
official site. another, use vmstat monitor page rate during query.
On Apr 1, 2013 6:09 PM, Vibhav Mundra mun...@gmail.com wrote:

 Hi All,

 I am trying to use Hbase for real-time data retrieval with a timeout of 50
 ms.

 I am using 2 machines as datanode and regionservers,
 and one machine as a master for hadoop and Hbase.

 But I am able to fire only 3000 queries per sec and 10% of them are timing
 out.
 The database has 60 million rows.

 Are these figure okie, or I am missing something.
 I have used the scanner caching to be equal to one, because for each time
 we are fetching a single row only.

 Here are the various configurations:

 *Our schema
 *{NAME = 'mytable', FAMILIES = [{NAME = 'cf', DATA_BLOCK_ENCODING =
 'NONE', BLOOMFILTER = 'ROWCOL', REPLICATION_SCOPE = '0', COMPRESSION =
 'GZ', VERSIONS = '1', TTL = '2147483647', MIN_VERSIONS = '0', KEE
 P_DELETED_CELLS = 'false', BLOCKSIZE = '8192', ENCODE_ON_DISK = 'true',
 IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}

 *Configuration*
 1 Machine having both hbase and hadoop master
 2 machines having both region server node and datanode
 total 285 region servers

 *Machine Level Optimizations:*
 a)No of file descriptors is 100(ulimit -n gives 100)
 b)Increase the read-ahead value to 4096
 c)Added noatime,nodiratime to the disks

 *Hadoop Optimizations:*
 dfs.datanode.max.xcievers = 4096
 dfs.block.size = 33554432
 dfs.datanode.handler.count = 256
 io.file.buffer.size = 65536
 hadoop data is split on 4 directories, so that different disks are being
 accessed

 *Hbase Optimizations*:

 hbase.client.scanner.caching=1  #We have specifcally added this, as we
 return always one row.
 hbase.regionserver.handler.count=3200
 hfile.block.cache.size=0.35
 hbase.hregion.memstore.mslab.enabled=true
 hfile.min.blocksize.size=16384
 hfile.min.blocksize.size=4
 hbase.hstore.blockingStoreFiles=200
 hbase.regionserver.optionallogflushinterval=6
 hbase.hregion.majorcompaction=0
 hbase.hstore.compaction.max=100
 hbase.hstore.compactionThreshold=100

 *Hbase-GC
 *-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
 *Hadoop-GC*
 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

 -Vibhav

Re: Read thruput

I have used the following site:
http://grokbase.com/t/hbase/user/11bat80x7m/row-get-very-slow

to lessen the value of block cache.

-Vibhav


On Mon, Apr 1, 2013 at 4:23 PM, Ted yuzhih...@gmail.com wrote:

 Can you increase block cache size ?

 What version of hbase are you using ?

 Thanks

 On Apr 1, 2013, at 3:47 AM, Vibhav Mundra mun...@gmail.com wrote:

  The typical size of each of my row is less than 1KB.
 
  Regarding the memory, I have used 8GB for Hbase regionservers and 4 GB
 for
  datanodes and I dont see them completely used. So I ruled out the GC
 aspect.
 
  In case u still believe that GC is an issue, I will upload the gc logs.
 
  -Vibhav
 
 
  On Mon, Apr 1, 2013 at 3:46 PM, ramkrishna vasudevan 
  ramkrishna.s.vasude...@gmail.com wrote:
 
  Hi
 
  How big is your row?  Are they wider rows and what would be the size of
  every cell?
  How many read threads are getting used?
 
 
  Were you able to take a thread dump when this was happening?  Have you
 seen
  the GC log?
  May be need some more info before we can think of the problem.
 
  Regards
  Ram
 
 
  On Mon, Apr 1, 2013 at 3:39 PM, Vibhav Mundra mun...@gmail.com wrote:
 
  Hi All,
 
  I am trying to use Hbase for real-time data retrieval with a timeout of
  50
  ms.
 
  I am using 2 machines as datanode and regionservers,
  and one machine as a master for hadoop and Hbase.
 
  But I am able to fire only 3000 queries per sec and 10% of them are
  timing
  out.
  The database has 60 million rows.
 
  Are these figure okie, or I am missing something.
  I have used the scanner caching to be equal to one, because for each
 time
  we are fetching a single row only.
 
  Here are the various configurations:
 
  *Our schema
  *{NAME = 'mytable', FAMILIES = [{NAME = 'cf', DATA_BLOCK_ENCODING =
  'NONE', BLOOMFILTER = 'ROWCOL', REPLICATION_SCOPE = '0', COMPRESSION
 =
  'GZ', VERSIONS = '1', TTL = '2147483647', MIN_VERSIONS = '0', KEE
  P_DELETED_CELLS = 'false', BLOCKSIZE = '8192', ENCODE_ON_DISK =
  'true',
  IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}
 
  *Configuration*
  1 Machine having both hbase and hadoop master
  2 machines having both region server node and datanode
  total 285 region servers
 
  *Machine Level Optimizations:*
  a)No of file descriptors is 100(ulimit -n gives 100)
  b)Increase the read-ahead value to 4096
  c)Added noatime,nodiratime to the disks
 
  *Hadoop Optimizations:*
  dfs.datanode.max.xcievers = 4096
  dfs.block.size = 33554432
  dfs.datanode.handler.count = 256
  io.file.buffer.size = 65536
  hadoop data is split on 4 directories, so that different disks are
 being
  accessed
 
  *Hbase Optimizations*:
 
  hbase.client.scanner.caching=1  #We have specifcally added this, as we
  return always one row.
  hbase.regionserver.handler.count=3200
  hfile.block.cache.size=0.35
  hbase.hregion.memstore.mslab.enabled=true
  hfile.min.blocksize.size=16384
  hfile.min.blocksize.size=4
  hbase.hstore.blockingStoreFiles=200
  hbase.regionserver.optionallogflushinterval=6
  hbase.hregion.majorcompaction=0
  hbase.hstore.compaction.max=100
  hbase.hstore.compactionThreshold=100
 
  *Hbase-GC
  *-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
  -XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
  *Hadoop-GC*
  -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
 
  -Vibhav

Inconsistencies in comparisons using KeyComparator

2013-04-01 Thread Alan Chaney


Hi

I need to write some code that sorts row keys identically to HBase.

I looked at the KeyValue.KeyComparator code, and it seems that, by 
default, HBase elects to use the 'Unsafe' comparator as the basis of its 
comparison, with a fall-back to to the PureJavaComparer should Unsafe 
not be available (for example, in tests.)


However, I'm finding that the sort order from a call to 
KeyValue.KeyComparator appears to be inconsistent between the two forms.


As an example, comparing:

(first param) (second param)
616c1b to 
61741b


gives 1 for the default (presumably, Unsafe) call, and -1 using the 
PureJavaComparator.


I would actually expect it to be a -ve number, based on the difference 
of 6c to 74 in the 3rd from last byte above.


Similarly

616c1b to 
00061741b


gives  0 instead of  0. The PureJavaComparator does a byte-by-byte 
comparison by


Is this expected? From the definition of lexicographical compare that I 
found, I don't think so. There's no issue of signed comparison here, 
because 0x6c and 0x74 are still +ve byte values.


Regards

Alan

Re: Inconsistencies in comparisons using KeyComparator

2013-04-01 Thread Stack

That is an interesting (disturbing) find Alan.  Hopefully the fallback is
rare.  Did you have a technique for making the compare fallback to pure
java compare?

Thank you,
St.Ack


On Mon, Apr 1, 2013 at 7:54 AM, Alan Chaney a...@mechnicality.com wrote:

 Hi

 I need to write some code that sorts row keys identically to HBase.

 I looked at the KeyValue.KeyComparator code, and it seems that, by
 default, HBase elects to use the 'Unsafe' comparator as the basis of its
 comparison, with a fall-back to to the PureJavaComparer should Unsafe not
 be available (for example, in tests.)

 However, I'm finding that the sort order from a call to
 KeyValue.KeyComparator appears to be inconsistent between the two forms.

 As an example, comparing:

 (first param) (second param)
 ff**ff616c1b to
 ff**ff61741b

 gives 1 for the default (presumably, Unsafe) call, and -1 using the
 PureJavaComparator.

 I would actually expect it to be a -ve number, based on the difference of
 6c to 74 in the 3rd from last byte above.

 Similarly

 00**00616c1b to
 00**061741b

 gives  0 instead of  0. The PureJavaComparator does a byte-by-byte
 comparison by

 Is this expected? From the definition of lexicographical compare that I
 found, I don't think so. There's no issue of signed comparison here,
 because 0x6c and 0x74 are still +ve byte values.

 Regards

 Alan

Re: Read thruput

I was aware of that discussion which was about MAX_FILESIZE and BLOCKSIZE

My suggestion was about block cache percentage.

Cheers


On Mon, Apr 1, 2013 at 4:57 AM, Vibhav Mundra mun...@gmail.com wrote:

 I have used the following site:
 http://grokbase.com/t/hbase/user/11bat80x7m/row-get-very-slow

 to lessen the value of block cache.

 -Vibhav


 On Mon, Apr 1, 2013 at 4:23 PM, Ted yuzhih...@gmail.com wrote:

  Can you increase block cache size ?
 
  What version of hbase are you using ?
 
  Thanks
 
  On Apr 1, 2013, at 3:47 AM, Vibhav Mundra mun...@gmail.com wrote:
 
   The typical size of each of my row is less than 1KB.
  
   Regarding the memory, I have used 8GB for Hbase regionservers and 4 GB
  for
   datanodes and I dont see them completely used. So I ruled out the GC
  aspect.
  
   In case u still believe that GC is an issue, I will upload the gc logs.
  
   -Vibhav
  
  
   On Mon, Apr 1, 2013 at 3:46 PM, ramkrishna vasudevan 
   ramkrishna.s.vasude...@gmail.com wrote:
  
   Hi
  
   How big is your row?  Are they wider rows and what would be the size
 of
   every cell?
   How many read threads are getting used?
  
  
   Were you able to take a thread dump when this was happening?  Have you
  seen
   the GC log?
   May be need some more info before we can think of the problem.
  
   Regards
   Ram
  
  
   On Mon, Apr 1, 2013 at 3:39 PM, Vibhav Mundra mun...@gmail.com
 wrote:
  
   Hi All,
  
   I am trying to use Hbase for real-time data retrieval with a timeout
 of
   50
   ms.
  
   I am using 2 machines as datanode and regionservers,
   and one machine as a master for hadoop and Hbase.
  
   But I am able to fire only 3000 queries per sec and 10% of them are
   timing
   out.
   The database has 60 million rows.
  
   Are these figure okie, or I am missing something.
   I have used the scanner caching to be equal to one, because for each
  time
   we are fetching a single row only.
  
   Here are the various configurations:
  
   *Our schema
   *{NAME = 'mytable', FAMILIES = [{NAME = 'cf', DATA_BLOCK_ENCODING
 =
   'NONE', BLOOMFILTER = 'ROWCOL', REPLICATION_SCOPE = '0',
 COMPRESSION
  =
   'GZ', VERSIONS = '1', TTL = '2147483647', MIN_VERSIONS = '0', KEE
   P_DELETED_CELLS = 'false', BLOCKSIZE = '8192', ENCODE_ON_DISK =
   'true',
   IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}
  
   *Configuration*
   1 Machine having both hbase and hadoop master
   2 machines having both region server node and datanode
   total 285 region servers
  
   *Machine Level Optimizations:*
   a)No of file descriptors is 100(ulimit -n gives 100)
   b)Increase the read-ahead value to 4096
   c)Added noatime,nodiratime to the disks
  
   *Hadoop Optimizations:*
   dfs.datanode.max.xcievers = 4096
   dfs.block.size = 33554432
   dfs.datanode.handler.count = 256
   io.file.buffer.size = 65536
   hadoop data is split on 4 directories, so that different disks are
  being
   accessed
  
   *Hbase Optimizations*:
  
   hbase.client.scanner.caching=1  #We have specifcally added this, as
 we
   return always one row.
   hbase.regionserver.handler.count=3200
   hfile.block.cache.size=0.35
   hbase.hregion.memstore.mslab.enabled=true
   hfile.min.blocksize.size=16384
   hfile.min.blocksize.size=4
   hbase.hstore.blockingStoreFiles=200
   hbase.regionserver.optionallogflushinterval=6
   hbase.hregion.majorcompaction=0
   hbase.hstore.compaction.max=100
   hbase.hstore.compactionThreshold=100
  
   *Hbase-GC
   *-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
 -XX:+CMSParallelRemarkEnabled
   -XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
   *Hadoop-GC*
   -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
  
   -Vibhav

Re: Inconsistencies in comparisons using KeyComparator

2013-04-01 Thread Alan Chaney



On 4/1/2013 9:42 AM, Stack wrote:

That is an interesting (disturbing) find Alan.  Hopefully the fallback is
rare.  Did you have a technique for making the compare fallback to pure
java compare?

Thank you,
St.Ack


I agree its disturbing! I based my findings on reading the source code 
for 0.92.1  (the CDH4.1.2 distro).


It seems to me that, from org.apache.hadoop.hbase.KeyValue$KVComparator 
the KeyComparator calls KeyComparator.compareRows which in turn calls


Bytes.compareTo(left, loffset, llength, righ, roffset, rlength) which in 
turn calls Bytes.compareTo which calls 
LexicographicalCompareHolder.BEST_COMPARER


which appears to be implemented thus:

  static class LexicographicalComparerHolder {
static final String UNSAFE_COMPARER_NAME =
LexicographicalComparerHolder.class.getName() + $UnsafeComparer;

static final Comparerbyte[] BEST_COMPARER = getBestComparer();
/**
 * Returns the Unsafe-using Comparer, or falls back to the pure-Java
 * implementation if unable to do so.
 */
static Comparerbyte[] getBestComparer() {
  try {
Class? theClass = Class.forName(UNSAFE_COMPARER_NAME);
...
}

enum PureJavaComparer implements Comparerbyte[] {
  INSTANCE;

  @Override
  public int compareTo(byte[] buffer1, int offset1, int length1,
   ...
  }
}

So, it looks like to me that Unsafe is the default. However, its not 
really very easy to debug this, except by invoking the 
KeyValue.KeyComparator and seeing what you get, which is what I did. 
Either I'm doing something very stupid (extremely plausible) or there is 
a bit of an issue here. I was hoping that someone would point out my error!


I've got some unit tests that appear to show the difference.

Thanks

Alan





On Mon, Apr 1, 2013 at 7:54 AM, Alan Chaney a...@mechnicality.com wrote:


Hi

I need to write some code that sorts row keys identically to HBase.

I looked at the KeyValue.KeyComparator code, and it seems that, by
default, HBase elects to use the 'Unsafe' comparator as the basis of its
comparison, with a fall-back to to the PureJavaComparer should Unsafe not
be available (for example, in tests.)

However, I'm finding that the sort order from a call to
KeyValue.KeyComparator appears to be inconsistent between the two forms.

As an example, comparing:

(first param) (second param)
ff**ff616c1b to
ff**ff61741b

gives 1 for the default (presumably, Unsafe) call, and -1 using the
PureJavaComparator.

I would actually expect it to be a -ve number, based on the difference of
6c to 74 in the 3rd from last byte above.

Similarly

00**00616c1b to
00**061741b

gives  0 instead of  0. The PureJavaComparator does a byte-by-byte
comparison by

Is this expected? From the definition of lexicographical compare that I
found, I don't think so. There's no issue of signed comparison here,
because 0x6c and 0x74 are still +ve byte values.

Regards

Alan

Re: Read thruput

yes, I have changes the BLOCK CACHE % to 0.35.

-Vibhav


On Mon, Apr 1, 2013 at 10:20 PM, Ted Yu yuzhih...@gmail.com wrote:

 I was aware of that discussion which was about MAX_FILESIZE and BLOCKSIZE

 My suggestion was about block cache percentage.

 Cheers


 On Mon, Apr 1, 2013 at 4:57 AM, Vibhav Mundra mun...@gmail.com wrote:

  I have used the following site:
  http://grokbase.com/t/hbase/user/11bat80x7m/row-get-very-slow
 
  to lessen the value of block cache.
 
  -Vibhav
 
 
  On Mon, Apr 1, 2013 at 4:23 PM, Ted yuzhih...@gmail.com wrote:
 
   Can you increase block cache size ?
  
   What version of hbase are you using ?
  
   Thanks
  
   On Apr 1, 2013, at 3:47 AM, Vibhav Mundra mun...@gmail.com wrote:
  
The typical size of each of my row is less than 1KB.
   
Regarding the memory, I have used 8GB for Hbase regionservers and 4
 GB
   for
datanodes and I dont see them completely used. So I ruled out the GC
   aspect.
   
In case u still believe that GC is an issue, I will upload the gc
 logs.
   
-Vibhav
   
   
On Mon, Apr 1, 2013 at 3:46 PM, ramkrishna vasudevan 
ramkrishna.s.vasude...@gmail.com wrote:
   
Hi
   
How big is your row?  Are they wider rows and what would be the size
  of
every cell?
How many read threads are getting used?
   
   
Were you able to take a thread dump when this was happening?  Have
 you
   seen
the GC log?
May be need some more info before we can think of the problem.
   
Regards
Ram
   
   
On Mon, Apr 1, 2013 at 3:39 PM, Vibhav Mundra mun...@gmail.com
  wrote:
   
Hi All,
   
I am trying to use Hbase for real-time data retrieval with a
 timeout
  of
50
ms.
   
I am using 2 machines as datanode and regionservers,
and one machine as a master for hadoop and Hbase.
   
But I am able to fire only 3000 queries per sec and 10% of them are
timing
out.
The database has 60 million rows.
   
Are these figure okie, or I am missing something.
I have used the scanner caching to be equal to one, because for
 each
   time
we are fetching a single row only.
   
Here are the various configurations:
   
*Our schema
*{NAME = 'mytable', FAMILIES = [{NAME = 'cf',
 DATA_BLOCK_ENCODING
  =
'NONE', BLOOMFILTER = 'ROWCOL', REPLICATION_SCOPE = '0',
  COMPRESSION
   =
'GZ', VERSIONS = '1', TTL = '2147483647', MIN_VERSIONS = '0',
 KEE
P_DELETED_CELLS = 'false', BLOCKSIZE = '8192', ENCODE_ON_DISK =
'true',
IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}
   
*Configuration*
1 Machine having both hbase and hadoop master
2 machines having both region server node and datanode
total 285 region servers
   
*Machine Level Optimizations:*
a)No of file descriptors is 100(ulimit -n gives 100)
b)Increase the read-ahead value to 4096
c)Added noatime,nodiratime to the disks
   
*Hadoop Optimizations:*
dfs.datanode.max.xcievers = 4096
dfs.block.size = 33554432
dfs.datanode.handler.count = 256
io.file.buffer.size = 65536
hadoop data is split on 4 directories, so that different disks are
   being
accessed
   
*Hbase Optimizations*:
   
hbase.client.scanner.caching=1  #We have specifcally added this, as
  we
return always one row.
hbase.regionserver.handler.count=3200
hfile.block.cache.size=0.35
hbase.hregion.memstore.mslab.enabled=true
hfile.min.blocksize.size=16384
hfile.min.blocksize.size=4
hbase.hstore.blockingStoreFiles=200
hbase.regionserver.optionallogflushinterval=6
hbase.hregion.majorcompaction=0
hbase.hstore.compaction.max=100
hbase.hstore.compactionThreshold=100
   
*Hbase-GC
*-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
  -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
*Hadoop-GC*
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
   
-Vibhav

Re: Read thruput

What is the general read-thru put that one gets when using Hbase.

 I am not to able to achieve more than 3000/secs with a timeout of 50
millisecs.
In this case also there is 10% of them are timing-out.

-Vibhav


On Mon, Apr 1, 2013 at 11:20 PM, Vibhav Mundra mun...@gmail.com wrote:

 yes, I have changes the BLOCK CACHE % to 0.35.

 -Vibhav


 On Mon, Apr 1, 2013 at 10:20 PM, Ted Yu yuzhih...@gmail.com wrote:

 I was aware of that discussion which was about MAX_FILESIZE and BLOCKSIZE

 My suggestion was about block cache percentage.

 Cheers


 On Mon, Apr 1, 2013 at 4:57 AM, Vibhav Mundra mun...@gmail.com wrote:

  I have used the following site:
  http://grokbase.com/t/hbase/user/11bat80x7m/row-get-very-slow
 
  to lessen the value of block cache.
 
  -Vibhav
 
 
  On Mon, Apr 1, 2013 at 4:23 PM, Ted yuzhih...@gmail.com wrote:
 
   Can you increase block cache size ?
  
   What version of hbase are you using ?
  
   Thanks
  
   On Apr 1, 2013, at 3:47 AM, Vibhav Mundra mun...@gmail.com wrote:
  
The typical size of each of my row is less than 1KB.
   
Regarding the memory, I have used 8GB for Hbase regionservers and 4
 GB
   for
datanodes and I dont see them completely used. So I ruled out the GC
   aspect.
   
In case u still believe that GC is an issue, I will upload the gc
 logs.
   
-Vibhav
   
   
On Mon, Apr 1, 2013 at 3:46 PM, ramkrishna vasudevan 
ramkrishna.s.vasude...@gmail.com wrote:
   
Hi
   
How big is your row?  Are they wider rows and what would be the
 size
  of
every cell?
How many read threads are getting used?
   
   
Were you able to take a thread dump when this was happening?  Have
 you
   seen
the GC log?
May be need some more info before we can think of the problem.
   
Regards
Ram
   
   
On Mon, Apr 1, 2013 at 3:39 PM, Vibhav Mundra mun...@gmail.com
  wrote:
   
Hi All,
   
I am trying to use Hbase for real-time data retrieval with a
 timeout
  of
50
ms.
   
I am using 2 machines as datanode and regionservers,
and one machine as a master for hadoop and Hbase.
   
But I am able to fire only 3000 queries per sec and 10% of them
 are
timing
out.
The database has 60 million rows.
   
Are these figure okie, or I am missing something.
I have used the scanner caching to be equal to one, because for
 each
   time
we are fetching a single row only.
   
Here are the various configurations:
   
*Our schema
*{NAME = 'mytable', FAMILIES = [{NAME = 'cf',
 DATA_BLOCK_ENCODING
  =
'NONE', BLOOMFILTER = 'ROWCOL', REPLICATION_SCOPE = '0',
  COMPRESSION
   =
'GZ', VERSIONS = '1', TTL = '2147483647', MIN_VERSIONS = '0',
 KEE
P_DELETED_CELLS = 'false', BLOCKSIZE = '8192', ENCODE_ON_DISK =
'true',
IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}
   
*Configuration*
1 Machine having both hbase and hadoop master
2 machines having both region server node and datanode
total 285 region servers
   
*Machine Level Optimizations:*
a)No of file descriptors is 100(ulimit -n gives 100)
b)Increase the read-ahead value to 4096
c)Added noatime,nodiratime to the disks
   
*Hadoop Optimizations:*
dfs.datanode.max.xcievers = 4096
dfs.block.size = 33554432
dfs.datanode.handler.count = 256
io.file.buffer.size = 65536
hadoop data is split on 4 directories, so that different disks are
   being
accessed
   
*Hbase Optimizations*:
   
hbase.client.scanner.caching=1  #We have specifcally added this,
 as
  we
return always one row.
hbase.regionserver.handler.count=3200
hfile.block.cache.size=0.35
hbase.hregion.memstore.mslab.enabled=true
hfile.min.blocksize.size=16384
hfile.min.blocksize.size=4
hbase.hstore.blockingStoreFiles=200
hbase.regionserver.optionallogflushinterval=6
hbase.hregion.majorcompaction=0
hbase.hstore.compaction.max=100
hbase.hstore.compactionThreshold=100
   
*Hbase-GC
*-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
  -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
*Hadoop-GC*
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
   
-Vibhav

HBase Types: Explicit Null Support

Heya,

Thinking about data types and serialization. I think null support is an
important characteristic for the serialized representations, especially
when considering the compound type. However, doing so in directly
incompatible with fixed-width representations for numerics. For instance,
if we want to have a fixed-width signed long stored on 8-bytes, where do
you put null? float and double types can cheat a little by folding negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case, the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving serialization. This
imposes some limitations on our encoding strategies. For instance, it's not
enough to simply encode null, it really needs to be encoded as 0x00 so as
to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: Read thruput

Your hbase.regionserver.handler.count seems very high. The following is
from hbase-default.xml:

For an estimate of server-side memory-used, evaluate
hbase.client.write.buffer * hbase.regionserver.handler.count

In your case, the above product would be 6GB :-)


On Mon, Apr 1, 2013 at 3:09 AM, Vibhav Mundra mun...@gmail.com wrote:

 Hi All,

 I am trying to use Hbase for real-time data retrieval with a timeout of 50
 ms.

 I am using 2 machines as datanode and regionservers,
 and one machine as a master for hadoop and Hbase.

 But I am able to fire only 3000 queries per sec and 10% of them are timing
 out.
 The database has 60 million rows.

 Are these figure okie, or I am missing something.
 I have used the scanner caching to be equal to one, because for each time
 we are fetching a single row only.

 Here are the various configurations:

 *Our schema
 *{NAME = 'mytable', FAMILIES = [{NAME = 'cf', DATA_BLOCK_ENCODING =
 'NONE', BLOOMFILTER = 'ROWCOL', REPLICATION_SCOPE = '0', COMPRESSION =
 'GZ', VERSIONS = '1', TTL = '2147483647', MIN_VERSIONS = '0', KEE
 P_DELETED_CELLS = 'false', BLOCKSIZE = '8192', ENCODE_ON_DISK = 'true',
 IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}

 *Configuration*
 1 Machine having both hbase and hadoop master
 2 machines having both region server node and datanode
 total 285 region servers

 *Machine Level Optimizations:*
 a)No of file descriptors is 100(ulimit -n gives 100)
 b)Increase the read-ahead value to 4096
 c)Added noatime,nodiratime to the disks

 *Hadoop Optimizations:*
 dfs.datanode.max.xcievers = 4096
 dfs.block.size = 33554432
 dfs.datanode.handler.count = 256
 io.file.buffer.size = 65536
 hadoop data is split on 4 directories, so that different disks are being
 accessed

 *Hbase Optimizations*:

 hbase.client.scanner.caching=1  #We have specifcally added this, as we
 return always one row.
 hbase.regionserver.handler.count=3200
 hfile.block.cache.size=0.35
 hbase.hregion.memstore.mslab.enabled=true
 hfile.min.blocksize.size=16384
 hfile.min.blocksize.size=4
 hbase.hstore.blockingStoreFiles=200
 hbase.regionserver.optionallogflushinterval=6
 hbase.hregion.majorcompaction=0
 hbase.hstore.compaction.max=100
 hbase.hstore.compactionThreshold=100

 *Hbase-GC
 *-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=20 -XX:ParallelGCThreads=16
 *Hadoop-GC*
 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

 -Vibhav

Re: Inconsistencies in comparisons using KeyComparator

Looking at
http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/share/classes/sun/misc/Unsafe.java,
looks like Unsafe is provided by openjdk as well.

I guess this issue, though disturbing, wouldn't show up.


On Mon, Apr 1, 2013 at 10:04 AM, Alan Chaney a...@mechnicality.com wrote:


 On 4/1/2013 9:42 AM, Stack wrote:

 That is an interesting (disturbing) find Alan.  Hopefully the fallback is
 rare.  Did you have a technique for making the compare fallback to pure
 java compare?

 Thank you,
 St.Ack


 I agree its disturbing! I based my findings on reading the source code for
 0.92.1  (the CDH4.1.2 distro).

 It seems to me that, from org.apache.hadoop.hbase.**KeyValue$KVComparator
 the KeyComparator calls KeyComparator.compareRows which in turn calls

 Bytes.compareTo(left, loffset, llength, righ, roffset, rlength) which in
 turn calls Bytes.compareTo which calls LexicographicalCompareHolder.**
 BEST_COMPARER

 which appears to be implemented thus:

   static class LexicographicalComparerHolder {
 static final String UNSAFE_COMPARER_NAME =
 LexicographicalComparerHolder.**class.getName() +
 $UnsafeComparer;

 static final Comparerbyte[] BEST_COMPARER = getBestComparer();
 /**
  * Returns the Unsafe-using Comparer, or falls back to the pure-Java
  * implementation if unable to do so.
  */
 static Comparerbyte[] getBestComparer() {
   try {
 Class? theClass = Class.forName(UNSAFE_COMPARER_**NAME);
 ...
 }

 enum PureJavaComparer implements Comparerbyte[] {
   INSTANCE;

   @Override
   public int compareTo(byte[] buffer1, int offset1, int length1,
...
   }
 }

 So, it looks like to me that Unsafe is the default. However, its not
 really very easy to debug this, except by invoking the
 KeyValue.KeyComparator and seeing what you get, which is what I did. Either
 I'm doing something very stupid (extremely plausible) or there is a bit of
 an issue here. I was hoping that someone would point out my error!

 I've got some unit tests that appear to show the difference.

 Thanks

 Alan





 On Mon, Apr 1, 2013 at 7:54 AM, Alan Chaney a...@mechnicality.com
 wrote:

  Hi

 I need to write some code that sorts row keys identically to HBase.

 I looked at the KeyValue.KeyComparator code, and it seems that, by
 default, HBase elects to use the 'Unsafe' comparator as the basis of its
 comparison, with a fall-back to to the PureJavaComparer should Unsafe not
 be available (for example, in tests.)

 However, I'm finding that the sort order from a call to
 KeyValue.KeyComparator appears to be inconsistent between the two forms.

 As an example, comparing:

 (first param) (second param)
 ffff616c1b to
 ffff61741b

 gives 1 for the default (presumably, Unsafe) call, and -1 using the
 PureJavaComparator.

 I would actually expect it to be a -ve number, based on the difference of
 6c to 74 in the 3rd from last byte above.

 Similarly

 0000616c1b to
 00061741b

 gives  0 instead of  0. The PureJavaComparator does a byte-by-byte
 comparison by

 Is this expected? From the definition of lexicographical compare that I
 found, I don't think so. There's no issue of signed comparison here,
 because 0x6c and 0x74 are still +ve byte values.

 Regards

 Alan

Re: HBase Types: Explicit Null Support

2013-04-01 Thread Doug Meil


HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.




On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

Thinking about data types and serialization. I think null support is an
important characteristic for the serialized representations, especially
when considering the compound type. However, doing so in directly
incompatible with fixed-width representations for numerics. For instance,
if we want to have a fixed-width signed long stored on 8-bytes, where do
you put null? float and double types can cheat a little by folding
negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving serialization.
This
imposes some limitations on our encoding strategies. For instance, it's
not
enough to simply encode null, it really needs to be encoded as 0x00 so as
to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: HBase Types: Explicit Null Support

2013-04-01 Thread Matt Corgan

I spent some time this weekend extracting bits of our serialization code to
a public github repo at http://github.com/hotpads/data-tools.
Contributions are welcome - i'm sure we all have this stuff laying around.

You can see I've bumped into the NULL problem in a few places:
*
https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
*
https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java

Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior and
confusion. It's cleaner to provide a wrapper class (so both LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior, and costs
a little more in performance. If the user can't find a pre-made wrapper
class, it's not very difficult for each user to provide their own
interpretation of null and check for it themselves.

If you reject nullability, the question becomes what to do in situations
where you're implementing existing interfaces that accept nullable params.
The LongArrayList above implements ListLong which requires an add(Long)
method. In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the user to make
that swap and then throw IllegalArgumentException if they pass null.

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil doug.m...@explorysmedical.comwrote:

HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.

On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

Thinking about data types and serialization. I think null support is an
important characteristic for the serialized representations, especially
when considering the compound type. However, doing so in directly
incompatible with fixed-width representations for numerics. For instance,
if we want to have a fixed-width signed long stored on 8-bytes, where do
you put null? float and double types can cheat a little by folding
negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving serialization.
This
imposes some limitations on our encoding strategies. For instance, it's
not
enough to simply encode null, it really needs to be encoded as 0x00 so as
to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: Read thruput

2013-04-01 Thread Asaf Mesika

How does your client call looks like? Get? Scan? Filters?
Is 3000/sec is client side calls or is it in numbers of rows per sec?
If you measure in MB/sec how much read throughput do you get?
Where is your client located? Same router as the cluster?
Have you activated dfs read short circuit? Of not try it.
Compression - try switching to Snappy - should be faster.
What else is running on the cluster parallel to your reading client?

On Monday, April 1, 2013, Vibhav Mundra wrote:

 What is the general read-thru put that one gets when using Hbase.

  I am not to able to achieve more than 3000/secs with a timeout of 50
 millisecs.
 In this case also there is 10% of them are timing-out.

 -Vibhav


 On Mon, Apr 1, 2013 at 11:20 PM, Vibhav Mundra mun...@gmail.com wrote:

  yes, I have changes the BLOCK CACHE % to 0.35.
 
  -Vibhav
 
 
  On Mon, Apr 1, 2013 at 10:20 PM, Ted Yu yuzhih...@gmail.com wrote:
 
  I was aware of that discussion which was about MAX_FILESIZE and
 BLOCKSIZE
 
  My suggestion was about block cache percentage.
 
  Cheers
 
 
  On Mon, Apr 1, 2013 at 4:57 AM, Vibhav Mundra mun...@gmail.com wrote:
 
   I have used the following site:
   http://grokbase.com/t/hbase/user/11bat80x7m/row-get-very-slow
  
   to lessen the value of block cache.
  
   -Vibhav
  
  
   On Mon, Apr 1, 2013 at 4:23 PM, Ted yuzhih...@gmail.com wrote:
  
Can you increase block cache size ?
   
What version of hbase are you using ?
   
Thanks
   
On Apr 1, 2013, at 3:47 AM, Vibhav Mundra mun...@gmail.com wrote:
   
 The typical size of each of my row is less than 1KB.

 Regarding the memory, I have used 8GB for Hbase regionservers and
 4
  GB
for
 datanodes and I dont see them completely used. So I ruled out the
 GC
aspect.

 In case u still believe that GC is an issue, I will upload the gc
  logs.

 -Vibhav


 On Mon, Apr 1, 2013 at 3:46 PM, ramkrishna vasudevan 
 ramkrishna.s.vasude...@gmail.com wrote:

 Hi

 How big is your row?  Are they wider rows and what would be the
  size
   of
 every cell?
 How many read threads are getting used?


 Were you able to take a thread dump when this was happening?
  Have
  you
seen
 the GC log?
 May be need some more info before we can think of the problem.

 Regards
 Ram


 On Mon, Apr 1, 2013 at 3:39 PM, Vibhav Mundra mun...@gmail.com
   wrote:

 Hi All,

 I am trying to use Hbase for real-time data retrieval with a
  timeout
   of
 50
 ms.

 I am using 2 machines as datanode and regionservers,
 and one machine as a master for hadoop and Hbase.

 But I am able to fire only 3000 queries per sec and 10% of them
  are
 timing
 out.
 The database has 60 million rows.

 Are these figure okie, or I am missing something.
 I have used the scanner caching to be equal to one, because for
  each
time

Re: HBase Types: Explicit Null Support

Thanks for the thoughtful response (and code!).

I'm thinking I will press forward with a base implementation that does not
support nulls. The idea is to provide an extensible set of interfaces, so I
think this will not box us into a corner later. That is, a mirroring
package could be implemented that supports null values and accepts
the relevant trade-offs.

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan mcor...@hotpads.com wrote:

You can see I've bumped into the NULL problem in a few places:
*

https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
*

https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil doug.m...@explorysmedical.com
wrote:

HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.

On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

Thinking about data types and serialization. I think null support is an
important characteristic for the serialized representations, especially
when considering the compound type. However, doing so in directly
incompatible with fixed-width representations for numerics. For
instance,
if we want to have a fixed-width signed long stored on 8-bytes, where do
you put null? float and double types can cheat a little by folding
negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving serialization.
This
imposes some limitations on our encoding strategies. For instance, it's
not
enough to simply encode null, it really needs to be encoded as 0x00 so
as
to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Flume EventSerializer vs hbase coprocessor

2013-04-01 Thread Robert Hamilton

I have a calculation that I'm doing in a custom AsyncHbaseEventSerializer. I 
want to do the calculation in real time, but it looks like it could be done 
either here or in a coprocessor. I'm just doing it in the serializer for now 
because the code is simple that way, and data only ever will come in through 
flume anyway.

But is this good practice?  I would welcome any advice or guidance.

A simplified version of the calculation: 

Every row has a groupID and a data timestamp field; each groupID represents a 
distinct group of rows and the timestamp distinguishes between individual rows 
in the group. We can assume the combination is always unique. So I construct 
the rowkey as concatenated groupID, '.' , and reverse timestamp.

The task I have, for each such row to be inserted into HBase, find the latest 
row already inserted having the same groupID (based on timestamp part of the 
key),  and insert another column having the difference between its time and 
that of the previous record.  

Each row the serializer sees, it looks up the previous row using a scan and 
gets the first row from the scan (thats why I'm using the reverse timestamp).  
Finds the difference and adds that to the list of PutRequests.

Example:  the data having 2 rows looks like this:

,123456, 'hello'
,123400, 'there'

Result in hbase would look like this.

Row: .123456 , 
cf:v = 'hello'
cf:dt = null --- no previous row so dt is null

Row: .123400, 
cf:v='there'
cf:dt=56 -- dt is 56 ms from 123456 - 123400


As shown, I've calculated the dt field from the previous record.  The dt=56 
means this record came from an event that was logged 56 ms later than the first 
one.

Is this a common practice, or am I crazy to be doing this in the serializer? 
Are there performance or reliability issues that I should be considering?




-- 
This e-mail, including attachments, contains confidential and/or 
proprietary information, and may be used only by the person or entity to 
which it is addressed. The reader is hereby notified that any 
dissemination, distribution or copying of this e-mail is prohibited. If you 
have received this e-mail in error, please notify the sender by replying to 
this message and delete this e-mail immediately.

Re: HBase Types: Explicit Null Support

2013-04-01 Thread James Taylor

From the SQL perspective, handling null is important. Phoenix supports
null in the following way:

- the absence of a key value
- an empty value in a key value
- an empty value in a multi part row key
- for variable length types (VARCHAR and DECIMAL) a null byte
separator would be used if not the last column

- for fixed width types only the last column is allowed to be null

As you mentioned, it's important to maintain the lexicographical sort
order with nulls being first.

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

Thanks for the thoughtful response (and code!).

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan mcor...@hotpads.com wrote:

You can see I've bumped into the NULL problem in a few places:
*

https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
*

https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil doug.m...@explorysmedical.com

wrote:
HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.

On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

instance,

if we want to have a fixed-width signed long stored on 8-bytes, where do
you put null? float and double types can cheat a little by folding
negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: HBase Types: Explicit Null Support

On Mon, Apr 1, 2013 at 4:31 PM, James Taylor jtay...@salesforce.com wrote:

From the SQL perspective, handling null is important.

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

Thanks for the thoughtful response (and code!).

I'm thinking I will press forward with a base implementation that does not
support nulls. The idea is to provide an extensible set of interfaces, so
I
think this will not box us into a corner later. That is, a mirroring
package could be implemented that supports null values and accepts
the relevant trade-offs.

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan mcor...@hotpads.com wrote:

I spent some time this weekend extracting bits of our serialization code
to
a public github repo at
http://github.com/hotpads/**data-toolshttp://github.com/hotpads/data-tools
.
Contributions are welcome - i'm sure we all have this stuff laying
around.

You can see I've bumped into the NULL problem in a few places:
*

https://github.com/hotpads/**data-tools/blob/master/src/**
main/java/com/hotpads/data/**primitive/lists/LongArrayList.**javahttps://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
*

https://github.com/hotpads/**data-tools/blob/master/src/**
main/java/com/hotpads/data/**types/floats/DoubleByteTool.**javahttps://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java

Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior and
confusion. It's cleaner to provide a wrapper class (so both
LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior, and
costs
a little more in performance. If the user can't find a pre-made wrapper
class, it's not very difficult for each user to provide their own
interpretation of null and check for it themselves.

If you reject nullability, the question becomes what to do in situations
where you're implementing existing interfaces that accept nullable
params.
The LongArrayList above implements ListLong which requires an
add(Long)
method. In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the user to
make
that swap and then throw IllegalArgumentException if they pass null.

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil
doug.m...@explorysmedical.com

wrote:
HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.

On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

instance,

if we want to have a fixed-width signed long stored on 8-bytes, where do
you put null? float and double types can cheat a little by folding
negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: HBase Types: Explicit Null Support

2013-04-01 Thread James Taylor

On 04/01/2013 04:41 PM, Nick Dimiduk wrote:

On Mon, Apr 1, 2013 at 4:31 PM, James Taylor jtay...@salesforce.com wrote:

From the SQL perspective, handling null is important.

From your perspective, it is critical to support NULLs, even at the expense
of fixed-width encodings at all or supporting representation of a full
range of values. That is, you'd rather be able to represent NULL than -2^31?
We've been able to get away with supporting NULL through the absence of
the value rather than restricting the data range. We haven't had any
push back on not allowing a fixed width nullable leading row key column.
Since our variable length DECIMAL supports null and is a superset of the
fixed width numeric types, users have a reasonable alternative.

I'd rather not restrict the range of values, since it doesn't seem like
this would be necessary.

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

Thanks for the thoughtful response (and code!).

I'm thinking I will press forward with a base implementation that does not
support nulls. The idea is to provide an extensible set of interfaces, so
I
think this will not box us into a corner later. That is, a mirroring
package could be implemented that supports null values and accepts
the relevant trade-offs.

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan mcor...@hotpads.com wrote:

I spent some time this weekend extracting bits of our serialization code

to
a public github repo at
http://github.com/hotpads/**data-toolshttp://github.com/hotpads/data-tools
.
Contributions are welcome - i'm sure we all have this stuff laying
around.

You can see I've bumped into the NULL problem in a few places:
*

Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior and
confusion. It's cleaner to provide a wrapper class (so both
LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior, and
costs
a little more in performance. If the user can't find a pre-made wrapper
class, it's not very difficult for each user to provide their own
interpretation of null and check for it themselves.

If you reject nullability, the question becomes what to do in situations
where you're implementing existing interfaces that accept nullable
params.
The LongArrayList above implements ListLong which requires an
add(Long)
method. In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the user to
make
that swap and then throw IllegalArgumentException if they pass null.

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil
doug.m...@explorysmedical.com

wrote:
HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.

On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

instance,
if we want to have a fixed-width signed long stored on 8-bytes, where do

you put null? float and double types can cheat a little by folding
negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

as
to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: HBase Types: Explicit Null Support

2013-04-01 Thread Matt Corgan

I generally don't allow nulls in my composite row keys. Does SQL allow
nulls in the PK? In the rare case I wanted to do that I might create a
separate format called NullableCInt32 with 5 bytes where the first one
determined null. It's important to keep the pure types pure.

I have lots of null *values* however, but they're represented by lack of a
qualifier in the Put. If a row has all null values, I create a dummy
qualifier with a dummy value to make sure the row key gets inserted as it
would in sql.

On Mon, Apr 1, 2013 at 4:49 PM, James Taylor jtay...@salesforce.com wrote:

On 04/01/2013 04:41 PM, Nick Dimiduk wrote:

On Mon, Apr 1, 2013 at 4:31 PM, James Taylor jtay...@salesforce.com
wrote:

From the SQL perspective, handling null is important.

We've been able to get away with supporting NULL through the absence of
the value rather than restricting the data range. We haven't had any push
back on not allowing a fixed width nullable leading row key column. Since
our variable length DECIMAL supports null and is a superset of the fixed
width numeric types, users have a reasonable alternative.

I'd rather not restrict the range of values, since it doesn't seem like
this would be necessary.

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

Thanks for the thoughtful response (and code!).

I'm thinking I will press forward with a base implementation that does
not
support nulls. The idea is to provide an extensible set of interfaces,
so
I
think this will not box us into a corner later. That is, a mirroring
package could be implemented that supports null values and accepts
the relevant trade-offs.

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan mcor...@hotpads.com
wrote:

I spent some time this weekend extracting bits of our serialization
code

to
a public github repo at
http://github.com/hotpads/data-toolshttp://github.com/hotpads/**data-tools
http://github.com/**hotpads/data-toolshttp://github.com/hotpads/data-tools

.
Contributions are welcome - i'm sure we all have this stuff laying
around.

You can see I've bumped into the NULL problem in a few places:
*

https://github.com/hotpads/data-tools/blob/master/src/**https://github.com/hotpads/**data-tools/blob/master/src/**
main/java/com/hotpads/data/primitive/lists/LongArrayList.java
https://github.com/**hotpads/data-tools/blob/**
master/src/main/java/com/**hotpads/data/primitive/lists/**
LongArrayList.javahttps://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java

https://github.com/hotpads/data-tools/blob/master/src/**https://github.com/hotpads/**data-tools/blob/master/src/**
main/java/com/hotpads/data/types/floats/DoubleByteTool.java
https://github.com/**hotpads/data-tools/blob/**
master/src/main/java/com/**hotpads/data/types/floats/**
DoubleByteTool.javahttps://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java

Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior and
confusion. It's cleaner to provide a wrapper class (so both
LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior, and
costs
a little more in performance. If the user can't find a pre-made
wrapper
class, it's not very difficult for each user to provide their own
interpretation of null and check for it themselves.

If you reject nullability, the question becomes what to do in
situations
where you're implementing existing interfaces that accept nullable
params.
The LongArrayList above implements ListLong which requires an
add(Long)
method. In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the user to
make
that swap and then throw IllegalArgumentException if they pass null.

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil
doug.m...@explorysmedical.com

wrote:
HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE
and
keeping fixed width.

On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

instance,
if we want to have a fixed-width signed long stored on 8-bytes, where
do

you put null? float and double types can cheat a little by folding
negative
and

Re: HBase Types: Explicit Null Support

Furthermore, is is more important to support null values than squeeze all
representations into minimum size (4-bytes for int32, c.)?
On Apr 1, 2013 4:41 PM, Nick Dimiduk ndimi...@gmail.com wrote:

On Mon, Apr 1, 2013 at 4:31 PM, James Taylor jtay...@salesforce.comwrote:

From the SQL perspective, handling null is important.

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

Thanks for the thoughtful response (and code!).

I'm thinking I will press forward with a base implementation that does
not
support nulls. The idea is to provide an extensible set of interfaces,
so I
think this will not box us into a corner later. That is, a mirroring
package could be implemented that supports null values and accepts
the relevant trade-offs.

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan mcor...@hotpads.com
wrote:

I spent some time this weekend extracting bits of our serialization
code to
a public github repo at
http://github.com/hotpads/**data-toolshttp://github.com/hotpads/data-tools
.
Contributions are welcome - i'm sure we all have this stuff laying
around.

You can see I've bumped into the NULL problem in a few places:
*

Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior and
confusion. It's cleaner to provide a wrapper class (so both
LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior, and
costs
a little more in performance. If the user can't find a pre-made wrapper
class, it's not very difficult for each user to provide their own
interpretation of null and check for it themselves.

If you reject nullability, the question becomes what to do in situations
where you're implementing existing interfaces that accept nullable
params.
The LongArrayList above implements ListLong which requires an
add(Long)
method. In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the user to
make
that swap and then throw IllegalArgumentException if they pass null.

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil
doug.m...@explorysmedical.com

wrote:
HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.

On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

instance,

if we want to have a fixed-width signed long stored on 8-bytes, where
do
you put null? float and double types can cheat a little by folding
negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example
case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at
the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving serialization.
This
imposes some limitations on our encoding strategies. For instance,
it's
not
enough to simply encode null, it really needs to be encoded as 0x00 so

to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: HBase Types: Explicit Null Support

2013-04-01 Thread Enis Söztutar

I think having Int32, and NullableInt32 would support minimum overhead, as
well as allowing SQL semantics.

On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Furthermore, is is more important to support null values than squeeze all
representations into minimum size (4-bytes for int32, c.)?
On Apr 1, 2013 4:41 PM, Nick Dimiduk ndimi...@gmail.com wrote:

On Mon, Apr 1, 2013 at 4:31 PM, James Taylor jtay...@salesforce.com
wrote:

From the SQL perspective, handling null is important.

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

Thanks for the thoughtful response (and code!).

I'm thinking I will press forward with a base implementation that does
not
support nulls. The idea is to provide an extensible set of interfaces,
so I
think this will not box us into a corner later. That is, a mirroring
package could be implemented that supports null values and accepts
the relevant trade-offs.

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan mcor...@hotpads.com
wrote:

I spent some time this weekend extracting bits of our serialization
code to
a public github repo at http://github.com/hotpads/**data-tools
http://github.com/hotpads/data-tools
.
Contributions are welcome - i'm sure we all have this stuff laying
around.

You can see I've bumped into the NULL problem in a few places:
*

https://github.com/hotpads/**data-tools/blob/master/src/**
main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java
https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java

https://github.com/hotpads/**data-tools/blob/master/src/**
main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java
https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java

Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior and
confusion. It's cleaner to provide a wrapper class (so both
LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior, and
costs
a little more in performance. If the user can't find a pre-made
wrapper
class, it's not very difficult for each user to provide their own
interpretation of null and check for it themselves.

If you reject nullability, the question becomes what to do in
situations
where you're implementing existing interfaces that accept nullable
params.
The LongArrayList above implements ListLong which requires an
add(Long)
method. In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the user to
make
that swap and then throw IllegalArgumentException if they pass null.

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil
doug.m...@explorysmedical.com

wrote:
HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE
and
keeping fixed width.

On 4/1/13 2:00 PM, Nick Dimiduk ndimi...@gmail.com wrote:

Heya,

instance,

if we want to have a fixed-width signed long stored on 8-bytes, where
do
you put null? float and double types can cheat a little by folding
negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example
case,
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at
the
idea.

The variable-width encodings have it a little easier. There's
already
enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving
serialization.
This
imposes some limitations on our encoding strategies. For instance,
it's
not
enough to simply encode null, it really needs to be encoded as 0x00
so

to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: Errors when starting Hbase service