[jira] [Created] (HBASE-12941) CompactionRequestor - a private interface class with no users

2015-01-28 Thread ryan rawson (JIRA)
ryan rawson created HBASE-12941:
---

 Summary: CompactionRequestor - a private interface class with no 
users
 Key: HBASE-12941
 URL: https://issues.apache.org/jira/browse/HBASE-12941
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: ryan rawson


CompactionRequestor is a 'interface audience private' class with no users in 
the HBase code base.  Unused things should be deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: A face-lift for 1.0

2014-12-05 Thread Ryan Rawson
I'm checking to see if our marketing and web team can help.  The
primary requirement is going to be ditching the mvn:site from the
front page.  reskinning it might not be so easy.

-ryan

On Thu, Dec 4, 2014 at 9:52 AM, lars hofhansl la...@apache.org wrote:
 +1
 I just came across one of the various HBase vs. Cassandra articles, and one 
 of the main tenants of the articles was how much better the Cassandra 
 documentation was.Thank god we have Misty now. :)

 (not sure how much just a skin would help, but it surely won't hurt)

 -- Lars
   From: Nick Dimiduk ndimi...@gmail.com
  To: hbase-dev dev@hbase.apache.org
  Sent: Tuesday, December 2, 2014 9:46 AM
  Subject: A face-lift for 1.0

 Heya,

 In mind of the new release, I was thinking we should clean up our act a
 little bit in regard to hbase.apache.org and our book. Just because the
 project started in 2007 doesn't mean we need a site that looks like it's
 from 2007. Phoenix's site looks great in this regard.

 For the home page, I was thinking of converting it over to bootstrap [0] so
 that it'll be easier to pick up theme, either on of our own or something
 pre-canned [1]. I'm no web designer, but the idea is this would make it
 easier for someone who is to help us out.

 For the book, I just want to skin it -- no intention of changing docbook
 part (such a decision I'll leave up to Misty). I'm less sure on this
 project, but Riak's docs are a nice inspiration.

 What do you think? Do we know any web designers who can help out with the
 CSS?

 -n

 [0]: http://getbootstrap.com
 [1]: https://wrapbootstrap.com/
 [2]: http://docs.basho.com/riak/latest/





Call for Presentations - HBase User group meeting

2014-11-10 Thread Ryan Rawson
Hi all,

The next HBase user group meeting is on November the 20th.  We need a
few more presenters still!

Please send me your proposals - summary and outline of your talk!

Thanks!
-ryan


[jira] [Created] (HBASE-12260) MasterServices - remove from coprocessor API (Discuss)

2014-10-14 Thread ryan rawson (JIRA)
ryan rawson created HBASE-12260:
---

 Summary: MasterServices - remove from coprocessor API (Discuss)
 Key: HBASE-12260
 URL: https://issues.apache.org/jira/browse/HBASE-12260
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: ryan rawson
Priority: Minor


A major issue with MasterServices is the MasterCoprocessorEnvironment exposes 
this class even though MasterServices is tagged with @InterfaceAudience.Private

This means that the entire internals of the HMaster is essentially part of the 
coprocessor API.  Many of the classes returned by the MasterServices API are 
highly internal, extremely powerful, and subject to constant change.  

Perhaps a new API to replace MasterServices that is use-case focused, and 
justified based on real world co-processors would suit things better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12192) Remove EventHandlerListener

2014-10-07 Thread ryan rawson (JIRA)
ryan rawson created HBASE-12192:
---

 Summary: Remove EventHandlerListener
 Key: HBASE-12192
 URL: https://issues.apache.org/jira/browse/HBASE-12192
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: ryan rawson


EventHandlerListener isn't actually being used by internal HBase code right 
now.  No one actually calls 'ExecutorService.registerListener()' according to 
IntelliJ.

It might be possible that some coprocessors use it. Perhaps people can comment 
if they find this functionality useful or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: 答复: DISCUSSION: Component Lieutenants?

2012-09-22 Thread ryan rawson
I'd like to contribute, we are learning some very interesting things here and 
I'd like to feed back as much as possible but I just can't guarantee a response 
time. 

Sent from your iPhone

On Sep 21, 2012, at 10:14 AM, Andrew Purtell apurt...@apache.org wrote:

 On Fri, Sep 21, 2012 at 1:49 AM, Ryan Rawson ryano...@gmail.com wrote:
 This is a cool idea, I'd like to contribute, but I'll need coverage
 since I cannot guarantee my time (since it doesnt belong to me
 anyways).
 
 What do you need Ryan?
 
 -- 
 Best regards,
 
   - Andy
 
 Problems worthy of attack prove their worth by hitting back. - Piet
 Hein (via Tom White)


Re: 答复: DISCUSSION: Component Lieutenants?

2012-09-21 Thread Ryan Rawson
This is a cool idea, I'd like to contribute, but I'll need coverage
since I cannot guarantee my time (since it doesnt belong to me
anyways).


Metrics, be better already (The legend of graphite)

2012-09-21 Thread Ryan Rawson
Hey folks,

I built something out here at DTS, and I wanted to get feedback to see
if it was interesting for anyone else... I built a no-dependency
GraphiteReportingContext module that allows hadoop metrics to be
exported to a graphite service.  We've been using it here for a month
and it works quite nicely.  Graphite is not as easy as it should be to
set up (I have some automated packaging scripts too), but it sure
beats everything else, and double so on ganglia.

Anyone think this is interesting?


Re: testing, powermock, jmockit - from pow-wow

2012-09-21 Thread Ryan Rawson
I've used mockito a few times, and it's great... but it can make your
tests very brittle.  It can also be hard to successfully use if the
code is complex.  For example I had a class that took an HBaseAdmin,
and i mocked out the few calls it used.  Then when I needed to access
Configuration, things went downhill fast.  I ended up abandoning
easymock even.

The issue ultimately stems from not writing your code in a certain way
with a minimal of easy to mock external interfaces.  When this isn't
true, then easymock does nothing for you.  It can save your bacon if
you are trying to unit test something deep though.

The other question I guess is integration testing... there is no
specific good reason why everything is done in 1 jvm, except 'because
we can'.  a longer lived 'minicluster' could amortize the cost of
running one.

-ryan

On Fri, Sep 21, 2012 at 9:06 AM, Rogerio rliesenf...@gmail.com wrote:
 lars hofhansl lhofhansl@... writes:

  To get the low-level access we could instead use jmockit at the cost of
 dealing with code-weaving.

 As we had discussed, this scares me :).
 I do not want to have to debug some test code that was weaved (i.e. has no
 matching source code lying around *anywhere*).


 I think you are imagining a problem that does not exist. JMockit users can 
 debug
 Java code just fine...




Re: HBASE-2182

2012-06-30 Thread Ryan Rawson
On Fri, Jun 29, 2012 at 5:04 PM, Todd Lipcon t...@cloudera.com wrote:
 A few inline notes below:

 On Fri, Jun 29, 2012 at 4:42 PM, Elliott Clark ecl...@stumbleupon.comwrote:

 I just posted a pretty early skeleton(
 https://issues.apache.org/jira/browse/HBASE-2182) on what I think a netty
 based hbase client/server could look like.

 Pros:

   - Faster
      - Giraph got a 3x perf improvement by droppping hadoop rpc


 Whats the reference for this? The 3x perf I heard about from Giraph was
 from switching to using LMAX's Disruptor instead of queues, internally. We
 could do the same, but I'm not certain the model works well for our use
 cases where the RPC processing can end up blocked on disk access, etc.


      - Asynhbase trounces our client when JD benchmarked them


 I'm still convinced that the majority of this has to do with the way our
 batching happens to the server, not async vs sync. (in the current sync
 client, once we fill up the buffer, we flush from the same thread, and
 block the flush until all buffered edits have made it, vs doing it in the
 background). We could fix this without going to a fully async model.

I also agree here, if you do the apriori code analysis, it becomes
obvious that the issue is that slower regionservers can hold up entire
batches even if 90%+ of the Puts were already acked...

And don't forget that we used to issue Puts to regionservers SERIALLY
until we do the current parallelism code... (not that the code is
great, but it was relatively easy to fix at the time).






   - Could encourage things to be a little more modular if everything isn't
   hanging directly off of HRegionServer

 Sure, but not sure I see why this is Netty vs not-Netty


   - Netty is better about thread usage than hadoop rpc server.

 Can you explain further?


   - Pretty easy to define an rpc protocol after all of the work on
   protobuf (Thanks everyone)
   - Decoupling the rpc server library from the hadoop library could allow
   us to rev the server code easier.
   - The filter model is very easy to work with.
      - Security can be just a single filter.
      - Logging can ba another
      - Stats can be another.

 Cons:

   - Netty and non apache rpc server's don't play well togther.  They might
   be able to but I haven't gotten there yet.

 What do you mean non apache rpc servers?


   - Complexity
      - Two different servers in the src
      - Confusing users who don't know which to pick
   - Non-blocking could make the client a harder to write.


 I'm really just trying to gauge what people think of the direction and if
 it's still something that is wanted.  The code is a loong way from even
 being a tech demo, and I'm not a netty expert, so suggestions would be
 welcomed.

 Thoughts ? Are people interested in this? Should I push this to my github
 so other can help ?


 IMO, I'd want to see a noticeable perf difference from the change -
 unfortunately it would take a fair amount of work to get to the point where
 you could benchmark it. But if you're willing to spend the time to get to
 that point, seems worth investigating.

 --
 Todd Lipcon
 Software Engineer, Cloudera


Re: Bay Area HBase User Group organizer change (?)

2011-10-01 Thread Ryan Rawson
Good job Andrew. Don't forget to expense it - problem solved!

-ryan

On Sat, Oct 1, 2011 at 6:26 PM, Ted Dunning tdunn...@maprtech.com wrote:
 I can get some sponsorship going on my end as well.

 On Sun, Oct 2, 2011 at 12:09 AM, Ted Yu yuzhih...@gmail.com wrote:

 I agree. We should share the payment.

 On Sat, Oct 1, 2011 at 5:05 PM, Todd Lipcon t...@cloudera.com wrote:

  Thanks, Andrew! Let us know if we can chip in for the dues.
 
  -Todd
 
  On Sat, Oct 1, 2011 at 4:38 PM, Andrew Purtell apurt...@apache.org
  wrote:
   I went to RSVP for the upcoming HUG in NYC after Hadoop World. Meetup
  complained the Bay Area HBase User Group was missing an organizer (who
 pays
  dues), and would be deleted after 14 days. I've paid up for us for the
 next
  6 months, and am now the organizer. I'll figure out what is required of
  that, but please pardon if something is not quite right at first.
  
   Best regards,
  
      - Andy
  
   Problems worthy of attack prove their worth by hitting back. - Piet
 Hein
  (via Tom White)
 
 
 
  --
  Todd Lipcon
  Software Engineer, Cloudera
 




Re: prefix compression implementation

2011-09-19 Thread Ryan Rawson
I was just pushing back at the idea of 'turn everything into
interfaces! problem solved!', and thinking about what was really
necessary to get to where you want to go...

On Mon, Sep 19, 2011 at 3:26 PM, Matt Corgan mcor...@hotpads.com wrote:
 Ryan - i answered your question on another thread yesterday.  Will use this
 thread to continue conversation on the KeyValue interface.

 I don't think the name is all that important, though i thought HCell was
 less clumsy than KeyValue or KeyValueInterface.  Take a look at this
 interface on github:

 https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/model/HCell.java

 Seems like it should be trivially easy to get KeyValue to implement that.
  Then it provides the right methods to make compareTo methods that will work
 across different implementations.  The implementations of those methods
 might have an if-statement to determine the class of the other HCell, and
 choose the fastest byte comparison method behind the scenes.

 I need to look into the KeyValue scanner interfaces


 On Fri, Sep 16, 2011 at 7:34 PM, Ryan Rawson ryano...@gmail.com wrote:

 On Fri, Sep 16, 2011 at 7:29 PM, Matt Corgan mcor...@hotpads.com wrote:
  Ryan - thanks for the feedback.  The situation I'm thinking of where it's
  useful to parse DirectBB without copying to heap is when you are serving
  small random values out of the block cache.  At HotPads, we'd like to
 store
  hundreds of GB of real estate listing data in memory so it can be quickly
  served up at random.  We want to access many small values that are
 already
  in memory, so basically skipping step 1 of 3 because values are already
 in
  memory.  That being said, the DirectBB are not essential for us since we
  haven't run into gb problems, i just figured it would be nice to support
  them since they seem to be important to other people.
 
  My motivation for doing this is to make hbase a viable candidate for a
  large, auto-partitioned, sorted, *in-memory* database.  Not the usual
  analytics use case, but i think hbase would be great for this.

 What exactly about the current system makes it not a viable candidate?





 
 
  On Fri, Sep 16, 2011 at 7:08 PM, Ryan Rawson ryano...@gmail.com wrote:
 
  On Fri, Sep 16, 2011 at 6:47 PM, Matt Corgan mcor...@hotpads.com
 wrote:
   I'm a little confused over the direction of the DBBs in general, hence
  the
   lack of clarity in my code.
  
   I see value in doing fine-grained parsing of the DBB if you're going
 to
  have
   a large block of data and only want to retrieve a small KV from the
  middle
   of it.  With this trie design, you can navigate your way through the
 DBB
   without copying hardly anything to the heap.  It would be a shame blow
  away
   your entire L1 cache by loading a whole 256KB block onto heap if you
 only
   want to read 200 bytes out of the middle... it can be done
   ultra-efficiently.
 
  This paragraph is not factually correct.  The DirectByteBuffer vs main
  heap has nothing to do with the CPU cache.  Consider the following
  scenario:
 
  - read block from DFS
  - scan block in ram
  - prepare result set for client
 
  Pretty simple, we have a choice in step 1:
  - write to java heap
  - write to DirectByteBuffer off-heap controlled memory
 
  in either case, you are copying to memory, and therefore cycling thru
  the cpu cache (of course).  The difference is whether the Java GC has
  to deal with the aftermath or not.
 
  So the question DBB or not is not one about CPU caches, but one
  about garbage collection.  Of course, nothing is free, and dealing
  with DBB requires extensive in-situ bounds checking (look at the
  source code for that class!), and also requires manual memory
  management on the behalf of the programmer.  So you are faced with an
  expensive API (getByte is not as good at an array get), and a lot more
  homework to do.  I have decided it's not worth it personally and
  aren't chasing that line as a potential performance improvement, and I
  also would encourage you not to as well.
 
  Ultimately the DFS speed issues need to be solved by the DFS - HDFS
  needs more work, but alternatives are already there and are a lot
  faster.
 
 
 
 
 
  
   The problem is if you're going to iterate through an entire block made
 of
   5000 small KV's doing thousands of DBB.get(index) calls.  Those are
 like
  10x
   slower than byte[index] calls.  In that case, if it's a DBB, you want
 to
   copy the full block on-heap and access it through the byte[]
 interface.
   If
   it's a HeapBB, then you already have access to the underlying byte[].
 
  Yes this is the issue - you have to take an extra copy one way or
  another.  Doing effective prefix compression with DBB is not really
  feasible imo, and that's another reason why I have given up on DBBs.
 
  
   So there's possibly value in implementing both methods.  The main
 problem
  i
   see is a lack of interfaces in the current code base.  I'll throw one

Re: prefix compression implementation

2011-09-19 Thread Ryan Rawson
So if the HCell or whatever ends up returning ByteBuffers, then that
plays straight in to scatter/gather NIO calls, and if some of them are
DBB, then so much the merrier.

For example, the thrift stuff takes ByteBuffers when its calling for a
byte sequence.

-ryan

On Mon, Sep 19, 2011 at 10:39 PM, Stack st...@duboce.net wrote:
 One other thought is that exposing ByteRange, ByteBuffer, and v1 array
 stuff in Interface seems like you are exposing 'implementation'
 details that perhaps shouldn't show through.  I'm guessing its
 unavoidable though if the Interface is to be used in a few different
 contexts: i.e. v1 has to work if we are to get this new stuff in,
 some srcs will be DBBs, etc.

 St.Ack

 On Mon, Sep 19, 2011 at 10:33 PM, Stack st...@duboce.net wrote:
 On Mon, Sep 19, 2011 at 3:26 PM, Matt Corgan mcor...@hotpads.com wrote:
 I don't think the name is all that important, though i thought HCell was
 less clumsy than KeyValue or KeyValueInterface.  Take a look at this
 interface on github:

 https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/model/HCell.java

 Seems like it should be trivially easy to get KeyValue to implement that.
  Then it provides the right methods to make compareTo methods that will work
 across different implementations.  The implementations of those methods
 might have an if-statement to determine the class of the other HCell, and
 choose the fastest byte comparison method behind the scenes.


 I'd say call it Cell rather than HCell.

 You have getRowArray rather than getRow which we currently have but I
 suppose it makes sense since you can then group by suffix.

 There is a patch lying around that adds a version to KV by using top
 two bytes of the type byte.  If you need me to dig it up, just say
 (then you might not have to have v1 stuff in your Interface).

 You might need to add some equals for stuff like same row, cf, and
 qualifier... but they can come later.

 The comparator stuff is currently horrid because it depends on
 context; i.e. whether the KVs are from -ROOT- or .META. or from a
 userspace table.  There are some ideas for having it so only one
 comparator for all types but thats another issue.

 St.Ack




Re: prefix compression implementation

2011-09-16 Thread Ryan Rawson
Hey this stuff looks really interesting!

On the ByteBuffer, the 'array' byte[] access to the underlying data is
totally incompatible with the 'off heap' features that are implemented
by DirectByteBuffer.  While people talk about DBB in terms of nio
performance, if you have to roundtrip the data thru java code, I'm not
sure it buys you much - you still need to move data in and out of the
main Java heap.  Typically this is geared more towards apps which read
and write from/to socket/files with minimal processing.

While in the past I have been pretty bullish on off-heap caching for
HBase, I have since changed my mind due to the poor API (ByteBuffer is
a sucky way to access data structures in ram), and other reasons (ping
me off list if you want).  The KeyValue code pretty much presumes that
data is in byte[] anyways, and I had thought that even with off-heap
caching, we'd still have to copy KeyValues into main-heap during
scanning anyways.

Given the minimal size of the hfile blocks, I really dont see an issue
with buffering a block output - especially if the savings is fairly
substantial.

Thanks,
-ryan

On Fri, Sep 16, 2011 at 5:48 PM, Matt Corgan mcor...@hotpads.com wrote:
 Jacek,

 Thanks for helping out with this.  I implemented most of the DeltaEncoder
 and DeltaEncoderSeeker.  I haven't taken the time to generate a good set of
 test data for any of this, but it does pass on some very small input data
 that aims to cover the edge cases i can think of.  Perhaps you have full
 HFiles you can run through it.

 https://github.com/hotpads/hbase-prefix-trie/tree/master/src/org/apache/hadoop/hbase/keyvalue/trie/deltaencoder

 I also put some notes on the PtDeltaEncoder regarding how the prefix trie
 should be optimally used.  I can't think of a situation where you'd want to
 blow it up into the full uncompressed KeyValue ByteBuffer, so implementing
 the DeltaEncoder interface is a mismatch, but I realize it's only a starting
 point.

 You also would never really have a full ByteBuffer of KeyValues to pass to
 it for compression.  Typically, you'd be passing individual KeyValues from
 the memstore flush or from a collection of HFiles being merged through a
 PriorityQueue.

 The end goal is to operate on the encoded trie without decompressing it.
  Long term, and in certain circumstances, it may even be possible to pass
 the compressed trie over the wire to the client who can then decode it.

 Let me know if I implemented that the way you had in mind.  I haven't done
 the seekTo method yet, but will try to do that next week.

 Matt

 On Wed, Sep 14, 2011 at 3:43 PM, Jacek Migdal ja...@fb.com wrote:

 Matt,

 Thanks a lot for the code. Great job!

 As I mentioned in JIRA I work full time on the delta encoding [1]. Right
 now the code and integration is almost done. Most of the parts are under
 review. Since it is a big change will plan to test it very carefully.
 After that, It will be ported to trunk and open sourced.

 I have a quick glimpse I have taken the different approach. I implemented
 a few different algorithms which are simpler. They also aims mostly to
 save space while having fast decompress/compress code. However the access
 is still sequential. The goal of my project is to save some RAM by having
 compressed BlockCache in memory.

 On the other hand, it seems that you are most concerned about seeking
 performance.

 I will read your code more carefully. A quick glimpse: we both implemented
 some routines (like vint), but expect that there is no overlap.

 I also seen that you spend some time investigating ByteBuffer vs. Byte[].
 I experienced significant negative performance impact when I switched to
 ByteBuffer. However I postpone this optimization.

 Right now I think the easiest way to go would be that you will implement
 DeltaEncoder interface after my change:
 http://pastebin.com/Y8UxUByG
 (note, there might be some minor changes)

 That way, you will reuse my integration with existing code for free.

 Jacek

 [1] - I prefer to call it that way. Prefix is one of the algorithm, but
 there are also different approach.

 On 9/13/11 1:36 AM, Ted Yu yuzhih...@gmail.com wrote:

 Matt:
 Thanks for the update.
 Cacheable interface is defined in:
 src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java
 
 You can find the implementation at:
 src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
 
 I will browse your code later.
 
 On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan mcor...@hotpads.com
 wrote:
 
  Hi devs,
 
  I put a developer preview of a prefix compression algorithm on github.
   It
  still needs some details worked out, a full set of iterators, about 200
  optimizations, and a bunch of other stuff...  but, it successfully
 passes
  some preliminary tests so I thought I'd get it in front of more eyeballs
  sooner than later.
 
  https://github.com/hotpads/hbase-prefix-trie
 
  It depends on HBase's Bytes.java and KeyValue.java, which depends on
  hadoop.
   Those jars are 

Re: prefix compression implementation

2011-09-16 Thread Ryan Rawson
, but ultimately hbase is an
integrated whole, and the concurrency problems have been really tough
to crack.  Things are better than they have ever been, but still a lot
of testing to do.


 Matt

 On Fri, Sep 16, 2011 at 6:08 PM, Ryan Rawson ryano...@gmail.com wrote:

 Hey this stuff looks really interesting!

 On the ByteBuffer, the 'array' byte[] access to the underlying data is
 totally incompatible with the 'off heap' features that are implemented
 by DirectByteBuffer.  While people talk about DBB in terms of nio
 performance, if you have to roundtrip the data thru java code, I'm not
 sure it buys you much - you still need to move data in and out of the
 main Java heap.  Typically this is geared more towards apps which read
 and write from/to socket/files with minimal processing.

 While in the past I have been pretty bullish on off-heap caching for
 HBase, I have since changed my mind due to the poor API (ByteBuffer is
 a sucky way to access data structures in ram), and other reasons (ping
 me off list if you want).  The KeyValue code pretty much presumes that
 data is in byte[] anyways, and I had thought that even with off-heap
 caching, we'd still have to copy KeyValues into main-heap during
 scanning anyways.

 Given the minimal size of the hfile blocks, I really dont see an issue
 with buffering a block output - especially if the savings is fairly
 substantial.

 Thanks,
 -ryan

 On Fri, Sep 16, 2011 at 5:48 PM, Matt Corgan mcor...@hotpads.com wrote:
  Jacek,
 
  Thanks for helping out with this.  I implemented most of the DeltaEncoder
  and DeltaEncoderSeeker.  I haven't taken the time to generate a good set
 of
  test data for any of this, but it does pass on some very small input data
  that aims to cover the edge cases i can think of.  Perhaps you have full
  HFiles you can run through it.
 
 
 https://github.com/hotpads/hbase-prefix-trie/tree/master/src/org/apache/hadoop/hbase/keyvalue/trie/deltaencoder
 
  I also put some notes on the PtDeltaEncoder regarding how the prefix trie
  should be optimally used.  I can't think of a situation where you'd want
 to
  blow it up into the full uncompressed KeyValue ByteBuffer, so
 implementing
  the DeltaEncoder interface is a mismatch, but I realize it's only a
 starting
  point.
 
  You also would never really have a full ByteBuffer of KeyValues to pass
 to
  it for compression.  Typically, you'd be passing individual KeyValues
 from
  the memstore flush or from a collection of HFiles being merged through a
  PriorityQueue.
 
  The end goal is to operate on the encoded trie without decompressing it.
   Long term, and in certain circumstances, it may even be possible to pass
  the compressed trie over the wire to the client who can then decode it.
 
  Let me know if I implemented that the way you had in mind.  I haven't
 done
  the seekTo method yet, but will try to do that next week.
 
  Matt
 
  On Wed, Sep 14, 2011 at 3:43 PM, Jacek Migdal ja...@fb.com wrote:
 
  Matt,
 
  Thanks a lot for the code. Great job!
 
  As I mentioned in JIRA I work full time on the delta encoding [1]. Right
  now the code and integration is almost done. Most of the parts are under
  review. Since it is a big change will plan to test it very carefully.
  After that, It will be ported to trunk and open sourced.
 
  I have a quick glimpse I have taken the different approach. I
 implemented
  a few different algorithms which are simpler. They also aims mostly to
  save space while having fast decompress/compress code. However the
 access
  is still sequential. The goal of my project is to save some RAM by
 having
  compressed BlockCache in memory.
 
  On the other hand, it seems that you are most concerned about seeking
  performance.
 
  I will read your code more carefully. A quick glimpse: we both
 implemented
  some routines (like vint), but expect that there is no overlap.
 
  I also seen that you spend some time investigating ByteBuffer vs.
 Byte[].
  I experienced significant negative performance impact when I switched to
  ByteBuffer. However I postpone this optimization.
 
  Right now I think the easiest way to go would be that you will implement
  DeltaEncoder interface after my change:
  http://pastebin.com/Y8UxUByG
  (note, there might be some minor changes)
 
  That way, you will reuse my integration with existing code for free.
 
  Jacek
 
  [1] - I prefer to call it that way. Prefix is one of the algorithm, but
  there are also different approach.
 
  On 9/13/11 1:36 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  Matt:
  Thanks for the update.
  Cacheable interface is defined in:
  src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java
  
  You can find the implementation at:
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
  
  I will browse your code later.
  
  On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan mcor...@hotpads.com
  wrote:
  
   Hi devs,
  
   I put a developer preview of a prefix compression algorithm on
 github.
    It
   still needs some

Re: prefix compression implementation

2011-09-16 Thread Ryan Rawson
On Fri, Sep 16, 2011 at 7:29 PM, Matt Corgan mcor...@hotpads.com wrote:
 Ryan - thanks for the feedback.  The situation I'm thinking of where it's
 useful to parse DirectBB without copying to heap is when you are serving
 small random values out of the block cache.  At HotPads, we'd like to store
 hundreds of GB of real estate listing data in memory so it can be quickly
 served up at random.  We want to access many small values that are already
 in memory, so basically skipping step 1 of 3 because values are already in
 memory.  That being said, the DirectBB are not essential for us since we
 haven't run into gb problems, i just figured it would be nice to support
 them since they seem to be important to other people.

 My motivation for doing this is to make hbase a viable candidate for a
 large, auto-partitioned, sorted, *in-memory* database.  Not the usual
 analytics use case, but i think hbase would be great for this.

What exactly about the current system makes it not a viable candidate?







 On Fri, Sep 16, 2011 at 7:08 PM, Ryan Rawson ryano...@gmail.com wrote:

 On Fri, Sep 16, 2011 at 6:47 PM, Matt Corgan mcor...@hotpads.com wrote:
  I'm a little confused over the direction of the DBBs in general, hence
 the
  lack of clarity in my code.
 
  I see value in doing fine-grained parsing of the DBB if you're going to
 have
  a large block of data and only want to retrieve a small KV from the
 middle
  of it.  With this trie design, you can navigate your way through the DBB
  without copying hardly anything to the heap.  It would be a shame blow
 away
  your entire L1 cache by loading a whole 256KB block onto heap if you only
  want to read 200 bytes out of the middle... it can be done
  ultra-efficiently.

 This paragraph is not factually correct.  The DirectByteBuffer vs main
 heap has nothing to do with the CPU cache.  Consider the following
 scenario:

 - read block from DFS
 - scan block in ram
 - prepare result set for client

 Pretty simple, we have a choice in step 1:
 - write to java heap
 - write to DirectByteBuffer off-heap controlled memory

 in either case, you are copying to memory, and therefore cycling thru
 the cpu cache (of course).  The difference is whether the Java GC has
 to deal with the aftermath or not.

 So the question DBB or not is not one about CPU caches, but one
 about garbage collection.  Of course, nothing is free, and dealing
 with DBB requires extensive in-situ bounds checking (look at the
 source code for that class!), and also requires manual memory
 management on the behalf of the programmer.  So you are faced with an
 expensive API (getByte is not as good at an array get), and a lot more
 homework to do.  I have decided it's not worth it personally and
 aren't chasing that line as a potential performance improvement, and I
 also would encourage you not to as well.

 Ultimately the DFS speed issues need to be solved by the DFS - HDFS
 needs more work, but alternatives are already there and are a lot
 faster.





 
  The problem is if you're going to iterate through an entire block made of
  5000 small KV's doing thousands of DBB.get(index) calls.  Those are like
 10x
  slower than byte[index] calls.  In that case, if it's a DBB, you want to
  copy the full block on-heap and access it through the byte[] interface.
  If
  it's a HeapBB, then you already have access to the underlying byte[].

 Yes this is the issue - you have to take an extra copy one way or
 another.  Doing effective prefix compression with DBB is not really
 feasible imo, and that's another reason why I have given up on DBBs.

 
  So there's possibly value in implementing both methods.  The main problem
 i
  see is a lack of interfaces in the current code base.  I'll throw one
  suggestion out there as food for thought.  Create a new interface:
 
  interface HCell{
   byte[] getRow();
   byte[] getFamily();
   byte[] getQualifier();
   long getTimestamp();
   byte getType();
   byte[] getValue();
 
   //plus an endless list of convenience methods:
   int getKeyLength();
   KeyValue getKeyValue();
   boolean isDelete();
   //etc, etc (or put these in sub-interface)
  }
 
  We could start by making KeyValue implement that interface and then
 slowly
  change pieces of the code base to use HCell.  That will allow us to start
  elegantly working in different implementations.
  PtKeyValue
 https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/keyvalue/trie/compact/read/PtKeyValue.java
 would
  be one of them.  During the transition, you can always call
  PtKeyValue.getCopiedKeyValue() which will instantiate a new byte[] in the
  traditional KeyValue format.

 I am not really super keen here, and while the interface of course
 makes plenty of sense, the issue is that you will need to turn an
 array of KeyValues (aka a Result instance) in to a bunch of bytes on
 the wire.  So there HAS to be a method that returns a ByteBuffer that
 the IO layer can then use to write out

Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on Apache Incubator as a proposal

2011-09-04 Thread Ryan Rawson
We thought about it earlier, but single machine needing to come back
up to restore didnt seem like a good idea.

-ryan

On Sat, Sep 3, 2011 at 11:43 PM, Mathias Herberts
mathias.herbe...@gmail.com wrote:
 On Sep 4, 2011 1:39 AM, Bill de hÓra li...@dehora.net wrote:

 On 02/09/11 19:06, Stack wrote:

 What do folks think?


 Not putting the log into hdfs seems like a good idea.

 I was somehow thinking the opposite as it makes irrecoverable machine
 failures much more problematic. What makes you say it's a good idea?



Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on Apache Incubator as a proposal

2011-09-03 Thread Ryan Rawson
My understanding is that the ASF is about community, not code. So what
is the goal for Accumulo?  Build a community. How much would it
intersect with the HBase community?  Sounds like a lot.  Does it still
make sense to incubate it then?

To the point earlier that ASF has hosted multiple competitors of
various core projects, notably httpd, I had a look, there is exactly 2
projects that serve HTTP exclusively:
Apache HTTPD
Apache Traffic Server

But these 2 are complementary, although some features kind of overlap
(mod_proxy for eg), they dont really compete directly.

So, would the ASF allow incubation of a web server product, for
example nginx (which is a direct httpd competitor)?  If the answer is
no either work with the httpd community or go elsewhere, then sure
Accumulo should have the same treatment?

-ryan


On Sat, Sep 3, 2011 at 12:00 AM, Bernd Fondermann
bernd.fonderm...@googlemail.com wrote:
 On Saturday, September 3, 2011, Andrew Purtell apurt...@apache.org wrote:
 I'm simply pointing out a lack of community involvement to date.


 I would only add to this that the incubation proposal makes a
 controversial statement regarding existing involvement with the HBase
 community. It may be technically true if a certain company with involvement
 in HBase has also been interacting with Accumulo, but is disingenuous to
 claim that the community has been involved here.

 It looks like strictly a one way street: They have been able to observe or
 borrow the fruits of our labor for years, and now at a suitable point wish
 to incubate at the ASF to compete with our project for community. That is
 not community involvement. That is leeching.

 are you saying that the proposal is actually some kind of HBase fork?

 And, isn't this 'competition' already happening between all the BT and
 Dynamo implementations?

 I fail to see anything bad happening here.

  Bernd



Re: already online on this server - still buggy?

2011-08-10 Thread Ryan Rawson
Oh yes I need to dig this up.

But is the solution to 'find the potential problem and fix the hole'?
Because it's quite possible the problem is that regionserver and
master were being bounced around at the same time, leading to ? In any
case, why fail the assignment.

On Mon, Aug 8, 2011 at 3:36 PM, Stack st...@duboce.net wrote:
 On Sun, Aug 7, 2011 at 9:21 PM, Ryan Rawson ryano...@gmail.com wrote:
 Why is this still happening? This was a major issue in the old master.
  And still broke?


 What happened with this region when you trace it in master logs?
 St.Ack



already online on this server - still buggy?

2011-08-07 Thread Ryan Rawson
Hi all,

I think we still have a hole in the RIT graph... I get messages like
this in my RS:

2011-08-08 04:17:48,469 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Attempted open of region_name. but already online on this server

And the master UI says the region continues to hang out in
PENDING_OPEN in the RIT graph.

Why is this still happening? This was a major issue in the old master.
 And still broke?

-ryan


Re: Msft.... (renamed thread)

2011-08-02 Thread Ryan Rawson
Stack can talk about this, but essentially for a period he could
contribute to HBase only, but not to Hadoop.  As you note, since Stack
has joined my former employer the situation is good now.

While it might be technically correct to say that MSFT has supported
HBase in the past, this is a legalistic view imo, since ultimately I
dont think that powerset uses HBase anymore.

-ryan

On Tue, Aug 2, 2011 at 8:59 AM, Andrew Purtell apurt...@apache.org wrote:
 When Microsoft acquired Powerset, Stack and Jim were still working there, but 
 were disallowed by policy to contribute to HBase ... for months. My 
 understanding is this was due to concerns about intellectual property -- open 
 source fright?. Anyway, it was a disruptive period for the project that was 
 resolved when Stack left.

 Best regards,


    - Andy

 Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
 Tom White)



From: Doug Meil doug.m...@explorysmedical.com
To: dev@hbase.apache.org dev@hbase.apache.org
Sent: Tuesday, August 2, 2011 4:43 AM
Subject: Msft  (renamed thread)


This is a reasonably interesting history question.  Powerset folks were
working on Hbase in 2007, but per...

http://en.wikipedia.org/wiki/Powerset_%28company%29

...  Microsoft didn't buy Powerset until mid-2008.  But that's all in the
past.


However, Microsoft is currently a platinum sponsor of the Apache Software
Foundation...

http://www.apache.org/foundation/thanks.html

... along with Google and Yahoo, which means they all each donate at least
$100k per year to ASF.


So in an extended financial sense, Msft supports Hbase by way of their
donation to ASF, but they also support everything else in ASF, just like
the rest of the big donors.



On 8/1/11 10:56 PM, Ryan Rawson ryano...@gmail.com wrote:

No one at powerset is currently contributing to HBase.  Was is the
key here - in the past, etc.

I guess MSFT never got to integrating the HBase API with the C# LINQ
system and Visual Studio.  Maybe it's that azure table services?



On Mon, Aug 1, 2011 at 7:50 PM, Fuad Efendi f...@efendi.ca wrote:



re:  Is it really-really supported by Microsoft employees?!

It is really, really not.



 I believe Hbase was contributed to Apache by a Powerset, currently owned
 by Microsoft; and (same contributors) were full-time supporting Hbase
and
 having salaries from Microsoft for at least a year; it was first
 (implicit) contribution from Microsoft to Apache.









Re: HBASE-4089 HBASE-4147 - on the topic of ops output

2011-07-31 Thread Ryan Rawson
You should ask for your money back!!

On Sun, Jul 31, 2011 at 3:10 PM, Fuad Efendi f...@efendi.ca wrote:
 What is it all about? HBase sucks. Too many problems to newcomers,
 few-weeks-warm-up to begin with Is it really-really
 supported by Microsoft employees?!
















 And, SEO of course:
 ===


 --
 Fuad Efendi
 416-993-2060
 Tokenizer Inc., Canada
 Data Mining, Search Engines
 http://www.tokenizer.ca








 On 11-07-29 7:49 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

Hi,

I'm for publishing all performance metrics in JMX (in addition to
exposing it wherever else you guys decide).  That's because JMX is
probably the easiest for our SPM for HBase [1] to get to HBase
performance metrics and I suspect we are not alone.

Otis
[1] http://sematext.com/spm/hbase-performance-monitoring/index.html

Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/




From: Andrew Purtell apurt...@apache.org
To: Doug Meil doug.m...@explorysmedical.com; dev@hbase.apache.org
dev@hbase.apache.org
Sent: Friday, July 29, 2011 4:34 PM
Subject: Re: HBASE-4089  HBASE-4147 - on the topic of ops output

 I'd rather see this output being able to be captured by something the
sink that Todd suggested, rather than focusing on shell access.


I don't agree.


Look at what we have existing and proposed:

    - Java API access to server and region load information, that the
shell uses

    - A proposal to dump some stats into log files, that then has to be
scraped

    - A proposal (by the FB guys) to export some JSON via a HTTP servlet

This is not good design, this is a bunch of random shit stuck together.

Note that what Todd proposed does not preclude adding Java client API
support for retrieving it.

At a minimum all of this information must be accessible via the Java
client API, to enable programmatic monitoring and analysis use cases.
I'll add the shell support if nobody else cares about it, that is a
relatively small detail, but one I think is important.

Best regards,


    - Andy


Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)



From: Doug Meil doug.m...@explorysmedical.com
To: dev@hbase.apache.org dev@hbase.apache.org;
apurt...@apache.org apurt...@apache.org
Sent: Friday, July 29, 2011 11:39 AM
Subject: Re: HBASE-4089  HBASE-4147 - on the topic of ops output


I'd rather see this output being able to be captured by something the
sink
that Todd suggested, rather than focusing on shell access.  HServerLoad
is
super-summary at the RS level, and both the items in 4089 and 4147 are
proposed to be summarized but still have reasonable detail (e.g., even
table/CF summary there could be dozens of entries given a reasonably
complex system).




On 7/29/11 1:15 PM, Andrew Purtell apurt...@apache.org wrote:

There is also the matter of HServerLoad and how that is used by the
shell
and master UI to report on cluster status.

I'd like the shell to be able to let the user explore all of these
different reports interactively.

At the very least, they should all be handled the same way.

And then there is Riley's work over at FB on a slow query log. How does
that fit in?

Best regards,


   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein
(via Tom White)



From: Todd Lipcon t...@cloudera.com
To: dev@hbase.apache.org
Sent: Friday, July 29, 2011 9:58 AM
Subject: Re: HBASE-4089  HBASE-4147 - on the topic of ops output

What I'd prefer is something like:

interface BlockCacheReportSink {
  public void reportStats(BlockCacheReport report);
}

class LoggingBlockCacheReportSink {
  ... {
    log it with whatever formatting you want
  }
}

then a configuration which could default to the logging
implementation,
but
orgs could easily substitute their own implementation. For example, I
could
see wanting to do an implementation where it keeps local RRD graphs of
some
stats, or pushes them to a central management server.

The assumption is that BlockCacheReport is a fairly straightforward
struct
with the non-formatted information available.

-Todd

On Fri, Jul 29, 2011 at 4:15 AM, Doug Meil
doug.m...@explorysmedical.comwrote:


 Hi Folks-

 You probably already my email yesterday on this...
  https://issues.apache.org/jira/browse/HBASE-4089 (block cache
report)

 ...and I just created this one...
  https://issues.apache.org/jira/browse/HBASE-4147 (StoreFile query
 report)

 What I'd like to run past the dev-list is this:  if Hbase had
periodic
 summary usage statistics, where should they go?  What I'd like to
throw
 out for discussion is that I'm suggesting that it should simply go
to
the
 log files and users can slice and dice this on their own.  No UI
(I.e.,
 JSPs), no JMX, etc.


 The summary out the output is this:
 BlockCacheReport:  on configured interval, print 

Re: heapSize() implementation of KeyValue

2011-07-31 Thread Ryan Rawson
Each array is really a pointer to an array (hence the references),
then we are taking account of the overhead of the 'bytes' array
itself.

And I see 3 integers pasted in, so things are looking good to me

On Sun, Jul 31, 2011 at 10:01 PM, Akash Ashok thehellma...@gmail.com wrote:
 Hi,
     I was going thru the heapSize() method in the class KeyValue and i
 couldn't seem to understand a few things which are in bold below


  private byte [] bytes = null;
  private int offset = 0;
  private int length = 0;
  private int keyLength = 0;

  // the row cached
  private byte [] rowCache = null;

  // default value is 0, aka DNC
  private long memstoreTS = 0;
  * @return Timestamp
  */
  private long timestampCache = -1;


  public long heapSize() {
    return ClassSize.align(
    // Fixed Object size
    ClassSize.OBJECT +

 *    // Why this ?
    (2 * ClassSize.REFERENCE) +*

    // bytes Array
    ClassSize.align(ClassSize.ARRAY) +

    //Size of int length
    ClassSize.align(length) +

 *    // Why this ?? There are only 2 ints leaving length which are int (
 offset, length)
    (3 * Bytes.SIZEOF_INT) +
 *
    // rowCache byte array
        ClassSize.align(ClassSize.ARRAY) +

    // Accounts for the longs memstoreTS and timestampCache
    (2 * Bytes.SIZEOF_LONG));
  }



Re: Avro connector

2011-07-14 Thread Ryan Rawson
Someone, but not necessarily the original contributor, should step up and
maintain. Ideally someone who is also using it :)

This could be a good chance to get on the good sides of everyone!
On Jul 14, 2011 11:48 AM, Doug Meil doug.m...@explorysmedical.com wrote:
 +1


 On 7/14/11 2:16 PM, Andrew Purtell apurt...@apache.org wrote:

HBASE-2400 introduced a new connector contrib architecturally equivalent
to the Thrift connector, but using Avro serialization and associated
transport and RPC server work. However, it remains unfinished, was
developed against an old version of Avro, is currently not maintained,
and is regarded as not production quality (see:
http://www.quora.com/What-is-the-current-status-for-using-Avro-with-HBase)
. Therefore I propose:

If a contributor steps up, then this person should bring the Avro
connector up to par with the Thrift connector.

Otherwise, we should deprecate and remove the Avro connector.

Best regards,

 - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)



Re: Converting byte[] to ByteBuffer

2011-07-09 Thread Ryan Rawson
I think my general point is we could hack up the hbase source, add
refcounting, circumvent the gc, etc or we could demand more from the dfs.

If a variant of hdfs-347 was committed, reads could come from the Linux
buffer cache and life would be good.

The choice isn't fast hbase vs slow hbase, there are elements of bugs there
as well.
On Jul 9, 2011 12:25 PM, M. C. Srivas mcsri...@gmail.com wrote:
 On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen 
jason.rutherg...@gmail.com
 wrote:

 There are couple of things here, one is direct byte buffers to put the
 blocks outside of heap, the other is MMap'ing the blocks directly from
 the underlying HDFS file.


 I think they both make sense. And I'm not sure MapR's solution will
 be that much better if the latter is implemented in HBase.


 There're some major issues with mmap'ing the local hdfs file (the block)
 directly:
 (a) no checksums to detect data corruption from bad disks
 (b) when a disk does fail, the dfs could start reading from an alternate
 replica ... but that option is lost when mmap'ing and the RS will crash
 immediately
 (c) security is completely lost, but that is minor given hbase's current
 status

 For those hbase deployments that don't care about the absence of the (a)
and
 (b), especially (b), its definitely a viable option that gives good perf.

 At MapR, we did consider similar direct-access capability and rejected it
 due to the above concerns.




 On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson ryano...@gmail.com wrote:
  The overhead in a byte buffer is the extra integers to keep track of
the
  mark, position, limit.
 
  I am not sure that putting the block cache in to heap is the way to go.
  Getting faster local dfs reads is important, and if you run hbase on
top
 of
  Mapr, these things are taken care of for you.
  On Jul 8, 2011 6:20 PM, Jason Rutherglen jason.rutherg...@gmail.com
  wrote:
  Also, it's for a good cause, moving the blocks out of main heap using
  direct byte buffers or some other more native-like facility (if DBB's
  don't work).
 
  On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com
wrote:
  Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
  is...annoying.
  On Jul 8, 2011 4:51 PM, Jason Rutherglen 
jason.rutherg...@gmail.com
 
  wrote:
  Is there an open issue for this? How hard will this be? :)
 
 



Re: Converting byte[] to ByteBuffer

2011-07-09 Thread Ryan Rawson
No lines of hbase were changed to run on Mapr. Mapr implements the hdfs API
and uses jni to get local data. If hdfs wanted to it could use more
sophisticated methods to get data rapidly from local disk to a client's
memory space...as Mapr does.
On Jul 9, 2011 6:05 PM, Doug Meil doug.m...@explorysmedical.com wrote:

 re: If a variant of hdfs-347 was committed,

 I agree with what Ryan is saying here, and I'd like to second (third?
 fourth?) keep pushing for HDFS improvements. Anything else is coding
 around the bigger I/O issue.



 On 7/9/11 6:13 PM, Ryan Rawson ryano...@gmail.com wrote:

I think my general point is we could hack up the hbase source, add
refcounting, circumvent the gc, etc or we could demand more from the dfs.

If a variant of hdfs-347 was committed, reads could come from the Linux
buffer cache and life would be good.

The choice isn't fast hbase vs slow hbase, there are elements of bugs
there
as well.
On Jul 9, 2011 12:25 PM, M. C. Srivas mcsri...@gmail.com wrote:
 On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen 
jason.rutherg...@gmail.com
 wrote:

 There are couple of things here, one is direct byte buffers to put the
 blocks outside of heap, the other is MMap'ing the blocks directly from
 the underlying HDFS file.


 I think they both make sense. And I'm not sure MapR's solution will
 be that much better if the latter is implemented in HBase.


 There're some major issues with mmap'ing the local hdfs file (the
block)
 directly:
 (a) no checksums to detect data corruption from bad disks
 (b) when a disk does fail, the dfs could start reading from an alternate
 replica ... but that option is lost when mmap'ing and the RS will crash
 immediately
 (c) security is completely lost, but that is minor given hbase's current
 status

 For those hbase deployments that don't care about the absence of the (a)
and
 (b), especially (b), its definitely a viable option that gives good
perf.

 At MapR, we did consider similar direct-access capability and rejected
it
 due to the above concerns.




 On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson ryano...@gmail.com wrote:
  The overhead in a byte buffer is the extra integers to keep track of
the
  mark, position, limit.
 
  I am not sure that putting the block cache in to heap is the way to
go.
  Getting faster local dfs reads is important, and if you run hbase on
top
 of
  Mapr, these things are taken care of for you.
  On Jul 8, 2011 6:20 PM, Jason Rutherglen
jason.rutherg...@gmail.com
  wrote:
  Also, it's for a good cause, moving the blocks out of main heap
using
  direct byte buffers or some other more native-like facility (if
DBB's
  don't work).
 
  On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com
wrote:
  Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the
API
  is...annoying.
  On Jul 8, 2011 4:51 PM, Jason Rutherglen 
jason.rutherg...@gmail.com
 
  wrote:
  Is there an open issue for this? How hard will this be? :)
 
 




Re: Converting byte[] to ByteBuffer

2011-07-08 Thread Ryan Rawson
The overhead in a byte buffer is the extra integers to keep track of the
mark, position, limit.

I am not sure that putting the block cache in to heap is the way to go.
Getting faster local dfs reads is important, and if you run hbase on top of
Mapr, these things are taken care of for you.
On Jul 8, 2011 6:20 PM, Jason Rutherglen jason.rutherg...@gmail.com
wrote:
 Also, it's for a good cause, moving the blocks out of main heap using
 direct byte buffers or some other more native-like facility (if DBB's
 don't work).

 On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com wrote:
 Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
 is...annoying.
 On Jul 8, 2011 4:51 PM, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:
 Is there an open issue for this? How hard will this be? :)



Re: Converting byte[] to ByteBuffer

2011-07-08 Thread Ryan Rawson
Hey,

When running on top of Mapr, hbase has fast cached access to locally stored
files, the Mapr client ensures that. Likewise, hdfs should also ensure that
local reads are fast and come out of cache as necessary. Eg: the kernel
block cache.

I wouldn't support mmap, it would require 2 different read path
implementations. You will never know when a read is not local.

Hdfs needs to provide faster local reads imo. Managing the block cache in
not heap might work but you also might get there and find the dbb accounting
overhead kills.
On Jul 8, 2011 6:47 PM, Jason Rutherglen jason.rutherg...@gmail.com
wrote:
 There are couple of things here, one is direct byte buffers to put the
 blocks outside of heap, the other is MMap'ing the blocks directly from
 the underlying HDFS file.

 I think they both make sense. And I'm not sure MapR's solution will
 be that much better if the latter is implemented in HBase.

 On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson ryano...@gmail.com wrote:
 The overhead in a byte buffer is the extra integers to keep track of the
 mark, position, limit.

 I am not sure that putting the block cache in to heap is the way to go.
 Getting faster local dfs reads is important, and if you run hbase on top
of
 Mapr, these things are taken care of for you.
 On Jul 8, 2011 6:20 PM, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:
 Also, it's for a good cause, moving the blocks out of main heap using
 direct byte buffers or some other more native-like facility (if DBB's
 don't work).

 On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com wrote:
 Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
 is...annoying.
 On Jul 8, 2011 4:51 PM, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:
 Is there an open issue for this? How hard will this be? :)




Re: Converting byte[] to ByteBuffer

2011-07-08 Thread Ryan Rawson
On Jul 8, 2011 7:19 PM, Jason Rutherglen jason.rutherg...@gmail.com
wrote:

  When running on top of Mapr, hbase has fast cached access to locally
stored
  files, the Mapr client ensures that. Likewise, hdfs should also ensure
that
  local reads are fast and come out of cache as necessary. Eg: the kernel
  block cache.

 Agreed!  However I don't see how that's possible today.  Eg, it'd
 require more of a byte buffer type of API to HDFS, random reads not
 using streams.  It's easy to add.

I don't think its as easy as you say. And even using the stream API Mapr
delivers a lot more performance. And this is from my own tests not a white
paper.


 I think the biggest win for HBase with MapR is the lack of the
 NameNode issues and snapshotting.  In particular, snapshots are pretty
 much a standard RDBMS feature.

That is good too - if you are using hbase in real time prod you need to look
at Mapr.

But even beyond that the performance improvements are insane. We are talking
like 8-9x perf on my tests. Not to mention substantially reduced latency.

I'll repeat again, local accelerated access is going to be a required
feature. It already is.

I investigated using dbb once upon a time, I concluded that managing the ref
counts would be a nightmare, and the better solution was to copy keyvalues
out of the dbb during scans.

Injecting refcount code seems like a worse remedy than the problem. Hbase
doesn't have as many bugs but explicit ref counting everywhere seems
dangerous. Especially when a perf solution is already here. Use Mapr or
hdfs-347/local reads.

  Managing the block cache in not heap might work but you also might get
there and find the dbb accounting
  overhead kills.

 Lucene uses/abuses ref counting so I'm familiar with the downsides.
 When it works, it's great, when it doesn't it's a nightmare to debug.
 It is possible to make it work though.  I don't think there would be
 overhead from it, ie, any pool of objects implements ref counting.

 It'd be nice to not have a block cache however it's necessary for
 caching compressed [on disk] blocks.

 On Fri, Jul 8, 2011 at 7:05 PM, Ryan Rawson ryano...@gmail.com wrote:
  Hey,
 
  When running on top of Mapr, hbase has fast cached access to locally
stored
  files, the Mapr client ensures that. Likewise, hdfs should also ensure
that
  local reads are fast and come out of cache as necessary. Eg: the kernel
  block cache.
 
  I wouldn't support mmap, it would require 2 different read path
  implementations. You will never know when a read is not local.
 
  Hdfs needs to provide faster local reads imo. Managing the block cache
in
  not heap might work but you also might get there and find the dbb
accounting
  overhead kills.
  On Jul 8, 2011 6:47 PM, Jason Rutherglen jason.rutherg...@gmail.com
  wrote:
  There are couple of things here, one is direct byte buffers to put the
  blocks outside of heap, the other is MMap'ing the blocks directly from
  the underlying HDFS file.
 
  I think they both make sense. And I'm not sure MapR's solution will
  be that much better if the latter is implemented in HBase.
 
  On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson ryano...@gmail.com wrote:
  The overhead in a byte buffer is the extra integers to keep track of
the
  mark, position, limit.
 
  I am not sure that putting the block cache in to heap is the way to
go.
  Getting faster local dfs reads is important, and if you run hbase on
top
  of
  Mapr, these things are taken care of for you.
  On Jul 8, 2011 6:20 PM, Jason Rutherglen jason.rutherg...@gmail.com

  wrote:
  Also, it's for a good cause, moving the blocks out of main heap using
  direct byte buffers or some other more native-like facility (if DBB's
  don't work).
 
  On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com
wrote:
  Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the
API
  is...annoying.
  On Jul 8, 2011 4:51 PM, Jason Rutherglen 
jason.rutherg...@gmail.com
  wrote:
  Is there an open issue for this? How hard will this be? :)
 
 
 


Re: zoo.cfg vs hbase-site.xml

2011-07-06 Thread Ryan Rawson
I was thinking that perhaps the normative use case for talking to a cluster
is to specify the quorum name and path... The implicit config can be really
confusing and is out of norms compared to other data store systems. Eg MySQL
memcache etc.
On Jul 6, 2011 2:14 PM, Stack st...@duboce.net wrote:
 I agree that we should be more consistent in how we get zk config
 (Your original report looks like a bug Lars). I also recently tripped
 over the fact that hbase uses different names for one or two zk
 configs. We need to fix that too.
 St.Ack

 On Mon, Jul 4, 2011 at 8:59 AM, Jesse Yates jesse.k.ya...@gmail.com
wrote:
 Isn't that kind of the point though? If you drop in a zk config file on a
 machine, you should be able to update all your apps on that machine to
the
 new config.
 Whats more important though is being able to easily distribute a changed
zk
 config across your cluster and simultaneously across multiple
applications.
 Rather than rewriting the confs for a handful of applications, and
possibly
 making a mistake dealing with each application's own special semantics,
and
 single conf to update everything just makes sense.

 I would lobby then that we make usage more consistent (as Lars
recommends)
 and make some of the hbase conf values to more closely match the zk conf
 values (though hbase.${zk.value} is really not bad).

 -Jesse


 From: Ryan Rawson [ryano...@gmail.com]
 Sent: Monday, July 04, 2011 5:25 AM
 To: dev@hbase.apache.org
 Subject: Re: zoo.cfg vs hbase-site.xml

 Should just fully deprecate zoo.cfg, it ended up being more trouble
 than it was worth.  When you use zoo.cfg you cannot connect to more
 than 1 cluster from a single JVM.  Annoying!

 On Sun, Jul 3, 2011 at 10:22 AM, Ted Yu yuzhih...@gmail.com wrote:
  I looked at conf/zoo_sample.cfg from zookeeper trunk. The naming of
  properties is different from the way we name
  hbase.zookeeper.property.maxClientCnxns
 
  e.g.
  # the port at which the clients will connect
  clientPort=2181
 
  FYI
 
  On Sun, Jul 3, 2011 at 9:53 AM, Lars George lars.geo...@gmail.com
 wrote:
 
  Hi,
 
  Usually the zoo.cfg overrides *all* settings off the hbase-site.xml
  (including the ones from hbase-default.xml) - when present. But in
some
  places we do not consider this, for example in HConnectionManager:
 
   static {
 // We set instances to one more than the value specified for
{@link
 // HConstants#ZOOKEEPER_MAX_CLIENT_CNXNS}. By default, the zk
default
  max
 // connections to the ensemble from the one client is 30, so in
that
  case we
 // should run into zk issues before the LRU hit this value of 31.
 MAX_CACHED_HBASE_INSTANCES = HBaseConfiguration.create().getInt(
 HConstants.ZOOKEEPER_MAX_CLIENT_CNXNS,
 HConstants.DEFAULT_ZOOKEPER_MAX_CLIENT_CNXNS) + 1;
 HBASE_INSTANCES = new LinkedHashMapHConnectionKey,
  HConnectionImplementation(
 (int) (MAX_CACHED_HBASE_INSTANCES / 0.75F) + 1, 0.75F, true) {
@Override
   protected boolean removeEldestEntry(
   Map.EntryHConnectionKey, HConnectionImplementation eldest)
{
  return size()  MAX_CACHED_HBASE_INSTANCES;
}
 };
 
 
  This only reads it from hbase-site.xml+hbase-default.xml. This is
  inconsistent, I think this should use ZKConfig.makeZKProps(conf) and
 then
  get the value.
 
  Thoughts?
 
  Lars
 
 




Re: zoo.cfg vs hbase-site.xml

2011-07-04 Thread Ryan Rawson
Should just fully deprecate zoo.cfg, it ended up being more trouble
than it was worth.  When you use zoo.cfg you cannot connect to more
than 1 cluster from a single JVM.  Annoying!

On Sun, Jul 3, 2011 at 10:22 AM, Ted Yu yuzhih...@gmail.com wrote:
 I looked at conf/zoo_sample.cfg from zookeeper trunk. The naming of
 properties is different from the way we name
 hbase.zookeeper.property.maxClientCnxns

 e.g.
 # the port at which the clients will connect
 clientPort=2181

 FYI

 On Sun, Jul 3, 2011 at 9:53 AM, Lars George lars.geo...@gmail.com wrote:

 Hi,

 Usually the zoo.cfg overrides *all* settings off the hbase-site.xml
 (including the ones from hbase-default.xml) - when present. But in some
 places we do not consider this, for example in HConnectionManager:

  static {
    // We set instances to one more than the value specified for {@link
    // HConstants#ZOOKEEPER_MAX_CLIENT_CNXNS}. By default, the zk default
 max
    // connections to the ensemble from the one client is 30, so in that
 case we
    // should run into zk issues before the LRU hit this value of 31.
    MAX_CACHED_HBASE_INSTANCES = HBaseConfiguration.create().getInt(
        HConstants.ZOOKEEPER_MAX_CLIENT_CNXNS,
        HConstants.DEFAULT_ZOOKEPER_MAX_CLIENT_CNXNS) + 1;
    HBASE_INSTANCES = new LinkedHashMapHConnectionKey,
 HConnectionImplementation(
        (int) (MAX_CACHED_HBASE_INSTANCES / 0.75F) + 1, 0.75F, true) {
       @Override
      protected boolean removeEldestEntry(
          Map.EntryHConnectionKey, HConnectionImplementation eldest) {
         return size()  MAX_CACHED_HBASE_INSTANCES;
       }
    };


 This only reads it from hbase-site.xml+hbase-default.xml. This is
 inconsistent, I think this should use ZKConfig.makeZKProps(conf) and then
 get the value.

 Thoughts?

 Lars




Re: Pluggable block index

2011-06-06 Thread Ryan Rawson
On Sun, Jun 5, 2011 at 11:33 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 Ok, the block index is only storing the first key of each block?
 Hmm... I think we can store a pointer to an exact position in the
 block, or at least allow that (for the FST implementation).

Are you sure that is a good idea?  Surely the disk seeks would destroy
you on index load?




 How efficient is the current seeking?

 I have previously thought about prefix compression, it seemed doable,

 It does look like prefix compression should be doable.  Eg, we'd seek
 to a position based on the block index (from which we'd have the
 entire key).  From the seek'd to position, we could scan and load up
 each subsequent prefix compressed key into a KeyValue, though right
 the KV wouldn't be 'pointing' back to the internals of the block, it'd
 be creating a whole new byte[] for each KV (which could have it's own
 garbage related ramifications).

 you'd need a compressing algorithm

 Lucene's terms dict is very simple.  The next key has the position at
 which the previous key differs.

 On Sat, Jun 4, 2011 at 3:35 PM, Ryan Rawson ryano...@gmail.com wrote:
 Also, dont break it :-)

 Part of the goal of HFile was to build something quick and reliable.
 It can be hard to know you have all the corner cases down and you
 won't find out in 6 months that every single piece of data you have
 put in HBase is corrupt.  Keeping it simple is one strategy.

 I have previously thought about prefix compression, it seemed doable,
 you'd need a compressing algorithm, then in the Scanner you would
 expand KeyValues and callers would end up with copies, not views on,
 the original data.  The JVM is fairly good about short lived objects
 (up to a certain allocation rate that is), and while the original goal
 was to reduce memory usage, it could make sense to take a higher short
 term allocation rate if the wins from prefix compression are there.

 Also note that in whole-system profiling, often repeated methods in
 KeyValue do pop up.  The goal of KeyValue was to have a format that
 didnt require deserialization into larger data structures (hence the
 lack of vint), and would be simple and fast.  Undoing that work should
 be accompanied with profiling evidence that new slowdowns were not
 introduced.

 -ryan

 On Sat, Jun 4, 2011 at 3:30 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 You'd have to change how the Scanner code works, etc.  You'll find out.

 Nice!  Sounds fun.

 On Sat, Jun 4, 2011 at 3:27 PM, Ryan Rawson ryano...@gmail.com wrote:
 What are the specs/goals of a pluggable block index?  Right now the
 block index is fairly tied deep in how HFile works. You'd have to
 change how the Scanner code works, etc.  You'll find out.



 On Sat, Jun 4, 2011 at 3:17 PM, Stack saint@gmail.com wrote:
 I do not know of one.  FYI hfile is pretty standalone regards tests etc.  
 There is even a perf testing class for hfile



 On Jun 4, 2011, at 14:44, Jason Rutherglen jason.rutherg...@gmail.com 
 wrote:

 I want to take a wh/hack at creating a pluggable block index, is there
 an open issue for this?  I looked and couldn't find one.







Re: Pluggable block index

2011-06-06 Thread Ryan Rawson
When I thought about it, I didn't think cross-block compression would
be a good idea - this is because you want to be able to decompress
each block independently of each other.  Perhaps a master HFile
dictionary or something.

-ryan

On Mon, Jun 6, 2011 at 12:06 AM, M. C. Srivas mcsri...@gmail.com wrote:
 On Sun, Jun 5, 2011 at 11:37 PM, Ryan Rawson ryano...@gmail.com wrote:

 On Sun, Jun 5, 2011 at 11:33 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  Ok, the block index is only storing the first key of each block?
  Hmm... I think we can store a pointer to an exact position in the
  block, or at least allow that (for the FST implementation).

 Are you sure that is a good idea?  Surely the disk seeks would destroy
 you on index load?


 I agree, it would be pretty bad.

 But, assuming that the block size is set appropriately, copying one key per
 100 or so values into the block index does not really bloat the hfile and is
 good trade-off to avoid the seeking. Plus, it does not prevent
 prefix-compression inside the block itself. Are we considering
 prefix-compression of keys across blocks?





 
  How efficient is the current seeking?
 
  I have previously thought about prefix compression, it seemed doable,
 
  It does look like prefix compression should be doable.  Eg, we'd seek
  to a position based on the block index (from which we'd have the
  entire key).  From the seek'd to position, we could scan and load up
  each subsequent prefix compressed key into a KeyValue, though right
  the KV wouldn't be 'pointing' back to the internals of the block, it'd
  be creating a whole new byte[] for each KV (which could have it's own
  garbage related ramifications).
 
  you'd need a compressing algorithm
 
  Lucene's terms dict is very simple.  The next key has the position at
  which the previous key differs.
 
  On Sat, Jun 4, 2011 at 3:35 PM, Ryan Rawson ryano...@gmail.com wrote:
  Also, dont break it :-)
 
  Part of the goal of HFile was to build something quick and reliable.
  It can be hard to know you have all the corner cases down and you
  won't find out in 6 months that every single piece of data you have
  put in HBase is corrupt.  Keeping it simple is one strategy.
 
  I have previously thought about prefix compression, it seemed doable,
  you'd need a compressing algorithm, then in the Scanner you would
  expand KeyValues and callers would end up with copies, not views on,
  the original data.  The JVM is fairly good about short lived objects
  (up to a certain allocation rate that is), and while the original goal
  was to reduce memory usage, it could make sense to take a higher short
  term allocation rate if the wins from prefix compression are there.
 
  Also note that in whole-system profiling, often repeated methods in
  KeyValue do pop up.  The goal of KeyValue was to have a format that
  didnt require deserialization into larger data structures (hence the
  lack of vint), and would be simple and fast.  Undoing that work should
  be accompanied with profiling evidence that new slowdowns were not
  introduced.
 
  -ryan
 
  On Sat, Jun 4, 2011 at 3:30 PM, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
  You'd have to change how the Scanner code works, etc.  You'll find
 out.
 
  Nice!  Sounds fun.
 
  On Sat, Jun 4, 2011 at 3:27 PM, Ryan Rawson ryano...@gmail.com
 wrote:
  What are the specs/goals of a pluggable block index?  Right now the
  block index is fairly tied deep in how HFile works. You'd have to
  change how the Scanner code works, etc.  You'll find out.
 
 
 
  On Sat, Jun 4, 2011 at 3:17 PM, Stack saint@gmail.com wrote:
  I do not know of one.  FYI hfile is pretty standalone regards tests
 etc.  There is even a perf testing class for hfile
 
 
 
  On Jun 4, 2011, at 14:44, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:
 
  I want to take a wh/hack at creating a pluggable block index, is
 there
  an open issue for this?  I looked and couldn't find one.
 
 
 
 
 




Re: Pluggable block index

2011-06-04 Thread Ryan Rawson
Also, dont break it :-)

Part of the goal of HFile was to build something quick and reliable.
It can be hard to know you have all the corner cases down and you
won't find out in 6 months that every single piece of data you have
put in HBase is corrupt.  Keeping it simple is one strategy.

I have previously thought about prefix compression, it seemed doable,
you'd need a compressing algorithm, then in the Scanner you would
expand KeyValues and callers would end up with copies, not views on,
the original data.  The JVM is fairly good about short lived objects
(up to a certain allocation rate that is), and while the original goal
was to reduce memory usage, it could make sense to take a higher short
term allocation rate if the wins from prefix compression are there.

Also note that in whole-system profiling, often repeated methods in
KeyValue do pop up.  The goal of KeyValue was to have a format that
didnt require deserialization into larger data structures (hence the
lack of vint), and would be simple and fast.  Undoing that work should
be accompanied with profiling evidence that new slowdowns were not
introduced.

-ryan

On Sat, Jun 4, 2011 at 3:30 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 You'd have to change how the Scanner code works, etc.  You'll find out.

 Nice!  Sounds fun.

 On Sat, Jun 4, 2011 at 3:27 PM, Ryan Rawson ryano...@gmail.com wrote:
 What are the specs/goals of a pluggable block index?  Right now the
 block index is fairly tied deep in how HFile works. You'd have to
 change how the Scanner code works, etc.  You'll find out.



 On Sat, Jun 4, 2011 at 3:17 PM, Stack saint@gmail.com wrote:
 I do not know of one.  FYI hfile is pretty standalone regards tests etc.  
 There is even a perf testing class for hfile



 On Jun 4, 2011, at 14:44, Jason Rutherglen jason.rutherg...@gmail.com 
 wrote:

 I want to take a wh/hack at creating a pluggable block index, is there
 an open issue for this?  I looked and couldn't find one.





Re: Pluggable block index

2011-06-04 Thread Ryan Rawson
Oh BTW, you can't mmap anything in HBase unless you copy it to local
disk first.  HDFS = no mmap.

just thought you'd like to know.

On Sat, Jun 4, 2011 at 3:41 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 It can be hard to know you have all the corner cases down and you
 won't find out in 6 months that every single piece of data you have
 put in HBase is corrupt.  Keeping it simple is one strategy.

 Isn't the block index separate from the actual data?  So corruption in
 that case is unlikely.

 I have previously thought about prefix compression, it seemed doable,
 you'd need a compressing algorithm, then in the Scanner you would
 expand KeyValues

 I think we can try that later.  I'm not sure one can make a hard and
 fast rule to always load the keys into RAM as an FST.  The block index
 would seem to be fairly separate.

 On Sat, Jun 4, 2011 at 3:35 PM, Ryan Rawson ryano...@gmail.com wrote:
 Also, dont break it :-)

 Part of the goal of HFile was to build something quick and reliable.
 It can be hard to know you have all the corner cases down and you
 won't find out in 6 months that every single piece of data you have
 put in HBase is corrupt.  Keeping it simple is one strategy.

 I have previously thought about prefix compression, it seemed doable,
 you'd need a compressing algorithm, then in the Scanner you would
 expand KeyValues and callers would end up with copies, not views on,
 the original data.  The JVM is fairly good about short lived objects
 (up to a certain allocation rate that is), and while the original goal
 was to reduce memory usage, it could make sense to take a higher short
 term allocation rate if the wins from prefix compression are there.

 Also note that in whole-system profiling, often repeated methods in
 KeyValue do pop up.  The goal of KeyValue was to have a format that
 didnt require deserialization into larger data structures (hence the
 lack of vint), and would be simple and fast.  Undoing that work should
 be accompanied with profiling evidence that new slowdowns were not
 introduced.

 -ryan

 On Sat, Jun 4, 2011 at 3:30 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 You'd have to change how the Scanner code works, etc.  You'll find out.

 Nice!  Sounds fun.

 On Sat, Jun 4, 2011 at 3:27 PM, Ryan Rawson ryano...@gmail.com wrote:
 What are the specs/goals of a pluggable block index?  Right now the
 block index is fairly tied deep in how HFile works. You'd have to
 change how the Scanner code works, etc.  You'll find out.



 On Sat, Jun 4, 2011 at 3:17 PM, Stack saint@gmail.com wrote:
 I do not know of one.  FYI hfile is pretty standalone regards tests etc.  
 There is even a perf testing class for hfile



 On Jun 4, 2011, at 14:44, Jason Rutherglen jason.rutherg...@gmail.com 
 wrote:

 I want to take a wh/hack at creating a pluggable block index, is there
 an open issue for this?  I looked and couldn't find one.







Re: modular build and pluggable rpc

2011-05-31 Thread Ryan Rawson
The cost of serialization is non trivial and a substantial expense in
conveying information from regionserver - client.  I did some
timings, and sending data across the wire is surprisingly slow, but
attempting to compress it with various compression systems ended up
taking 50-100ms on average case (1-5mb Result[] sets).

Originally when conceptualizing thrift, the thought was to just send
the KeyValue byte[] over thrift as an opaque blob and not doing a
whole structure thing, eg: no KeyValue structure with parts for each
of the parts of a KeyValue.  On large results that cost becomes
prohibitive.

While HTTP has a high overhead of headers, if one wanted to be
http-oriented you could do: http://www.chromium.org/spdy

The nice thing is that HTTP has a good set of interops and the like.
The bad thing is it is too verbose.

-ryan

On Tue, May 31, 2011 at 1:22 PM, Stack st...@duboce.net wrote:
 On Mon, May 30, 2011 at 9:55 PM, Eric Yang ey...@yahoo-inc.com wrote:
 Maven modulation could be enhanced to have a structure looks like this:

 Super POM
  +- common
  +- shell
  +- master
  +- region-server
  +- coprocessor

 The software is basically group by processor type (role of the process) and 
 a shared library.


 I'd change the list above.  shell should be client and perhaps master
 and regionserver should be both inside a single 'server' submodule.
 We need to add security in there.  Perhaps we'd have a submodule for
 thrift, avro, rest (and perhaps rest war file)?  (Is this too many
 submodules  -- I suppose once we are submodularized, adding new ones
 is trivial.  Its the initial move to submodules that is painful)


 For RPC, there are several feasible options, avro, thrift and jackson+jersey 
 (REST).  Avro may seems cumbersome to define the schema in JSON string.  
 Thrift comes with it's own rpc server, it is not trivial to add 
 authorization and authentication to secure the rpc transport.  
 Jackson+Jersey RPC message is biggest message size compare to Avro and 
 thrift.  All three frameworks have pros and cons but I think Jackson+jersey 
 have the right balance for rpc framework.  In most of the use case, 
 pluggable RPC can be narrow down to two main category of use cases:

 1. Freedom of creating most efficient rpc but hard to integrate with 
 everything else because it's custom made.
 2. Being able to evolve message passing and versioning.

 If we can see beyond first reason, and realize second reason is in part 
 polymorphic serialization.  This means, Jackson+Jersey is probably the 
 better choice as a RPC framework because Jackson supports polymorphic 
 serialization, and Jersey builds on HTTP protocol.  It would be easier to 
 versioning and add security on top of existing standards.  The syntax and 
 feature set seems more engineering proper to me.


 I always considered http attactive but much too heavy-weight for hbase
 rpc; each request/response would carry a bunch of what are for the
 most part extraneous headers.  I suppose we should just measure.
 Regards JSON messages, thats interesting but hbase is all about binary
 data.  Does jackson/jersey do BSON?

 St.Ack



Re: modular build and pluggable rpc

2011-05-27 Thread Ryan Rawson
The build modules are fine, I just wanted to voice my opinions on avro
vs thrift.  I dont think we should spend a lot of time attempting to
build a avro vs thrift thing, we should plan to eventually move to
thrift as our RPC serialization.  I also concur with Todd, our server
side code has had a lot of work and it isnt half bad now :-)

+1 to maven modules, they are pretty cool

On Fri, May 27, 2011 at 2:38 PM, Andrew Purtell apurt...@apache.org wrote:
 I don't disagree with any of this but the fact is we have compile time 
 differences if going against secure Hadoop 0.20 or non-secure Hadoop 0.20.

 So either we decide to punt on integration with secure Hadoop 0.20 or we deal 
 with the compile time differences. If dealing with them, we can do it by 
 reflection, which is brittle and can be difficult to understand and debug, 
 and someone would have to do this work; or we can wholesale replace RPC with 
 something based on Thrift, and someone would have to do the work; or we take 
 the pluggable RPC changes that Gary has already developed and modularize the 
 build, which Eric has already volunteered to do.

  - Andy

 --- On Fri, 5/27/11, Todd Lipcon t...@cloudera.com wrote:

 From: Todd Lipcon t...@cloudera.com
 Subject: Re: modular build and pluggable rpc
 To: dev@hbase.apache.org
 Cc: apurt...@apache.org
 Date: Friday, May 27, 2011, 1:30 PM
 Agreed - I'm all for Thrift.

 Though, I actually, contrary to Ryan, think that the
 existing HBaseRPC
 handler/client code is pretty good -- better than the
 equivalents from
 Thrift Java.

 We could start by using Thrift serialization on our
 existing transport
 -- then maybe work towards contributing it upstream to the
 Thrift
 project. HDFS folks are potentially interested in doing
 that as well.

 -Todd

 On Fri, May 27, 2011 at 1:10 PM, Ryan Rawson ryano...@gmail.com
 wrote:
  I'm -1 on avro as a RPC format.  Thrift is the way to
 go, any of the
  advantages of smaller serialization of avro is lost by
 the sheer
  complexity of avro and therefore the potential bugs.
 
  I understand the desire to have a pluggable RPC
 engine, but it feels
  like the better approach would be to adopt a unified
 RPC and just be
  done with it.  I had a look at the HsHa mechanism in
 thrift and it is
  very good, it in fact matches our 'handler' approach -
 async
  recieving/sending of data, but single threaded for
 processing a
  message.
 
  -ryan
 
  On Fri, May 27, 2011 at 1:00 PM, Andrew Purtell apurt...@apache.org
 wrote:
  Also needing, perhaps later, consideration:
 
  - HDFS-347 or not
 
   - Lucene embedding for hbase-search, though as a
 coprocessor this is already pretty much handled if we have
 platform support (therefore a platform module) for a HDFS
 that can do local read shortcutting and block placement
 requests
 
  - HFile v1 versus v2
 
  Making decoupled development at several downstream
 sites manageable, with a home upstream for all the work,
 while simultaneously providing clean migration paths for
 users, basically.
 
  --- On Fri, 5/27/11, Andrew Purtell apurt...@apache.org
 wrote:
 
  From: Andrew Purtell apurt...@apache.org
  Subject: modular build and pluggable rpc
  To: dev@hbase.apache.org
  Date: Friday, May 27, 2011, 12:49 PM
  From IRC:
 
  apurtell    i propose we take the build
 modular as early as possible to deal with multiple platform
 targets
  apurtell    secure vs nonsecure
  apurtell    0.20 vs 0.22 vs trunk
  apurtell    i understand the maintenence
 issues with multiple rpc engines, for example, but a lot of
 reflection twistiness is going to be worse
  apurtell    i propose we take up esammer on
 his offer
  apurtell    so branch 0.92 asap, get trunk
 modular and working against multiple platform targets
  apurtell    especially if we're going to
 see rpc changes coming from downstream projects...
  apurtell    also what about supporting
 secure and nonsecure clients with the same deployment?
  apurtell    zookeeper does this
  apurtell    so that is selectable rpc
 engine per connection, with a negotiation
  apurtell    we don't have or want to be
 crazy about it but a rolling upgrade should be possible if
 for example we are taking in a new rpc from fb (?) or
 cloudera (avro based?)
  apurtell    also looks like hlog modules
 for 0.20 vs 0.22 and successors
  apurtell    i think over time we can
 roadmap the rpc engines, if we have multiple, by
 deprecation
  apurtell    now that we're on the edge of
 supporting both 0.20 and 0.22, and secure vs nonsecure,
 let's get it as manageable as possible right away
 
  St^Ack_        apurtell: +1
 
  apurtell    also i think there is some
 interest in async rpc engine
 
  St^Ack_        we should stick this up
 on dev i'd say
 
  Best regards,
 
      - Andy
 
  Problems worthy of attack prove their worth by
 hitting
  back. - Piet Hein (via Tom White)
 
 
 



 --
 Todd Lipcon
 Software Engineer, Cloudera




Re: HBase version numbers

2011-05-11 Thread Ryan Rawson
0.91, if used, will be used for a developer-preview.  Much as there
was a 0.89 Developer Preview, then a 0.90.x

DPs tend to be marked by the date they were cut, since there is no
real version, and it is not expected the average user run a DP in
production (nor advised!)

-ryan

On Tue, May 10, 2011 at 9:59 PM, lohit lohit.vijayar...@gmail.com wrote:
 Hello

 I see that for HBase released version is 0.90.2
 0.90.3 vote is open and a branch/tag for 0.90.4

 After this for trunk, the version seems to be 0.92.0 , right?
 What happened to 0.91 ?

 Thanks,
 Lohit



Re: why not obtain row lock when geting a row

2011-03-28 Thread Ryan Rawson
That is not the case, please see:

https://issues.apache.org/jira/browse/HBASE-2248

There are alternative mechanisms (outlined in that JIRA) to assure
atomic row reads.

-ryan

On Sun, Mar 27, 2011 at 11:54 PM, jiangwen w wjiang...@gmail.com wrote:
 so client may read dirty data, considering the following case
 client#1 update firstName and lastName for a user.
 client#2 read the information of the user when client#1 updated firstName
 and will update lastName.
 so client#1 read the latest firstName, but the old lastName.

 Sincerely

 On Mon, Mar 28, 2011 at 1:45 PM, Ryan Rawson ryano...@gmail.com wrote:

 Row locks are not necessary when reading. this changed, that is why that is
 still there.
 On Mar 27, 2011 10:42 PM, jiangwen w wjiang...@gmail.com wrote:
  I think a row lock should be obtained before getting a row.
  but the following method in HRegion class show a row lock won't be
 obtained
 
  *public Result get(final Get get, final Integer lockid)*
  *
  *
  although there is a* lockid* parameter, but it is not used in this
 method.
 
  Sincerely
  Vince Wei




Re: why not obtain row lock when geting a row

2011-03-27 Thread Ryan Rawson
Row locks are not necessary when reading. this changed, that is why that is
still there.
On Mar 27, 2011 10:42 PM, jiangwen w wjiang...@gmail.com wrote:
 I think a row lock should be obtained before getting a row.
 but the following method in HRegion class show a row lock won't be
obtained

 *public Result get(final Get get, final Integer lockid)*
 *
 *
 although there is a* lockid* parameter, but it is not used in this method.

 Sincerely
 Vince Wei


Re: negotiated timeout

2011-03-24 Thread Ryan Rawson
the HQuorumPeer uses hbase-site.xml/hbase-default.xml to configure ZK,
including the line Patrick pointed out.  You can increase that to
increase the max timeout.

-ryan

On Thu, Mar 24, 2011 at 5:27 PM, Ted Yu yuzhih...@gmail.com wrote:
 Seeking more comment.

 -- Forwarded message --
 From: Patrick Hunt ph...@apache.org
 Date: Thu, Mar 24, 2011 at 4:15 PM
 Subject: Re: negotiated timeout
 To: Ted Yu yuzhih...@gmail.com
 Cc: d...@zookeeper.apache.org, Mahadev Konar maha...@apache.org,
 zookeeper-...@hadoop.apache.org


 Ted, you'll need to ask the hbase guys about this if you are not
 running a dedicated zk cluster. I'm not sure how they manage embedded
 zk.

 However a quick search of the HBASE code results in:

 ./src/main/java/org/apache/hadoop/hbase/zookeeper/HQuorumPeer.java:

   // Set the max session timeout from the provided client-side timeout
   properties.setProperty(maxSessionTimeout,
       conf.get(zookeeper.session.timeout, 18));

 Patrick

 On Thu, Mar 24, 2011 at 4:00 PM, Ted Yu yuzhih...@gmail.com wrote:
 Patrick:
 Do you want me to look at maxSessionTimeout ?
 Since hbase manages zookeeper, I am not sure I can control this parameter
 directly.

 On Thu, Mar 24, 2011 at 3:50 PM, Patrick Hunt ph...@apache.org wrote:



 http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_advancedConfiguration

 On Thu, Mar 24, 2011 at 3:43 PM, Mahadev Konar maha...@apache.org
 wrote:
  Hi Ted,
   The session timeout can be changed by the server depending on min/max
  bounds set on the servers. Are you servers configured to have a max
  timeout of 60 seconds? usually the default is 20 * tickTime. Looks
  like your ticktime is 3 seconds?
 
  thanks
  mahadev
 
 
 
  On Thu, Mar 24, 2011 at 3:20 PM, Ted Yu yuzhih...@gmail.com wrote:
  Hi,
  hbase 0.90.1 uses zookeeper 3.3.2
  I specified:
  property
  namezookeeper.session.timeout/name
  value49/value
  /property
 
  In zookeeper log I see:
  2011-03-24 19:58:09,499 INFO
 org.apache.zookeeper.server.NIOServerCnxn:
  Client attempting to establish new session at /10.202.50.111:50325
  2011-03-24 19:58:09,499 INFO
 org.apache.zookeeper.server.NIOServerCnxn:
  Established session 0x12ebb99d686a012 with negotiated timeout 6
 for
  client /10.202.50.112:62386
  2011-03-24 19:58:09,499 INFO
 org.apache.zookeeper.server.NIOServerCnxn:
  Client attempting to establish new session at /10.202.50.112:62387
  2011-03-24 19:58:09,499 INFO
  org.apache.zookeeper.server.PrepRequestProcessor: Got user-level
  KeeperException when processing sessionid:0x12ebb99d686a012
 type:create
  cxid:0x1 zxid:0xfffe txntype:unknown reqpath:n/a Error
  Path:/hbase Error:KeeperErrorCode = NodeExists for /hbase
  2011-03-24 19:58:09,499 INFO
 org.apache.zookeeper.server.NIOServerCnxn:
  Established session 0x12ebb99d686a013 with negotiated timeout 6
 for
  client /10.202.50.111:50324
 
  Can someone tell me how the negotiated timeout of 6 was computed ?
 
  Thanks
 
 





Re: gauging cost of region movement

2011-03-21 Thread Ryan Rawson
it would make sense to avoid moving regions, so therefore the more
recently a region was moved, the less likely we should move it.

you could imagine a hypothetical perfect 'region move cost' function
that might look like:

F(r) = timeSinceMoved(r) + size(r) + loadAvg(r)

The functions should probably be normalized to [0,1], so the range of
F would be [0,3] with 3 == 'dont move' and 0 == 'move first'.

The goal is to minimize all the F(r[i]) in the moves.

-ryan

On Mon, Mar 21, 2011 at 4:26 PM, Jonathan Gray jg...@fb.com wrote:
 Also, using more stable measures of request count will help, such as 30 
 minute rolling averages.

 -Original Message-
 From: Jonathan Gray [mailto:jg...@fb.com]
 Sent: Monday, March 21, 2011 4:23 PM
 To: dev@hbase.apache.org
 Subject: RE: gauging cost of region movement

 This is an interesting direction, and definitely file a JIRA as this could 
 be an
 additional metric in the future, but it's not exactly what I had in mind.

 One of the hardest parts of load balancing based on request count and other
 dynamic/transient measures is that you can get some pretty pathological
 conditions where you are always moving stuff around.

 To guard against it, I think we'll need to move to more of a cost-based
 algorithm that is taking not just the difference in request counts into 
 account
 but also a baseline cost of moving a region.  The cost difference in load
 between two unbalanced servers would have to outweigh the cost
 associated with moving a region.  As you say, looking at the number of live
 operations to a given region could contribute to the cost of moving that
 region, but the best measure for that is probably just looking at request
 count (it's all requests that incur a cost, not just active scanners).

 JG

  -Original Message-
  From: Ted Yu [mailto:yuzhih...@gmail.com]
  Sent: Monday, March 21, 2011 3:44 PM
  To: dev@hbase.apache.org
  Subject: gauging cost of region movement
 
  Can we add a counter for the number of InternalScanner's to HRegion ?
  We decrement this counter when close() is called.
 
  Such counter can be used to gauge the cost of moving the underlying
 region.
 
  Cheers



Re: trimming RegionLoad fields

2011-03-17 Thread Ryan Rawson
How much memory does profiling indicating these objects use?  How much
are you expecting to save?

Saving 4-8 bytes even on a 10k region cluster is still only 80k of
ram, not really significant.


On Thu, Mar 17, 2011 at 2:32 PM, Ted Yu yuzhih...@gmail.com wrote:
 Hi,
 See email thread 'One of the regionserver aborted, then the master shut down
 itself' for background.
 I am evaluating various ways of trimming the memory footprint of RegionLoad
 because there would be so many regions in production cluster.

 Looking at field memstoreSizeMB of RegionLoad, I only found this reference -
 AvroUtil.hslToASL()
 Load balancer currently isn't checking this metric. And HRegion has
 memstoreSize field.

 I wonder whether we can trim field memstoreSizeMB off RegionLoad.

 Please comment.



Re: trimming RegionLoad fields

2011-03-17 Thread Ryan Rawson
Without solid evidence of we'll be saving X megabytes I don't see a
compelling reason to hacking that stuff out yet.

We sort of need a better out-of-the-box monitoring system. One idea I
had was to embed OpenTSDB inside the HMaster. This way OpenTSDB would
store info about a HBase cluster back in the same cluster it monitors.
While this may sound weird I think it makes sense because every
great database system provides strong self monitoring tools. Eg:
Oracle, etc.  Due to the LGPL, this is not currently viable.  Perhaps
there is an alternative floating out there we can ship with? And not
ganglia :-)

On Thu, Mar 17, 2011 at 2:47 PM, Andrew Purtell apurt...@apache.org wrote:
 memstoreSizeMB is part of the output printed by the shell when you do status 
 'detailed'.

 I use that.

 Isn't that information useful to others?

  - Andy

 --- On Thu, 3/17/11, Ryan Rawson ryano...@gmail.com wrote:

 From: Ryan Rawson ryano...@gmail.com
 Subject: Re: trimming RegionLoad fields
 To: dev@hbase.apache.org
 Cc: Ted Yu yuzhih...@gmail.com
 Date: Thursday, March 17, 2011, 2:37 PM
 How much memory does profiling
 indicating these objects use?  How much
 are you expecting to save?

 Saving 4-8 bytes even on a 10k region cluster is still only
 80k of ram, not really significant.


 On Thu, Mar 17, 2011 at 2:32 PM, Ted Yu yuzhih...@gmail.com
 wrote:
  Hi,
  See email thread 'One of the regionserver aborted,
 then the master shut down
  itself' for background.
  I am evaluating various ways of trimming the memory
 footprint of RegionLoad
  because there would be so many regions in production
 cluster.
 
  Looking at field memstoreSizeMB of RegionLoad, I only
 found this reference -
  AvroUtil.hslToASL()
  Load balancer currently isn't checking this metric.
 And HRegion has
  memstoreSize field.
 
  I wonder whether we can trim field memstoreSizeMB off
 RegionLoad.
 
  Please comment.
 







Re: move meta table to ZK

2011-03-17 Thread Ryan Rawson
Is it possible to search a list of z nodes? That is what we do now with meta
in hbase.

I used to be a fan, but I think self hosting all important meta data is the
best approach. It makes lots of things easier, like replication, snapshots,
etc.
On Mar 17, 2011 9:27 PM, jiangwen w wjiang...@gmail.com wrote:
 how do you think about moving meta table to ZK, so no meta table are
needed.
 if we do so, we need enhance ZK in the following way:
 1. let children of ZNode in order.

 if we do so, we can benifit:

 1. no need to treat meta table as a special way. this will simplify the
code
 a lot
 2. ZK is highly available, so we don't worry the availablility of the meta
 data.
 3. currently if the region server where meta table is on failed, the whole
 cluster may pause.
 if we move meta table to ZK, there is no such problem.
 4. meta table may be a hotspot, but in ZK reading is scalable by adding
more
 observers.


 Sincerely


Re: When a region is spliting, what will be done with its memstores?

2011-03-14 Thread Ryan Rawson
split transaction closes the region, at which time the memstore is
flushed to disk.

at this point they are empty.  Dereferenced when HRegion is removed
from the maps, then gced.


On Mon, Mar 14, 2011 at 1:55 AM, Zhou Shuaifeng
zhoushuaif...@huawei.com wrote:
 I read the SplitTransaction and flush code, but still don't understand the
 procedure of this question, can someone tell me?



 Zhou Shuaifeng(Frank)




 
 -
 This e-mail and its attachments contain confidential information from
 HUAWEI, which
 is intended only for the person or entity whose address is listed above. Any
 use of the
 information contained herein in any way (including, but not limited to,
 total or partial
 disclosure, reproduction, or dissemination) by persons other than the
 intended
 recipient(s) is prohibited. If you receive this e-mail in error, please
 notify the sender by
 phone or email immediately and delete it!






Re: HTable thread safety in 0.20.6

2011-03-06 Thread Ryan Rawson
On Sun, Mar 6, 2011 at 9:25 PM, Suraj Varma svarma...@gmail.com wrote:
 Thanks all for your insights into this.

 I would agree that providing mechanisms to support no-outage upgrades going
 forward would really be widely beneficial. I was looking forward to Avro for
 this reason.

 Some follow up questions:
 1) If asynchbase client to do this (i.e. talk wire protocol and adjust based
 on server versions), why not the native hbase client? Is there something in
 the native client design that would make this too hard / not worth
 emulating?

Typically this has not been an issue.  The particular design of the
way that hadoop rpc (the rpc we use) makes it difficult to offer
multiple protocol/version support. To fix it would more or less
require rewriting the entire protocol stack. I'm glad we spent serious
time making the base storage layer and query paths fast, since without
those fundamentals a better RPC would be moot. From my measurements
I dont think we are losing a lot of performance in our current RPC
system, and unless we are very careful we'll lose a lot in a
thrift/avro transition.


 2) Does asynchbase have any limitations (functionally or otherwise) compared
 to the native HBase client?

 3) If Avro were the native protocol that HBase  client talks through,
 that is one thing (and that's what I'm hoping we end up with) - however,
 isn't spinning up Avro gateways on each node (like what is currently
 available) require folks to scale up two layers (Avro gateway layer + HBase
 layer)? i.e. now we need to be worried about whether the Avro gateways can
 handle the traffic, etc.

The hbase client is fairly 'thick', it must intelligently route
between different regionservers, handle errors, relook up meta data,
use zookeeper to bootstrap, etc. This is part of making a scalable
client though. Having the RPC serialization in thrift or avro would
make it easier to write those kinds of clients for non-Java languages.
The gateway approach will probably be necessary for a while alas. At
SU I am not sure that the gateway is adding a lot of of latency to
small queries, since average/median latency is around 1ms.  One
strategy is to deploy gateways on all client nodes and use localhost
as much as possible.

 In our application, we have Java clients talking directly to HBase. We
 debated using Thrift or Stargate layer (even though we have a Java client)
 just because of this easier upgrade-ability. But we finally decided to use
 the native HBase client because we didn't want to have to scale two layers
 rather than just HBase ... and Avro was on the road map. An HBase client
 talking native Avro directly to RS (i.e. without intermediate gateways
 would have worked - but that was a ways ...

So again avro isn't going to be a magic bullet. Neither thrift.  You
can't just have a dumb client with little logic open up a socket and
start talking to HBase.  That isn't congruent with a scalable system
unfortunately. You need your clients to be smart and do a bunch of
work that otherwise would have to be done by a centralized type node
or another middleman. Only if the client is smart can we send the
minimal RPCs to the shortest network length. Other systems have
servers bounce the requests to other servers but that can promote
extra traffic at the cost of a simpler client.

 I think now that we are in the .90s, an option to do no-outage upgrades
 (from client's perspective) would be really beneficial.

We'd all like this, it's formost in pretty much every committer's mind
all the time. It's just a HUGE body of work. One that is fraught with
perils and danger zones. For example it seemed avro would reign
supreme, but the RPC landscape is shifting back towards thrift.


 Thanks,
 --Suraj


 On Sat, Mar 5, 2011 at 2:21 PM, Todd Lipcon t...@cloudera.com wrote:

 On Sat, Mar 5, 2011 at 2:10 PM, Ryan Rawson ryano...@gmail.com wrote:
  As for the past RPC, it's all well to complain that we didn't spend
  more time making it more compatible, but in a world where evolving
  features in an early platform is more important than keeping backwards
  compatibility (how many hbase 18 jars want to talk to a modern
  cluster? Like none.), I am confident we did the right choice.  Moving
  forward I think the goal should NOT be to maintain the current system
  compatible at all costs, but to look at things like avro and thrift,
  make a calculated engineering tradeoff and get ourselves on to a
  extendable platform, even if there is a flag day.  We aren't out of
  the woods yet, but eventually we will be.

 Hear hear! +1!

 -Todd
 --
 Todd Lipcon
 Software Engineer, Cloudera




Re: HTable thread safety in 0.20.6

2011-03-06 Thread Ryan Rawson
So when you look at the interface that the client uses to talk to the
regionservers it has calls like this:

  public R MultiResponse multi(MultiActionR multi) throws IOException;
  public long openScanner(final byte [] regionName, final Scan scan)
  throws IOException;

etc

Note that this is the interface you get _AFTER_ you are talking to a
particular regionserver.  If you send a regionName that is not being
served you get a 'region not served' exception.

In other words a blind client wouldnt know which servers to talk to.
You have to first:
- bootstrap the ROOT table region server location from ZK (there is
only 1, always will only be one)
- get the META region(s) location(s).
- query the META region(s) to find out which server contains the
region for the specific request.
- talk to the individual regionserver. If you get exceptions, do the
lookup in META again and try again.

Putting these smarts in the client makes it scalable, at the cost of a
thicker client.

To make an API that has a '1 shot' type of interface, we'd end up
creating something that looks like the thrift gateway.  But now you
have bottlenecks in the thrift gateway servers.

There really is no free lunch. Sorry.


On Sun, Mar 6, 2011 at 10:09 PM, Suraj Varma svarma...@gmail.com wrote:
 Sorry - missed the user group in my previous mail.
 --Suraj

 On Sun, Mar 6, 2011 at 10:07 PM, Suraj Varma svarma...@gmail.com wrote:

 Very interesting.
 I was just about to send an additional mail asking why HBase client also
 needs the hadoop jar (thereby tying the client onto the hadoop version as
 well) - but, I guess at the least the hadoop rpc is the dependency. So, now
 that makes sense.


  One strategy is to deploy gateways on all client nodes and use localhost
  as much as possible.

 This certainly scales up the gateway nodes - but complicates the upgrades.
 For instance, we will have a 100+ clients talking to the cluster and
 upgrading from 0.20.x to 0.90.x would be that much harder with version
 specific gateway nodes all over the place.

  So again avro isn't going to be a magic bullet. Neither thrift.
 This is interesting (disappointing?) ... isn't the plan to substitute
 hadoop rpc with avro (or thrift) while still keeping all the smart logic in
 the client in place? I thought avro with its cross-version capabilities
 would have solved the versioning issues and allowed the backward/forward
 compatibility. I mean, a thick client talking avro was what I had imagined
 the solution to be.

 Glad to know that client compatibility is very much in the commiter's /
 community's mind.

 Based on discussion below, is async-hbase a thick / smart client or
 something less than that?
  2) Does asynchbase have any limitations (functionally or otherwise)
 compared
  to the native HBase client?

 Thanks again.
 --Suraj


 On Sun, Mar 6, 2011 at 9:40 PM, Ryan Rawson ryano...@gmail.com wrote:

 On Sun, Mar 6, 2011 at 9:25 PM, Suraj Varma svarma...@gmail.com wrote:
  Thanks all for your insights into this.
 
  I would agree that providing mechanisms to support no-outage upgrades
 going
  forward would really be widely beneficial. I was looking forward to Avro
 for
  this reason.
 
  Some follow up questions:
  1) If asynchbase client to do this (i.e. talk wire protocol and adjust
 based
  on server versions), why not the native hbase client? Is there something
 in
  the native client design that would make this too hard / not worth
  emulating?

 Typically this has not been an issue.  The particular design of the
 way that hadoop rpc (the rpc we use) makes it difficult to offer
 multiple protocol/version support. To fix it would more or less
 require rewriting the entire protocol stack. I'm glad we spent serious
 time making the base storage layer and query paths fast, since without
 those fundamentals a better RPC would be moot. From my measurements
 I dont think we are losing a lot of performance in our current RPC
 system, and unless we are very careful we'll lose a lot in a
 thrift/avro transition.


  2) Does asynchbase have any limitations (functionally or otherwise)
 compared
  to the native HBase client?
 
  3) If Avro were the native protocol that HBase  client talks through,
  that is one thing (and that's what I'm hoping we end up with) - however,
  isn't spinning up Avro gateways on each node (like what is currently
  available) require folks to scale up two layers (Avro gateway layer +
 HBase
  layer)? i.e. now we need to be worried about whether the Avro gateways
 can
  handle the traffic, etc.

 The hbase client is fairly 'thick', it must intelligently route
 between different regionservers, handle errors, relook up meta data,
 use zookeeper to bootstrap, etc. This is part of making a scalable
 client though. Having the RPC serialization in thrift or avro would
 make it easier to write those kinds of clients for non-Java languages.
 The gateway approach will probably be necessary for a while alas. At
 SU I am not sure that the gateway

Re: HTable thread safety in 0.20.6

2011-03-05 Thread Ryan Rawson
I dont think protobuf is winning the war out there, it's either thrift
or avro at this point.  Protobuf just isn't an bazzar open-source type
project, and it's non-Java/C++/python support isn't 1st class, plus no
RPC.

As for the past RPC, it's all well to complain that we didn't spend
more time making it more compatible, but in a world where evolving
features in an early platform is more important than keeping backwards
compatibility (how many hbase 18 jars want to talk to a modern
cluster? Like none.), I am confident we did the right choice.  Moving
forward I think the goal should NOT be to maintain the current system
compatible at all costs, but to look at things like avro and thrift,
make a calculated engineering tradeoff and get ourselves on to a
extendable platform, even if there is a flag day.  We aren't out of
the woods yet, but eventually we will be.

-ryan

On Fri, Mar 4, 2011 at 8:50 PM, M. C. Srivas mcsri...@gmail.com wrote:
 Google's protobufs make this problem more palatable with optional params. Of
 course, you will have to break versions once more 

 On Fri, Mar 4, 2011 at 10:04 AM, Stack st...@duboce.net wrote:

 On Fri, Mar 4, 2011 at 12:24 AM, tsuna tsuna...@gmail.com wrote:
  In practice, bear in mind that HBase has a bad track record of
  breaking backward compatibility between virtually every release (even
  minor ones), although they often bump the protocol version number even
  though there are no client-visible API changes (e.g. because only some
  internal APIs used by the master or other administrative APIs
  irrelevant for the client changed).

 At Benoit's suggestion, we've changed the way we version Interfaces;
 rather than a global version for all, we now version each Interface
 separately.  More to come...
 St.Ack




Re: Build failed in Hudson: HBase-TRUNK #1763

2011-03-02 Thread Ryan Rawson
I'll fix this in a few hours. Not write awake :)


Re: Coprocessor tax?

2011-03-01 Thread Ryan Rawson
I don't think we need a lock even for updating, check it copy on write array
list.
On Mar 1, 2011 12:45 PM, Gary Helmling ghelml...@gmail.com wrote:
 Yeah, I was just looking at the write lock references as well.

 I'm not sure RegionCoprocessorHost.preClose() would really need the write
 lock? As you say, there is still a race in HRegion.doClose() between
 preClose() completing and HRegion.lock.writeLock() being taken out, so
other
 methods could still be called after.

 RegionCoprocessorHost.postClose() occurs under the HRegion write lock, so
 any read lock operations would already have to have completed by this
point.
 So here we wouldn't really need the coprocessor write lock either?

 It seems like we could actually drop the coprocessor lock, since
 coprocessors are currently loaded prior to region open completing.

 Online coprocessor loading (not currently provided) could be handled in
the
 future by a lock just for loading, and creating a new coprocessor
collection
 and assigning when done.

 On Tue, Mar 1, 2011 at 12:08 PM, Ryan Rawson ryano...@gmail.com wrote:

 My own profiling shows that a read write lock can be up to 3-6% of the
 CPU budget in our put/get query path. Adding another one if not
 necessary would probably not be good.

 In fact in the region coprocessor the only thing the write lock is
 used for is the preClose and postClose, but looking in the
 implementation of those methods I don't really get why this is
 necessary. The write lock ensures single thread access, but there is
 nothing that prevents other threads from calling other methods AFTER
 the postClose?

 -ryan

 On Tue, Mar 1, 2011 at 12:02 PM, Gary Helmling ghelml...@gmail.com
 wrote:
  All the CoprocessorHost invocations should be wrapped in if (cpHost !=
  null). We could just added an extra check for whether any coprocessors
 are
  loaded -- if (cpHost != null  cpHost.isActive()), something like
 that?
  Or the CoprocessorHost methods could do this checking internally.
 
  Either way should be relatively easy to bypass the lock acquisition. Is
  there much overhead to acquiring a read lock if the write lock is never
  taken though? (just wondering)
 
 
 
  On Tue, Mar 1, 2011 at 11:51 AM, Stack st...@duboce.net wrote:
 
  So, I'm debugging something else but thread dumping I see a bunch of
 this:
 
 
  IPC Server handler 6 on 61020 daemon prio=10 tid=0x422d2800
  nid=0x7714 runnable [0x7f1c5acea000]
  java.lang.Thread.State: RUNNABLE
  at
 

java.util.concurrent.locks.ReentrantReadWriteLock$Sync.fullTryAcquireShared(ReentrantReadWriteLock.java:434)
  at
 

java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:404)
  at
 

java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1260)
  at
 

java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594)
  at
 

org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:532)
  at
 

org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.java:1476)
  at
  org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1454)
  at
 

org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2652)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at
 

org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:309)
  at
 

org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1060)
 
 
  Do others? I don't have any CPs loaded. I'm wondering if we can do
  more to just avoid the CP codepath if no CPs loaded.
 
  St.Ack
 
 



Re: putting a border around 0.92 release

2011-02-28 Thread Ryan Rawson
I'd generally vote for a time-based release.  The big feature
releases while are good for attracting new users with new features,
present a problem in that it can really delay releases for a long
time.  More releases are better!  If a feature takes more than 3
months then it's too big to implement in one go.

On Mon, Feb 28, 2011 at 2:00 PM, Todd Lipcon t...@cloudera.com wrote:
 On Sat, Feb 26, 2011 at 2:24 PM, Jean-Daniel Cryans jdcry...@apache.org 
 wrote:
 Woah those are huge tasks!

 Also to consider:

  - integration with hadoop 0.22, should we support it and should we
 also support 0.20 at the same time? 0.22 was branched but IIRC it
 still has quite a few blockers.
  - removing heartbeats, this is in the pipeline from Stack and IMO
 will have ripple effects on online schema editing.
  - HBASE-2856, pretty critical.
  - replication-related issues like multi-slave (which I'm working on),
 and ideally multi-master. I'd like to add better management tools too.

 And lastly we need to plan when we want to branch 0.92... should we
 target late May in order to be ready for the Hadoop Summit in June?
 For once it would be nice to offer more than an alpha release :)

 In my view, we can do one or the other: either it's a feature-based
 release, in which case we release it when it's done, or it's a
 time-based release, in which case we release at some decided-upon time
 with whatever's done.

 I personally prefer time-based releases, though we need to make sure
 if we decide to do this that any large destabilizing (or half
 complete) features are guarded either by config flags or are developed
 in a branch. Thus trunk stays relatively releasable at all times and
 we can be pretty confident we'll hit the decided-upon timeline.

 Looking back at the 0.90 release, we got caught in a bind because we
 were trying to do both feature-based (new master) and time-based (end
 of 2010).

 So, my vote is either:
 plan a: hybrid model - 0.91.X becomes a time-based release series
 where we drop trunk once every month or two, and 0.92.0 is gated on
 features
 or:
 plan b: strict time-based: we release 0.92.0 around summit, and lock
 down the branch at least a month or so ahead of time for bugfix only.

 Thoughts?

 -Todd



 On Sat, Feb 26, 2011 at 12:34 PM, Andrew Purtell apurt...@apache.org wrote:
 Stack and I were chatting on IRC about settling with should get into 0.92 
 before pulling the trigger on the release.

 Stack thinks we need online region schema editing. I agree because 
 per-table coprocessor loading is configured via table attributes. We'd also 
 need some kind of notification of schema update to trigger various actions 
 in the regionserver. (For CPs, (re)load.)

 I'd also really like to see some form of secondary indexing. This is an 
 important feature for HBase to have. All of our in house devs ask for this 
 sooner or later in one form or another. Other projects have options in this 
 arena, while HBase used to in core, but no longer. We have three people 
 starting on this ASAP. I'd like to at least do co-design with the 
 community. We should aim for 'simple and effective'.

 There are 14 blockers: 
 https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+HBASE+AND+fixVersion+%3D+%220.92.0%22+AND+resolution+%3D+Unresolved+AND+priority+%3D+Blocker

 Additionally, 22 marked as critical: 
 https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+HBASE+AND+fixVersion+%3D+%220.92.0%22+AND+resolution+%3D+Unresolved+AND+priority+%3D+Critical

 Best regards,

    - Andy

 Problems worthy of attack prove their worth by hitting back.
  - Piet Hein (via Tom White)








 --
 Todd Lipcon
 Software Engineer, Cloudera



Let's Switch to TestNG

2011-02-23 Thread Ryan Rawson
I filed HBASE-3555, and I listed the following reasons;

- test groups allow us to separate slow/fast tests from each other
- surefire support for running specific groups would allow 'check in
tests' vs 'hudson/integration tests' (ie fast/slow)
- it supports all the features of junit 4, plus it is VERY similar,
making for the transition easy.
- they have assertEquals(byte[],byte[])

What do other people think?


Re: Hbase packaging

2011-02-17 Thread Ryan Rawson
Can there be a way to turn it off for those of us who build and use
the .tar.gz but dont want the time sink in generating deb/rpms?



On Thu, Feb 17, 2011 at 1:25 PM, Eric Yang ey...@yahoo-inc.com wrote:
 Thanks Ted.  I will include this build phase patch with the rpm/deb packaging 
 patch. :)

 Regards,
 Eric

 On 2/17/11 12:58 PM, Ted Dunning tdunn...@maprtech.com wrote:

 Attaching the packaging to the normal life cycle step is a great idea.

 Having the packaging to RPM and deb packaging all in one step is very nice.

 On Thu, Feb 17, 2011 at 12:40 PM, Eric Yang ey...@yahoo-inc.com wrote:
 Sorry the attachment didn't make it through the mailing list.  The patch
 looks like this:

 Index: pom.xml
 ===
 --- pom.xml     (revision 1071461)
 +++ pom.xml     (working copy)
 @@ -321,6 +321,15 @@
             descriptorsrc/assembly/all.xml/descriptor
           /descriptors
         /configuration
 +        executions
 +          execution
 +            idtarball/id
 +            phasepackage/phase
 +            goals
 +              goalsingle/goal
 +            /goals
 +          /execution
 +        /executions
       /plugin

       !-- Run with -Dmaven.test.skip.exec=true to build -tests.jar without
 running tests (this is needed for upstream projects whose tests need this
 jar simply for compilation)--
 @@ -329,6 +338,7 @@
         artifactIdmaven-jar-plugin/artifactId
         executions
           execution
 +            phaseprepare-package/phase
             goals
               goaltest-jar/goal
             /goals
 @@ -355,7 +365,7 @@
         executions
           execution
             idattach-sources/id
 -            phasepackage/phase
 +            phaseprepare-package/phase
             goals
               goaljar-no-fork/goal
             /goals



 On 2/17/11 12:30 PM, Eric Yang ey...@yahoo-inc.com wrote:

 Hi Stack,

 Thanks for the pointer.  This is very useful.  What do you think about
 making jar file creation to prepare-package phase, and having
 assembly:single be part of package phase?  This would make room for running
 both rpm plugin and jdeb plugin in the packaging phase.  Enclosed patch can
 express my meaning better.  User can run:

 mvn -DskipTests package

 The result would be jars, tarball, rpm, debian packages in target directory.

 Another approach is to use -P rpm,deb to control package type generation.

 The current assumption is to leave hbase bundled zookeeper outside of the
 rpm/deb package to improve project integrations.  There will be a submodule
 called hbase-conf-pseudo package, which deploys a single node hbase cluster
 on top of Hadoop+Zookeeper rpms. Would this work for you?

 Regards,
 Eric

 On 2/17/11 11:41 AM, Stack st...@duboce.net wrote:

 On Thu, Feb 17, 2011 at 11:34 AM, Eric Yang ey...@yahoo-inc.com wrote:
 Hi,

 I am trying to understand the release package process for HBase.  In the
 current maven pom.xml, I don't see tarball generation as part of the
 packaging phase.

 The assembly plugin does it for us.  Run:

 $ mvn assembly:assembly

 or

 $ mvn -DskipTests assembly:assembly

 ... to skip the running of the test suite (1 hour).

 See http://wiki.apache.org/hadoop/Hbase/MavenPrimer.



 What about having a inline process which creates both release tarball, rpm,
 and debian packages?  This is to collect feedback for HADOOP-6255 to ensure
 HBase integrates well with rest of the stack.  Thanks



 This sounds great Eric.  Let us know how we can help.  It looks like
 there is an rpm plugin for maven but I've not played with it in the
 past.  If you have input on this, and you'd like me to mess with it,
 I'd be happy to help out.

 Good stuff,
 St.Ack









Re: Hbase packaging

2011-02-17 Thread Ryan Rawson
Sounds good, thanks!

-ryan

On Thu, Feb 17, 2011 at 1:40 PM, Eric Yang ey...@yahoo-inc.com wrote:
 Hi Ryan,

 This would fall in the second proposal, use profile as toggle to switch
 between packaging mechanism. I.e.

 mvn –DskipTests package

 builds tarball.

 mvn –DskipTests package –p rpm,deb

 builds tarball, rpm and deb.

 Does this work for you?

 Regards,
 Eric

 On 2/17/11 1:27 PM, Ryan Rawson ryano...@gmail.com wrote:

 Can there be a way to turn it off for those of us who build and use
 the .tar.gz but dont want the time sink in generating deb/rpms?



 On Thu, Feb 17, 2011 at 1:25 PM, Eric Yang ey...@yahoo-inc.com wrote:
 Thanks Ted.  I will include this build phase patch with the rpm/deb
 packaging patch. :)

 Regards,
 Eric

 On 2/17/11 12:58 PM, Ted Dunning tdunn...@maprtech.com wrote:

 Attaching the packaging to the normal life cycle step is a great idea.

 Having the packaging to RPM and deb packaging all in one step is very
 nice.

 On Thu, Feb 17, 2011 at 12:40 PM, Eric Yang ey...@yahoo-inc.com wrote:
 Sorry the attachment didn't make it through the mailing list.  The patch
 looks like this:

 Index: pom.xml
 ===
 --- pom.xml     (revision 1071461)
 +++ pom.xml     (working copy)
 @@ -321,6 +321,15 @@
             descriptorsrc/assembly/all.xml/descriptor
           /descriptors
         /configuration
 +        executions
 +          execution
 +            idtarball/id
 +            phasepackage/phase
 +            goals
 +              goalsingle/goal
 +            /goals
 +          /execution
 +        /executions
       /plugin

       !-- Run with -Dmaven.test.skip.exec=true to build -tests.jar
 without
 running tests (this is needed for upstream projects whose tests need this
 jar simply for compilation)--
 @@ -329,6 +338,7 @@
         artifactIdmaven-jar-plugin/artifactId
         executions
           execution
 +            phaseprepare-package/phase
             goals
               goaltest-jar/goal
             /goals
 @@ -355,7 +365,7 @@
         executions
           execution
             idattach-sources/id
 -            phasepackage/phase
 +            phaseprepare-package/phase
             goals
               goaljar-no-fork/goal
             /goals



 On 2/17/11 12:30 PM, Eric Yang ey...@yahoo-inc.com wrote:

 Hi Stack,

 Thanks for the pointer.  This is very useful.  What do you think about
 making jar file creation to prepare-package phase, and having
 assembly:single be part of package phase?  This would make room for
 running
 both rpm plugin and jdeb plugin in the packaging phase.  Enclosed patch
 can
 express my meaning better.  User can run:

 mvn -DskipTests package

 The result would be jars, tarball, rpm, debian packages in target
 directory.

 Another approach is to use -P rpm,deb to control package type generation.

 The current assumption is to leave hbase bundled zookeeper outside of the
 rpm/deb package to improve project integrations.  There will be a
 submodule
 called hbase-conf-pseudo package, which deploys a single node hbase
 cluster
 on top of Hadoop+Zookeeper rpms. Would this work for you?

 Regards,
 Eric

 On 2/17/11 11:41 AM, Stack st...@duboce.net wrote:

 On Thu, Feb 17, 2011 at 11:34 AM, Eric Yang ey...@yahoo-inc.com wrote:
 Hi,

 I am trying to understand the release package process for HBase.  In
 the
 current maven pom.xml, I don't see tarball generation as part of the
 packaging phase.

 The assembly plugin does it for us.  Run:

 $ mvn assembly:assembly

 or

 $ mvn -DskipTests assembly:assembly

 ... to skip the running of the test suite (1 hour).

 See http://wiki.apache.org/hadoop/Hbase/MavenPrimer.



 What about having a inline process which creates both release tarball,
 rpm,
 and debian packages?  This is to collect feedback for HADOOP-6255 to
 ensure
 HBase integrates well with rest of the stack.  Thanks



 This sounds great Eric.  Let us know how we can help.  It looks like
 there is an rpm plugin for maven but I've not played with it in the
 past.  If you have input on this, and you'd like me to mess with it,
 I'd be happy to help out.

 Good stuff,
 St.Ack











Re: API changes between 0.20.6 and 0.90.1

2011-02-16 Thread Ryan Rawson
Well done Andrew.

People who want to know the API differences should probably mostly only read:

https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes/pkg_org.apache.hadoop.hbase.client.html

And specifically the HTable, Put, Get, Delete, Scan classes.



On Wed, Feb 16, 2011 at 7:19 AM, Andrew Purtell apurt...@apache.org wrote:
 I ran jdiff by hand. See:

   https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes.html

 Best regards,

     - Andy

 Problems worthy of attack prove their worth by hitting back.
   - Piet Hein (via Tom White)


 --- On Wed, 2/16/11, Lars George lars.geo...@gmail.com wrote:

 From: Lars George lars.geo...@gmail.com
 Subject: Re: API changes between 0.20.6 and 0.90.1
 To: dev@hbase.apache.org
 Date: Wednesday, February 16, 2011, 1:22 AM
 +1, I like that idea.

 On Wed, Feb 16, 2011 at 2:43 AM, Todd Lipcon t...@cloudera.com
 wrote:
  Hi Ted,
 
  I'd recommend setting up jdiff to answer this
 question. Would be a good
  contribution to our source base to be able to run this
 automatically and
  generate a report as part of our build. We do this in
 Hadoop and it's very
  useful.
 
  -Todd
 
  On Tue, Feb 15, 2011 at 5:14 PM, Ted Yu yuzhih...@gmail.com
 wrote:
 
  Can someone tell me which classes from the list
 below changed API between
  0.20.6 and 0.90.1 ?
  http://pastebin.com/TkZfPt52
 
  Thanks
 
 
 
 
  --
  Todd Lipcon
  Software Engineer, Cloudera
 









Re: API changes between 0.20.6 and 0.90.1

2011-02-16 Thread Ryan Rawson
Sounds like Ted volunteered to do it!

Good job!
-ryan

On Wed, Feb 16, 2011 at 12:15 PM, Ted Yu yuzhih...@gmail.com wrote:
 Definitely.

 On Wed, Feb 16, 2011 at 11:57 AM, Todd Lipcon t...@cloudera.com wrote:

 In Hadoop land, Tom White did some awesome work to add special annotations
 that we stick on all the public classes that classify the interfaces as:

 Stability:
  - Unstable: may change and likely to change between point releases,
  - Evolving: possibly change between point releases but unlikely, could
 well change between bigger releases
  - Stable: hasn't changed in a long time, unlikely to change

 Audience: Private, Limited, Public
  - Private: not meant for users, even if it's Stable we might change it
 and break you without a deprecation path
  - Limited: meant only for a certain set of specified projects (eg we might
 say this API is only for use by Hive, and we'll change it so long as the
 hive people are OK with it)
  - Public: won't change without deprecation path for one major release

 He also built some cool tools to do jdiff and javadoc with these
 annotations
 taken into account (eg javadoc won't show private APIs)

 Are people interested in bringing this system over to HBase?

 -Todd

 On Wed, Feb 16, 2011 at 11:51 AM, Ryan Rawson ryano...@gmail.com wrote:

  Well done Andrew.
 
  People who want to know the API differences should probably mostly only
  read:
 
 
 
 https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes/pkg_org.apache.hadoop.hbase.client.html
 
  And specifically the HTable, Put, Get, Delete, Scan classes.
 
 
 
  On Wed, Feb 16, 2011 at 7:19 AM, Andrew Purtell apurt...@apache.org
  wrote:
   I ran jdiff by hand. See:
  
  
  https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes.html
  
   Best regards,
  
       - Andy
  
   Problems worthy of attack prove their worth by hitting back.
     - Piet Hein (via Tom White)
  
  
   --- On Wed, 2/16/11, Lars George lars.geo...@gmail.com wrote:
  
   From: Lars George lars.geo...@gmail.com
   Subject: Re: API changes between 0.20.6 and 0.90.1
   To: dev@hbase.apache.org
   Date: Wednesday, February 16, 2011, 1:22 AM
   +1, I like that idea.
  
   On Wed, Feb 16, 2011 at 2:43 AM, Todd Lipcon t...@cloudera.com
   wrote:
Hi Ted,
   
I'd recommend setting up jdiff to answer this
   question. Would be a good
contribution to our source base to be able to run this
   automatically and
generate a report as part of our build. We do this in
   Hadoop and it's very
useful.
   
-Todd
   
On Tue, Feb 15, 2011 at 5:14 PM, Ted Yu yuzhih...@gmail.com
   wrote:
   
Can someone tell me which classes from the list
   below changed API between
0.20.6 and 0.90.1 ?
http://pastebin.com/TkZfPt52
   
Thanks
   
   
   
   
--
Todd Lipcon
Software Engineer, Cloudera
   
  
  
  
  
  
  
  
 



 --
 Todd Lipcon
 Software Engineer, Cloudera




Re: API changes between 0.20.6 and 0.90.1

2011-02-16 Thread Ryan Rawson
Step 1 is to add the jdiff framework in, that is a non-trivial but
straightforward change.
Step 2 is to annotate all the APIs, something that should be done by
various domain experts over time. Even if this is not complete there
is value with #1.
Step 3: ?
Step 4: profit!

On Wed, Feb 16, 2011 at 3:17 PM, Ted Yu yuzhih...@gmail.com wrote:
 Looking at what Stack is doing in
 https://issues.apache.org/jira/browse/HBASE-1502, I think we can use the
 following appoach:
 1. create the annotations below
 2. Committers who actively refactor code place proper annotation on the
 classes they touch
 3. after some time, we should be able to mark the classes/methods untouched
 by #2 stable.

 My two cents.

 On Wed, Feb 16, 2011 at 2:12 PM, Ted Yu yuzhih...@gmail.com wrote:

 The following annotation can only be attached by HBase committer(s):


 Stability:
  - Unstable: may change and likely to change between point releases,
  - Evolving: possibly change between point releases but unlikely, could
 well change between bigger releases

 Contributors would have a hard time keeping up with current development.


 On Wed, Feb 16, 2011 at 12:46 PM, Ted Yu yuzhih...@gmail.com wrote:

 I am not very familiar with (internal) HBase APIs which grow quite large.
 I have a full-time job.

 And this task is quite big.

 Community effort should be the best approach.


 On Wed, Feb 16, 2011 at 12:20 PM, Todd Lipcon t...@cloudera.com wrote:

 On Wed, Feb 16, 2011 at 12:16 PM, Ryan Rawson ryano...@gmail.comwrote:

 Sounds like Ted volunteered to do it!


 Woohoo, thanks Ted!

 -Todd


  On Wed, Feb 16, 2011 at 12:15 PM, Ted Yu yuzhih...@gmail.com wrote:
  Definitely.
 
  On Wed, Feb 16, 2011 at 11:57 AM, Todd Lipcon t...@cloudera.com
 wrote:
 
  In Hadoop land, Tom White did some awesome work to add special
 annotations
  that we stick on all the public classes that classify the interfaces
 as:
 
  Stability:
   - Unstable: may change and likely to change between point releases,
   - Evolving: possibly change between point releases but unlikely,
 could
  well change between bigger releases
   - Stable: hasn't changed in a long time, unlikely to change
 
  Audience: Private, Limited, Public
   - Private: not meant for users, even if it's Stable we might
 change it
  and break you without a deprecation path
   - Limited: meant only for a certain set of specified projects (eg we
 might
  say this API is only for use by Hive, and we'll change it so long as
 the
  hive people are OK with it)
   - Public: won't change without deprecation path for one major
 release
 
  He also built some cool tools to do jdiff and javadoc with these
  annotations
  taken into account (eg javadoc won't show private APIs)
 
  Are people interested in bringing this system over to HBase?
 
  -Todd
 
  On Wed, Feb 16, 2011 at 11:51 AM, Ryan Rawson ryano...@gmail.com
 wrote:
 
   Well done Andrew.
  
   People who want to know the API differences should probably mostly
 only
   read:
  
  
  
 
 https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes/pkg_org.apache.hadoop.hbase.client.html
  
   And specifically the HTable, Put, Get, Delete, Scan classes.
  
  
  
   On Wed, Feb 16, 2011 at 7:19 AM, Andrew Purtell 
 apurt...@apache.org
   wrote:
I ran jdiff by hand. See:
   
   
  
 https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes.html
   
Best regards,
   
    - Andy
   
Problems worthy of attack prove their worth by hitting back.
  - Piet Hein (via Tom White)
   
   
--- On Wed, 2/16/11, Lars George lars.geo...@gmail.com wrote:
   
From: Lars George lars.geo...@gmail.com
Subject: Re: API changes between 0.20.6 and 0.90.1
To: dev@hbase.apache.org
Date: Wednesday, February 16, 2011, 1:22 AM
+1, I like that idea.
   
On Wed, Feb 16, 2011 at 2:43 AM, Todd Lipcon t...@cloudera.com
 
wrote:
 Hi Ted,

 I'd recommend setting up jdiff to answer this
question. Would be a good
 contribution to our source base to be able to run this
automatically and
 generate a report as part of our build. We do this in
Hadoop and it's very
 useful.

 -Todd

 On Tue, Feb 15, 2011 at 5:14 PM, Ted Yu yuzhih...@gmail.com
wrote:

 Can someone tell me which classes from the list
below changed API between
 0.20.6 and 0.90.1 ?
 http://pastebin.com/TkZfPt52

 Thanks




 --
 Todd Lipcon
 Software Engineer, Cloudera

   
   
   
   
   
   
   
  
 
 
 
  --
  Todd Lipcon
  Software Engineer, Cloudera
 
 




 --
 Todd Lipcon
 Software Engineer, Cloudera







Re: HRegionInfo and HRegion

2011-02-11 Thread Ryan Rawson
hregion is the internal implementation of a region inside
regionserver, you dont get it from a client.

that data is being sent to the master, it's being published to ganglia
and the metric system.

On Fri, Feb 11, 2011 at 1:23 PM, Ted Yu yuzhih...@gmail.com wrote:
 HTable can return region information:
  MapHRegionInfo, HServerAddress regions = table.getRegionsInfo();
 However, request count (HBASE-3507) is contained in HRegion.

 How do I access to HRegion for regions in a table ?

 Thanks



Re: [VOTE] HBase 0.90.1 rc0 is available for download

2011-02-11 Thread Ryan Rawson
I am generally +1, but we'll need another RC to address HBASE-3524.

Here is some of my other report of running this:

Been running a variant of this found here:

https://github.com/stumbleupon/hbase/tree/su_prod_90

Running in dev here at SU now.

Also been testing that against our Hadoop CDH3b2 patched in with
HDFS-347.  In uncontended YCSB runs this did improve much 'get'
numbers, but in a 15 thread contended test the average get time goes
from 12.1 ms - 6.9ms.  We plan to test this more and roll in to our
production environment.  With 0.90.1 + a number of our patches,
Hadoopw/347 I loaded 30gb in using YCSB.

Still working on getting VerifyingWorkload to run and verify this
data. But no exceptions.

-ryan

On Fri, Feb 11, 2011 at 7:10 PM, Andrew Purtell apurt...@apache.org wrote:
 Seems reasonable to stay -1 given HBASE-3524.

 This weekend I'm rolling RPMs of 0.90.1rc0 + ... a few patches (including 
 3524) ... for deployment to preproduction staging. Depending how that goes we 
 may have jiras and patches for you next week.

 Best regards,

    - Andy



 From: Stack st...@duboce.net
 Subject: Re: [VOTE] HBase 0.90.1 rc0 is available for download
 To: apurt...@apache.org
 Cc: dev@hbase.apache.org
 Date: Friday, February 11, 2011, 9:35 AM

 Yes.  We need to fix the assembly.  Its going to trip folks up.  I
 don't think it a sinker on the RC though, especially as we
 shipped 0.90.0 w/ this same issue.  What you think boss?

 St.Ack


 On Fri, Feb 11, 2011 at 9:30 AM, Andrew Purtell apurt...@apache.org
 wrote:
  No an earlier version from before that I failed to
 delete while moving jars around. So this is a user problem,
 but I forsee it coming up again and again.







Re: Build patched cdh3b2

2011-02-11 Thread Ryan Rawson
I put up the patch I used, I then changed the version to 0.20.2-322
and just did ant jar. I crippled the forrest crap in build.xml... I
didnt check the filesize of the resulting jar though.

-ryan

On Fri, Feb 11, 2011 at 10:08 PM, Ted Yu yuzhih...@gmail.com wrote:
 Ryan:
 Can you share how you built patched cdh3b2 ?

 When I used 'ant jar', I got build/hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar
 which was much larger than the official hadoop-core-0.20.2+320.jar
 hadoop had trouble starting if I used hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar
 in place of official jar.

 Thanks

 On Fri, Feb 11, 2011 at 9:59 PM, Ryan Rawson ryano...@gmail.com wrote:

 I am generally +1, but we'll need another RC to address HBASE-3524.

 Here is some of my other report of running this:

 Been running a variant of this found here:

 https://github.com/stumbleupon/hbase/tree/su_prod_90

 Running in dev here at SU now.

 Also been testing that against our Hadoop CDH3b2 patched in with
 HDFS-347.  In uncontended YCSB runs this did improve much 'get'
 numbers, but in a 15 thread contended test the average get time goes
 from 12.1 ms - 6.9ms.  We plan to test this more and roll in to our
 production environment.  With 0.90.1 + a number of our patches,
 Hadoopw/347 I loaded 30gb in using YCSB.

 Still working on getting VerifyingWorkload to run and verify this
 data. But no exceptions.

 -ryan





Re: Build patched cdh3b2

2011-02-11 Thread Ryan Rawson
my jar looks like:

-rw-r--r--  1 hadoop hadoop 2861459 2011-02-09 16:34 hadoop-core-0.20.2+322.jar

-ryan

On Fri, Feb 11, 2011 at 10:29 PM, Ryan Rawson ryano...@gmail.com wrote:
 I put up the patch I used, I then changed the version to 0.20.2-322
 and just did ant jar. I crippled the forrest crap in build.xml... I
 didnt check the filesize of the resulting jar though.

 -ryan

 On Fri, Feb 11, 2011 at 10:08 PM, Ted Yu yuzhih...@gmail.com wrote:
 Ryan:
 Can you share how you built patched cdh3b2 ?

 When I used 'ant jar', I got build/hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar
 which was much larger than the official hadoop-core-0.20.2+320.jar
 hadoop had trouble starting if I used hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar
 in place of official jar.

 Thanks

 On Fri, Feb 11, 2011 at 9:59 PM, Ryan Rawson ryano...@gmail.com wrote:

 I am generally +1, but we'll need another RC to address HBASE-3524.

 Here is some of my other report of running this:

 Been running a variant of this found here:

 https://github.com/stumbleupon/hbase/tree/su_prod_90

 Running in dev here at SU now.

 Also been testing that against our Hadoop CDH3b2 patched in with
 HDFS-347.  In uncontended YCSB runs this did improve much 'get'
 numbers, but in a 15 thread contended test the average get time goes
 from 12.1 ms - 6.9ms.  We plan to test this more and roll in to our
 production environment.  With 0.90.1 + a number of our patches,
 Hadoopw/347 I loaded 30gb in using YCSB.

 Still working on getting VerifyingWorkload to run and verify this
 data. But no exceptions.

 -ryan






Re: Build patched cdh3b2

2011-02-11 Thread Ryan Rawson
i call it 0.20.2-322 and its at http://people.apache.org/~rawson/repo/ (m2 repo)

for just the jar you can find it there.

On Fri, Feb 11, 2011 at 10:35 PM, Ted Yu yuzhih...@gmail.com wrote:
 Is it possible for you to share the hadoop-core-0.20.2+320.jar that you
 built ?

 Thanks

 On Fri, Feb 11, 2011 at 10:29 PM, Ryan Rawson ryano...@gmail.com wrote:

 I put up the patch I used, I then changed the version to 0.20.2-322
 and just did ant jar. I crippled the forrest crap in build.xml... I
 didnt check the filesize of the resulting jar though.

 -ryan

 On Fri, Feb 11, 2011 at 10:08 PM, Ted Yu yuzhih...@gmail.com wrote:
  Ryan:
  Can you share how you built patched cdh3b2 ?
 
  When I used 'ant jar', I got build/hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar
  which was much larger than the official hadoop-core-0.20.2+320.jar
  hadoop had trouble starting if I used
 hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar
  in place of official jar.
 
  Thanks
 
  On Fri, Feb 11, 2011 at 9:59 PM, Ryan Rawson ryano...@gmail.com wrote:
 
  I am generally +1, but we'll need another RC to address HBASE-3524.
 
  Here is some of my other report of running this:
 
  Been running a variant of this found here:
 
  https://github.com/stumbleupon/hbase/tree/su_prod_90
 
  Running in dev here at SU now.
 
  Also been testing that against our Hadoop CDH3b2 patched in with
  HDFS-347.  In uncontended YCSB runs this did improve much 'get'
  numbers, but in a 15 thread contended test the average get time goes
  from 12.1 ms - 6.9ms.  We plan to test this more and roll in to our
  production environment.  With 0.90.1 + a number of our patches,
  Hadoopw/347 I loaded 30gb in using YCSB.
 
  Still working on getting VerifyingWorkload to run and verify this
  data. But no exceptions.
 
  -ryan
 
 
 




Re: Build patched cdh3b2

2011-02-11 Thread Ryan Rawson
Oh right, the groupId is com.cloudera not org.apache, so the other dir...

On Fri, Feb 11, 2011 at 10:41 PM, Ted Yu yuzhih...@gmail.com wrote:
 I don't see it under
 http://people.apache.org/~rawson/repo/org/apache/hadoop/hadoop-core/
 Should I look somewhere else ?

 On Fri, Feb 11, 2011 at 10:37 PM, Ryan Rawson ryano...@gmail.com wrote:

 i call it 0.20.2-322 and its at http://people.apache.org/~rawson/repo/ (m2
 repo)

 for just the jar you can find it there.

 On Fri, Feb 11, 2011 at 10:35 PM, Ted Yu yuzhih...@gmail.com wrote:
  Is it possible for you to share the hadoop-core-0.20.2+320.jar that you
  built ?
 
  Thanks
 
  On Fri, Feb 11, 2011 at 10:29 PM, Ryan Rawson ryano...@gmail.com
 wrote:
 
  I put up the patch I used, I then changed the version to 0.20.2-322
  and just did ant jar. I crippled the forrest crap in build.xml... I
  didnt check the filesize of the resulting jar though.
 
  -ryan
 
  On Fri, Feb 11, 2011 at 10:08 PM, Ted Yu yuzhih...@gmail.com wrote:
   Ryan:
   Can you share how you built patched cdh3b2 ?
  
   When I used 'ant jar', I got
 build/hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar
   which was much larger than the official hadoop-core-0.20.2+320.jar
   hadoop had trouble starting if I used
  hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar
   in place of official jar.
  
   Thanks
  
   On Fri, Feb 11, 2011 at 9:59 PM, Ryan Rawson ryano...@gmail.com
 wrote:
  
   I am generally +1, but we'll need another RC to address HBASE-3524.
  
   Here is some of my other report of running this:
  
   Been running a variant of this found here:
  
   https://github.com/stumbleupon/hbase/tree/su_prod_90
  
   Running in dev here at SU now.
  
   Also been testing that against our Hadoop CDH3b2 patched in with
   HDFS-347.  In uncontended YCSB runs this did improve much 'get'
   numbers, but in a 15 thread contended test the average get time goes
   from 12.1 ms - 6.9ms.  We plan to test this more and roll in to our
   production environment.  With 0.90.1 + a number of our patches,
   Hadoopw/347 I loaded 30gb in using YCSB.
  
   Still working on getting VerifyingWorkload to run and verify this
   data. But no exceptions.
  
   -ryan
  
  
  
 
 




Re: initial experience with HBase 0.90.1 rc0

2011-02-10 Thread Ryan Rawson
You don't have both the old and the new hbase jars in there do you?

-ryan

On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu yuzhih...@gmail.com wrote:
 .META. went offline during second flow attempt.

 The time out I mentioned happened for 1st and 3rd attempts. HBase was
 restarted before the 1st and 3rd attempts.

 Here is jstack:
 http://pastebin.com/EHMSvsRt

 On Thu, Feb 10, 2011 at 3:04 PM, Stack st...@duboce.net wrote:

 So, .META. is not online?  What happens if you use shell at this time.

 Your attachement did not come across Ted.  Mind postbin'ing it?

 St.Ack

 On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu yuzhih...@gmail.com wrote:
  I replaced hbase jar with hbase-0.90.1.jar
  I also upgraded client side jar to hbase-0.90.1.jar
 
  Our map tasks were running faster than before for about 50 minutes.
 However,
  map tasks then timed out calling flushCommits(). This happened even after
  fresh restart of hbase.
 
  I don't see any exception in region server logs.
 
  In master log, I found:
 
  2011-02-10 18:24:15,286 DEBUG
  org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
  -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
  2011-02-10 18:24:15,349 INFO
 org.apache.hadoop.hbase.catalog.CatalogTracker:
  Failed verification of .META.,,1 at address=null;
  org.apache.hadoop.hbase.NotServingRegionException:
  org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
  .META.,,1
  2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
  master:6-0x12e10d0e31e Creating (or updating) unassigned node for
  1028785192 with OFFLINE state
 
  I am attaching region server (which didn't respond to stop-hbase.sh)
 jstack.
 
  FYI
 
  On Thu, Feb 10, 2011 at 10:10 AM, Stack st...@duboce.net wrote:
 
  Thats probably enough Ted.  The 0.90.1 hbase-default.xml has an extra
  config. to enable the experimental HBASE-3455 feature but you can copy
  that over if you want to try playing with it (it defaults off so you'd
  copy over the config. if you wanted to set it to true).
 
  St.Ack
 
 




Re: initial experience with HBase 0.90.1 rc0

2011-02-10 Thread Ryan Rawson
As I suspected.

It's a byproduct of our maven assembly process. The process could be
fixed. I wouldn't mind. I don't support runtime checking of jars,
there is such thing as too much tests, and this is an example of it.
The check would then need a test, etc, etc.

At SU we use new directories for each upgrade, copying the config
over. With the lack of -default.xml this is easier than ever (just
copy everything in conf/).  With symlink switchover it makes roll
forward/back as simple as doing a symlink switchover or back. I have
to recommend this to everyone who doesnt have a management scheme.

On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu yuzhih...@gmail.com wrote:
 hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the classpath.
 I wonder
 1. why hbase jar is placed in two directories - 0.20.6 didn't use such
 structure
 2. what from lib/hbase-0.90.0.jar could be picked up and why there wasn't
 exception in server log

 I think a JIRA should be filed for item 2 above - bail out when the two
 hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions.

 Cheers

 On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson ryano...@gmail.com wrote:

 What do you get when you:

 ls lib/hbase*

 I'm going to guess there is hbase-0.90.0.jar there



 On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu yuzhih...@gmail.com wrote:
  hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
  Would this be a problem ?
 
  On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson ryano...@gmail.com wrote:
 
  You don't have both the old and the new hbase jars in there do you?
 
  -ryan
 
  On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu yuzhih...@gmail.com wrote:
   .META. went offline during second flow attempt.
  
   The time out I mentioned happened for 1st and 3rd attempts. HBase was
   restarted before the 1st and 3rd attempts.
  
   Here is jstack:
   http://pastebin.com/EHMSvsRt
  
   On Thu, Feb 10, 2011 at 3:04 PM, Stack st...@duboce.net wrote:
  
   So, .META. is not online?  What happens if you use shell at this
 time.
  
   Your attachement did not come across Ted.  Mind postbin'ing it?
  
   St.Ack
  
   On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu yuzhih...@gmail.com wrote:
I replaced hbase jar with hbase-0.90.1.jar
I also upgraded client side jar to hbase-0.90.1.jar
   
Our map tasks were running faster than before for about 50 minutes.
   However,
map tasks then timed out calling flushCommits(). This happened even
  after
fresh restart of hbase.
   
I don't see any exception in region server logs.
   
In master log, I found:
   
2011-02-10 18:24:15,286 DEBUG
org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened
  region
-ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
2011-02-10 18:24:15,349 INFO
   org.apache.hadoop.hbase.catalog.CatalogTracker:
Failed verification of .META.,,1 at address=null;
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region is not
  online:
.META.,,1
2011-02-10 18:24:15,350 DEBUG
  org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:6-0x12e10d0e31e Creating (or updating) unassigned
 node
  for
1028785192 with OFFLINE state
   
I am attaching region server (which didn't respond to
 stop-hbase.sh)
   jstack.
   
FYI
   
On Thu, Feb 10, 2011 at 10:10 AM, Stack st...@duboce.net wrote:
   
Thats probably enough Ted.  The 0.90.1 hbase-default.xml has an
 extra
config. to enable the experimental HBASE-3455 feature but you can
  copy
that over if you want to try playing with it (it defaults off so
  you'd
copy over the config. if you wanted to set it to true).
   
St.Ack
   
   
  
  
 
 




Re: Data upgrade from 0.89x to 0.90.0.

2011-02-10 Thread Ryan Rawson
we only major compact a region ever 24 hours, therefore if it was JUST
compacted within the last 24 hours we skip it.

this is how it used to work, and how it should still work, not really
looking at code right now, busy elsewhere :-)

-ryan

On Thu, Feb 10, 2011 at 11:17 PM, James Kennedy
james.kenn...@troove.net wrote:
 Can you define 'come due'?

 The NPE occurs at the first isMajorCompaction() test in the main loop of 
 MajorCompactionChecker.
 That cycle is executed every 2.78 hours.
 Yet I know that I've kept healthy QA test data up and running for much longer 
 than that.


 James Kennedy
 Project Manager
 Troove Inc.

 On 2011-02-10, at 10:46 PM, Ryan Rawson wrote:

 I am speaking off the hip here, but the major compaction algorithm
 attempts to keep the number of major compactions to a minimum by
 checking the timestamp of the file. So it's possible that the other
 regions just 'didnt come due' yet.

 -ryan

 On Thu, Feb 10, 2011 at 10:42 PM, James Kennedy
 james.kenn...@troove.net wrote:
 I've tested HBase 0.90 + HBase-trx 0.90.0 and i've run it over old data 
 from 0.89x using a variety of seeded unit test/QA data and cluster 
 configurations.

 But when it came time to upgrade some production data I got snagged on 
 HBASE-3524. The gist of it is in Ryan's last points:

 * compaction is optional, meaning if it fails no data is lost, so you
 should probably be fine.

 * Older versions of the code did not write out time tracker data and
 that is why your older files were giving you NPEs.

 Makes sense.  But why did I not encounter this with my initial data 
 upgrades on very similar data pkgs?

 So I applied Ryan's patch, which simply assigns a default value 
 (Long.MIN_VALUE) when a StoreFile lacks a timeRangeTracker and I fixed 
 the data by forcing major compactions on the regions affected.  Preliminary 
 poking has not shown any instability in the data since.

 But I confess that I just don't have the time right now to really dig into 
 the code and validate that there are no more gotchya's or data corruption 
 that could have resulted.

 I guess the questions that I have for the team are:

 * What state would 9 out of 50 tables be in to miss the new 0.90.0 
 timeRangeTracker injection before the first major compaction check?
 * Where else is the new TimeRangeTracker used?  Could a StoreFile with a 
 null timeRangeTracker have corrupted the data in other subtler ways?
 * What other upgrade-related data changes might not have completed 
 elsewhere?

 Thanks,

 James Kennedy
 Project Manage
 Troove Inc.






Re: 0.90.1?

2011-02-01 Thread Ryan Rawson
Gary pointed this out on irc:

http://jira.codehaus.org/browse/SUREFIRE-656

When we were talking about making the tests faster.

Test-ng has this support ready to roll _now_.

Basically we could have a 'smoke test' run for the release... Do the
larger integration tests outside the mvn release line or something.

Thoughts?
-ryan

On Tue, Feb 1, 2011 at 11:57 AM, Stack st...@duboce.net wrote:
 Oh, you have to use the release plugin if you want to get stuff into
 Apache repository -- else I'd sidestep it.
 St.Ack

 On Tue, Feb 1, 2011 at 11:56 AM, Stack st...@duboce.net wrote:
 On Tue, Feb 1, 2011 at 11:43 AM, Ryan Rawson ryano...@gmail.com wrote:
 A mvn release (to maven central) is different than our standard
 tarball (assembly:assembly), right?


 Yeah.  There is a 'release' mvn plugin that wants to 'help' you in the
 way that mswindows is always trying to help you; you know, Would you
 like to do XYZ? when you do NOT want to do XYZ.  It wants to update
 versions in poms, add tags to svn, put stuff into 'repositories', but
 you have to wrestle it to make it use right repository locations and
 version numbers.

 St.Ack




Re: Scan operator ignores setMaxVersions (0.20.6)

2011-01-27 Thread Ryan Rawson
how many versions is the column family configured for?  the
maxVersions will never return more than that, so if it is 1 you wont
have more than 1.

-ryan

On Thu, Jan 27, 2011 at 3:08 PM, Vladimir Rodionov
vrodio...@carrieriq.com wrote:

 Although this version is not supported but may be somebody can advice how to 
 get ALL versions of rows from HTable scan?

 This code:

        public IteratorOtaUploadWritable getUploadsByProfileId(String 
 profile,
                        long start, long end) throws IOException {

                Scan scan = new Scan(getStartKey(profile), getEndKey(profile));
                scan.addColumn(COLFAMILY, COL_REF);
                scan.addColumn(COLFAMILY, COL_UPLOAD);
                scan.setTimeRange(start, end);
                scan.setMaxVersions();
                ResultScanner rs = this.getScanner(scan);
                return new ResultScannerIterator(rs);
        }

 does not seem work correctly (only last versions of rows get into the result 
 set)

 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com



Re: Looks like duplicate in MemoryStoreFlusher flushSomeRegions()

2011-01-25 Thread Ryan Rawson
the call to compactionRequested() only puts the region on a queue to
be compacted, so if there is unintended duplication, it wont actually
hold anything up.

-ryan

On Tue, Jan 25, 2011 at 6:05 PM, mac fang mac.had...@gmail.com wrote:
 Guys, since the flushCache will make the write/read suspend. I am NOT sure
 if it is necessary here.

 On Mon, Jan 24, 2011 at 1:48 PM, mac fang mac.had...@gmail.com wrote:

 Yes, I mean the server.compactSplitThread.compactionRequested(region,
 getName());

 in flushRegion, it will do the 
 server.compactSplitThread.compactionRequested(region,
 getName());

 *seems we don't need to do it again in the following logic (can you guys
 see the lines in bold, **!flushRegion(biggestMemStoreRegion, true) and *
 *
 *
 *    for (HRegion region : regionsToCompact) {
       server.compactSplitThread.compactionRequested(region, getName());
     }*
 *
 *
 *
 *
 *regards*
 macf


     if (*!flushRegion(biggestMemStoreRegion, true)*) {
         LOG.warn(Flush failed);
         break;
       }
       regionsToCompact.add(biggestMemStoreRegion);
     }
     for (HRegion region : regionsToCompact) {
       *server.compactSplitThread.compactionRequested(region, getName());*
     }

   in flushRegion
  
    private boolean flushRegion(final HRegion region, final boolean
   emergencyFlush) {
      synchronized (this.regionsInQueue) {
        FlushQueueEntry fqe = this.regionsInQueue.remove(region);
        if (fqe != null  emergencyFlush) {
          // Need to remove from region from delay queue.  When NOT an
          // emergencyFlush, then item was removed via a flushQueue.poll.
          flushQueue.remove(fqe);
        }
        lock.lock();
      }
      try {
        if (region.flushcache()) {
          *server.compactSplitThread.compactionRequested(region,
 getName());*
        }

 On Mon, Jan 24, 2011 at 6:40 AM, Ted Yu yuzhih...@gmail.com wrote:

 I think he was referring to this line:

 server.compactSplitThread.compactionRequested(region, getName());

 On Sun, Jan 23, 2011 at 10:52 AM, Stack st...@duboce.net wrote:

  Hello Mac Fang:  Which lines in the below?  Your colorizing didn't
  come across in the mail.  Thanks, St.Ack
 
  On Sun, Jan 23, 2011 at 6:23 AM, mac fang mac.had...@gmail.com wrote:
   Hi, guys,
  
   see the below codes in* MemStoreFlusher.java*, i am not sure if those
   lines
   in orange are the same and looks like they are trying to do the same
  logic.
   Are they redundant?
  
   regards
   macf
  
      if (!flushRegion(biggestMemStoreRegion, true)) {
          LOG.warn(Flush failed);
          break;
        }
        regionsToCompact.add(biggestMemStoreRegion);
      }
      for (HRegion region : regionsToCompact) {
        server.compactSplitThread.compactionRequested(region, getName());
      }
  
   in flushRegion
  
    private boolean flushRegion(final HRegion region, final boolean
   emergencyFlush) {
      synchronized (this.regionsInQueue) {
        FlushQueueEntry fqe = this.regionsInQueue.remove(region);
        if (fqe != null  emergencyFlush) {
          // Need to remove from region from delay queue.  When NOT an
          // emergencyFlush, then item was removed via a flushQueue.poll.
          flushQueue.remove(fqe);
        }
        lock.lock();
      }
      try {
        if (region.flushcache()) {
          server.compactSplitThread.compactionRequested(region,
 getName());
        }
  
 






Re: parallelizing HBaseAdmin.flush()

2011-01-24 Thread Ryan Rawson
dont forget that this wont parallelize the flushes or compactions,
since they happen region-server side and there are built in limits
there to keep io down.

this will accelerate sending all the command messages though.

-ryan

On Mon, Jan 24, 2011 at 11:18 AM, Ted Yu yuzhih...@gmail.com wrote:
 https://issues.apache.org/jira/browse/HBASE-3471 is created

 On Mon, Jan 24, 2011 at 10:56 AM, Jean-Daniel Cryans 
 jdcry...@apache.orgwrote:

 I'd guess so and the same could be done for splits and compactions
 since it's (almost) the same code.

 J-D

 On Sat, Jan 22, 2011 at 8:00 AM, Ted Yu yuzhih...@gmail.com wrote:
  In 0.90, HBaseAdmin.flush() uses a loop to go over ListPairHRegionInfo,
  HServerAddress
 
  Should executor service be used for higher parallelism ?
 
  Thanks
 




Re: Items to contribute (plan)

2011-01-22 Thread Ryan Rawson
Hopefully to do #1, you would not require many/any changes in HFile or
HBase.  Implementing the HDFS stream API should be enough.

#2 is interesting, what is the benefit?  How did you measure said benefit?

-ryan

On Sat, Jan 22, 2011 at 5:45 PM, Ted Yu yuzhih...@gmail.com wrote:
 #1 looks similar to what MapR has done.

 On Sat, Jan 22, 2011 at 5:18 PM, Tatsuya Kawano tatsuya6...@gmail.comwrote:


 Hi,

 I wanted to let you know that I'm planning to contribute the following
 items to the HBase community. These are my spare time projects and I'll only
 be able to spend my time about 7 hours a week, so the progress will be very
 slow. I want some feedback from you guys to prioritize them. Also, if
 someone/team wants to work on them (with me or alone), I'll be happy to
 provide more details.


 1. RADOS integration

 Run HBase not only on HDFS but also RADOS distributed object store (the
 lower layer of Ceph), so that the following options will become available to
 HBase users:

 -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors
 and data nodes)
 -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot
 per object pool)
 -- Extra durability option on WAL (RADOS can do both synchronous and
 asynchronous disk flush. HDFS doesn't have the earlier option)

 Note:
 RADOS object = HFile, WAL
 object pool = group of HFiles or WAL

 Current status: Design phase


 2. mapreduce.HFileInputFormat

 MR library to read data directly from HFiles. (Roughly 2.5 times faster
 than TableInputFormat in my tests)

 Current status: Completed a proof-of-concept prototype and measured
 performance.


 3. Enhance Get/Scan performance of RS

 Add an hash code and a couple of flags to HFile at the flush time and
 change scanner implementation so that:

 -- Get/Scan operations will get faster. (less key comparisons for
 reconstructing a row: O(h * c) - O(h).  [h = number of HFiles for the row,
 c = number of columns in an HFile])
 -- The size of HFiles will become a bit smaller. (The flags will eliminate
 duplicate bytes in keys (row, column family and qualifier) from HFiles.)

 Current status: Completed a proof-of-concept prototype and measured
 performance.

 Detals:
 https://github.com/tatsuya6502/hbase-mr-pof/
 (I meant poc not pof...)


 4. Writing Japanese books and documents

 -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL
 book
 -- I'll translate The Apache HBase Book to Japanese


 Thank you,


 --
 Tatsuya Kawano (Mr.)
 Tokyo, Japan

 http://twitter.com/#!/tatsuya6502 http://twitter.com/#%21/tatsuya6502






Re: VERSIONS in Shell

2011-01-18 Thread Ryan Rawson
the parse code inside table.rb is wacky, maybe this fixes it:

diff --git a/src/main/ruby/hbase/table.rb b/src/main/ruby/hbase/table.rb
index c8e0076..cd90132 100644
--- a/src/main/ruby/hbase/table.rb
+++ b/src/main/ruby/hbase/table.rb
@@ -138,19 +138,17 @@ module Hbase
   get.addFamily(family)
 end
   end
-
-  # Additional params
-  get.setMaxVersions(args[VERSIONS] || 1)
-  get.setTimeStamp(args[TIMESTAMP]) if args[TIMESTAMP]
 else
   # May have passed TIMESTAMP and row only; wants all columns from ts.
-  unless ts = args[TIMESTAMP]
-raise ArgumentError, Failed parse of #{args.inspect},
#{args.class}
+  if ts = args[TIMESTAMP]
+# Set the timestamp
+   get.setTimeStamp(ts.to_i)
   end
-
-  # Set the timestamp
-  get.setTimeStamp(ts.to_i)
 end
+
+# Additional params
+get.setMaxVersions(args[VERSIONS] || 1)
+get.setTimeStamp(args[TIMESTAMP]) if args[TIMESTAMP]
   end

   # Call hbase for the results


On Tue, Jan 18, 2011 at 12:36 AM, Lars George lars.geo...@gmail.com wrote:
 Hi,

 On hbase-0.89.20100924+28 I tried to get all versions for a cell that
 has 3 versions and on the shell I got:

 hbase(main):014:0 get 'hbase_table_1', '498', {VERSIONS=10}
 COLUMN                                        CELL

 ERROR: Failed parse of {VERSIONS=10}, Hash

 Here is some help for this command:
          Get row or cell contents; pass table name, row, and optionally
          a dictionary of column(s), timestamp and versions. Examples:

            hbase get 't1', 'r1'
            hbase get 't1', 'r1', {COLUMN = 'c1'}
            hbase get 't1', 'r1', {COLUMN = ['c1', 'c2', 'c3']}
            hbase get 't1', 'r1', {COLUMN = 'c1', TIMESTAMP = ts1}
            hbase get 't1', 'r1', {COLUMN = 'c1', TIMESTAMP = ts1,
 VERSIONS = 4}
            hbase get 't1', 'r1', 'c1'
            hbase get 't1', 'r1', 'c1', 'c2'
            hbase get 't1', 'r1', ['c1', 'c2']


 hbase(main):015:0 scan 'hbase_table_1', { STARTROW='498',
 STOPROW='498',VERSIONS=10}
 ROW                                           COLUMN+CELL
  498                                          column=cf1:val,
 timestamp=1295335912913, value=val_498
  498                                          column=cf1:val,
 timestamp=1295335912913, value=val_498
  498                                          column=cf1:val,
 timestamp=1295335912913, value=val_498
 1 row(s) in 0.0520 seconds

 hbase(main):016:0

 So the scan works but not the get. That's wrong, right?

 Lars



Re: zookeeper.session.timeout

2011-01-18 Thread Ryan Rawson
no it does not, zookeeper fixed that.

-ryan

On Tue, Jan 18, 2011 at 3:29 PM, Ted Yu yuzhih...@gmail.com wrote:
 Hi,
 In hbase 0.20.6, I see the following in description for
 zookeeper.session.timeout:

 The current implementation
      requires that the timeout be a minimum of 2 times the tickTime
      (as set in the server configuration) and a maximum of 20 times
      the tickTime. Set the zk ticktime with
 hbase.zookeeper.property.tickTime.
      In milliseconds.

 Does the above hold for hbase 0.90 as well ?

 Thanks



Re: Zookeeper tuning, was: YouAreDeadException

2011-01-15 Thread Ryan Rawson
also remember that higher session timeouts take longer to discover a
regionserver is dead.  so it's a trade off.

On Sat, Jan 15, 2011 at 6:37 PM, Ted Yu yuzhih...@gmail.com wrote:
 I want region server to be more durable.
 If zookeeper.session.timeout is set high, it takes master long to discover
 dead region server.

 Can you share zookeeper tuning experiences ?

 Thanks

 On Sat, Jan 15, 2011 at 5:14 PM, Stack st...@duboce.net wrote:

 Yes.

 Currently, there are two heartbeats: the zk client one and then the
 hbase which used to be what we relied on figuring whether a
 regionserver is alive but now its just used to post the master the
 regionserver stats such as requests per second.  This latter is going
 away in 0.92 (Pre-0.90.0 regionserver and master would swap 'messages'
 on the back of the heartbeat -- i.e. open this region, I've just split
 region X, etc. but now 90% of this stuff is done via zk.  In 0.92.
 we'll finish the cleanup).

 Hope this helps,
 St.Ack

 On Sat, Jan 15, 2011 at 5:03 PM, Ted Yu yuzhih...@gmail.com wrote:
  For #1, I assume I should look for 'received expired from ZooKeeper,
  aborting'
 
  On Sat, Jan 15, 2011 at 5:02 PM, Ted Yu yuzhih...@gmail.com wrote:
 
  For #1, what string should I look for in region server log ?
  For #4, what's the rationale behind sending YADE after receiving
 heartbeat
  ? I thought heartbeat means the RS is alive.
 
  Thanks
 
 
  On Sat, Jan 15, 2011 at 4:49 PM, Stack st...@duboce.net wrote:
 
  FYI Ted, the YourAreDeadException usually happens in following context:
 
  1. Regionserver has some kinda issue -- long GC pause for instance --
  and it stops tickling zk.
  2. Master gets zk session expired event.  Starts up recovery of the
 hung
  region.
  3. Regionserver recovers but has not yet processed its session expired
  event.  It heartbeats the Master as though nothing wrong.
  4. Master is mid-recovery or beyond server recovery and on receipt of
  the heartbeat in essence tells the regionserver to 'go away' by
  sending him the YouAreDeadException.
  5. By now the regionserver will have gotten its session expired
  notification and will have started an abort so the YADE is not news
  when it receives the exception.
 
  St.Ack
 
 
  On Fri, Jan 14, 2011 at 7:49 PM, Ted Yu yuzhih...@gmail.com wrote:
   Thanks for your analysis, Ryan.
   The dev cluster has half as many nodes as our staging cluster. Each
 node
  has
   half the number of cores as the node in staging.
  
   I agree with your conclusion.
  
   I will report back after I collect more data - the flow uses hbase
  heavily
   toward the end.
  
   On Fri, Jan 14, 2011 at 6:20 PM, Ryan Rawson ryano...@gmail.com
  wrote:
  
   I'm seeing not much in the way of errors, timeouts, all to one
 machine
   ending with .80, so that is probably your failed node.
  
   Other than that, the log doesnt seem to say too much.  Searching for
   strings like FATAL and Exception is the way to go here.
  
   Also things like this:
   2011-01-14 23:38:52,936 INFO
   org.apache.hadoop.hbase.master.AssignmentManager: Region has been
   PENDING_OPEN for too long, reassigning region=
  
  
 
 NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v
   [h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602
   7.2c40637c6c648a67162cc38d8c6d8ee9.
  
  
   Guessing, I'd probably say your nodes hit some performance wall,
 with
   io-wait, or networking, or something, and Regionserver processes
   stopped responding, but did not time out from zookeeper yet... so
 you
   would run into a situation where some nodes are unresponsive, so any
   data hosted there would be difficult to talk to.  Until the
   regionserver times out it's zookeeper node, the master doesnt know
   about the fault of the regionserver.
  
   The master web UI is probably inaccessible because the META table is
   on a regionserver that went AWOL.  You should check your load, your
   ganglia graphs.  Also remember, despite having lots of disks, each
   node is a gigabit ethernet which means about 110-120 MB/sec.  It's
   quite possible you are running into network limitations, remember
 that
   regionservers must write to 2 additional datanodes, and there will
 be
   overlap, thus you have to share some of that 110-120MB/sec per node
   figure with other nodes, not to mention that you also need to factor
   inbound bandwidth (from client-hbase regionserver) and outbound
   bandwidth (from datanode replica 1 - dn replica 2).
  
   -ryan
  
   On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu yuzhih...@gmail.com
 wrote:
Now I cannot access master web UI, This happened after I doubled
 the
   amount
of data processed in our flow.
   
I am attaching master log.
   
On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson ryano...@gmail.com
  wrote:
   
This is the cause:
   
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
 region
   server
serverName=sjc1-hadoop1.sjc1.carrieriq.com

Re: ANN: The fourth hbase 0.90.0 release candidate is available for download X

2011-01-14 Thread Ryan Rawson
Hey,

It's a pretty unusual situation that got you into 3445?  It's been a
few weeks of RCs, and we need to push out a 0.90.0 so everyone can
benefit from it.  We can release point releases fairly quickly once a
stable base release is out, does that sound reasonable to you?

Thanks for testing!
-ryan

On Fri, Jan 14, 2011 at 2:14 PM, James Kennedy james.kenn...@troove.net wrote:
 -1 for the following bug:

 https://issues.apache.org/jira/browse/HBASE-3445

 Note however that aside from this issue RC 3 looks pretty stable:

 * All HBase tests pass (on a Mac)
 * All hbase-trx tests pass after I upgraded 
 https://github.com/hbase-trx/hbase-transactional-tableindexed
 * All tests pass in our web app.
 * Our application performs well on local machine.

 * Still todo after 3445 fixed:  Full cluster testing


 James Kennedy

 On 2011-01-07, at 5:03 PM, Stack wrote:

 The fourth hbase 0.90.0 release candidate is available for download:

 http://people.apache.org/~stack/hbase-0.90.0-candidate-3/

 This is going to be the one!

 Should we release this candidate as hbase 0.90.0?  Take it for a spin.
 Check out the doc., etc.  Vote +1/-1 by next Friday, the 14th of January.

 HBase 0.90.0 is the major HBase release that follows 0.20.0 and the
 fruit of the 0.89.x development release series we've been running of
 late.

 Over 1k issues have been closed since 0.20.0.  Release notes are
 available here: http://su.pr/8LbgvK.

 HBase 0.90.0 runs on Hadoop 0.20.x.  It does not currently run on
 Hadoop 0.21.0 nor on Hadoop TRUNK.   HBase will lose data unless it is
 running on an Hadoop HDFS 0.20.x that has a durable sync. Currently
 only the branch-0.20-append branch [1] has this attribute (See
 CHANGES.txt [3] in branch-0.20-append to see the list of patches
 involved adding an append). No official releases have been made from
 this branch as yet so you will have to build your own Hadoop from the
 tip of this branch, OR install Cloudera's CDH3 [2] (Its currently in
 beta).  CDH3b2 or CDHb3 have the 0.20-append patches needed to add a
 durable sync. If using CDH, be sure to replace the hadoop jars that
 are bundled with HBase with those from your CDH distribution.

 There is no migration necessary.  Your data written with HBase 0.20.x
 (or with HBase 0.89.x) is readable by HBase 0.90.0.  A shutdown and
 restart after putting in place the new HBase should be all thats
 involved.  That said, once done, there is no going back to 0.20.x once
 the transition has been made.   HBase 0.90.0 and HBase 0.89.x write
 region names differently in the filesystem.  Rolling restart from
 0.20.x or 0.89.x to 0.90.0RC1 will not work.

 Yours,
 The HBasistas
 P.S. For why the version 0.90 and whats new in HBase 0.90, see slides
 4-10 in this deck [4]

 1. http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append
 2. http://archive.cloudera.com/docs/
 3. 
 http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt
 4. http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/




Re: YouAreDeadException

2011-01-14 Thread Ryan Rawson
This is the cause:

org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:

Why did the session expire?  Typically it's GC, what does your GC logs
say?  Otherwise, network issues perhaps?  Swapping?  Other machine
related systems problems?

-ryan


On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu yuzhih...@gmail.com wrote:
 I ran 0.90 RC3 in dev cluster.

 I saw the following in region server log:

 Caused by: org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
 currently processing sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 as
 dead server
    at
 org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
    at
 org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
    at
 org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
    at
 org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)

    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
    at
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
    at $Proxy0.regionServerReport(Unknown Source)
    at
 org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702)
    ... 2 more
 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper: Initiating
 client connection,
 connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181sessionTimeout=9
 watcher=hconnection
 2011-01-13 03:55:08,914 FATAL
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
 serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
 load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
 regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004
 received expired from ZooKeeper, aborting
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired
    at
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
    at
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
    at
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)

 ---

 And the following from master log:

 2011-01-13 03:52:42,003 INFO
 org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
 ephemeral node deleted, processing expiration [
 sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378]
 2011-01-13 03:52:42,005 DEBUG org.apache.hadoop.hbase.master.ServerManager:
 Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead servers,
 submitted shutdown handler to be executed, root=false, meta=false
 2011-01-13 03:52:42,005 INFO
 org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs
 for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
 2011-01-13 03:52:42,092 INFO
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s)
 in hdfs://
 sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
 2011-01-13 03:52:42,093 DEBUG
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread
 Thread[WriterThread-0,5,main]: starting
 2011-01-13 03:52:42,094 DEBUG
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread
 Thread[WriterThread-1,5,main]: starting
 2011-01-13 03:52:42,096 DEBUG
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of
 1: hdfs://
 sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407,
 length=0

 Please advise what could be the cause.

 Thanks



Re: YouAreDeadException

2011-01-14 Thread Ryan Rawson
I'm seeing not much in the way of errors, timeouts, all to one machine
ending with .80, so that is probably your failed node.

Other than that, the log doesnt seem to say too much.  Searching for
strings like FATAL and Exception is the way to go here.

Also things like this:
2011-01-14 23:38:52,936 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_OPEN for too long, reassigning region=
NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602
7.2c40637c6c648a67162cc38d8c6d8ee9.


Guessing, I'd probably say your nodes hit some performance wall, with
io-wait, or networking, or something, and Regionserver processes
stopped responding, but did not time out from zookeeper yet... so you
would run into a situation where some nodes are unresponsive, so any
data hosted there would be difficult to talk to.  Until the
regionserver times out it's zookeeper node, the master doesnt know
about the fault of the regionserver.

The master web UI is probably inaccessible because the META table is
on a regionserver that went AWOL.  You should check your load, your
ganglia graphs.  Also remember, despite having lots of disks, each
node is a gigabit ethernet which means about 110-120 MB/sec.  It's
quite possible you are running into network limitations, remember that
regionservers must write to 2 additional datanodes, and there will be
overlap, thus you have to share some of that 110-120MB/sec per node
figure with other nodes, not to mention that you also need to factor
inbound bandwidth (from client-hbase regionserver) and outbound
bandwidth (from datanode replica 1 - dn replica 2).

-ryan

On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu yuzhih...@gmail.com wrote:
 Now I cannot access master web UI, This happened after I doubled the amount
 of data processed in our flow.

 I am attaching master log.

 On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson ryano...@gmail.com wrote:

 This is the cause:

 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
 serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
 load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
 regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004
 received expired from ZooKeeper, aborting
 org.apache.zookeeper.KeeperException$SessionExpiredException:

 Why did the session expire?  Typically it's GC, what does your GC logs
 say?  Otherwise, network issues perhaps?  Swapping?  Other machine
 related systems problems?

 -ryan


 On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu yuzhih...@gmail.com wrote:
  I ran 0.90 RC3 in dev cluster.
 
  I saw the following in region server log:
 
  Caused by: org.apache.hadoop.ipc.RemoteException:
  org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
  currently processing sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
  as
  dead server
     at
 
  org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
     at
 
  org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
     at
 
  org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648)
     at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
     at
 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at
  org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
     at
 
  org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
 
     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
     at
  org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
     at $Proxy0.regionServerReport(Unknown Source)
     at
 
  org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702)
     ... 2 more
  2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper: Initiating
  client connection,
  connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181sessionTimeout=9
  watcher=hconnection
  2011-01-13 03:55:08,914 FATAL
  org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
  server
  serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
  load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
  regionserver:60020-0x12d7b7b1c760004
  regionserver:60020-0x12d7b7b1c760004
  received expired from ZooKeeper, aborting
  org.apache.zookeeper.KeeperException$SessionExpiredException:
  KeeperErrorCode = Session expired
     at
 
  org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
     at
 
  org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
     at
 
  org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
     at
  org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506

Re: question about Hregion.incrementColumnValue

2011-01-10 Thread Ryan Rawson
There was a good, but complex reason. Its going away with stacks time stamp
patch. I'll see if I can do a better email tomorrow.
On Jan 9, 2011 11:42 PM, Dhruba Borthakur dhr...@gmail.com wrote:
 I am looking at Hregion.incrementColumnValue(). It has the following piece
 of code

 // build the KeyValue now:
 3266 KeyValue newKv = new KeyValue(row, family,
 3267 qualifier, EnvironmentEdgeManager.currentTimeMillis(),
 3268 Bytes.toBytes(result));
 3269
 3270 // now log it:
 3271 if (writeToWAL) {
 3272 long now = EnvironmentEdgeManager.currentTimeMillis();
 3273 WALEdit walEdit = new WALEdit();
 3274 walEdit.add(newKv);
 3275 this.log.append(regionInfo,
 regionInfo.getTableDesc().getName(),
 3276 walEdit, now);
 3277 }

 It invokes EnvironmentEdgeManager.currentTimeMillis() twice, once for
 creating the new KV and then another time to add it to the WAL. Is this
 significant or just an oversight? Can we instead invoke it once before we
 create the new key-value and then use it for both code paths?

 Thanks,
 dhruba

 --
 Connect to me at http://www.facebook.com/dhruba


Re: question about Hregion.incrementColumnValue

2011-01-10 Thread Ryan Rawson
I put more comments on this:
HBASE-3021

Basically we needed to avoid duplicate timestamp KVs in memstore 
hfile, elsewise we might end up 'getting' the wrong value and thus
messing up the count.

With work on ACID by stack we can avoid using that.

-ryan

On Mon, Jan 10, 2011 at 11:23 AM, Stack st...@duboce.net wrote:
 Yeah, thats going away unless Ryan comes up w/ a reason for why we
 should keep it.
 St.Ack

 On Mon, Jan 10, 2011 at 12:29 AM, Ryan Rawson ryano...@gmail.com wrote:
 There was a good, but complex reason. Its going away with stacks time stamp
 patch. I'll see if I can do a better email tomorrow.
 On Jan 9, 2011 11:42 PM, Dhruba Borthakur dhr...@gmail.com wrote:
 I am looking at Hregion.incrementColumnValue(). It has the following piece
 of code

 // build the KeyValue now:
 3266 KeyValue newKv = new KeyValue(row, family,
 3267 qualifier, EnvironmentEdgeManager.currentTimeMillis(),
 3268 Bytes.toBytes(result));
 3269
 3270 // now log it:
 3271 if (writeToWAL) {
 3272 long now = EnvironmentEdgeManager.currentTimeMillis();
 3273 WALEdit walEdit = new WALEdit();
 3274 walEdit.add(newKv);
 3275 this.log.append(regionInfo,
 regionInfo.getTableDesc().getName(),
 3276 walEdit, now);
 3277 }

 It invokes EnvironmentEdgeManager.currentTimeMillis() twice, once for
 creating the new KV and then another time to add it to the WAL. Is this
 significant or just an oversight? Can we instead invoke it once before we
 create the new key-value and then use it for both code paths?

 Thanks,
 dhruba

 --
 Connect to me at http://www.facebook.com/dhruba




Re: question about Hregion.incrementColumnValue

2011-01-10 Thread Ryan Rawson
That is just an artifact of the way the code was written, the
math.max() and +1 code was the guarantee.  Remember without that code,
the old ICV would _LOSE DATA_.  So a little hacking was in order.

I expect to clean up this with the HBASE-2856 patch.

-ryan

On Mon, Jan 10, 2011 at 4:34 PM, Jonathan Gray jg...@fb.com wrote:
 How does doing currentTimeMillis() twice in a row guarantee different 
 timestamps?  And in this case, we're talking about the MemStore vs. HLog not 
 HFile.

 There is another section of the code where there is a timestamp+1 to avoid 
 duplicates but this is something else.

 -Original Message-
 From: Ryan Rawson [mailto:ryano...@gmail.com]
 Sent: Monday, January 10, 2011 2:27 PM
 To: dev@hbase.apache.org
 Subject: Re: question about Hregion.incrementColumnValue

 I put more comments on this:
 HBASE-3021

 Basically we needed to avoid duplicate timestamp KVs in memstore  hfile,
 elsewise we might end up 'getting' the wrong value and thus messing up the
 count.

 With work on ACID by stack we can avoid using that.

 -ryan

 On Mon, Jan 10, 2011 at 11:23 AM, Stack st...@duboce.net wrote:
  Yeah, thats going away unless Ryan comes up w/ a reason for why we
  should keep it.
  St.Ack
 
  On Mon, Jan 10, 2011 at 12:29 AM, Ryan Rawson ryano...@gmail.com
 wrote:
  There was a good, but complex reason. Its going away with stacks time
  stamp patch. I'll see if I can do a better email tomorrow.
  On Jan 9, 2011 11:42 PM, Dhruba Borthakur dhr...@gmail.com
 wrote:
  I am looking at Hregion.incrementColumnValue(). It has the following
  piece of code
 
  // build the KeyValue now:
  3266 KeyValue newKv = new KeyValue(row, family,
  3267 qualifier, EnvironmentEdgeManager.currentTimeMillis(),
  3268 Bytes.toBytes(result));
  3269
  3270 // now log it:
  3271 if (writeToWAL) {
  3272 long now = EnvironmentEdgeManager.currentTimeMillis();
  3273 WALEdit walEdit = new WALEdit();
  3274 walEdit.add(newKv);
  3275 this.log.append(regionInfo,
  regionInfo.getTableDesc().getName(),
  3276 walEdit, now);
  3277 }
 
  It invokes EnvironmentEdgeManager.currentTimeMillis() twice, once
  for creating the new KV and then another time to add it to the WAL.
  Is this significant or just an oversight? Can we instead invoke it
  once before we create the new key-value and then use it for both code
 paths?
 
  Thanks,
  dhruba
 
  --
  Connect to me at http://www.facebook.com/dhruba
 
 



Re: Good VLDB paper on WALs

2010-12-29 Thread Ryan Rawson
Oh no, let's be wary of those server rewrites.  My micro profiling is
showing about 30 usec for a lock handoff in the HBase client...

I think we should be able to get big wins with minimal things.  A big
rewrite has it's major costs, not to mention to effectively be async
we'd have to rewrite every single pice of code more complex than
Bytes.*.  If you need to block you will need to push context on a
context-store (aka stack) and manage that all ourselves.

I've been seeing papers that are talking about threading improvements
that could get us better performance.  Assuming that ctx is the actual
reason why we arent as fast as we could be (note: we are NOT slow!).

As for the DI, I think I'd like to see more study on the costs and
benefits.  We have a relatively minimal amount of interfaces and
concrete objects, for the interfaces we do, we have 1 or 2
implementations at most.  Usually 1.  There is a cost, I'd like to see
more descriptions of the costs vs the benefits.

-ryan

On Wed, Dec 29, 2010 at 11:32 AM, Stack st...@duboce.net wrote:
 Nice list of things we need to do to make logging faster (with useful
 citations on current state of art).  This notion of early lock release
 (ELR) is worth looking into (Jon, for high rates of counter
 transactions, you've been talking about aggregating counts in front of
 the WAL lock... maybe an ELR and then a hold on the transaction until
 confirmation of flush would be way to go?).  Regards flush-pipelining,
 it would be interesting to see if there are traces of the sys-time
 that Dhruba is seeing in his NN out in HBase servers.  My guess is
 that its probably drowned by other context switches done in our
 servers.  Definitely worth study.

 St.Ack
 P.S. Minimizing context switches, a system for ELR and
 flush-pipelining, recasting the server to make use of one of the DI or
 OSGi frameworks, moving off log4j, etc. Is it just me or do others
 feel a server rewrite coming on?


 On Mon, Dec 27, 2010 at 11:48 AM, Dhruba Borthakur dhr...@gmail.com wrote:
 HDFS currently uses Hadoop RPC and the server thread blocks till the WAL is
 written to disk. In earlier deployments, I thought we could safely ignore
 flush-pipelining by creating more server threads. But in our largest HDFS
 systems, I am starting to see  20% sys-time usage on the namenode machine;
 most of this  could be thread scheduling. If so, then it makes sense to
 enhance the logging code to release server threads even before the WAL is
 flushed to disk (but, of course, we still have to delay the transaction
 response to the client till the WAL is synced to disk).

 Does anybody have any idea on how to figure out what percentage of the above
 sys-time is spent in thread scheduling vs the time spent in other system
 calls (especially in the Namenode context)?

 thanks,
 dhruba


 On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon t...@cloudera.com wrote:

 Via Hammer - I thought this was a pretty good read, some good ideas for
 optimizations for our WAL.

 http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf

 -Todd
 --
 Todd Lipcon
 Software Engineer, Cloudera




 --
 Connect to me at http://www.facebook.com/dhruba




Re: deploying hbase 0.90 to internal maven repository

2010-12-29 Thread Ryan Rawson
just run 'mvn install' in our directory and that should do the trick.
everything else is implied by pom.xml. well except the repository
stuff.

-ryan

On Wed, Dec 29, 2010 at 10:29 AM, Ted Yu yuzhih...@gmail.com wrote:
 Hi,
 I used the following script to deploy hbase 0.90 jar to internal maven
 repository but was not successful:

 #!/usr/bin/env bash
 set -x
 mvn deploy:deploy-file -Dfile=target/hbase-0.90.0.jar -Dpackaging=jar
 -DgroupId=org.apache.hbase -DartifactId=hbase -Dversion=0.90.0
 -DrepositoryId=carrieriq.thirdParty
 -Durl=scp://maven2:mav...@repository.eng.carrieriq.com:
 /data/maven2/repository/thirdparty

 Comment about how the following error can be fixed is appreciated.

 Here is the output:

 [INFO] Scanning for projects...
 [WARNING]
        Profile with id: 'property-overrides' has not been activated.

 [INFO]
 
 [ERROR] BUILD ERROR
 [INFO]
 
 [INFO] Error building POM (may not be this project's POM).


 Project ID: com.agilejava.docbkx:docbkx-maven-plugin
 POM Location: Artifact [com.agilejava.docbkx:docbkx-maven-plugin:pom:2.0.11]
 Validation Messages:

    [0]  'dependencies.dependency.version' is missing for
 com.agilejava.docbkx:docbkx-maven-base:jar

 Reason: Failed to validate POM for project
 com.agilejava.docbkx:docbkx-maven-plugin at Artifact
 [com.agilejava.docbkx:docbkx-maven-plugin:pom:2.0.11]



Re: provide a 0.20-append tarball?

2010-12-23 Thread Ryan Rawson
Looks like the fight does not go well.  A lot of hdfs developers are
concerned that it would detract resources.  I'm not sure who's
resources.

I hope my 13-15 month commented helped... I've heard wait for the
next version before and I am not interested in it.  If that indeed
worked, a year ago we'd have a stable working sync/hlog recovery
support.

-ryan

On Wed, Dec 22, 2010 at 3:41 PM, Stack st...@duboce.net wrote:
 On Wed, Dec 22, 2010 at 11:14 AM, Stack st...@duboce.net wrote:
 Let me ask Dhruba what he thinks about making a 0.20-append release
 (He's the release manager).  Will also sound out the hadoop pmc since
 they'll have an opinion.


 I asked Dhruba.  He's fine w/ a release off tip of branch--0.20-append.

 I just wrote a message to general up on hadoop to gauge what hadoopers
 think of the idea.

 St.Ack



Re: 0.90.0RC2 tomorrow?

2010-12-21 Thread Ryan Rawson
The default xml is in the jar and is intended to be that way. Thee other is
a bug. Can you file a jira? Thanks!
On Dec 21, 2010 7:18 PM, Tatsuya Kawano tatsuya6...@gmail.com wrote:
 Hi,

 Just noticed a couple of things in the last candidate (rc1).

 1. conf/hbase-default.xml is missing.

 2. bin/start-hbase.sh displays the following warning.

 cat: ... /hbase-0.90.0/bin/../target/cached_classpath.txt: No such file or
directory


 Thank you,
 Tatsuya

 --
 Tatsuya Kawano
 Tokyo, Japan


 On Dec 21, 2010, at 2:36 PM, Stack st...@duboce.net wrote:

 We should be able to post our third release candidate tomorrow (Our
 RCs are zero-based). All current blockers and criticals have been
 cleared. Speak up if there is anything you want to get into 0.90.0RC2
 or if there is a good reason for not cutting the RC now.

 Thanks,
 St.Ack



Re: Hypertable claiming upto 900% random-read throughput vs HBase

2010-12-15 Thread Ryan Rawson
So if that is the case, I'm not sure how that is a fair test.  One
system reads from RAM, the other from disk.  The results as expected.

Why not test one system with SSDs and the other without?

It's really hard to get apples/oranges comparison. Even if you are
doing the same workloads on 2 diverse systems, you are not testing the
code quality, you are testing overall systems and other issues.

As G1 GC improves, I expect our ability to use larger and larger heaps
would blunt the advantage of a C++ program using malloc.

-ryan

On Wed, Dec 15, 2010 at 11:15 AM, Ted Dunning tdunn...@maprtech.com wrote:
 From the small comments I have heard, the RAM versus disk difference is
 mostly what I have heard they were testing.

 On Wed, Dec 15, 2010 at 11:11 AM, Ryan Rawson ryano...@gmail.com wrote:

 We dont have the test source code, so it isnt very objective.  However
 I believe there are 2 things which help them:
 - They are able to harness larger amounts of RAM, so they are really
 just testing that vs HBase




Re: Hypertable claiming upto 900% random-read throughput vs HBase

2010-12-15 Thread Ryan Rawson
Purtell has more, but he told me no longer crashes, but minor pauses
between 50-250 ms.  From 1.6_23.

Still not usable in a latency sensitive prod setting.  Maybe in other settings?

-ryan

On Wed, Dec 15, 2010 at 11:31 AM, Ted Dunning tdunn...@maprtech.com wrote:
 Does anybody have a recent report about how G1 is coming along?

 On Wed, Dec 15, 2010 at 11:22 AM, Ryan Rawson ryano...@gmail.com wrote:

 As G1 GC improves, I expect our ability to use larger and larger heaps
 would blunt the advantage of a C++ program using malloc.




Re: Hypertable claiming upto 900% random-read throughput vs HBase

2010-12-15 Thread Ryan Rawson
Why do that?  You reduce the cache effectiveness and up the logistical
complexity.  As a stopgap maybe, but not as a long term strategy.

Sun just needs to fix their GC.  Er, Oracle.

-ryan

On Wed, Dec 15, 2010 at 11:55 AM, Chad Walters
chad.walt...@microsoft.com wrote:
 Why not run multiple JVMs per machine?

 Chad

 -Original Message-
 From: Ryan Rawson [mailto:ryano...@gmail.com]
 Sent: Wednesday, December 15, 2010 11:52 AM
 To: dev@hbase.apache.org
 Subject: Re: Hypertable claiming upto 900% random-read throughput vs HBase

 The malloc thing was pointing out that we have to contend with Xmx and GC.  
 So it makes it harder for us to maximally use all the available ram for block 
 cache in the regionserver.  Which you may or may not want to do for 
 alternative reasons.  At least with Xmx you can plan and control your 
 deployments, and you wont suffer from heap growth due to heap fragmentation.

 -ryan

 On Wed, Dec 15, 2010 at 11:49 AM, Todd Lipcon t...@cloudera.com wrote:
 On Wed, Dec 15, 2010 at 11:44 AM, Gaurav Sharma
 gaurav.gs.sha...@gmail.com wrote:
 Thanks Ryan and Ted. I also think if they were using tcmalloc, it
 would have given them a further advantage but as you said, not much
 is known about the test source code.

 I think Hypertable does use tcmalloc or jemalloc (forget which)

 You may be interested in this thread from back in August:
 http://search-hadoop.com/m/pG6SM1xSP7r/hypertablesubj=Re+Finding+on+H
 Base+Hypertable+comparison

 -Todd


 On Wed, Dec 15, 2010 at 2:22 PM, Ryan Rawson ryano...@gmail.com wrote:

 So if that is the case, I'm not sure how that is a fair test.  One
 system reads from RAM, the other from disk.  The results as expected.

 Why not test one system with SSDs and the other without?

 It's really hard to get apples/oranges comparison. Even if you are
 doing the same workloads on 2 diverse systems, you are not testing
 the code quality, you are testing overall systems and other issues.

 As G1 GC improves, I expect our ability to use larger and larger
 heaps would blunt the advantage of a C++ program using malloc.

 -ryan

 On Wed, Dec 15, 2010 at 11:15 AM, Ted Dunning
 tdunn...@maprtech.com
 wrote:
  From the small comments I have heard, the RAM versus disk
  difference is mostly what I have heard they were testing.
 
  On Wed, Dec 15, 2010 at 11:11 AM, Ryan Rawson ryano...@gmail.com
 wrote:
 
  We dont have the test source code, so it isnt very objective.
  However I believe there are 2 things which help them:
  - They are able to harness larger amounts of RAM, so they are
  really just testing that vs HBase
 
 





 --
 Todd Lipcon
 Software Engineer, Cloudera





Re: Local sockets

2010-12-06 Thread Ryan Rawson
Hi,

I'd like to hear more on how you think this paper and the associated
topics apply to HBase.  Remember, unlike the paper, everyone will
always run replication in a real environment, it would be suicide not
to.

-ryan


On Mon, Dec 6, 2010 at 11:39 AM, Vladimir Rodionov
vrodio...@carrieriq.com wrote:
 Todd,

 There are  some curious people who had spent time (and tax payers money :)  
 and have came to  the same conclusion (as me):

 http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf


 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: Todd Lipcon [t...@cloudera.com]
 Sent: Monday, December 06, 2010 10:04 AM
 To: dev@hbase.apache.org
 Subject: Re: Local sockets

 On Mon, Dec 6, 2010 at 9:59 AM, Vladimir Rodionov
 vrodio...@carrieriq.comwrote:

 Todd,

 The major hdfs problem is inefficient processing of multiple streams in
 parallel -
 multiple readers/writers per one physical drive result in significant drop
 in overall
 I/O throughput on Linux (tested with ext3, ext4). There should be only one
 reader thread,
 one writer thread per physical drive (until we get AIO support in Java)

 Multiple data buffer copies in pipeline do not improve situation as well.


 In my benchmarks, the copies account for only a minor amount of the
 overhead. Do a benchmark of ChecksumLocalFilesystem vs RawLocalFilesystem
 and you should see the 2x difference I mentioned for data that's in buffer
 cache.

 As for parallel reader streams, I disagree with your assessment. After
 tuning readahead and with a decent elevator algorithm (anticipatory seems
 best in my benchmarks) it's better to have multiple threads reading from a
 drive compared to one, unless we had AIO. Otherwise we won't be able to have
 multiple outstanding requests to the block device, and the elevator will be
 powerless to do any reordering of reads.


 CRC32 can be fast btw and some other hashing algos can be even faster (like
 murmur2 -1.5GB per sec)


 Our CRC32 implementation goes around 750MB/sec on raw data, but for whatever
 undiscovered reason it adds a lot more overhead when you mix it into the
 data pipeline. HDFS-347 has some interesting benchmarks there.

 -Todd


 
 From: Todd Lipcon [t...@cloudera.com]
 Sent: Saturday, December 04, 2010 3:04 PM
 To: dev@hbase.apache.org
 Subject: Re: Local sockets

 On Sat, Dec 4, 2010 at 2:57 PM, Vladimir Rodionov
 vrodio...@carrieriq.comwrote:

  From my own experiments performance difference is huge even on
  sequential R/W operations (up to 300%) when you do local File I/O vs HDFS
  File I/O
 
  Overhead of HDFS I/O is substantial to say the least.
 
 
 Much of this is from checksumming, though - turn off checksums and you
 should see about a 2x improvement at least.

 -Todd


  Best regards,
  Vladimir Rodionov
  Principal Platform Engineer
  Carrier IQ, www.carrieriq.com
  e-mail: vrodio...@carrieriq.com
 
  
  From: Todd Lipcon [t...@cloudera.com]
  Sent: Saturday, December 04, 2010 12:30 PM
  To: dev@hbase.apache.org
  Subject: Re: Local sockets
 
  Hi Leen,
 
  Check out HDFS-347 for more info on this. I hope to pick this back up in
  2011 - in 2010 we mostly focused on stability above performance in
 HBase's
  interactions with HDFS.
 
  Thanks
  -Todd
 
  On Sat, Dec 4, 2010 at 12:28 PM, Leen Toelen toe...@gmail.com wrote:
 
   Hi,
  
   has anyone tested the performance impact (when there is a hdfs
   datanode and a hbase node on the same machine) of using unix domain
   sockets communication or shared memory ipc using nio? I guess this
   should make a difference on reads?
  
   Regards,
   Leen
  
 
 
 
  --
  Todd Lipcon
  Software Engineer, Cloudera
 



 --
 Todd Lipcon
 Software Engineer, Cloudera




 --
 Todd Lipcon
 Software Engineer, Cloudera



Re: Local sockets

2010-12-06 Thread Ryan Rawson
So are we talking about re-implementing IO scheduling in Hadoop at the
application level?



On Mon, Dec 6, 2010 at 12:13 PM, Rajappa Iyer r...@panix.com wrote:
 Jay Booth jaybo...@gmail.com writes:

 I don't get what they're talking about with hiding I/O limitations..  if the
 OS is doing a poor job of handling sequential readers, that's on the OS and
 not Hadoop, no?  In other words, I didn't see anything specific to Hadoop in
 their multiple readers slow down sequential access statement, it just may
 or may not be true for a given I/O subsystem.  The operating system is still
 getting open file, read, read, read, close, whether you're accessing that
 file locally or via a datanode.  Datanodes don't close files in between read
 calls, except at block boundaries.

 The root cause of the problem is the way map jobs are scheduled.  Since
 the job execution overlaps, the reads from different jobs also overlap
 and hence increase seeks.  Realistically, there's not much that the OS
 can do about it.

 What Vladimir is talking about is reducing the seek times by essentially
 serializing the reads through a single thread per disk.  You could
 either cleverly reorganize the reads so that seek is minimized and/or
 read the entire block in one call.

 -rsi


 On Mon, Dec 6, 2010 at 2:39 PM, Vladimir Rodionov
 vrodio...@carrieriq.comwrote:

 Todd,

 There are  some curious people who had spent time (and tax payers money :)
  and have came to  the same conclusion (as me):

 http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf


 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: Todd Lipcon [t...@cloudera.com]
 Sent: Monday, December 06, 2010 10:04 AM
 To: dev@hbase.apache.org
 Subject: Re: Local sockets

 On Mon, Dec 6, 2010 at 9:59 AM, Vladimir Rodionov
 vrodio...@carrieriq.comwrote:

  Todd,
 
  The major hdfs problem is inefficient processing of multiple streams in
  parallel -
  multiple readers/writers per one physical drive result in significant
 drop
  in overall
  I/O throughput on Linux (tested with ext3, ext4). There should be only
 one
  reader thread,
  one writer thread per physical drive (until we get AIO support in Java)
 
  Multiple data buffer copies in pipeline do not improve situation as well.
 

 In my benchmarks, the copies account for only a minor amount of the
 overhead. Do a benchmark of ChecksumLocalFilesystem vs RawLocalFilesystem
 and you should see the 2x difference I mentioned for data that's in buffer
 cache.

 As for parallel reader streams, I disagree with your assessment. After
 tuning readahead and with a decent elevator algorithm (anticipatory seems
 best in my benchmarks) it's better to have multiple threads reading from a
 drive compared to one, unless we had AIO. Otherwise we won't be able to
 have
 multiple outstanding requests to the block device, and the elevator will be
 powerless to do any reordering of reads.


  CRC32 can be fast btw and some other hashing algos can be even faster
 (like
  murmur2 -1.5GB per sec)
 

 Our CRC32 implementation goes around 750MB/sec on raw data, but for
 whatever
 undiscovered reason it adds a lot more overhead when you mix it into the
 data pipeline. HDFS-347 has some interesting benchmarks there.

 -Todd

 
  
  From: Todd Lipcon [t...@cloudera.com]
  Sent: Saturday, December 04, 2010 3:04 PM
  To: dev@hbase.apache.org
  Subject: Re: Local sockets
 
  On Sat, Dec 4, 2010 at 2:57 PM, Vladimir Rodionov
  vrodio...@carrieriq.comwrote:
 
   From my own experiments performance difference is huge even on
   sequential R/W operations (up to 300%) when you do local File I/O vs
 HDFS
   File I/O
  
   Overhead of HDFS I/O is substantial to say the least.
  
  
  Much of this is from checksumming, though - turn off checksums and you
  should see about a 2x improvement at least.
 
  -Todd
 
 
   Best regards,
   Vladimir Rodionov
   Principal Platform Engineer
   Carrier IQ, www.carrieriq.com
   e-mail: vrodio...@carrieriq.com
  
   
   From: Todd Lipcon [t...@cloudera.com]
   Sent: Saturday, December 04, 2010 12:30 PM
   To: dev@hbase.apache.org
   Subject: Re: Local sockets
  
   Hi Leen,
  
   Check out HDFS-347 for more info on this. I hope to pick this back up
 in
   2011 - in 2010 we mostly focused on stability above performance in
  HBase's
   interactions with HDFS.
  
   Thanks
   -Todd
  
   On Sat, Dec 4, 2010 at 12:28 PM, Leen Toelen toe...@gmail.com wrote:
  
Hi,
   
has anyone tested the performance impact (when there is a hdfs
datanode and a hbase node on the same machine) of using unix domain
sockets communication or shared memory ipc using nio? I guess this
should make a difference on reads?
   
Regards,
Leen
   
  
  
  
   --
   Todd Lipcon
   Software Engineer, Cloudera
  
 
 
 
  --
  

Re: Local sockets

2010-12-04 Thread Ryan Rawson
While I applaud these experiments, the next challenge is getting them
in to a shipping Hadoop.  I think it's a relative nonstarter if we
require someone to patch in a bunch of patches that are/were being
refused to be committed.

Keep on experimenting and collecting that evidence though!  One day!

-ryan

On Sat, Dec 4, 2010 at 3:04 PM, Todd Lipcon t...@cloudera.com wrote:
 On Sat, Dec 4, 2010 at 2:57 PM, Vladimir Rodionov
 vrodio...@carrieriq.comwrote:

 From my own experiments performance difference is huge even on
 sequential R/W operations (up to 300%) when you do local File I/O vs HDFS
 File I/O

 Overhead of HDFS I/O is substantial to say the least.


 Much of this is from checksumming, though - turn off checksums and you
 should see about a 2x improvement at least.

 -Todd


 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: Todd Lipcon [t...@cloudera.com]
 Sent: Saturday, December 04, 2010 12:30 PM
 To: dev@hbase.apache.org
 Subject: Re: Local sockets

 Hi Leen,

 Check out HDFS-347 for more info on this. I hope to pick this back up in
 2011 - in 2010 we mostly focused on stability above performance in HBase's
 interactions with HDFS.

 Thanks
 -Todd

 On Sat, Dec 4, 2010 at 12:28 PM, Leen Toelen toe...@gmail.com wrote:

  Hi,
 
  has anyone tested the performance impact (when there is a hdfs
  datanode and a hbase node on the same machine) of using unix domain
  sockets communication or shared memory ipc using nio? I guess this
  should make a difference on reads?
 
  Regards,
  Leen
 



 --
 Todd Lipcon
 Software Engineer, Cloudera




 --
 Todd Lipcon
 Software Engineer, Cloudera



Re: Review Request: Add option to cache blocks on hfile write and evict blocks on hfile close

2010-11-30 Thread Ryan Rawson


 On 2010-11-30 09:57:27, Ryan Rawson wrote:
  branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java, 
  line 765
  http://review.cloudera.org/r/1261/diff/1/?file=17902#file17902line765
 
  why would you not want to evict blocks from the cache on close?
 
 stack wrote:
 I think this a good point.  Its different behavior but its behavior we 
 should have always had?  One less option too.

I'm still confused why we are adding config for something that we should always 
be doing it.  While we'll never be zero conf, I am not seeing the reason why 
we'd want to keep things in the LRU.  

It would make more sense not to evict on a split, but evict every other time, 
since a split will probably reopen the same hfiles and need those blocks again.


- Ryan


---
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1261/#review2010
---


On 2010-11-29 23:22:38, Jonathan Gray wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 http://review.cloudera.org/r/1261/
 ---
 
 (Updated 2010-11-29 23:22:38)
 
 
 Review request for hbase, stack and khemani.
 
 
 Summary
 ---
 
 This issue is about adding configuration options to add/remove from the block 
 cache when creating/closing files. For use cases with lots of flushing and 
 compacting, this might be desirable to prevent cache misses and maximize the 
 effective utilization of total block cache capacity.
 
 The first option, hbase.rs.cacheblocksonwrite, will make it so we pre-cache 
 blocks as we are writing out new files.
 
 The second option, hbase.rs.evictblocksonclose, will make it so we evict 
 blocks when files are closed.
 
 
 This addresses bug HBASE-3287.
 http://issues.apache.org/jira/browse/HBASE-3287
 
 
 Diffs
 -
 
   
 branches/0.90/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java
  1040422 
   
 branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java 
 1040422 
   branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java 
 1040422 
   
 branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
  1040422 
   
 branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java
  1040422 
   
 branches/0.90/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
  1040422 
   branches/0.90/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 
 1040422 
   
 branches/0.90/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
  1040422 
   
 branches/0.90/src/main/java/org/apache/hadoop/hbase/util/CompressionTest.java 
 1040422 
   
 branches/0.90/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
  1040422 
   
 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java
  1040422 
   
 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/RandomSeek.java 
 1040422 
   branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFile.java 
 1040422 
   
 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFilePerformance.java
  1040422 
   
 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileSeek.java
  1040422 
   
 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
  1040422 
   
 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java 
 1040422 
   
 branches/0.90/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java
  1040422 
   
 branches/0.90/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java
  1040422 
 
 Diff: http://review.cloudera.org/r/1261/diff
 
 
 Testing
 ---
 
 Added a unit test to TestStoreFile.  That passes.
 
 Need to do perf testing on a cluster.
 
 
 Thanks,
 
 Jonathan
 




Re: Review Request: delete followed by a put with the same timestamp

2010-11-26 Thread Ryan Rawson

---
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1252/#review1993
---



trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java
http://review.cloudera.org/r/1252/#comment6297

what are all the consequences for not sorting by type when using 
KVComparator?  Does this mean we might create HFiles that not sorted properly, 
because the HFile comparator uses the KeyComparator directly with ignoreType = 
false. 

While in memstore we can rely on memstoreTS to roughly order by insertion 
time, and the Put/Delete should probably work in that situation, you are 
talking about modifiying a pretty core and important concept in how we sort 
things.

There are other ways to reconcile bugs like this, one of them is to extend 
the memstoreTS concept into the HFile and use that to reconcile during reads.  
There is another JIRA where I proposed this.  

If we are talking about 0.92 and beyond I'd prefer building a solid base 
rather than dangerous hacks like this.  Our unit tests are not extremely 
extensive, so while they might pass, that doesnt guarantee lack of bad 
behaviour later on.



- Ryan


On 2010-11-26 07:47:02, Pranav Khaitan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 http://review.cloudera.org/r/1252/
 ---
 
 (Updated 2010-11-26 07:47:02)
 
 
 Review request for hbase, Jonathan Gray and Kannan Muthukkaruppan.
 
 
 Summary
 ---
 
 This is a design change suggested in HBASE-3276 so adequate thought should be 
 given before proceeding. 
 
 The main code change is just one line which is to ignore key type while doing 
 KV comparisons. When the key type is ignored, then all the keys for the same 
 timestamp are sorted according the order in which they were interested. It is 
 still ensured that the delete family and delete column will be at the top 
 because they have the default column name and default timestamp.
 
 
 This addresses bug HBASE-3276.
 http://issues.apache.org/jira/browse/HBASE-3276
 
 
 Diffs
 -
 
   trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java 1039233 
   
 trunk/src/test/java/org/apache/hadoop/hbase/regionserver/KeyValueScanFixture.java
  1039233 
   
 trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreScanner.java
  1039233 
 
 Diff: http://review.cloudera.org/r/1252/diff
 
 
 Testing
 ---
 
 Test cases added. Since there is a change in semantics, some previous tests 
 were failing because of this change. Those tests have been modified to test 
 the newer behavior.
 
 
 Thanks,
 
 Pranav
 




Re: HRegion.RegionScanner.nextInternal()

2010-11-26 Thread Ryan Rawson
Yes in this case 'batch' and 'limit' refer to how many cells to return
at a time within a row.  The 'scanner caching' comes across in the
next(int) argument which can change on a per-call basis (although the
HTable API doesnt quite allow it).

-ryan

On Fri, Nov 26, 2010 at 3:12 AM, Lars George lars.geo...@gmail.com wrote:
 OK, got it. I missed the HRegionServers.next() in the mix. It calls
 the RegionScanner.next(results) and that uses the batch. Tricksy! I
 should have started on the client side instead.

 Lars

 On Fri, Nov 26, 2010 at 3:08 AM, Ryan Rawson ryano...@gmail.com wrote:
 No, batch size when limit is set is 1. You get partial results for a route,
 then get more from the same row. Then the next row.
 On Nov 25, 2010 4:54 PM, Lars George lars.geo...@gmail.com wrote:
 Mkay, I will look into it more for the latter. But for the limit this is
 still confusing to me as limit == batch and that is in he client side the
 number of rows. But not the number of columns. Does that mean if I had 100
 columns and set batch to 10 that it would only return 10 rows with 10
 columns but not what I would have expected ie. 10 rows with all columns? Is
 this implicitly mean batch is also the intra row batch size?

 Lars

 On Nov 25, 2010, at 21:53, Ryan Rawson ryano...@gmail.com wrote:

 limit is for retrieving partial results of a row. Ie: give me a row
 in chunks. Filters that want to operate on the entire row cannot be
 used with this mode. i forget why it's in the loop but there was a
 good reason at the time.

 -ryan

 On Thu, Nov 25, 2010 at 10:51 AM, Lars George lars.geo...@gmail.com
 wrote:
 Does hbase-dev still get forwarded? Did you see the below message?

 -- Forwarded message --
 From: Lars George lars.geo...@gmail.com
 Date: Tue, Nov 23, 2010 at 4:25 PM
 Subject: HRegion.RegionScanner.nextInternal()
 To: hbase-...@hadoop.apache.org

 Hi,

 I am officially confused:

 byte [] nextRow;
 do {
 this.storeHeap.next(results, limit - results.size());
 if (limit  0  results.size() == limit) {
 if (this.filter != null  filter.hasFilterRow()) throw
 new IncompatibleFilterException(
 Filter with filterRow(ListKeyValue) incompatible
 with scan with limit!);
 return true; // we are expecting more yes, but also
 limited to how many we can return.
 }
 } while (Bytes.equals(currentRow, nextRow = peekRow()));

 This is from the nextInternal() call. Questions:

 a) Why is that check for the filter and limit both being set inside the
 loop?

 b) if limit is the batch size (which for a Get is -1, not 1 as I
 would have thought) then what does that limit - results.size()
 achieve?

 I mean, this loops gets all columns for a given row, so batch/limit
 should not be handled here, right? what if limit were set to 1 by
 the client? Then even if the Get had 3 columns to retrieve it would
 not be able to since this limit makes it bail out. So there would be
 multiple calls to nextInternal() to complete what could be done in one
 loop?

 Eh?

 Lars





Re: Review Request: delete followed by a put with the same timestamp

2010-11-26 Thread Ryan Rawson


 On 2010-11-26 14:54:45, Ryan Rawson wrote:
  trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java, line 1373
  http://review.cloudera.org/r/1252/diff/1/?file=17712#file17712line1373
 
  what are all the consequences for not sorting by type when using 
  KVComparator?  Does this mean we might create HFiles that not sorted 
  properly, because the HFile comparator uses the KeyComparator directly with 
  ignoreType = false. 
  
  While in memstore we can rely on memstoreTS to roughly order by 
  insertion time, and the Put/Delete should probably work in that situation, 
  you are talking about modifiying a pretty core and important concept in how 
  we sort things.
  
  There are other ways to reconcile bugs like this, one of them is to 
  extend the memstoreTS concept into the HFile and use that to reconcile 
  during reads.  There is another JIRA where I proposed this.  
  
  If we are talking about 0.92 and beyond I'd prefer building a solid 
  base rather than dangerous hacks like this.  Our unit tests are not 
  extremely extensive, so while they might pass, that doesnt guarantee lack 
  of bad behaviour later on.
 
 
 Pranav Khaitan wrote:
 Agree. As I mentioned, this is a major change and more thought needs to 
 be given to it.
 
 However, to resolve issues like HBASE-3276, we need either such a change 
 or extend the memstoreTS concept to HFile as you mentioned.
 
 About consequences, I don't see anything negative here. This change only 
 affects the sorting of keys having same row, col, timestamp. After this 
 change, all keys with the same row, col, ts will be sorted purely based on 
 the order in which they were inserted. When a memstore is flushed to HFile, 
 the memstoreTS takes care of ordering. During compactions, the KeyValueHeap 
 breaks ties by using the sequence ids of storefiles.

the problem is you are now changing how things are ordered sometimes but not 
all the time.  HFile directly uses the rawcomparator, instantiating it directly 
rather than getting it via the code path you changed.  So now you create a 
memstore in this order:

row,col,100,Put  (memstoreTS=1)
row,col,100,Delete (memstoreTS=2)
row,col,100,Put (memstoreTS=3)

But the HFile comparator will consider this out of order since it doesnt know 
about memstoreTS and it still expects things to be in a certain order.

I'm a little wary of having implicit ordering in the HFiles... in your new 
scheme, Put,Delete,Put are in that order 'just because they are', and the 
comparator cannot put them back in order, and must rely on scanner order.  
During compactions we would place keys in order based on which files they came 
from, but they wouldn't themselves have an order.  Basically we should get rid 
of 'type sorting' and use memstoreTS sorting in memory and implicit sorting in 
the HFiles.  


- Ryan


---
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1252/#review1993
---


On 2010-11-26 07:47:02, Pranav Khaitan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 http://review.cloudera.org/r/1252/
 ---
 
 (Updated 2010-11-26 07:47:02)
 
 
 Review request for hbase, Jonathan Gray and Kannan Muthukkaruppan.
 
 
 Summary
 ---
 
 This is a design change suggested in HBASE-3276 so adequate thought should be 
 given before proceeding. 
 
 The main code change is just one line which is to ignore key type while doing 
 KV comparisons. When the key type is ignored, then all the keys for the same 
 timestamp are sorted according the order in which they were interested. It is 
 still ensured that the delete family and delete column will be at the top 
 because they have the default column name and default timestamp.
 
 
 This addresses bug HBASE-3276.
 http://issues.apache.org/jira/browse/HBASE-3276
 
 
 Diffs
 -
 
   trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java 1039233 
   
 trunk/src/test/java/org/apache/hadoop/hbase/regionserver/KeyValueScanFixture.java
  1039233 
   
 trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreScanner.java
  1039233 
 
 Diff: http://review.cloudera.org/r/1252/diff
 
 
 Testing
 ---
 
 Test cases added. Since there is a change in semantics, some previous tests 
 were failing because of this change. Those tests have been modified to test 
 the newer behavior.
 
 
 Thanks,
 
 Pranav
 




Re: HRegion.RegionScanner.nextInternal()

2010-11-25 Thread Ryan Rawson
limit is for retrieving partial results of a row.  Ie: give me a row
in chunks.  Filters that want to operate on the entire row cannot be
used with this mode.  i forget why it's in the loop but there was a
good reason at the time.

-ryan

On Thu, Nov 25, 2010 at 10:51 AM, Lars George lars.geo...@gmail.com wrote:
 Does hbase-dev still get forwarded? Did you see the below message?

 -- Forwarded message --
 From: Lars George lars.geo...@gmail.com
 Date: Tue, Nov 23, 2010 at 4:25 PM
 Subject: HRegion.RegionScanner.nextInternal()
 To: hbase-...@hadoop.apache.org

 Hi,

 I am officially confused:

          byte [] nextRow;
          do {
            this.storeHeap.next(results, limit - results.size());
            if (limit  0  results.size() == limit) {
              if (this.filter != null  filter.hasFilterRow()) throw
 new IncompatibleFilterException(
                  Filter with filterRow(ListKeyValue) incompatible
 with scan with limit!);
              return true; // we are expecting more yes, but also
 limited to how many we can return.
            }
          } while (Bytes.equals(currentRow, nextRow = peekRow()));

 This is from the nextInternal() call. Questions:

 a) Why is that check for the filter and limit both being set inside the loop?

 b) if limit is the batch size (which for a Get is -1, not 1 as I
 would have thought) then what does that limit - results.size()
 achieve?

 I mean, this loops gets all columns for a given row, so batch/limit
 should not be handled here, right? what if limit were set to 1 by
 the client? Then even if the Get had 3 columns to retrieve it would
 not be able to since this limit makes it bail out. So there would be
 multiple calls to nextInternal() to complete what could be done in one
 loop?

 Eh?

 Lars



Re: code review: HBASE-3251

2010-11-24 Thread Ryan Rawson
Please include a diff instead, it's hard to compare.

Also I'm not sure there will be a 0.20.7.

-ryan

On Wed, Nov 24, 2010 at 3:45 PM, Ted Yu yuzhih...@gmail.com wrote:
 Hi,
 I wanted to automate the manual deletion of dangling row(s) in .META. table.
 Please kindly comment on the following modification to HMaster.createTable()
 which is base on 0.20.6 codebase:

    long scannerid = srvr.openScanner(metaRegionName, scan);
    try {
        HashSetbyte[] regions = new HashSetbyte[]();
        boolean cleanTable = false,        // whether the table has a row in
 .META. whose start key is empty
                exists = false;
        Result data = srvr.next(scannerid);
        while (data != null) {
            if (data != null  data.size()  0) {
                HRegionInfo info = Writables.getHRegionInfo(
                        data.getValue(CATALOG_FAMILY,
 REGIONINFO_QUALIFIER));
                if (info.getTableDesc().getNameAsString().equals(tableName))
 {
                    exists = true;
                    if (info.getStartKey().length == 0) {
                        cleanTable = true;
                    } else {
                        regions.add(info.getRegionName());
                    }
                }
            }
            data = srvr.next(scannerid);
        }
        if (exists) {
            if (!cleanTable) {
                HTable meta = new HTable(HConstants.META_TABLE_NAME);
                for (byte[] region : regions) {
                    Delete d = new Delete(region);
                    meta.delete(d);
                    LOG.info(dangling row  + Bytes.toString(region) + 
 deleted from .META.);
                }
            } else {
                // A region for this table already exists. Ergo table
 exists.
                throw new TableExistsException(tableName);
            }
        }
    } finally {
      srvr.close(scannerid);
    }

 Thanks



Re: How to put() and get() when setAutoFlush(false)?

2010-11-22 Thread Ryan Rawson
Hi,

You could implement this in a code structure like so:

HTable table = new HTable(tableName, conf);
Put lastPut = null;
while ( moreData ) {
Put put = makeNewPutBasedOnLastPutToo( lastPut, dataSource );
table.put(put);
lastPut = put;
dataSource.next();
}

if that is unsatisfactory you may access the write buffer via
HTable.getWRiteBuffer().

-ryan


On Mon, Nov 22, 2010 at 5:41 PM, Xin Wang and...@gmail.com wrote:
 Hello everyone,

  I am a beginner to HBase. I want to load a data file of 2 million lines
 into a HBase table.
  I want to load data as fast as possible, so I called
 HTable.setAutoFlush(false) at the beginning. However, when I HTable.put() a
 row and then HTable.get() the same row, the result is empty. I know this is
 because the setAutoFlush(false) make put() write into the buffer. But the
 algorithm in my loading process requires to read the value of the previous
 one that just is put into the HTable cell. I have tried to make
 setAutoFlush(true), although the previous value can be read but the loading
 process is slower down by about an order of magnitude. Can I get() value
 directly from the write buffer? Are there any other solutions to this
 problem that I do not know? Thank you in advance!

  Best regards,

 Xin Wang



Re: ANN: hbase 0.90.0 Release Candidate 0 available for download

2010-11-17 Thread Ryan Rawson
I concur. Next week?

On Wed, Nov 17, 2010 at 4:39 PM, Stack st...@duboce.net wrote:
 Good one.  Want to make an issue J-D?

 Seems like this RC is sunk going by issues filed against it.   If its
 OK w/ you all lets let this RC hang out there a little longer to
 see if the RC catches more bad bugs before we cut a new RC?

 St.Ack

 On Wed, Nov 17, 2010 at 6:28 PM, Jean-Daniel Cryans jdcry...@apache.org 
 wrote:
 Currently both trunk and 0.90's pom.xml are incomplete, we were
 relying on Ryan's repo have the thrift pom but now that it was changed
 to Stack's new comers cannot compile the project since that pom is
 missing. Reported by kzk9 on IRC.

 So either we had Ryan's repo back in the pom, or Stack copies the
 files to his own repo, or we add a FB's repo that has it.

 J-D

 On Tue, Nov 16, 2010 at 3:38 PM, Stack st...@duboce.net wrote:
 Agreed.  Keep testing and keep the sinkers coming in so its the more
 likely that the next RC we put out graduates.  Make sure issues are
 filed against 0.90.0.
 Good stuff,
 St.Ack

 On Tue, Nov 16, 2010 at 5:27 PM, Todd Lipcon t...@cloudera.com wrote:
 The web UI split and compact buttons are currently not hooked up - 
 filed
 last night, will try to get a patch done today.

 The good news is I ran some YCSB tests and on the whole performance is much
 improved!

 I agree, let's keep going with this rc until people stop finding new 
 issues,
 or we reach something that blocks further testing.

 -Todd

 On Tue, Nov 16, 2010 at 7:57 AM, Gary Helmling ghelml...@gmail.com wrote:

 -1 on RC

 I opened HBASE-3235 for an issue with ICVs that should also sink the RC.
 When a put and subsequent ICV go in with the same timestamp for the same
 row/family/qualifier, the initial put masks the ICV, effectively causing 
 it
 to disappear.  There's a fix up on review board.

 We may want to give a couple more days for any other issues to shake out 
 as
 well?

 Gary


 On Tue, Nov 16, 2010 at 4:53 AM, Mathias Herberts 
 mathias.herbe...@gmail.com wrote:

  Hi,
 
  I just filed HBASE-3238 which appears to me as a blocker as HBase
  won't start if its zookeeper.parent.znode exists and HBase does not
  have the CREATE permission on this znode's parent znode.
 
  Mathias.
 




 --
 Todd Lipcon
 Software Engineer, Cloudera






Re: ANN: hbase 0.90.0 Release Candidate 0 available for download

2010-11-15 Thread Ryan Rawson
That is correct, those classes were deprecated in 0.20, and now gone in 0.90.

Now you will want to use HTable and Result.

Also Filter.getNextKeyHint() is an implementation detail, have a look
at the other filters to get a sense of what it does.

On Mon, Nov 15, 2010 at 12:33 PM, Ted Yu yuzhih...@gmail.com wrote:
 Just a few findings when I tried to compile our 0.20.6 based code with this
 new release:

 HConstants is final class now instead of interface
 RowFilterInterface is gone
 org.apache.hadoop.hbase.io.Cell is gone
 org.apache.hadoop.hbase.io.RowResult is gone
 constructor
 HColumnDescriptor(byte[],int,java.lang.String,boolean,boolean,int,boolean)
 is gone
 Put.setTimeStamp() is gone
 org.apache.hadoop.hbase.filter.Filter has added
 getNextKeyHint(org.apache.hadoop.hbase.KeyValue)

 If you know the alternative to some of the old classes, please share.

 On Mon, Nov 15, 2010 at 2:51 AM, Stack st...@duboce.net wrote:

 The first hbase 0.90.0 release candidate is available for download:

  http://people.apache.org/~stack/hbase-0.90.0-candidate-0/http://people.apache.org/%7Estack/hbase-0.90.0-candidate-0/

 HBase 0.90.0 is the major HBase release that follows 0.20.0 and the
 fruit of the 0.89.x development release series we've been running of
 late.

 More than 920 issues have been closed since 0.20.0.  Release notes are
 available here: http://su.pr/8LbgvK.

 HBase 0.90.0 runs on Hadoop 0.20.x.  It does not currently run on
 Hadoop 0.21.0.   HBase will lose data unless it is running on an
 Hadoop HDFS 0.20.x that has a durable sync. Currently only the
 branch-0.20-append branch [1] has this attribute. No official releases
 have been made from this branch as yet so you will have to build your
 own Hadoop from the tip of this branch or install Cloudera's CDH3 [2]
 (Its currently in beta).  CDH3b2 or CDHb3 have the 0.20-append patches
 needed to add a durable sync. See CHANGES.txt [3] in
 branch-0.20-append to see list of patches involved.

 There is no migration necessary.  Your data written with HBase 0.20.x
 (or with HBase 0.89.x) is readable by HBase 0.90.0.  A shutdown and
 restart after putting in place the new HBase should be all thats
 involved.  That said, once done, there is no going back to 0.20.x once
 the transition has been made.   HBase 0.90.0 and HBase 0.89.x write
 region names differently in the filesystem.  Rolling restart from
 0.20.x or 0.89.x to 0.90.0RC0 will not work.

 Should we release this candidate as hbase 0.90.0?  Take it for a spin.
  Check out the doc.  Vote +1/-1 by November 22nd.

 Yours,
 The HBasistas
 P.S. For why the version 0.90 and whats new in HBase 0.90, see slides
 4-10 in this deck [4]

 1. http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append
 2. http://archive.cloudera.com/docs/
 3.
 http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt
 4. http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/




  1   2   >