[jira] [Created] (HBASE-12941) CompactionRequestor - a private interface class with no users
ryan rawson created HBASE-12941: --- Summary: CompactionRequestor - a private interface class with no users Key: HBASE-12941 URL: https://issues.apache.org/jira/browse/HBASE-12941 Project: HBase Issue Type: Bug Components: regionserver Reporter: ryan rawson CompactionRequestor is a 'interface audience private' class with no users in the HBase code base. Unused things should be deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: A face-lift for 1.0
I'm checking to see if our marketing and web team can help. The primary requirement is going to be ditching the mvn:site from the front page. reskinning it might not be so easy. -ryan On Thu, Dec 4, 2014 at 9:52 AM, lars hofhansl la...@apache.org wrote: +1 I just came across one of the various HBase vs. Cassandra articles, and one of the main tenants of the articles was how much better the Cassandra documentation was.Thank god we have Misty now. :) (not sure how much just a skin would help, but it surely won't hurt) -- Lars From: Nick Dimiduk ndimi...@gmail.com To: hbase-dev dev@hbase.apache.org Sent: Tuesday, December 2, 2014 9:46 AM Subject: A face-lift for 1.0 Heya, In mind of the new release, I was thinking we should clean up our act a little bit in regard to hbase.apache.org and our book. Just because the project started in 2007 doesn't mean we need a site that looks like it's from 2007. Phoenix's site looks great in this regard. For the home page, I was thinking of converting it over to bootstrap [0] so that it'll be easier to pick up theme, either on of our own or something pre-canned [1]. I'm no web designer, but the idea is this would make it easier for someone who is to help us out. For the book, I just want to skin it -- no intention of changing docbook part (such a decision I'll leave up to Misty). I'm less sure on this project, but Riak's docs are a nice inspiration. What do you think? Do we know any web designers who can help out with the CSS? -n [0]: http://getbootstrap.com [1]: https://wrapbootstrap.com/ [2]: http://docs.basho.com/riak/latest/
Call for Presentations - HBase User group meeting
Hi all, The next HBase user group meeting is on November the 20th. We need a few more presenters still! Please send me your proposals - summary and outline of your talk! Thanks! -ryan
[jira] [Created] (HBASE-12260) MasterServices - remove from coprocessor API (Discuss)
ryan rawson created HBASE-12260: --- Summary: MasterServices - remove from coprocessor API (Discuss) Key: HBASE-12260 URL: https://issues.apache.org/jira/browse/HBASE-12260 Project: HBase Issue Type: Bug Components: master Reporter: ryan rawson Priority: Minor A major issue with MasterServices is the MasterCoprocessorEnvironment exposes this class even though MasterServices is tagged with @InterfaceAudience.Private This means that the entire internals of the HMaster is essentially part of the coprocessor API. Many of the classes returned by the MasterServices API are highly internal, extremely powerful, and subject to constant change. Perhaps a new API to replace MasterServices that is use-case focused, and justified based on real world co-processors would suit things better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12192) Remove EventHandlerListener
ryan rawson created HBASE-12192: --- Summary: Remove EventHandlerListener Key: HBASE-12192 URL: https://issues.apache.org/jira/browse/HBASE-12192 Project: HBase Issue Type: Bug Components: master Reporter: ryan rawson EventHandlerListener isn't actually being used by internal HBase code right now. No one actually calls 'ExecutorService.registerListener()' according to IntelliJ. It might be possible that some coprocessors use it. Perhaps people can comment if they find this functionality useful or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: 答复: DISCUSSION: Component Lieutenants?
I'd like to contribute, we are learning some very interesting things here and I'd like to feed back as much as possible but I just can't guarantee a response time. Sent from your iPhone On Sep 21, 2012, at 10:14 AM, Andrew Purtell apurt...@apache.org wrote: On Fri, Sep 21, 2012 at 1:49 AM, Ryan Rawson ryano...@gmail.com wrote: This is a cool idea, I'd like to contribute, but I'll need coverage since I cannot guarantee my time (since it doesnt belong to me anyways). What do you need Ryan? -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
Re: 答复: DISCUSSION: Component Lieutenants?
This is a cool idea, I'd like to contribute, but I'll need coverage since I cannot guarantee my time (since it doesnt belong to me anyways).
Metrics, be better already (The legend of graphite)
Hey folks, I built something out here at DTS, and I wanted to get feedback to see if it was interesting for anyone else... I built a no-dependency GraphiteReportingContext module that allows hadoop metrics to be exported to a graphite service. We've been using it here for a month and it works quite nicely. Graphite is not as easy as it should be to set up (I have some automated packaging scripts too), but it sure beats everything else, and double so on ganglia. Anyone think this is interesting?
Re: testing, powermock, jmockit - from pow-wow
I've used mockito a few times, and it's great... but it can make your tests very brittle. It can also be hard to successfully use if the code is complex. For example I had a class that took an HBaseAdmin, and i mocked out the few calls it used. Then when I needed to access Configuration, things went downhill fast. I ended up abandoning easymock even. The issue ultimately stems from not writing your code in a certain way with a minimal of easy to mock external interfaces. When this isn't true, then easymock does nothing for you. It can save your bacon if you are trying to unit test something deep though. The other question I guess is integration testing... there is no specific good reason why everything is done in 1 jvm, except 'because we can'. a longer lived 'minicluster' could amortize the cost of running one. -ryan On Fri, Sep 21, 2012 at 9:06 AM, Rogerio rliesenf...@gmail.com wrote: lars hofhansl lhofhansl@... writes: To get the low-level access we could instead use jmockit at the cost of dealing with code-weaving. As we had discussed, this scares me :). I do not want to have to debug some test code that was weaved (i.e. has no matching source code lying around *anywhere*). I think you are imagining a problem that does not exist. JMockit users can debug Java code just fine...
Re: HBASE-2182
On Fri, Jun 29, 2012 at 5:04 PM, Todd Lipcon t...@cloudera.com wrote: A few inline notes below: On Fri, Jun 29, 2012 at 4:42 PM, Elliott Clark ecl...@stumbleupon.comwrote: I just posted a pretty early skeleton( https://issues.apache.org/jira/browse/HBASE-2182) on what I think a netty based hbase client/server could look like. Pros: - Faster - Giraph got a 3x perf improvement by droppping hadoop rpc Whats the reference for this? The 3x perf I heard about from Giraph was from switching to using LMAX's Disruptor instead of queues, internally. We could do the same, but I'm not certain the model works well for our use cases where the RPC processing can end up blocked on disk access, etc. - Asynhbase trounces our client when JD benchmarked them I'm still convinced that the majority of this has to do with the way our batching happens to the server, not async vs sync. (in the current sync client, once we fill up the buffer, we flush from the same thread, and block the flush until all buffered edits have made it, vs doing it in the background). We could fix this without going to a fully async model. I also agree here, if you do the apriori code analysis, it becomes obvious that the issue is that slower regionservers can hold up entire batches even if 90%+ of the Puts were already acked... And don't forget that we used to issue Puts to regionservers SERIALLY until we do the current parallelism code... (not that the code is great, but it was relatively easy to fix at the time). - Could encourage things to be a little more modular if everything isn't hanging directly off of HRegionServer Sure, but not sure I see why this is Netty vs not-Netty - Netty is better about thread usage than hadoop rpc server. Can you explain further? - Pretty easy to define an rpc protocol after all of the work on protobuf (Thanks everyone) - Decoupling the rpc server library from the hadoop library could allow us to rev the server code easier. - The filter model is very easy to work with. - Security can be just a single filter. - Logging can ba another - Stats can be another. Cons: - Netty and non apache rpc server's don't play well togther. They might be able to but I haven't gotten there yet. What do you mean non apache rpc servers? - Complexity - Two different servers in the src - Confusing users who don't know which to pick - Non-blocking could make the client a harder to write. I'm really just trying to gauge what people think of the direction and if it's still something that is wanted. The code is a loong way from even being a tech demo, and I'm not a netty expert, so suggestions would be welcomed. Thoughts ? Are people interested in this? Should I push this to my github so other can help ? IMO, I'd want to see a noticeable perf difference from the change - unfortunately it would take a fair amount of work to get to the point where you could benchmark it. But if you're willing to spend the time to get to that point, seems worth investigating. -- Todd Lipcon Software Engineer, Cloudera
Re: Bay Area HBase User Group organizer change (?)
Good job Andrew. Don't forget to expense it - problem solved! -ryan On Sat, Oct 1, 2011 at 6:26 PM, Ted Dunning tdunn...@maprtech.com wrote: I can get some sponsorship going on my end as well. On Sun, Oct 2, 2011 at 12:09 AM, Ted Yu yuzhih...@gmail.com wrote: I agree. We should share the payment. On Sat, Oct 1, 2011 at 5:05 PM, Todd Lipcon t...@cloudera.com wrote: Thanks, Andrew! Let us know if we can chip in for the dues. -Todd On Sat, Oct 1, 2011 at 4:38 PM, Andrew Purtell apurt...@apache.org wrote: I went to RSVP for the upcoming HUG in NYC after Hadoop World. Meetup complained the Bay Area HBase User Group was missing an organizer (who pays dues), and would be deleted after 14 days. I've paid up for us for the next 6 months, and am now the organizer. I'll figure out what is required of that, but please pardon if something is not quite right at first. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) -- Todd Lipcon Software Engineer, Cloudera
Re: prefix compression implementation
I was just pushing back at the idea of 'turn everything into interfaces! problem solved!', and thinking about what was really necessary to get to where you want to go... On Mon, Sep 19, 2011 at 3:26 PM, Matt Corgan mcor...@hotpads.com wrote: Ryan - i answered your question on another thread yesterday. Will use this thread to continue conversation on the KeyValue interface. I don't think the name is all that important, though i thought HCell was less clumsy than KeyValue or KeyValueInterface. Take a look at this interface on github: https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/model/HCell.java Seems like it should be trivially easy to get KeyValue to implement that. Then it provides the right methods to make compareTo methods that will work across different implementations. The implementations of those methods might have an if-statement to determine the class of the other HCell, and choose the fastest byte comparison method behind the scenes. I need to look into the KeyValue scanner interfaces On Fri, Sep 16, 2011 at 7:34 PM, Ryan Rawson ryano...@gmail.com wrote: On Fri, Sep 16, 2011 at 7:29 PM, Matt Corgan mcor...@hotpads.com wrote: Ryan - thanks for the feedback. The situation I'm thinking of where it's useful to parse DirectBB without copying to heap is when you are serving small random values out of the block cache. At HotPads, we'd like to store hundreds of GB of real estate listing data in memory so it can be quickly served up at random. We want to access many small values that are already in memory, so basically skipping step 1 of 3 because values are already in memory. That being said, the DirectBB are not essential for us since we haven't run into gb problems, i just figured it would be nice to support them since they seem to be important to other people. My motivation for doing this is to make hbase a viable candidate for a large, auto-partitioned, sorted, *in-memory* database. Not the usual analytics use case, but i think hbase would be great for this. What exactly about the current system makes it not a viable candidate? On Fri, Sep 16, 2011 at 7:08 PM, Ryan Rawson ryano...@gmail.com wrote: On Fri, Sep 16, 2011 at 6:47 PM, Matt Corgan mcor...@hotpads.com wrote: I'm a little confused over the direction of the DBBs in general, hence the lack of clarity in my code. I see value in doing fine-grained parsing of the DBB if you're going to have a large block of data and only want to retrieve a small KV from the middle of it. With this trie design, you can navigate your way through the DBB without copying hardly anything to the heap. It would be a shame blow away your entire L1 cache by loading a whole 256KB block onto heap if you only want to read 200 bytes out of the middle... it can be done ultra-efficiently. This paragraph is not factually correct. The DirectByteBuffer vs main heap has nothing to do with the CPU cache. Consider the following scenario: - read block from DFS - scan block in ram - prepare result set for client Pretty simple, we have a choice in step 1: - write to java heap - write to DirectByteBuffer off-heap controlled memory in either case, you are copying to memory, and therefore cycling thru the cpu cache (of course). The difference is whether the Java GC has to deal with the aftermath or not. So the question DBB or not is not one about CPU caches, but one about garbage collection. Of course, nothing is free, and dealing with DBB requires extensive in-situ bounds checking (look at the source code for that class!), and also requires manual memory management on the behalf of the programmer. So you are faced with an expensive API (getByte is not as good at an array get), and a lot more homework to do. I have decided it's not worth it personally and aren't chasing that line as a potential performance improvement, and I also would encourage you not to as well. Ultimately the DFS speed issues need to be solved by the DFS - HDFS needs more work, but alternatives are already there and are a lot faster. The problem is if you're going to iterate through an entire block made of 5000 small KV's doing thousands of DBB.get(index) calls. Those are like 10x slower than byte[index] calls. In that case, if it's a DBB, you want to copy the full block on-heap and access it through the byte[] interface. If it's a HeapBB, then you already have access to the underlying byte[]. Yes this is the issue - you have to take an extra copy one way or another. Doing effective prefix compression with DBB is not really feasible imo, and that's another reason why I have given up on DBBs. So there's possibly value in implementing both methods. The main problem i see is a lack of interfaces in the current code base. I'll throw one
Re: prefix compression implementation
So if the HCell or whatever ends up returning ByteBuffers, then that plays straight in to scatter/gather NIO calls, and if some of them are DBB, then so much the merrier. For example, the thrift stuff takes ByteBuffers when its calling for a byte sequence. -ryan On Mon, Sep 19, 2011 at 10:39 PM, Stack st...@duboce.net wrote: One other thought is that exposing ByteRange, ByteBuffer, and v1 array stuff in Interface seems like you are exposing 'implementation' details that perhaps shouldn't show through. I'm guessing its unavoidable though if the Interface is to be used in a few different contexts: i.e. v1 has to work if we are to get this new stuff in, some srcs will be DBBs, etc. St.Ack On Mon, Sep 19, 2011 at 10:33 PM, Stack st...@duboce.net wrote: On Mon, Sep 19, 2011 at 3:26 PM, Matt Corgan mcor...@hotpads.com wrote: I don't think the name is all that important, though i thought HCell was less clumsy than KeyValue or KeyValueInterface. Take a look at this interface on github: https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/model/HCell.java Seems like it should be trivially easy to get KeyValue to implement that. Then it provides the right methods to make compareTo methods that will work across different implementations. The implementations of those methods might have an if-statement to determine the class of the other HCell, and choose the fastest byte comparison method behind the scenes. I'd say call it Cell rather than HCell. You have getRowArray rather than getRow which we currently have but I suppose it makes sense since you can then group by suffix. There is a patch lying around that adds a version to KV by using top two bytes of the type byte. If you need me to dig it up, just say (then you might not have to have v1 stuff in your Interface). You might need to add some equals for stuff like same row, cf, and qualifier... but they can come later. The comparator stuff is currently horrid because it depends on context; i.e. whether the KVs are from -ROOT- or .META. or from a userspace table. There are some ideas for having it so only one comparator for all types but thats another issue. St.Ack
Re: prefix compression implementation
Hey this stuff looks really interesting! On the ByteBuffer, the 'array' byte[] access to the underlying data is totally incompatible with the 'off heap' features that are implemented by DirectByteBuffer. While people talk about DBB in terms of nio performance, if you have to roundtrip the data thru java code, I'm not sure it buys you much - you still need to move data in and out of the main Java heap. Typically this is geared more towards apps which read and write from/to socket/files with minimal processing. While in the past I have been pretty bullish on off-heap caching for HBase, I have since changed my mind due to the poor API (ByteBuffer is a sucky way to access data structures in ram), and other reasons (ping me off list if you want). The KeyValue code pretty much presumes that data is in byte[] anyways, and I had thought that even with off-heap caching, we'd still have to copy KeyValues into main-heap during scanning anyways. Given the minimal size of the hfile blocks, I really dont see an issue with buffering a block output - especially if the savings is fairly substantial. Thanks, -ryan On Fri, Sep 16, 2011 at 5:48 PM, Matt Corgan mcor...@hotpads.com wrote: Jacek, Thanks for helping out with this. I implemented most of the DeltaEncoder and DeltaEncoderSeeker. I haven't taken the time to generate a good set of test data for any of this, but it does pass on some very small input data that aims to cover the edge cases i can think of. Perhaps you have full HFiles you can run through it. https://github.com/hotpads/hbase-prefix-trie/tree/master/src/org/apache/hadoop/hbase/keyvalue/trie/deltaencoder I also put some notes on the PtDeltaEncoder regarding how the prefix trie should be optimally used. I can't think of a situation where you'd want to blow it up into the full uncompressed KeyValue ByteBuffer, so implementing the DeltaEncoder interface is a mismatch, but I realize it's only a starting point. You also would never really have a full ByteBuffer of KeyValues to pass to it for compression. Typically, you'd be passing individual KeyValues from the memstore flush or from a collection of HFiles being merged through a PriorityQueue. The end goal is to operate on the encoded trie without decompressing it. Long term, and in certain circumstances, it may even be possible to pass the compressed trie over the wire to the client who can then decode it. Let me know if I implemented that the way you had in mind. I haven't done the seekTo method yet, but will try to do that next week. Matt On Wed, Sep 14, 2011 at 3:43 PM, Jacek Migdal ja...@fb.com wrote: Matt, Thanks a lot for the code. Great job! As I mentioned in JIRA I work full time on the delta encoding [1]. Right now the code and integration is almost done. Most of the parts are under review. Since it is a big change will plan to test it very carefully. After that, It will be ported to trunk and open sourced. I have a quick glimpse I have taken the different approach. I implemented a few different algorithms which are simpler. They also aims mostly to save space while having fast decompress/compress code. However the access is still sequential. The goal of my project is to save some RAM by having compressed BlockCache in memory. On the other hand, it seems that you are most concerned about seeking performance. I will read your code more carefully. A quick glimpse: we both implemented some routines (like vint), but expect that there is no overlap. I also seen that you spend some time investigating ByteBuffer vs. Byte[]. I experienced significant negative performance impact when I switched to ByteBuffer. However I postpone this optimization. Right now I think the easiest way to go would be that you will implement DeltaEncoder interface after my change: http://pastebin.com/Y8UxUByG (note, there might be some minor changes) That way, you will reuse my integration with existing code for free. Jacek [1] - I prefer to call it that way. Prefix is one of the algorithm, but there are also different approach. On 9/13/11 1:36 AM, Ted Yu yuzhih...@gmail.com wrote: Matt: Thanks for the update. Cacheable interface is defined in: src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java You can find the implementation at: src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java I will browse your code later. On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan mcor...@hotpads.com wrote: Hi devs, I put a developer preview of a prefix compression algorithm on github. It still needs some details worked out, a full set of iterators, about 200 optimizations, and a bunch of other stuff... but, it successfully passes some preliminary tests so I thought I'd get it in front of more eyeballs sooner than later. https://github.com/hotpads/hbase-prefix-trie It depends on HBase's Bytes.java and KeyValue.java, which depends on hadoop. Those jars are
Re: prefix compression implementation
, but ultimately hbase is an integrated whole, and the concurrency problems have been really tough to crack. Things are better than they have ever been, but still a lot of testing to do. Matt On Fri, Sep 16, 2011 at 6:08 PM, Ryan Rawson ryano...@gmail.com wrote: Hey this stuff looks really interesting! On the ByteBuffer, the 'array' byte[] access to the underlying data is totally incompatible with the 'off heap' features that are implemented by DirectByteBuffer. While people talk about DBB in terms of nio performance, if you have to roundtrip the data thru java code, I'm not sure it buys you much - you still need to move data in and out of the main Java heap. Typically this is geared more towards apps which read and write from/to socket/files with minimal processing. While in the past I have been pretty bullish on off-heap caching for HBase, I have since changed my mind due to the poor API (ByteBuffer is a sucky way to access data structures in ram), and other reasons (ping me off list if you want). The KeyValue code pretty much presumes that data is in byte[] anyways, and I had thought that even with off-heap caching, we'd still have to copy KeyValues into main-heap during scanning anyways. Given the minimal size of the hfile blocks, I really dont see an issue with buffering a block output - especially if the savings is fairly substantial. Thanks, -ryan On Fri, Sep 16, 2011 at 5:48 PM, Matt Corgan mcor...@hotpads.com wrote: Jacek, Thanks for helping out with this. I implemented most of the DeltaEncoder and DeltaEncoderSeeker. I haven't taken the time to generate a good set of test data for any of this, but it does pass on some very small input data that aims to cover the edge cases i can think of. Perhaps you have full HFiles you can run through it. https://github.com/hotpads/hbase-prefix-trie/tree/master/src/org/apache/hadoop/hbase/keyvalue/trie/deltaencoder I also put some notes on the PtDeltaEncoder regarding how the prefix trie should be optimally used. I can't think of a situation where you'd want to blow it up into the full uncompressed KeyValue ByteBuffer, so implementing the DeltaEncoder interface is a mismatch, but I realize it's only a starting point. You also would never really have a full ByteBuffer of KeyValues to pass to it for compression. Typically, you'd be passing individual KeyValues from the memstore flush or from a collection of HFiles being merged through a PriorityQueue. The end goal is to operate on the encoded trie without decompressing it. Long term, and in certain circumstances, it may even be possible to pass the compressed trie over the wire to the client who can then decode it. Let me know if I implemented that the way you had in mind. I haven't done the seekTo method yet, but will try to do that next week. Matt On Wed, Sep 14, 2011 at 3:43 PM, Jacek Migdal ja...@fb.com wrote: Matt, Thanks a lot for the code. Great job! As I mentioned in JIRA I work full time on the delta encoding [1]. Right now the code and integration is almost done. Most of the parts are under review. Since it is a big change will plan to test it very carefully. After that, It will be ported to trunk and open sourced. I have a quick glimpse I have taken the different approach. I implemented a few different algorithms which are simpler. They also aims mostly to save space while having fast decompress/compress code. However the access is still sequential. The goal of my project is to save some RAM by having compressed BlockCache in memory. On the other hand, it seems that you are most concerned about seeking performance. I will read your code more carefully. A quick glimpse: we both implemented some routines (like vint), but expect that there is no overlap. I also seen that you spend some time investigating ByteBuffer vs. Byte[]. I experienced significant negative performance impact when I switched to ByteBuffer. However I postpone this optimization. Right now I think the easiest way to go would be that you will implement DeltaEncoder interface after my change: http://pastebin.com/Y8UxUByG (note, there might be some minor changes) That way, you will reuse my integration with existing code for free. Jacek [1] - I prefer to call it that way. Prefix is one of the algorithm, but there are also different approach. On 9/13/11 1:36 AM, Ted Yu yuzhih...@gmail.com wrote: Matt: Thanks for the update. Cacheable interface is defined in: src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java You can find the implementation at: src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java I will browse your code later. On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan mcor...@hotpads.com wrote: Hi devs, I put a developer preview of a prefix compression algorithm on github. It still needs some
Re: prefix compression implementation
On Fri, Sep 16, 2011 at 7:29 PM, Matt Corgan mcor...@hotpads.com wrote: Ryan - thanks for the feedback. The situation I'm thinking of where it's useful to parse DirectBB without copying to heap is when you are serving small random values out of the block cache. At HotPads, we'd like to store hundreds of GB of real estate listing data in memory so it can be quickly served up at random. We want to access many small values that are already in memory, so basically skipping step 1 of 3 because values are already in memory. That being said, the DirectBB are not essential for us since we haven't run into gb problems, i just figured it would be nice to support them since they seem to be important to other people. My motivation for doing this is to make hbase a viable candidate for a large, auto-partitioned, sorted, *in-memory* database. Not the usual analytics use case, but i think hbase would be great for this. What exactly about the current system makes it not a viable candidate? On Fri, Sep 16, 2011 at 7:08 PM, Ryan Rawson ryano...@gmail.com wrote: On Fri, Sep 16, 2011 at 6:47 PM, Matt Corgan mcor...@hotpads.com wrote: I'm a little confused over the direction of the DBBs in general, hence the lack of clarity in my code. I see value in doing fine-grained parsing of the DBB if you're going to have a large block of data and only want to retrieve a small KV from the middle of it. With this trie design, you can navigate your way through the DBB without copying hardly anything to the heap. It would be a shame blow away your entire L1 cache by loading a whole 256KB block onto heap if you only want to read 200 bytes out of the middle... it can be done ultra-efficiently. This paragraph is not factually correct. The DirectByteBuffer vs main heap has nothing to do with the CPU cache. Consider the following scenario: - read block from DFS - scan block in ram - prepare result set for client Pretty simple, we have a choice in step 1: - write to java heap - write to DirectByteBuffer off-heap controlled memory in either case, you are copying to memory, and therefore cycling thru the cpu cache (of course). The difference is whether the Java GC has to deal with the aftermath or not. So the question DBB or not is not one about CPU caches, but one about garbage collection. Of course, nothing is free, and dealing with DBB requires extensive in-situ bounds checking (look at the source code for that class!), and also requires manual memory management on the behalf of the programmer. So you are faced with an expensive API (getByte is not as good at an array get), and a lot more homework to do. I have decided it's not worth it personally and aren't chasing that line as a potential performance improvement, and I also would encourage you not to as well. Ultimately the DFS speed issues need to be solved by the DFS - HDFS needs more work, but alternatives are already there and are a lot faster. The problem is if you're going to iterate through an entire block made of 5000 small KV's doing thousands of DBB.get(index) calls. Those are like 10x slower than byte[index] calls. In that case, if it's a DBB, you want to copy the full block on-heap and access it through the byte[] interface. If it's a HeapBB, then you already have access to the underlying byte[]. Yes this is the issue - you have to take an extra copy one way or another. Doing effective prefix compression with DBB is not really feasible imo, and that's another reason why I have given up on DBBs. So there's possibly value in implementing both methods. The main problem i see is a lack of interfaces in the current code base. I'll throw one suggestion out there as food for thought. Create a new interface: interface HCell{ byte[] getRow(); byte[] getFamily(); byte[] getQualifier(); long getTimestamp(); byte getType(); byte[] getValue(); //plus an endless list of convenience methods: int getKeyLength(); KeyValue getKeyValue(); boolean isDelete(); //etc, etc (or put these in sub-interface) } We could start by making KeyValue implement that interface and then slowly change pieces of the code base to use HCell. That will allow us to start elegantly working in different implementations. PtKeyValue https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/keyvalue/trie/compact/read/PtKeyValue.java would be one of them. During the transition, you can always call PtKeyValue.getCopiedKeyValue() which will instantiate a new byte[] in the traditional KeyValue format. I am not really super keen here, and while the interface of course makes plenty of sense, the issue is that you will need to turn an array of KeyValues (aka a Result instance) in to a bunch of bytes on the wire. So there HAS to be a method that returns a ByteBuffer that the IO layer can then use to write out
Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on Apache Incubator as a proposal
We thought about it earlier, but single machine needing to come back up to restore didnt seem like a good idea. -ryan On Sat, Sep 3, 2011 at 11:43 PM, Mathias Herberts mathias.herbe...@gmail.com wrote: On Sep 4, 2011 1:39 AM, Bill de hÓra li...@dehora.net wrote: On 02/09/11 19:06, Stack wrote: What do folks think? Not putting the log into hdfs seems like a good idea. I was somehow thinking the opposite as it makes irrecoverable machine failures much more problematic. What makes you say it's a good idea?
Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on Apache Incubator as a proposal
My understanding is that the ASF is about community, not code. So what is the goal for Accumulo? Build a community. How much would it intersect with the HBase community? Sounds like a lot. Does it still make sense to incubate it then? To the point earlier that ASF has hosted multiple competitors of various core projects, notably httpd, I had a look, there is exactly 2 projects that serve HTTP exclusively: Apache HTTPD Apache Traffic Server But these 2 are complementary, although some features kind of overlap (mod_proxy for eg), they dont really compete directly. So, would the ASF allow incubation of a web server product, for example nginx (which is a direct httpd competitor)? If the answer is no either work with the httpd community or go elsewhere, then sure Accumulo should have the same treatment? -ryan On Sat, Sep 3, 2011 at 12:00 AM, Bernd Fondermann bernd.fonderm...@googlemail.com wrote: On Saturday, September 3, 2011, Andrew Purtell apurt...@apache.org wrote: I'm simply pointing out a lack of community involvement to date. I would only add to this that the incubation proposal makes a controversial statement regarding existing involvement with the HBase community. It may be technically true if a certain company with involvement in HBase has also been interacting with Accumulo, but is disingenuous to claim that the community has been involved here. It looks like strictly a one way street: They have been able to observe or borrow the fruits of our labor for years, and now at a suitable point wish to incubate at the ASF to compete with our project for community. That is not community involvement. That is leeching. are you saying that the proposal is actually some kind of HBase fork? And, isn't this 'competition' already happening between all the BT and Dynamo implementations? I fail to see anything bad happening here. Bernd
Re: already online on this server - still buggy?
Oh yes I need to dig this up. But is the solution to 'find the potential problem and fix the hole'? Because it's quite possible the problem is that regionserver and master were being bounced around at the same time, leading to ? In any case, why fail the assignment. On Mon, Aug 8, 2011 at 3:36 PM, Stack st...@duboce.net wrote: On Sun, Aug 7, 2011 at 9:21 PM, Ryan Rawson ryano...@gmail.com wrote: Why is this still happening? This was a major issue in the old master. And still broke? What happened with this region when you trace it in master logs? St.Ack
already online on this server - still buggy?
Hi all, I think we still have a hole in the RIT graph... I get messages like this in my RS: 2011-08-08 04:17:48,469 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of region_name. but already online on this server And the master UI says the region continues to hang out in PENDING_OPEN in the RIT graph. Why is this still happening? This was a major issue in the old master. And still broke? -ryan
Re: Msft.... (renamed thread)
Stack can talk about this, but essentially for a period he could contribute to HBase only, but not to Hadoop. As you note, since Stack has joined my former employer the situation is good now. While it might be technically correct to say that MSFT has supported HBase in the past, this is a legalistic view imo, since ultimately I dont think that powerset uses HBase anymore. -ryan On Tue, Aug 2, 2011 at 8:59 AM, Andrew Purtell apurt...@apache.org wrote: When Microsoft acquired Powerset, Stack and Jim were still working there, but were disallowed by policy to contribute to HBase ... for months. My understanding is this was due to concerns about intellectual property -- open source fright?. Anyway, it was a disruptive period for the project that was resolved when Stack left. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) From: Doug Meil doug.m...@explorysmedical.com To: dev@hbase.apache.org dev@hbase.apache.org Sent: Tuesday, August 2, 2011 4:43 AM Subject: Msft (renamed thread) This is a reasonably interesting history question. Powerset folks were working on Hbase in 2007, but per... http://en.wikipedia.org/wiki/Powerset_%28company%29 ... Microsoft didn't buy Powerset until mid-2008. But that's all in the past. However, Microsoft is currently a platinum sponsor of the Apache Software Foundation... http://www.apache.org/foundation/thanks.html ... along with Google and Yahoo, which means they all each donate at least $100k per year to ASF. So in an extended financial sense, Msft supports Hbase by way of their donation to ASF, but they also support everything else in ASF, just like the rest of the big donors. On 8/1/11 10:56 PM, Ryan Rawson ryano...@gmail.com wrote: No one at powerset is currently contributing to HBase. Was is the key here - in the past, etc. I guess MSFT never got to integrating the HBase API with the C# LINQ system and Visual Studio. Maybe it's that azure table services? On Mon, Aug 1, 2011 at 7:50 PM, Fuad Efendi f...@efendi.ca wrote: re: Is it really-really supported by Microsoft employees?! It is really, really not. I believe Hbase was contributed to Apache by a Powerset, currently owned by Microsoft; and (same contributors) were full-time supporting Hbase and having salaries from Microsoft for at least a year; it was first (implicit) contribution from Microsoft to Apache.
Re: HBASE-4089 HBASE-4147 - on the topic of ops output
You should ask for your money back!! On Sun, Jul 31, 2011 at 3:10 PM, Fuad Efendi f...@efendi.ca wrote: What is it all about? HBase sucks. Too many problems to newcomers, few-weeks-warm-up to begin with Is it really-really supported by Microsoft employees?! And, SEO of course: === -- Fuad Efendi 416-993-2060 Tokenizer Inc., Canada Data Mining, Search Engines http://www.tokenizer.ca On 11-07-29 7:49 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I'm for publishing all performance metrics in JMX (in addition to exposing it wherever else you guys decide). That's because JMX is probably the easiest for our SPM for HBase [1] to get to HBase performance metrics and I suspect we are not alone. Otis [1] http://sematext.com/spm/hbase-performance-monitoring/index.html Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/ From: Andrew Purtell apurt...@apache.org To: Doug Meil doug.m...@explorysmedical.com; dev@hbase.apache.org dev@hbase.apache.org Sent: Friday, July 29, 2011 4:34 PM Subject: Re: HBASE-4089 HBASE-4147 - on the topic of ops output I'd rather see this output being able to be captured by something the sink that Todd suggested, rather than focusing on shell access. I don't agree. Look at what we have existing and proposed: - Java API access to server and region load information, that the shell uses - A proposal to dump some stats into log files, that then has to be scraped - A proposal (by the FB guys) to export some JSON via a HTTP servlet This is not good design, this is a bunch of random shit stuck together. Note that what Todd proposed does not preclude adding Java client API support for retrieving it. At a minimum all of this information must be accessible via the Java client API, to enable programmatic monitoring and analysis use cases. I'll add the shell support if nobody else cares about it, that is a relatively small detail, but one I think is important. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) From: Doug Meil doug.m...@explorysmedical.com To: dev@hbase.apache.org dev@hbase.apache.org; apurt...@apache.org apurt...@apache.org Sent: Friday, July 29, 2011 11:39 AM Subject: Re: HBASE-4089 HBASE-4147 - on the topic of ops output I'd rather see this output being able to be captured by something the sink that Todd suggested, rather than focusing on shell access. HServerLoad is super-summary at the RS level, and both the items in 4089 and 4147 are proposed to be summarized but still have reasonable detail (e.g., even table/CF summary there could be dozens of entries given a reasonably complex system). On 7/29/11 1:15 PM, Andrew Purtell apurt...@apache.org wrote: There is also the matter of HServerLoad and how that is used by the shell and master UI to report on cluster status. I'd like the shell to be able to let the user explore all of these different reports interactively. At the very least, they should all be handled the same way. And then there is Riley's work over at FB on a slow query log. How does that fit in? Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) From: Todd Lipcon t...@cloudera.com To: dev@hbase.apache.org Sent: Friday, July 29, 2011 9:58 AM Subject: Re: HBASE-4089 HBASE-4147 - on the topic of ops output What I'd prefer is something like: interface BlockCacheReportSink { public void reportStats(BlockCacheReport report); } class LoggingBlockCacheReportSink { ... { log it with whatever formatting you want } } then a configuration which could default to the logging implementation, but orgs could easily substitute their own implementation. For example, I could see wanting to do an implementation where it keeps local RRD graphs of some stats, or pushes them to a central management server. The assumption is that BlockCacheReport is a fairly straightforward struct with the non-formatted information available. -Todd On Fri, Jul 29, 2011 at 4:15 AM, Doug Meil doug.m...@explorysmedical.comwrote: Hi Folks- You probably already my email yesterday on this... https://issues.apache.org/jira/browse/HBASE-4089 (block cache report) ...and I just created this one... https://issues.apache.org/jira/browse/HBASE-4147 (StoreFile query report) What I'd like to run past the dev-list is this: if Hbase had periodic summary usage statistics, where should they go? What I'd like to throw out for discussion is that I'm suggesting that it should simply go to the log files and users can slice and dice this on their own. No UI (I.e., JSPs), no JMX, etc. The summary out the output is this: BlockCacheReport: on configured interval, print
Re: heapSize() implementation of KeyValue
Each array is really a pointer to an array (hence the references), then we are taking account of the overhead of the 'bytes' array itself. And I see 3 integers pasted in, so things are looking good to me On Sun, Jul 31, 2011 at 10:01 PM, Akash Ashok thehellma...@gmail.com wrote: Hi, I was going thru the heapSize() method in the class KeyValue and i couldn't seem to understand a few things which are in bold below private byte [] bytes = null; private int offset = 0; private int length = 0; private int keyLength = 0; // the row cached private byte [] rowCache = null; // default value is 0, aka DNC private long memstoreTS = 0; * @return Timestamp */ private long timestampCache = -1; public long heapSize() { return ClassSize.align( // Fixed Object size ClassSize.OBJECT + * // Why this ? (2 * ClassSize.REFERENCE) +* // bytes Array ClassSize.align(ClassSize.ARRAY) + //Size of int length ClassSize.align(length) + * // Why this ?? There are only 2 ints leaving length which are int ( offset, length) (3 * Bytes.SIZEOF_INT) + * // rowCache byte array ClassSize.align(ClassSize.ARRAY) + // Accounts for the longs memstoreTS and timestampCache (2 * Bytes.SIZEOF_LONG)); }
Re: Avro connector
Someone, but not necessarily the original contributor, should step up and maintain. Ideally someone who is also using it :) This could be a good chance to get on the good sides of everyone! On Jul 14, 2011 11:48 AM, Doug Meil doug.m...@explorysmedical.com wrote: +1 On 7/14/11 2:16 PM, Andrew Purtell apurt...@apache.org wrote: HBASE-2400 introduced a new connector contrib architecturally equivalent to the Thrift connector, but using Avro serialization and associated transport and RPC server work. However, it remains unfinished, was developed against an old version of Avro, is currently not maintained, and is regarded as not production quality (see: http://www.quora.com/What-is-the-current-status-for-using-Avro-with-HBase) . Therefore I propose: If a contributor steps up, then this person should bring the Avro connector up to par with the Thrift connector. Otherwise, we should deprecate and remove the Avro connector. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
Re: Converting byte[] to ByteBuffer
I think my general point is we could hack up the hbase source, add refcounting, circumvent the gc, etc or we could demand more from the dfs. If a variant of hdfs-347 was committed, reads could come from the Linux buffer cache and life would be good. The choice isn't fast hbase vs slow hbase, there are elements of bugs there as well. On Jul 9, 2011 12:25 PM, M. C. Srivas mcsri...@gmail.com wrote: On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: There are couple of things here, one is direct byte buffers to put the blocks outside of heap, the other is MMap'ing the blocks directly from the underlying HDFS file. I think they both make sense. And I'm not sure MapR's solution will be that much better if the latter is implemented in HBase. There're some major issues with mmap'ing the local hdfs file (the block) directly: (a) no checksums to detect data corruption from bad disks (b) when a disk does fail, the dfs could start reading from an alternate replica ... but that option is lost when mmap'ing and the RS will crash immediately (c) security is completely lost, but that is minor given hbase's current status For those hbase deployments that don't care about the absence of the (a) and (b), especially (b), its definitely a viable option that gives good perf. At MapR, we did consider similar direct-access capability and rejected it due to the above concerns. On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson ryano...@gmail.com wrote: The overhead in a byte buffer is the extra integers to keep track of the mark, position, limit. I am not sure that putting the block cache in to heap is the way to go. Getting faster local dfs reads is important, and if you run hbase on top of Mapr, these things are taken care of for you. On Jul 8, 2011 6:20 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Also, it's for a good cause, moving the blocks out of main heap using direct byte buffers or some other more native-like facility (if DBB's don't work). On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com wrote: Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API is...annoying. On Jul 8, 2011 4:51 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is there an open issue for this? How hard will this be? :)
Re: Converting byte[] to ByteBuffer
No lines of hbase were changed to run on Mapr. Mapr implements the hdfs API and uses jni to get local data. If hdfs wanted to it could use more sophisticated methods to get data rapidly from local disk to a client's memory space...as Mapr does. On Jul 9, 2011 6:05 PM, Doug Meil doug.m...@explorysmedical.com wrote: re: If a variant of hdfs-347 was committed, I agree with what Ryan is saying here, and I'd like to second (third? fourth?) keep pushing for HDFS improvements. Anything else is coding around the bigger I/O issue. On 7/9/11 6:13 PM, Ryan Rawson ryano...@gmail.com wrote: I think my general point is we could hack up the hbase source, add refcounting, circumvent the gc, etc or we could demand more from the dfs. If a variant of hdfs-347 was committed, reads could come from the Linux buffer cache and life would be good. The choice isn't fast hbase vs slow hbase, there are elements of bugs there as well. On Jul 9, 2011 12:25 PM, M. C. Srivas mcsri...@gmail.com wrote: On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: There are couple of things here, one is direct byte buffers to put the blocks outside of heap, the other is MMap'ing the blocks directly from the underlying HDFS file. I think they both make sense. And I'm not sure MapR's solution will be that much better if the latter is implemented in HBase. There're some major issues with mmap'ing the local hdfs file (the block) directly: (a) no checksums to detect data corruption from bad disks (b) when a disk does fail, the dfs could start reading from an alternate replica ... but that option is lost when mmap'ing and the RS will crash immediately (c) security is completely lost, but that is minor given hbase's current status For those hbase deployments that don't care about the absence of the (a) and (b), especially (b), its definitely a viable option that gives good perf. At MapR, we did consider similar direct-access capability and rejected it due to the above concerns. On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson ryano...@gmail.com wrote: The overhead in a byte buffer is the extra integers to keep track of the mark, position, limit. I am not sure that putting the block cache in to heap is the way to go. Getting faster local dfs reads is important, and if you run hbase on top of Mapr, these things are taken care of for you. On Jul 8, 2011 6:20 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Also, it's for a good cause, moving the blocks out of main heap using direct byte buffers or some other more native-like facility (if DBB's don't work). On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com wrote: Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API is...annoying. On Jul 8, 2011 4:51 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is there an open issue for this? How hard will this be? :)
Re: Converting byte[] to ByteBuffer
The overhead in a byte buffer is the extra integers to keep track of the mark, position, limit. I am not sure that putting the block cache in to heap is the way to go. Getting faster local dfs reads is important, and if you run hbase on top of Mapr, these things are taken care of for you. On Jul 8, 2011 6:20 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Also, it's for a good cause, moving the blocks out of main heap using direct byte buffers or some other more native-like facility (if DBB's don't work). On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com wrote: Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API is...annoying. On Jul 8, 2011 4:51 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is there an open issue for this? How hard will this be? :)
Re: Converting byte[] to ByteBuffer
Hey, When running on top of Mapr, hbase has fast cached access to locally stored files, the Mapr client ensures that. Likewise, hdfs should also ensure that local reads are fast and come out of cache as necessary. Eg: the kernel block cache. I wouldn't support mmap, it would require 2 different read path implementations. You will never know when a read is not local. Hdfs needs to provide faster local reads imo. Managing the block cache in not heap might work but you also might get there and find the dbb accounting overhead kills. On Jul 8, 2011 6:47 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: There are couple of things here, one is direct byte buffers to put the blocks outside of heap, the other is MMap'ing the blocks directly from the underlying HDFS file. I think they both make sense. And I'm not sure MapR's solution will be that much better if the latter is implemented in HBase. On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson ryano...@gmail.com wrote: The overhead in a byte buffer is the extra integers to keep track of the mark, position, limit. I am not sure that putting the block cache in to heap is the way to go. Getting faster local dfs reads is important, and if you run hbase on top of Mapr, these things are taken care of for you. On Jul 8, 2011 6:20 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Also, it's for a good cause, moving the blocks out of main heap using direct byte buffers or some other more native-like facility (if DBB's don't work). On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com wrote: Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API is...annoying. On Jul 8, 2011 4:51 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is there an open issue for this? How hard will this be? :)
Re: Converting byte[] to ByteBuffer
On Jul 8, 2011 7:19 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: When running on top of Mapr, hbase has fast cached access to locally stored files, the Mapr client ensures that. Likewise, hdfs should also ensure that local reads are fast and come out of cache as necessary. Eg: the kernel block cache. Agreed! However I don't see how that's possible today. Eg, it'd require more of a byte buffer type of API to HDFS, random reads not using streams. It's easy to add. I don't think its as easy as you say. And even using the stream API Mapr delivers a lot more performance. And this is from my own tests not a white paper. I think the biggest win for HBase with MapR is the lack of the NameNode issues and snapshotting. In particular, snapshots are pretty much a standard RDBMS feature. That is good too - if you are using hbase in real time prod you need to look at Mapr. But even beyond that the performance improvements are insane. We are talking like 8-9x perf on my tests. Not to mention substantially reduced latency. I'll repeat again, local accelerated access is going to be a required feature. It already is. I investigated using dbb once upon a time, I concluded that managing the ref counts would be a nightmare, and the better solution was to copy keyvalues out of the dbb during scans. Injecting refcount code seems like a worse remedy than the problem. Hbase doesn't have as many bugs but explicit ref counting everywhere seems dangerous. Especially when a perf solution is already here. Use Mapr or hdfs-347/local reads. Managing the block cache in not heap might work but you also might get there and find the dbb accounting overhead kills. Lucene uses/abuses ref counting so I'm familiar with the downsides. When it works, it's great, when it doesn't it's a nightmare to debug. It is possible to make it work though. I don't think there would be overhead from it, ie, any pool of objects implements ref counting. It'd be nice to not have a block cache however it's necessary for caching compressed [on disk] blocks. On Fri, Jul 8, 2011 at 7:05 PM, Ryan Rawson ryano...@gmail.com wrote: Hey, When running on top of Mapr, hbase has fast cached access to locally stored files, the Mapr client ensures that. Likewise, hdfs should also ensure that local reads are fast and come out of cache as necessary. Eg: the kernel block cache. I wouldn't support mmap, it would require 2 different read path implementations. You will never know when a read is not local. Hdfs needs to provide faster local reads imo. Managing the block cache in not heap might work but you also might get there and find the dbb accounting overhead kills. On Jul 8, 2011 6:47 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: There are couple of things here, one is direct byte buffers to put the blocks outside of heap, the other is MMap'ing the blocks directly from the underlying HDFS file. I think they both make sense. And I'm not sure MapR's solution will be that much better if the latter is implemented in HBase. On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson ryano...@gmail.com wrote: The overhead in a byte buffer is the extra integers to keep track of the mark, position, limit. I am not sure that putting the block cache in to heap is the way to go. Getting faster local dfs reads is important, and if you run hbase on top of Mapr, these things are taken care of for you. On Jul 8, 2011 6:20 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Also, it's for a good cause, moving the blocks out of main heap using direct byte buffers or some other more native-like facility (if DBB's don't work). On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson ryano...@gmail.com wrote: Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API is...annoying. On Jul 8, 2011 4:51 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is there an open issue for this? How hard will this be? :)
Re: zoo.cfg vs hbase-site.xml
I was thinking that perhaps the normative use case for talking to a cluster is to specify the quorum name and path... The implicit config can be really confusing and is out of norms compared to other data store systems. Eg MySQL memcache etc. On Jul 6, 2011 2:14 PM, Stack st...@duboce.net wrote: I agree that we should be more consistent in how we get zk config (Your original report looks like a bug Lars). I also recently tripped over the fact that hbase uses different names for one or two zk configs. We need to fix that too. St.Ack On Mon, Jul 4, 2011 at 8:59 AM, Jesse Yates jesse.k.ya...@gmail.com wrote: Isn't that kind of the point though? If you drop in a zk config file on a machine, you should be able to update all your apps on that machine to the new config. Whats more important though is being able to easily distribute a changed zk config across your cluster and simultaneously across multiple applications. Rather than rewriting the confs for a handful of applications, and possibly making a mistake dealing with each application's own special semantics, and single conf to update everything just makes sense. I would lobby then that we make usage more consistent (as Lars recommends) and make some of the hbase conf values to more closely match the zk conf values (though hbase.${zk.value} is really not bad). -Jesse From: Ryan Rawson [ryano...@gmail.com] Sent: Monday, July 04, 2011 5:25 AM To: dev@hbase.apache.org Subject: Re: zoo.cfg vs hbase-site.xml Should just fully deprecate zoo.cfg, it ended up being more trouble than it was worth. When you use zoo.cfg you cannot connect to more than 1 cluster from a single JVM. Annoying! On Sun, Jul 3, 2011 at 10:22 AM, Ted Yu yuzhih...@gmail.com wrote: I looked at conf/zoo_sample.cfg from zookeeper trunk. The naming of properties is different from the way we name hbase.zookeeper.property.maxClientCnxns e.g. # the port at which the clients will connect clientPort=2181 FYI On Sun, Jul 3, 2011 at 9:53 AM, Lars George lars.geo...@gmail.com wrote: Hi, Usually the zoo.cfg overrides *all* settings off the hbase-site.xml (including the ones from hbase-default.xml) - when present. But in some places we do not consider this, for example in HConnectionManager: static { // We set instances to one more than the value specified for {@link // HConstants#ZOOKEEPER_MAX_CLIENT_CNXNS}. By default, the zk default max // connections to the ensemble from the one client is 30, so in that case we // should run into zk issues before the LRU hit this value of 31. MAX_CACHED_HBASE_INSTANCES = HBaseConfiguration.create().getInt( HConstants.ZOOKEEPER_MAX_CLIENT_CNXNS, HConstants.DEFAULT_ZOOKEPER_MAX_CLIENT_CNXNS) + 1; HBASE_INSTANCES = new LinkedHashMapHConnectionKey, HConnectionImplementation( (int) (MAX_CACHED_HBASE_INSTANCES / 0.75F) + 1, 0.75F, true) { @Override protected boolean removeEldestEntry( Map.EntryHConnectionKey, HConnectionImplementation eldest) { return size() MAX_CACHED_HBASE_INSTANCES; } }; This only reads it from hbase-site.xml+hbase-default.xml. This is inconsistent, I think this should use ZKConfig.makeZKProps(conf) and then get the value. Thoughts? Lars
Re: zoo.cfg vs hbase-site.xml
Should just fully deprecate zoo.cfg, it ended up being more trouble than it was worth. When you use zoo.cfg you cannot connect to more than 1 cluster from a single JVM. Annoying! On Sun, Jul 3, 2011 at 10:22 AM, Ted Yu yuzhih...@gmail.com wrote: I looked at conf/zoo_sample.cfg from zookeeper trunk. The naming of properties is different from the way we name hbase.zookeeper.property.maxClientCnxns e.g. # the port at which the clients will connect clientPort=2181 FYI On Sun, Jul 3, 2011 at 9:53 AM, Lars George lars.geo...@gmail.com wrote: Hi, Usually the zoo.cfg overrides *all* settings off the hbase-site.xml (including the ones from hbase-default.xml) - when present. But in some places we do not consider this, for example in HConnectionManager: static { // We set instances to one more than the value specified for {@link // HConstants#ZOOKEEPER_MAX_CLIENT_CNXNS}. By default, the zk default max // connections to the ensemble from the one client is 30, so in that case we // should run into zk issues before the LRU hit this value of 31. MAX_CACHED_HBASE_INSTANCES = HBaseConfiguration.create().getInt( HConstants.ZOOKEEPER_MAX_CLIENT_CNXNS, HConstants.DEFAULT_ZOOKEPER_MAX_CLIENT_CNXNS) + 1; HBASE_INSTANCES = new LinkedHashMapHConnectionKey, HConnectionImplementation( (int) (MAX_CACHED_HBASE_INSTANCES / 0.75F) + 1, 0.75F, true) { @Override protected boolean removeEldestEntry( Map.EntryHConnectionKey, HConnectionImplementation eldest) { return size() MAX_CACHED_HBASE_INSTANCES; } }; This only reads it from hbase-site.xml+hbase-default.xml. This is inconsistent, I think this should use ZKConfig.makeZKProps(conf) and then get the value. Thoughts? Lars
Re: Pluggable block index
On Sun, Jun 5, 2011 at 11:33 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Ok, the block index is only storing the first key of each block? Hmm... I think we can store a pointer to an exact position in the block, or at least allow that (for the FST implementation). Are you sure that is a good idea? Surely the disk seeks would destroy you on index load? How efficient is the current seeking? I have previously thought about prefix compression, it seemed doable, It does look like prefix compression should be doable. Eg, we'd seek to a position based on the block index (from which we'd have the entire key). From the seek'd to position, we could scan and load up each subsequent prefix compressed key into a KeyValue, though right the KV wouldn't be 'pointing' back to the internals of the block, it'd be creating a whole new byte[] for each KV (which could have it's own garbage related ramifications). you'd need a compressing algorithm Lucene's terms dict is very simple. The next key has the position at which the previous key differs. On Sat, Jun 4, 2011 at 3:35 PM, Ryan Rawson ryano...@gmail.com wrote: Also, dont break it :-) Part of the goal of HFile was to build something quick and reliable. It can be hard to know you have all the corner cases down and you won't find out in 6 months that every single piece of data you have put in HBase is corrupt. Keeping it simple is one strategy. I have previously thought about prefix compression, it seemed doable, you'd need a compressing algorithm, then in the Scanner you would expand KeyValues and callers would end up with copies, not views on, the original data. The JVM is fairly good about short lived objects (up to a certain allocation rate that is), and while the original goal was to reduce memory usage, it could make sense to take a higher short term allocation rate if the wins from prefix compression are there. Also note that in whole-system profiling, often repeated methods in KeyValue do pop up. The goal of KeyValue was to have a format that didnt require deserialization into larger data structures (hence the lack of vint), and would be simple and fast. Undoing that work should be accompanied with profiling evidence that new slowdowns were not introduced. -ryan On Sat, Jun 4, 2011 at 3:30 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: You'd have to change how the Scanner code works, etc. You'll find out. Nice! Sounds fun. On Sat, Jun 4, 2011 at 3:27 PM, Ryan Rawson ryano...@gmail.com wrote: What are the specs/goals of a pluggable block index? Right now the block index is fairly tied deep in how HFile works. You'd have to change how the Scanner code works, etc. You'll find out. On Sat, Jun 4, 2011 at 3:17 PM, Stack saint@gmail.com wrote: I do not know of one. FYI hfile is pretty standalone regards tests etc. There is even a perf testing class for hfile On Jun 4, 2011, at 14:44, Jason Rutherglen jason.rutherg...@gmail.com wrote: I want to take a wh/hack at creating a pluggable block index, is there an open issue for this? I looked and couldn't find one.
Re: Pluggable block index
When I thought about it, I didn't think cross-block compression would be a good idea - this is because you want to be able to decompress each block independently of each other. Perhaps a master HFile dictionary or something. -ryan On Mon, Jun 6, 2011 at 12:06 AM, M. C. Srivas mcsri...@gmail.com wrote: On Sun, Jun 5, 2011 at 11:37 PM, Ryan Rawson ryano...@gmail.com wrote: On Sun, Jun 5, 2011 at 11:33 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Ok, the block index is only storing the first key of each block? Hmm... I think we can store a pointer to an exact position in the block, or at least allow that (for the FST implementation). Are you sure that is a good idea? Surely the disk seeks would destroy you on index load? I agree, it would be pretty bad. But, assuming that the block size is set appropriately, copying one key per 100 or so values into the block index does not really bloat the hfile and is good trade-off to avoid the seeking. Plus, it does not prevent prefix-compression inside the block itself. Are we considering prefix-compression of keys across blocks? How efficient is the current seeking? I have previously thought about prefix compression, it seemed doable, It does look like prefix compression should be doable. Eg, we'd seek to a position based on the block index (from which we'd have the entire key). From the seek'd to position, we could scan and load up each subsequent prefix compressed key into a KeyValue, though right the KV wouldn't be 'pointing' back to the internals of the block, it'd be creating a whole new byte[] for each KV (which could have it's own garbage related ramifications). you'd need a compressing algorithm Lucene's terms dict is very simple. The next key has the position at which the previous key differs. On Sat, Jun 4, 2011 at 3:35 PM, Ryan Rawson ryano...@gmail.com wrote: Also, dont break it :-) Part of the goal of HFile was to build something quick and reliable. It can be hard to know you have all the corner cases down and you won't find out in 6 months that every single piece of data you have put in HBase is corrupt. Keeping it simple is one strategy. I have previously thought about prefix compression, it seemed doable, you'd need a compressing algorithm, then in the Scanner you would expand KeyValues and callers would end up with copies, not views on, the original data. The JVM is fairly good about short lived objects (up to a certain allocation rate that is), and while the original goal was to reduce memory usage, it could make sense to take a higher short term allocation rate if the wins from prefix compression are there. Also note that in whole-system profiling, often repeated methods in KeyValue do pop up. The goal of KeyValue was to have a format that didnt require deserialization into larger data structures (hence the lack of vint), and would be simple and fast. Undoing that work should be accompanied with profiling evidence that new slowdowns were not introduced. -ryan On Sat, Jun 4, 2011 at 3:30 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: You'd have to change how the Scanner code works, etc. You'll find out. Nice! Sounds fun. On Sat, Jun 4, 2011 at 3:27 PM, Ryan Rawson ryano...@gmail.com wrote: What are the specs/goals of a pluggable block index? Right now the block index is fairly tied deep in how HFile works. You'd have to change how the Scanner code works, etc. You'll find out. On Sat, Jun 4, 2011 at 3:17 PM, Stack saint@gmail.com wrote: I do not know of one. FYI hfile is pretty standalone regards tests etc. There is even a perf testing class for hfile On Jun 4, 2011, at 14:44, Jason Rutherglen jason.rutherg...@gmail.com wrote: I want to take a wh/hack at creating a pluggable block index, is there an open issue for this? I looked and couldn't find one.
Re: Pluggable block index
Also, dont break it :-) Part of the goal of HFile was to build something quick and reliable. It can be hard to know you have all the corner cases down and you won't find out in 6 months that every single piece of data you have put in HBase is corrupt. Keeping it simple is one strategy. I have previously thought about prefix compression, it seemed doable, you'd need a compressing algorithm, then in the Scanner you would expand KeyValues and callers would end up with copies, not views on, the original data. The JVM is fairly good about short lived objects (up to a certain allocation rate that is), and while the original goal was to reduce memory usage, it could make sense to take a higher short term allocation rate if the wins from prefix compression are there. Also note that in whole-system profiling, often repeated methods in KeyValue do pop up. The goal of KeyValue was to have a format that didnt require deserialization into larger data structures (hence the lack of vint), and would be simple and fast. Undoing that work should be accompanied with profiling evidence that new slowdowns were not introduced. -ryan On Sat, Jun 4, 2011 at 3:30 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: You'd have to change how the Scanner code works, etc. You'll find out. Nice! Sounds fun. On Sat, Jun 4, 2011 at 3:27 PM, Ryan Rawson ryano...@gmail.com wrote: What are the specs/goals of a pluggable block index? Right now the block index is fairly tied deep in how HFile works. You'd have to change how the Scanner code works, etc. You'll find out. On Sat, Jun 4, 2011 at 3:17 PM, Stack saint@gmail.com wrote: I do not know of one. FYI hfile is pretty standalone regards tests etc. There is even a perf testing class for hfile On Jun 4, 2011, at 14:44, Jason Rutherglen jason.rutherg...@gmail.com wrote: I want to take a wh/hack at creating a pluggable block index, is there an open issue for this? I looked and couldn't find one.
Re: Pluggable block index
Oh BTW, you can't mmap anything in HBase unless you copy it to local disk first. HDFS = no mmap. just thought you'd like to know. On Sat, Jun 4, 2011 at 3:41 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: It can be hard to know you have all the corner cases down and you won't find out in 6 months that every single piece of data you have put in HBase is corrupt. Keeping it simple is one strategy. Isn't the block index separate from the actual data? So corruption in that case is unlikely. I have previously thought about prefix compression, it seemed doable, you'd need a compressing algorithm, then in the Scanner you would expand KeyValues I think we can try that later. I'm not sure one can make a hard and fast rule to always load the keys into RAM as an FST. The block index would seem to be fairly separate. On Sat, Jun 4, 2011 at 3:35 PM, Ryan Rawson ryano...@gmail.com wrote: Also, dont break it :-) Part of the goal of HFile was to build something quick and reliable. It can be hard to know you have all the corner cases down and you won't find out in 6 months that every single piece of data you have put in HBase is corrupt. Keeping it simple is one strategy. I have previously thought about prefix compression, it seemed doable, you'd need a compressing algorithm, then in the Scanner you would expand KeyValues and callers would end up with copies, not views on, the original data. The JVM is fairly good about short lived objects (up to a certain allocation rate that is), and while the original goal was to reduce memory usage, it could make sense to take a higher short term allocation rate if the wins from prefix compression are there. Also note that in whole-system profiling, often repeated methods in KeyValue do pop up. The goal of KeyValue was to have a format that didnt require deserialization into larger data structures (hence the lack of vint), and would be simple and fast. Undoing that work should be accompanied with profiling evidence that new slowdowns were not introduced. -ryan On Sat, Jun 4, 2011 at 3:30 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: You'd have to change how the Scanner code works, etc. You'll find out. Nice! Sounds fun. On Sat, Jun 4, 2011 at 3:27 PM, Ryan Rawson ryano...@gmail.com wrote: What are the specs/goals of a pluggable block index? Right now the block index is fairly tied deep in how HFile works. You'd have to change how the Scanner code works, etc. You'll find out. On Sat, Jun 4, 2011 at 3:17 PM, Stack saint@gmail.com wrote: I do not know of one. FYI hfile is pretty standalone regards tests etc. There is even a perf testing class for hfile On Jun 4, 2011, at 14:44, Jason Rutherglen jason.rutherg...@gmail.com wrote: I want to take a wh/hack at creating a pluggable block index, is there an open issue for this? I looked and couldn't find one.
Re: modular build and pluggable rpc
The cost of serialization is non trivial and a substantial expense in conveying information from regionserver - client. I did some timings, and sending data across the wire is surprisingly slow, but attempting to compress it with various compression systems ended up taking 50-100ms on average case (1-5mb Result[] sets). Originally when conceptualizing thrift, the thought was to just send the KeyValue byte[] over thrift as an opaque blob and not doing a whole structure thing, eg: no KeyValue structure with parts for each of the parts of a KeyValue. On large results that cost becomes prohibitive. While HTTP has a high overhead of headers, if one wanted to be http-oriented you could do: http://www.chromium.org/spdy The nice thing is that HTTP has a good set of interops and the like. The bad thing is it is too verbose. -ryan On Tue, May 31, 2011 at 1:22 PM, Stack st...@duboce.net wrote: On Mon, May 30, 2011 at 9:55 PM, Eric Yang ey...@yahoo-inc.com wrote: Maven modulation could be enhanced to have a structure looks like this: Super POM +- common +- shell +- master +- region-server +- coprocessor The software is basically group by processor type (role of the process) and a shared library. I'd change the list above. shell should be client and perhaps master and regionserver should be both inside a single 'server' submodule. We need to add security in there. Perhaps we'd have a submodule for thrift, avro, rest (and perhaps rest war file)? (Is this too many submodules -- I suppose once we are submodularized, adding new ones is trivial. Its the initial move to submodules that is painful) For RPC, there are several feasible options, avro, thrift and jackson+jersey (REST). Avro may seems cumbersome to define the schema in JSON string. Thrift comes with it's own rpc server, it is not trivial to add authorization and authentication to secure the rpc transport. Jackson+Jersey RPC message is biggest message size compare to Avro and thrift. All three frameworks have pros and cons but I think Jackson+jersey have the right balance for rpc framework. In most of the use case, pluggable RPC can be narrow down to two main category of use cases: 1. Freedom of creating most efficient rpc but hard to integrate with everything else because it's custom made. 2. Being able to evolve message passing and versioning. If we can see beyond first reason, and realize second reason is in part polymorphic serialization. This means, Jackson+Jersey is probably the better choice as a RPC framework because Jackson supports polymorphic serialization, and Jersey builds on HTTP protocol. It would be easier to versioning and add security on top of existing standards. The syntax and feature set seems more engineering proper to me. I always considered http attactive but much too heavy-weight for hbase rpc; each request/response would carry a bunch of what are for the most part extraneous headers. I suppose we should just measure. Regards JSON messages, thats interesting but hbase is all about binary data. Does jackson/jersey do BSON? St.Ack
Re: modular build and pluggable rpc
The build modules are fine, I just wanted to voice my opinions on avro vs thrift. I dont think we should spend a lot of time attempting to build a avro vs thrift thing, we should plan to eventually move to thrift as our RPC serialization. I also concur with Todd, our server side code has had a lot of work and it isnt half bad now :-) +1 to maven modules, they are pretty cool On Fri, May 27, 2011 at 2:38 PM, Andrew Purtell apurt...@apache.org wrote: I don't disagree with any of this but the fact is we have compile time differences if going against secure Hadoop 0.20 or non-secure Hadoop 0.20. So either we decide to punt on integration with secure Hadoop 0.20 or we deal with the compile time differences. If dealing with them, we can do it by reflection, which is brittle and can be difficult to understand and debug, and someone would have to do this work; or we can wholesale replace RPC with something based on Thrift, and someone would have to do the work; or we take the pluggable RPC changes that Gary has already developed and modularize the build, which Eric has already volunteered to do. - Andy --- On Fri, 5/27/11, Todd Lipcon t...@cloudera.com wrote: From: Todd Lipcon t...@cloudera.com Subject: Re: modular build and pluggable rpc To: dev@hbase.apache.org Cc: apurt...@apache.org Date: Friday, May 27, 2011, 1:30 PM Agreed - I'm all for Thrift. Though, I actually, contrary to Ryan, think that the existing HBaseRPC handler/client code is pretty good -- better than the equivalents from Thrift Java. We could start by using Thrift serialization on our existing transport -- then maybe work towards contributing it upstream to the Thrift project. HDFS folks are potentially interested in doing that as well. -Todd On Fri, May 27, 2011 at 1:10 PM, Ryan Rawson ryano...@gmail.com wrote: I'm -1 on avro as a RPC format. Thrift is the way to go, any of the advantages of smaller serialization of avro is lost by the sheer complexity of avro and therefore the potential bugs. I understand the desire to have a pluggable RPC engine, but it feels like the better approach would be to adopt a unified RPC and just be done with it. I had a look at the HsHa mechanism in thrift and it is very good, it in fact matches our 'handler' approach - async recieving/sending of data, but single threaded for processing a message. -ryan On Fri, May 27, 2011 at 1:00 PM, Andrew Purtell apurt...@apache.org wrote: Also needing, perhaps later, consideration: - HDFS-347 or not - Lucene embedding for hbase-search, though as a coprocessor this is already pretty much handled if we have platform support (therefore a platform module) for a HDFS that can do local read shortcutting and block placement requests - HFile v1 versus v2 Making decoupled development at several downstream sites manageable, with a home upstream for all the work, while simultaneously providing clean migration paths for users, basically. --- On Fri, 5/27/11, Andrew Purtell apurt...@apache.org wrote: From: Andrew Purtell apurt...@apache.org Subject: modular build and pluggable rpc To: dev@hbase.apache.org Date: Friday, May 27, 2011, 12:49 PM From IRC: apurtell i propose we take the build modular as early as possible to deal with multiple platform targets apurtell secure vs nonsecure apurtell 0.20 vs 0.22 vs trunk apurtell i understand the maintenence issues with multiple rpc engines, for example, but a lot of reflection twistiness is going to be worse apurtell i propose we take up esammer on his offer apurtell so branch 0.92 asap, get trunk modular and working against multiple platform targets apurtell especially if we're going to see rpc changes coming from downstream projects... apurtell also what about supporting secure and nonsecure clients with the same deployment? apurtell zookeeper does this apurtell so that is selectable rpc engine per connection, with a negotiation apurtell we don't have or want to be crazy about it but a rolling upgrade should be possible if for example we are taking in a new rpc from fb (?) or cloudera (avro based?) apurtell also looks like hlog modules for 0.20 vs 0.22 and successors apurtell i think over time we can roadmap the rpc engines, if we have multiple, by deprecation apurtell now that we're on the edge of supporting both 0.20 and 0.22, and secure vs nonsecure, let's get it as manageable as possible right away St^Ack_ apurtell: +1 apurtell also i think there is some interest in async rpc engine St^Ack_ we should stick this up on dev i'd say Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) -- Todd Lipcon Software Engineer, Cloudera
Re: HBase version numbers
0.91, if used, will be used for a developer-preview. Much as there was a 0.89 Developer Preview, then a 0.90.x DPs tend to be marked by the date they were cut, since there is no real version, and it is not expected the average user run a DP in production (nor advised!) -ryan On Tue, May 10, 2011 at 9:59 PM, lohit lohit.vijayar...@gmail.com wrote: Hello I see that for HBase released version is 0.90.2 0.90.3 vote is open and a branch/tag for 0.90.4 After this for trunk, the version seems to be 0.92.0 , right? What happened to 0.91 ? Thanks, Lohit
Re: why not obtain row lock when geting a row
That is not the case, please see: https://issues.apache.org/jira/browse/HBASE-2248 There are alternative mechanisms (outlined in that JIRA) to assure atomic row reads. -ryan On Sun, Mar 27, 2011 at 11:54 PM, jiangwen w wjiang...@gmail.com wrote: so client may read dirty data, considering the following case client#1 update firstName and lastName for a user. client#2 read the information of the user when client#1 updated firstName and will update lastName. so client#1 read the latest firstName, but the old lastName. Sincerely On Mon, Mar 28, 2011 at 1:45 PM, Ryan Rawson ryano...@gmail.com wrote: Row locks are not necessary when reading. this changed, that is why that is still there. On Mar 27, 2011 10:42 PM, jiangwen w wjiang...@gmail.com wrote: I think a row lock should be obtained before getting a row. but the following method in HRegion class show a row lock won't be obtained *public Result get(final Get get, final Integer lockid)* * * although there is a* lockid* parameter, but it is not used in this method. Sincerely Vince Wei
Re: why not obtain row lock when geting a row
Row locks are not necessary when reading. this changed, that is why that is still there. On Mar 27, 2011 10:42 PM, jiangwen w wjiang...@gmail.com wrote: I think a row lock should be obtained before getting a row. but the following method in HRegion class show a row lock won't be obtained *public Result get(final Get get, final Integer lockid)* * * although there is a* lockid* parameter, but it is not used in this method. Sincerely Vince Wei
Re: negotiated timeout
the HQuorumPeer uses hbase-site.xml/hbase-default.xml to configure ZK, including the line Patrick pointed out. You can increase that to increase the max timeout. -ryan On Thu, Mar 24, 2011 at 5:27 PM, Ted Yu yuzhih...@gmail.com wrote: Seeking more comment. -- Forwarded message -- From: Patrick Hunt ph...@apache.org Date: Thu, Mar 24, 2011 at 4:15 PM Subject: Re: negotiated timeout To: Ted Yu yuzhih...@gmail.com Cc: d...@zookeeper.apache.org, Mahadev Konar maha...@apache.org, zookeeper-...@hadoop.apache.org Ted, you'll need to ask the hbase guys about this if you are not running a dedicated zk cluster. I'm not sure how they manage embedded zk. However a quick search of the HBASE code results in: ./src/main/java/org/apache/hadoop/hbase/zookeeper/HQuorumPeer.java: // Set the max session timeout from the provided client-side timeout properties.setProperty(maxSessionTimeout, conf.get(zookeeper.session.timeout, 18)); Patrick On Thu, Mar 24, 2011 at 4:00 PM, Ted Yu yuzhih...@gmail.com wrote: Patrick: Do you want me to look at maxSessionTimeout ? Since hbase manages zookeeper, I am not sure I can control this parameter directly. On Thu, Mar 24, 2011 at 3:50 PM, Patrick Hunt ph...@apache.org wrote: http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_advancedConfiguration On Thu, Mar 24, 2011 at 3:43 PM, Mahadev Konar maha...@apache.org wrote: Hi Ted, The session timeout can be changed by the server depending on min/max bounds set on the servers. Are you servers configured to have a max timeout of 60 seconds? usually the default is 20 * tickTime. Looks like your ticktime is 3 seconds? thanks mahadev On Thu, Mar 24, 2011 at 3:20 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, hbase 0.90.1 uses zookeeper 3.3.2 I specified: property namezookeeper.session.timeout/name value49/value /property In zookeeper log I see: 2011-03-24 19:58:09,499 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /10.202.50.111:50325 2011-03-24 19:58:09,499 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x12ebb99d686a012 with negotiated timeout 6 for client /10.202.50.112:62386 2011-03-24 19:58:09,499 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /10.202.50.112:62387 2011-03-24 19:58:09,499 INFO org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x12ebb99d686a012 type:create cxid:0x1 zxid:0xfffe txntype:unknown reqpath:n/a Error Path:/hbase Error:KeeperErrorCode = NodeExists for /hbase 2011-03-24 19:58:09,499 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x12ebb99d686a013 with negotiated timeout 6 for client /10.202.50.111:50324 Can someone tell me how the negotiated timeout of 6 was computed ? Thanks
Re: gauging cost of region movement
it would make sense to avoid moving regions, so therefore the more recently a region was moved, the less likely we should move it. you could imagine a hypothetical perfect 'region move cost' function that might look like: F(r) = timeSinceMoved(r) + size(r) + loadAvg(r) The functions should probably be normalized to [0,1], so the range of F would be [0,3] with 3 == 'dont move' and 0 == 'move first'. The goal is to minimize all the F(r[i]) in the moves. -ryan On Mon, Mar 21, 2011 at 4:26 PM, Jonathan Gray jg...@fb.com wrote: Also, using more stable measures of request count will help, such as 30 minute rolling averages. -Original Message- From: Jonathan Gray [mailto:jg...@fb.com] Sent: Monday, March 21, 2011 4:23 PM To: dev@hbase.apache.org Subject: RE: gauging cost of region movement This is an interesting direction, and definitely file a JIRA as this could be an additional metric in the future, but it's not exactly what I had in mind. One of the hardest parts of load balancing based on request count and other dynamic/transient measures is that you can get some pretty pathological conditions where you are always moving stuff around. To guard against it, I think we'll need to move to more of a cost-based algorithm that is taking not just the difference in request counts into account but also a baseline cost of moving a region. The cost difference in load between two unbalanced servers would have to outweigh the cost associated with moving a region. As you say, looking at the number of live operations to a given region could contribute to the cost of moving that region, but the best measure for that is probably just looking at request count (it's all requests that incur a cost, not just active scanners). JG -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Monday, March 21, 2011 3:44 PM To: dev@hbase.apache.org Subject: gauging cost of region movement Can we add a counter for the number of InternalScanner's to HRegion ? We decrement this counter when close() is called. Such counter can be used to gauge the cost of moving the underlying region. Cheers
Re: trimming RegionLoad fields
How much memory does profiling indicating these objects use? How much are you expecting to save? Saving 4-8 bytes even on a 10k region cluster is still only 80k of ram, not really significant. On Thu, Mar 17, 2011 at 2:32 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, See email thread 'One of the regionserver aborted, then the master shut down itself' for background. I am evaluating various ways of trimming the memory footprint of RegionLoad because there would be so many regions in production cluster. Looking at field memstoreSizeMB of RegionLoad, I only found this reference - AvroUtil.hslToASL() Load balancer currently isn't checking this metric. And HRegion has memstoreSize field. I wonder whether we can trim field memstoreSizeMB off RegionLoad. Please comment.
Re: trimming RegionLoad fields
Without solid evidence of we'll be saving X megabytes I don't see a compelling reason to hacking that stuff out yet. We sort of need a better out-of-the-box monitoring system. One idea I had was to embed OpenTSDB inside the HMaster. This way OpenTSDB would store info about a HBase cluster back in the same cluster it monitors. While this may sound weird I think it makes sense because every great database system provides strong self monitoring tools. Eg: Oracle, etc. Due to the LGPL, this is not currently viable. Perhaps there is an alternative floating out there we can ship with? And not ganglia :-) On Thu, Mar 17, 2011 at 2:47 PM, Andrew Purtell apurt...@apache.org wrote: memstoreSizeMB is part of the output printed by the shell when you do status 'detailed'. I use that. Isn't that information useful to others? - Andy --- On Thu, 3/17/11, Ryan Rawson ryano...@gmail.com wrote: From: Ryan Rawson ryano...@gmail.com Subject: Re: trimming RegionLoad fields To: dev@hbase.apache.org Cc: Ted Yu yuzhih...@gmail.com Date: Thursday, March 17, 2011, 2:37 PM How much memory does profiling indicating these objects use? How much are you expecting to save? Saving 4-8 bytes even on a 10k region cluster is still only 80k of ram, not really significant. On Thu, Mar 17, 2011 at 2:32 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, See email thread 'One of the regionserver aborted, then the master shut down itself' for background. I am evaluating various ways of trimming the memory footprint of RegionLoad because there would be so many regions in production cluster. Looking at field memstoreSizeMB of RegionLoad, I only found this reference - AvroUtil.hslToASL() Load balancer currently isn't checking this metric. And HRegion has memstoreSize field. I wonder whether we can trim field memstoreSizeMB off RegionLoad. Please comment.
Re: move meta table to ZK
Is it possible to search a list of z nodes? That is what we do now with meta in hbase. I used to be a fan, but I think self hosting all important meta data is the best approach. It makes lots of things easier, like replication, snapshots, etc. On Mar 17, 2011 9:27 PM, jiangwen w wjiang...@gmail.com wrote: how do you think about moving meta table to ZK, so no meta table are needed. if we do so, we need enhance ZK in the following way: 1. let children of ZNode in order. if we do so, we can benifit: 1. no need to treat meta table as a special way. this will simplify the code a lot 2. ZK is highly available, so we don't worry the availablility of the meta data. 3. currently if the region server where meta table is on failed, the whole cluster may pause. if we move meta table to ZK, there is no such problem. 4. meta table may be a hotspot, but in ZK reading is scalable by adding more observers. Sincerely
Re: When a region is spliting, what will be done with its memstores?
split transaction closes the region, at which time the memstore is flushed to disk. at this point they are empty. Dereferenced when HRegion is removed from the maps, then gced. On Mon, Mar 14, 2011 at 1:55 AM, Zhou Shuaifeng zhoushuaif...@huawei.com wrote: I read the SplitTransaction and flush code, but still don't understand the procedure of this question, can someone tell me? Zhou Shuaifeng(Frank) - This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Re: HTable thread safety in 0.20.6
On Sun, Mar 6, 2011 at 9:25 PM, Suraj Varma svarma...@gmail.com wrote: Thanks all for your insights into this. I would agree that providing mechanisms to support no-outage upgrades going forward would really be widely beneficial. I was looking forward to Avro for this reason. Some follow up questions: 1) If asynchbase client to do this (i.e. talk wire protocol and adjust based on server versions), why not the native hbase client? Is there something in the native client design that would make this too hard / not worth emulating? Typically this has not been an issue. The particular design of the way that hadoop rpc (the rpc we use) makes it difficult to offer multiple protocol/version support. To fix it would more or less require rewriting the entire protocol stack. I'm glad we spent serious time making the base storage layer and query paths fast, since without those fundamentals a better RPC would be moot. From my measurements I dont think we are losing a lot of performance in our current RPC system, and unless we are very careful we'll lose a lot in a thrift/avro transition. 2) Does asynchbase have any limitations (functionally or otherwise) compared to the native HBase client? 3) If Avro were the native protocol that HBase client talks through, that is one thing (and that's what I'm hoping we end up with) - however, isn't spinning up Avro gateways on each node (like what is currently available) require folks to scale up two layers (Avro gateway layer + HBase layer)? i.e. now we need to be worried about whether the Avro gateways can handle the traffic, etc. The hbase client is fairly 'thick', it must intelligently route between different regionservers, handle errors, relook up meta data, use zookeeper to bootstrap, etc. This is part of making a scalable client though. Having the RPC serialization in thrift or avro would make it easier to write those kinds of clients for non-Java languages. The gateway approach will probably be necessary for a while alas. At SU I am not sure that the gateway is adding a lot of of latency to small queries, since average/median latency is around 1ms. One strategy is to deploy gateways on all client nodes and use localhost as much as possible. In our application, we have Java clients talking directly to HBase. We debated using Thrift or Stargate layer (even though we have a Java client) just because of this easier upgrade-ability. But we finally decided to use the native HBase client because we didn't want to have to scale two layers rather than just HBase ... and Avro was on the road map. An HBase client talking native Avro directly to RS (i.e. without intermediate gateways would have worked - but that was a ways ... So again avro isn't going to be a magic bullet. Neither thrift. You can't just have a dumb client with little logic open up a socket and start talking to HBase. That isn't congruent with a scalable system unfortunately. You need your clients to be smart and do a bunch of work that otherwise would have to be done by a centralized type node or another middleman. Only if the client is smart can we send the minimal RPCs to the shortest network length. Other systems have servers bounce the requests to other servers but that can promote extra traffic at the cost of a simpler client. I think now that we are in the .90s, an option to do no-outage upgrades (from client's perspective) would be really beneficial. We'd all like this, it's formost in pretty much every committer's mind all the time. It's just a HUGE body of work. One that is fraught with perils and danger zones. For example it seemed avro would reign supreme, but the RPC landscape is shifting back towards thrift. Thanks, --Suraj On Sat, Mar 5, 2011 at 2:21 PM, Todd Lipcon t...@cloudera.com wrote: On Sat, Mar 5, 2011 at 2:10 PM, Ryan Rawson ryano...@gmail.com wrote: As for the past RPC, it's all well to complain that we didn't spend more time making it more compatible, but in a world where evolving features in an early platform is more important than keeping backwards compatibility (how many hbase 18 jars want to talk to a modern cluster? Like none.), I am confident we did the right choice. Moving forward I think the goal should NOT be to maintain the current system compatible at all costs, but to look at things like avro and thrift, make a calculated engineering tradeoff and get ourselves on to a extendable platform, even if there is a flag day. We aren't out of the woods yet, but eventually we will be. Hear hear! +1! -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: HTable thread safety in 0.20.6
So when you look at the interface that the client uses to talk to the regionservers it has calls like this: public R MultiResponse multi(MultiActionR multi) throws IOException; public long openScanner(final byte [] regionName, final Scan scan) throws IOException; etc Note that this is the interface you get _AFTER_ you are talking to a particular regionserver. If you send a regionName that is not being served you get a 'region not served' exception. In other words a blind client wouldnt know which servers to talk to. You have to first: - bootstrap the ROOT table region server location from ZK (there is only 1, always will only be one) - get the META region(s) location(s). - query the META region(s) to find out which server contains the region for the specific request. - talk to the individual regionserver. If you get exceptions, do the lookup in META again and try again. Putting these smarts in the client makes it scalable, at the cost of a thicker client. To make an API that has a '1 shot' type of interface, we'd end up creating something that looks like the thrift gateway. But now you have bottlenecks in the thrift gateway servers. There really is no free lunch. Sorry. On Sun, Mar 6, 2011 at 10:09 PM, Suraj Varma svarma...@gmail.com wrote: Sorry - missed the user group in my previous mail. --Suraj On Sun, Mar 6, 2011 at 10:07 PM, Suraj Varma svarma...@gmail.com wrote: Very interesting. I was just about to send an additional mail asking why HBase client also needs the hadoop jar (thereby tying the client onto the hadoop version as well) - but, I guess at the least the hadoop rpc is the dependency. So, now that makes sense. One strategy is to deploy gateways on all client nodes and use localhost as much as possible. This certainly scales up the gateway nodes - but complicates the upgrades. For instance, we will have a 100+ clients talking to the cluster and upgrading from 0.20.x to 0.90.x would be that much harder with version specific gateway nodes all over the place. So again avro isn't going to be a magic bullet. Neither thrift. This is interesting (disappointing?) ... isn't the plan to substitute hadoop rpc with avro (or thrift) while still keeping all the smart logic in the client in place? I thought avro with its cross-version capabilities would have solved the versioning issues and allowed the backward/forward compatibility. I mean, a thick client talking avro was what I had imagined the solution to be. Glad to know that client compatibility is very much in the commiter's / community's mind. Based on discussion below, is async-hbase a thick / smart client or something less than that? 2) Does asynchbase have any limitations (functionally or otherwise) compared to the native HBase client? Thanks again. --Suraj On Sun, Mar 6, 2011 at 9:40 PM, Ryan Rawson ryano...@gmail.com wrote: On Sun, Mar 6, 2011 at 9:25 PM, Suraj Varma svarma...@gmail.com wrote: Thanks all for your insights into this. I would agree that providing mechanisms to support no-outage upgrades going forward would really be widely beneficial. I was looking forward to Avro for this reason. Some follow up questions: 1) If asynchbase client to do this (i.e. talk wire protocol and adjust based on server versions), why not the native hbase client? Is there something in the native client design that would make this too hard / not worth emulating? Typically this has not been an issue. The particular design of the way that hadoop rpc (the rpc we use) makes it difficult to offer multiple protocol/version support. To fix it would more or less require rewriting the entire protocol stack. I'm glad we spent serious time making the base storage layer and query paths fast, since without those fundamentals a better RPC would be moot. From my measurements I dont think we are losing a lot of performance in our current RPC system, and unless we are very careful we'll lose a lot in a thrift/avro transition. 2) Does asynchbase have any limitations (functionally or otherwise) compared to the native HBase client? 3) If Avro were the native protocol that HBase client talks through, that is one thing (and that's what I'm hoping we end up with) - however, isn't spinning up Avro gateways on each node (like what is currently available) require folks to scale up two layers (Avro gateway layer + HBase layer)? i.e. now we need to be worried about whether the Avro gateways can handle the traffic, etc. The hbase client is fairly 'thick', it must intelligently route between different regionservers, handle errors, relook up meta data, use zookeeper to bootstrap, etc. This is part of making a scalable client though. Having the RPC serialization in thrift or avro would make it easier to write those kinds of clients for non-Java languages. The gateway approach will probably be necessary for a while alas. At SU I am not sure that the gateway
Re: HTable thread safety in 0.20.6
I dont think protobuf is winning the war out there, it's either thrift or avro at this point. Protobuf just isn't an bazzar open-source type project, and it's non-Java/C++/python support isn't 1st class, plus no RPC. As for the past RPC, it's all well to complain that we didn't spend more time making it more compatible, but in a world where evolving features in an early platform is more important than keeping backwards compatibility (how many hbase 18 jars want to talk to a modern cluster? Like none.), I am confident we did the right choice. Moving forward I think the goal should NOT be to maintain the current system compatible at all costs, but to look at things like avro and thrift, make a calculated engineering tradeoff and get ourselves on to a extendable platform, even if there is a flag day. We aren't out of the woods yet, but eventually we will be. -ryan On Fri, Mar 4, 2011 at 8:50 PM, M. C. Srivas mcsri...@gmail.com wrote: Google's protobufs make this problem more palatable with optional params. Of course, you will have to break versions once more On Fri, Mar 4, 2011 at 10:04 AM, Stack st...@duboce.net wrote: On Fri, Mar 4, 2011 at 12:24 AM, tsuna tsuna...@gmail.com wrote: In practice, bear in mind that HBase has a bad track record of breaking backward compatibility between virtually every release (even minor ones), although they often bump the protocol version number even though there are no client-visible API changes (e.g. because only some internal APIs used by the master or other administrative APIs irrelevant for the client changed). At Benoit's suggestion, we've changed the way we version Interfaces; rather than a global version for all, we now version each Interface separately. More to come... St.Ack
Re: Build failed in Hudson: HBase-TRUNK #1763
I'll fix this in a few hours. Not write awake :)
Re: Coprocessor tax?
I don't think we need a lock even for updating, check it copy on write array list. On Mar 1, 2011 12:45 PM, Gary Helmling ghelml...@gmail.com wrote: Yeah, I was just looking at the write lock references as well. I'm not sure RegionCoprocessorHost.preClose() would really need the write lock? As you say, there is still a race in HRegion.doClose() between preClose() completing and HRegion.lock.writeLock() being taken out, so other methods could still be called after. RegionCoprocessorHost.postClose() occurs under the HRegion write lock, so any read lock operations would already have to have completed by this point. So here we wouldn't really need the coprocessor write lock either? It seems like we could actually drop the coprocessor lock, since coprocessors are currently loaded prior to region open completing. Online coprocessor loading (not currently provided) could be handled in the future by a lock just for loading, and creating a new coprocessor collection and assigning when done. On Tue, Mar 1, 2011 at 12:08 PM, Ryan Rawson ryano...@gmail.com wrote: My own profiling shows that a read write lock can be up to 3-6% of the CPU budget in our put/get query path. Adding another one if not necessary would probably not be good. In fact in the region coprocessor the only thing the write lock is used for is the preClose and postClose, but looking in the implementation of those methods I don't really get why this is necessary. The write lock ensures single thread access, but there is nothing that prevents other threads from calling other methods AFTER the postClose? -ryan On Tue, Mar 1, 2011 at 12:02 PM, Gary Helmling ghelml...@gmail.com wrote: All the CoprocessorHost invocations should be wrapped in if (cpHost != null). We could just added an extra check for whether any coprocessors are loaded -- if (cpHost != null cpHost.isActive()), something like that? Or the CoprocessorHost methods could do this checking internally. Either way should be relatively easy to bypass the lock acquisition. Is there much overhead to acquiring a read lock if the write lock is never taken though? (just wondering) On Tue, Mar 1, 2011 at 11:51 AM, Stack st...@duboce.net wrote: So, I'm debugging something else but thread dumping I see a bunch of this: IPC Server handler 6 on 61020 daemon prio=10 tid=0x422d2800 nid=0x7714 runnable [0x7f1c5acea000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.fullTryAcquireShared(ReentrantReadWriteLock.java:434) at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:404) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1260) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.prePut(RegionCoprocessorHost.java:532) at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.java:1476) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1454) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2652) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:309) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1060) Do others? I don't have any CPs loaded. I'm wondering if we can do more to just avoid the CP codepath if no CPs loaded. St.Ack
Re: putting a border around 0.92 release
I'd generally vote for a time-based release. The big feature releases while are good for attracting new users with new features, present a problem in that it can really delay releases for a long time. More releases are better! If a feature takes more than 3 months then it's too big to implement in one go. On Mon, Feb 28, 2011 at 2:00 PM, Todd Lipcon t...@cloudera.com wrote: On Sat, Feb 26, 2011 at 2:24 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: Woah those are huge tasks! Also to consider: - integration with hadoop 0.22, should we support it and should we also support 0.20 at the same time? 0.22 was branched but IIRC it still has quite a few blockers. - removing heartbeats, this is in the pipeline from Stack and IMO will have ripple effects on online schema editing. - HBASE-2856, pretty critical. - replication-related issues like multi-slave (which I'm working on), and ideally multi-master. I'd like to add better management tools too. And lastly we need to plan when we want to branch 0.92... should we target late May in order to be ready for the Hadoop Summit in June? For once it would be nice to offer more than an alpha release :) In my view, we can do one or the other: either it's a feature-based release, in which case we release it when it's done, or it's a time-based release, in which case we release at some decided-upon time with whatever's done. I personally prefer time-based releases, though we need to make sure if we decide to do this that any large destabilizing (or half complete) features are guarded either by config flags or are developed in a branch. Thus trunk stays relatively releasable at all times and we can be pretty confident we'll hit the decided-upon timeline. Looking back at the 0.90 release, we got caught in a bind because we were trying to do both feature-based (new master) and time-based (end of 2010). So, my vote is either: plan a: hybrid model - 0.91.X becomes a time-based release series where we drop trunk once every month or two, and 0.92.0 is gated on features or: plan b: strict time-based: we release 0.92.0 around summit, and lock down the branch at least a month or so ahead of time for bugfix only. Thoughts? -Todd On Sat, Feb 26, 2011 at 12:34 PM, Andrew Purtell apurt...@apache.org wrote: Stack and I were chatting on IRC about settling with should get into 0.92 before pulling the trigger on the release. Stack thinks we need online region schema editing. I agree because per-table coprocessor loading is configured via table attributes. We'd also need some kind of notification of schema update to trigger various actions in the regionserver. (For CPs, (re)load.) I'd also really like to see some form of secondary indexing. This is an important feature for HBase to have. All of our in house devs ask for this sooner or later in one form or another. Other projects have options in this arena, while HBase used to in core, but no longer. We have three people starting on this ASAP. I'd like to at least do co-design with the community. We should aim for 'simple and effective'. There are 14 blockers: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+HBASE+AND+fixVersion+%3D+%220.92.0%22+AND+resolution+%3D+Unresolved+AND+priority+%3D+Blocker Additionally, 22 marked as critical: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+HBASE+AND+fixVersion+%3D+%220.92.0%22+AND+resolution+%3D+Unresolved+AND+priority+%3D+Critical Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) -- Todd Lipcon Software Engineer, Cloudera
Let's Switch to TestNG
I filed HBASE-3555, and I listed the following reasons; - test groups allow us to separate slow/fast tests from each other - surefire support for running specific groups would allow 'check in tests' vs 'hudson/integration tests' (ie fast/slow) - it supports all the features of junit 4, plus it is VERY similar, making for the transition easy. - they have assertEquals(byte[],byte[]) What do other people think?
Re: Hbase packaging
Can there be a way to turn it off for those of us who build and use the .tar.gz but dont want the time sink in generating deb/rpms? On Thu, Feb 17, 2011 at 1:25 PM, Eric Yang ey...@yahoo-inc.com wrote: Thanks Ted. I will include this build phase patch with the rpm/deb packaging patch. :) Regards, Eric On 2/17/11 12:58 PM, Ted Dunning tdunn...@maprtech.com wrote: Attaching the packaging to the normal life cycle step is a great idea. Having the packaging to RPM and deb packaging all in one step is very nice. On Thu, Feb 17, 2011 at 12:40 PM, Eric Yang ey...@yahoo-inc.com wrote: Sorry the attachment didn't make it through the mailing list. The patch looks like this: Index: pom.xml === --- pom.xml (revision 1071461) +++ pom.xml (working copy) @@ -321,6 +321,15 @@ descriptorsrc/assembly/all.xml/descriptor /descriptors /configuration + executions + execution + idtarball/id + phasepackage/phase + goals + goalsingle/goal + /goals + /execution + /executions /plugin !-- Run with -Dmaven.test.skip.exec=true to build -tests.jar without running tests (this is needed for upstream projects whose tests need this jar simply for compilation)-- @@ -329,6 +338,7 @@ artifactIdmaven-jar-plugin/artifactId executions execution + phaseprepare-package/phase goals goaltest-jar/goal /goals @@ -355,7 +365,7 @@ executions execution idattach-sources/id - phasepackage/phase + phaseprepare-package/phase goals goaljar-no-fork/goal /goals On 2/17/11 12:30 PM, Eric Yang ey...@yahoo-inc.com wrote: Hi Stack, Thanks for the pointer. This is very useful. What do you think about making jar file creation to prepare-package phase, and having assembly:single be part of package phase? This would make room for running both rpm plugin and jdeb plugin in the packaging phase. Enclosed patch can express my meaning better. User can run: mvn -DskipTests package The result would be jars, tarball, rpm, debian packages in target directory. Another approach is to use -P rpm,deb to control package type generation. The current assumption is to leave hbase bundled zookeeper outside of the rpm/deb package to improve project integrations. There will be a submodule called hbase-conf-pseudo package, which deploys a single node hbase cluster on top of Hadoop+Zookeeper rpms. Would this work for you? Regards, Eric On 2/17/11 11:41 AM, Stack st...@duboce.net wrote: On Thu, Feb 17, 2011 at 11:34 AM, Eric Yang ey...@yahoo-inc.com wrote: Hi, I am trying to understand the release package process for HBase. In the current maven pom.xml, I don't see tarball generation as part of the packaging phase. The assembly plugin does it for us. Run: $ mvn assembly:assembly or $ mvn -DskipTests assembly:assembly ... to skip the running of the test suite (1 hour). See http://wiki.apache.org/hadoop/Hbase/MavenPrimer. What about having a inline process which creates both release tarball, rpm, and debian packages? This is to collect feedback for HADOOP-6255 to ensure HBase integrates well with rest of the stack. Thanks This sounds great Eric. Let us know how we can help. It looks like there is an rpm plugin for maven but I've not played with it in the past. If you have input on this, and you'd like me to mess with it, I'd be happy to help out. Good stuff, St.Ack
Re: Hbase packaging
Sounds good, thanks! -ryan On Thu, Feb 17, 2011 at 1:40 PM, Eric Yang ey...@yahoo-inc.com wrote: Hi Ryan, This would fall in the second proposal, use profile as toggle to switch between packaging mechanism. I.e. mvn –DskipTests package builds tarball. mvn –DskipTests package –p rpm,deb builds tarball, rpm and deb. Does this work for you? Regards, Eric On 2/17/11 1:27 PM, Ryan Rawson ryano...@gmail.com wrote: Can there be a way to turn it off for those of us who build and use the .tar.gz but dont want the time sink in generating deb/rpms? On Thu, Feb 17, 2011 at 1:25 PM, Eric Yang ey...@yahoo-inc.com wrote: Thanks Ted. I will include this build phase patch with the rpm/deb packaging patch. :) Regards, Eric On 2/17/11 12:58 PM, Ted Dunning tdunn...@maprtech.com wrote: Attaching the packaging to the normal life cycle step is a great idea. Having the packaging to RPM and deb packaging all in one step is very nice. On Thu, Feb 17, 2011 at 12:40 PM, Eric Yang ey...@yahoo-inc.com wrote: Sorry the attachment didn't make it through the mailing list. The patch looks like this: Index: pom.xml === --- pom.xml (revision 1071461) +++ pom.xml (working copy) @@ -321,6 +321,15 @@ descriptorsrc/assembly/all.xml/descriptor /descriptors /configuration + executions + execution + idtarball/id + phasepackage/phase + goals + goalsingle/goal + /goals + /execution + /executions /plugin !-- Run with -Dmaven.test.skip.exec=true to build -tests.jar without running tests (this is needed for upstream projects whose tests need this jar simply for compilation)-- @@ -329,6 +338,7 @@ artifactIdmaven-jar-plugin/artifactId executions execution + phaseprepare-package/phase goals goaltest-jar/goal /goals @@ -355,7 +365,7 @@ executions execution idattach-sources/id - phasepackage/phase + phaseprepare-package/phase goals goaljar-no-fork/goal /goals On 2/17/11 12:30 PM, Eric Yang ey...@yahoo-inc.com wrote: Hi Stack, Thanks for the pointer. This is very useful. What do you think about making jar file creation to prepare-package phase, and having assembly:single be part of package phase? This would make room for running both rpm plugin and jdeb plugin in the packaging phase. Enclosed patch can express my meaning better. User can run: mvn -DskipTests package The result would be jars, tarball, rpm, debian packages in target directory. Another approach is to use -P rpm,deb to control package type generation. The current assumption is to leave hbase bundled zookeeper outside of the rpm/deb package to improve project integrations. There will be a submodule called hbase-conf-pseudo package, which deploys a single node hbase cluster on top of Hadoop+Zookeeper rpms. Would this work for you? Regards, Eric On 2/17/11 11:41 AM, Stack st...@duboce.net wrote: On Thu, Feb 17, 2011 at 11:34 AM, Eric Yang ey...@yahoo-inc.com wrote: Hi, I am trying to understand the release package process for HBase. In the current maven pom.xml, I don't see tarball generation as part of the packaging phase. The assembly plugin does it for us. Run: $ mvn assembly:assembly or $ mvn -DskipTests assembly:assembly ... to skip the running of the test suite (1 hour). See http://wiki.apache.org/hadoop/Hbase/MavenPrimer. What about having a inline process which creates both release tarball, rpm, and debian packages? This is to collect feedback for HADOOP-6255 to ensure HBase integrates well with rest of the stack. Thanks This sounds great Eric. Let us know how we can help. It looks like there is an rpm plugin for maven but I've not played with it in the past. If you have input on this, and you'd like me to mess with it, I'd be happy to help out. Good stuff, St.Ack
Re: API changes between 0.20.6 and 0.90.1
Well done Andrew. People who want to know the API differences should probably mostly only read: https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes/pkg_org.apache.hadoop.hbase.client.html And specifically the HTable, Put, Get, Delete, Scan classes. On Wed, Feb 16, 2011 at 7:19 AM, Andrew Purtell apurt...@apache.org wrote: I ran jdiff by hand. See: https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes.html Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --- On Wed, 2/16/11, Lars George lars.geo...@gmail.com wrote: From: Lars George lars.geo...@gmail.com Subject: Re: API changes between 0.20.6 and 0.90.1 To: dev@hbase.apache.org Date: Wednesday, February 16, 2011, 1:22 AM +1, I like that idea. On Wed, Feb 16, 2011 at 2:43 AM, Todd Lipcon t...@cloudera.com wrote: Hi Ted, I'd recommend setting up jdiff to answer this question. Would be a good contribution to our source base to be able to run this automatically and generate a report as part of our build. We do this in Hadoop and it's very useful. -Todd On Tue, Feb 15, 2011 at 5:14 PM, Ted Yu yuzhih...@gmail.com wrote: Can someone tell me which classes from the list below changed API between 0.20.6 and 0.90.1 ? http://pastebin.com/TkZfPt52 Thanks -- Todd Lipcon Software Engineer, Cloudera
Re: API changes between 0.20.6 and 0.90.1
Sounds like Ted volunteered to do it! Good job! -ryan On Wed, Feb 16, 2011 at 12:15 PM, Ted Yu yuzhih...@gmail.com wrote: Definitely. On Wed, Feb 16, 2011 at 11:57 AM, Todd Lipcon t...@cloudera.com wrote: In Hadoop land, Tom White did some awesome work to add special annotations that we stick on all the public classes that classify the interfaces as: Stability: - Unstable: may change and likely to change between point releases, - Evolving: possibly change between point releases but unlikely, could well change between bigger releases - Stable: hasn't changed in a long time, unlikely to change Audience: Private, Limited, Public - Private: not meant for users, even if it's Stable we might change it and break you without a deprecation path - Limited: meant only for a certain set of specified projects (eg we might say this API is only for use by Hive, and we'll change it so long as the hive people are OK with it) - Public: won't change without deprecation path for one major release He also built some cool tools to do jdiff and javadoc with these annotations taken into account (eg javadoc won't show private APIs) Are people interested in bringing this system over to HBase? -Todd On Wed, Feb 16, 2011 at 11:51 AM, Ryan Rawson ryano...@gmail.com wrote: Well done Andrew. People who want to know the API differences should probably mostly only read: https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes/pkg_org.apache.hadoop.hbase.client.html And specifically the HTable, Put, Get, Delete, Scan classes. On Wed, Feb 16, 2011 at 7:19 AM, Andrew Purtell apurt...@apache.org wrote: I ran jdiff by hand. See: https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes.html Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --- On Wed, 2/16/11, Lars George lars.geo...@gmail.com wrote: From: Lars George lars.geo...@gmail.com Subject: Re: API changes between 0.20.6 and 0.90.1 To: dev@hbase.apache.org Date: Wednesday, February 16, 2011, 1:22 AM +1, I like that idea. On Wed, Feb 16, 2011 at 2:43 AM, Todd Lipcon t...@cloudera.com wrote: Hi Ted, I'd recommend setting up jdiff to answer this question. Would be a good contribution to our source base to be able to run this automatically and generate a report as part of our build. We do this in Hadoop and it's very useful. -Todd On Tue, Feb 15, 2011 at 5:14 PM, Ted Yu yuzhih...@gmail.com wrote: Can someone tell me which classes from the list below changed API between 0.20.6 and 0.90.1 ? http://pastebin.com/TkZfPt52 Thanks -- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera
Re: API changes between 0.20.6 and 0.90.1
Step 1 is to add the jdiff framework in, that is a non-trivial but straightforward change. Step 2 is to annotate all the APIs, something that should be done by various domain experts over time. Even if this is not complete there is value with #1. Step 3: ? Step 4: profit! On Wed, Feb 16, 2011 at 3:17 PM, Ted Yu yuzhih...@gmail.com wrote: Looking at what Stack is doing in https://issues.apache.org/jira/browse/HBASE-1502, I think we can use the following appoach: 1. create the annotations below 2. Committers who actively refactor code place proper annotation on the classes they touch 3. after some time, we should be able to mark the classes/methods untouched by #2 stable. My two cents. On Wed, Feb 16, 2011 at 2:12 PM, Ted Yu yuzhih...@gmail.com wrote: The following annotation can only be attached by HBase committer(s): Stability: - Unstable: may change and likely to change between point releases, - Evolving: possibly change between point releases but unlikely, could well change between bigger releases Contributors would have a hard time keeping up with current development. On Wed, Feb 16, 2011 at 12:46 PM, Ted Yu yuzhih...@gmail.com wrote: I am not very familiar with (internal) HBase APIs which grow quite large. I have a full-time job. And this task is quite big. Community effort should be the best approach. On Wed, Feb 16, 2011 at 12:20 PM, Todd Lipcon t...@cloudera.com wrote: On Wed, Feb 16, 2011 at 12:16 PM, Ryan Rawson ryano...@gmail.comwrote: Sounds like Ted volunteered to do it! Woohoo, thanks Ted! -Todd On Wed, Feb 16, 2011 at 12:15 PM, Ted Yu yuzhih...@gmail.com wrote: Definitely. On Wed, Feb 16, 2011 at 11:57 AM, Todd Lipcon t...@cloudera.com wrote: In Hadoop land, Tom White did some awesome work to add special annotations that we stick on all the public classes that classify the interfaces as: Stability: - Unstable: may change and likely to change between point releases, - Evolving: possibly change between point releases but unlikely, could well change between bigger releases - Stable: hasn't changed in a long time, unlikely to change Audience: Private, Limited, Public - Private: not meant for users, even if it's Stable we might change it and break you without a deprecation path - Limited: meant only for a certain set of specified projects (eg we might say this API is only for use by Hive, and we'll change it so long as the hive people are OK with it) - Public: won't change without deprecation path for one major release He also built some cool tools to do jdiff and javadoc with these annotations taken into account (eg javadoc won't show private APIs) Are people interested in bringing this system over to HBase? -Todd On Wed, Feb 16, 2011 at 11:51 AM, Ryan Rawson ryano...@gmail.com wrote: Well done Andrew. People who want to know the API differences should probably mostly only read: https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes/pkg_org.apache.hadoop.hbase.client.html And specifically the HTable, Put, Get, Delete, Scan classes. On Wed, Feb 16, 2011 at 7:19 AM, Andrew Purtell apurt...@apache.org wrote: I ran jdiff by hand. See: https://tm-files.s3.amazonaws.com/hbase/jdiff-hbase-0.90.1/changes.html Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --- On Wed, 2/16/11, Lars George lars.geo...@gmail.com wrote: From: Lars George lars.geo...@gmail.com Subject: Re: API changes between 0.20.6 and 0.90.1 To: dev@hbase.apache.org Date: Wednesday, February 16, 2011, 1:22 AM +1, I like that idea. On Wed, Feb 16, 2011 at 2:43 AM, Todd Lipcon t...@cloudera.com wrote: Hi Ted, I'd recommend setting up jdiff to answer this question. Would be a good contribution to our source base to be able to run this automatically and generate a report as part of our build. We do this in Hadoop and it's very useful. -Todd On Tue, Feb 15, 2011 at 5:14 PM, Ted Yu yuzhih...@gmail.com wrote: Can someone tell me which classes from the list below changed API between 0.20.6 and 0.90.1 ? http://pastebin.com/TkZfPt52 Thanks -- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera
Re: HRegionInfo and HRegion
hregion is the internal implementation of a region inside regionserver, you dont get it from a client. that data is being sent to the master, it's being published to ganglia and the metric system. On Fri, Feb 11, 2011 at 1:23 PM, Ted Yu yuzhih...@gmail.com wrote: HTable can return region information: MapHRegionInfo, HServerAddress regions = table.getRegionsInfo(); However, request count (HBASE-3507) is contained in HRegion. How do I access to HRegion for regions in a table ? Thanks
Re: [VOTE] HBase 0.90.1 rc0 is available for download
I am generally +1, but we'll need another RC to address HBASE-3524. Here is some of my other report of running this: Been running a variant of this found here: https://github.com/stumbleupon/hbase/tree/su_prod_90 Running in dev here at SU now. Also been testing that against our Hadoop CDH3b2 patched in with HDFS-347. In uncontended YCSB runs this did improve much 'get' numbers, but in a 15 thread contended test the average get time goes from 12.1 ms - 6.9ms. We plan to test this more and roll in to our production environment. With 0.90.1 + a number of our patches, Hadoopw/347 I loaded 30gb in using YCSB. Still working on getting VerifyingWorkload to run and verify this data. But no exceptions. -ryan On Fri, Feb 11, 2011 at 7:10 PM, Andrew Purtell apurt...@apache.org wrote: Seems reasonable to stay -1 given HBASE-3524. This weekend I'm rolling RPMs of 0.90.1rc0 + ... a few patches (including 3524) ... for deployment to preproduction staging. Depending how that goes we may have jiras and patches for you next week. Best regards, - Andy From: Stack st...@duboce.net Subject: Re: [VOTE] HBase 0.90.1 rc0 is available for download To: apurt...@apache.org Cc: dev@hbase.apache.org Date: Friday, February 11, 2011, 9:35 AM Yes. We need to fix the assembly. Its going to trip folks up. I don't think it a sinker on the RC though, especially as we shipped 0.90.0 w/ this same issue. What you think boss? St.Ack On Fri, Feb 11, 2011 at 9:30 AM, Andrew Purtell apurt...@apache.org wrote: No an earlier version from before that I failed to delete while moving jars around. So this is a user problem, but I forsee it coming up again and again.
Re: Build patched cdh3b2
I put up the patch I used, I then changed the version to 0.20.2-322 and just did ant jar. I crippled the forrest crap in build.xml... I didnt check the filesize of the resulting jar though. -ryan On Fri, Feb 11, 2011 at 10:08 PM, Ted Yu yuzhih...@gmail.com wrote: Ryan: Can you share how you built patched cdh3b2 ? When I used 'ant jar', I got build/hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar which was much larger than the official hadoop-core-0.20.2+320.jar hadoop had trouble starting if I used hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar in place of official jar. Thanks On Fri, Feb 11, 2011 at 9:59 PM, Ryan Rawson ryano...@gmail.com wrote: I am generally +1, but we'll need another RC to address HBASE-3524. Here is some of my other report of running this: Been running a variant of this found here: https://github.com/stumbleupon/hbase/tree/su_prod_90 Running in dev here at SU now. Also been testing that against our Hadoop CDH3b2 patched in with HDFS-347. In uncontended YCSB runs this did improve much 'get' numbers, but in a 15 thread contended test the average get time goes from 12.1 ms - 6.9ms. We plan to test this more and roll in to our production environment. With 0.90.1 + a number of our patches, Hadoopw/347 I loaded 30gb in using YCSB. Still working on getting VerifyingWorkload to run and verify this data. But no exceptions. -ryan
Re: Build patched cdh3b2
my jar looks like: -rw-r--r-- 1 hadoop hadoop 2861459 2011-02-09 16:34 hadoop-core-0.20.2+322.jar -ryan On Fri, Feb 11, 2011 at 10:29 PM, Ryan Rawson ryano...@gmail.com wrote: I put up the patch I used, I then changed the version to 0.20.2-322 and just did ant jar. I crippled the forrest crap in build.xml... I didnt check the filesize of the resulting jar though. -ryan On Fri, Feb 11, 2011 at 10:08 PM, Ted Yu yuzhih...@gmail.com wrote: Ryan: Can you share how you built patched cdh3b2 ? When I used 'ant jar', I got build/hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar which was much larger than the official hadoop-core-0.20.2+320.jar hadoop had trouble starting if I used hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar in place of official jar. Thanks On Fri, Feb 11, 2011 at 9:59 PM, Ryan Rawson ryano...@gmail.com wrote: I am generally +1, but we'll need another RC to address HBASE-3524. Here is some of my other report of running this: Been running a variant of this found here: https://github.com/stumbleupon/hbase/tree/su_prod_90 Running in dev here at SU now. Also been testing that against our Hadoop CDH3b2 patched in with HDFS-347. In uncontended YCSB runs this did improve much 'get' numbers, but in a 15 thread contended test the average get time goes from 12.1 ms - 6.9ms. We plan to test this more and roll in to our production environment. With 0.90.1 + a number of our patches, Hadoopw/347 I loaded 30gb in using YCSB. Still working on getting VerifyingWorkload to run and verify this data. But no exceptions. -ryan
Re: Build patched cdh3b2
i call it 0.20.2-322 and its at http://people.apache.org/~rawson/repo/ (m2 repo) for just the jar you can find it there. On Fri, Feb 11, 2011 at 10:35 PM, Ted Yu yuzhih...@gmail.com wrote: Is it possible for you to share the hadoop-core-0.20.2+320.jar that you built ? Thanks On Fri, Feb 11, 2011 at 10:29 PM, Ryan Rawson ryano...@gmail.com wrote: I put up the patch I used, I then changed the version to 0.20.2-322 and just did ant jar. I crippled the forrest crap in build.xml... I didnt check the filesize of the resulting jar though. -ryan On Fri, Feb 11, 2011 at 10:08 PM, Ted Yu yuzhih...@gmail.com wrote: Ryan: Can you share how you built patched cdh3b2 ? When I used 'ant jar', I got build/hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar which was much larger than the official hadoop-core-0.20.2+320.jar hadoop had trouble starting if I used hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar in place of official jar. Thanks On Fri, Feb 11, 2011 at 9:59 PM, Ryan Rawson ryano...@gmail.com wrote: I am generally +1, but we'll need another RC to address HBASE-3524. Here is some of my other report of running this: Been running a variant of this found here: https://github.com/stumbleupon/hbase/tree/su_prod_90 Running in dev here at SU now. Also been testing that against our Hadoop CDH3b2 patched in with HDFS-347. In uncontended YCSB runs this did improve much 'get' numbers, but in a 15 thread contended test the average get time goes from 12.1 ms - 6.9ms. We plan to test this more and roll in to our production environment. With 0.90.1 + a number of our patches, Hadoopw/347 I loaded 30gb in using YCSB. Still working on getting VerifyingWorkload to run and verify this data. But no exceptions. -ryan
Re: Build patched cdh3b2
Oh right, the groupId is com.cloudera not org.apache, so the other dir... On Fri, Feb 11, 2011 at 10:41 PM, Ted Yu yuzhih...@gmail.com wrote: I don't see it under http://people.apache.org/~rawson/repo/org/apache/hadoop/hadoop-core/ Should I look somewhere else ? On Fri, Feb 11, 2011 at 10:37 PM, Ryan Rawson ryano...@gmail.com wrote: i call it 0.20.2-322 and its at http://people.apache.org/~rawson/repo/ (m2 repo) for just the jar you can find it there. On Fri, Feb 11, 2011 at 10:35 PM, Ted Yu yuzhih...@gmail.com wrote: Is it possible for you to share the hadoop-core-0.20.2+320.jar that you built ? Thanks On Fri, Feb 11, 2011 at 10:29 PM, Ryan Rawson ryano...@gmail.com wrote: I put up the patch I used, I then changed the version to 0.20.2-322 and just did ant jar. I crippled the forrest crap in build.xml... I didnt check the filesize of the resulting jar though. -ryan On Fri, Feb 11, 2011 at 10:08 PM, Ted Yu yuzhih...@gmail.com wrote: Ryan: Can you share how you built patched cdh3b2 ? When I used 'ant jar', I got build/hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar which was much larger than the official hadoop-core-0.20.2+320.jar hadoop had trouble starting if I used hadoop-core-0.20.2-CDH3b2-SNAPSHOT.jar in place of official jar. Thanks On Fri, Feb 11, 2011 at 9:59 PM, Ryan Rawson ryano...@gmail.com wrote: I am generally +1, but we'll need another RC to address HBASE-3524. Here is some of my other report of running this: Been running a variant of this found here: https://github.com/stumbleupon/hbase/tree/su_prod_90 Running in dev here at SU now. Also been testing that against our Hadoop CDH3b2 patched in with HDFS-347. In uncontended YCSB runs this did improve much 'get' numbers, but in a 15 thread contended test the average get time goes from 12.1 ms - 6.9ms. We plan to test this more and roll in to our production environment. With 0.90.1 + a number of our patches, Hadoopw/347 I loaded 30gb in using YCSB. Still working on getting VerifyingWorkload to run and verify this data. But no exceptions. -ryan
Re: initial experience with HBase 0.90.1 rc0
You don't have both the old and the new hbase jars in there do you? -ryan On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu yuzhih...@gmail.com wrote: .META. went offline during second flow attempt. The time out I mentioned happened for 1st and 3rd attempts. HBase was restarted before the 1st and 3rd attempts. Here is jstack: http://pastebin.com/EHMSvsRt On Thu, Feb 10, 2011 at 3:04 PM, Stack st...@duboce.net wrote: So, .META. is not online? What happens if you use shell at this time. Your attachement did not come across Ted. Mind postbin'ing it? St.Ack On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu yuzhih...@gmail.com wrote: I replaced hbase jar with hbase-0.90.1.jar I also upgraded client side jar to hbase-0.90.1.jar Our map tasks were running faster than before for about 50 minutes. However, map tasks then timed out calling flushCommits(). This happened even after fresh restart of hbase. I don't see any exception in region server logs. In master log, I found: 2011-02-10 18:24:15,286 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595 2011-02-10 18:24:15,349 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=null; org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region is not online: .META.,,1 2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x12e10d0e31e Creating (or updating) unassigned node for 1028785192 with OFFLINE state I am attaching region server (which didn't respond to stop-hbase.sh) jstack. FYI On Thu, Feb 10, 2011 at 10:10 AM, Stack st...@duboce.net wrote: Thats probably enough Ted. The 0.90.1 hbase-default.xml has an extra config. to enable the experimental HBASE-3455 feature but you can copy that over if you want to try playing with it (it defaults off so you'd copy over the config. if you wanted to set it to true). St.Ack
Re: initial experience with HBase 0.90.1 rc0
As I suspected. It's a byproduct of our maven assembly process. The process could be fixed. I wouldn't mind. I don't support runtime checking of jars, there is such thing as too much tests, and this is an example of it. The check would then need a test, etc, etc. At SU we use new directories for each upgrade, copying the config over. With the lack of -default.xml this is easier than ever (just copy everything in conf/). With symlink switchover it makes roll forward/back as simple as doing a symlink switchover or back. I have to recommend this to everyone who doesnt have a management scheme. On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu yuzhih...@gmail.com wrote: hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the classpath. I wonder 1. why hbase jar is placed in two directories - 0.20.6 didn't use such structure 2. what from lib/hbase-0.90.0.jar could be picked up and why there wasn't exception in server log I think a JIRA should be filed for item 2 above - bail out when the two hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions. Cheers On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson ryano...@gmail.com wrote: What do you get when you: ls lib/hbase* I'm going to guess there is hbase-0.90.0.jar there On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu yuzhih...@gmail.com wrote: hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist Would this be a problem ? On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson ryano...@gmail.com wrote: You don't have both the old and the new hbase jars in there do you? -ryan On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu yuzhih...@gmail.com wrote: .META. went offline during second flow attempt. The time out I mentioned happened for 1st and 3rd attempts. HBase was restarted before the 1st and 3rd attempts. Here is jstack: http://pastebin.com/EHMSvsRt On Thu, Feb 10, 2011 at 3:04 PM, Stack st...@duboce.net wrote: So, .META. is not online? What happens if you use shell at this time. Your attachement did not come across Ted. Mind postbin'ing it? St.Ack On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu yuzhih...@gmail.com wrote: I replaced hbase jar with hbase-0.90.1.jar I also upgraded client side jar to hbase-0.90.1.jar Our map tasks were running faster than before for about 50 minutes. However, map tasks then timed out calling flushCommits(). This happened even after fresh restart of hbase. I don't see any exception in region server logs. In master log, I found: 2011-02-10 18:24:15,286 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595 2011-02-10 18:24:15,349 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=null; org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region is not online: .META.,,1 2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x12e10d0e31e Creating (or updating) unassigned node for 1028785192 with OFFLINE state I am attaching region server (which didn't respond to stop-hbase.sh) jstack. FYI On Thu, Feb 10, 2011 at 10:10 AM, Stack st...@duboce.net wrote: Thats probably enough Ted. The 0.90.1 hbase-default.xml has an extra config. to enable the experimental HBASE-3455 feature but you can copy that over if you want to try playing with it (it defaults off so you'd copy over the config. if you wanted to set it to true). St.Ack
Re: Data upgrade from 0.89x to 0.90.0.
we only major compact a region ever 24 hours, therefore if it was JUST compacted within the last 24 hours we skip it. this is how it used to work, and how it should still work, not really looking at code right now, busy elsewhere :-) -ryan On Thu, Feb 10, 2011 at 11:17 PM, James Kennedy james.kenn...@troove.net wrote: Can you define 'come due'? The NPE occurs at the first isMajorCompaction() test in the main loop of MajorCompactionChecker. That cycle is executed every 2.78 hours. Yet I know that I've kept healthy QA test data up and running for much longer than that. James Kennedy Project Manager Troove Inc. On 2011-02-10, at 10:46 PM, Ryan Rawson wrote: I am speaking off the hip here, but the major compaction algorithm attempts to keep the number of major compactions to a minimum by checking the timestamp of the file. So it's possible that the other regions just 'didnt come due' yet. -ryan On Thu, Feb 10, 2011 at 10:42 PM, James Kennedy james.kenn...@troove.net wrote: I've tested HBase 0.90 + HBase-trx 0.90.0 and i've run it over old data from 0.89x using a variety of seeded unit test/QA data and cluster configurations. But when it came time to upgrade some production data I got snagged on HBASE-3524. The gist of it is in Ryan's last points: * compaction is optional, meaning if it fails no data is lost, so you should probably be fine. * Older versions of the code did not write out time tracker data and that is why your older files were giving you NPEs. Makes sense. But why did I not encounter this with my initial data upgrades on very similar data pkgs? So I applied Ryan's patch, which simply assigns a default value (Long.MIN_VALUE) when a StoreFile lacks a timeRangeTracker and I fixed the data by forcing major compactions on the regions affected. Preliminary poking has not shown any instability in the data since. But I confess that I just don't have the time right now to really dig into the code and validate that there are no more gotchya's or data corruption that could have resulted. I guess the questions that I have for the team are: * What state would 9 out of 50 tables be in to miss the new 0.90.0 timeRangeTracker injection before the first major compaction check? * Where else is the new TimeRangeTracker used? Could a StoreFile with a null timeRangeTracker have corrupted the data in other subtler ways? * What other upgrade-related data changes might not have completed elsewhere? Thanks, James Kennedy Project Manage Troove Inc.
Re: 0.90.1?
Gary pointed this out on irc: http://jira.codehaus.org/browse/SUREFIRE-656 When we were talking about making the tests faster. Test-ng has this support ready to roll _now_. Basically we could have a 'smoke test' run for the release... Do the larger integration tests outside the mvn release line or something. Thoughts? -ryan On Tue, Feb 1, 2011 at 11:57 AM, Stack st...@duboce.net wrote: Oh, you have to use the release plugin if you want to get stuff into Apache repository -- else I'd sidestep it. St.Ack On Tue, Feb 1, 2011 at 11:56 AM, Stack st...@duboce.net wrote: On Tue, Feb 1, 2011 at 11:43 AM, Ryan Rawson ryano...@gmail.com wrote: A mvn release (to maven central) is different than our standard tarball (assembly:assembly), right? Yeah. There is a 'release' mvn plugin that wants to 'help' you in the way that mswindows is always trying to help you; you know, Would you like to do XYZ? when you do NOT want to do XYZ. It wants to update versions in poms, add tags to svn, put stuff into 'repositories', but you have to wrestle it to make it use right repository locations and version numbers. St.Ack
Re: Scan operator ignores setMaxVersions (0.20.6)
how many versions is the column family configured for? the maxVersions will never return more than that, so if it is 1 you wont have more than 1. -ryan On Thu, Jan 27, 2011 at 3:08 PM, Vladimir Rodionov vrodio...@carrieriq.com wrote: Although this version is not supported but may be somebody can advice how to get ALL versions of rows from HTable scan? This code: public IteratorOtaUploadWritable getUploadsByProfileId(String profile, long start, long end) throws IOException { Scan scan = new Scan(getStartKey(profile), getEndKey(profile)); scan.addColumn(COLFAMILY, COL_REF); scan.addColumn(COLFAMILY, COL_UPLOAD); scan.setTimeRange(start, end); scan.setMaxVersions(); ResultScanner rs = this.getScanner(scan); return new ResultScannerIterator(rs); } does not seem work correctly (only last versions of rows get into the result set) Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com
Re: Looks like duplicate in MemoryStoreFlusher flushSomeRegions()
the call to compactionRequested() only puts the region on a queue to be compacted, so if there is unintended duplication, it wont actually hold anything up. -ryan On Tue, Jan 25, 2011 at 6:05 PM, mac fang mac.had...@gmail.com wrote: Guys, since the flushCache will make the write/read suspend. I am NOT sure if it is necessary here. On Mon, Jan 24, 2011 at 1:48 PM, mac fang mac.had...@gmail.com wrote: Yes, I mean the server.compactSplitThread.compactionRequested(region, getName()); in flushRegion, it will do the server.compactSplitThread.compactionRequested(region, getName()); *seems we don't need to do it again in the following logic (can you guys see the lines in bold, **!flushRegion(biggestMemStoreRegion, true) and * * * * for (HRegion region : regionsToCompact) { server.compactSplitThread.compactionRequested(region, getName()); }* * * * * *regards* macf if (*!flushRegion(biggestMemStoreRegion, true)*) { LOG.warn(Flush failed); break; } regionsToCompact.add(biggestMemStoreRegion); } for (HRegion region : regionsToCompact) { *server.compactSplitThread.compactionRequested(region, getName());* } in flushRegion private boolean flushRegion(final HRegion region, final boolean emergencyFlush) { synchronized (this.regionsInQueue) { FlushQueueEntry fqe = this.regionsInQueue.remove(region); if (fqe != null emergencyFlush) { // Need to remove from region from delay queue. When NOT an // emergencyFlush, then item was removed via a flushQueue.poll. flushQueue.remove(fqe); } lock.lock(); } try { if (region.flushcache()) { *server.compactSplitThread.compactionRequested(region, getName());* } On Mon, Jan 24, 2011 at 6:40 AM, Ted Yu yuzhih...@gmail.com wrote: I think he was referring to this line: server.compactSplitThread.compactionRequested(region, getName()); On Sun, Jan 23, 2011 at 10:52 AM, Stack st...@duboce.net wrote: Hello Mac Fang: Which lines in the below? Your colorizing didn't come across in the mail. Thanks, St.Ack On Sun, Jan 23, 2011 at 6:23 AM, mac fang mac.had...@gmail.com wrote: Hi, guys, see the below codes in* MemStoreFlusher.java*, i am not sure if those lines in orange are the same and looks like they are trying to do the same logic. Are they redundant? regards macf if (!flushRegion(biggestMemStoreRegion, true)) { LOG.warn(Flush failed); break; } regionsToCompact.add(biggestMemStoreRegion); } for (HRegion region : regionsToCompact) { server.compactSplitThread.compactionRequested(region, getName()); } in flushRegion private boolean flushRegion(final HRegion region, final boolean emergencyFlush) { synchronized (this.regionsInQueue) { FlushQueueEntry fqe = this.regionsInQueue.remove(region); if (fqe != null emergencyFlush) { // Need to remove from region from delay queue. When NOT an // emergencyFlush, then item was removed via a flushQueue.poll. flushQueue.remove(fqe); } lock.lock(); } try { if (region.flushcache()) { server.compactSplitThread.compactionRequested(region, getName()); }
Re: parallelizing HBaseAdmin.flush()
dont forget that this wont parallelize the flushes or compactions, since they happen region-server side and there are built in limits there to keep io down. this will accelerate sending all the command messages though. -ryan On Mon, Jan 24, 2011 at 11:18 AM, Ted Yu yuzhih...@gmail.com wrote: https://issues.apache.org/jira/browse/HBASE-3471 is created On Mon, Jan 24, 2011 at 10:56 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote: I'd guess so and the same could be done for splits and compactions since it's (almost) the same code. J-D On Sat, Jan 22, 2011 at 8:00 AM, Ted Yu yuzhih...@gmail.com wrote: In 0.90, HBaseAdmin.flush() uses a loop to go over ListPairHRegionInfo, HServerAddress Should executor service be used for higher parallelism ? Thanks
Re: Items to contribute (plan)
Hopefully to do #1, you would not require many/any changes in HFile or HBase. Implementing the HDFS stream API should be enough. #2 is interesting, what is the benefit? How did you measure said benefit? -ryan On Sat, Jan 22, 2011 at 5:45 PM, Ted Yu yuzhih...@gmail.com wrote: #1 looks similar to what MapR has done. On Sat, Jan 22, 2011 at 5:18 PM, Tatsuya Kawano tatsuya6...@gmail.comwrote: Hi, I wanted to let you know that I'm planning to contribute the following items to the HBase community. These are my spare time projects and I'll only be able to spend my time about 7 hours a week, so the progress will be very slow. I want some feedback from you guys to prioritize them. Also, if someone/team wants to work on them (with me or alone), I'll be happy to provide more details. 1. RADOS integration Run HBase not only on HDFS but also RADOS distributed object store (the lower layer of Ceph), so that the following options will become available to HBase users: -- No SPOF (RADOS doesn't have the name node(s), but only ZK-like monitors and data nodes) -- Instant backup of HBase tables (RADOS provides copy-on-write snapshot per object pool) -- Extra durability option on WAL (RADOS can do both synchronous and asynchronous disk flush. HDFS doesn't have the earlier option) Note: RADOS object = HFile, WAL object pool = group of HFiles or WAL Current status: Design phase 2. mapreduce.HFileInputFormat MR library to read data directly from HFiles. (Roughly 2.5 times faster than TableInputFormat in my tests) Current status: Completed a proof-of-concept prototype and measured performance. 3. Enhance Get/Scan performance of RS Add an hash code and a couple of flags to HFile at the flush time and change scanner implementation so that: -- Get/Scan operations will get faster. (less key comparisons for reconstructing a row: O(h * c) - O(h). [h = number of HFiles for the row, c = number of columns in an HFile]) -- The size of HFiles will become a bit smaller. (The flags will eliminate duplicate bytes in keys (row, column family and qualifier) from HFiles.) Current status: Completed a proof-of-concept prototype and measured performance. Detals: https://github.com/tatsuya6502/hbase-mr-pof/ (I meant poc not pof...) 4. Writing Japanese books and documents -- Currently I'm authoring a book chapter about HBase for a Japanese NOSQL book -- I'll translate The Apache HBase Book to Japanese Thank you, -- Tatsuya Kawano (Mr.) Tokyo, Japan http://twitter.com/#!/tatsuya6502 http://twitter.com/#%21/tatsuya6502
Re: VERSIONS in Shell
the parse code inside table.rb is wacky, maybe this fixes it: diff --git a/src/main/ruby/hbase/table.rb b/src/main/ruby/hbase/table.rb index c8e0076..cd90132 100644 --- a/src/main/ruby/hbase/table.rb +++ b/src/main/ruby/hbase/table.rb @@ -138,19 +138,17 @@ module Hbase get.addFamily(family) end end - - # Additional params - get.setMaxVersions(args[VERSIONS] || 1) - get.setTimeStamp(args[TIMESTAMP]) if args[TIMESTAMP] else # May have passed TIMESTAMP and row only; wants all columns from ts. - unless ts = args[TIMESTAMP] -raise ArgumentError, Failed parse of #{args.inspect}, #{args.class} + if ts = args[TIMESTAMP] +# Set the timestamp + get.setTimeStamp(ts.to_i) end - - # Set the timestamp - get.setTimeStamp(ts.to_i) end + +# Additional params +get.setMaxVersions(args[VERSIONS] || 1) +get.setTimeStamp(args[TIMESTAMP]) if args[TIMESTAMP] end # Call hbase for the results On Tue, Jan 18, 2011 at 12:36 AM, Lars George lars.geo...@gmail.com wrote: Hi, On hbase-0.89.20100924+28 I tried to get all versions for a cell that has 3 versions and on the shell I got: hbase(main):014:0 get 'hbase_table_1', '498', {VERSIONS=10} COLUMN CELL ERROR: Failed parse of {VERSIONS=10}, Hash Here is some help for this command: Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp and versions. Examples: hbase get 't1', 'r1' hbase get 't1', 'r1', {COLUMN = 'c1'} hbase get 't1', 'r1', {COLUMN = ['c1', 'c2', 'c3']} hbase get 't1', 'r1', {COLUMN = 'c1', TIMESTAMP = ts1} hbase get 't1', 'r1', {COLUMN = 'c1', TIMESTAMP = ts1, VERSIONS = 4} hbase get 't1', 'r1', 'c1' hbase get 't1', 'r1', 'c1', 'c2' hbase get 't1', 'r1', ['c1', 'c2'] hbase(main):015:0 scan 'hbase_table_1', { STARTROW='498', STOPROW='498',VERSIONS=10} ROW COLUMN+CELL 498 column=cf1:val, timestamp=1295335912913, value=val_498 498 column=cf1:val, timestamp=1295335912913, value=val_498 498 column=cf1:val, timestamp=1295335912913, value=val_498 1 row(s) in 0.0520 seconds hbase(main):016:0 So the scan works but not the get. That's wrong, right? Lars
Re: zookeeper.session.timeout
no it does not, zookeeper fixed that. -ryan On Tue, Jan 18, 2011 at 3:29 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, In hbase 0.20.6, I see the following in description for zookeeper.session.timeout: The current implementation requires that the timeout be a minimum of 2 times the tickTime (as set in the server configuration) and a maximum of 20 times the tickTime. Set the zk ticktime with hbase.zookeeper.property.tickTime. In milliseconds. Does the above hold for hbase 0.90 as well ? Thanks
Re: Zookeeper tuning, was: YouAreDeadException
also remember that higher session timeouts take longer to discover a regionserver is dead. so it's a trade off. On Sat, Jan 15, 2011 at 6:37 PM, Ted Yu yuzhih...@gmail.com wrote: I want region server to be more durable. If zookeeper.session.timeout is set high, it takes master long to discover dead region server. Can you share zookeeper tuning experiences ? Thanks On Sat, Jan 15, 2011 at 5:14 PM, Stack st...@duboce.net wrote: Yes. Currently, there are two heartbeats: the zk client one and then the hbase which used to be what we relied on figuring whether a regionserver is alive but now its just used to post the master the regionserver stats such as requests per second. This latter is going away in 0.92 (Pre-0.90.0 regionserver and master would swap 'messages' on the back of the heartbeat -- i.e. open this region, I've just split region X, etc. but now 90% of this stuff is done via zk. In 0.92. we'll finish the cleanup). Hope this helps, St.Ack On Sat, Jan 15, 2011 at 5:03 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, I assume I should look for 'received expired from ZooKeeper, aborting' On Sat, Jan 15, 2011 at 5:02 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, what string should I look for in region server log ? For #4, what's the rationale behind sending YADE after receiving heartbeat ? I thought heartbeat means the RS is alive. Thanks On Sat, Jan 15, 2011 at 4:49 PM, Stack st...@duboce.net wrote: FYI Ted, the YourAreDeadException usually happens in following context: 1. Regionserver has some kinda issue -- long GC pause for instance -- and it stops tickling zk. 2. Master gets zk session expired event. Starts up recovery of the hung region. 3. Regionserver recovers but has not yet processed its session expired event. It heartbeats the Master as though nothing wrong. 4. Master is mid-recovery or beyond server recovery and on receipt of the heartbeat in essence tells the regionserver to 'go away' by sending him the YouAreDeadException. 5. By now the regionserver will have gotten its session expired notification and will have started an abort so the YADE is not news when it receives the exception. St.Ack On Fri, Jan 14, 2011 at 7:49 PM, Ted Yu yuzhih...@gmail.com wrote: Thanks for your analysis, Ryan. The dev cluster has half as many nodes as our staging cluster. Each node has half the number of cores as the node in staging. I agree with your conclusion. I will report back after I collect more data - the flow uses hbase heavily toward the end. On Fri, Jan 14, 2011 at 6:20 PM, Ryan Rawson ryano...@gmail.com wrote: I'm seeing not much in the way of errors, timeouts, all to one machine ending with .80, so that is probably your failed node. Other than that, the log doesnt seem to say too much. Searching for strings like FATAL and Exception is the way to go here. Also things like this: 2011-01-14 23:38:52,936 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region= NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v [h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602 7.2c40637c6c648a67162cc38d8c6d8ee9. Guessing, I'd probably say your nodes hit some performance wall, with io-wait, or networking, or something, and Regionserver processes stopped responding, but did not time out from zookeeper yet... so you would run into a situation where some nodes are unresponsive, so any data hosted there would be difficult to talk to. Until the regionserver times out it's zookeeper node, the master doesnt know about the fault of the regionserver. The master web UI is probably inaccessible because the META table is on a regionserver that went AWOL. You should check your load, your ganglia graphs. Also remember, despite having lots of disks, each node is a gigabit ethernet which means about 110-120 MB/sec. It's quite possible you are running into network limitations, remember that regionservers must write to 2 additional datanodes, and there will be overlap, thus you have to share some of that 110-120MB/sec per node figure with other nodes, not to mention that you also need to factor inbound bandwidth (from client-hbase regionserver) and outbound bandwidth (from datanode replica 1 - dn replica 2). -ryan On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu yuzhih...@gmail.com wrote: Now I cannot access master web UI, This happened after I doubled the amount of data processed in our flow. I am attaching master log. On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson ryano...@gmail.com wrote: This is the cause: org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=sjc1-hadoop1.sjc1.carrieriq.com
Re: ANN: The fourth hbase 0.90.0 release candidate is available for download X
Hey, It's a pretty unusual situation that got you into 3445? It's been a few weeks of RCs, and we need to push out a 0.90.0 so everyone can benefit from it. We can release point releases fairly quickly once a stable base release is out, does that sound reasonable to you? Thanks for testing! -ryan On Fri, Jan 14, 2011 at 2:14 PM, James Kennedy james.kenn...@troove.net wrote: -1 for the following bug: https://issues.apache.org/jira/browse/HBASE-3445 Note however that aside from this issue RC 3 looks pretty stable: * All HBase tests pass (on a Mac) * All hbase-trx tests pass after I upgraded https://github.com/hbase-trx/hbase-transactional-tableindexed * All tests pass in our web app. * Our application performs well on local machine. * Still todo after 3445 fixed: Full cluster testing James Kennedy On 2011-01-07, at 5:03 PM, Stack wrote: The fourth hbase 0.90.0 release candidate is available for download: http://people.apache.org/~stack/hbase-0.90.0-candidate-3/ This is going to be the one! Should we release this candidate as hbase 0.90.0? Take it for a spin. Check out the doc., etc. Vote +1/-1 by next Friday, the 14th of January. HBase 0.90.0 is the major HBase release that follows 0.20.0 and the fruit of the 0.89.x development release series we've been running of late. Over 1k issues have been closed since 0.20.0. Release notes are available here: http://su.pr/8LbgvK. HBase 0.90.0 runs on Hadoop 0.20.x. It does not currently run on Hadoop 0.21.0 nor on Hadoop TRUNK. HBase will lose data unless it is running on an Hadoop HDFS 0.20.x that has a durable sync. Currently only the branch-0.20-append branch [1] has this attribute (See CHANGES.txt [3] in branch-0.20-append to see the list of patches involved adding an append). No official releases have been made from this branch as yet so you will have to build your own Hadoop from the tip of this branch, OR install Cloudera's CDH3 [2] (Its currently in beta). CDH3b2 or CDHb3 have the 0.20-append patches needed to add a durable sync. If using CDH, be sure to replace the hadoop jars that are bundled with HBase with those from your CDH distribution. There is no migration necessary. Your data written with HBase 0.20.x (or with HBase 0.89.x) is readable by HBase 0.90.0. A shutdown and restart after putting in place the new HBase should be all thats involved. That said, once done, there is no going back to 0.20.x once the transition has been made. HBase 0.90.0 and HBase 0.89.x write region names differently in the filesystem. Rolling restart from 0.20.x or 0.89.x to 0.90.0RC1 will not work. Yours, The HBasistas P.S. For why the version 0.90 and whats new in HBase 0.90, see slides 4-10 in this deck [4] 1. http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append 2. http://archive.cloudera.com/docs/ 3. http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt 4. http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/
Re: YouAreDeadException
This is the cause: org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: Why did the session expire? Typically it's GC, what does your GC logs say? Otherwise, network issues perhaps? Swapping? Other machine related systems problems? -ryan On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu yuzhih...@gmail.com wrote: I ran 0.90 RC3 in dev cluster. I saw the following in region server log: Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy0.regionServerReport(Unknown Source) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702) ... 2 more 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181sessionTimeout=9 watcher=hconnection 2011-01-13 03:55:08,914 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) --- And the following from master log: 2011-01-13 03:52:42,003 INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [ sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378] 2011-01-13 03:52:42,005 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead servers, submitted shutdown handler to be executed, root=false, meta=false 2011-01-13 03:52:42,005 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 2011-01-13 03:52:42,092 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in hdfs:// sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 2011-01-13 03:52:42,093 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-0,5,main]: starting 2011-01-13 03:52:42,094 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-1,5,main]: starting 2011-01-13 03:52:42,096 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 1: hdfs:// sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407, length=0 Please advise what could be the cause. Thanks
Re: YouAreDeadException
I'm seeing not much in the way of errors, timeouts, all to one machine ending with .80, so that is probably your failed node. Other than that, the log doesnt seem to say too much. Searching for strings like FATAL and Exception is the way to go here. Also things like this: 2011-01-14 23:38:52,936 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region= NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602 7.2c40637c6c648a67162cc38d8c6d8ee9. Guessing, I'd probably say your nodes hit some performance wall, with io-wait, or networking, or something, and Regionserver processes stopped responding, but did not time out from zookeeper yet... so you would run into a situation where some nodes are unresponsive, so any data hosted there would be difficult to talk to. Until the regionserver times out it's zookeeper node, the master doesnt know about the fault of the regionserver. The master web UI is probably inaccessible because the META table is on a regionserver that went AWOL. You should check your load, your ganglia graphs. Also remember, despite having lots of disks, each node is a gigabit ethernet which means about 110-120 MB/sec. It's quite possible you are running into network limitations, remember that regionservers must write to 2 additional datanodes, and there will be overlap, thus you have to share some of that 110-120MB/sec per node figure with other nodes, not to mention that you also need to factor inbound bandwidth (from client-hbase regionserver) and outbound bandwidth (from datanode replica 1 - dn replica 2). -ryan On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu yuzhih...@gmail.com wrote: Now I cannot access master web UI, This happened after I doubled the amount of data processed in our flow. I am attaching master log. On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson ryano...@gmail.com wrote: This is the cause: org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: Why did the session expire? Typically it's GC, what does your GC logs say? Otherwise, network issues perhaps? Swapping? Other machine related systems problems? -ryan On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu yuzhih...@gmail.com wrote: I ran 0.90 RC3 in dev cluster. I saw the following in region server log: Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy0.regionServerReport(Unknown Source) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702) ... 2 more 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181sessionTimeout=9 watcher=hconnection 2011-01-13 03:55:08,914 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506
Re: question about Hregion.incrementColumnValue
There was a good, but complex reason. Its going away with stacks time stamp patch. I'll see if I can do a better email tomorrow. On Jan 9, 2011 11:42 PM, Dhruba Borthakur dhr...@gmail.com wrote: I am looking at Hregion.incrementColumnValue(). It has the following piece of code // build the KeyValue now: 3266 KeyValue newKv = new KeyValue(row, family, 3267 qualifier, EnvironmentEdgeManager.currentTimeMillis(), 3268 Bytes.toBytes(result)); 3269 3270 // now log it: 3271 if (writeToWAL) { 3272 long now = EnvironmentEdgeManager.currentTimeMillis(); 3273 WALEdit walEdit = new WALEdit(); 3274 walEdit.add(newKv); 3275 this.log.append(regionInfo, regionInfo.getTableDesc().getName(), 3276 walEdit, now); 3277 } It invokes EnvironmentEdgeManager.currentTimeMillis() twice, once for creating the new KV and then another time to add it to the WAL. Is this significant or just an oversight? Can we instead invoke it once before we create the new key-value and then use it for both code paths? Thanks, dhruba -- Connect to me at http://www.facebook.com/dhruba
Re: question about Hregion.incrementColumnValue
I put more comments on this: HBASE-3021 Basically we needed to avoid duplicate timestamp KVs in memstore hfile, elsewise we might end up 'getting' the wrong value and thus messing up the count. With work on ACID by stack we can avoid using that. -ryan On Mon, Jan 10, 2011 at 11:23 AM, Stack st...@duboce.net wrote: Yeah, thats going away unless Ryan comes up w/ a reason for why we should keep it. St.Ack On Mon, Jan 10, 2011 at 12:29 AM, Ryan Rawson ryano...@gmail.com wrote: There was a good, but complex reason. Its going away with stacks time stamp patch. I'll see if I can do a better email tomorrow. On Jan 9, 2011 11:42 PM, Dhruba Borthakur dhr...@gmail.com wrote: I am looking at Hregion.incrementColumnValue(). It has the following piece of code // build the KeyValue now: 3266 KeyValue newKv = new KeyValue(row, family, 3267 qualifier, EnvironmentEdgeManager.currentTimeMillis(), 3268 Bytes.toBytes(result)); 3269 3270 // now log it: 3271 if (writeToWAL) { 3272 long now = EnvironmentEdgeManager.currentTimeMillis(); 3273 WALEdit walEdit = new WALEdit(); 3274 walEdit.add(newKv); 3275 this.log.append(regionInfo, regionInfo.getTableDesc().getName(), 3276 walEdit, now); 3277 } It invokes EnvironmentEdgeManager.currentTimeMillis() twice, once for creating the new KV and then another time to add it to the WAL. Is this significant or just an oversight? Can we instead invoke it once before we create the new key-value and then use it for both code paths? Thanks, dhruba -- Connect to me at http://www.facebook.com/dhruba
Re: question about Hregion.incrementColumnValue
That is just an artifact of the way the code was written, the math.max() and +1 code was the guarantee. Remember without that code, the old ICV would _LOSE DATA_. So a little hacking was in order. I expect to clean up this with the HBASE-2856 patch. -ryan On Mon, Jan 10, 2011 at 4:34 PM, Jonathan Gray jg...@fb.com wrote: How does doing currentTimeMillis() twice in a row guarantee different timestamps? And in this case, we're talking about the MemStore vs. HLog not HFile. There is another section of the code where there is a timestamp+1 to avoid duplicates but this is something else. -Original Message- From: Ryan Rawson [mailto:ryano...@gmail.com] Sent: Monday, January 10, 2011 2:27 PM To: dev@hbase.apache.org Subject: Re: question about Hregion.incrementColumnValue I put more comments on this: HBASE-3021 Basically we needed to avoid duplicate timestamp KVs in memstore hfile, elsewise we might end up 'getting' the wrong value and thus messing up the count. With work on ACID by stack we can avoid using that. -ryan On Mon, Jan 10, 2011 at 11:23 AM, Stack st...@duboce.net wrote: Yeah, thats going away unless Ryan comes up w/ a reason for why we should keep it. St.Ack On Mon, Jan 10, 2011 at 12:29 AM, Ryan Rawson ryano...@gmail.com wrote: There was a good, but complex reason. Its going away with stacks time stamp patch. I'll see if I can do a better email tomorrow. On Jan 9, 2011 11:42 PM, Dhruba Borthakur dhr...@gmail.com wrote: I am looking at Hregion.incrementColumnValue(). It has the following piece of code // build the KeyValue now: 3266 KeyValue newKv = new KeyValue(row, family, 3267 qualifier, EnvironmentEdgeManager.currentTimeMillis(), 3268 Bytes.toBytes(result)); 3269 3270 // now log it: 3271 if (writeToWAL) { 3272 long now = EnvironmentEdgeManager.currentTimeMillis(); 3273 WALEdit walEdit = new WALEdit(); 3274 walEdit.add(newKv); 3275 this.log.append(regionInfo, regionInfo.getTableDesc().getName(), 3276 walEdit, now); 3277 } It invokes EnvironmentEdgeManager.currentTimeMillis() twice, once for creating the new KV and then another time to add it to the WAL. Is this significant or just an oversight? Can we instead invoke it once before we create the new key-value and then use it for both code paths? Thanks, dhruba -- Connect to me at http://www.facebook.com/dhruba
Re: Good VLDB paper on WALs
Oh no, let's be wary of those server rewrites. My micro profiling is showing about 30 usec for a lock handoff in the HBase client... I think we should be able to get big wins with minimal things. A big rewrite has it's major costs, not to mention to effectively be async we'd have to rewrite every single pice of code more complex than Bytes.*. If you need to block you will need to push context on a context-store (aka stack) and manage that all ourselves. I've been seeing papers that are talking about threading improvements that could get us better performance. Assuming that ctx is the actual reason why we arent as fast as we could be (note: we are NOT slow!). As for the DI, I think I'd like to see more study on the costs and benefits. We have a relatively minimal amount of interfaces and concrete objects, for the interfaces we do, we have 1 or 2 implementations at most. Usually 1. There is a cost, I'd like to see more descriptions of the costs vs the benefits. -ryan On Wed, Dec 29, 2010 at 11:32 AM, Stack st...@duboce.net wrote: Nice list of things we need to do to make logging faster (with useful citations on current state of art). This notion of early lock release (ELR) is worth looking into (Jon, for high rates of counter transactions, you've been talking about aggregating counts in front of the WAL lock... maybe an ELR and then a hold on the transaction until confirmation of flush would be way to go?). Regards flush-pipelining, it would be interesting to see if there are traces of the sys-time that Dhruba is seeing in his NN out in HBase servers. My guess is that its probably drowned by other context switches done in our servers. Definitely worth study. St.Ack P.S. Minimizing context switches, a system for ELR and flush-pipelining, recasting the server to make use of one of the DI or OSGi frameworks, moving off log4j, etc. Is it just me or do others feel a server rewrite coming on? On Mon, Dec 27, 2010 at 11:48 AM, Dhruba Borthakur dhr...@gmail.com wrote: HDFS currently uses Hadoop RPC and the server thread blocks till the WAL is written to disk. In earlier deployments, I thought we could safely ignore flush-pipelining by creating more server threads. But in our largest HDFS systems, I am starting to see 20% sys-time usage on the namenode machine; most of this could be thread scheduling. If so, then it makes sense to enhance the logging code to release server threads even before the WAL is flushed to disk (but, of course, we still have to delay the transaction response to the client till the WAL is synced to disk). Does anybody have any idea on how to figure out what percentage of the above sys-time is spent in thread scheduling vs the time spent in other system calls (especially in the Namenode context)? thanks, dhruba On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon t...@cloudera.com wrote: Via Hammer - I thought this was a pretty good read, some good ideas for optimizations for our WAL. http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf -Todd -- Todd Lipcon Software Engineer, Cloudera -- Connect to me at http://www.facebook.com/dhruba
Re: deploying hbase 0.90 to internal maven repository
just run 'mvn install' in our directory and that should do the trick. everything else is implied by pom.xml. well except the repository stuff. -ryan On Wed, Dec 29, 2010 at 10:29 AM, Ted Yu yuzhih...@gmail.com wrote: Hi, I used the following script to deploy hbase 0.90 jar to internal maven repository but was not successful: #!/usr/bin/env bash set -x mvn deploy:deploy-file -Dfile=target/hbase-0.90.0.jar -Dpackaging=jar -DgroupId=org.apache.hbase -DartifactId=hbase -Dversion=0.90.0 -DrepositoryId=carrieriq.thirdParty -Durl=scp://maven2:mav...@repository.eng.carrieriq.com: /data/maven2/repository/thirdparty Comment about how the following error can be fixed is appreciated. Here is the output: [INFO] Scanning for projects... [WARNING] Profile with id: 'property-overrides' has not been activated. [INFO] [ERROR] BUILD ERROR [INFO] [INFO] Error building POM (may not be this project's POM). Project ID: com.agilejava.docbkx:docbkx-maven-plugin POM Location: Artifact [com.agilejava.docbkx:docbkx-maven-plugin:pom:2.0.11] Validation Messages: [0] 'dependencies.dependency.version' is missing for com.agilejava.docbkx:docbkx-maven-base:jar Reason: Failed to validate POM for project com.agilejava.docbkx:docbkx-maven-plugin at Artifact [com.agilejava.docbkx:docbkx-maven-plugin:pom:2.0.11]
Re: provide a 0.20-append tarball?
Looks like the fight does not go well. A lot of hdfs developers are concerned that it would detract resources. I'm not sure who's resources. I hope my 13-15 month commented helped... I've heard wait for the next version before and I am not interested in it. If that indeed worked, a year ago we'd have a stable working sync/hlog recovery support. -ryan On Wed, Dec 22, 2010 at 3:41 PM, Stack st...@duboce.net wrote: On Wed, Dec 22, 2010 at 11:14 AM, Stack st...@duboce.net wrote: Let me ask Dhruba what he thinks about making a 0.20-append release (He's the release manager). Will also sound out the hadoop pmc since they'll have an opinion. I asked Dhruba. He's fine w/ a release off tip of branch--0.20-append. I just wrote a message to general up on hadoop to gauge what hadoopers think of the idea. St.Ack
Re: 0.90.0RC2 tomorrow?
The default xml is in the jar and is intended to be that way. Thee other is a bug. Can you file a jira? Thanks! On Dec 21, 2010 7:18 PM, Tatsuya Kawano tatsuya6...@gmail.com wrote: Hi, Just noticed a couple of things in the last candidate (rc1). 1. conf/hbase-default.xml is missing. 2. bin/start-hbase.sh displays the following warning. cat: ... /hbase-0.90.0/bin/../target/cached_classpath.txt: No such file or directory Thank you, Tatsuya -- Tatsuya Kawano Tokyo, Japan On Dec 21, 2010, at 2:36 PM, Stack st...@duboce.net wrote: We should be able to post our third release candidate tomorrow (Our RCs are zero-based). All current blockers and criticals have been cleared. Speak up if there is anything you want to get into 0.90.0RC2 or if there is a good reason for not cutting the RC now. Thanks, St.Ack
Re: Hypertable claiming upto 900% random-read throughput vs HBase
So if that is the case, I'm not sure how that is a fair test. One system reads from RAM, the other from disk. The results as expected. Why not test one system with SSDs and the other without? It's really hard to get apples/oranges comparison. Even if you are doing the same workloads on 2 diverse systems, you are not testing the code quality, you are testing overall systems and other issues. As G1 GC improves, I expect our ability to use larger and larger heaps would blunt the advantage of a C++ program using malloc. -ryan On Wed, Dec 15, 2010 at 11:15 AM, Ted Dunning tdunn...@maprtech.com wrote: From the small comments I have heard, the RAM versus disk difference is mostly what I have heard they were testing. On Wed, Dec 15, 2010 at 11:11 AM, Ryan Rawson ryano...@gmail.com wrote: We dont have the test source code, so it isnt very objective. However I believe there are 2 things which help them: - They are able to harness larger amounts of RAM, so they are really just testing that vs HBase
Re: Hypertable claiming upto 900% random-read throughput vs HBase
Purtell has more, but he told me no longer crashes, but minor pauses between 50-250 ms. From 1.6_23. Still not usable in a latency sensitive prod setting. Maybe in other settings? -ryan On Wed, Dec 15, 2010 at 11:31 AM, Ted Dunning tdunn...@maprtech.com wrote: Does anybody have a recent report about how G1 is coming along? On Wed, Dec 15, 2010 at 11:22 AM, Ryan Rawson ryano...@gmail.com wrote: As G1 GC improves, I expect our ability to use larger and larger heaps would blunt the advantage of a C++ program using malloc.
Re: Hypertable claiming upto 900% random-read throughput vs HBase
Why do that? You reduce the cache effectiveness and up the logistical complexity. As a stopgap maybe, but not as a long term strategy. Sun just needs to fix their GC. Er, Oracle. -ryan On Wed, Dec 15, 2010 at 11:55 AM, Chad Walters chad.walt...@microsoft.com wrote: Why not run multiple JVMs per machine? Chad -Original Message- From: Ryan Rawson [mailto:ryano...@gmail.com] Sent: Wednesday, December 15, 2010 11:52 AM To: dev@hbase.apache.org Subject: Re: Hypertable claiming upto 900% random-read throughput vs HBase The malloc thing was pointing out that we have to contend with Xmx and GC. So it makes it harder for us to maximally use all the available ram for block cache in the regionserver. Which you may or may not want to do for alternative reasons. At least with Xmx you can plan and control your deployments, and you wont suffer from heap growth due to heap fragmentation. -ryan On Wed, Dec 15, 2010 at 11:49 AM, Todd Lipcon t...@cloudera.com wrote: On Wed, Dec 15, 2010 at 11:44 AM, Gaurav Sharma gaurav.gs.sha...@gmail.com wrote: Thanks Ryan and Ted. I also think if they were using tcmalloc, it would have given them a further advantage but as you said, not much is known about the test source code. I think Hypertable does use tcmalloc or jemalloc (forget which) You may be interested in this thread from back in August: http://search-hadoop.com/m/pG6SM1xSP7r/hypertablesubj=Re+Finding+on+H Base+Hypertable+comparison -Todd On Wed, Dec 15, 2010 at 2:22 PM, Ryan Rawson ryano...@gmail.com wrote: So if that is the case, I'm not sure how that is a fair test. One system reads from RAM, the other from disk. The results as expected. Why not test one system with SSDs and the other without? It's really hard to get apples/oranges comparison. Even if you are doing the same workloads on 2 diverse systems, you are not testing the code quality, you are testing overall systems and other issues. As G1 GC improves, I expect our ability to use larger and larger heaps would blunt the advantage of a C++ program using malloc. -ryan On Wed, Dec 15, 2010 at 11:15 AM, Ted Dunning tdunn...@maprtech.com wrote: From the small comments I have heard, the RAM versus disk difference is mostly what I have heard they were testing. On Wed, Dec 15, 2010 at 11:11 AM, Ryan Rawson ryano...@gmail.com wrote: We dont have the test source code, so it isnt very objective. However I believe there are 2 things which help them: - They are able to harness larger amounts of RAM, so they are really just testing that vs HBase -- Todd Lipcon Software Engineer, Cloudera
Re: Local sockets
Hi, I'd like to hear more on how you think this paper and the associated topics apply to HBase. Remember, unlike the paper, everyone will always run replication in a real environment, it would be suicide not to. -ryan On Mon, Dec 6, 2010 at 11:39 AM, Vladimir Rodionov vrodio...@carrieriq.com wrote: Todd, There are some curious people who had spent time (and tax payers money :) and have came to the same conclusion (as me): http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Todd Lipcon [t...@cloudera.com] Sent: Monday, December 06, 2010 10:04 AM To: dev@hbase.apache.org Subject: Re: Local sockets On Mon, Dec 6, 2010 at 9:59 AM, Vladimir Rodionov vrodio...@carrieriq.comwrote: Todd, The major hdfs problem is inefficient processing of multiple streams in parallel - multiple readers/writers per one physical drive result in significant drop in overall I/O throughput on Linux (tested with ext3, ext4). There should be only one reader thread, one writer thread per physical drive (until we get AIO support in Java) Multiple data buffer copies in pipeline do not improve situation as well. In my benchmarks, the copies account for only a minor amount of the overhead. Do a benchmark of ChecksumLocalFilesystem vs RawLocalFilesystem and you should see the 2x difference I mentioned for data that's in buffer cache. As for parallel reader streams, I disagree with your assessment. After tuning readahead and with a decent elevator algorithm (anticipatory seems best in my benchmarks) it's better to have multiple threads reading from a drive compared to one, unless we had AIO. Otherwise we won't be able to have multiple outstanding requests to the block device, and the elevator will be powerless to do any reordering of reads. CRC32 can be fast btw and some other hashing algos can be even faster (like murmur2 -1.5GB per sec) Our CRC32 implementation goes around 750MB/sec on raw data, but for whatever undiscovered reason it adds a lot more overhead when you mix it into the data pipeline. HDFS-347 has some interesting benchmarks there. -Todd From: Todd Lipcon [t...@cloudera.com] Sent: Saturday, December 04, 2010 3:04 PM To: dev@hbase.apache.org Subject: Re: Local sockets On Sat, Dec 4, 2010 at 2:57 PM, Vladimir Rodionov vrodio...@carrieriq.comwrote: From my own experiments performance difference is huge even on sequential R/W operations (up to 300%) when you do local File I/O vs HDFS File I/O Overhead of HDFS I/O is substantial to say the least. Much of this is from checksumming, though - turn off checksums and you should see about a 2x improvement at least. -Todd Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Todd Lipcon [t...@cloudera.com] Sent: Saturday, December 04, 2010 12:30 PM To: dev@hbase.apache.org Subject: Re: Local sockets Hi Leen, Check out HDFS-347 for more info on this. I hope to pick this back up in 2011 - in 2010 we mostly focused on stability above performance in HBase's interactions with HDFS. Thanks -Todd On Sat, Dec 4, 2010 at 12:28 PM, Leen Toelen toe...@gmail.com wrote: Hi, has anyone tested the performance impact (when there is a hdfs datanode and a hbase node on the same machine) of using unix domain sockets communication or shared memory ipc using nio? I guess this should make a difference on reads? Regards, Leen -- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera
Re: Local sockets
So are we talking about re-implementing IO scheduling in Hadoop at the application level? On Mon, Dec 6, 2010 at 12:13 PM, Rajappa Iyer r...@panix.com wrote: Jay Booth jaybo...@gmail.com writes: I don't get what they're talking about with hiding I/O limitations.. if the OS is doing a poor job of handling sequential readers, that's on the OS and not Hadoop, no? In other words, I didn't see anything specific to Hadoop in their multiple readers slow down sequential access statement, it just may or may not be true for a given I/O subsystem. The operating system is still getting open file, read, read, read, close, whether you're accessing that file locally or via a datanode. Datanodes don't close files in between read calls, except at block boundaries. The root cause of the problem is the way map jobs are scheduled. Since the job execution overlaps, the reads from different jobs also overlap and hence increase seeks. Realistically, there's not much that the OS can do about it. What Vladimir is talking about is reducing the seek times by essentially serializing the reads through a single thread per disk. You could either cleverly reorganize the reads so that seek is minimized and/or read the entire block in one call. -rsi On Mon, Dec 6, 2010 at 2:39 PM, Vladimir Rodionov vrodio...@carrieriq.comwrote: Todd, There are some curious people who had spent time (and tax payers money :) and have came to the same conclusion (as me): http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Todd Lipcon [t...@cloudera.com] Sent: Monday, December 06, 2010 10:04 AM To: dev@hbase.apache.org Subject: Re: Local sockets On Mon, Dec 6, 2010 at 9:59 AM, Vladimir Rodionov vrodio...@carrieriq.comwrote: Todd, The major hdfs problem is inefficient processing of multiple streams in parallel - multiple readers/writers per one physical drive result in significant drop in overall I/O throughput on Linux (tested with ext3, ext4). There should be only one reader thread, one writer thread per physical drive (until we get AIO support in Java) Multiple data buffer copies in pipeline do not improve situation as well. In my benchmarks, the copies account for only a minor amount of the overhead. Do a benchmark of ChecksumLocalFilesystem vs RawLocalFilesystem and you should see the 2x difference I mentioned for data that's in buffer cache. As for parallel reader streams, I disagree with your assessment. After tuning readahead and with a decent elevator algorithm (anticipatory seems best in my benchmarks) it's better to have multiple threads reading from a drive compared to one, unless we had AIO. Otherwise we won't be able to have multiple outstanding requests to the block device, and the elevator will be powerless to do any reordering of reads. CRC32 can be fast btw and some other hashing algos can be even faster (like murmur2 -1.5GB per sec) Our CRC32 implementation goes around 750MB/sec on raw data, but for whatever undiscovered reason it adds a lot more overhead when you mix it into the data pipeline. HDFS-347 has some interesting benchmarks there. -Todd From: Todd Lipcon [t...@cloudera.com] Sent: Saturday, December 04, 2010 3:04 PM To: dev@hbase.apache.org Subject: Re: Local sockets On Sat, Dec 4, 2010 at 2:57 PM, Vladimir Rodionov vrodio...@carrieriq.comwrote: From my own experiments performance difference is huge even on sequential R/W operations (up to 300%) when you do local File I/O vs HDFS File I/O Overhead of HDFS I/O is substantial to say the least. Much of this is from checksumming, though - turn off checksums and you should see about a 2x improvement at least. -Todd Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Todd Lipcon [t...@cloudera.com] Sent: Saturday, December 04, 2010 12:30 PM To: dev@hbase.apache.org Subject: Re: Local sockets Hi Leen, Check out HDFS-347 for more info on this. I hope to pick this back up in 2011 - in 2010 we mostly focused on stability above performance in HBase's interactions with HDFS. Thanks -Todd On Sat, Dec 4, 2010 at 12:28 PM, Leen Toelen toe...@gmail.com wrote: Hi, has anyone tested the performance impact (when there is a hdfs datanode and a hbase node on the same machine) of using unix domain sockets communication or shared memory ipc using nio? I guess this should make a difference on reads? Regards, Leen -- Todd Lipcon Software Engineer, Cloudera --
Re: Local sockets
While I applaud these experiments, the next challenge is getting them in to a shipping Hadoop. I think it's a relative nonstarter if we require someone to patch in a bunch of patches that are/were being refused to be committed. Keep on experimenting and collecting that evidence though! One day! -ryan On Sat, Dec 4, 2010 at 3:04 PM, Todd Lipcon t...@cloudera.com wrote: On Sat, Dec 4, 2010 at 2:57 PM, Vladimir Rodionov vrodio...@carrieriq.comwrote: From my own experiments performance difference is huge even on sequential R/W operations (up to 300%) when you do local File I/O vs HDFS File I/O Overhead of HDFS I/O is substantial to say the least. Much of this is from checksumming, though - turn off checksums and you should see about a 2x improvement at least. -Todd Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Todd Lipcon [t...@cloudera.com] Sent: Saturday, December 04, 2010 12:30 PM To: dev@hbase.apache.org Subject: Re: Local sockets Hi Leen, Check out HDFS-347 for more info on this. I hope to pick this back up in 2011 - in 2010 we mostly focused on stability above performance in HBase's interactions with HDFS. Thanks -Todd On Sat, Dec 4, 2010 at 12:28 PM, Leen Toelen toe...@gmail.com wrote: Hi, has anyone tested the performance impact (when there is a hdfs datanode and a hbase node on the same machine) of using unix domain sockets communication or shared memory ipc using nio? I guess this should make a difference on reads? Regards, Leen -- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera
Re: Review Request: Add option to cache blocks on hfile write and evict blocks on hfile close
On 2010-11-30 09:57:27, Ryan Rawson wrote: branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java, line 765 http://review.cloudera.org/r/1261/diff/1/?file=17902#file17902line765 why would you not want to evict blocks from the cache on close? stack wrote: I think this a good point. Its different behavior but its behavior we should have always had? One less option too. I'm still confused why we are adding config for something that we should always be doing it. While we'll never be zero conf, I am not seeing the reason why we'd want to keep things in the LRU. It would make more sense not to evict on a split, but evict every other time, since a split will probably reopen the same hfiles and need those blocks again. - Ryan --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1261/#review2010 --- On 2010-11-29 23:22:38, Jonathan Gray wrote: --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1261/ --- (Updated 2010-11-29 23:22:38) Review request for hbase, stack and khemani. Summary --- This issue is about adding configuration options to add/remove from the block cache when creating/closing files. For use cases with lots of flushing and compacting, this might be desirable to prevent cache misses and maximize the effective utilization of total block cache capacity. The first option, hbase.rs.cacheblocksonwrite, will make it so we pre-cache blocks as we are writing out new files. The second option, hbase.rs.evictblocksonclose, will make it so we evict blocks when files are closed. This addresses bug HBASE-3287. http://issues.apache.org/jira/browse/HBASE-3287 Diffs - branches/0.90/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java 1040422 branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java 1040422 branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java 1040422 branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java 1040422 branches/0.90/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java 1040422 branches/0.90/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 1040422 branches/0.90/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 1040422 branches/0.90/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java 1040422 branches/0.90/src/main/java/org/apache/hadoop/hbase/util/CompressionTest.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/RandomSeek.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFile.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFilePerformance.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileSeek.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java 1040422 branches/0.90/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreFile.java 1040422 Diff: http://review.cloudera.org/r/1261/diff Testing --- Added a unit test to TestStoreFile. That passes. Need to do perf testing on a cluster. Thanks, Jonathan
Re: Review Request: delete followed by a put with the same timestamp
--- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1252/#review1993 --- trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java http://review.cloudera.org/r/1252/#comment6297 what are all the consequences for not sorting by type when using KVComparator? Does this mean we might create HFiles that not sorted properly, because the HFile comparator uses the KeyComparator directly with ignoreType = false. While in memstore we can rely on memstoreTS to roughly order by insertion time, and the Put/Delete should probably work in that situation, you are talking about modifiying a pretty core and important concept in how we sort things. There are other ways to reconcile bugs like this, one of them is to extend the memstoreTS concept into the HFile and use that to reconcile during reads. There is another JIRA where I proposed this. If we are talking about 0.92 and beyond I'd prefer building a solid base rather than dangerous hacks like this. Our unit tests are not extremely extensive, so while they might pass, that doesnt guarantee lack of bad behaviour later on. - Ryan On 2010-11-26 07:47:02, Pranav Khaitan wrote: --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1252/ --- (Updated 2010-11-26 07:47:02) Review request for hbase, Jonathan Gray and Kannan Muthukkaruppan. Summary --- This is a design change suggested in HBASE-3276 so adequate thought should be given before proceeding. The main code change is just one line which is to ignore key type while doing KV comparisons. When the key type is ignored, then all the keys for the same timestamp are sorted according the order in which they were interested. It is still ensured that the delete family and delete column will be at the top because they have the default column name and default timestamp. This addresses bug HBASE-3276. http://issues.apache.org/jira/browse/HBASE-3276 Diffs - trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java 1039233 trunk/src/test/java/org/apache/hadoop/hbase/regionserver/KeyValueScanFixture.java 1039233 trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreScanner.java 1039233 Diff: http://review.cloudera.org/r/1252/diff Testing --- Test cases added. Since there is a change in semantics, some previous tests were failing because of this change. Those tests have been modified to test the newer behavior. Thanks, Pranav
Re: HRegion.RegionScanner.nextInternal()
Yes in this case 'batch' and 'limit' refer to how many cells to return at a time within a row. The 'scanner caching' comes across in the next(int) argument which can change on a per-call basis (although the HTable API doesnt quite allow it). -ryan On Fri, Nov 26, 2010 at 3:12 AM, Lars George lars.geo...@gmail.com wrote: OK, got it. I missed the HRegionServers.next() in the mix. It calls the RegionScanner.next(results) and that uses the batch. Tricksy! I should have started on the client side instead. Lars On Fri, Nov 26, 2010 at 3:08 AM, Ryan Rawson ryano...@gmail.com wrote: No, batch size when limit is set is 1. You get partial results for a route, then get more from the same row. Then the next row. On Nov 25, 2010 4:54 PM, Lars George lars.geo...@gmail.com wrote: Mkay, I will look into it more for the latter. But for the limit this is still confusing to me as limit == batch and that is in he client side the number of rows. But not the number of columns. Does that mean if I had 100 columns and set batch to 10 that it would only return 10 rows with 10 columns but not what I would have expected ie. 10 rows with all columns? Is this implicitly mean batch is also the intra row batch size? Lars On Nov 25, 2010, at 21:53, Ryan Rawson ryano...@gmail.com wrote: limit is for retrieving partial results of a row. Ie: give me a row in chunks. Filters that want to operate on the entire row cannot be used with this mode. i forget why it's in the loop but there was a good reason at the time. -ryan On Thu, Nov 25, 2010 at 10:51 AM, Lars George lars.geo...@gmail.com wrote: Does hbase-dev still get forwarded? Did you see the below message? -- Forwarded message -- From: Lars George lars.geo...@gmail.com Date: Tue, Nov 23, 2010 at 4:25 PM Subject: HRegion.RegionScanner.nextInternal() To: hbase-...@hadoop.apache.org Hi, I am officially confused: byte [] nextRow; do { this.storeHeap.next(results, limit - results.size()); if (limit 0 results.size() == limit) { if (this.filter != null filter.hasFilterRow()) throw new IncompatibleFilterException( Filter with filterRow(ListKeyValue) incompatible with scan with limit!); return true; // we are expecting more yes, but also limited to how many we can return. } } while (Bytes.equals(currentRow, nextRow = peekRow())); This is from the nextInternal() call. Questions: a) Why is that check for the filter and limit both being set inside the loop? b) if limit is the batch size (which for a Get is -1, not 1 as I would have thought) then what does that limit - results.size() achieve? I mean, this loops gets all columns for a given row, so batch/limit should not be handled here, right? what if limit were set to 1 by the client? Then even if the Get had 3 columns to retrieve it would not be able to since this limit makes it bail out. So there would be multiple calls to nextInternal() to complete what could be done in one loop? Eh? Lars
Re: Review Request: delete followed by a put with the same timestamp
On 2010-11-26 14:54:45, Ryan Rawson wrote: trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java, line 1373 http://review.cloudera.org/r/1252/diff/1/?file=17712#file17712line1373 what are all the consequences for not sorting by type when using KVComparator? Does this mean we might create HFiles that not sorted properly, because the HFile comparator uses the KeyComparator directly with ignoreType = false. While in memstore we can rely on memstoreTS to roughly order by insertion time, and the Put/Delete should probably work in that situation, you are talking about modifiying a pretty core and important concept in how we sort things. There are other ways to reconcile bugs like this, one of them is to extend the memstoreTS concept into the HFile and use that to reconcile during reads. There is another JIRA where I proposed this. If we are talking about 0.92 and beyond I'd prefer building a solid base rather than dangerous hacks like this. Our unit tests are not extremely extensive, so while they might pass, that doesnt guarantee lack of bad behaviour later on. Pranav Khaitan wrote: Agree. As I mentioned, this is a major change and more thought needs to be given to it. However, to resolve issues like HBASE-3276, we need either such a change or extend the memstoreTS concept to HFile as you mentioned. About consequences, I don't see anything negative here. This change only affects the sorting of keys having same row, col, timestamp. After this change, all keys with the same row, col, ts will be sorted purely based on the order in which they were inserted. When a memstore is flushed to HFile, the memstoreTS takes care of ordering. During compactions, the KeyValueHeap breaks ties by using the sequence ids of storefiles. the problem is you are now changing how things are ordered sometimes but not all the time. HFile directly uses the rawcomparator, instantiating it directly rather than getting it via the code path you changed. So now you create a memstore in this order: row,col,100,Put (memstoreTS=1) row,col,100,Delete (memstoreTS=2) row,col,100,Put (memstoreTS=3) But the HFile comparator will consider this out of order since it doesnt know about memstoreTS and it still expects things to be in a certain order. I'm a little wary of having implicit ordering in the HFiles... in your new scheme, Put,Delete,Put are in that order 'just because they are', and the comparator cannot put them back in order, and must rely on scanner order. During compactions we would place keys in order based on which files they came from, but they wouldn't themselves have an order. Basically we should get rid of 'type sorting' and use memstoreTS sorting in memory and implicit sorting in the HFiles. - Ryan --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1252/#review1993 --- On 2010-11-26 07:47:02, Pranav Khaitan wrote: --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1252/ --- (Updated 2010-11-26 07:47:02) Review request for hbase, Jonathan Gray and Kannan Muthukkaruppan. Summary --- This is a design change suggested in HBASE-3276 so adequate thought should be given before proceeding. The main code change is just one line which is to ignore key type while doing KV comparisons. When the key type is ignored, then all the keys for the same timestamp are sorted according the order in which they were interested. It is still ensured that the delete family and delete column will be at the top because they have the default column name and default timestamp. This addresses bug HBASE-3276. http://issues.apache.org/jira/browse/HBASE-3276 Diffs - trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java 1039233 trunk/src/test/java/org/apache/hadoop/hbase/regionserver/KeyValueScanFixture.java 1039233 trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestStoreScanner.java 1039233 Diff: http://review.cloudera.org/r/1252/diff Testing --- Test cases added. Since there is a change in semantics, some previous tests were failing because of this change. Those tests have been modified to test the newer behavior. Thanks, Pranav
Re: HRegion.RegionScanner.nextInternal()
limit is for retrieving partial results of a row. Ie: give me a row in chunks. Filters that want to operate on the entire row cannot be used with this mode. i forget why it's in the loop but there was a good reason at the time. -ryan On Thu, Nov 25, 2010 at 10:51 AM, Lars George lars.geo...@gmail.com wrote: Does hbase-dev still get forwarded? Did you see the below message? -- Forwarded message -- From: Lars George lars.geo...@gmail.com Date: Tue, Nov 23, 2010 at 4:25 PM Subject: HRegion.RegionScanner.nextInternal() To: hbase-...@hadoop.apache.org Hi, I am officially confused: byte [] nextRow; do { this.storeHeap.next(results, limit - results.size()); if (limit 0 results.size() == limit) { if (this.filter != null filter.hasFilterRow()) throw new IncompatibleFilterException( Filter with filterRow(ListKeyValue) incompatible with scan with limit!); return true; // we are expecting more yes, but also limited to how many we can return. } } while (Bytes.equals(currentRow, nextRow = peekRow())); This is from the nextInternal() call. Questions: a) Why is that check for the filter and limit both being set inside the loop? b) if limit is the batch size (which for a Get is -1, not 1 as I would have thought) then what does that limit - results.size() achieve? I mean, this loops gets all columns for a given row, so batch/limit should not be handled here, right? what if limit were set to 1 by the client? Then even if the Get had 3 columns to retrieve it would not be able to since this limit makes it bail out. So there would be multiple calls to nextInternal() to complete what could be done in one loop? Eh? Lars
Re: code review: HBASE-3251
Please include a diff instead, it's hard to compare. Also I'm not sure there will be a 0.20.7. -ryan On Wed, Nov 24, 2010 at 3:45 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, I wanted to automate the manual deletion of dangling row(s) in .META. table. Please kindly comment on the following modification to HMaster.createTable() which is base on 0.20.6 codebase: long scannerid = srvr.openScanner(metaRegionName, scan); try { HashSetbyte[] regions = new HashSetbyte[](); boolean cleanTable = false, // whether the table has a row in .META. whose start key is empty exists = false; Result data = srvr.next(scannerid); while (data != null) { if (data != null data.size() 0) { HRegionInfo info = Writables.getHRegionInfo( data.getValue(CATALOG_FAMILY, REGIONINFO_QUALIFIER)); if (info.getTableDesc().getNameAsString().equals(tableName)) { exists = true; if (info.getStartKey().length == 0) { cleanTable = true; } else { regions.add(info.getRegionName()); } } } data = srvr.next(scannerid); } if (exists) { if (!cleanTable) { HTable meta = new HTable(HConstants.META_TABLE_NAME); for (byte[] region : regions) { Delete d = new Delete(region); meta.delete(d); LOG.info(dangling row + Bytes.toString(region) + deleted from .META.); } } else { // A region for this table already exists. Ergo table exists. throw new TableExistsException(tableName); } } } finally { srvr.close(scannerid); } Thanks
Re: How to put() and get() when setAutoFlush(false)?
Hi, You could implement this in a code structure like so: HTable table = new HTable(tableName, conf); Put lastPut = null; while ( moreData ) { Put put = makeNewPutBasedOnLastPutToo( lastPut, dataSource ); table.put(put); lastPut = put; dataSource.next(); } if that is unsatisfactory you may access the write buffer via HTable.getWRiteBuffer(). -ryan On Mon, Nov 22, 2010 at 5:41 PM, Xin Wang and...@gmail.com wrote: Hello everyone, I am a beginner to HBase. I want to load a data file of 2 million lines into a HBase table. I want to load data as fast as possible, so I called HTable.setAutoFlush(false) at the beginning. However, when I HTable.put() a row and then HTable.get() the same row, the result is empty. I know this is because the setAutoFlush(false) make put() write into the buffer. But the algorithm in my loading process requires to read the value of the previous one that just is put into the HTable cell. I have tried to make setAutoFlush(true), although the previous value can be read but the loading process is slower down by about an order of magnitude. Can I get() value directly from the write buffer? Are there any other solutions to this problem that I do not know? Thank you in advance! Best regards, Xin Wang
Re: ANN: hbase 0.90.0 Release Candidate 0 available for download
I concur. Next week? On Wed, Nov 17, 2010 at 4:39 PM, Stack st...@duboce.net wrote: Good one. Want to make an issue J-D? Seems like this RC is sunk going by issues filed against it. If its OK w/ you all lets let this RC hang out there a little longer to see if the RC catches more bad bugs before we cut a new RC? St.Ack On Wed, Nov 17, 2010 at 6:28 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: Currently both trunk and 0.90's pom.xml are incomplete, we were relying on Ryan's repo have the thrift pom but now that it was changed to Stack's new comers cannot compile the project since that pom is missing. Reported by kzk9 on IRC. So either we had Ryan's repo back in the pom, or Stack copies the files to his own repo, or we add a FB's repo that has it. J-D On Tue, Nov 16, 2010 at 3:38 PM, Stack st...@duboce.net wrote: Agreed. Keep testing and keep the sinkers coming in so its the more likely that the next RC we put out graduates. Make sure issues are filed against 0.90.0. Good stuff, St.Ack On Tue, Nov 16, 2010 at 5:27 PM, Todd Lipcon t...@cloudera.com wrote: The web UI split and compact buttons are currently not hooked up - filed last night, will try to get a patch done today. The good news is I ran some YCSB tests and on the whole performance is much improved! I agree, let's keep going with this rc until people stop finding new issues, or we reach something that blocks further testing. -Todd On Tue, Nov 16, 2010 at 7:57 AM, Gary Helmling ghelml...@gmail.com wrote: -1 on RC I opened HBASE-3235 for an issue with ICVs that should also sink the RC. When a put and subsequent ICV go in with the same timestamp for the same row/family/qualifier, the initial put masks the ICV, effectively causing it to disappear. There's a fix up on review board. We may want to give a couple more days for any other issues to shake out as well? Gary On Tue, Nov 16, 2010 at 4:53 AM, Mathias Herberts mathias.herbe...@gmail.com wrote: Hi, I just filed HBASE-3238 which appears to me as a blocker as HBase won't start if its zookeeper.parent.znode exists and HBase does not have the CREATE permission on this znode's parent znode. Mathias. -- Todd Lipcon Software Engineer, Cloudera
Re: ANN: hbase 0.90.0 Release Candidate 0 available for download
That is correct, those classes were deprecated in 0.20, and now gone in 0.90. Now you will want to use HTable and Result. Also Filter.getNextKeyHint() is an implementation detail, have a look at the other filters to get a sense of what it does. On Mon, Nov 15, 2010 at 12:33 PM, Ted Yu yuzhih...@gmail.com wrote: Just a few findings when I tried to compile our 0.20.6 based code with this new release: HConstants is final class now instead of interface RowFilterInterface is gone org.apache.hadoop.hbase.io.Cell is gone org.apache.hadoop.hbase.io.RowResult is gone constructor HColumnDescriptor(byte[],int,java.lang.String,boolean,boolean,int,boolean) is gone Put.setTimeStamp() is gone org.apache.hadoop.hbase.filter.Filter has added getNextKeyHint(org.apache.hadoop.hbase.KeyValue) If you know the alternative to some of the old classes, please share. On Mon, Nov 15, 2010 at 2:51 AM, Stack st...@duboce.net wrote: The first hbase 0.90.0 release candidate is available for download: http://people.apache.org/~stack/hbase-0.90.0-candidate-0/http://people.apache.org/%7Estack/hbase-0.90.0-candidate-0/ HBase 0.90.0 is the major HBase release that follows 0.20.0 and the fruit of the 0.89.x development release series we've been running of late. More than 920 issues have been closed since 0.20.0. Release notes are available here: http://su.pr/8LbgvK. HBase 0.90.0 runs on Hadoop 0.20.x. It does not currently run on Hadoop 0.21.0. HBase will lose data unless it is running on an Hadoop HDFS 0.20.x that has a durable sync. Currently only the branch-0.20-append branch [1] has this attribute. No official releases have been made from this branch as yet so you will have to build your own Hadoop from the tip of this branch or install Cloudera's CDH3 [2] (Its currently in beta). CDH3b2 or CDHb3 have the 0.20-append patches needed to add a durable sync. See CHANGES.txt [3] in branch-0.20-append to see list of patches involved. There is no migration necessary. Your data written with HBase 0.20.x (or with HBase 0.89.x) is readable by HBase 0.90.0. A shutdown and restart after putting in place the new HBase should be all thats involved. That said, once done, there is no going back to 0.20.x once the transition has been made. HBase 0.90.0 and HBase 0.89.x write region names differently in the filesystem. Rolling restart from 0.20.x or 0.89.x to 0.90.0RC0 will not work. Should we release this candidate as hbase 0.90.0? Take it for a spin. Check out the doc. Vote +1/-1 by November 22nd. Yours, The HBasistas P.S. For why the version 0.90 and whats new in HBase 0.90, see slides 4-10 in this deck [4] 1. http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append 2. http://archive.cloudera.com/docs/ 3. http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt 4. http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/