Congrats Talat. You are our GSoC. We'll try and be nice (smile). St.Ack On Fri, Mar 25, 2016 at 1:52 PM, Stack <[email protected]> wrote:
> Thanks Talat... I shoved some comments up in it but looks basically sound. > Thanks for sending it in. > St.Ack > > On Fri, Mar 25, 2016 at 11:09 AM, Talat Uyarer <[email protected]> wrote: > >> Hi all, >> >> I created my GSoC proposal for Block Encoding and Compression for RPC >> Layer[1]. If you review and share your comments I will be appreciated. >> >> [1] >> https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing >> [2] https://issues.apache.org/jira/browse/HBASE-15530 >> >> Thanks >> >> On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer <[email protected]> wrote: >> > Hi, >> > >> > I am appreciated to being mentor Stack :) As I know as ASF already >> > participate and you can sign up. [1] last year I was a mentor. I just >> > send an email to private and [email protected]. Would you >> > like to check it ? >> > >> > [1] >> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this >> > >> > 2016-03-22 17:32 GMT-07:00 Enis Söztutar <[email protected]>: >> >>> >> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is >> it too >> >>> late for us to participate now? >> >>> >> >>> >> >> ASF participates in GSOC, so HBase automatically can participate AFAIK. >> >> >> >> >> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed >> the >> >>> mentor signup deadline. >> >>> >> >> >> >> I did not check the deadline, if that is the case, it means this year >> is >> >> over? >> >> >> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc. >> >> >> >> >> >>> >> >>> >> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. >> These >> >>> > are: >> >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. >> >>> > encoding. But these encodings just can use in HFile context. In RPC >> >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can >> >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't >> say >> >>> > the issue number But I guessed it is HBASE-12883 Support block >> >>> > encoding based on knowing set of column qualifiers up front) >> >>> > >> >>> >> >>> Sounds like a fine project (Someone was just asking about this >> offline...) >> >>> >> >>> >> >>> >> >>> > - HBASE-14379 Replication V2 >> >>> > - HBASE-8691 High-Throughput Streaming Scan API >> >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase >> -> >> >>> > SOLR indexing. I guess it could be this issue.) >> >>> > >> >>> > Could you help me for selecting topics or could you offer another >> issue ? >> >>> > >> >>> > >> >>> All above are good. >> >>> >> >>> Here's a few others made for another context: >> >>> >> >>> + Become Jepsen distributed systems test tool expert: run it against >> HBase >> >>> and HDFS. Analyze results. E.g. see >> >>> >> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen >> >>> + Deep dive on hbase Compactions. Own it. Review current options both >> the >> >>> defaults, experimental, and the stale. Build tooling and surface >> metrics >> >>> that give better insight on effectiveness of compaction mechanics and >> >>> policies. Develop tunings and alternate, new policies. For further >> credit, >> >>> develop master-orchestrated compaction algorithm. >> >>> + Reimplement HBase append and increment as write-only with rollup on >> read >> >>> or using CRDTs ( >> >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) >> >>> + Make the HBase Server async/event driven/SEDA moving it off its >> current >> >>> thread-per-request basis >> >>> + UI: build out more pages and tabs on the HBase master exposing more >> of >> >>> our cluster metrics (make the master into a metrics sink). Extra >> points for >> >>> views, histograms, or dashboards that are both informative AND pretty >> (D3, >> >>> etc.). A good benchmark would be subsuming the Hannibal tool >> >>> https://github.com/sentric/hannibal >> >>> + Build an example application on HBase for test and illustration: >> e.g. use >> >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase >> to >> >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load >> >>> hbase >> >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra >> >>> credit for documenting steps involved and filing issues where API is >> >>> awkward or hard to follow. >> >>> + Add actionable statistics to hbase internals that capture vitals >> about >> >>> the data being served and that we exploit responding to queries; e.g. >> rough >> >>> sizes of rows, column-families, columns-per-row-per-region, etc. For >> >>> example, if a client has been stepping sequentially through the data, >> the >> >>> stats would allow us recognize this state so we could switch to a >> different >> >>> scan type; one that is optimal to a sequential progression. >> >>> + Review and redo our fundamental merge sort, the basis of our read. >> There >> >>> are a few techniques to try such as a "loser tree merge" ( >> >>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally >> we'd >> >>> make >> >>> our merge sort block-based rather than Cell-based. Set yourself up in >> a rig >> >>> and try different Cell formats to get yourself to a cache-friendly >> Cell >> >>> format that maximizes instructions per cycle. >> >>> + Our client is heavy-weight and has accumulated lots of logic over >> time. >> >>> E.g. it is hard to set a single timeout for a request because client >> is >> >>> layered each with its own running timeouts. At its core is a >> mostly-done >> >>> async engine. Review, and finish the async work. Rewrite where it >> makes >> >>> sense after analysis. >> >>> + Our RPC is based on protobuf Service where we plugged in our own RPC >> >>> transport. An exploratory PoC putting HBase up on grpc was done by >> the grpc >> >>> team. Bring this project home. Extra points if you reveal a Streaming >> >>> Interface between Client and Server. >> >>> + Tiering... if regions are cold, close them so they don't occupy >> resources >> >>> (close files, purge its data from cache...).... reopen when a request >> comes >> >>> in.... >> >>> + Dynamic configuration of running HBase >> >>> >> >>> >> >>> St.Ack >> >>> >> >>> >> >>> >> >>> >> >>> > Thanks >> >>> > -- >> >>> > Talat UYARER >> >>> > >> >>> >> > >> > >> > >> > -- >> > Talat UYARER >> > Websitesi: http://talat.uyarer.com >> > Twitter: http://twitter.com/talatuyarer >> > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 >> > >> > On Tue, Mar 22, 2016 at 5:32 PM, Enis Söztutar <[email protected]> >> wrote: >> >>> >> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is >> it too >> >>> late for us to participate now? >> >>> >> >>> >> >> ASF participates in GSOC, so HBase automatically can participate AFAIK. >> >> >> >> >> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed >> the >> >>> mentor signup deadline. >> >>> >> >> >> >> I did not check the deadline, if that is the case, it means this year >> is >> >> over? >> >> >> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc. >> >> >> >> >> >>> >> >>> >> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. >> These >> >>> > are: >> >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. >> >>> > encoding. But these encodings just can use in HFile context. In RPC >> >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can >> >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't >> say >> >>> > the issue number But I guessed it is HBASE-12883 Support block >> >>> > encoding based on knowing set of column qualifiers up front) >> >>> > >> >>> >> >>> Sounds like a fine project (Someone was just asking about this >> offline...) >> >>> >> >>> >> >>> >> >>> > - HBASE-14379 Replication V2 >> >>> > - HBASE-8691 High-Throughput Streaming Scan API >> >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase >> -> >> >>> > SOLR indexing. I guess it could be this issue.) >> >>> > >> >>> > Could you help me for selecting topics or could you offer another >> issue ? >> >>> > >> >>> > >> >>> All above are good. >> >>> >> >>> Here's a few others made for another context: >> >>> >> >>> + Become Jepsen distributed systems test tool expert: run it against >> HBase >> >>> and HDFS. Analyze results. E.g. see >> >>> >> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen >> >>> + Deep dive on hbase Compactions. Own it. Review current options both >> the >> >>> defaults, experimental, and the stale. Build tooling and surface >> metrics >> >>> that give better insight on effectiveness of compaction mechanics and >> >>> policies. Develop tunings and alternate, new policies. For further >> credit, >> >>> develop master-orchestrated compaction algorithm. >> >>> + Reimplement HBase append and increment as write-only with rollup on >> read >> >>> or using CRDTs ( >> >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) >> >>> + Make the HBase Server async/event driven/SEDA moving it off its >> current >> >>> thread-per-request basis >> >>> + UI: build out more pages and tabs on the HBase master exposing more >> of >> >>> our cluster metrics (make the master into a metrics sink). Extra >> points for >> >>> views, histograms, or dashboards that are both informative AND pretty >> (D3, >> >>> etc.). A good benchmark would be subsuming the Hannibal tool >> >>> https://github.com/sentric/hannibal >> >>> + Build an example application on HBase for test and illustration: >> e.g. use >> >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase >> to >> >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load >> >>> hbase >> >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra >> >>> credit for documenting steps involved and filing issues where API is >> >>> awkward or hard to follow. >> >>> + Add actionable statistics to hbase internals that capture vitals >> about >> >>> the data being served and that we exploit responding to queries; e.g. >> rough >> >>> sizes of rows, column-families, columns-per-row-per-region, etc. For >> >>> example, if a client has been stepping sequentially through the data, >> the >> >>> stats would allow us recognize this state so we could switch to a >> different >> >>> scan type; one that is optimal to a sequential progression. >> >>> + Review and redo our fundamental merge sort, the basis of our read. >> There >> >>> are a few techniques to try such as a "loser tree merge" ( >> >>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally >> we'd >> >>> make >> >>> our merge sort block-based rather than Cell-based. Set yourself up in >> a rig >> >>> and try different Cell formats to get yourself to a cache-friendly >> Cell >> >>> format that maximizes instructions per cycle. >> >>> + Our client is heavy-weight and has accumulated lots of logic over >> time. >> >>> E.g. it is hard to set a single timeout for a request because client >> is >> >>> layered each with its own running timeouts. At its core is a >> mostly-done >> >>> async engine. Review, and finish the async work. Rewrite where it >> makes >> >>> sense after analysis. >> >>> + Our RPC is based on protobuf Service where we plugged in our own RPC >> >>> transport. An exploratory PoC putting HBase up on grpc was done by >> the grpc >> >>> team. Bring this project home. Extra points if you reveal a Streaming >> >>> Interface between Client and Server. >> >>> + Tiering... if regions are cold, close them so they don't occupy >> resources >> >>> (close files, purge its data from cache...).... reopen when a request >> comes >> >>> in.... >> >>> + Dynamic configuration of running HBase >> >>> >> >>> >> >>> St.Ack >> >>> >> >>> >> >>> >> >>> >> >>> > Thanks >> >>> > -- >> >>> > Talat UYARER >> >>> > >> >>> >> > >> > >> > >> > -- >> > Talat UYARER >> > Websitesi: http://talat.uyarer.com >> > Twitter: http://twitter.com/talatuyarer >> > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 >> >> >> >> -- >> Talat UYARER >> Websitesi: http://talat.uyarer.com >> Twitter: http://twitter.com/talatuyarer >> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 >> > >
