Re: Google Summer Of Code 2016

Stack Fri, 25 Mar 2016 13:52:50 -0700

Thanks Talat... I shoved some comments up in it but looks basically sound.
Thanks for sending it in.
St.Ack


On Fri, Mar 25, 2016 at 11:09 AM, Talat Uyarer <[email protected]> wrote:

> Hi all,
>
> I created my GSoC proposal for Block Encoding and Compression for RPC
> Layer[1]. If you review and share your comments I will be appreciated.
>
> [1]
> https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing
> [2] https://issues.apache.org/jira/browse/HBASE-15530
>
> Thanks
>
> On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer <[email protected]> wrote:
> > Hi,
> >
> > I am appreciated to being mentor Stack :) As I know as ASF already
> > participate and you can sign up. [1] last year I was a mentor. I just
> > send an email to private and [email protected]. Would you
> > like to check it ?
> >
> > [1]
> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
> >
> > 2016-03-22 17:32 GMT-07:00 Enis Söztutar <[email protected]>:
> >>>
> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is
> it too
> >>> late for us to participate now?
> >>>
> >>>
> >> ASF participates in GSOC, so HBase automatically can participate AFAIK.
> >>
> >>
> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed
> the
> >>> mentor signup deadline.
> >>>
> >>
> >> I did not check the deadline, if that is the case, it means this year is
> >> over?
> >>
> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
> >>
> >>
> >>>
> >>>
> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC.
> These
> >>> > are:
> >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
> >>> > encoding. But these encodings just can use in HFile context. In RPC
> >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
> >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
> >>> > the issue number But I guessed it is HBASE-12883 Support block
> >>> > encoding based on knowing set of column qualifiers up front)
> >>> >
> >>>
> >>> Sounds like a fine project (Someone was just asking about this
> offline...)
> >>>
> >>>
> >>>
> >>> > - HBASE-14379 Replication V2
> >>> > - HBASE-8691 High-Throughput Streaming Scan API
> >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
> >>> > SOLR indexing. I guess it could be this issue.)
> >>> >
> >>> > Could you help me for selecting topics or could you offer another
> issue ?
> >>> >
> >>> >
> >>> All above are good.
> >>>
> >>> Here's a few others made for another context:
> >>>
> >>> + Become Jepsen distributed systems test tool expert: run it against
> HBase
> >>> and HDFS. Analyze results. E.g. see
> >>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
> >>> + Deep dive on hbase Compactions. Own it. Review current options both
> the
> >>> defaults, experimental, and the stale. Build tooling and surface
> metrics
> >>> that give better insight on effectiveness of compaction mechanics and
> >>> policies. Develop tunings and alternate, new policies. For further
> credit,
> >>> develop master-orchestrated compaction algorithm.
> >>> + Reimplement HBase append and increment as write-only with rollup on
> read
> >>> or using CRDTs (
> >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
> >>> + Make the HBase Server async/event driven/SEDA moving it off its
> current
> >>> thread-per-request basis
> >>> + UI: build out more pages and tabs on the HBase master exposing more
> of
> >>> our cluster metrics (make the master into a metrics sink). Extra
> points for
> >>> views, histograms, or dashboards that are both informative AND pretty
> (D3,
> >>> etc.). A good benchmark would be subsuming the Hannibal tool
> >>> https://github.com/sentric/hannibal
> >>> + Build an example application on HBase for test and illustration:
> e.g. use
> >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase
> to
> >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
> >>> hbase
> >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
> >>> credit for documenting steps involved and filing issues where API is
> >>> awkward or hard to follow.
> >>> + Add actionable statistics to hbase internals that capture vitals
> about
> >>> the data being served and that we exploit responding to queries; e.g.
> rough
> >>> sizes of rows, column-families, columns-per-row-per-region, etc. For
> >>> example, if a client has been stepping sequentially through the data,
> the
> >>> stats would allow us recognize this state so we could switch to a
> different
> >>> scan type; one that is optimal to a sequential progression.
> >>> + Review and redo our fundamental merge sort, the basis of our read.
> There
> >>> are a few techniques to try such as a "loser tree merge" (
> >>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
> >>> make
> >>> our merge sort block-based rather than Cell-based. Set yourself up in
> a rig
> >>> and try different Cell formats to get yourself to a cache-friendly Cell
> >>> format that maximizes instructions per cycle.
> >>> + Our client is heavy-weight and has accumulated lots of logic over
> time.
> >>> E.g. it is hard to set a single timeout for a request because client is
> >>> layered each with its own running timeouts. At its core is a
> mostly-done
> >>> async engine. Review, and finish the async work. Rewrite where it makes
> >>> sense after analysis.
> >>> + Our RPC is based on protobuf Service where we plugged in our own RPC
> >>> transport. An exploratory PoC putting HBase up on grpc was done by the
> grpc
> >>> team. Bring this project home. Extra points if you reveal a Streaming
> >>> Interface between Client and Server.
> >>> + Tiering... if regions are cold, close them so they don't occupy
> resources
> >>> (close files, purge its data from cache...).... reopen when a request
> comes
> >>> in....
> >>> + Dynamic configuration of running HBase
> >>>
> >>>
> >>> St.Ack
> >>>
> >>>
> >>>
> >>>
> >>> > Thanks
> >>> > --
> >>> > Talat UYARER
> >>> >
> >>>
> >
> >
> >
> > --
> > Talat UYARER
> > Websitesi: http://talat.uyarer.com
> > Twitter: http://twitter.com/talatuyarer
> > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
> >
> > On Tue, Mar 22, 2016 at 5:32 PM, Enis Söztutar <[email protected]>
> wrote:
> >>>
> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is
> it too
> >>> late for us to participate now?
> >>>
> >>>
> >> ASF participates in GSOC, so HBase automatically can participate AFAIK.
> >>
> >>
> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed
> the
> >>> mentor signup deadline.
> >>>
> >>
> >> I did not check the deadline, if that is the case, it means this year is
> >> over?
> >>
> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
> >>
> >>
> >>>
> >>>
> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC.
> These
> >>> > are:
> >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
> >>> > encoding. But these encodings just can use in HFile context. In RPC
> >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
> >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
> >>> > the issue number But I guessed it is HBASE-12883 Support block
> >>> > encoding based on knowing set of column qualifiers up front)
> >>> >
> >>>
> >>> Sounds like a fine project (Someone was just asking about this
> offline...)
> >>>
> >>>
> >>>
> >>> > - HBASE-14379 Replication V2
> >>> > - HBASE-8691 High-Throughput Streaming Scan API
> >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
> >>> > SOLR indexing. I guess it could be this issue.)
> >>> >
> >>> > Could you help me for selecting topics or could you offer another
> issue ?
> >>> >
> >>> >
> >>> All above are good.
> >>>
> >>> Here's a few others made for another context:
> >>>
> >>> + Become Jepsen distributed systems test tool expert: run it against
> HBase
> >>> and HDFS. Analyze results. E.g. see
> >>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
> >>> + Deep dive on hbase Compactions. Own it. Review current options both
> the
> >>> defaults, experimental, and the stale. Build tooling and surface
> metrics
> >>> that give better insight on effectiveness of compaction mechanics and
> >>> policies. Develop tunings and alternate, new policies. For further
> credit,
> >>> develop master-orchestrated compaction algorithm.
> >>> + Reimplement HBase append and increment as write-only with rollup on
> read
> >>> or using CRDTs (
> >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
> >>> + Make the HBase Server async/event driven/SEDA moving it off its
> current
> >>> thread-per-request basis
> >>> + UI: build out more pages and tabs on the HBase master exposing more
> of
> >>> our cluster metrics (make the master into a metrics sink). Extra
> points for
> >>> views, histograms, or dashboards that are both informative AND pretty
> (D3,
> >>> etc.). A good benchmark would be subsuming the Hannibal tool
> >>> https://github.com/sentric/hannibal
> >>> + Build an example application on HBase for test and illustration:
> e.g. use
> >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase
> to
> >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
> >>> hbase
> >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
> >>> credit for documenting steps involved and filing issues where API is
> >>> awkward or hard to follow.
> >>> + Add actionable statistics to hbase internals that capture vitals
> about
> >>> the data being served and that we exploit responding to queries; e.g.
> rough
> >>> sizes of rows, column-families, columns-per-row-per-region, etc. For
> >>> example, if a client has been stepping sequentially through the data,
> the
> >>> stats would allow us recognize this state so we could switch to a
> different
> >>> scan type; one that is optimal to a sequential progression.
> >>> + Review and redo our fundamental merge sort, the basis of our read.
> There
> >>> are a few techniques to try such as a "loser tree merge" (
> >>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
> >>> make
> >>> our merge sort block-based rather than Cell-based. Set yourself up in
> a rig
> >>> and try different Cell formats to get yourself to a cache-friendly Cell
> >>> format that maximizes instructions per cycle.
> >>> + Our client is heavy-weight and has accumulated lots of logic over
> time.
> >>> E.g. it is hard to set a single timeout for a request because client is
> >>> layered each with its own running timeouts. At its core is a
> mostly-done
> >>> async engine. Review, and finish the async work. Rewrite where it makes
> >>> sense after analysis.
> >>> + Our RPC is based on protobuf Service where we plugged in our own RPC
> >>> transport. An exploratory PoC putting HBase up on grpc was done by the
> grpc
> >>> team. Bring this project home. Extra points if you reveal a Streaming
> >>> Interface between Client and Server.
> >>> + Tiering... if regions are cold, close them so they don't occupy
> resources
> >>> (close files, purge its data from cache...).... reopen when a request
> comes
> >>> in....
> >>> + Dynamic configuration of running HBase
> >>>
> >>>
> >>> St.Ack
> >>>
> >>>
> >>>
> >>>
> >>> > Thanks
> >>> > --
> >>> > Talat UYARER
> >>> >
> >>>
> >
> >
> >
> > --
> > Talat UYARER
> > Websitesi: http://talat.uyarer.com
> > Twitter: http://twitter.com/talatuyarer
> > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>

Re: Google Summer Of Code 2016

Reply via email to