Re: [PROPOSAL] Apache DataSketches

2019-03-25 Thread leerho
I went ahead and performed the following searches based on the list someone
else provided.  Perhaps you can use this?

Note: the term "sketch" commonly refers to an artistic visualization or
drawing.
The use of the term "sketch" in the study of algorithms refer to a synopsis
of some larger set of data where the synopsis is approximate, simplified
(not all the detail), and can be executed quickly.  These properties are
shared with artistic sketches, but there the similarity ends. DataSketches
have nothing to do with visualization at all.

Search results.

https://github.com/search?o=desc=datasketches
returned links are indirect references to our site. or a reference to site
about data art.

https://opensource.google.com/projects/search?q=datasketches
No hits

https://sourceforge.net/directory/os:mac/?q=datasketches
No hits

https://www.openhub.net/p?ref=homepage=datasketches
No hits

https://www.trademarkia.com
No hits: "data sketch", "data sketches", "data-sketch", "data-sketches",
"datasketch", or "datasketches".

https://trademarks.justia.com/search?q=datasketches
No hits: "data sketch", "data sketches", "data-sketch", "data-sketches",
"datasketch", or "datasketches".

http://tmsearch.uspto.gov/
No hits: "data sketch", "data sketches", "data-sketch", "data-sketches",
"datasketch", or "datasketches".

https://www.google.com/search?q=datasketches=datasketches
About 37,600 results most all are indirect references to our site or to
sites about artistic visual renderings of data. Searching for
"datasketches" (with quotes) is a much smaller set (6800) that mostly refer
to our software.

https://en.wikipedia.org/wiki/datasketches
q: "datasketches": No hits
q: "data sketches" One hit: the common data science use of the pair of
words referring to sketching algorithms: "The different techniques can be
classified according to the data sketches they store."

https://stackoverflow.com/search?q=datasketches
2 hits that refer back to our software (Druid-datasketches is our software)
q:data sketches

https://www.linkedin.com/company/datasketches/about/
No hits

https://en.oxforddictionaries.com/search?filter=dictionary=datasketches
No hits

On Mon, Mar 25, 2019 at 1:36 PM Kenneth Knowles  wrote:

> The vote is passed to accept into the incubator. Since there is a cost to
> changing the name once infrastructure is set up, I suggest doing the name
> search immediately. There seemed to be some consensus to try to keep the
> DataSketches name. If there are no objections, I will file a
> PODLINGNAMESEARCH for this.
>
> Kenn
>
> On Tue, Feb 26, 2019 at 3:58 PM Liang Chen 
> wrote:
>
> > Hi Justin
> >
> > You are right, should be "Liang Chen", already updated it.
> >
> > Justin, could you please help to check my right to create new proposal on
> > incubator wiki at :
> > https://wiki.apache.org/incubator/ProjectProposals
> >
> > Regards
> > Liang
> >
> > Justin Mclean wrote
> > > Hi,
> > >
> > >> Currently only IPMC members can be official mentors, of the 3 people
> > >> listed here I believe only Jean-Baptiste Onofré is an IPMC member.
> > >
> > > Sorry, my apologies, Liang Chen is also an IPMC member, (Chen Liang,
> and
> > > presumedly a different person, is a committer but not an IPMC member)
> but
> > > I cannot find Gil Yehuda, do you mind provide a link to the roster for
> > > them or their Apache id?
> > >
> > > Thanks,
> > > Justin
> > > -
> > > To unsubscribe, e-mail:
> >
> > > general-unsubscribe@.apache
> >
> > > For additional commands, e-mail:
> >
> > > general-help@.apache
> >
> >
> >
> >
> >
> > --
> > Sent from: http://apache-incubator-general.996316.n3.nabble.com/
> >
> > -
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
>


Re: [PROPOSAL] Apache DataSketches

2019-03-25 Thread Kenneth Knowles
The vote is passed to accept into the incubator. Since there is a cost to
changing the name once infrastructure is set up, I suggest doing the name
search immediately. There seemed to be some consensus to try to keep the
DataSketches name. If there are no objections, I will file a
PODLINGNAMESEARCH for this.

Kenn

On Tue, Feb 26, 2019 at 3:58 PM Liang Chen  wrote:

> Hi Justin
>
> You are right, should be "Liang Chen", already updated it.
>
> Justin, could you please help to check my right to create new proposal on
> incubator wiki at :
> https://wiki.apache.org/incubator/ProjectProposals
>
> Regards
> Liang
>
> Justin Mclean wrote
> > Hi,
> >
> >> Currently only IPMC members can be official mentors, of the 3 people
> >> listed here I believe only Jean-Baptiste Onofré is an IPMC member.
> >
> > Sorry, my apologies, Liang Chen is also an IPMC member, (Chen Liang, and
> > presumedly a different person, is a committer but not an IPMC member) but
> > I cannot find Gil Yehuda, do you mind provide a link to the roster for
> > them or their Apache id?
> >
> > Thanks,
> > Justin
> > -
> > To unsubscribe, e-mail:
>
> > general-unsubscribe@.apache
>
> > For additional commands, e-mail:
>
> > general-help@.apache
>
>
>
>
>
> --
> Sent from: http://apache-incubator-general.996316.n3.nabble.com/
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


Re: [PROPOSAL] Apache DataSketches

2019-02-26 Thread Liang Chen
Hi Justin

You are right, should be "Liang Chen", already updated it.

Justin, could you please help to check my right to create new proposal on
incubator wiki at :
https://wiki.apache.org/incubator/ProjectProposals

Regards
Liang

Justin Mclean wrote
> Hi,
> 
>> Currently only IPMC members can be official mentors, of the 3 people
>> listed here I believe only Jean-Baptiste Onofré is an IPMC member.
> 
> Sorry, my apologies, Liang Chen is also an IPMC member, (Chen Liang, and
> presumedly a different person, is a committer but not an IPMC member) but
> I cannot find Gil Yehuda, do you mind provide a link to the roster for
> them or their Apache id?
> 
> Thanks,
> Justin
> -
> To unsubscribe, e-mail: 

> general-unsubscribe@.apache

> For additional commands, e-mail: 

> general-help@.apache





--
Sent from: http://apache-incubator-general.996316.n3.nabble.com/

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Apache DataSketches

2019-02-23 Thread Ted Dunning
Actually, I find that early drafts are easier to work on using gdocs. Once
things settle down, the wiki is a very good place.


On Sat, Feb 23, 2019 at 12:35 PM lee...@gmail.com  wrote:

>
>
> On 2019/02/23 18:54:57, leerho  wrote:
> > Forgive me I am a newbie, but there has got to be a better way to post a
> > document that everyone can see and allow it to be updated without having
> to
> > resend it as raw text.  I have an easier to read version of the proposal
> as
> > a Google doc where I could post the link, but I sense that that is a
> no-no
> > in this community.  Any suggestions?
>
>


Re: [PROPOSAL] Apache DataSketches

2019-02-23 Thread leerho



On 2019/02/23 18:54:57, leerho  wrote: 
> Forgive me I am a newbie, but there has got to be a better way to post a
> document that everyone can see and allow it to be updated without having to
> resend it as raw text.  I have an easier to read version of the proposal as
> a Google doc where I could post the link, but I sense that that is a no-no
> in this community.  Any suggestions?
> 
> Lee.
> 
> On Sat, Feb 23, 2019 at 4:07 AM Justin Mclean 
> wrote:
> 
> > Hi,
> >
> > > Currently only IPMC members can be official mentors, of the 3 people
> > listed here I believe only Jean-Baptiste Onofré is an IPMC member.
> >
> > Sorry, my apologies, Liang Chen is also an IPMC member, (Chen Liang, and
> > presumedly a different person, is a committer but not an IPMC member) but I
> > cannot find Gil Yehuda, do you mind provide a link to the roster for them
> > or their Apache id?
> >
> > Thanks,
> > Justin
> > -
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
> I got it.  It is the Incubator Wiki.  Now trying to get this proposal posted 
> there. 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Apache DataSketches

2019-02-23 Thread leerho
Forgive me I am a newbie, but there has got to be a better way to post a
document that everyone can see and allow it to be updated without having to
resend it as raw text.  I have an easier to read version of the proposal as
a Google doc where I could post the link, but I sense that that is a no-no
in this community.  Any suggestions?

Lee.

On Sat, Feb 23, 2019 at 4:07 AM Justin Mclean 
wrote:

> Hi,
>
> > Currently only IPMC members can be official mentors, of the 3 people
> listed here I believe only Jean-Baptiste Onofré is an IPMC member.
>
> Sorry, my apologies, Liang Chen is also an IPMC member, (Chen Liang, and
> presumedly a different person, is a committer but not an IPMC member) but I
> cannot find Gil Yehuda, do you mind provide a link to the roster for them
> or their Apache id?
>
> Thanks,
> Justin
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


Re: [PROPOSAL] Apache DataSketches

2019-02-23 Thread Justin Mclean
Hi,

> Currently only IPMC members can be official mentors, of the 3 people listed 
> here I believe only Jean-Baptiste Onofré is an IPMC member.

Sorry, my apologies, Liang Chen is also an IPMC member, (Chen Liang, and 
presumedly a different person, is a committer but not an IPMC member) but I 
cannot find Gil Yehuda, do you mind provide a link to the roster for them or 
their Apache id?

Thanks,
Justin
-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Apache DataSketches

2019-02-23 Thread Furkan KAMACI
Hi,

If possible, I would like to contribute DataSketches as mentor too.

Kind Regards,
Furkan KAMACI

23 Şub 2019 Cmt, saat 14:59 tarihinde Justin Mclean <
jus...@classsoftware.com> şunu yazdı:

> Hi,
>
> > === Nominated Mentors ===
> > (Recommended to me: )
> >
> > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache
> > dot org]
> > Jean-Baptiste Onofré, jb at nanthrax dot net
> > Gil Yehuda, gyehuda at verizonmedia dot com
>
> Currently only IPMC members can be official mentors, of the 3 people
> listed here I believe only Jean-Baptiste Onofré is an IPMC member.
>
> Thanks,
> Justin
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


Re: [PROPOSAL] Apache DataSketches

2019-02-23 Thread Justin Mclean
Hi,

> === Nominated Mentors ===
> (Recommended to me: )
> 
> Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache
> dot org]
> Jean-Baptiste Onofré, jb at nanthrax dot net
> Gil Yehuda, gyehuda at verizonmedia dot com

Currently only IPMC members can be official mentors, of the 3 people listed 
here I believe only Jean-Baptiste Onofré is an IPMC member.

Thanks,
Justin 
-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



[PROPOSAL] Apache DataSketches

2019-02-23 Thread leerho


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Apache DataSketches

2019-02-22 Thread Kenneth Knowles
Nice.

I would very much like to help mentor this project, though you already have
a couple good ones.

I concur with incubator as sponsoring entity.

Kenn (VP Apache Beam)

On Fri, Feb 22, 2019 at 9:45 PM leerho  wrote:

> I didn't realize that this mail list does not accept PDF files, apparently
> only text.  So let me try one more time ... :)  Please let me know if
> this works!
>
>
> = Apache DataSketches Proposal[1] =
>
> == Abstract ==
>
> DataSketches.GitHub.io is an open source, high-performance library of
> stochastic streaming algorithms commonly called "sketches" in the data
> sciences. Sketches are small, stateful programs that process massive data
> as a stream and can provide approximate answers, with mathematical
> guarantees, to computationally difficult queries orders-of-magnitude faster
> than traditional, exact methods.
>
> This proposal is to move DataSketches to the Apache Software
> Foundation(ASF) transferring ownership of its copyright intellectual
> property to the ASF.  Thereafter, DataSketches would be officially known as
> Apache DataSketches and its evolution and governance would come under the
> rules and guidance of the ASF.
>
> == Introduction ==
>
> The DataSketches library contains carefully crafted implementations of
> sketch algorithms that meet rigorous standards of quality and performance
> and provide capabilities required for large-scale production systems that
> must process and analyze massive data. The DataSketches core repository is
> written in Java with a parallel core repository written in C++ that
> includes Python wrappers. The DataSketches library also includes special
> repositories for extending the core library for Apache Hive and Apache Pig.
> The sketches developed in the different languages share a common binary
> storage format so that sketches created and stored in Java, for example,
> can be fully used in C++, and visa versa.  Because the stored sketch
> "images" are just a "blob" of bytes (similar to picture images), they can
> be shared across many different systems, languages and platforms.
>
> The DataSketches documentation website, https://datasketches.github.io ,
> includes general tutorials, a comprehensive research section with
> references to relevant academic papers, extensive examples for using the
> core library directly as well as examples for accessing the library in
> Hive, Pig, and Apache Spark.
>
> The DataSketches library also includes a characterization repository for
> long running test programs that are used for studying accuracy and
> performance of these sketches over wide ranges of input variables. The data
> produced by these programs is used for generating the many performance
> plots contained in the documentation website and for academic
> publications.
>
> The code repositories used for production are versioned and published to
> Maven Central on periodic intervals as the library evolves.
>
> The DataSketches library also includes several experimental repositories
> for use-cases outside the large-scale systems environments, such as
> sketches for mobile, IoT devices (Android), command-line access of the
> sketch library, and an experimental repository for vector-based sketches
> that performs approximate Singular Value Decomposition (SVD) analysis that
> could potentially be used in Machine Learning (ML) applications.
>
> == Background ==
>
> The DataSketches library was started in 2012 as internal Yahoo project to
> dramatically reduce time and resources required for distinct (unique)
> counting.  An extensive search on the Internet at the time yielded a number
> of theoretical papers on stochastic streaming algorithms with pseudocode
> examples, but we did not find any usable open-source code of the quality we
> felt we needed for our internal production systems.  So we started a small
> project (one person) to develop our own sketches working directly from
> published theoretical papers.
>
> The DataSketches library was designed from the start with the objective of
> making these algorithms, usually only described in theoretical papers,
> easily accessible to systems developers for use in our internal production
> systems. By necessity, the code had to be of the highest quality and
> thoroughly tested. The wide variety of our internal production systems
> drove the requirement that the sketch implementations had to have an
> absolute minimum of external, run-time dependencies in order to simplify
> integration and troubleshooting.
>
> Our internal experiments demonstrated dramatic positive impact on the
> performance of our systems.  As a result, the DataSketches library quickly
> evolved to include different types of sketches for different types of
> queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
> quantile/histogram algorithms, and weighted and unweighted sampling
> algorithms.
>
> We quickly discovered that developing these sketch algorithms to be truly
> robust in production environments is 

Re: [PROPOSAL] Apache DataSketches

2019-02-22 Thread leerho
I didn't realize that this mail list does not accept PDF files, apparently
only text.  So let me try one more time ... :)  Please let me know if
this works!


= Apache DataSketches Proposal[1] =

== Abstract ==

DataSketches.GitHub.io is an open source, high-performance library of
stochastic streaming algorithms commonly called "sketches" in the data
sciences. Sketches are small, stateful programs that process massive data
as a stream and can provide approximate answers, with mathematical
guarantees, to computationally difficult queries orders-of-magnitude faster
than traditional, exact methods.

This proposal is to move DataSketches to the Apache Software
Foundation(ASF) transferring ownership of its copyright intellectual
property to the ASF.  Thereafter, DataSketches would be officially known as
Apache DataSketches and its evolution and governance would come under the
rules and guidance of the ASF.

== Introduction ==

The DataSketches library contains carefully crafted implementations of
sketch algorithms that meet rigorous standards of quality and performance
and provide capabilities required for large-scale production systems that
must process and analyze massive data. The DataSketches core repository is
written in Java with a parallel core repository written in C++ that
includes Python wrappers. The DataSketches library also includes special
repositories for extending the core library for Apache Hive and Apache Pig.
The sketches developed in the different languages share a common binary
storage format so that sketches created and stored in Java, for example,
can be fully used in C++, and visa versa.  Because the stored sketch
"images" are just a "blob" of bytes (similar to picture images), they can
be shared across many different systems, languages and platforms.

The DataSketches documentation website, https://datasketches.github.io ,
includes general tutorials, a comprehensive research section with
references to relevant academic papers, extensive examples for using the
core library directly as well as examples for accessing the library in
Hive, Pig, and Apache Spark.

The DataSketches library also includes a characterization repository for
long running test programs that are used for studying accuracy and
performance of these sketches over wide ranges of input variables. The data
produced by these programs is used for generating the many performance
plots contained in the documentation website and for academic
publications.

The code repositories used for production are versioned and published to
Maven Central on periodic intervals as the library evolves.

The DataSketches library also includes several experimental repositories
for use-cases outside the large-scale systems environments, such as
sketches for mobile, IoT devices (Android), command-line access of the
sketch library, and an experimental repository for vector-based sketches
that performs approximate Singular Value Decomposition (SVD) analysis that
could potentially be used in Machine Learning (ML) applications.

== Background ==

The DataSketches library was started in 2012 as internal Yahoo project to
dramatically reduce time and resources required for distinct (unique)
counting.  An extensive search on the Internet at the time yielded a number
of theoretical papers on stochastic streaming algorithms with pseudocode
examples, but we did not find any usable open-source code of the quality we
felt we needed for our internal production systems.  So we started a small
project (one person) to develop our own sketches working directly from
published theoretical papers.

The DataSketches library was designed from the start with the objective of
making these algorithms, usually only described in theoretical papers,
easily accessible to systems developers for use in our internal production
systems. By necessity, the code had to be of the highest quality and
thoroughly tested. The wide variety of our internal production systems
drove the requirement that the sketch implementations had to have an
absolute minimum of external, run-time dependencies in order to simplify
integration and troubleshooting.

Our internal experiments demonstrated dramatic positive impact on the
performance of our systems.  As a result, the DataSketches library quickly
evolved to include different types of sketches for different types of
queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
quantile/histogram algorithms, and weighted and unweighted sampling
algorithms.

We quickly discovered that developing these sketch algorithms to be truly
robust in production environments is quite difficult and requires deep
understanding of the underlying mathematics and statistics as well as
extensive experience in developing high quality code for 24/7 production
systems. This is a difficult combination of skills for any one organization
to collect and maintain over time. It became clear that this technology
needed a community larger than Yahoo to evolve.  In 

Re: [PROPOSAL] Apache DataSketches

2019-02-22 Thread Kenneth Knowles
The subject line has me interested already. Follow examples like this maybe?

1.
https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
2.
https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E

Kenn

On Fri, Feb 22, 2019 at 8:05 PM leerho  wrote:

> I'll try again ... :)
>
> On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning  wrote:
>
>> It didn't make it again
>>
>> On Fri, Feb 22, 2019, 8:35 PM leerho  wrote:
>>
>> > I'm not sure the attached document made it through.
>> >
>> > On Fri, Feb 22, 2019 at 7:28 PM leerho  wrote:
>> >
>> > >
>> > >
>> >
>>
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org


Re: [PROPOSAL] Apache DataSketches

2019-02-22 Thread leerho
I'll try again ... :)

On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning  wrote:

> It didn't make it again
>
> On Fri, Feb 22, 2019, 8:35 PM leerho  wrote:
>
> > I'm not sure the attached document made it through.
> >
> > On Fri, Feb 22, 2019 at 7:28 PM leerho  wrote:
> >
> > >
> > >
> >
>

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Apache DataSketches

2019-02-22 Thread Ted Dunning
It didn't make it again

On Fri, Feb 22, 2019, 8:35 PM leerho  wrote:

> I'm not sure the attached document made it through.
>
> On Fri, Feb 22, 2019 at 7:28 PM leerho  wrote:
>
> >
> >
>


Re: [PROPOSAL] Apache DataSketches

2019-02-22 Thread leerho
I'm not sure the attached document made it through.

On Fri, Feb 22, 2019 at 7:28 PM leerho  wrote:

>
>