Re: [discuss] Apache Gobblin Incubator Proposal

2017-02-15 Thread Olivier Lamy
Hi
Thanks for the proposal Jim!
I will add you as a mentor then start the vote
Cheers
Olivier

On 16 February 2017 at 02:35, Jim Jagielski  wrote:

> If you need/want another mentor, I volunteer
>
> > On Feb 14, 2017, at 3:53 PM, Olivier Lamy  wrote:
> >
> > Hi
> > Well I don't see issues as no one discuss the proposal.
> > So I will start the official vote tomorrow.
> > Cheers
> > Olivier
> >
> > On 6 February 2017 at 14:08, Olivier Lamy  wrote:
> >
> >> Hello everyone,
> >> I would like to submit to you a proposal to bring Gooblin to the Apache
> >> Software Foundation.
> >> The text of the proposal is included below and available as a draft here
> >> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal
> >>
> >> We will appreciate any feedback and input.
> >>
> >> Olivier on behalf of the Gobblin community
> >>
> >>
> >> = Apache Gobblin Proposal =
> >> == Abstract ==
> >> Gobblin is a distributed data integration framework that simplifies
> common
> >> aspects of big data integration such as data ingestion, replication,
> >> organization and lifecycle management for both streaming and batch data
> >> ecosystems.
> >>
> >> == Proposal ==
> >>
> >> Gobblin is a universal data integration framework. The framework has
> been
> >> used to build a variety of big data applications such as ingestion,
> >> replication, and data retention. The fundamental constructs provided by
> the
> >> Gobblin framework are:
> >>
> >> 1. An expandable set of connectors that allow data to be integrated from
> >> a variety of sources and sinks. The range of connectors already
> available
> >> in Gobblin are quite diverse and are an ever expanding set. To highlight
> >> just a few examples, connectors exist for databases (e.g., MySQL, Oracle
> >> Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP
> >> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming
> data
> >> (Kafka, EventHubs etc.), and a variety of proprietary data sources and
> >> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.).
> Similarly,
> >> Gobblin has a rich library of converters that allow for conversion of
> data
> >> from one format to another as data moves across system boundaries (e.g.
> >> AVRO in HDFS to JSON in another system).
> >>
> >>
> >> 2. Gobblin has a well defined and customizable state management layer
> >> that allows writing stateful applications. These are particularly useful
> >> when solving problems like bulk incremental ingest and keeping several
> >> clusters replicated in sync. The ability to record work that has been
> >> completed and what remains in a scalable manner is critical to writing
> such
> >> diverse applications successfully.
> >>
> >>
> >> 3. Gobblin is agnostic to the underlying execution engine. It can be
> >> tailored to run ontop of a variety of execution frameworks ranging from
> >> multiple processes on a single node, to open source execution engines
> like
> >> MapReduce, Spark or Samza, natively on top of raw containers like Yarn
> or
> >> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are
> >> extending Gobblin to run on top of a self managed cluster when security
> is
> >> vital.  This allows different applications that require different
> degrees
> >> of scalability, latency or security to be customized to for their
> specific
> >> needs. For example, highly latency sensitive applications can be
> executed
> >> in a streaming environment while batch based execution might benefit
> >> applications where the priority might be geared towards optimal
> container
> >> utilization.
> >>
> >> 4.Gobblin comes out of the box with several diagnosability features like
> >> Gobblin metrics and error handling. Collectively, these features allow
> >> Gobblin to operate at the scale of petabytes of data. To give just one
> >> example, the ability to quarantine a few bad records from an isolated
> Kafka
> >> topic without stopping the entire flow from continued execution is vital
> >> when the number of Kafka topics range in the thousands and the
> collective
> >> data handled is in the petabytes.
> >>
> >> Gobblin thus provides crisply defined software constructs that can be
> used
> >> to build a vast array of data integration applications customizable for
> >> varied user needs. It has become a preferred technology for data
> >> integration use-cases by many organizations worldwide (see a partial
> list
> >> here).
> >>
> >> == Background ==
> >>
> >> Over the last decade, data integration has evolved use case by use case
> in
> >> most companies. For example, at LinkedIn, when Kafka became a
> significant
> >> part of the data ecosystem, a system called Camus was built to ingest
> this
> >> data for analytics processing on Hadoop. Similarly, we had custom
> pipelines
> >> to ingest data from Salesforce, Oracle and myriad other sources. This
> >> pattern became the norm rather than the 

Re: [discuss] Apache Gobblin Incubator Proposal

2017-02-15 Thread Jean-Baptiste Onofré
Thanks for the proposal Jim.

Regards
JB

On Feb 15, 2017, 11:37, at 11:37, Jim Jagielski  wrote:
>If you need/want another mentor, I volunteer
>
>> On Feb 14, 2017, at 3:53 PM, Olivier Lamy  wrote:
>>
>> Hi
>> Well I don't see issues as no one discuss the proposal.
>> So I will start the official vote tomorrow.
>> Cheers
>> Olivier
>>
>> On 6 February 2017 at 14:08, Olivier Lamy  wrote:
>>
>>> Hello everyone,
>>> I would like to submit to you a proposal to bring Gooblin to the
>Apache
>>> Software Foundation.
>>> The text of the proposal is included below and available as a draft
>here
>>> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal
>>>
>>> We will appreciate any feedback and input.
>>>
>>> Olivier on behalf of the Gobblin community
>>>
>>>
>>> = Apache Gobblin Proposal =
>>> == Abstract ==
>>> Gobblin is a distributed data integration framework that simplifies
>common
>>> aspects of big data integration such as data ingestion, replication,
>>> organization and lifecycle management for both streaming and batch
>data
>>> ecosystems.
>>>
>>> == Proposal ==
>>>
>>> Gobblin is a universal data integration framework. The framework has
>been
>>> used to build a variety of big data applications such as ingestion,
>>> replication, and data retention. The fundamental constructs provided
>by the
>>> Gobblin framework are:
>>>
>>> 1. An expandable set of connectors that allow data to be integrated
>from
>>> a variety of sources and sinks. The range of connectors already
>available
>>> in Gobblin are quite diverse and are an ever expanding set. To
>highlight
>>> just a few examples, connectors exist for databases (e.g., MySQL,
>Oracle
>>> Teradata, Couchbase etc.), web based technologies (REST APIs,
>FTP/SFTP
>>> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming
>data
>>> (Kafka, EventHubs etc.), and a variety of proprietary data sources
>and
>>> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.).
>Similarly,
>>> Gobblin has a rich library of converters that allow for conversion
>of data
>>> from one format to another as data moves across system boundaries
>(e.g.
>>> AVRO in HDFS to JSON in another system).
>>>
>>>
>>> 2. Gobblin has a well defined and customizable state management
>layer
>>> that allows writing stateful applications. These are particularly
>useful
>>> when solving problems like bulk incremental ingest and keeping
>several
>>> clusters replicated in sync. The ability to record work that has
>been
>>> completed and what remains in a scalable manner is critical to
>writing such
>>> diverse applications successfully.
>>>
>>> 
>>> 3. Gobblin is agnostic to the underlying execution engine. It can be
>>> tailored to run ontop of a variety of execution frameworks ranging
>from
>>> multiple processes on a single node, to open source execution
>engines like
>>> MapReduce, Spark or Samza, natively on top of raw containers like
>Yarn or
>>> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We
>are
>>> extending Gobblin to run on top of a self managed cluster when
>security is
>>> vital.  This allows different applications that require different
>degrees
>>> of scalability, latency or security to be customized to for their
>specific
>>> needs. For example, highly latency sensitive applications can be
>executed
>>> in a streaming environment while batch based execution might benefit
>>> applications where the priority might be geared towards optimal
>container
>>> utilization.
>>>
>>> 4.Gobblin comes out of the box with several diagnosability features
>like
>>> Gobblin metrics and error handling. Collectively, these features
>allow
>>> Gobblin to operate at the scale of petabytes of data. To give just
>one
>>> example, the ability to quarantine a few bad records from an
>isolated Kafka
>>> topic without stopping the entire flow from continued execution is
>vital
>>> when the number of Kafka topics range in the thousands and the
>collective
>>> data handled is in the petabytes.
>>>
>>> Gobblin thus provides crisply defined software constructs that can
>be used
>>> to build a vast array of data integration applications customizable
>for
>>> varied user needs. It has become a preferred technology for data
>>> integration use-cases by many organizations worldwide (see a partial
>list
>>> here).
>>>
>>> == Background ==
>>>
>>> Over the last decade, data integration has evolved use case by use
>case in
>>> most companies. For example, at LinkedIn, when Kafka became a
>significant
>>> part of the data ecosystem, a system called Camus was built to
>ingest this
>>> data for analytics processing on Hadoop. Similarly, we had custom
>pipelines
>>> to ingest data from Salesforce, Oracle and myriad other sources.
>This
>>> pattern became the norm rather than the exception and one point,
>LinkedIn
>>> was running at least fifteen different types of ingestion pipelines.
>This
>>> fragmentation has several 

Re: [discuss] Apache Gobblin Incubator Proposal

2017-02-15 Thread Jim Jagielski
If you need/want another mentor, I volunteer

> On Feb 14, 2017, at 3:53 PM, Olivier Lamy  wrote:
> 
> Hi
> Well I don't see issues as no one discuss the proposal.
> So I will start the official vote tomorrow.
> Cheers
> Olivier
> 
> On 6 February 2017 at 14:08, Olivier Lamy  wrote:
> 
>> Hello everyone,
>> I would like to submit to you a proposal to bring Gooblin to the Apache
>> Software Foundation.
>> The text of the proposal is included below and available as a draft here
>> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal
>> 
>> We will appreciate any feedback and input.
>> 
>> Olivier on behalf of the Gobblin community
>> 
>> 
>> = Apache Gobblin Proposal =
>> == Abstract ==
>> Gobblin is a distributed data integration framework that simplifies common
>> aspects of big data integration such as data ingestion, replication,
>> organization and lifecycle management for both streaming and batch data
>> ecosystems.
>> 
>> == Proposal ==
>> 
>> Gobblin is a universal data integration framework. The framework has been
>> used to build a variety of big data applications such as ingestion,
>> replication, and data retention. The fundamental constructs provided by the
>> Gobblin framework are:
>> 
>> 1. An expandable set of connectors that allow data to be integrated from
>> a variety of sources and sinks. The range of connectors already available
>> in Gobblin are quite diverse and are an ever expanding set. To highlight
>> just a few examples, connectors exist for databases (e.g., MySQL, Oracle
>> Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP
>> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming data
>> (Kafka, EventHubs etc.), and a variety of proprietary data sources and
>> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). Similarly,
>> Gobblin has a rich library of converters that allow for conversion of data
>> from one format to another as data moves across system boundaries (e.g.
>> AVRO in HDFS to JSON in another system).
>> 
>> 
>> 2. Gobblin has a well defined and customizable state management layer
>> that allows writing stateful applications. These are particularly useful
>> when solving problems like bulk incremental ingest and keeping several
>> clusters replicated in sync. The ability to record work that has been
>> completed and what remains in a scalable manner is critical to writing such
>> diverse applications successfully.
>> 
>> 
>> 3. Gobblin is agnostic to the underlying execution engine. It can be
>> tailored to run ontop of a variety of execution frameworks ranging from
>> multiple processes on a single node, to open source execution engines like
>> MapReduce, Spark or Samza, natively on top of raw containers like Yarn or
>> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are
>> extending Gobblin to run on top of a self managed cluster when security is
>> vital.  This allows different applications that require different degrees
>> of scalability, latency or security to be customized to for their specific
>> needs. For example, highly latency sensitive applications can be executed
>> in a streaming environment while batch based execution might benefit
>> applications where the priority might be geared towards optimal container
>> utilization.
>> 
>> 4.Gobblin comes out of the box with several diagnosability features like
>> Gobblin metrics and error handling. Collectively, these features allow
>> Gobblin to operate at the scale of petabytes of data. To give just one
>> example, the ability to quarantine a few bad records from an isolated Kafka
>> topic without stopping the entire flow from continued execution is vital
>> when the number of Kafka topics range in the thousands and the collective
>> data handled is in the petabytes.
>> 
>> Gobblin thus provides crisply defined software constructs that can be used
>> to build a vast array of data integration applications customizable for
>> varied user needs. It has become a preferred technology for data
>> integration use-cases by many organizations worldwide (see a partial list
>> here).
>> 
>> == Background ==
>> 
>> Over the last decade, data integration has evolved use case by use case in
>> most companies. For example, at LinkedIn, when Kafka became a significant
>> part of the data ecosystem, a system called Camus was built to ingest this
>> data for analytics processing on Hadoop. Similarly, we had custom pipelines
>> to ingest data from Salesforce, Oracle and myriad other sources. This
>> pattern became the norm rather than the exception and one point, LinkedIn
>> was running at least fifteen different types of ingestion pipelines. This
>> fragmentation has several unfortunate implications. Operational costs scale
>> with the number of pipelines even if the myriad pipelines share a vasty
>> array of common features. Bug fixes and performance optimizations cannot be
>> shared across the pipelines. A common 

Re: [discuss] Apache Gobblin Incubator Proposal

2017-02-14 Thread Olivier Lamy
Hi
Well I don't see issues as no one discuss the proposal.
So I will start the official vote tomorrow.
Cheers
Olivier

On 6 February 2017 at 14:08, Olivier Lamy  wrote:

> Hello everyone,
> I would like to submit to you a proposal to bring Gooblin to the Apache
> Software Foundation.
> The text of the proposal is included below and available as a draft here
> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal
>
> We will appreciate any feedback and input.
>
> Olivier on behalf of the Gobblin community
>
>
> = Apache Gobblin Proposal =
> == Abstract ==
> Gobblin is a distributed data integration framework that simplifies common
> aspects of big data integration such as data ingestion, replication,
> organization and lifecycle management for both streaming and batch data
> ecosystems.
>
> == Proposal ==
>
> Gobblin is a universal data integration framework. The framework has been
> used to build a variety of big data applications such as ingestion,
> replication, and data retention. The fundamental constructs provided by the
> Gobblin framework are:
>
>  1. An expandable set of connectors that allow data to be integrated from
> a variety of sources and sinks. The range of connectors already available
> in Gobblin are quite diverse and are an ever expanding set. To highlight
> just a few examples, connectors exist for databases (e.g., MySQL, Oracle
> Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP
> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming data
> (Kafka, EventHubs etc.), and a variety of proprietary data sources and
> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). Similarly,
> Gobblin has a rich library of converters that allow for conversion of data
> from one format to another as data moves across system boundaries (e.g.
> AVRO in HDFS to JSON in another system).
>
>
>  2. Gobblin has a well defined and customizable state management layer
> that allows writing stateful applications. These are particularly useful
> when solving problems like bulk incremental ingest and keeping several
> clusters replicated in sync. The ability to record work that has been
> completed and what remains in a scalable manner is critical to writing such
> diverse applications successfully.
>
>
>  3. Gobblin is agnostic to the underlying execution engine. It can be
> tailored to run ontop of a variety of execution frameworks ranging from
> multiple processes on a single node, to open source execution engines like
> MapReduce, Spark or Samza, natively on top of raw containers like Yarn or
> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are
> extending Gobblin to run on top of a self managed cluster when security is
> vital.  This allows different applications that require different degrees
> of scalability, latency or security to be customized to for their specific
> needs. For example, highly latency sensitive applications can be executed
> in a streaming environment while batch based execution might benefit
> applications where the priority might be geared towards optimal container
> utilization.
>
>  4.Gobblin comes out of the box with several diagnosability features like
> Gobblin metrics and error handling. Collectively, these features allow
> Gobblin to operate at the scale of petabytes of data. To give just one
> example, the ability to quarantine a few bad records from an isolated Kafka
> topic without stopping the entire flow from continued execution is vital
> when the number of Kafka topics range in the thousands and the collective
> data handled is in the petabytes.
>
> Gobblin thus provides crisply defined software constructs that can be used
> to build a vast array of data integration applications customizable for
> varied user needs. It has become a preferred technology for data
> integration use-cases by many organizations worldwide (see a partial list
> here).
>
> == Background ==
>
> Over the last decade, data integration has evolved use case by use case in
> most companies. For example, at LinkedIn, when Kafka became a significant
> part of the data ecosystem, a system called Camus was built to ingest this
> data for analytics processing on Hadoop. Similarly, we had custom pipelines
> to ingest data from Salesforce, Oracle and myriad other sources. This
> pattern became the norm rather than the exception and one point, LinkedIn
> was running at least fifteen different types of ingestion pipelines. This
> fragmentation has several unfortunate implications. Operational costs scale
> with the number of pipelines even if the myriad pipelines share a vasty
> array of common features. Bug fixes and performance optimizations cannot be
> shared across the pipelines. A common set of practices around debugging and
> deployment does not emerge. Each pipeline operator will continue to invest
> in his little silo of the data integration world completely oblivious to
> the challenges of his fellow operator 

[discuss] Apache Gobblin Incubator Proposal

2017-02-05 Thread Olivier Lamy
Hello everyone,
I would like to submit to you a proposal to bring Gooblin to the Apache
Software Foundation.
The text of the proposal is included below and available as a draft here in
the Wiki: https://wiki.apache.org/incubator/GobblinProposal

We will appreciate any feedback and input.

Olivier on behalf of the Gobblin community


= Apache Gobblin Proposal =
== Abstract ==
Gobblin is a distributed data integration framework that simplifies common
aspects of big data integration such as data ingestion, replication,
organization and lifecycle management for both streaming and batch data
ecosystems.

== Proposal ==

Gobblin is a universal data integration framework. The framework has been
used to build a variety of big data applications such as ingestion,
replication, and data retention. The fundamental constructs provided by the
Gobblin framework are:

 1. An expandable set of connectors that allow data to be integrated from a
variety of sources and sinks. The range of connectors already available in
Gobblin are quite diverse and are an ever expanding set. To highlight just
a few examples, connectors exist for databases (e.g., MySQL, Oracle
Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP
servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming data
(Kafka, EventHubs etc.), and a variety of proprietary data sources and
sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). Similarly,
Gobblin has a rich library of converters that allow for conversion of data
from one format to another as data moves across system boundaries (e.g.
AVRO in HDFS to JSON in another system).


 2. Gobblin has a well defined and customizable state management layer that
allows writing stateful applications. These are particularly useful when
solving problems like bulk incremental ingest and keeping several clusters
replicated in sync. The ability to record work that has been completed and
what remains in a scalable manner is critical to writing such diverse
applications successfully.


 3. Gobblin is agnostic to the underlying execution engine. It can be
tailored to run ontop of a variety of execution frameworks ranging from
multiple processes on a single node, to open source execution engines like
MapReduce, Spark or Samza, natively on top of raw containers like Yarn or
Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are
extending Gobblin to run on top of a self managed cluster when security is
vital.  This allows different applications that require different degrees
of scalability, latency or security to be customized to for their specific
needs. For example, highly latency sensitive applications can be executed
in a streaming environment while batch based execution might benefit
applications where the priority might be geared towards optimal container
utilization.

 4.Gobblin comes out of the box with several diagnosability features like
Gobblin metrics and error handling. Collectively, these features allow
Gobblin to operate at the scale of petabytes of data. To give just one
example, the ability to quarantine a few bad records from an isolated Kafka
topic without stopping the entire flow from continued execution is vital
when the number of Kafka topics range in the thousands and the collective
data handled is in the petabytes.

Gobblin thus provides crisply defined software constructs that can be used
to build a vast array of data integration applications customizable for
varied user needs. It has become a preferred technology for data
integration use-cases by many organizations worldwide (see a partial list
here).

== Background ==

Over the last decade, data integration has evolved use case by use case in
most companies. For example, at LinkedIn, when Kafka became a significant
part of the data ecosystem, a system called Camus was built to ingest this
data for analytics processing on Hadoop. Similarly, we had custom pipelines
to ingest data from Salesforce, Oracle and myriad other sources. This
pattern became the norm rather than the exception and one point, LinkedIn
was running at least fifteen different types of ingestion pipelines. This
fragmentation has several unfortunate implications. Operational costs scale
with the number of pipelines even if the myriad pipelines share a vasty
array of common features. Bug fixes and performance optimizations cannot be
shared across the pipelines. A common set of practices around debugging and
deployment does not emerge. Each pipeline operator will continue to invest
in his little silo of the data integration world completely oblivious to
the challenges of his fellow operator sitting five tables down.

These experiences were the genesis behind the design and implementation of
Gobblin. Gobblin thus started out as a universal data ingestion framework
focussed on extracting, transforming, and synchronizing large volumes of
data between different data sources and sinks. Not surprisingly, given its
origins, the initial design