Re: [discuss] Apache Gobblin Incubator Proposal
Hi Thanks for the proposal Jim! I will add you as a mentor then start the vote Cheers Olivier On 16 February 2017 at 02:35, Jim Jagielskiwrote: > If you need/want another mentor, I volunteer > > > On Feb 14, 2017, at 3:53 PM, Olivier Lamy wrote: > > > > Hi > > Well I don't see issues as no one discuss the proposal. > > So I will start the official vote tomorrow. > > Cheers > > Olivier > > > > On 6 February 2017 at 14:08, Olivier Lamy wrote: > > > >> Hello everyone, > >> I would like to submit to you a proposal to bring Gooblin to the Apache > >> Software Foundation. > >> The text of the proposal is included below and available as a draft here > >> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal > >> > >> We will appreciate any feedback and input. > >> > >> Olivier on behalf of the Gobblin community > >> > >> > >> = Apache Gobblin Proposal = > >> == Abstract == > >> Gobblin is a distributed data integration framework that simplifies > common > >> aspects of big data integration such as data ingestion, replication, > >> organization and lifecycle management for both streaming and batch data > >> ecosystems. > >> > >> == Proposal == > >> > >> Gobblin is a universal data integration framework. The framework has > been > >> used to build a variety of big data applications such as ingestion, > >> replication, and data retention. The fundamental constructs provided by > the > >> Gobblin framework are: > >> > >> 1. An expandable set of connectors that allow data to be integrated from > >> a variety of sources and sinks. The range of connectors already > available > >> in Gobblin are quite diverse and are an ever expanding set. To highlight > >> just a few examples, connectors exist for databases (e.g., MySQL, Oracle > >> Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP > >> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming > data > >> (Kafka, EventHubs etc.), and a variety of proprietary data sources and > >> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). > Similarly, > >> Gobblin has a rich library of converters that allow for conversion of > data > >> from one format to another as data moves across system boundaries (e.g. > >> AVRO in HDFS to JSON in another system). > >> > >> > >> 2. Gobblin has a well defined and customizable state management layer > >> that allows writing stateful applications. These are particularly useful > >> when solving problems like bulk incremental ingest and keeping several > >> clusters replicated in sync. The ability to record work that has been > >> completed and what remains in a scalable manner is critical to writing > such > >> diverse applications successfully. > >> > >> > >> 3. Gobblin is agnostic to the underlying execution engine. It can be > >> tailored to run ontop of a variety of execution frameworks ranging from > >> multiple processes on a single node, to open source execution engines > like > >> MapReduce, Spark or Samza, natively on top of raw containers like Yarn > or > >> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are > >> extending Gobblin to run on top of a self managed cluster when security > is > >> vital. This allows different applications that require different > degrees > >> of scalability, latency or security to be customized to for their > specific > >> needs. For example, highly latency sensitive applications can be > executed > >> in a streaming environment while batch based execution might benefit > >> applications where the priority might be geared towards optimal > container > >> utilization. > >> > >> 4.Gobblin comes out of the box with several diagnosability features like > >> Gobblin metrics and error handling. Collectively, these features allow > >> Gobblin to operate at the scale of petabytes of data. To give just one > >> example, the ability to quarantine a few bad records from an isolated > Kafka > >> topic without stopping the entire flow from continued execution is vital > >> when the number of Kafka topics range in the thousands and the > collective > >> data handled is in the petabytes. > >> > >> Gobblin thus provides crisply defined software constructs that can be > used > >> to build a vast array of data integration applications customizable for > >> varied user needs. It has become a preferred technology for data > >> integration use-cases by many organizations worldwide (see a partial > list > >> here). > >> > >> == Background == > >> > >> Over the last decade, data integration has evolved use case by use case > in > >> most companies. For example, at LinkedIn, when Kafka became a > significant > >> part of the data ecosystem, a system called Camus was built to ingest > this > >> data for analytics processing on Hadoop. Similarly, we had custom > pipelines > >> to ingest data from Salesforce, Oracle and myriad other sources. This > >> pattern became the norm rather than the
Re: [discuss] Apache Gobblin Incubator Proposal
Thanks for the proposal Jim. Regards JB On Feb 15, 2017, 11:37, at 11:37, Jim Jagielskiwrote: >If you need/want another mentor, I volunteer > >> On Feb 14, 2017, at 3:53 PM, Olivier Lamy wrote: >> >> Hi >> Well I don't see issues as no one discuss the proposal. >> So I will start the official vote tomorrow. >> Cheers >> Olivier >> >> On 6 February 2017 at 14:08, Olivier Lamy wrote: >> >>> Hello everyone, >>> I would like to submit to you a proposal to bring Gooblin to the >Apache >>> Software Foundation. >>> The text of the proposal is included below and available as a draft >here >>> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal >>> >>> We will appreciate any feedback and input. >>> >>> Olivier on behalf of the Gobblin community >>> >>> >>> = Apache Gobblin Proposal = >>> == Abstract == >>> Gobblin is a distributed data integration framework that simplifies >common >>> aspects of big data integration such as data ingestion, replication, >>> organization and lifecycle management for both streaming and batch >data >>> ecosystems. >>> >>> == Proposal == >>> >>> Gobblin is a universal data integration framework. The framework has >been >>> used to build a variety of big data applications such as ingestion, >>> replication, and data retention. The fundamental constructs provided >by the >>> Gobblin framework are: >>> >>> 1. An expandable set of connectors that allow data to be integrated >from >>> a variety of sources and sinks. The range of connectors already >available >>> in Gobblin are quite diverse and are an ever expanding set. To >highlight >>> just a few examples, connectors exist for databases (e.g., MySQL, >Oracle >>> Teradata, Couchbase etc.), web based technologies (REST APIs, >FTP/SFTP >>> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming >data >>> (Kafka, EventHubs etc.), and a variety of proprietary data sources >and >>> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). >Similarly, >>> Gobblin has a rich library of converters that allow for conversion >of data >>> from one format to another as data moves across system boundaries >(e.g. >>> AVRO in HDFS to JSON in another system). >>> >>> >>> 2. Gobblin has a well defined and customizable state management >layer >>> that allows writing stateful applications. These are particularly >useful >>> when solving problems like bulk incremental ingest and keeping >several >>> clusters replicated in sync. The ability to record work that has >been >>> completed and what remains in a scalable manner is critical to >writing such >>> diverse applications successfully. >>> >>> >>> 3. Gobblin is agnostic to the underlying execution engine. It can be >>> tailored to run ontop of a variety of execution frameworks ranging >from >>> multiple processes on a single node, to open source execution >engines like >>> MapReduce, Spark or Samza, natively on top of raw containers like >Yarn or >>> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We >are >>> extending Gobblin to run on top of a self managed cluster when >security is >>> vital. This allows different applications that require different >degrees >>> of scalability, latency or security to be customized to for their >specific >>> needs. For example, highly latency sensitive applications can be >executed >>> in a streaming environment while batch based execution might benefit >>> applications where the priority might be geared towards optimal >container >>> utilization. >>> >>> 4.Gobblin comes out of the box with several diagnosability features >like >>> Gobblin metrics and error handling. Collectively, these features >allow >>> Gobblin to operate at the scale of petabytes of data. To give just >one >>> example, the ability to quarantine a few bad records from an >isolated Kafka >>> topic without stopping the entire flow from continued execution is >vital >>> when the number of Kafka topics range in the thousands and the >collective >>> data handled is in the petabytes. >>> >>> Gobblin thus provides crisply defined software constructs that can >be used >>> to build a vast array of data integration applications customizable >for >>> varied user needs. It has become a preferred technology for data >>> integration use-cases by many organizations worldwide (see a partial >list >>> here). >>> >>> == Background == >>> >>> Over the last decade, data integration has evolved use case by use >case in >>> most companies. For example, at LinkedIn, when Kafka became a >significant >>> part of the data ecosystem, a system called Camus was built to >ingest this >>> data for analytics processing on Hadoop. Similarly, we had custom >pipelines >>> to ingest data from Salesforce, Oracle and myriad other sources. >This >>> pattern became the norm rather than the exception and one point, >LinkedIn >>> was running at least fifteen different types of ingestion pipelines. >This >>> fragmentation has several
Re: [discuss] Apache Gobblin Incubator Proposal
If you need/want another mentor, I volunteer > On Feb 14, 2017, at 3:53 PM, Olivier Lamywrote: > > Hi > Well I don't see issues as no one discuss the proposal. > So I will start the official vote tomorrow. > Cheers > Olivier > > On 6 February 2017 at 14:08, Olivier Lamy wrote: > >> Hello everyone, >> I would like to submit to you a proposal to bring Gooblin to the Apache >> Software Foundation. >> The text of the proposal is included below and available as a draft here >> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal >> >> We will appreciate any feedback and input. >> >> Olivier on behalf of the Gobblin community >> >> >> = Apache Gobblin Proposal = >> == Abstract == >> Gobblin is a distributed data integration framework that simplifies common >> aspects of big data integration such as data ingestion, replication, >> organization and lifecycle management for both streaming and batch data >> ecosystems. >> >> == Proposal == >> >> Gobblin is a universal data integration framework. The framework has been >> used to build a variety of big data applications such as ingestion, >> replication, and data retention. The fundamental constructs provided by the >> Gobblin framework are: >> >> 1. An expandable set of connectors that allow data to be integrated from >> a variety of sources and sinks. The range of connectors already available >> in Gobblin are quite diverse and are an ever expanding set. To highlight >> just a few examples, connectors exist for databases (e.g., MySQL, Oracle >> Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP >> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming data >> (Kafka, EventHubs etc.), and a variety of proprietary data sources and >> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). Similarly, >> Gobblin has a rich library of converters that allow for conversion of data >> from one format to another as data moves across system boundaries (e.g. >> AVRO in HDFS to JSON in another system). >> >> >> 2. Gobblin has a well defined and customizable state management layer >> that allows writing stateful applications. These are particularly useful >> when solving problems like bulk incremental ingest and keeping several >> clusters replicated in sync. The ability to record work that has been >> completed and what remains in a scalable manner is critical to writing such >> diverse applications successfully. >> >> >> 3. Gobblin is agnostic to the underlying execution engine. It can be >> tailored to run ontop of a variety of execution frameworks ranging from >> multiple processes on a single node, to open source execution engines like >> MapReduce, Spark or Samza, natively on top of raw containers like Yarn or >> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are >> extending Gobblin to run on top of a self managed cluster when security is >> vital. This allows different applications that require different degrees >> of scalability, latency or security to be customized to for their specific >> needs. For example, highly latency sensitive applications can be executed >> in a streaming environment while batch based execution might benefit >> applications where the priority might be geared towards optimal container >> utilization. >> >> 4.Gobblin comes out of the box with several diagnosability features like >> Gobblin metrics and error handling. Collectively, these features allow >> Gobblin to operate at the scale of petabytes of data. To give just one >> example, the ability to quarantine a few bad records from an isolated Kafka >> topic without stopping the entire flow from continued execution is vital >> when the number of Kafka topics range in the thousands and the collective >> data handled is in the petabytes. >> >> Gobblin thus provides crisply defined software constructs that can be used >> to build a vast array of data integration applications customizable for >> varied user needs. It has become a preferred technology for data >> integration use-cases by many organizations worldwide (see a partial list >> here). >> >> == Background == >> >> Over the last decade, data integration has evolved use case by use case in >> most companies. For example, at LinkedIn, when Kafka became a significant >> part of the data ecosystem, a system called Camus was built to ingest this >> data for analytics processing on Hadoop. Similarly, we had custom pipelines >> to ingest data from Salesforce, Oracle and myriad other sources. This >> pattern became the norm rather than the exception and one point, LinkedIn >> was running at least fifteen different types of ingestion pipelines. This >> fragmentation has several unfortunate implications. Operational costs scale >> with the number of pipelines even if the myriad pipelines share a vasty >> array of common features. Bug fixes and performance optimizations cannot be >> shared across the pipelines. A common
Re: [discuss] Apache Gobblin Incubator Proposal
Hi Well I don't see issues as no one discuss the proposal. So I will start the official vote tomorrow. Cheers Olivier On 6 February 2017 at 14:08, Olivier Lamywrote: > Hello everyone, > I would like to submit to you a proposal to bring Gooblin to the Apache > Software Foundation. > The text of the proposal is included below and available as a draft here > in the Wiki: https://wiki.apache.org/incubator/GobblinProposal > > We will appreciate any feedback and input. > > Olivier on behalf of the Gobblin community > > > = Apache Gobblin Proposal = > == Abstract == > Gobblin is a distributed data integration framework that simplifies common > aspects of big data integration such as data ingestion, replication, > organization and lifecycle management for both streaming and batch data > ecosystems. > > == Proposal == > > Gobblin is a universal data integration framework. The framework has been > used to build a variety of big data applications such as ingestion, > replication, and data retention. The fundamental constructs provided by the > Gobblin framework are: > > 1. An expandable set of connectors that allow data to be integrated from > a variety of sources and sinks. The range of connectors already available > in Gobblin are quite diverse and are an ever expanding set. To highlight > just a few examples, connectors exist for databases (e.g., MySQL, Oracle > Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP > servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming data > (Kafka, EventHubs etc.), and a variety of proprietary data sources and > sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). Similarly, > Gobblin has a rich library of converters that allow for conversion of data > from one format to another as data moves across system boundaries (e.g. > AVRO in HDFS to JSON in another system). > > > 2. Gobblin has a well defined and customizable state management layer > that allows writing stateful applications. These are particularly useful > when solving problems like bulk incremental ingest and keeping several > clusters replicated in sync. The ability to record work that has been > completed and what remains in a scalable manner is critical to writing such > diverse applications successfully. > > > 3. Gobblin is agnostic to the underlying execution engine. It can be > tailored to run ontop of a variety of execution frameworks ranging from > multiple processes on a single node, to open source execution engines like > MapReduce, Spark or Samza, natively on top of raw containers like Yarn or > Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are > extending Gobblin to run on top of a self managed cluster when security is > vital. This allows different applications that require different degrees > of scalability, latency or security to be customized to for their specific > needs. For example, highly latency sensitive applications can be executed > in a streaming environment while batch based execution might benefit > applications where the priority might be geared towards optimal container > utilization. > > 4.Gobblin comes out of the box with several diagnosability features like > Gobblin metrics and error handling. Collectively, these features allow > Gobblin to operate at the scale of petabytes of data. To give just one > example, the ability to quarantine a few bad records from an isolated Kafka > topic without stopping the entire flow from continued execution is vital > when the number of Kafka topics range in the thousands and the collective > data handled is in the petabytes. > > Gobblin thus provides crisply defined software constructs that can be used > to build a vast array of data integration applications customizable for > varied user needs. It has become a preferred technology for data > integration use-cases by many organizations worldwide (see a partial list > here). > > == Background == > > Over the last decade, data integration has evolved use case by use case in > most companies. For example, at LinkedIn, when Kafka became a significant > part of the data ecosystem, a system called Camus was built to ingest this > data for analytics processing on Hadoop. Similarly, we had custom pipelines > to ingest data from Salesforce, Oracle and myriad other sources. This > pattern became the norm rather than the exception and one point, LinkedIn > was running at least fifteen different types of ingestion pipelines. This > fragmentation has several unfortunate implications. Operational costs scale > with the number of pipelines even if the myriad pipelines share a vasty > array of common features. Bug fixes and performance optimizations cannot be > shared across the pipelines. A common set of practices around debugging and > deployment does not emerge. Each pipeline operator will continue to invest > in his little silo of the data integration world completely oblivious to > the challenges of his fellow operator
[discuss] Apache Gobblin Incubator Proposal
Hello everyone, I would like to submit to you a proposal to bring Gooblin to the Apache Software Foundation. The text of the proposal is included below and available as a draft here in the Wiki: https://wiki.apache.org/incubator/GobblinProposal We will appreciate any feedback and input. Olivier on behalf of the Gobblin community = Apache Gobblin Proposal = == Abstract == Gobblin is a distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems. == Proposal == Gobblin is a universal data integration framework. The framework has been used to build a variety of big data applications such as ingestion, replication, and data retention. The fundamental constructs provided by the Gobblin framework are: 1. An expandable set of connectors that allow data to be integrated from a variety of sources and sinks. The range of connectors already available in Gobblin are quite diverse and are an ever expanding set. To highlight just a few examples, connectors exist for databases (e.g., MySQL, Oracle Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming data (Kafka, EventHubs etc.), and a variety of proprietary data sources and sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). Similarly, Gobblin has a rich library of converters that allow for conversion of data from one format to another as data moves across system boundaries (e.g. AVRO in HDFS to JSON in another system). 2. Gobblin has a well defined and customizable state management layer that allows writing stateful applications. These are particularly useful when solving problems like bulk incremental ingest and keeping several clusters replicated in sync. The ability to record work that has been completed and what remains in a scalable manner is critical to writing such diverse applications successfully. 3. Gobblin is agnostic to the underlying execution engine. It can be tailored to run ontop of a variety of execution frameworks ranging from multiple processes on a single node, to open source execution engines like MapReduce, Spark or Samza, natively on top of raw containers like Yarn or Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are extending Gobblin to run on top of a self managed cluster when security is vital. This allows different applications that require different degrees of scalability, latency or security to be customized to for their specific needs. For example, highly latency sensitive applications can be executed in a streaming environment while batch based execution might benefit applications where the priority might be geared towards optimal container utilization. 4.Gobblin comes out of the box with several diagnosability features like Gobblin metrics and error handling. Collectively, these features allow Gobblin to operate at the scale of petabytes of data. To give just one example, the ability to quarantine a few bad records from an isolated Kafka topic without stopping the entire flow from continued execution is vital when the number of Kafka topics range in the thousands and the collective data handled is in the petabytes. Gobblin thus provides crisply defined software constructs that can be used to build a vast array of data integration applications customizable for varied user needs. It has become a preferred technology for data integration use-cases by many organizations worldwide (see a partial list here). == Background == Over the last decade, data integration has evolved use case by use case in most companies. For example, at LinkedIn, when Kafka became a significant part of the data ecosystem, a system called Camus was built to ingest this data for analytics processing on Hadoop. Similarly, we had custom pipelines to ingest data from Salesforce, Oracle and myriad other sources. This pattern became the norm rather than the exception and one point, LinkedIn was running at least fifteen different types of ingestion pipelines. This fragmentation has several unfortunate implications. Operational costs scale with the number of pipelines even if the myriad pipelines share a vasty array of common features. Bug fixes and performance optimizations cannot be shared across the pipelines. A common set of practices around debugging and deployment does not emerge. Each pipeline operator will continue to invest in his little silo of the data integration world completely oblivious to the challenges of his fellow operator sitting five tables down. These experiences were the genesis behind the design and implementation of Gobblin. Gobblin thus started out as a universal data ingestion framework focussed on extracting, transforming, and synchronizing large volumes of data between different data sources and sinks. Not surprisingly, given its origins, the initial design