AHeise commented on a change in pull request #16932: URL: https://github.com/apache/flink/pull/16932#discussion_r693955740
########## File path: docs/content/docs/connectors/datastream/hybridsource.md ########## @@ -0,0 +1,101 @@ +--- +title: Hybrid Source +weight: 8 +type: docs +aliases: +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Hybrid Source + +`HybridSource` is a source that contains a list of concrete sources. +It solves the problem of sequentially reading input from heterogeneous sources to produce a single input stream. + +For example, a bootstrap use case may need to read several days worth of bounded input from S3 before continuing with the latest unbounded input from Kafka. +`HybridSource` switches from `FileSource` to `KafkaSource` when the bounded file input finishes. + +Prior to `HybridSource`, it was necessary to create a topology with multiple sources and define a switching mechanism in user land, which leads to operational complexity and inefficiency. + +With `HybridSource` the multiple sources appear as a single source in the Flink job graph and from `DataStream` API perspective. + +For more background see [FLIP-150](https://cwiki.apache.org/confluence/display/FLINK/FLIP-150%3A+Introduce+Hybrid+Source) + +To use the connector, add the ```flink-connector-base``` dependency to your project: + +{{< artifact flink-connector-base >}} + +(Typically comes as transitive dependency with concrete sources.) + +## Start position for next source + +To arrange multiple sources in a `HybridSource` each source typically needs to be assigned a +start and end position (end position for bounded input for all but the final source). +Details depend on the specific source and the external storage systems. Review comment: ```suggestion To arrange multiple sources in a `HybridSource`, all sources except the last one need to be bounded. Therefore, the sources typically need to be assigned a start and end position. The last source may be bounded in which case the `HybridSource` is bounded and unbounded otherwise. Details depend on the specific source and the external storage systems. ``` ########## File path: docs/content/docs/connectors/datastream/hybridsource.md ########## @@ -0,0 +1,101 @@ +--- +title: Hybrid Source +weight: 8 +type: docs +aliases: +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Hybrid Source + +`HybridSource` is a source that contains a list of concrete sources. Review comment: I'd not explicitly mention FLIP-27. Eventually source=FLIP-27, if we do a good enough job. But maybe we could link to https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/datastream/sources/? ########## File path: docs/content/docs/connectors/datastream/hybridsource.md ########## @@ -0,0 +1,101 @@ +--- +title: Hybrid Source +weight: 8 +type: docs +aliases: +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Hybrid Source + +`HybridSource` is a source that contains a list of concrete sources. +It solves the problem of sequentially reading input from heterogeneous sources to produce a single input stream. + +For example, a bootstrap use case may need to read several days worth of bounded input from S3 before continuing with the latest unbounded input from Kafka. +`HybridSource` switches from `FileSource` to `KafkaSource` when the bounded file input finishes. + +Prior to `HybridSource`, it was necessary to create a topology with multiple sources and define a switching mechanism in user land, which leads to operational complexity and inefficiency. + +With `HybridSource` the multiple sources appear as a single source in the Flink job graph and from `DataStream` API perspective. + +For more background see [FLIP-150](https://cwiki.apache.org/confluence/display/FLINK/FLIP-150%3A+Introduce+Hybrid+Source) + +To use the connector, add the ```flink-connector-base``` dependency to your project: + +{{< artifact flink-connector-base >}} + +(Typically comes as transitive dependency with concrete sources.) + +## Start position for next source + +To arrange multiple sources in a `HybridSource` each source typically needs to be assigned a +start and end position (end position for bounded input for all but the final source). +Details depend on the specific source and the external storage systems. + +Here we cover the most basic and then a more complex scenario, following the File/Kafka example. + +#### Fixed start position at graph construction time + +Example: Read till pre-determined switch time from files and then continue reading from Kafka. +Each source covers an upfront known range and therefore the contained sources can be created upfront as if they were used directly: + +```java +long switchTimestamp = t2; // derive from file input paths Review comment: ```suggestion long switchTimestamp = ...; // derive from file input paths ``` ########## File path: docs/content/docs/connectors/datastream/hybridsource.md ########## @@ -0,0 +1,101 @@ +--- +title: Hybrid Source +weight: 8 +type: docs +aliases: +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Hybrid Source + +`HybridSource` is a source that contains a list of concrete sources. +It solves the problem of sequentially reading input from heterogeneous sources to produce a single input stream. + +For example, a bootstrap use case may need to read several days worth of bounded input from S3 before continuing with the latest unbounded input from Kafka. +`HybridSource` switches from `FileSource` to `KafkaSource` when the bounded file input finishes. Review comment: ```suggestion For example, a bootstrap use case may need to read several days worth of bounded input from S3 before continuing with the latest unbounded input from Kafka. `HybridSource` switches from `FileSource` to `KafkaSource` when the bounded file input finishes without interrupting the application. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
