This is an automated email from the ASF dual-hosted git repository. hxb pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/flink.git
commit f2a8270e26adb8a79d64294270a503fe8f34cc93 Author: huangxingbo <[email protected]> AuthorDate: Wed Nov 9 19:13:14 2022 +0800 [FLINK-29966][python][docs] Adds PyFlink API doc This closes #21280. --- flink-python/docs/index.rst | 38 +-- flink-python/docs/{ => reference}/index.rst | 36 +-- .../pyflink.common/config.rst} | 63 +++-- .../docs/{ => reference/pyflink.common}/index.rst | 39 +-- .../pyflink.common/job_info.rst} | 53 ++-- .../pyflink.common/serializer.rst} | 43 ++- .../pyflink.common/typeinfo.rst} | 63 +++-- .../reference/pyflink.datastream/checkpoint.rst | 136 ++++++++++ .../reference/pyflink.datastream/connectors.rst | 229 ++++++++++++++++ .../reference/pyflink.datastream/datastream.rst | 278 ++++++++++++++++++++ .../pyflink.datastream/formats.rst} | 80 ++++-- .../reference/pyflink.datastream/functions.rst | 76 ++++++ .../{ => reference/pyflink.datastream}/index.rst | 43 ++- .../pyflink.datastream/sideoutput.rst} | 43 ++- .../docs/reference/pyflink.datastream/state.rst | 174 +++++++++++++ .../stream_execution_environment.rst | 135 ++++++++++ .../pyflink.datastream/timer.rst} | 49 ++-- .../docs/reference/pyflink.datastream/window.rst | 85 ++++++ .../docs/reference/pyflink.table/catalog.rst | 238 +++++++++++++++++ .../docs/reference/pyflink.table/data_types.rst | 74 ++++++ .../docs/reference/pyflink.table/descriptors.rst | 142 ++++++++++ .../docs/reference/pyflink.table/expressions.rst | 288 +++++++++++++++++++++ .../docs/{ => reference/pyflink.table}/index.rst | 46 ++-- .../docs/reference/pyflink.table/statement_set.rst | 81 ++++++ .../docs/reference/pyflink.table/table.rst | 205 +++++++++++++++ .../reference/pyflink.table/table_environment.rst | 272 +++++++++++++++++++ flink-python/docs/reference/pyflink.table/udf.rst | 95 +++++++ .../docs/reference/pyflink.table/window.rst | 131 ++++++++++ 28 files changed, 2940 insertions(+), 295 deletions(-) diff --git a/flink-python/docs/index.rst b/flink-python/docs/index.rst index 7a174fd75ab..23b29517a26 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/index.rst @@ -16,35 +16,15 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== +============================= +Welcome to Flink Python Docs! +============================= -.. toctree:: - :maxdepth: 2 - :caption: Contents - - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics - - -Core Classes: ---------------- - - :class:`pyflink.table.TableEnvironment` - - Main entry point for Flink Table functionality. +.. mdinclude:: ../README.md - :class:`pyflink.table.Table` - - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. - - :class:`pyflink.datastream.StreamExecutionEnvironment` - - Main entry point for Flink DataStream functionality. - - :class:`pyflink.datastream.DataStream` +.. toctree:: + :maxdepth: 2 + :hidden: - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. + reference/index + examples/index diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/index.rst similarity index 56% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/index.rst index 7a174fd75ab..9776d64718f 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/index.rst @@ -16,35 +16,13 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== +============= +API Reference +============= .. toctree:: - :maxdepth: 2 - :caption: Contents + :maxdepth: 2 - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics - - -Core Classes: ---------------- - - :class:`pyflink.table.TableEnvironment` - - Main entry point for Flink Table functionality. - - :class:`pyflink.table.Table` - - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. - - :class:`pyflink.datastream.StreamExecutionEnvironment` - - Main entry point for Flink DataStream functionality. - - :class:`pyflink.datastream.DataStream` - - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. + pyflink.table/index + pyflink.datastream/index + pyflink.common/index diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.common/config.rst similarity index 57% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.common/config.rst index 7a174fd75ab..fa9611e409a 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.common/config.rst @@ -16,35 +16,60 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== -.. toctree:: - :maxdepth: 2 - :caption: Contents +====== +Config +====== - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics +Configuration +------------- +.. currentmodule:: pyflink.common.configuration -Core Classes: +.. autosummary:: + :toctree: api/ + + Configuration + +.. currentmodule:: pyflink.common.config_options + +.. autosummary:: + :toctree: api/ + + ConfigOptions + ConfigOption + + +ExecutionConfig --------------- - :class:`pyflink.table.TableEnvironment` +.. currentmodule:: pyflink.common.execution_config + +.. autosummary:: + :toctree: api/ - Main entry point for Flink Table functionality. + ExecutionConfig - :class:`pyflink.table.Table` - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. +ExecutionMode +------------- + +.. currentmodule:: pyflink.common.execution_mode + +.. autosummary:: + :toctree: api/ + + ExecutionMode + + +RestartStrategy +--------------- - :class:`pyflink.datastream.StreamExecutionEnvironment` +.. currentmodule:: pyflink.common.restart_strategy - Main entry point for Flink DataStream functionality. +.. autosummary:: + :toctree: api/ - :class:`pyflink.datastream.DataStream` + RestartStrategyConfiguration + RestartStrategies - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.common/index.rst similarity index 56% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.common/index.rst index 7a174fd75ab..f663b0359ac 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.common/index.rst @@ -16,35 +16,16 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== +============== +PyFlink Common +============== -.. toctree:: - :maxdepth: 2 - :caption: Contents - - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics - - -Core Classes: ---------------- - - :class:`pyflink.table.TableEnvironment` - - Main entry point for Flink Table functionality. +This page gives an overview of all public PyFlink Common API. - :class:`pyflink.table.Table` - - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. - - :class:`pyflink.datastream.StreamExecutionEnvironment` - - Main entry point for Flink DataStream functionality. - - :class:`pyflink.datastream.DataStream` +.. toctree:: + :maxdepth: 2 - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. + config + typeinfo + job_info + serializer \ No newline at end of file diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.common/job_info.rst similarity index 56% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.common/job_info.rst index 7a174fd75ab..77e110146d2 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.common/job_info.rst @@ -16,35 +16,48 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== -.. toctree:: - :maxdepth: 2 - :caption: Contents +=============== +Job Information +=============== - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics +Job Client +---------- +.. currentmodule:: pyflink.common.job_client -Core Classes: ---------------- +.. autosummary:: + :toctree: api/ - :class:`pyflink.table.TableEnvironment` + JobClient.get_job_id + JobClient.get_job_status + JobClient.cancel + JobClient.stop_with_savepoint + JobClient.trigger_savepoint + JobClient.get_accumulators + JobClient.get_job_execution_result - Main entry point for Flink Table functionality. - :class:`pyflink.table.Table` +JobExecution Result +------------------- - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. +.. currentmodule:: pyflink.common.job_execution_result - :class:`pyflink.datastream.StreamExecutionEnvironment` +.. autosummary:: + :toctree: api/ - Main entry point for Flink DataStream functionality. + JobExecutionResult.get_job_id + JobExecutionResult.get_net_runtime + JobExecutionResult.get_accumulator_result + JobExecutionResult.get_all_accumulator_results - :class:`pyflink.datastream.DataStream` - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. +Job Status +---------- + +.. currentmodule:: pyflink.common.job_status + +.. autosummary:: + :toctree: api/ + + JobStatus diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.common/serializer.rst similarity index 56% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.common/serializer.rst index 7a174fd75ab..0eb60a2d36e 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.common/serializer.rst @@ -16,35 +16,34 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== -.. toctree:: - :maxdepth: 2 - :caption: Contents +========== +Serializer +========== - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics +Serializer +---------- +.. currentmodule:: pyflink.common.serializer -Core Classes: ---------------- +.. autosummary:: + :toctree: api/ - :class:`pyflink.table.TableEnvironment` + TypeSerializer + VoidNamespaceSerializer - Main entry point for Flink Table functionality. - :class:`pyflink.table.Table` +Serialization +------------- - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. +.. currentmodule:: pyflink.common.serialization - :class:`pyflink.datastream.StreamExecutionEnvironment` +.. autosummary:: + :toctree: api/ - Main entry point for Flink DataStream functionality. - - :class:`pyflink.datastream.DataStream` - - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. + SerializationSchema + DeserializationSchema + SimpleStringSchema + Encoder + BulkWriterFactory + RowDataBulkWriterFactory diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.common/typeinfo.rst similarity index 56% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.common/typeinfo.rst index 7a174fd75ab..0d288db38a5 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.common/typeinfo.rst @@ -16,35 +16,52 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== -.. toctree:: - :maxdepth: 2 - :caption: Contents +========== +Serializer +========== - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics +TypeInfo +-------- +.. currentmodule:: pyflink.common.typeinfo -Core Classes: ---------------- +.. autosummary:: + :toctree: api/ - :class:`pyflink.table.TableEnvironment` + Types.STRING + Types.BYTE + Types.BYTE + Types.BOOLEAN + Types.SHORT + Types.INT + Types.LONG + Types.FLOAT + Types.DOUBLE + Types.CHAR + Types.BIG_INT + Types.BIG_DEC + Types.INSTANT + Types.SQL_DATE + Types.SQL_TIME + Types.SQL_TIMESTAMP + Types.PICKLED_BYTE_ARRAY + Types.ROW + Types.ROW_NAMED + Types.TUPLE + Types.PRIMITIVE_ARRAY + Types.BASIC_ARRAY + Types.OBJECT_ARRAY + Types.MAP + Types.LIST - Main entry point for Flink Table functionality. +Row +--- - :class:`pyflink.table.Table` +.. currentmodule:: pyflink.common.types - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. +.. autosummary:: + :toctree: api/ - :class:`pyflink.datastream.StreamExecutionEnvironment` - - Main entry point for Flink DataStream functionality. - - :class:`pyflink.datastream.DataStream` - - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. + Row + RowKind diff --git a/flink-python/docs/reference/pyflink.datastream/checkpoint.rst b/flink-python/docs/reference/pyflink.datastream/checkpoint.rst new file mode 100644 index 00000000000..23239b85775 --- /dev/null +++ b/flink-python/docs/reference/pyflink.datastream/checkpoint.rst @@ -0,0 +1,136 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + + +========== +Checkpoint +========== + +CheckpointConfig +---------------- + +Configuration that captures all checkpointing related settings. + +:data:`DEFAULT_MODE`: + +The default checkpoint mode: exactly once. + +:data:`DEFAULT_TIMEOUT`: + +The default timeout of a checkpoint attempt: 10 minutes. + +:data:`DEFAULT_MIN_PAUSE_BETWEEN_CHECKPOINTS`: + +The default minimum pause to be made between checkpoints: none. + +:data:`DEFAULT_MAX_CONCURRENT_CHECKPOINTS`: + +The default limit of concurrently happening checkpoints: one. + +.. currentmodule:: pyflink.datastream.checkpoint_config + +.. autosummary:: + :toctree: api/ + + CheckpointConfig.DEFAULT_MODE + CheckpointConfig.DEFAULT_TIMEOUT + CheckpointConfig.DEFAULT_MIN_PAUSE_BETWEEN_CHECKPOINTS + CheckpointConfig.DEFAULT_MIN_PAUSE_BETWEEN_CHECKPOINTS + CheckpointConfig.DEFAULT_MAX_CONCURRENT_CHECKPOINTS + CheckpointConfig.is_checkpointing_enabled + CheckpointConfig.get_checkpointing_mode + CheckpointConfig.set_checkpointing_mode + CheckpointConfig.get_checkpoint_interval + CheckpointConfig.set_checkpoint_interval + CheckpointConfig.get_checkpoint_timeout + CheckpointConfig.set_checkpoint_timeout + CheckpointConfig.get_min_pause_between_checkpoints + CheckpointConfig.set_min_pause_between_checkpoints + CheckpointConfig.get_max_concurrent_checkpoints + CheckpointConfig.set_max_concurrent_checkpoints + CheckpointConfig.is_fail_on_checkpointing_errors + CheckpointConfig.set_fail_on_checkpointing_errors + CheckpointConfig.get_tolerable_checkpoint_failure_number + CheckpointConfig.set_tolerable_checkpoint_failure_number + CheckpointConfig.enable_externalized_checkpoints + CheckpointConfig.set_externalized_checkpoint_cleanup + CheckpointConfig.is_externalized_checkpoints_enabled + CheckpointConfig.get_externalized_checkpoint_cleanup + CheckpointConfig.is_unaligned_checkpoints_enabled + CheckpointConfig.enable_unaligned_checkpoints + CheckpointConfig.disable_unaligned_checkpoints + CheckpointConfig.set_alignment_timeout + CheckpointConfig.get_alignment_timeout + CheckpointConfig.set_force_unaligned_checkpoints + CheckpointConfig.is_force_unaligned_checkpoints + CheckpointConfig.set_checkpoint_storage + CheckpointConfig.set_checkpoint_storage_dir + CheckpointConfig.get_checkpoint_storage + ExternalizedCheckpointCleanup + + +CheckpointStorage +----------------- + +Checkpoint storage defines how :class:`StateBackend`'s store their state for fault-tolerance +in streaming applications. Various implementations store their checkpoints in different fashions +and have different requirements and availability guarantees. + +For example, :class:`JobManagerCheckpointStorage` stores checkpoints in the memory of the +`JobManager`. It is lightweight and without additional dependencies but is not scalable +and only supports small state sizes. This checkpoints storage policy is convenient for local +testing and development. + +:class:`FileSystemCheckpointStorage` stores checkpoints in a filesystem. For systems like HDFS +NFS drives, S3, and GCS, this storage policy supports large state size, in the magnitude of many +terabytes while providing a highly available foundation for streaming applications. This +checkpoint storage policy is recommended for most production deployments. + +**Raw Bytes Storage** + +The `CheckpointStorage` creates services for raw bytes storage. + +The raw bytes storage (through the CheckpointStreamFactory) is the fundamental service that +simply stores bytes in a fault tolerant fashion. This service is used by the JobManager to +store checkpoint and recovery metadata and is typically also used by the keyed- and operator- +state backends to store checkpoint state. + +**Serializability** + +Implementations need to be serializable(`java.io.Serializable`), because they are distributed +across parallel processes (for distributed execution) together with the streaming application +code. + +Because of that `CheckpointStorage` implementations are meant to be like _factories_ that create +the proper state stores that provide access to the persistent layer. That way, the storage +policy can be very lightweight (contain only configurations) which makes it easier to be +serializable. + +**Thread Safety** + +Checkpoint storage implementations have to be thread-safe. Multiple threads may be creating +streams concurrently. + +.. currentmodule:: pyflink.datastream.checkpoint_storage + +.. autosummary:: + :toctree: api/ + + JobManagerCheckpointStorage + FileSystemCheckpointStorage + CustomCheckpointStorage diff --git a/flink-python/docs/reference/pyflink.datastream/connectors.rst b/flink-python/docs/reference/pyflink.datastream/connectors.rst new file mode 100644 index 00000000000..c0bce856baf --- /dev/null +++ b/flink-python/docs/reference/pyflink.datastream/connectors.rst @@ -0,0 +1,229 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + + +========== +Connectors +========== + +File System +=========== + +File Source +----------- + +.. currentmodule:: pyflink.datastream.connectors.file_system + +.. autosummary:: + :toctree: api/ + + FileEnumeratorProvider + FileSplitAssignerProvider + StreamFormat + BulkFormat + FileSourceBuilder + FileSource + + +File Sink +--------- + +.. currentmodule:: pyflink.datastream.connectors.file_system + +.. autosummary:: + :toctree: api/ + + BucketAssigner + RollingPolicy + DefaultRollingPolicy + OnCheckpointRollingPolicy + OutputFileConfig + FileCompactStrategy + FileCompactor + FileSink + StreamingFileSink + + +Number Sequence +=============== + +.. currentmodule:: pyflink.datastream.connectors.number_seq + +.. autosummary:: + :toctree: api/ + + NumberSequenceSource + + +Kafka +===== + +Kakfa Producer and Consumer +--------------------------- + +.. currentmodule:: pyflink.datastream.connectors.kafka + +.. autosummary:: + :toctree: api/ + + FlinkKafkaConsumer + FlinkKafkaProducer + Semantic + + +Kafka Source and Sink +--------------------------- + +.. currentmodule:: pyflink.datastream.connectors.kafka + +.. autosummary:: + :toctree: api/ + + KafkaSource + KafkaSourceBuilder + KafkaTopicPartition + KafkaOffsetResetStrategy + KafkaOffsetsInitializer + KafkaSink + KafkaSinkBuilder + KafkaRecordSerializationSchema + KafkaRecordSerializationSchemaBuilder + KafkaTopicSelector + + +kinesis +======= + +Kinesis Source +-------------- + +.. currentmodule:: pyflink.datastream.connectors.kinesis + +.. autosummary:: + :toctree: api/ + + KinesisShardAssigner + KinesisDeserializationSchema + WatermarkTracker + FlinkKinesisConsumer + + +Kinesis Sink +------------ + +.. currentmodule:: pyflink.datastream.connectors.kinesis + +.. autosummary:: + :toctree: api/ + + PartitionKeyGenerator + KinesisStreamsSink + KinesisStreamsSinkBuilder + KinesisFirehoseSink + KinesisFirehoseSinkBuilder + + +Pulsar +====== + +Pulsar Source +------------- + +.. currentmodule:: pyflink.datastream.connectors.pulsar + +.. autosummary:: + :toctree: api/ + + PulsarDeserializationSchema + SubscriptionType + StartCursor + StopCursor + PulsarSource + PulsarSourceBuilder + + +Pulsar Sink +----------- + +.. currentmodule:: pyflink.datastream.connectors.pulsar + +.. autosummary:: + :toctree: api/ + + PulsarSerializationSchema + TopicRoutingMode + MessageDelayer + PulsarSink + PulsarSinkBuilder + + +Jdbc +==== + +.. currentmodule:: pyflink.datastream.connectors.jdbc + +.. autosummary:: + :toctree: api/ + + JdbcSink + JdbcConnectionOptions + JdbcExecutionOptions + + +RMQ +=== + +.. currentmodule:: pyflink.datastream.connectors.rabbitmq + +.. autosummary:: + :toctree: api/ + + RMQConnectionConfig + RMQSource + RMQSink + + +Elasticsearch +============= + +.. currentmodule:: pyflink.datastream.connectors.elasticsearch + +.. autosummary:: + :toctree: api/ + + FlushBackoffType + ElasticsearchEmitter + Elasticsearch6SinkBuilder + Elasticsearch7SinkBuilder + ElasticsearchSink + + +Cassandra +========= + +.. currentmodule:: pyflink.datastream.connectors.cassandra + +.. autosummary:: + :toctree: api/ + + CassandraSink + ConsistencyLevel + MapperOptions + ClusterBuilder + CassandraCommitter + CassandraFailureHandler diff --git a/flink-python/docs/reference/pyflink.datastream/datastream.rst b/flink-python/docs/reference/pyflink.datastream/datastream.rst new file mode 100644 index 00000000000..57038189e67 --- /dev/null +++ b/flink-python/docs/reference/pyflink.datastream/datastream.rst @@ -0,0 +1,278 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + +========== +DataStream +========== + +DataStream +---------- + +A DataStream represents a stream of elements of the same type. + +.. currentmodule:: pyflink.datastream.data_stream + +.. autosummary:: + :toctree: api/ + + DataStream.get_name + DataStream.name + DataStream.uid + DataStream.set_uid_hash + DataStream.set_parallelism + DataStream.set_max_parallelism + DataStream.get_type + DataStream.get_execution_environment + DataStream.force_non_parallel + DataStream.set_buffer_timeout + DataStream.start_new_chain + DataStream.disable_chaining + DataStream.slot_sharing_group + DataStream.set_description + DataStream.map + DataStream.flat_map + DataStream.key_by + DataStream.filter + DataStream.window_all + DataStream.union + DataStream.connect + DataStream.shuffle + DataStream.project + DataStream.rescale + DataStream.rebalance + DataStream.forward + DataStream.broadcast + DataStream.process + DataStream.assign_timestamps_and_watermarks + DataStream.partition_custom + DataStream.add_sink + DataStream.sink_to + DataStream.execute_and_collect + DataStream.print + DataStream.get_side_output + DataStream.cache + + +DataStreamSink +-------------- + +A Stream Sink. This is used for emitting elements from a streaming topology. + +.. currentmodule:: pyflink.datastream.data_stream + +.. autosummary:: + :toctree: api/ + + DataStreamSink.name + DataStreamSink.uid + DataStreamSink.set_uid_hash + DataStreamSink.set_parallelism + DataStreamSink.set_description + DataStreamSink.disable_chaining + DataStreamSink.slot_sharing_group + + +KeyedStream +----------- + +A Stream Sink. This is used for emitting elements from a streaming topology. + +.. currentmodule:: pyflink.datastream.data_stream + +.. autosummary:: + :toctree: api/ + + KeyedStream.map + KeyedStream.flat_map + KeyedStream.reduce + KeyedStream.filter + KeyedStream.sum + KeyedStream.min + KeyedStream.max + KeyedStream.min_by + KeyedStream.max_by + KeyedStream.add_sink + KeyedStream.key_by + KeyedStream.process + KeyedStream.window + KeyedStream.count_window + KeyedStream.union + KeyedStream.connect + KeyedStream.partition_custom + KeyedStream.print + + +CachedDataStream +---------------- + +CachedDataStream represents a DataStream whose intermediate result will be cached at the first +time when it is computed. And the cached intermediate result can be used in later job that using +the same CachedDataStream to avoid re-computing the intermediate result. + +.. currentmodule:: pyflink.datastream.data_stream + +.. autosummary:: + :toctree: api/ + + CachedDataStream.get_type + CachedDataStream.get_execution_environment + CachedDataStream.set_description + CachedDataStream.map + CachedDataStream.flat_map + CachedDataStream.key_by + CachedDataStream.filter + CachedDataStream.window_all + CachedDataStream.union + CachedDataStream.connect + CachedDataStream.shuffle + CachedDataStream.project + CachedDataStream.rescale + CachedDataStream.rebalance + CachedDataStream.forward + CachedDataStream.broadcast + CachedDataStream.process + CachedDataStream.assign_timestamps_and_watermarks + CachedDataStream.partition_custom + CachedDataStream.add_sink + CachedDataStream.sink_to + CachedDataStream.execute_and_collect + CachedDataStream.print + CachedDataStream.get_side_output + CachedDataStream.cache + CachedDataStream.invalidate + + +WindowedStream +-------------- + +A WindowedStream represents a data stream where elements are grouped by key, and for each +key, the stream of elements is split into windows based on a WindowAssigner. Window emission +is triggered based on a Trigger. + +The windows are conceptually evaluated for each key individually, meaning windows can trigger +at different points for each key. + +Note that the WindowedStream is purely an API construct, during runtime the WindowedStream will +be collapsed together with the KeyedStream and the operation over the window into one single +operation. + +.. currentmodule:: pyflink.datastream.data_stream + +.. autosummary:: + :toctree: api/ + + WindowedStream.get_execution_environment + WindowedStream.get_input_type + WindowedStream.trigger + WindowedStream.allowed_lateness + WindowedStream.side_output_late_data + WindowedStream.reduce + WindowedStream.aggregate + WindowedStream.apply + WindowedStream.process + + +AllWindowedStream +----------------- + +A AllWindowedStream represents a data stream where the stream of elements is split into windows +based on a WindowAssigner. Window emission is triggered based on a Trigger. + +If an Evictor is specified it will be used to evict elements from the window after evaluation +was triggered by the Trigger but before the actual evaluation of the window. +When using an evictor, window performance will degrade significantly, since pre-aggregation of +window results cannot be used. + +Note that the AllWindowedStream is purely an API construct, during runtime the AllWindowedStream +will be collapsed together with the operation over the window into one single operation. + +.. currentmodule:: pyflink.datastream.data_stream + +.. autosummary:: + :toctree: api/ + + AllWindowedStream.get_execution_environment + AllWindowedStream.get_input_type + AllWindowedStream.trigger + AllWindowedStream.allowed_lateness + AllWindowedStream.side_output_late_data + AllWindowedStream.reduce + AllWindowedStream.aggregate + AllWindowedStream.apply + AllWindowedStream.process + + +ConnectedStreams +---------------- + +ConnectedStreams represent two connected streams of (possibly) different data types. +Connected streams are useful for cases where operations on one stream directly +affect the operations on the other stream, usually via shared state between the streams. + +An example for the use of connected streams would be to apply rules that change over time +onto another stream. One of the connected streams has the rules, the other stream the +elements to apply the rules to. The operation on the connected stream maintains the +current set of rules in the state. It may receive either a rule update and update the state +or a data element and apply the rules in the state to the element. + +The connected stream can be conceptually viewed as a union stream of an Either type, that +holds either the first stream's type or the second stream's type. + +.. currentmodule:: pyflink.datastream.data_stream + +.. autosummary:: + :toctree: api/ + + ConnectedStreams.key_by + ConnectedStreams.map + ConnectedStreams.flat_map + ConnectedStreams.process + + +BroadcastStream +--------------- + +.. currentmodule:: pyflink.datastream.data_stream + +.. autosummary:: + :toctree: api/ + + BroadcastStream + + +BroadcastConnectedStream +------------------------ + +A BroadcastConnectedStream represents the result of connecting a keyed or non-keyed stream, with +a :class:`BroadcastStream` with :class:`~state.BroadcastState` (s). As in the case of +:class:`ConnectedStreams` these streams are useful for cases where operations on one stream +directly affect the operations on the other stream, usually via shared state between the +streams. + +An example for the use of such connected streams would be to apply rules that change over time +onto another, possibly keyed stream. The stream with the broadcast state has the rules, and will +store them in the broadcast state, while the other stream will contain the elements to apply the +rules to. By broadcasting the rules, these will be available in all parallel instances, and can +be applied to all partitions of the other stream. + +.. currentmodule:: pyflink.datastream.data_stream + +.. autosummary:: + :toctree: api/ + + BroadcastConnectedStream.process diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.datastream/formats.rst similarity index 50% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.datastream/formats.rst index 7a174fd75ab..e1c19959943 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.datastream/formats.rst @@ -16,35 +16,75 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== -.. toctree:: - :maxdepth: 2 - :caption: Contents +======= +Formats +======= - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics +Avro +---- +.. currentmodule:: pyflink.datastream.formats.avro -Core Classes: ---------------- +.. autosummary:: + :toctree: api/ - :class:`pyflink.table.TableEnvironment` + AvroSchema + GenericRecordAvroTypeInfo + AvroInputFormat + AvroBulkWriters + AvroRowDeserializationSchema + AvroRowSerializationSchema - Main entry point for Flink Table functionality. - :class:`pyflink.table.Table` +CSV +--- - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. +.. currentmodule:: pyflink.datastream.formats.csv - :class:`pyflink.datastream.StreamExecutionEnvironment` +.. autosummary:: + :toctree: api/ - Main entry point for Flink DataStream functionality. + CsvSchema + CsvSchemaBuilder + CsvReaderFormat + CsvBulkWriters + CsvRowDeserializationSchema + CsvRowSerializationSchema - :class:`pyflink.datastream.DataStream` - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. +Json +---- + +.. currentmodule:: pyflink.datastream.formats.json + +.. autosummary:: + :toctree: api/ + + JsonRowDeserializationSchema + JsonRowSerializationSchema + + +Orc +--- + +.. currentmodule:: pyflink.datastream.formats.orc + +.. autosummary:: + :toctree: api/ + + OrcBulkWriters + + +Parquet +------- + +.. currentmodule:: pyflink.datastream.formats.parquet + +.. autosummary:: + :toctree: api/ + + AvroParquetReaders + AvroParquetWriters + ParquetColumnarRowInputFormat + ParquetBulkWriters diff --git a/flink-python/docs/reference/pyflink.datastream/functions.rst b/flink-python/docs/reference/pyflink.datastream/functions.rst new file mode 100644 index 00000000000..6337688c696 --- /dev/null +++ b/flink-python/docs/reference/pyflink.datastream/functions.rst @@ -0,0 +1,76 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + + +========= +Functions +========= + +RuntimeContext +-------------- + +.. currentmodule:: pyflink.datastream.functions + +.. autosummary:: + :toctree: api/ + + RuntimeContext.get_task_name + RuntimeContext.get_number_of_parallel_subtasks + RuntimeContext.get_max_number_of_parallel_subtasks + RuntimeContext.get_index_of_this_subtask + RuntimeContext.get_attempt_number + RuntimeContext.get_task_name_with_subtasks + RuntimeContext.get_job_parameter + RuntimeContext.get_metrics_group + RuntimeContext.get_state + RuntimeContext.get_list_state + RuntimeContext.get_map_state + RuntimeContext.get_reducing_state + RuntimeContext.get_aggregating_state + + +Function +-------- + +All user-defined functions. + +.. currentmodule:: pyflink.datastream.functions + +.. autosummary:: + :toctree: api/ + + MapFunction + CoMapFunction + FlatMapFunction + CoFlatMapFunction + ReduceFunction + AggregateFunction + ProcessFunction + KeyedProcessFunction + CoProcessFunction + KeyedCoProcessFunction + WindowFunction + AllWindowFunction + ProcessWindowFunction + ProcessAllWindowFunction + KeySelector + NullByteKeySelector + FilterFunction + Partitioner + BroadcastProcessFunction + KeyedBroadcastProcessFunction diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.datastream/index.rst similarity index 56% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.datastream/index.rst index 7a174fd75ab..f824a64956f 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.datastream/index.rst @@ -16,35 +16,24 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== +================== +PyFlink DataStream +================== -.. toctree:: - :maxdepth: 2 - :caption: Contents - - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics - - -Core Classes: ---------------- - - :class:`pyflink.table.TableEnvironment` +This page gives an overview of all public PyFlink DataStream API. - Main entry point for Flink Table functionality. - - :class:`pyflink.table.Table` - - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. - - :class:`pyflink.datastream.StreamExecutionEnvironment` +.. toctree:: + :maxdepth: 2 - Main entry point for Flink DataStream functionality. + stream_execution_environment + datastream + functions + state + timer + window + checkpoint + sideoutput + connectors + formats - :class:`pyflink.datastream.DataStream` - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.datastream/sideoutput.rst similarity index 55% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.datastream/sideoutput.rst index 7a174fd75ab..b8b297dcede 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.datastream/sideoutput.rst @@ -16,35 +16,24 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== -.. toctree:: - :maxdepth: 2 - :caption: Contents +============ +Side Outputs +============ - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics +OutputTag +--------- +An :class:`OutputTag` is a typed and named tag to use for tagging side outputs of an operator. -Core Classes: ---------------- +Example: +:: - :class:`pyflink.table.TableEnvironment` - - Main entry point for Flink Table functionality. - - :class:`pyflink.table.Table` - - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. - - :class:`pyflink.datastream.StreamExecutionEnvironment` - - Main entry point for Flink DataStream functionality. - - :class:`pyflink.datastream.DataStream` - - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. + # Explicitly specify output type + >>> info = OutputTag("late-data", Types.TUPLE([Types.STRING(), Types.LONG()])) + # Implicitly wrap list to Types.ROW + >>> info_row = OutputTag("row", [Types.STRING(), Types.LONG()]) + # Implicitly use pickle serialization + >>> info_side = OutputTag("side") + # ERROR: tag id cannot be empty string (extra requirement for Python API) + >>> info_error = OutputTag("") diff --git a/flink-python/docs/reference/pyflink.datastream/state.rst b/flink-python/docs/reference/pyflink.datastream/state.rst new file mode 100644 index 00000000000..ae8be7ea430 --- /dev/null +++ b/flink-python/docs/reference/pyflink.datastream/state.rst @@ -0,0 +1,174 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + + +===== +State +===== + +OperatorStateStore +------------------ + +.. currentmodule:: pyflink.datastream.state + +.. autosummary:: + :toctree: api/ + + OperatorStateStore.get_broadcast_state + + +State +----- + +.. currentmodule:: pyflink.datastream.state + +.. autosummary:: + :toctree: api/ + + ValueState + AppendingState + MergingState + ReducingState + AggregatingState + ListState + MapState + ReadOnlyBroadcastState + BroadcastState + + +StateDescriptor +--------------- + +.. currentmodule:: pyflink.datastream.state + +.. autosummary:: + :toctree: api/ + + ValueStateDescriptor + ListStateDescriptor + MapStateDescriptor + ReducingStateDescriptor + AggregatingStateDescriptor + + +StateTtlConfig +-------------- + +.. currentmodule:: pyflink.datastream.state + +.. autoclass:: pyflink.datastream.state::StateTtlConfig.UpdateType + :members: + +.. autoclass:: pyflink.datastream.state::StateTtlConfig.StateVisibility + :members: + +.. autoclass:: pyflink.datastream.state::StateTtlConfig.TtlTimeCharacteristic + :members: + +.. autoclass:: pyflink.datastream.state::StateTtlConfig.CleanupStrategies + :members: + +.. currentmodule:: pyflink.datastream.state + +.. autosummary:: + :toctree: api/ + + StateTtlConfig.new_builder + StateTtlConfig.get_update_type + StateTtlConfig.get_state_visibility + StateTtlConfig.get_ttl + StateTtlConfig.get_ttl_time_characteristic + StateTtlConfig.is_enabled + StateTtlConfig.get_cleanup_strategies + StateTtlConfig.Builder.set_update_type + StateTtlConfig.Builder.update_ttl_on_create_and_write + StateTtlConfig.Builder.update_ttl_on_read_and_write + StateTtlConfig.Builder.set_state_visibility + StateTtlConfig.Builder.return_expired_if_not_cleaned_up + StateTtlConfig.Builder.never_return_expired + StateTtlConfig.Builder.set_ttl_time_characteristic + StateTtlConfig.Builder.use_processing_time + StateTtlConfig.Builder.cleanup_full_snapshot + StateTtlConfig.Builder.cleanup_incrementally + StateTtlConfig.Builder.cleanup_in_rocksdb_compact_filter + StateTtlConfig.Builder.disable_cleanup_in_background + StateTtlConfig.Builder.set_ttl + StateTtlConfig.Builder.build + + +StateBackend +------------ + +A **State Backend** defines how the state of a streaming application is stored locally within +the cluster. Different state backends store their state in different fashions, and use different +data structures to hold the state of running applications. + +For example, the :class:`HashMapStateBackend` keeps working state in the memory of the +TaskManager. The backend is lightweight and without additional dependencies. + +The :class:`EmbeddedRocksDBStateBackend` keeps working state in the memory of the TaskManager +and stores state checkpoints in a filesystem(typically a replicated highly-available filesystem, +like `HDFS <https://hadoop.apache.org/>`_, `Ceph <https://ceph.com/>`_, +`S3 <https://aws.amazon.com/documentation/s3/>`_, `GCS <https://cloud.google.com/storage/>`_, +etc). + +The :class:`EmbeddedRocksDBStateBackend` stores working state in an embedded +`RocksDB <http://rocksdb.org/>`_, instance and is able to scale working state to many +terrabytes in size, only limited by available disk space across all task managers. + +**Raw Bytes Storage and Backends** + +The :class:`StateBackend` creates services for *raw bytes storage* and for *keyed state* +and *operator state*. + +The `org.apache.flink.runtime.state.AbstractKeyedStateBackend and +`org.apache.flink.runtime.state.OperatorStateBackend` created by this state backend define how +to hold the working state for keys and operators. They also define how to checkpoint that +state, frequently using the raw bytes storage (via the +`org.apache.flink.runtime.state.CheckpointStreamFactory`). However, it is also possible that +for example a keyed state backend simply implements the bridge to a key/value store, and that +it does not need to store anything in the raw byte storage upon a checkpoint. + +**Serializability** + +State Backends need to be serializable(`java.io.Serializable`), because they distributed +across parallel processes (for distributed execution) together with the streaming application +code. + +Because of that, :class:`StateBackend` implementations are meant to be like *factories* that +create the proper states stores that provide access to the persistent storage and hold the +keyed- and operator state data structures. That way, the State Backend can be very lightweight +(contain only configurations) which makes it easier to be serializable. + +**Thread Safety** + +State backend implementations have to be thread-safe. Multiple threads may be creating +streams and keyed-/operator state backends concurrently. + +.. currentmodule:: pyflink.datastream.state_backend + +.. autosummary:: + :toctree: api/ + + HashMapStateBackend + EmbeddedRocksDBStateBackend + MemoryStateBackend + FsStateBackend + RocksDBStateBackend + CustomStateBackend + PredefinedOptions diff --git a/flink-python/docs/reference/pyflink.datastream/stream_execution_environment.rst b/flink-python/docs/reference/pyflink.datastream/stream_execution_environment.rst new file mode 100644 index 00000000000..3a7b16de9cd --- /dev/null +++ b/flink-python/docs/reference/pyflink.datastream/stream_execution_environment.rst @@ -0,0 +1,135 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + +========================== +StreamExecutionEnvironment +========================== + +StreamExecutionEnvironment +-------------------------- + +The StreamExecutionEnvironment is the context in which a streaming program is executed. A +*LocalStreamEnvironment* will cause execution in the attached JVM, a +*RemoteStreamEnvironment* will cause execution on a remote setup. + +The environment provides methods to control the job execution (such as setting the parallelism +or the fault tolerance/checkpointing parameters) and to interact with the outside world (data +access). + +.. currentmodule:: pyflink.datastream.stream_execution_environment + +.. autosummary:: + :toctree: api/ + + StreamExecutionEnvironment.get_config + StreamExecutionEnvironment.set_parallelism + StreamExecutionEnvironment.set_max_parallelism + StreamExecutionEnvironment.register_slot_sharing_group + StreamExecutionEnvironment.get_parallelism + StreamExecutionEnvironment.get_max_parallelism + StreamExecutionEnvironment.set_runtime_mode + StreamExecutionEnvironment.set_buffer_timeout + StreamExecutionEnvironment.get_buffer_timeout + StreamExecutionEnvironment.disable_operator_chaining + StreamExecutionEnvironment.is_chaining_enabled + StreamExecutionEnvironment.get_checkpoint_config + StreamExecutionEnvironment.enable_checkpointing + StreamExecutionEnvironment.get_checkpoint_interval + StreamExecutionEnvironment.get_checkpointing_mode + StreamExecutionEnvironment.get_state_backend + StreamExecutionEnvironment.set_state_backend + StreamExecutionEnvironment.enable_changelog_state_backend + StreamExecutionEnvironment.is_changelog_state_backend_enabled + StreamExecutionEnvironment.set_default_savepoint_directory + StreamExecutionEnvironment.get_default_savepoint_directory + StreamExecutionEnvironment.set_restart_strategy + StreamExecutionEnvironment.get_restart_strategy + StreamExecutionEnvironment.add_default_kryo_serializer + StreamExecutionEnvironment.register_type_with_kryo_serializer + StreamExecutionEnvironment.register_type + StreamExecutionEnvironment.set_stream_time_characteristic + StreamExecutionEnvironment.get_stream_time_characteristic + StreamExecutionEnvironment.configure + StreamExecutionEnvironment.add_python_file + StreamExecutionEnvironment.set_python_requirements + StreamExecutionEnvironment.add_python_archive + StreamExecutionEnvironment.set_python_executable + StreamExecutionEnvironment.add_jars + StreamExecutionEnvironment.add_classpaths + StreamExecutionEnvironment.get_default_local_parallelism + StreamExecutionEnvironment.set_default_local_parallelism + StreamExecutionEnvironment.execute + StreamExecutionEnvironment.execute_async + StreamExecutionEnvironment.get_execution_plan + StreamExecutionEnvironment.register_cached_file + StreamExecutionEnvironment.get_execution_environment + StreamExecutionEnvironment.create_input + StreamExecutionEnvironment.add_source + StreamExecutionEnvironment.from_source + StreamExecutionEnvironment.read_text_file + StreamExecutionEnvironment.from_collection + StreamExecutionEnvironment.is_unaligned_checkpoints_enabled + StreamExecutionEnvironment.is_force_unaligned_checkpoints + StreamExecutionEnvironment.close + + +RuntimeExecutionMode +-------------------- + +Runtime execution mode of DataStream programs. Among other things, this controls task +scheduling, network shuffle behavior, and time semantics. Some operations will also change +their record emission behaviour based on the configured execution mode. + +:data:`STREAMING`: + +The Pipeline will be executed with Streaming Semantics. All tasks will be deployed before +execution starts, checkpoints will be enabled, and both processing and event time will be +fully supported. + +:data:`BATCH`: + +The Pipeline will be executed with Batch Semantics. Tasks will be scheduled gradually based +on the scheduling region they belong, shuffles between regions will be blocking, watermarks +are assumed to be "perfect" i.e. no late data, and processing time is assumed to not advance +during execution. + +:data:`AUTOMATIC`: + +Flink will set the execution mode to BATCH if all sources are bounded, or STREAMING if there +is at least one source which is unbounded. + +.. currentmodule:: pyflink.datastream.execution_mode + +.. autosummary:: + :toctree: api/ + + RuntimeExecutionMode.STREAMING + RuntimeExecutionMode.BATCH + RuntimeExecutionMode.AUTOMATIC + + +SlotSharingGroup +---------------- + +.. currentmodule:: pyflink.datastream.slot_sharing_group + +.. autosummary:: + :toctree: api/ + + SlotSharingGroup + MemorySize \ No newline at end of file diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.datastream/timer.rst similarity index 56% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.datastream/timer.rst index 7a174fd75ab..cc1ee50709e 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.datastream/timer.rst @@ -16,35 +16,44 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== -.. toctree:: - :maxdepth: 2 - :caption: Contents +===== +Timer +===== - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics +TimerService +------------ +.. currentmodule:: pyflink.datastream.timerservice -Core Classes: ---------------- +.. autosummary:: + :toctree: api/ - :class:`pyflink.table.TableEnvironment` + TimerService.current_processing_time + TimerService.current_watermark + TimerService.register_processing_time_timer + TimerService.register_event_time_timer + TimerService.delete_processing_time_timer + TimerService.delete_event_time_timer - Main entry point for Flink Table functionality. - :class:`pyflink.table.Table` +TimeCharacteristic +------------------ - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. +.. currentmodule:: pyflink.datastream.time_characteristic - :class:`pyflink.datastream.StreamExecutionEnvironment` +.. autosummary:: + :toctree: api/ - Main entry point for Flink DataStream functionality. + TimeCharacteristic - :class:`pyflink.datastream.DataStream` - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. +TimeDomain +---------- + +.. currentmodule:: pyflink.datastream.time_domain + +.. autosummary:: + :toctree: api/ + + TimeDomain diff --git a/flink-python/docs/reference/pyflink.datastream/window.rst b/flink-python/docs/reference/pyflink.datastream/window.rst new file mode 100644 index 00000000000..dd508dcee8e --- /dev/null +++ b/flink-python/docs/reference/pyflink.datastream/window.rst @@ -0,0 +1,85 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + + +====== +Window +====== + +Window +------ + +.. currentmodule:: pyflink.datastream.window + +.. autosummary:: + :toctree: api/ + + TimeWindow + CountWindow + GlobalWindow + + +Trigger +------- + +.. currentmodule:: pyflink.datastream.window + +.. autosummary:: + :toctree: api/ + + TriggerResult + EventTimeTrigger + ContinuousEventTimeTrigger + ProcessingTimeTrigger + ContinuousProcessingTimeTrigger + PurgingTrigger + CountTrigger + NeverTrigger + + +WindowAssigner +-------------- + +.. currentmodule:: pyflink.datastream.window + +.. autosummary:: + :toctree: api/ + + MergingWindowAssigner + CountTumblingWindowAssigner + CountSlidingWindowAssigner + TumblingProcessingTimeWindows + TumblingEventTimeWindows + SlidingProcessingTimeWindows + SlidingEventTimeWindows + ProcessingTimeSessionWindows + EventTimeSessionWindows + DynamicProcessingTimeSessionWindows + DynamicEventTimeSessionWindows + GlobalWindows + + +SessionWindowTimeGapExtractor +----------------------------- + +.. currentmodule:: pyflink.datastream.window + +.. autosummary:: + :toctree: api/ + + SessionWindowTimeGapExtractor diff --git a/flink-python/docs/reference/pyflink.table/catalog.rst b/flink-python/docs/reference/pyflink.table/catalog.rst new file mode 100644 index 00000000000..75ca6fdd80a --- /dev/null +++ b/flink-python/docs/reference/pyflink.table/catalog.rst @@ -0,0 +1,238 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + +======= +Catalog +======= + +Catalog +------- + +Catalog is responsible for reading and writing metadata such as database/table/views/UDFs from a +registered catalog. It connects a registered catalog and Flink's Table API. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + Catalog.get_default_database + Catalog.list_databases + Catalog.get_database + Catalog.database_exists + Catalog.create_database + Catalog.drop_database + Catalog.alter_database + Catalog.list_tables + Catalog.list_views + Catalog.get_table + Catalog.table_exists + Catalog.drop_table + Catalog.rename_table + Catalog.create_table + Catalog.alter_table + Catalog.list_partitions + Catalog.get_partition + Catalog.partition_exists + Catalog.create_partition + Catalog.drop_partition + Catalog.alter_partition + Catalog.list_functions + Catalog.get_function + Catalog.function_exists + Catalog.create_function + Catalog.alter_function + Catalog.drop_function + Catalog.get_table_statistics + Catalog.get_table_column_statistics + Catalog.get_partition_statistics + Catalog.bulk_get_partition_statistics + Catalog.get_partition_column_statistics + Catalog.bulk_get_partition_column_statistics + Catalog.alter_table_statistics + Catalog.alter_table_column_statistics + Catalog.alter_partition_statistics + Catalog.alter_partition_column_statistics + + +CatalogDatabase +--------------- + +Represents a database object in a catalog. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + CatalogDatabase.create_instance + CatalogDatabase.get_properties + CatalogDatabase.get_comment + CatalogDatabase.copy + CatalogDatabase.get_description + CatalogDatabase.get_detailed_description + + +CatalogBaseTable +---------------- + +CatalogBaseTable is the common parent of table and view. It has a map of key-value pairs defining +the properties of the table. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + CatalogBaseTable.create_table + CatalogBaseTable.create_view + CatalogBaseTable.get_options + CatalogBaseTable.get_schema + CatalogBaseTable.get_unresolved_schema + CatalogBaseTable.get_comment + CatalogBaseTable.copy + CatalogBaseTable.get_description + CatalogBaseTable.get_detailed_description + + +CatalogPartition +---------------- + +Represents a partition object in catalog. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + CatalogPartition.create_instance + CatalogPartition.get_properties + CatalogPartition.copy + CatalogPartition.get_description + CatalogPartition.get_detailed_description + CatalogPartition.get_comment + + +CatalogFunction +--------------- + +Represents a partition object in catalog. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + CatalogFunction.create_instance + CatalogFunction.get_class_name + CatalogFunction.copy + CatalogFunction.get_description + CatalogFunction.get_detailed_description + CatalogFunction.is_generic + CatalogFunction.get_function_language + + +ObjectPath +---------- + +A database name and object (table/view/function) name combo in a catalog. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + ObjectPath.from_string + ObjectPath.get_database_name + ObjectPath.get_object_name + ObjectPath.get_full_name + + +CatalogPartitionSpec +-------------------- + +Represents a partition spec object in catalog. +Partition columns and values are NOT of strict order, and they need to be re-arranged to the +correct order by comparing with a list of strictly ordered partition keys. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + CatalogPartitionSpec.get_partition_spec + + +CatalogTableStatistics +---------------------- + +Statistics for a non-partitioned table or a partition of a partitioned table. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + CatalogTableStatistics.get_row_count + CatalogTableStatistics.get_field_count + CatalogTableStatistics.get_total_size + CatalogTableStatistics.get_raw_data_size + CatalogTableStatistics.get_properties + CatalogTableStatistics.copy + + +CatalogColumnStatistics +----------------------- + +Column statistics of a table or partition. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + CatalogColumnStatistics.get_column_statistics_data + CatalogColumnStatistics.get_properties + CatalogColumnStatistics.copy + + +HiveCatalog +----------- + +A catalog implementation for Hive. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + HiveCatalog + + + +JdbcCatalog +----------- + +A catalog implementation for Jdbc. + +.. currentmodule:: pyflink.table.catalog + +.. autosummary:: + :toctree: api/ + + JdbcCatalog diff --git a/flink-python/docs/reference/pyflink.table/data_types.rst b/flink-python/docs/reference/pyflink.table/data_types.rst new file mode 100644 index 00000000000..c75c5c1604d --- /dev/null +++ b/flink-python/docs/reference/pyflink.table/data_types.rst @@ -0,0 +1,74 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + +========== +Data Types +========== + +Describes the data type of a value in the table ecosystem. Instances of this class can be used +to declare input and/or output types of operations. + +:class:`DataType` has two responsibilities: declaring a logical type and giving hints +about the physical representation of data to the optimizer. While the logical type is mandatory, +hints are optional but useful at the edges to other APIs. + +The logical type is independent of any physical representation and is close to the "data type" +terminology of the SQL standard. + +Physical hints are required at the edges of the table ecosystem. Hints indicate the data format +that an implementation expects. + +.. currentmodule:: pyflink.table.types + +.. autosummary:: + :toctree: api/ + + DataTypes.NULL + DataTypes.CHAR + DataTypes.VARCHAR + DataTypes.STRING + DataTypes.BOOLEAN + DataTypes.BINARY + DataTypes.VARBINARY + DataTypes.BYTES + DataTypes.DECIMAL + DataTypes.TINYINT + DataTypes.SMALLINT + DataTypes.INT + DataTypes.BIGINT + DataTypes.FLOAT + DataTypes.DOUBLE + DataTypes.DATE + DataTypes.TIME + DataTypes.TIMESTAMP + DataTypes.TIMESTAMP_WITH_LOCAL_TIME_ZONE + DataTypes.TIMESTAMP_LTZ + DataTypes.ARRAY + DataTypes.LIST_VIEW + DataTypes.MAP + DataTypes.MAP_VIEW + DataTypes.MULTISET + DataTypes.ROW + DataTypes.FIELD + DataTypes.SECOND + DataTypes.MINUTE + DataTypes.HOUR + DataTypes.DAY + DataTypes.MONTH + DataTypes.YEAR + DataTypes.INTERVAL diff --git a/flink-python/docs/reference/pyflink.table/descriptors.rst b/flink-python/docs/reference/pyflink.table/descriptors.rst new file mode 100644 index 00000000000..931ca24e1bf --- /dev/null +++ b/flink-python/docs/reference/pyflink.table/descriptors.rst @@ -0,0 +1,142 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + + +=========== +Descriptors +=========== + +TableDescriptor +--------------- + +Describes a CatalogTable representing a source or sink. + +TableDescriptor is a template for creating a CatalogTable instance. It closely resembles the +"CREATE TABLE" SQL DDL statement, containing schema, connector options, and other +characteristics. Since tables in Flink are typically backed by external systems, the +descriptor describes how a connector (and possibly its format) are configured. + +This can be used to register a table in the Table API, see :func:`create_temporary_table` in +TableEnvironment. + +.. currentmodule:: pyflink.table.table_descriptor + +.. autosummary:: + :toctree: api/ + + TableDescriptor.for_connector + TableDescriptor.get_schema + TableDescriptor.get_options + TableDescriptor.get_partition_keys + TableDescriptor.get_comment + TableDescriptor.Builder.schema + TableDescriptor.Builder.option + TableDescriptor.Builder.format + TableDescriptor.Builder.partitioned_by + TableDescriptor.Builder.comment + TableDescriptor.Builder.build + +FormatDescriptor +---------------- + +Describes a Format and its options for use with :class:`~pyflink.table.TableDescriptor`. + +Formats are responsible for encoding and decoding data in table connectors. Note that not +every connector has a format, while others may have multiple formats (e.g. the Kafka connector +has separate formats for keys and values). Common formats are "json", "csv", "avro", etc. + +.. currentmodule:: pyflink.table.table_descriptor + +.. autosummary:: + :toctree: api/ + + FormatDescriptor.for_format + FormatDescriptor.get_format + FormatDescriptor.get_options + FormatDescriptor.Builder.option + FormatDescriptor.Builder.build + +Schema +------ + +Schema of a table or view. + +A schema represents the schema part of a {@code CREATE TABLE (schema) WITH (options)} DDL +statement in SQL. It defines columns of different kind, constraints, time attributes, and +watermark strategies. It is possible to reference objects (such as functions or types) across +different catalogs. + +This class is used in the API and catalogs to define an unresolved schema that will be +translated to ResolvedSchema. Some methods of this class perform basic validation, however, the +main validation happens during the resolution. Thus, an unresolved schema can be incomplete and +might be enriched or merged with a different schema at a later stage. + +Since an instance of this class is unresolved, it should not be directly persisted. The str() +shows only a summary of the contained objects. + +.. currentmodule:: pyflink.table.schema + +.. autosummary:: + :toctree: api/ + + Schema.Builder.from_schema + Schema.Builder.from_row_data_type + Schema.Builder.from_fields + Schema.Builder.column + Schema.Builder.column_by_expression + Schema.Builder.column_by_metadata + Schema.Builder.watermark + Schema.Builder.primary_key + Schema.Builder.primary_key_named + Schema.Builder.build + + +TableSchema +----------- + +A table schema that represents a table's structure with field names and data types. + +.. currentmodule:: pyflink.table.table_schema + +.. autosummary:: + :toctree: api/ + + TableSchema.copy + TableSchema.get_field_data_types + TableSchema.get_field_data_type + TableSchema.get_field_count + TableSchema.get_field_names + TableSchema.get_field_name + TableSchema.to_row_data_type + TableSchema.Builder.field + TableSchema.Builder.build + + +ChangelogMode +------------- + +The set of changes contained in a changelog. + +.. currentmodule:: pyflink.table.changelog_mode + +.. autosummary:: + :toctree: api/ + + ChangelogMode.insert_only + ChangelogMode.upsert + ChangelogMode.all diff --git a/flink-python/docs/reference/pyflink.table/expressions.rst b/flink-python/docs/reference/pyflink.table/expressions.rst new file mode 100644 index 00000000000..595fa39f82b --- /dev/null +++ b/flink-python/docs/reference/pyflink.table/expressions.rst @@ -0,0 +1,288 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + + +=========== +Expressions +=========== + +Expressions +=========== + +.. currentmodule:: pyflink.table.expressions + +.. autosummary:: + :toctree: api/ + + col + lit + range_ + and_ + or_ + not_ + current_database + current_date + current_time + current_timestamp + current_watermark + local_time + local_timestamp + to_date + to_timestamp + to_timestamp_ltz + temporal_overlaps + date_format + timestamp_diff + convert_tz + from_unixtime + unix_timestamp + array + row + map_ + row_interval + pi + e + rand + rand_integer + atan2 + negative + concat + concat_ws + uuid + null_of + log + source_watermark + if_then_else + coalesce + with_columns + without_columns + json_string + json_object + json_object_agg + json_array + json_array_agg + call + call_sql + + +Expression +========== + +arithmetic functions +-------------------- + +.. currentmodule:: pyflink.table.expression + +.. autosummary:: + :toctree: api/ + + Expression.exp + Expression.log10 + Expression.log2 + Expression.ln + Expression.log + Expression.cosh + Expression.sinh + Expression.sin + Expression.cos + Expression.tan + Expression.cot + Expression.asin + Expression.acos + Expression.atan + Expression.tanh + Expression.degrees + Expression.radians + Expression.sqrt + Expression.abs + Expression.sign + Expression.round + Expression.between + Expression.not_between + Expression.then + Expression.if_null + Expression.is_null + Expression.is_not_null + Expression.is_true + Expression.is_false + Expression.is_not_true + Expression.is_not_false + Expression.distinct + Expression.sum + Expression.sum0 + Expression.min + Expression.max + Expression.count + Expression.avg + Expression.first_value + Expression.last_value + Expression.list_agg + Expression.stddev_pop + Expression.stddev_samp + Expression.var_pop + Expression.var_samp + Expression.collect + Expression.alias + Expression.cast + Expression.try_cast + Expression.asc + Expression.desc + Expression.in_ + Expression.start + Expression.end + Expression.bin + Expression.hex + Expression.truncate + +string functions +---------------- + +.. currentmodule:: pyflink.table.expression + +.. autosummary:: + :toctree: api/ + + Expression.substring + Expression.substr + Expression.trim_leading + Expression.trim_trailing + Expression.trim + Expression.replace + Expression.char_length + Expression.upper_case + Expression.lower_case + Expression.init_cap + Expression.like + Expression.similar + Expression.position + Expression.lpad + Expression.rpad + Expression.overlay + Expression.regexp + Expression.regexp_replace + Expression.regexp_extract + Expression.from_base64 + Expression.to_base64 + Expression.ascii + Expression.chr + Expression.decode + Expression.encode + Expression.left + Expression.right + Expression.instr + Expression.locate + Expression.parse_url + Expression.ltrim + Expression.rtrim + Expression.repeat + Expression.over + Expression.reverse + Expression.split_index + Expression.str_to_map + +temporal functions +------------------ + +.. currentmodule:: pyflink.table.expression + +.. autosummary:: + :toctree: api/ + + Expression.to_date + Expression.to_time + Expression.to_timestamp + Expression.extract + Expression.floor + Expression.ceil + + +advanced type helper functions +------------------------------ + +.. currentmodule:: pyflink.table.expression + +.. autosummary:: + :toctree: api/ + + Expression.get + Expression.flatten + Expression.at + Expression.cardinality + Expression.element + Expression.array_contains + + +time definition functions +------------------------- + +.. currentmodule:: pyflink.table.expression + +.. autosummary:: + :toctree: api/ + + Expression.rowtime + Expression.proctime + Expression.year + Expression.years + Expression.quarter + Expression.quarters + Expression.month + Expression.months + Expression.week + Expression.weeks + Expression.day + Expression.days + Expression.hour + Expression.hours + Expression.minute + Expression.minutes + Expression.second + Expression.seconds + Expression.milli + Expression.millis + + +hash functions +-------------- + +.. currentmodule:: pyflink.table.expression + +.. autosummary:: + :toctree: api/ + + Expression.md5 + Expression.sha1 + Expression.sha224 + Expression.sha256 + Expression.sha384 + Expression.sha512 + Expression.sha2 + + +JSON functions +-------------- + +.. currentmodule:: pyflink.table.expression + +.. autosummary:: + :toctree: api/ + + Expression.is_json + Expression.json_exists + Expression.json_value + Expression.json_query diff --git a/flink-python/docs/index.rst b/flink-python/docs/reference/pyflink.table/index.rst similarity index 56% copy from flink-python/docs/index.rst copy to flink-python/docs/reference/pyflink.table/index.rst index 7a174fd75ab..b4d26648ad8 100644 --- a/flink-python/docs/index.rst +++ b/flink-python/docs/reference/pyflink.table/index.rst @@ -16,35 +16,21 @@ limitations under the License. ################################################################################ -Welcome to Flink Python API Docs! -================================================== +============= +PyFlink Table +============= -.. toctree:: - :maxdepth: 2 - :caption: Contents - - pyflink - pyflink.common - pyflink.table - pyflink.datastream - pyflink.metrics - - -Core Classes: ---------------- - - :class:`pyflink.table.TableEnvironment` - - Main entry point for Flink Table functionality. +This page gives an overview of all public PyFlink Table API. - :class:`pyflink.table.Table` - - Core component of the Flink Table API. The Flink Table API is built around :class:`~pyflink.table.Table`. - - :class:`pyflink.datastream.StreamExecutionEnvironment` - - Main entry point for Flink DataStream functionality. - - :class:`pyflink.datastream.DataStream` - - Core component of the Flink DataStream API. The Flink DataStream API is built around :class:`~pyflink.datastream.DataStream`. +.. toctree:: + :maxdepth: 2 + + table_environment + table + data_types + window + expressions + udf + descriptors + statement_set + catalog diff --git a/flink-python/docs/reference/pyflink.table/statement_set.rst b/flink-python/docs/reference/pyflink.table/statement_set.rst new file mode 100644 index 00000000000..db48591f33c --- /dev/null +++ b/flink-python/docs/reference/pyflink.table/statement_set.rst @@ -0,0 +1,81 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + +============ +StatementSet +============ + +StatementSet +------------ + +A :class:`~StatementSet` accepts pipelines defined by DML statements or :class:`~Table` objects. +The planner can optimize all added statements together and then submit them as one job. + +The added statements will be cleared when calling the :func:`~StatementSet.execute` method. + +.. currentmodule:: pyflink.table.statement_set + +.. autosummary:: + :toctree: api/ + + StatementSet.add_insert_sql + StatementSet.attach_as_datastream + StatementSet.add_insert + StatementSet.explain + StatementSet.execute + + +TableResult +----------- + +A :class:`~pyflink.table.TableResult` is the representation of the statement execution result. + +.. currentmodule:: pyflink.table.table_result + +.. autosummary:: + :toctree: api/ + + TableResult.get_job_client + TableResult.wait + TableResult.get_table_schema + TableResult.get_result_kind + TableResult.collect + TableResult.print + + +ResultKind +---------- + +ResultKind defines the types of the result. + +:data:`SUCCESS`: + +The statement (e.g. DDL, USE) executes successfully, and the result only contains a simple "OK". + +:data:`SUCCESS_WITH_CONTENT`: + +The statement (e.g. DML, DQL, SHOW) executes successfully, and the result contains important +content. + +.. currentmodule:: pyflink.table.table_result + +.. autosummary:: + :toctree: api/ + + ResultKind.SUCCESS + ResultKind.SUCCESS_WITH_CONTENT diff --git a/flink-python/docs/reference/pyflink.table/table.rst b/flink-python/docs/reference/pyflink.table/table.rst new file mode 100644 index 00000000000..0f2101271f3 --- /dev/null +++ b/flink-python/docs/reference/pyflink.table/table.rst @@ -0,0 +1,205 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + +===== +Table +===== + +Table +===== + +A :class:`~pyflink.table.Table` object is the core abstraction of the Table API. +Similar to how the DataStream API has DataStream, +the Table API is built around :class:`~pyflink.table.Table`. + +A :class:`~pyflink.table.Table` object describes a pipeline of data transformations. It does not +contain the data itself in any way. Instead, it describes how to read data from a table source, +and how to eventually write data to a table sink. The declared pipeline can be +printed, optimized, and eventually executed in a cluster. The pipeline can work with bounded or +unbounded streams which enables both streaming and batch scenarios. + +By the definition above, a :class:`~pyflink.table.Table` object can actually be considered as +a view in SQL terms. + +The initial :class:`~pyflink.table.Table` object is constructed by a +:class:`~pyflink.table.TableEnvironment`. For example, +:func:`~pyflink.table.TableEnvironment.from_path` obtains a table from a catalog. +Every :class:`~pyflink.table.Table` object has a schema that is available through +:func:`~pyflink.table.Table.get_schema`. A :class:`~pyflink.table.Table` object is +always associated with its original table environment during programming. + +Every transformation (i.e. :func:`~pyflink.table.Table.select`} or +:func:`~pyflink.table.Table.filter` on a :class:`~pyflink.table.Table` object leads to a new +:class:`~pyflink.table.Table` object. + +Use :func:`~pyflink.table.Table.execute` to execute the pipeline and retrieve the transformed +data locally during development. Otherwise, use :func:`~pyflink.table.Table.execute_insert` to +write the data into a table sink. + +Many methods of this class take one or more :class:`~pyflink.table.Expression` as parameters. +For fluent definition of expressions and easier readability, we recommend to add a star import: + +Example: +:: + + >>> from pyflink.table.expressions import * + +Check the documentation for more programming language specific APIs. + +The following example shows how to work with a :class:`~pyflink.table.Table` object. + +Example: +:: + + >>> from pyflink.table import EnvironmentSettings, TableEnvironment + >>> from pyflink.table.expressions import * + >>> env_settings = EnvironmentSettings.in_streaming_mode() + >>> t_env = TableEnvironment.create(env_settings) + >>> table = t_env.from_path("my_table").select(col("colA").trim(), col("colB") + 12) + >>> table.execute().print() + +.. currentmodule:: pyflink.table + +.. autosummary:: + :toctree: api/ + + Table.add_columns + Table.add_or_replace_columns + Table.aggregate + Table.alias + Table.distinct + Table.drop_columns + Table.drop_columns + Table.execute + Table.execute_insert + Table.explain + Table.fetch + Table.filter + Table.flat_aggregate + Table.flat_map + Table.full_outer_join + Table.get_schema + Table.group_by + Table.intersect + Table.intersect_all + Table.join + Table.join_lateral + Table.left_outer_join + Table.left_outer_join_lateral + Table.limit + Table.map + Table.minus + Table.minus_all + Table.offset + Table.order_by + Table.over_window + Table.print_schema + Table.rename_columns + Table.right_outer_join + Table.select + Table.to_pandas + Table.union + Table.union_all + Table.where + Table.window + + +GroupedTable +============ + +A table that has been grouped on a set of grouping keys. + +.. currentmodule:: pyflink.table + +.. autosummary:: + :toctree: api/ + + GroupedTable.select + GroupedTable.aggregate + GroupedTable.flat_aggregate + + +GroupWindowedTable +================== + +A table that has been windowed for :class:`~pyflink.table.GroupWindow`. + +.. currentmodule:: pyflink.table + +.. autosummary:: + :toctree: api/ + + GroupWindowedTable.group_by + + +WindowGroupedTable +================== + +A table that has been windowed and grouped for :class:`~pyflink.table.window.GroupWindow`. + +.. currentmodule:: pyflink.table + +.. autosummary:: + :toctree: api/ + + WindowGroupedTable.select + WindowGroupedTable.aggregate + + +OverWindowedTable +================= + +A table that has been windowed for :class:`~pyflink.table.window.OverWindow`. + +Unlike group windows, which are specified in the GROUP BY clause, over windows do not collapse +rows. Instead over window aggregates compute an aggregate for each input row over a range of +its neighboring rows. + +.. currentmodule:: pyflink.table + +.. autosummary:: + :toctree: api/ + + OverWindowedTable.select + + +AggregatedTable +=============== + +A table that has been performed on the aggregate function. + +.. currentmodule:: pyflink.table.table + +.. autosummary:: + :toctree: api/ + + AggregatedTable.select + + +FlatAggregateTable +================== + +A table that performs flatAggregate on a :class:`~pyflink.table.Table`, a +:class:`~pyflink.table.GroupedTable` or a :class:`~pyflink.table.WindowGroupedTable` + +.. currentmodule:: pyflink.table.table + +.. autosummary:: + :toctree: api/ + + FlatAggregateTable.select diff --git a/flink-python/docs/reference/pyflink.table/table_environment.rst b/flink-python/docs/reference/pyflink.table/table_environment.rst new file mode 100644 index 00000000000..302dc3d5743 --- /dev/null +++ b/flink-python/docs/reference/pyflink.table/table_environment.rst @@ -0,0 +1,272 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + +================ +TableEnvironment +================ + +A table environment is the base class, entry point, and central context for creating Table +and SQL API programs. + +EnvironmentSettings +------------------- + +Defines all parameters that initialize a table environment. Those parameters are used only +during instantiation of a :class:`~pyflink.table.TableEnvironment` and cannot be changed +afterwards. + +Example: +:: + + >>> EnvironmentSettings.new_instance() \ + ... .in_streaming_mode() \ + ... .with_built_in_catalog_name("my_catalog") \ + ... .with_built_in_database_name("my_database") \ + ... .build() + +:func:`~EnvironmentSettings.in_streaming_mode` or :func:`~EnvironmentSettings.in_batch_mode` +might be convenient as shortcuts. + +.. currentmodule:: pyflink.table.environment_settings + +.. autosummary:: + :toctree: api/ + + EnvironmentSettings.new_instance + EnvironmentSettings.from_configuration + EnvironmentSettings.in_streaming_mode + EnvironmentSettings.in_batch_mode + EnvironmentSettings.get_built_in_catalog_name + EnvironmentSettings.get_built_in_database_name + EnvironmentSettings.is_streaming_mode + EnvironmentSettings.to_configuration + EnvironmentSettings.get_configuration + EnvironmentSettings.Builder.with_configuration + EnvironmentSettings.Builder.in_batch_mode + EnvironmentSettings.Builder.in_streaming_mode + EnvironmentSettings.Builder.with_built_in_catalog_name + EnvironmentSettings.Builder.with_built_in_database_name + EnvironmentSettings.Builder.build + +TableConfig +----------- + +Configuration for the current :class:`TableEnvironment` session to adjust Table & SQL API +programs. + +This class is a pure API class that abstracts configuration from various sources. Currently, +configuration can be set in any of the following layers (in the given order): + +- flink-conf.yaml +- CLI parameters +- :class:`~pyflink.datastream.StreamExecutionEnvironment` when bridging to DataStream API +- :func:`~EnvironmentSettings.Builder.with_configuration` +- :func:`~TableConfig.set` + +The latter two represent the application-specific part of the configuration. They initialize +and directly modify :func:`~TableConfig.get_configuration`. Other layers represent the +configuration of the execution context and are immutable. + +The getter :func:`~TableConfig.get` gives read-only access to the full configuration. However, +application-specific configuration has precedence. Configuration of outer layers is used for +defaults and fallbacks. The setter :func:`~TableConfig.set` will only affect +application-specific configuration. + +For common or important configuration options, this class provides getters and setters methods +with detailed inline documentation. + +For more advanced configuration, users can directly access the underlying key-value map via +:func:`~pyflink.table.TableConfig.get_configuration`. + +Example: +:: + + >>> table_config = t_env.get_config() + >>> config = Configuration() + >>> config.set_string("parallelism.default", "128") \ + ... .set_string("pipeline.auto-watermark-interval", "800ms") \ + ... .set_string("execution.checkpointing.interval", "30s") + >>> table_config.add_configuration(config) + +.. note:: + + Because options are read at different point in time when performing operations, it is + recommended to set configuration options early after instantiating a table environment. + +.. currentmodule:: pyflink.table.table_config + +.. autosummary:: + :toctree: api/ + + TableConfig + +TableEnvironment +---------------- + +A table environment is the base class, entry point, and central context for creating Table +and SQL API programs. + +It is unified for bounded and unbounded data processing. + +A table environment is responsible for: + + - Connecting to external systems. + - Registering and retrieving :class:`~pyflink.table.Table` and other meta objects from a + catalog. + - Executing SQL statements. + - Offering further configuration options. + +The path in methods such as :func:`create_temporary_view` +should be a proper SQL identifier. The syntax is following +[[catalog-name.]database-name.]object-name, where the catalog name and database are optional. +For path resolution see :func:`use_catalog` and :func:`use_database`. All keywords or other +special characters need to be escaped. + +Example: `cat.1`.`db`.`Table` resolves to an object named 'Table' (table is a reserved +keyword, thus must be escaped) in a catalog named 'cat.1' and database named 'db'. + +.. note:: + + This environment is meant for pure table programs. If you would like to convert from or to + other Flink APIs, it might be necessary to use one of the available language-specific table + environments in the corresponding bridging modules. + +.. currentmodule:: pyflink.table.table_environment + +.. autosummary:: + :toctree: api/ + + TableEnvironment.add_python_archive + TableEnvironment.add_python_file + TableEnvironment.create + TableEnvironment.create_java_function + TableEnvironment.create_java_temporary_function + TableEnvironment.create_java_temporary_system_function + TableEnvironment.create_statement_set + TableEnvironment.create_table + TableEnvironment.create_temporary_function + TableEnvironment.create_temporary_system_function + TableEnvironment.create_temporary_table + TableEnvironment.create_temporary_view + TableEnvironment.drop_function + TableEnvironment.drop_temporary_function + TableEnvironment.drop_temporary_system_function + TableEnvironment.drop_temporary_table + TableEnvironment.drop_temporary_view + TableEnvironment.execute_sql + TableEnvironment.explain_sql + TableEnvironment.from_descriptor + TableEnvironment.from_elements + TableEnvironment.from_pandas + TableEnvironment.from_path + TableEnvironment.from_table_source + TableEnvironment.get_catalog + TableEnvironment.get_config + TableEnvironment.get_current_catalog + TableEnvironment.get_current_database + TableEnvironment.list_catalogs + TableEnvironment.list_databases + TableEnvironment.list_full_modules + TableEnvironment.list_functions + TableEnvironment.list_modules + TableEnvironment.list_tables + TableEnvironment.list_temporary_tables + TableEnvironment.list_temporary_views + TableEnvironment.list_user_defined_functions + TableEnvironment.list_views + TableEnvironment.load_module + TableEnvironment.register_catalog + TableEnvironment.register_function + TableEnvironment.register_java_function + TableEnvironment.register_table + TableEnvironment.register_table_sink + TableEnvironment.register_table_source + TableEnvironment.scan + TableEnvironment.set_python_requirements + TableEnvironment.sql_query + TableEnvironment.unload_module + TableEnvironment.use_catalog + TableEnvironment.use_database + TableEnvironment.use_modules + +StreamTableEnvironment +---------------------- + +.. currentmodule:: pyflink.table.table_environment + +.. autosummary:: + :toctree: api/ + + StreamTableEnvironment.add_python_archive + StreamTableEnvironment.add_python_file + StreamTableEnvironment.create + StreamTableEnvironment.create_java_function + StreamTableEnvironment.create_java_temporary_function + StreamTableEnvironment.create_java_temporary_system_function + StreamTableEnvironment.create_statement_set + StreamTableEnvironment.create_table + StreamTableEnvironment.create_temporary_function + StreamTableEnvironment.create_temporary_system_function + StreamTableEnvironment.create_temporary_table + StreamTableEnvironment.create_temporary_view + StreamTableEnvironment.drop_function + StreamTableEnvironment.drop_temporary_function + StreamTableEnvironment.drop_temporary_system_function + StreamTableEnvironment.drop_temporary_table + StreamTableEnvironment.drop_temporary_view + StreamTableEnvironment.execute_sql + StreamTableEnvironment.explain_sql + StreamTableEnvironment.from_descriptor + StreamTableEnvironment.from_elements + StreamTableEnvironment.from_pandas + StreamTableEnvironment.from_path + StreamTableEnvironment.from_table_source + StreamTableEnvironment.from_data_stream + StreamTableEnvironment.from_changelog_stream + StreamTableEnvironment.get_catalog + StreamTableEnvironment.get_config + StreamTableEnvironment.get_current_catalog + StreamTableEnvironment.get_current_database + StreamTableEnvironment.list_catalogs + StreamTableEnvironment.list_databases + StreamTableEnvironment.list_full_modules + StreamTableEnvironment.list_functions + StreamTableEnvironment.list_modules + StreamTableEnvironment.list_tables + StreamTableEnvironment.list_temporary_tables + StreamTableEnvironment.list_temporary_views + StreamTableEnvironment.list_user_defined_functions + StreamTableEnvironment.list_views + StreamTableEnvironment.load_module + StreamTableEnvironment.register_catalog + StreamTableEnvironment.register_function + StreamTableEnvironment.register_java_function + StreamTableEnvironment.register_table + StreamTableEnvironment.register_table_sink + StreamTableEnvironment.register_table_source + StreamTableEnvironment.scan + StreamTableEnvironment.set_python_requirements + StreamTableEnvironment.sql_query + StreamTableEnvironment.to_data_stream + StreamTableEnvironment.to_changelog_stream + StreamTableEnvironment.to_append_stream + StreamTableEnvironment.to_retract_stream + StreamTableEnvironment.unload_module + StreamTableEnvironment.use_catalog + StreamTableEnvironment.use_database + StreamTableEnvironment.use_modules diff --git a/flink-python/docs/reference/pyflink.table/udf.rst b/flink-python/docs/reference/pyflink.table/udf.rst new file mode 100644 index 00000000000..7029269670d --- /dev/null +++ b/flink-python/docs/reference/pyflink.table/udf.rst @@ -0,0 +1,95 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + +====================== +User Defined Functions +====================== + +Scalar Function +--------------- + +A user-defined scalar functions maps zero, one, or multiple scalar values to a new scalar value. + +.. currentmodule:: pyflink.table.udf + +.. autosummary:: + :toctree: api/ + + ScalarFunction + udf + + +Table Function +-------------- + +A user-defined table function creates zero, one, or multiple rows to a new row value. + +.. currentmodule:: pyflink.table.udf + +.. autosummary:: + :toctree: api/ + + TableFunction + udtf + + +Aggregate Function +------------------ + +A user-defined aggregate function maps scalar values of multiple rows to a new scalar value. + +.. currentmodule:: pyflink.table.udf + +.. autosummary:: + :toctree: api/ + + AggregateFunction + udaf + + +Table Aggregate Function +------------------------ + +A user-defined table aggregate function maps scalar values of multiple rows to zero, one, or multiple +rows (or structured types). If an output record consists of only one field, the structured record can +be omitted, and a scalar value can be emitted that will be implicitly wrapped into a row by the runtime. + +.. currentmodule:: pyflink.table.udf + +.. autosummary:: + :toctree: api/ + + TableAggregateFunction + udtaf + +DataView +-------- + +If an accumulator needs to store large amounts of data, `pyflink.table.ListView` and `pyflink.table.MapView` +could be used instead of list and dict. These two data structures provide the similar functionalities +as list and dict, however usually having better performance by leveraging Flinkās state backend to eliminate +unnecessary state access. You can use them by declaring `DataTypes.LIST_VIEW(...)` and `DataTypes.MAP_VIEW(...)` +in the accumulator type. + +.. currentmodule:: pyflink.table.data_view + +.. autosummary:: + :toctree: api/ + + ListView + MapView diff --git a/flink-python/docs/reference/pyflink.table/window.rst b/flink-python/docs/reference/pyflink.table/window.rst new file mode 100644 index 00000000000..dfa264635bd --- /dev/null +++ b/flink-python/docs/reference/pyflink.table/window.rst @@ -0,0 +1,131 @@ +.. ################################################################################ + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + ################################################################################ + +====== +Window +====== + +Tumble Window +------------- + +Tumbling windows are consecutive, non-overlapping +windows of a specified fixed length. For example, a tumbling window of 5 minutes size groups +elements in 5 minutes intervals. + +Example: +:: + + >>> from pyflink.table.expressions import col, lit + >>> Tumble.over(lit(10).minutes) \ + ... .on(col("rowtime")) \ + ... .alias("w") + +.. currentmodule:: pyflink.table.window + +.. autosummary:: + :toctree: api/ + + Tumble.over + TumbleWithSize.on + TumbleWithSizeOnTime.alias + + +Sliding Window +-------------- + +Sliding windows have a fixed size and slide by +a specified slide interval. If the slide interval is smaller than the window size, sliding +windows are overlapping. Thus, an element can be assigned to multiple windows. + +For example, a sliding window of size 15 minutes with 5 minutes sliding interval groups +elements of 15 minutes and evaluates every five minutes. Each element is contained in three +consecutive window evaluations. + +Example: +:: + + >>> from pyflink.table.expressions import col, lit + >>> Slide.over(lit(10).minutes) \ + ... .every(lit(5).minutes) \ + ... .on(col("rowtime")) \ + ... .alias("w") + +.. currentmodule:: pyflink.table.window + +.. autosummary:: + :toctree: api/ + + Slide.over + SlideWithSize.every + SlideWithSizeAndSlide.on + SlideWithSizeAndSlideOnTime.alias + + +Session Window +-------------- + +The boundary of session windows are defined by +intervals of inactivity, i.e., a session window is closes if no event appears for a defined +gap period. + +Example: +:: + + >>> from pyflink.table.expressions import col, lit + >>> Session.with_gap(lit(10).minutes) \\ + ... .on(col("rowtime")) \\ + ... .alias("w") + + +.. currentmodule:: pyflink.table.window + +.. autosummary:: + :toctree: api/ + + Session.with_gap + SessionWithGap.on + SessionWithGapOnTime.alias + + +Over Window +----------- + +Similar to SQL, over window aggregates compute an +aggregate for each input row over a range of its neighboring rows. + +Example: +:: + + >>> from pyflink.table.expressions import col, UNBOUNDED_RANGE + >>> Over.partition_by(col("a")) \ + ... .order_by(col("rowtime")) \ + ... .preceding(UNBOUNDED_RANGE) \ + ... .alias("w") + +.. currentmodule:: pyflink.table.window + +.. autosummary:: + :toctree: api/ + + Over.order_by + Over.partition_by + OverWindowPartitionedOrdered.alias + OverWindowPartitionedOrdered.preceding + OverWindowPartitionedOrderedPreceding.alias + OverWindowPartitionedOrderedPreceding.following + OverWindowPartitioned.order_by
