MonsterChenzhuo opened a new issue, #4565: URL: https://github.com/apache/incubator-seatunnel/issues/4565
### Search before asking - [X] I had searched in the [feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description This guidebook is designed to provide guidance and advice for Seatunnel connector doc development, helping you write high-quality technical docs. Doc Structure and Organization A connector doc needs to provide users with the following four core elements: Dependencies, how to create connectors, connector parameters, attributes, data type mappings 1. Dependency 2. How to create a connector Error ❌: The demo lists a bunch of optional configuration items for the user, causing the user to be unable to run the common example Correct ✅: We want to provide users with a minimal demo of the minimum configuration items at the beginning of the doc, and users can immediately experience the use of the connector by copying the demo 3. Connector parameters Connector parameters are a list and overview of all full entries, describing the granularity to basic information. We can understand this element as a "dictionary", which is mainly used for users to quickly find parameters during use. 4. Features The feature section requires us to explain in detail the nested parameters and core function parameters. This section requires that examples must be provided 5. Data type mapping The data type of the data source db and the data type of the seatunnel are displayed on a table. Can let the user very clearly figure out, synchronize the current data, my data synchronization conf file schema and how to write to map the structure of the data source Example https://github.com/apache/incubator-seatunnel/pull/4389/files#diff-5384eeb077736ed8c196058d29007ba814deeb01e098960edbed793db900996c # Apache MongoDB connector - [x] [batch](../../concept/connector-v2-features.md) - [x] [exactly-once](../../concept/connector-v2-features.md) - [x] [column projection](../../concept/connector-v2-features.md) - [x] [parallelism](../../concept/connector-v2-features.md) - [x] [support user-defined split](../../concept/connector-v2-features.md) The MongoDB Connector provides the ability to read and write data from and to MongoDB. This document describes how to set up the MongoDB connector to run data reads against MongoDB. Dependencies ------------ In order to use the Mongodb connector, the following dependencies are required. They can be downloaded via install-plugin.sh or from the Maven central repository. | MongoDB version | dependency | |-----------------|------------------------------------------------------------------------------------------------------------------| | universal | [Download](https://mvnrepository.com/artifact/org.apache.seatunnel/seatunnel-connectors-v2/connector-mongodb-v2) | How to create a MongoDB Data synchronization jobs ------------ The example below shows how to create a MongoDB data synchronization jobs: ```bash -- Set the basic configuration of the task to be performed env { execution.parallelism = 1 job.mode = "BATCH" } -- Create a source to connect to Mongodb source { MongodbV2 { connection = "mongodb://user:[email protected]:27017" database = "test_db" collection = "source_table" schema = { fields { c_map = "map<string, string>" c_array = "array<int>" c_string = string c_boolean = boolean c_tinyint = tinyint c_smallint = smallint c_int = int c_bigint = bigint c_float = float c_double = double c_bytes = bytes c_date = date c_decimal = "decimal(38, 18)" c_timestamp = timestamp c_row = { c_map = "map<string, string>" c_array = "array<int>" c_string = string c_boolean = boolean c_tinyint = tinyint c_smallint = smallint c_int = int c_bigint = bigint c_float = float c_double = double c_bytes = bytes c_date = date c_decimal = "decimal(38, 18)" c_timestamp = timestamp } } } } } -- Console printing of the read Mongodb data sink { Console { } } ``` Connector Options ---------------- | Option | Required | Forwarded | Default | Type | Description | |--------------------|----------|-----------|---------|------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | connector | required | yes | (none) | String | The MongoDB connection uri. | | database | required | yes | (none) | String | The name of MongoDB database to read or write. | | collection | required | yes | (none) | String | The name of MongoDB collection to read or write. | | schema | required | no | (none) | String | MongoDB's BSON and seatunnel data structure mapping | | matchQuery | optional | no | (none) | String | In MongoDB, $match is one of the aggregation pipeline operators, used to filter documents | | projection | optional | no | (none) | String | In MongoDB, Projection is used to control the fields contained in the query results | | partition.strategy | optional | no | default | String | Specifies the partition strategy. Available strategies are `single`, `sample`, `split-vector`, `sharded` and `default`. See the following Partitioned Scan section for more details. | | partition.size | optional | no | 64mb | MemorySize | Specifies the partition memory size. | | partition.samples | optional | no | 10 | Integer | Specifies the samples count per partition. It only takes effect when the partition strategy is sample. The sample partitioner samples the collection, projects and sorts by the partition fields. Then uses every `scan.partition.samples` as the value to use to calculate the partition boundaries. The total number of samples taken is calculated as: `samples per partition * (count of documents / number of documents per partition)`. | | no-timeout | optional | no | true | Boolean | MongoDB server normally times out idle cursors after an inactivity period (10 minutes) to prevent excess memory use. Set this option to true to prevent that. However, if the application takes longer than 30 minutes to process the current batch of documents, the session is marked as expired and closed. | Features ---------------- **MatchQuery Scan** In MongoDB, $match is one of the aggregation pipeline operators used to filter documents. Its position in the pipeline determines when documents are filtered. $match uses MongoDB's standard query operators to filter data. Basically, it can be thought of as the "WHERE" clause in the aggregation pipeline. Here's a simple $match example, assuming we have a collection called orders and want to filter out documents that meet the status field value of "A": ```bash db.orders.aggregate([ { $match: { status: "A" } } ]); ``` In data synchronization scenarios, the matchQuery approach needs to be used early to reduce the number of documents that need to be processed by subsequent operators, thus improving performance. Here is a simple example of a seatunnel using $match ```bash source { MongoDB { uri = "mongodb://user:[email protected]:27017" database = "test_db" collection = "orders" matchQuery = "{ status: "A" }" schema = { fields { id = bigint status = string } } } } ``` **Projection Scan** In MongoDB, Projection is used to control which fields are included in the query results. This can be accomplished by specifying which fields need to be returned and which fields do not. In the find() method, a projection object can be passed as a second argument. The key of the projection object indicates the fields to include or exclude, and a value of 1 indicates inclusion and 0 indicates exclusion. Here is a simple example, assuming we have a collection named users: ```bash // Returns only the name and email fields db.users.find({}, { name: 1, email: 1 }); ``` In data synchronization scenarios, projection needs to be used early to reduce the number of documents that need to be processed by subsequent operators, thus improving performance. Here is a simple example of a seatunnel using projection: ```bash source { MongoDB { uri = "mongodb://user:[email protected]:27017" database = "test_db" collection = "users" matchQuery = "{ name: 1, email: 1 }" schema = { fields { id = bigint status = string } } } } ``` **Partitioned Scan** To speed up reading data in parallel source task instances, seatunnel provides a partitioned scan feature for MongoDB collections. The following partitioning strategies are provided. - single: treats the entire collection as a single partition. - sample: samples the collection and generate partitions which is fast but possibly uneven. - split-vector: uses the splitVector command to generate partitions for non-sharded collections which is fast and even. The splitVector permission is required. - sharded: reads config.chunks (MongoDB splits a sharded collection into chunks, and the range of the chunks are stored within the collection) as the partitions directly. The sharded strategy only used for sharded collection which is fast and even. Read permission of config database is required. - default: uses sharded strategy for sharded collections otherwise using split vector strategy. ```bash source { MongoDB { uri = "mongodb://user:[email protected]:27017" database = "test_db" collection = "users" partition.strategy = single partition.samples = 100 schema = { fields { id = bigint status = string } } } } ``` Data Type Mapping ---------------- The following table lists the field data type mapping from MongoDB BSON type to Seatunnel data type. | MongoDB BSON type | Seatunnel type | |-------------------|------------------| | ObjectId | STRING | | String | STRING | | Boolean | BOOLEAN | | Binary | BINARY | | Int32 | INTEGER | | Int64 | BIGINT | | Double | DOUBLE | | Decimal128 | DECIMAL | | DateTime | TIMESTAMP_LTZ(3) | | Timestamp | TIMESTAMP_LTZ(0) | | Object | ROW | | Array | ARRAY | Format and layout 1. Title Dependencies ------------ 2. Sub-headings **Partitioned Scan** 3. Table display | MongoDB BSON type | Seatunnel type | |-------------------|------------------| | ObjectId | STRING | | String | STRING | ### Usage Scenario _No response_ ### Related issues _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
