[GitHub] [incubator-seatunnel] MonsterChenzhuo opened a new issue, #4565: [Proposal] ST Connector Document Open Instruction Manual

via GitHub Wed, 12 Apr 2023 23:38:15 -0700


MonsterChenzhuo opened a new issue, #4565:
URL: https://github.com/apache/incubator-seatunnel/issues/4565


   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   This guidebook is designed to provide guidance and advice for Seatunnel 
connector doc development, helping you write high-quality technical docs.
   Doc Structure and Organization
   A connector doc needs to provide users with the following four core elements:
   Dependencies, how to create connectors, connector parameters, attributes, 
data type mappings
   1. Dependency
   2. How to create a connector
   Error ❌: The demo lists a bunch of optional configuration items for the 
user, causing the user to be unable to run the common example
   Correct ✅: We want to provide users with a minimal demo of the minimum 
configuration items at the beginning of the doc, and users can immediately 
experience the use of the connector by copying the demo
   3. Connector parameters
   Connector parameters are a list and overview of all full entries, describing 
the granularity to basic information.
   We can understand this element as a "dictionary", which is mainly used for 
users to quickly find parameters during use.
   4. Features
   The feature section requires us to explain in detail the nested parameters 
and core function parameters. This section requires that examples must be 
provided
   5. Data type mapping
   The data type of the data source db and the data type of the seatunnel are 
displayed on a table.
   Can let the user very clearly figure out, synchronize the current data, my 
data synchronization conf file schema and how to write to map the structure of 
the data source
   Example
   
https://github.com/apache/incubator-seatunnel/pull/4389/files#diff-5384eeb077736ed8c196058d29007ba814deeb01e098960edbed793db900996c
   # Apache MongoDB connector
   - [x] [batch](../../concept/connector-v2-features.md)
   - [x] [exactly-once](../../concept/connector-v2-features.md)
   - [x] [column projection](../../concept/connector-v2-features.md)
   - [x] [parallelism](../../concept/connector-v2-features.md)
   - [x] [support user-defined split](../../concept/connector-v2-features.md)
   
   The MongoDB Connector provides the ability to read and write data from and 
to MongoDB. 
   This document describes how to set up the MongoDB connector to run data 
reads against MongoDB.
   
   Dependencies
   ------------
   
   In order to use the Mongodb connector, the following dependencies are 
required. 
   They can be downloaded via install-plugin.sh or from the Maven central 
repository.
   
   | MongoDB version | dependency                                               
                                                        | 
   
|-----------------|------------------------------------------------------------------------------------------------------------------|
   | universal       | 
[Download](https://mvnrepository.com/artifact/org.apache.seatunnel/seatunnel-connectors-v2/connector-mongodb-v2)
 |
   
   How to create a MongoDB Data synchronization jobs
   ------------
   
   The example below shows how to create a MongoDB data synchronization jobs:
   ```bash
   -- Set the basic configuration of the task to be performed
   env {
     execution.parallelism = 1
     job.mode = "BATCH"
   }
   
   -- Create a source to connect to Mongodb
   source {
     MongodbV2 {
       connection = "mongodb://user:[email protected]:27017"
       database = "test_db"
       collection = "source_table"
       schema = {
         fields {
           c_map = "map<string, string>"
           c_array = "array<int>"
           c_string = string
           c_boolean = boolean
           c_tinyint = tinyint
           c_smallint = smallint
           c_int = int
           c_bigint = bigint
           c_float = float
           c_double = double
           c_bytes = bytes
           c_date = date
           c_decimal = "decimal(38, 18)"
           c_timestamp = timestamp
           c_row = {
             c_map = "map<string, string>"
             c_array = "array<int>"
             c_string = string
             c_boolean = boolean
             c_tinyint = tinyint
             c_smallint = smallint
             c_int = int
             c_bigint = bigint
             c_float = float
             c_double = double
             c_bytes = bytes
             c_date = date
             c_decimal = "decimal(38, 18)"
             c_timestamp = timestamp
           }
         }
       }
     }
   }
   
   -- Console printing of the read Mongodb data
   sink {
     Console {
     }
   }
   
   
   ```
   
   Connector Options
   ----------------
   
   | Option             | Required | Forwarded | Default | Type       | 
Description                                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                              | 
   
|--------------------|----------|-----------|---------|------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   | connector          | required | yes       | (none)  | String     | The 
MongoDB connection uri.                                                         
                                                                                
                                                                                
                                                                                
                                                                                
                          |
   | database           | required | yes       | (none)  | String     | The 
name of MongoDB database to read or write.                                      
                                                                                
                                                                                
                                                                                
                                                                                
                          |
   | collection         | required | yes       | (none)  | String     | The 
name of MongoDB collection to read or write.                                    
                                                                                
                                                                                
                                                                                
                                                                                
                          |
   | schema             | required | no        | (none)  | String     | 
MongoDB's BSON and seatunnel data structure mapping                             
                                                                                
                                                                                
                                                                                
                                                                                
                              |
   | matchQuery         | optional | no        | (none)  | String     | In 
MongoDB, $match is one of the aggregation pipeline operators, used to filter 
documents                                                                       
                                                                                
                                                                                
                                                                                
                              |
   | projection         | optional | no        | (none)  | String     | In 
MongoDB, Projection is used to control the fields contained in the query 
results                                                                         
                                                                                
                                                                                
                                                                                
                                  |
   | partition.strategy | optional | no        | default | String     | 
Specifies the partition strategy. Available strategies are `single`, `sample`, 
`split-vector`, `sharded` and `default`. See the following Partitioned Scan 
section for more details.                                                       
                                                                                
                                                                                
                                   |
   | partition.size     | optional | no        | 64mb    | MemorySize | 
Specifies the partition memory size.                                            
                                                                                
                                                                                
                                                                                
                                                                                
                              |
   | partition.samples  | optional | no        | 10      | Integer    | 
Specifies the samples count per partition. It only takes effect when the 
partition strategy is sample. The sample partitioner samples the collection, 
projects and sorts by the partition fields. Then uses every 
`scan.partition.samples` as the value to use to calculate the partition 
boundaries. The total number of samples taken is calculated as: `samples per 
partition * (count of documents / number of documents per partition)`. |
   | no-timeout         | optional | no        | true    | Boolean    | MongoDB 
server normally times out idle cursors after an inactivity period (10 minutes) 
to prevent excess memory use. Set this option to true to prevent that. However, 
if the application takes longer than 30 minutes to process the current batch of 
documents, the session is marked as expired and closed.                         
                                                                                
                       |
   
   Features
   ----------------
   
   **MatchQuery Scan**
   
   In MongoDB, $match is one of the aggregation pipeline operators used to 
filter documents. Its position in the pipeline determines when documents are 
filtered.
   $match uses MongoDB's standard query operators to filter data. Basically, it 
can be thought of as the "WHERE" clause in the aggregation pipeline.
   
   Here's a simple $match example, assuming we have a collection called orders 
and want to filter out documents that meet the status field value of "A":
   ```bash
   db.orders.aggregate([
     {
       $match: {
         status: "A"
       }
     }
   ]);
   
   ```
   In data synchronization scenarios, the matchQuery approach needs to be used 
early to reduce the number of documents that need to be processed by subsequent 
operators, thus improving performance.
   Here is a simple example of a seatunnel using $match
   ```bash
   source {
     MongoDB {
       uri = "mongodb://user:[email protected]:27017"
       database = "test_db"
       collection = "orders"
       matchQuery = "{
         status: "A"
       }"
       schema = {
         fields {
           id = bigint
           status = string
         }
       }
     }
   }
   ```
   
   **Projection Scan**
   
   In MongoDB, Projection is used to control which fields are included in the 
query results. This can be accomplished by specifying which fields need to be 
returned and which fields do not.
   In the find() method, a projection object can be passed as a second 
argument. The key of the projection object indicates the fields to include or 
exclude, and a value of 1 indicates inclusion and 0 indicates exclusion.
   Here is a simple example, assuming we have a collection named users:
   ```bash
   // Returns only the name and email fields
   db.users.find({}, { name: 1, email: 1 });
   ```
   In data synchronization scenarios, projection needs to be used early to 
reduce the number of documents that need to be processed by subsequent 
operators, thus improving performance.
   Here is a simple example of a seatunnel using projection:
   ```bash
   source {
     MongoDB {
       uri = "mongodb://user:[email protected]:27017"
       database = "test_db"
       collection = "users"
       matchQuery = "{ name: 1, email: 1 }"
       schema = {
         fields {
           id = bigint
           status = string
         }
       }
     }
   }
   
   ```
   
   **Partitioned Scan**
   To speed up reading data in parallel source task instances, seatunnel 
provides a partitioned scan feature for MongoDB collections. The following 
partitioning strategies are provided.
   - single: treats the entire collection as a single partition.
   - sample: samples the collection and generate partitions which is fast but 
possibly uneven.
   - split-vector: uses the splitVector command to generate partitions for 
non-sharded collections which is fast and even. The splitVector permission is 
required.
   - sharded: reads config.chunks (MongoDB splits a sharded collection into 
chunks, and the range of the chunks are stored within the collection) as the 
partitions directly. The sharded strategy only used for sharded collection 
which is fast and even. Read permission of config database is required.
   - default: uses sharded strategy for sharded collections otherwise using 
split vector strategy.
   ```bash
   source {
     MongoDB {
       uri = "mongodb://user:[email protected]:27017"
       database = "test_db"
       collection = "users"
       partition.strategy = single
       partition.samples = 100
       schema = {
         fields {
           id = bigint
           status = string
         }
       }
     }
   }
   ```
   
   Data Type Mapping
   ----------------
   
   The following table lists the field data type mapping from MongoDB BSON type 
to Seatunnel data type.
   
   | MongoDB BSON type | Seatunnel type   | 
   |-------------------|------------------|
   | ObjectId          | STRING           |
   | String            | STRING           |
   | Boolean           | BOOLEAN          |
   | Binary            | BINARY           |
   | Int32             | INTEGER          |
   | Int64             | BIGINT           |
   | Double            | DOUBLE           |
   | Decimal128        | DECIMAL          |
   | DateTime          | TIMESTAMP_LTZ(3) |
   | Timestamp         | TIMESTAMP_LTZ(0) |
   | Object            | ROW              |
   | Array             | ARRAY            |
   
   Format and layout
   1. Title
   Dependencies
   ------------
   2. Sub-headings
   **Partitioned Scan**
   3. Table display
   
   | MongoDB BSON type | Seatunnel type   | 
   |-------------------|------------------|
   | ObjectId          | STRING           |
   | String            | STRING           |
   
   
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-seatunnel] MonsterChenzhuo opened a new issue, #4565: [Proposal] ST Connector Document Open Instruction Manual

Reply via email to