[GitHub] [seatunnel] zhilinli123 commented on a diff in pull request #4871: [Docs][Connector-V2][HDFS]Refactor connector-v2 docs using unified format HDFS.

via GitHub Thu, 01 Jun 2023 07:01:39 -0700


zhilinli123 commented on code in PR #4871:
URL: https://github.com/apache/seatunnel/pull/4871#discussion_r1213206574



##########
docs/en/connector-v2/source/HdfsFile.md:
##########
@@ -1,18 +1,12 @@
 # HdfsFile
 
-> Hdfs file source connector
+> Hdfs File Source Connector
 
-## Description
-
-Read data from hdfs file system.
-
-:::tip
+## Support Those Engines
 
-If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
-
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when 
you download and install SeaTunnel Engine. You can check the jar package under 
${SEATUNNEL_HOME}/lib to confirm this.
-
-:::
+> Spark<br/>
+> Flink<br/>
+> Seatunnel Zeta<br/>
 
 ## Key features

Review Comment:
   Same as above
   



##########
docs/en/connector-v2/source/HdfsFile.md:
##########
@@ -33,233 +27,51 @@ Read all the data in a split in a pollNext call. What 
splits are read will be sa
   - [x] json
   - [x] excel
 
-## Options
-
-|           name            |  type   | required |    default value    |
-|---------------------------|---------|----------|---------------------|
-| path                      | string  | yes      | -                   |
-| file_format_type          | string  | yes      | -                   |
-| fs.defaultFS              | string  | yes      | -                   |
-| read_columns              | list    | yes      | -                   |
-| hdfs_site_path            | string  | no       | -                   |
-| delimiter                 | string  | no       | \001                |
-| parse_partition_from_path | boolean | no       | true                |
-| date_format               | string  | no       | yyyy-MM-dd          |
-| datetime_format           | string  | no       | yyyy-MM-dd HH:mm:ss |
-| time_format               | string  | no       | HH:mm:ss            |
-| kerberos_principal        | string  | no       | -                   |
-| kerberos_keytab_path      | string  | no       | -                   |
-| skip_header_row_number    | long    | no       | 0                   |
-| schema                    | config  | no       | -                   |
-| common-options            |         | no       | -                   |
-| sheet_name                | string  | no       | -                   |
-
-### path [string]
-
-The source file path.
-
-### delimiter [string]
-
-Field delimiter, used to tell connector how to slice and dice fields when 
reading text files
-
-default `\001`, the same as hive's default delimiter
-
-### parse_partition_from_path [boolean]
-
-Control whether parse the partition keys and values from file path
-
-For example if you read a file from path 
`hdfs://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`
-
-Every record data from file will be added these two fields:
-
-|     name      | age |
-|---------------|-----|
-| tyrantlucifer | 26  |
-
-Tips: **Do not define partition fields in schema option**
-
-### date_format [string]
-
-Date type format, used to tell connector how to convert string to date, 
supported as the following formats:
-
-`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`
-
-default `yyyy-MM-dd`
-
-### datetime_format [string]
-
-Datetime type format, used to tell connector how to convert string to 
datetime, supported as the following formats:
-
-`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss` `yyyy/MM/dd HH:mm:ss` 
`yyyyMMddHHmmss`
-
-default `yyyy-MM-dd HH:mm:ss`
-
-### time_format [string]
-
-Time type format, used to tell connector how to convert string to time, 
supported as the following formats:
-
-`HH:mm:ss` `HH:mm:ss.SSS`
-
-default `HH:mm:ss`
-
-### skip_header_row_number [long]
-
-Skip the first few lines, but only for the txt and csv.
-
-For example, set like following:
-
-`skip_header_row_number = 2`
-
-then Seatunnel will skip the first 2 lines from source files
-
-### file_format_type [string]
-
-File type, supported as the following file types:
-
-`text` `csv` `parquet` `orc` `json` `excel`
-
-If you assign file type to `json`, you should also assign schema option to 
tell connector how to parse data to the row you want.
-
-For example:
-
-upstream data is the following:
-
-```json
-
-{"code":  200, "data":  "get success", "success":  true}
-
-```
-
-You can also save multiple pieces of data in one file and split them by 
newline:
-
-```json lines
-
-{"code":  200, "data":  "get success", "success":  true}
-{"code":  300, "data":  "get failed", "success":  false}
-
-```
-
-you should assign schema as the following:
-
-```hocon
-
-schema {
-    fields {
-        code = int
-        data = string
-        success = boolean
-    }
-}
-
-```
-
-connector will generate data as the following:
-
-| code |    data     | success |
-|------|-------------|---------|
-| 200  | get success | true    |
-
-If you assign file type to `parquet` `orc`, schema option not required, 
connector can find the schema of upstream data automatically.
-
-If you assign file type to `text` `csv`, you can choose to specify the schema 
information or not.
-
-For example, upstream data is the following:
-
-```text
+## Description
 
-tyrantlucifer#26#male
+Read data from hdfs file system.
 
-```
+## Source Options
 
-If you do not assign data schema connector will treat the upstream data as the 
following:
+|           name            |  type   | required |    default value    |       
                                                                                
                                                                           
Description                                                                     
                                                                                
             |

Review Comment:
   Same as above
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [seatunnel] zhilinli123 commented on a diff in pull request #4871: [Docs][Connector-V2][HDFS]Refactor connector-v2 docs using unified format HDFS.

Reply via email to