Dear Apache SeaTunnel community members: Hello!
I am Shi Desheng, a big data engineer who has been following the development of the SeaTunnel project for a long time. Through the use and research of the project, it was found that SeaTunnel is unable to parse and store complex data scenarios, such as host logs, program run logs, and other irregular log formats; It may lead potential users who need this feature to give up using SeaTunnel, so I conducted secondary development and built a log parsing RegexParseTransform plugin based on regular expressions, which can parse irregular logs into structured logs. I am willing to contribute to the project and contribute this feature to the open source community, working together with SeaTunnel open source community members to accelerate the sprint to become the world's top open source data synchronization tool. 1. Background and Motivation With the continuous complexity of data sources and the rapid evolution of business requirements, universal data integration frameworks often face many challenges in the actual implementation process. Among them, SeaTunnel lacks universal parsing capabilities when dealing with raw logs with variable, irregular, and even deeply nested data formats (such as Apache/Nginx access logs, Linux Host logs, system syslogs, and custom program print logs). And these data are precisely an indispensable part of enterprise data governance and real-time monitoring. Therefore, improving SeaTunnel's ability to parse irregular logs and adding the RegexParseTransform plugin can not only expand its application scenarios, but also enhance its competitiveness in areas such as log analysis and observability platforms. 2. Goals Provide a Transform plugin named RegexParseTransform for parsing irregular logs into structured logs. Support for parsing irregular logs, such as: Apache/Nginx access logs Linux Host logs system syslogs custom program print logs Support multiple key information extraction. Support retaining original logs. Support one configuration to parse an entire irregular log. Compatible with both BATCH and STREAMING job modes. Transparent integration with all connector-v2 pipelines (no need to modify source/sink plugins). 3. Design Overview 3.1 Plugin Configuration transform { # ???????? # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED asap RegexParse { regex_parse_field= "value" regex = """(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+>(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)""" groupMap = { what_log_orig:0 # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED asap device_ip:1 # 192.168.73.1 operation_time:2 # Apr 13 16:27:33 device_name:3 # asap91 operation_command:4 # su how_op_res:5 # FAILED slave_account_name:6 # asap } } } regex_parse_field: Upstream fields that require parsing. regex: regular expression. groupMap: The correspondence between result fields and regular capture group indexes. 3.2 Execution Logic When starting a job, RegexParseTransform will?? Construct parameters in RegexParseTransform to obtain the values of three core parameters and the index of the regex_parse_fieldfield. On receiving a record, the RegexParseTransform will: Retrieve the data value corresponding to regex_marse_field, perform regular matching and logical verification. Traverse the corresponding relationship values of the groupMap field, capture the group index, and extract the corresponding captured group content of the regular matcher matcher. Assemble the extracted result values with the original values and pass them downstream.On receiving a record, the RegexParseTransform will: When converting data structures, RegexParseTransform will?? Traverse the relationship values corresponding to the groupMap fields, and by default, set the data structure type of all result fields to String (or convert them to their respective types based on their data format) This transformation is implemented by inheriting SeaTunnel's AbstractCatalogSupportTransform API and will be fully parallel.. 4. Implementation Plan TaskDescription Phase 1Support extracting multiple key information through regularization. Phase 2Support converting them into their respective types based on data format. Phase 3Add test coverage (unit + e2e) and documentation on website 5. End I sincerely appreciate the opportunity to contribute to the Apache SeaTunnel open source project. The attachment is my proposal. Thank you for taking the time to review it amidst your busy schedule. Looking forward to your reply. Best regards, Shi Desheng