This is a great enhancement for seatunnel, improving the usage scenarios of seatunnel. Of course, if we can support real-time reading of log files and combine it with this Transform, seatunnel will be able to fully meet the scenarios of log file reading and parsing.
David Zollo <davidzollo...@gmail.com> 于2025年5月13日周二 13:45写道: > good suggestion. > > > > Best Regards > > --------------- > David > Linkedin: https://www.linkedin.com/in/davidzollo > --------------- > > > On Mon, May 12, 2025 at 8:59 PM 史德昇 <2870088...@qq.com.invalid> wrote: > > > No problem, I will create this issue and continue to track and complete > it. > > > > > > ------------------ 原始邮件 ------------------ > > 发件人: > > "dev" > > < > > fanjia1...@gmail.com>; > > 发送时间: 2025年5月12日(星期一) 晚上8:00 > > 收件人: "dev"<dev@seatunnel.apache.org>; > > > > 主题: Re: Proposal:Add a 'RegexParseTransform' plugin in Apache > > SeaTunnel to parse irregular logs into structured logs > > > > > > > > Thanks Desheng! Could you create an issue to track it? > > > > 史德昇 <2870088...@qq.com.invalid> 于2025年5月12日周一 14:25写道: > > > > > > > > Dear Apache SeaTunnel community members: > > > > > > Hello! > > > I am Shi Desheng, a big data > > engineer who has been following the > > > development of the SeaTunnel project for a long time. Through the > use > > and > > > research of the project, it was found that SeaTunnel is unable to > > parse and > > > store complex data scenarios, such as host logs, program run logs, > and > > > other irregular log formats; It may lead potential users who need > this > > > feature to give up using SeaTunnel, so I conducted secondary > > development > > > and built a log parsing RegexParseTransform plugin based on regular > > > expressions, which can parse irregular logs into structured logs. I > am > > > willing to contribute to the project and contribute this feature to > > the > > > open source community, working together with SeaTunnel open source > > > community members to accelerate the sprint to become the world's top > > open > > > source data synchronization tool. > > > > > > *1. Background and Motivation* > > > > > > With the continuous complexity of data sources and the rapid > > evolution of > > > business requirements, universal data integration frameworks often > > face > > > many challenges in the actual implementation process. Among them, > > SeaTunnel > > > lacks universal parsing capabilities when dealing with raw logs with > > > variable, irregular, and even deeply nested data formats (such as > > > Apache/Nginx access logs, Linux Host logs, system syslogs, and > custom > > > program print logs). And these data are precisely an indispensable > > part of > > > enterprise data governance and real-time monitoring. Therefore, > > improving > > > SeaTunnel's ability to parse irregular logs and adding the > > *RegexParseTransform > > > plugin* can not only expand its application scenarios, but also > > enhance > > > its competitiveness in areas such as log analysis and observability > > > platforms. > > > ------------------------------ > > > *2. Goals* > > > > > > - > > > > > > Provide a *Transform plugin named > > RegexParseTransform* for parsing > > > irregular logs into structured logs. > > > - > > > > > > Support for *parsing irregular logs*, such as: > > > - > > > > > > Apache/Nginx access logs > > > - > > > > > > Linux Host logs > > > - > > > > > > system syslogs > > > - > > > > > > custom program print logs > > > - > > > > > > Support *multiple key information extraction*. > > > - > > > > > > Support *retaining original logs*. > > > - > > > > > > Support one configuration to *parse an entire > > irregular log*. > > > - > > > > > > Compatible with both *BATCH* and *STREAMING* job > > modes. > > > - > > > > > > Transparent integration with all connector-v2 > > pipelines (no need to > > > modify source/sink plugins). > > > > > > ------------------------------ > > > *3. Design Overview**3.1 Plugin Configuration* > > > > > > transform { > > > # 样例数据 > > > # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED > > asap > > > RegexParse { > > > regex_parse_field= "value" > > > regex = > > > """(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+>(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)""" > > > groupMap = { > > > > > > what_log_orig:0 > > # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED asap > > > > > > device_ip:1 > > # 192.168.73.1 > > > > > operation_time:2 # > > Apr 13 16:27:33 > > > > > > device_name:3 > > # asap91 > > > > > operation_command:4 # su > > > > > > how_op_res:5 > > # FAILED > > > > > slave_account_name:6 # asap > > > } > > > } > > > } > > > > > > > > > - > > > > > > regex_parse_field: Upstream fields that require > > parsing. > > > - > > > > > > regex: regular expression. > > > - > > > > > > groupMap: The correspondence between result fields > > and regular capture > > > group indexes. > > > > > > *3.2 Execution Logic* > > > > > > - > > > > > > When starting a job, RegexParseTransform will: > > > 1. > > > > > > Construct parameters in > > RegexParseTransform to obtain the values of > > > three core parameters and the > > index of the regex_parse_fieldfield. > > > - > > > > > > On receiving a record, the RegexParseTransform > will: > > > 1. > > > > > > Retrieve the data value > > corresponding to regex_marse_field, perform > > > regular matching and logical > > verification. > > > 2. > > > > > > Traverse the corresponding > > relationship values of the groupMap > > > field, capture the group index, > > and extract the corresponding captured > > > group content of the regular > > matcher matcher. > > > 3. > > > > > > Assemble the extracted result > > values with the original values and > > > pass them downstream.On > receiving > > a record, the RegexParseTransform > > > will: > > > - > > > > > > When converting data structures, > > RegexParseTransform will: > > > 1. > > > > > > Traverse the relationship values > > corresponding to the groupMap > > > fields, and by default, set the > > data structure type of all result fields to > > > String (or convert them to their > > respective types based on their > > > data format) > > > - > > > > > > This transformation is implemented by inheriting > > SeaTunnel's > > > AbstractCatalogSupportTransform API and will be > > fully parallel.. > > > > > > ------------------------------ > > > *4. Implementation Plan* > > > TaskDescription > > > Phase 1 Support extracting multiple key information through > > > regularization. > > > Phase 2 Support converting them into their respective types based on > > data > > > format. > > > Phase 3 Add test coverage (unit + e2e) and documentation on website > > > *5. End* > > > > > > I sincerely appreciate the > > opportunity to contribute to the Apache > > > SeaTunnel open source project. The attachment is my proposal. Thank > > you for > > > taking the time to review it amidst your busy schedule. Looking > > forward > > > to your reply. > > > > > > Best regards, > > > > > > Shi Desheng > > > > -- Best Regards ------------ Apache ID: gaojun2048 Github ID: EricJoy2048 Mail: gaojun2...@gmail.com