What a great plugin! gaojun2048 <gaojun2...@apache.org> 于2025年5月17日周六 10:50写道:
> This is a great enhancement for seatunnel, improving the usage scenarios of > seatunnel. Of course, if we can support real-time reading of log files and > combine it with this Transform, seatunnel will be able to fully meet the > scenarios of log file reading and parsing. > > David Zollo <davidzollo...@gmail.com> 于2025年5月13日周二 13:45写道: > > > good suggestion. > > > > > > > > Best Regards > > > > --------------- > > David > > Linkedin: https://www.linkedin.com/in/davidzollo > > --------------- > > > > > > On Mon, May 12, 2025 at 8:59 PM 史德昇 <2870088...@qq.com.invalid> wrote: > > > > > No problem, I will create this issue and continue to track and complete > > it. > > > > > > > > > ------------------ 原始邮件 ------------------ > > > 发件人: > > > "dev" > > > < > > > fanjia1...@gmail.com>; > > > 发送时间: 2025年5月12日(星期一) 晚上8:00 > > > 收件人: "dev"<dev@seatunnel.apache.org>; > > > > > > 主题: Re: Proposal:Add a 'RegexParseTransform' plugin in Apache > > > SeaTunnel to parse irregular logs into structured logs > > > > > > > > > > > > Thanks Desheng! Could you create an issue to track it? > > > > > > 史德昇 <2870088...@qq.com.invalid> 于2025年5月12日周一 14:25写道: > > > > > > > > > > > Dear Apache SeaTunnel community members: > > > > > > > > Hello! > > > > I am Shi Desheng, a big data > > > engineer who has been following the > > > > development of the SeaTunnel project for a long time. Through the > > use > > > and > > > > research of the project, it was found that SeaTunnel is unable to > > > parse and > > > > store complex data scenarios, such as host logs, program run logs, > > and > > > > other irregular log formats; It may lead potential users who need > > this > > > > feature to give up using SeaTunnel, so I conducted secondary > > > development > > > > and built a log parsing RegexParseTransform plugin based on > regular > > > > expressions, which can parse irregular logs into structured logs. > I > > am > > > > willing to contribute to the project and contribute this feature > to > > > the > > > > open source community, working together with SeaTunnel open source > > > > community members to accelerate the sprint to become the world's > top > > > open > > > > source data synchronization tool. > > > > > > > > *1. Background and Motivation* > > > > > > > > With the continuous complexity of data sources and the rapid > > > evolution of > > > > business requirements, universal data integration frameworks often > > > face > > > > many challenges in the actual implementation process. Among them, > > > SeaTunnel > > > > lacks universal parsing capabilities when dealing with raw logs > with > > > > variable, irregular, and even deeply nested data formats (such as > > > > Apache/Nginx access logs, Linux Host logs, system syslogs, and > > custom > > > > program print logs). And these data are precisely an indispensable > > > part of > > > > enterprise data governance and real-time monitoring. Therefore, > > > improving > > > > SeaTunnel's ability to parse irregular logs and adding the > > > *RegexParseTransform > > > > plugin* can not only expand its application scenarios, but also > > > enhance > > > > its competitiveness in areas such as log analysis and > observability > > > > platforms. > > > > ------------------------------ > > > > *2. Goals* > > > > > > > > - > > > > > > > > Provide a *Transform plugin named > > > RegexParseTransform* for parsing > > > > irregular logs into structured logs. > > > > - > > > > > > > > Support for *parsing irregular logs*, such as: > > > > - > > > > > > > > Apache/Nginx access logs > > > > - > > > > > > > > Linux Host logs > > > > - > > > > > > > > system syslogs > > > > - > > > > > > > > custom program print logs > > > > - > > > > > > > > Support *multiple key information extraction*. > > > > - > > > > > > > > Support *retaining original logs*. > > > > - > > > > > > > > Support one configuration to *parse an entire > > > irregular log*. > > > > - > > > > > > > > Compatible with both *BATCH* and *STREAMING* job > > > modes. > > > > - > > > > > > > > Transparent integration with all connector-v2 > > > pipelines (no need to > > > > modify source/sink plugins). > > > > > > > > ------------------------------ > > > > *3. Design Overview**3.1 Plugin Configuration* > > > > > > > > transform { > > > > # 样例数据 > > > > # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: > FAILED > > > asap > > > > RegexParse { > > > > regex_parse_field= "value" > > > > regex = > > > > > > """(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+>(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)""" > > > > groupMap = { > > > > > > > > > > what_log_orig:0 > > > # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED asap > > > > > > > > > > device_ip:1 > > > # 192.168.73.1 > > > > > > > operation_time:2 > # > > > Apr 13 16:27:33 > > > > > > > > > > device_name:3 > > > # asap91 > > > > > > > operation_command:4 # su > > > > > > > > > > how_op_res:5 > > > # FAILED > > > > > > > slave_account_name:6 # asap > > > > } > > > > } > > > > } > > > > > > > > > > > > - > > > > > > > > regex_parse_field: Upstream fields that require > > > parsing. > > > > - > > > > > > > > regex: regular expression. > > > > - > > > > > > > > groupMap: The correspondence between result > fields > > > and regular capture > > > > group indexes. > > > > > > > > *3.2 Execution Logic* > > > > > > > > - > > > > > > > > When starting a job, RegexParseTransform will: > > > > 1. > > > > > > > > Construct parameters in > > > RegexParseTransform to obtain the values of > > > > three core parameters and the > > > index of the regex_parse_fieldfield. > > > > - > > > > > > > > On receiving a record, the RegexParseTransform > > will: > > > > 1. > > > > > > > > Retrieve the data value > > > corresponding to regex_marse_field, perform > > > > regular matching and logical > > > verification. > > > > 2. > > > > > > > > Traverse the corresponding > > > relationship values of the groupMap > > > > field, capture the group > index, > > > and extract the corresponding captured > > > > group content of the regular > > > matcher matcher. > > > > 3. > > > > > > > > Assemble the extracted result > > > values with the original values and > > > > pass them downstream.On > > receiving > > > a record, the RegexParseTransform > > > > will: > > > > - > > > > > > > > When converting data structures, > > > RegexParseTransform will: > > > > 1. > > > > > > > > Traverse the relationship > values > > > corresponding to the groupMap > > > > fields, and by default, set > the > > > data structure type of all result fields to > > > > String (or convert them to > their > > > respective types based on their > > > > data format) > > > > - > > > > > > > > This transformation is implemented by inheriting > > > SeaTunnel's > > > > AbstractCatalogSupportTransform API and will be > > > fully parallel.. > > > > > > > > ------------------------------ > > > > *4. Implementation Plan* > > > > TaskDescription > > > > Phase 1 Support extracting multiple key information through > > > > regularization. > > > > Phase 2 Support converting them into their respective types based > on > > > data > > > > format. > > > > Phase 3 Add test coverage (unit + e2e) and documentation on > website > > > > *5. End* > > > > > > > > I sincerely appreciate > the > > > opportunity to contribute to the Apache > > > > SeaTunnel open source project. The attachment is my proposal. > Thank > > > you for > > > > taking the time to review it amidst your busy schedule. Looking > > > forward > > > > to your reply. > > > > > > > > Best regards, > > > > > > > > Shi Desheng > > > > > > > > > -- > > Best Regards > > ------------ > > Apache ID: gaojun2048 > > Github ID: EricJoy2048 > > Mail: gaojun2...@gmail.com >