This is a great enhancement for seatunnel, improving the usage scenarios of
seatunnel. Of course, if we can support real-time reading of log files and
combine it with this Transform, seatunnel will be able to fully meet the
scenarios of log file reading and parsing.

David Zollo <davidzollo...@gmail.com> 于2025年5月13日周二 13:45写道:

> good suggestion.
>
>
>
> Best Regards
>
> ---------------
> David
> Linkedin: https://www.linkedin.com/in/davidzollo
> ---------------
>
>
> On Mon, May 12, 2025 at 8:59 PM 史德昇 <2870088...@qq.com.invalid> wrote:
>
> > No problem, I will create this issue and continue to track and complete
> it.
> >
> >
> > ------------------&nbsp;原始邮件&nbsp;------------------
> > 发件人:
> >                                                   "dev"
> >                                                                 <
> > fanjia1...@gmail.com&gt;;
> > 发送时间:&nbsp;2025年5月12日(星期一) 晚上8:00
> > 收件人:&nbsp;"dev"<dev@seatunnel.apache.org&gt;;
> >
> > 主题:&nbsp;Re: Proposal:Add a 'RegexParseTransform' plugin in Apache
> > SeaTunnel to parse irregular logs into structured logs
> >
> >
> >
> > Thanks Desheng! Could you create an issue to track it?
> >
> > 史德昇 <2870088...@qq.com.invalid&gt; 于2025年5月12日周一 14:25写道:
> >
> > &gt;
> > &gt; Dear Apache SeaTunnel community members:
> > &gt;
> > &gt; Hello!
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I am Shi Desheng, a big data
> > engineer who has been following the
> > &gt; development of the SeaTunnel project for a long time. Through the
> use
> > and
> > &gt; research of the project, it was found that SeaTunnel is unable to
> > parse and
> > &gt; store complex data scenarios, such as host logs, program run logs,
> and
> > &gt; other irregular log formats; It may lead potential users who need
> this
> > &gt; feature to give up using SeaTunnel, so I conducted secondary
> > development
> > &gt; and built a log parsing RegexParseTransform plugin based on regular
> > &gt; expressions, which can parse irregular logs into structured logs. I
> am
> > &gt; willing to contribute to the project and contribute this feature to
> > the
> > &gt; open source community, working together with SeaTunnel open source
> > &gt; community members to accelerate the sprint to become the world's top
> > open
> > &gt; source data synchronization tool.
> > &gt;
> > &gt; *1. Background and Motivation*
> > &gt;
> > &gt; With the continuous complexity of data sources and the rapid
> > evolution of
> > &gt; business requirements, universal data integration frameworks often
> > face
> > &gt; many challenges in the actual implementation process. Among them,
> > SeaTunnel
> > &gt; lacks universal parsing capabilities when dealing with raw logs with
> > &gt; variable, irregular, and even deeply nested data formats (such as
> > &gt; Apache/Nginx access logs, Linux Host logs, system syslogs, and
> custom
> > &gt; program print logs). And these data are precisely an indispensable
> > part of
> > &gt; enterprise data governance and real-time monitoring. Therefore,
> > improving
> > &gt; SeaTunnel's ability to parse irregular logs and adding the
> > *RegexParseTransform
> > &gt; plugin* can not only expand its application scenarios, but also
> > enhance
> > &gt; its competitiveness in areas such as log analysis and observability
> > &gt; platforms.
> > &gt; ------------------------------
> > &gt; *2. Goals*
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; Provide a *Transform plugin named
> > RegexParseTransform* for parsing
> > &gt;&nbsp;&nbsp;&nbsp; irregular logs into structured logs.
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; Support for *parsing irregular logs*, such as:
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Apache/Nginx access logs
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Linux Host logs
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; system syslogs
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; custom program print logs
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; Support *multiple key information extraction*.
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; Support *retaining original logs*.
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; Support one configuration to *parse an entire
> > irregular log*.
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; Compatible with both *BATCH* and *STREAMING* job
> > modes.
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; Transparent integration with all connector-v2
> > pipelines (no need to
> > &gt;&nbsp;&nbsp;&nbsp; modify source/sink plugins).
> > &gt;
> > &gt; ------------------------------
> > &gt; *3. Design Overview**3.1 Plugin Configuration*
> > &gt;
> > &gt; transform {
> > &gt;&nbsp;&nbsp; # 样例数据
> > &gt;&nbsp;&nbsp; # 192.168.73.1~<37&gt;Apr 13 16:27:33 asap91 su: FAILED
> > asap
> > &gt;&nbsp;&nbsp; RegexParse {
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp; regex_parse_field= "value"
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp; regex =
> >
> """(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+&gt;(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)"""
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp; groupMap = {
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >
> what_log_orig:0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > # 192.168.73.1~<37&gt;Apr 13 16:27:33 asap91 su: FAILED asap
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >
> device_ip:1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > # 192.168.73.1
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > operation_time:2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #
> > Apr 13 16:27:33
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >
> device_name:3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > # asap91
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > operation_command:4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # su
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >
> how_op_res:5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > # FAILED
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > slave_account_name:6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # asap
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp; }
> > &gt;&nbsp; }
> > &gt; }
> > &gt;
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; regex_parse_field: Upstream fields that require
> > parsing.
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; regex: regular expression.
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; groupMap: The correspondence between result fields
> > and regular capture
> > &gt;&nbsp;&nbsp;&nbsp; group indexes.
> > &gt;
> > &gt; *3.2 Execution Logic*
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; When starting a job, RegexParseTransform will:
> > &gt;&nbsp;&nbsp;&nbsp; 1.
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Construct parameters in
> > RegexParseTransform to obtain the values of
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; three core parameters and the
> > index of the regex_parse_fieldfield.
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; On receiving a record, the RegexParseTransform
> will:
> > &gt;&nbsp;&nbsp;&nbsp; 1.
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Retrieve the data value
> > corresponding to regex_marse_field, perform
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; regular matching and logical
> > verification.
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Traverse the corresponding
> > relationship values of the groupMap
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; field, capture the group index,
> > and extract the corresponding captured
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; group content of the regular
> > matcher matcher.
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Assemble the extracted result
> > values with the original values and
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pass them downstream.On
> receiving
> > a record, the RegexParseTransform
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; will:
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; When converting data structures,
> > RegexParseTransform will:
> > &gt;&nbsp;&nbsp;&nbsp; 1.
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Traverse the relationship values
> > corresponding to the groupMap
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fields, and by default, set the
> > data structure type of all result fields to
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; String (or convert them to their
> > respective types based on their
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; data format)
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp; This transformation is implemented by inheriting
> > SeaTunnel's
> > &gt;&nbsp;&nbsp;&nbsp; AbstractCatalogSupportTransform API and will be
> > fully parallel..
> > &gt;
> > &gt; ------------------------------
> > &gt; *4. Implementation Plan*
> > &gt; TaskDescription
> > &gt; Phase 1 Support extracting multiple key information through
> > &gt; regularization.
> > &gt; Phase 2 Support converting them into their respective types based on
> > data
> > &gt; format.
> > &gt; Phase 3 Add test coverage (unit + e2e) and documentation on website
> > &gt; *5. End*
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I sincerely appreciate the
> > opportunity to contribute to the Apache
> > &gt; SeaTunnel open source project. The attachment is my proposal. Thank
> > you for
> > &gt; taking the time to review it amidst your busy schedule. Looking
> > forward
> > &gt; to your reply.
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Best regards,
> > &gt;
> > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Shi Desheng
> > &gt;
>


-- 

Best Regards

------------

Apache ID: gaojun2048

Github ID: EricJoy2048

Mail: gaojun2...@gmail.com

Reply via email to