What a great plugin!

gaojun2048 <gaojun2...@apache.org> 于2025年5月17日周六 10:50写道:

> This is a great enhancement for seatunnel, improving the usage scenarios of
> seatunnel. Of course, if we can support real-time reading of log files and
> combine it with this Transform, seatunnel will be able to fully meet the
> scenarios of log file reading and parsing.
>
> David Zollo <davidzollo...@gmail.com> 于2025年5月13日周二 13:45写道:
>
> > good suggestion.
> >
> >
> >
> > Best Regards
> >
> > ---------------
> > David
> > Linkedin: https://www.linkedin.com/in/davidzollo
> > ---------------
> >
> >
> > On Mon, May 12, 2025 at 8:59 PM 史德昇 <2870088...@qq.com.invalid> wrote:
> >
> > > No problem, I will create this issue and continue to track and complete
> > it.
> > >
> > >
> > > ------------------&nbsp;原始邮件&nbsp;------------------
> > > 发件人:
> > >                                                   "dev"
> > >                                                                 <
> > > fanjia1...@gmail.com&gt;;
> > > 发送时间:&nbsp;2025年5月12日(星期一) 晚上8:00
> > > 收件人:&nbsp;"dev"<dev@seatunnel.apache.org&gt;;
> > >
> > > 主题:&nbsp;Re: Proposal:Add a 'RegexParseTransform' plugin in Apache
> > > SeaTunnel to parse irregular logs into structured logs
> > >
> > >
> > >
> > > Thanks Desheng! Could you create an issue to track it?
> > >
> > > 史德昇 <2870088...@qq.com.invalid&gt; 于2025年5月12日周一 14:25写道:
> > >
> > > &gt;
> > > &gt; Dear Apache SeaTunnel community members:
> > > &gt;
> > > &gt; Hello!
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I am Shi Desheng, a big data
> > > engineer who has been following the
> > > &gt; development of the SeaTunnel project for a long time. Through the
> > use
> > > and
> > > &gt; research of the project, it was found that SeaTunnel is unable to
> > > parse and
> > > &gt; store complex data scenarios, such as host logs, program run logs,
> > and
> > > &gt; other irregular log formats; It may lead potential users who need
> > this
> > > &gt; feature to give up using SeaTunnel, so I conducted secondary
> > > development
> > > &gt; and built a log parsing RegexParseTransform plugin based on
> regular
> > > &gt; expressions, which can parse irregular logs into structured logs.
> I
> > am
> > > &gt; willing to contribute to the project and contribute this feature
> to
> > > the
> > > &gt; open source community, working together with SeaTunnel open source
> > > &gt; community members to accelerate the sprint to become the world's
> top
> > > open
> > > &gt; source data synchronization tool.
> > > &gt;
> > > &gt; *1. Background and Motivation*
> > > &gt;
> > > &gt; With the continuous complexity of data sources and the rapid
> > > evolution of
> > > &gt; business requirements, universal data integration frameworks often
> > > face
> > > &gt; many challenges in the actual implementation process. Among them,
> > > SeaTunnel
> > > &gt; lacks universal parsing capabilities when dealing with raw logs
> with
> > > &gt; variable, irregular, and even deeply nested data formats (such as
> > > &gt; Apache/Nginx access logs, Linux Host logs, system syslogs, and
> > custom
> > > &gt; program print logs). And these data are precisely an indispensable
> > > part of
> > > &gt; enterprise data governance and real-time monitoring. Therefore,
> > > improving
> > > &gt; SeaTunnel's ability to parse irregular logs and adding the
> > > *RegexParseTransform
> > > &gt; plugin* can not only expand its application scenarios, but also
> > > enhance
> > > &gt; its competitiveness in areas such as log analysis and
> observability
> > > &gt; platforms.
> > > &gt; ------------------------------
> > > &gt; *2. Goals*
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; Provide a *Transform plugin named
> > > RegexParseTransform* for parsing
> > > &gt;&nbsp;&nbsp;&nbsp; irregular logs into structured logs.
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; Support for *parsing irregular logs*, such as:
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Apache/Nginx access logs
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Linux Host logs
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; system syslogs
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; custom program print logs
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; Support *multiple key information extraction*.
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; Support *retaining original logs*.
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; Support one configuration to *parse an entire
> > > irregular log*.
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; Compatible with both *BATCH* and *STREAMING* job
> > > modes.
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; Transparent integration with all connector-v2
> > > pipelines (no need to
> > > &gt;&nbsp;&nbsp;&nbsp; modify source/sink plugins).
> > > &gt;
> > > &gt; ------------------------------
> > > &gt; *3. Design Overview**3.1 Plugin Configuration*
> > > &gt;
> > > &gt; transform {
> > > &gt;&nbsp;&nbsp; # 样例数据
> > > &gt;&nbsp;&nbsp; # 192.168.73.1~<37&gt;Apr 13 16:27:33 asap91 su:
> FAILED
> > > asap
> > > &gt;&nbsp;&nbsp; RegexParse {
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp; regex_parse_field= "value"
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp; regex =
> > >
> >
> """(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+&gt;(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)"""
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp; groupMap = {
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > >
> >
> what_log_orig:0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > > # 192.168.73.1~<37&gt;Apr 13 16:27:33 asap91 su: FAILED asap
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > >
> >
> device_ip:1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > > # 192.168.73.1
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > > operation_time:2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> #
> > > Apr 13 16:27:33
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > >
> >
> device_name:3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > > # asap91
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > > operation_command:4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # su
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > >
> >
> how_op_res:5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > > # FAILED
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > > slave_account_name:6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # asap
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp; }
> > > &gt;&nbsp; }
> > > &gt; }
> > > &gt;
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; regex_parse_field: Upstream fields that require
> > > parsing.
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; regex: regular expression.
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; groupMap: The correspondence between result
> fields
> > > and regular capture
> > > &gt;&nbsp;&nbsp;&nbsp; group indexes.
> > > &gt;
> > > &gt; *3.2 Execution Logic*
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; When starting a job, RegexParseTransform will:
> > > &gt;&nbsp;&nbsp;&nbsp; 1.
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Construct parameters in
> > > RegexParseTransform to obtain the values of
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; three core parameters and the
> > > index of the regex_parse_fieldfield.
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; On receiving a record, the RegexParseTransform
> > will:
> > > &gt;&nbsp;&nbsp;&nbsp; 1.
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Retrieve the data value
> > > corresponding to regex_marse_field, perform
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; regular matching and logical
> > > verification.
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Traverse the corresponding
> > > relationship values of the groupMap
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; field, capture the group
> index,
> > > and extract the corresponding captured
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; group content of the regular
> > > matcher matcher.
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Assemble the extracted result
> > > values with the original values and
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pass them downstream.On
> > receiving
> > > a record, the RegexParseTransform
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; will:
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; When converting data structures,
> > > RegexParseTransform will:
> > > &gt;&nbsp;&nbsp;&nbsp; 1.
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Traverse the relationship
> values
> > > corresponding to the groupMap
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fields, and by default, set
> the
> > > data structure type of all result fields to
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; String (or convert them to
> their
> > > respective types based on their
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; data format)
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp; This transformation is implemented by inheriting
> > > SeaTunnel's
> > > &gt;&nbsp;&nbsp;&nbsp; AbstractCatalogSupportTransform API and will be
> > > fully parallel..
> > > &gt;
> > > &gt; ------------------------------
> > > &gt; *4. Implementation Plan*
> > > &gt; TaskDescription
> > > &gt; Phase 1 Support extracting multiple key information through
> > > &gt; regularization.
> > > &gt; Phase 2 Support converting them into their respective types based
> on
> > > data
> > > &gt; format.
> > > &gt; Phase 3 Add test coverage (unit + e2e) and documentation on
> website
> > > &gt; *5. End*
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I sincerely appreciate
> the
> > > opportunity to contribute to the Apache
> > > &gt; SeaTunnel open source project. The attachment is my proposal.
> Thank
> > > you for
> > > &gt; taking the time to review it amidst your busy schedule. Looking
> > > forward
> > > &gt; to your reply.
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Best regards,
> > > &gt;
> > > &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Shi Desheng
> > > &gt;
> >
>
>
> --
>
> Best Regards
>
> ------------
>
> Apache ID: gaojun2048
>
> Github ID: EricJoy2048
>
> Mail: gaojun2...@gmail.com
>

Reply via email to