[jira] [Commented] (NIFI-4789) Enhance ExtractGrok processor to handle multiple grok expressions

ASF GitHub Bot (JIRA) Mon, 29 Jan 2018 20:04:23 -0800

    [ 
https://issues.apache.org/jira/browse/NIFI-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344468#comment-16344468
 ]


ASF GitHub Bot commented on NIFI-4789:
--------------------------------------

Github user charlesporter commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2411#discussion_r164636057
  
    --- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java
 ---
    @@ -107,31 +120,70 @@
             .build();
     
         public static final PropertyDescriptor CHARACTER_SET = new 
PropertyDescriptor.Builder()
    -        .name("Character Set")
    +        .name(CHARACTER_SET_KEY)
             .description("The Character Set in which the file is encoded")
             .required(true)
             .addValidator(StandardValidators.CHARACTER_SET_VALIDATOR)
             .defaultValue("UTF-8")
             .build();
     
         public static final PropertyDescriptor MAX_BUFFER_SIZE = new 
PropertyDescriptor.Builder()
    -        .name("Maximum Buffer Size")
    +        .name(MAXIMUM_BUFFER_SIZE_KEY)
             .description("Specifies the maximum amount of data to buffer (per 
file) in order to apply the Grok expressions. Files larger than the specified 
maximum will not be fully evaluated.")
             .required(true)
             .addValidator(StandardValidators.DATA_SIZE_VALIDATOR)
             .addValidator(StandardValidators.createDataSizeBoundsValidator(0, 
Integer.MAX_VALUE))
             .defaultValue("1 MB")
             .build();
     
    -    public static final PropertyDescriptor NAMED_CAPTURES_ONLY = new 
PropertyDescriptor.Builder()
    -        .name("Named captures only")
    -        .description("Only store named captures from grok")
    +     public static final PropertyDescriptor NAMED_CAPTURES_ONLY = new 
PropertyDescriptor.Builder()
    +        .name(NAMED_CAPTURES_ONLY_KEY)
    +        .description("Only store named captures from grokList")
             .required(true)
             .allowableValues("true", "false")
             .addValidator(StandardValidators.BOOLEAN_VALIDATOR)
             .defaultValue("false")
             .build();
     
    +    public static final PropertyDescriptor BREAK_ON_FIRST_MATCH = new 
PropertyDescriptor.Builder()
    +        .name(SINGLE_MATCH_KEY)
    +        .description("Stop on first matched expression.")
    +        .required(true)
    +        .allowableValues("true", "false")
    +        .addValidator(StandardValidators.BOOLEAN_VALIDATOR)
    +        .defaultValue("true")
    +        .build();
    +
    +    public static final PropertyDescriptor RESULT_PREFIX = new 
PropertyDescriptor.Builder()
    +        .name(RESULT_PREFIX_KEY)
    +        .description("Value to prefix attribute results with (avoid 
collisions with existing properties)" +
    +                "\n\t (Does not apply when results returned as content)" +
    +                "\n\t (May be empty, the dot (.) separator is not 
implied)")
    +        .required(true)
    +        .defaultValue("grok.")
    +        .addValidator(Validator.VALID)
    +        .build();
    +
    +
    +    public static final PropertyDescriptor EXPRESSION_SEPARATOR = new 
PropertyDescriptor.Builder()
    --- End diff --
    
    I now recall my reasoning on this.   The common way to use grok is to have 
most of the regexes in one or more grok pattern files, where they are easier to 
test then in an attribute property box. The expressions are then referenced by 
name in the expression passed in for evaluation, as in <q>%{EX_1}</q>. or even 
<q>{EX_1} %{EX_2}.</q>  If someone really does want to do an interesting regex 
in line, the delimiter can be  multiple characters, so it is easy to make a 
unique one like "<<DELIMETER>>".  If I wanted to locally define a set of 
regexes, I would could also also use the ExtractText processor. 
    On the other hand, if we go with putting each expression in a separate 
attributes, the attributes need to be named in a way that can be sorted since 
we commit to order of evaluation.  If  the user wants to insert a new 
expression into the middle of the list, they have to (on average) recreate half 
of the rest of the attributes with new names.  In my own use case of building a 
classifier, i am often tweaking the list, so having to recreate my attributes 
would be a nuisance.  
    
    PS: I will make the delimiter optional, with no default.


> Enhance ExtractGrok processor to handle multiple grok expressions
> -----------------------------------------------------------------
>
>                 Key: NIFI-4789
>                 URL: https://issues.apache.org/jira/browse/NIFI-4789
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>    Affects Versions: 1.2.0, 1.5.0
>         Environment: all
>            Reporter: Charles Porter
>            Priority: Minor
>              Labels: features
>
> Many flows require running several grok expressions against an input to 
> correctly tag and extract data. using many separate grok processors to 
> accomplish this is unwieldy and hard to maintain.  Supporting multiple grok 
> expressions delimited by comma or user selected delimiter greatly simplifies 
> this.  
> Feature is coded and tested, ready for pull request, if feature is approved



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NIFI-4789) Enhance ExtractGrok processor to handle multiple grok expressions

Reply via email to