travis-cook-sfdc opened a new issue, #10611:
URL: https://github.com/apache/pinot/issues/10611

   According to the 
[docs](https://docs.pinot.apache.org/configuration-reference/job-specification#top-level-spec),
 `includeFileNamePattern` and `excludeFileNamePattern` are documented like:
   
   > Only Files matching this pattern will be included from inputDirURI. Both 
glob and regex patterns are supported.
   Examples:
   Use 'glob:.avro'or 'regex:^..(avro)$' to include all avro files one level 
deep in the inputDirURI.
   Alternatively, use 'glob:*/.avro' to include all the avro files in 
inputDirURI as well as its subdirectories - bear in mind that, with this 
approach, the pattern needs to match the absolute path. You can use [Glob 
tool](https://www.digitalocean.com/community/tools/glob) or [Regex Tool 
](https://www.regextester.com/)to test out your patterns.
   
   
   A few issues here:
   
   1️⃣  The example of `regex:^..(avro)$` does not actually work.  When running 
a job with this pattern, you'll get an error like this
   ```
   Caused by: groovy.lang.GroovyRuntimeException: Failed to parse template 
script (your template may contain an error or be trying to use expressions not 
currently supported): startup failed:
   SimpleTemplateScript1.groovy: 1: illegal string body character after dollar 
sign;
      solution: either escape a literal dollar sign "\$5" or bracket the value 
expression "${5}" @ line 1, column 10.
      out.print("""
               ^
   
   1 error
   ```
   
   I'm assuming this because of the templating that was introduced in #5341 
(also not documented) , but job spec's appear to have special handling for both 
`$`, which needs to be escaped: `\$`, and backslashes which are automatically 
escaped to `\\`
   
   2️⃣ Related to the above, it's not clear how someone would write a single 
backslash character in their regex.  For example, I think this is an impossible 
regex to use `.*\.parquet$` because it's not clear how to get the single 
backslash character.  `\` turns into `\\` and `\\` stays as `\\`. 
   This issue can be worked around by using character classes and writing 
`.*[.]parquet$`, but it feels wrong.
   
   3️⃣ What flavor of regex is actually being used here?  `regextester.com` 
linked in the documentation only supports PCRE and Javascript regex.  However, 
I suspect this really java regex, which has different syntax.  Given the code 
uses 
[PathMatcher](https://docs.oracle.com/javase/7/docs/api/java/nio/file/FileSystem.html#getPathMatcher(java.lang.String)),
 it's java regex.  Pinot should link to a regex tester that will be accurate
   
   4️⃣ Can you provide some examples of the _absolute path_ I should be 
matching to?  I've submitted an ingestion job spec that has 
`includeFileNamePattern: 
regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$`
   
   I have an s3 file with the following name at the path:
   
`s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet`
   
   According to regex101.com, this is a match using Java8 syntax: 
   https://regex101.com/r/9ZKOhm/1
   
   It's unclear to me what I'm doing wrong that's causing this pattern to not 
match.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to