travis-cook-sfdc opened a new issue, #10611: URL: https://github.com/apache/pinot/issues/10611
According to the [docs](https://docs.pinot.apache.org/configuration-reference/job-specification#top-level-spec), `includeFileNamePattern` and `excludeFileNamePattern` are documented like: > Only Files matching this pattern will be included from inputDirURI. Both glob and regex patterns are supported. Examples: Use 'glob:.avro'or 'regex:^..(avro)$' to include all avro files one level deep in the inputDirURI. Alternatively, use 'glob:*/.avro' to include all the avro files in inputDirURI as well as its subdirectories - bear in mind that, with this approach, the pattern needs to match the absolute path. You can use [Glob tool](https://www.digitalocean.com/community/tools/glob) or [Regex Tool ](https://www.regextester.com/)to test out your patterns. A few issues here: 1️⃣ The example of `regex:^..(avro)$` does not actually work. When running a job with this pattern, you'll get an error like this ``` Caused by: groovy.lang.GroovyRuntimeException: Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): startup failed: SimpleTemplateScript1.groovy: 1: illegal string body character after dollar sign; solution: either escape a literal dollar sign "\$5" or bracket the value expression "${5}" @ line 1, column 10. out.print(""" ^ 1 error ``` I'm assuming this because of the templating that was introduced in #5341 (also not documented) , but job spec's appear to have special handling for both `$`, which needs to be escaped: `\$`, and backslashes which are automatically escaped to `\\` 2️⃣ Related to the above, it's not clear how someone would write a single backslash character in their regex. For example, I think this is an impossible regex to use `.*\.parquet$` because it's not clear how to get the single backslash character. `\` turns into `\\` and `\\` stays as `\\`. This issue can be worked around by using character classes and writing `.*[.]parquet$`, but it feels wrong. 3️⃣ What flavor of regex is actually being used here? `regextester.com` linked in the documentation only supports PCRE and Javascript regex. However, I suspect this really java regex, which has different syntax. Given the code uses [PathMatcher](https://docs.oracle.com/javase/7/docs/api/java/nio/file/FileSystem.html#getPathMatcher(java.lang.String)), it's java regex. Pinot should link to a regex tester that will be accurate 4️⃣ Can you provide some examples of the _absolute path_ I should be matching to? I've submitted an ingestion job spec that has `includeFileNamePattern: regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$` I have an s3 file with the following name at the path: `s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet` According to regex101.com, this is a match using Java8 syntax: https://regex101.com/r/9ZKOhm/1 It's unclear to me what I'm doing wrong that's causing this pattern to not match. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
