[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-06-02 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r434196518



##
File path: 
processing/src/main/java/org/apache/druid/query/expression/RegexpLikeExprMacro.java
##
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.expression;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.math.expr.Expr;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExprMacroTable;
+import org.apache.druid.math.expr.ExprType;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class RegexpLikeExprMacro implements ExprMacroTable.ExprMacro
+{
+  private static final String FN_NAME = "regexp_like";
+
+  @Override
+  public String name()
+  {
+return FN_NAME;
+  }
+
+  @Override
+  public Expr apply(final List args)
+  {
+if (args.size() != 2) {
+  throw new IAE("Function[%s] must have 2 arguments", name());
+}
+
+final Expr arg = args.get(0);
+final Expr patternExpr = args.get(1);
+
+if (!ExprUtils.isStringLiteral(patternExpr)) {
+  throw new IAE("Function[%s] pattern must be a string literal", name());
+}
+
+// Precompile the pattern.
+final Pattern pattern = Pattern.compile(
+StringUtils.nullToEmptyNonDruidDataString((String) 
patternExpr.getLiteralValue())
+);
+
+class RegexpLikeExpr extends 
ExprMacroTable.BaseScalarUnivariateMacroFunctionExpr
+{
+  private RegexpLikeExpr(Expr arg)
+  {
+super(FN_NAME, arg);
+  }
+
+  @Nonnull
+  @Override
+  public ExprEval eval(final ObjectBinding bindings)
+  {
+final String s = 
NullHandling.nullToEmptyIfNeeded(arg.eval(bindings).asString());
+
+if (s == null) {
+  // True nulls do not match anything. Note: this branch only executes 
in SQL-compatible null handling mode.
+  return ExprEval.of(false, ExprType.LONG);
+} else {
+  final Matcher matcher = 
pattern.matcher(NullHandling.nullToEmptyIfNeeded(s));

Review comment:
   nit: Is this `nullToEmptyIfNeeded` still needed because of the if block 
on line 77 - same comment for `RegexpExtractMacro`
   
   Unclear to me if there's a performance loss from the extra function call 
(I'd think it's probably not measurable)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428846660



##
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##
@@ -7300,6 +7301,74 @@ public void testRegexpExtract() throws Exception
 );
   }
 
+  @Test
+  public void testRegexpExtractFilterViaNotNullCheck() throws Exception
+  {
+// Cannot vectorize due to extractionFn in dimension spec.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_EXTRACT(dim1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' 
|| dim1, '^Z2') IS NOT NULL",
+ImmutableList.of(
+Druids.newTimeseriesQueryBuilder()
+  .dataSource(CalciteTests.DATASOURCE1)
+  .intervals(querySegmentSpec(Filtration.eternity()))
+  .granularity(Granularities.ALL)
+  .virtualColumns(
+  expressionVirtualColumn("v0", 
"regexp_extract(concat('Z',\"dim1\"),'^Z2')", ValueType.STRING)
+  )
+  .filters(
+  or(
+  not(selector("dim1", null, new 
RegexDimExtractionFn("^1", 0, true, null))),
+  not(selector("v0", null, null))
+  )
+  )
+  .aggregators(new CountAggregatorFactory("a0"))
+  .context(TIMESERIES_CONTEXT_DEFAULT)
+  .build()
+),
+ImmutableList.of(
+new Object[]{3L}
+)
+);
+  }
+
+  @Test
+  public void testRegexpLikeFilter() throws Exception
+  {
+// Cannot vectorize due to usage of regex filter.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_LIKE(dim1, '^1') OR REGEXP_LIKE('Z' || dim1, '^Z2')",

Review comment:
   Sounds good to me!
   
   > I could see having a systematic way to test for this (a query generator, 
maybe) but I don't think adding one just for this function would make sense.
   
   I've been working on the beginnings of a query generator to match all the 
different rows in `druid.foo` table - I was kinda sneakily hoping there was 
already a systematic way we add tests for all these different conditions and I 
just hadn't seen it yet :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428809485



##
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##
@@ -7300,6 +7301,74 @@ public void testRegexpExtract() throws Exception
 );
   }
 
+  @Test
+  public void testRegexpExtractFilterViaNotNullCheck() throws Exception
+  {
+// Cannot vectorize due to extractionFn in dimension spec.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_EXTRACT(dim1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' 
|| dim1, '^Z2') IS NOT NULL",
+ImmutableList.of(
+Druids.newTimeseriesQueryBuilder()
+  .dataSource(CalciteTests.DATASOURCE1)
+  .intervals(querySegmentSpec(Filtration.eternity()))
+  .granularity(Granularities.ALL)
+  .virtualColumns(
+  expressionVirtualColumn("v0", 
"regexp_extract(concat('Z',\"dim1\"),'^Z2')", ValueType.STRING)
+  )
+  .filters(
+  or(
+  not(selector("dim1", null, new 
RegexDimExtractionFn("^1", 0, true, null))),
+  not(selector("v0", null, null))
+  )
+  )
+  .aggregators(new CountAggregatorFactory("a0"))
+  .context(TIMESERIES_CONTEXT_DEFAULT)
+  .build()
+),
+ImmutableList.of(
+new Object[]{3L}
+)
+);
+  }
+
+  @Test
+  public void testRegexpLikeFilter() throws Exception
+  {
+// Cannot vectorize due to usage of regex filter.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_LIKE(dim1, '^1') OR REGEXP_LIKE('Z' || dim1, '^Z2')",

Review comment:
   The docs say that "The pattern must match starting at the beginning of 
`expr`" but it looks like the regex pattern you are passing in is asking that 
it start at the beginning of the string via `^` in the pattern string. Can I 
use a `$` in my regex to ask that it matches the end of the expr?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428808032



##
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##
@@ -7300,6 +7301,74 @@ public void testRegexpExtract() throws Exception
 );
   }
 
+  @Test
+  public void testRegexpExtractFilterViaNotNullCheck() throws Exception
+  {
+// Cannot vectorize due to extractionFn in dimension spec.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_EXTRACT(dim1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' 
|| dim1, '^Z2') IS NOT NULL",
+ImmutableList.of(
+Druids.newTimeseriesQueryBuilder()
+  .dataSource(CalciteTests.DATASOURCE1)
+  .intervals(querySegmentSpec(Filtration.eternity()))
+  .granularity(Granularities.ALL)
+  .virtualColumns(
+  expressionVirtualColumn("v0", 
"regexp_extract(concat('Z',\"dim1\"),'^Z2')", ValueType.STRING)
+  )
+  .filters(
+  or(
+  not(selector("dim1", null, new 
RegexDimExtractionFn("^1", 0, true, null))),
+  not(selector("v0", null, null))
+  )
+  )
+  .aggregators(new CountAggregatorFactory("a0"))
+  .context(TIMESERIES_CONTEXT_DEFAULT)
+  .build()
+),
+ImmutableList.of(
+new Object[]{3L}
+)
+);
+  }
+
+  @Test
+  public void testRegexpLikeFilter() throws Exception
+  {
+// Cannot vectorize due to usage of regex filter.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_LIKE(dim1, '^1') OR REGEXP_LIKE('Z' || dim1, '^Z2')",

Review comment:
   Do we have tests that check how the function performs against
   * a multi-value column
   * a numeric column
   * matching against null

##
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##
@@ -7300,6 +7301,74 @@ public void testRegexpExtract() throws Exception
 );
   }
 
+  @Test
+  public void testRegexpExtractFilterViaNotNullCheck() throws Exception
+  {
+// Cannot vectorize due to extractionFn in dimension spec.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_EXTRACT(dim1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' 
|| dim1, '^Z2') IS NOT NULL",
+ImmutableList.of(
+Druids.newTimeseriesQueryBuilder()
+  .dataSource(CalciteTests.DATASOURCE1)
+  .intervals(querySegmentSpec(Filtration.eternity()))
+  .granularity(Granularities.ALL)
+  .virtualColumns(
+  expressionVirtualColumn("v0", 
"regexp_extract(concat('Z',\"dim1\"),'^Z2')", ValueType.STRING)
+  )
+  .filters(
+  or(
+  not(selector("dim1", null, new 
RegexDimExtractionFn("^1", 0, true, null))),
+  not(selector("v0", null, null))
+  )
+  )
+  .aggregators(new CountAggregatorFactory("a0"))
+  .context(TIMESERIES_CONTEXT_DEFAULT)
+  .build()
+),
+ImmutableList.of(
+new Object[]{3L}
+)
+);
+  }
+
+  @Test
+  public void testRegexpLikeFilter() throws Exception
+  {
+// Cannot vectorize due to usage of regex filter.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_LIKE(dim1, '^1') OR REGEXP_LIKE('Z' || dim1, '^Z2')",

Review comment:
   The docs say that "The pattern must match starting at the beginning of 
`expr`" but it looks like the regex pattern you are passing in is asking that 
it start at the beginning of the string. Can I use a `$` in my regex to ask 
that it matches the end of the expr?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428792725



##
File path: 
processing/src/main/java/org/apache/druid/query/expression/RegexpLikeExprMacro.java
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.expression;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.math.expr.Expr;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExprMacroTable;
+import org.apache.druid.math.expr.ExprType;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class RegexpLikeExprMacro implements ExprMacroTable.ExprMacro
+{
+  private static final String FN_NAME = "regexp_like";
+
+  @Override
+  public String name()
+  {
+return FN_NAME;
+  }
+
+  @Override
+  public Expr apply(final List args)
+  {
+if (args.size() < 2 || args.size() > 3) {

Review comment:
   what is the 3rd argument for? I only see the first 2 being used in this 
expr
   ```suggestion
   if (args.size() != 2) {
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428792725



##
File path: 
processing/src/main/java/org/apache/druid/query/expression/RegexpLikeExprMacro.java
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.expression;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.math.expr.Expr;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExprMacroTable;
+import org.apache.druid.math.expr.ExprType;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class RegexpLikeExprMacro implements ExprMacroTable.ExprMacro
+{
+  private static final String FN_NAME = "regexp_like";
+
+  @Override
+  public String name()
+  {
+return FN_NAME;
+  }
+
+  @Override
+  public Expr apply(final List args)
+  {
+if (args.size() < 2 || args.size() > 3) {

Review comment:
   what is the 3rd argument for? I only see the first 2 being used in this 
expr





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org