[ 
https://issues.apache.org/jira/browse/PHOENIX-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135746#comment-14135746
 ] 

James Taylor commented on PHOENIX-1254:
---------------------------------------

Thanks for the patch, [~gabriel.reid]. Here's some feedback for your new 
RegexpSplitFunction built-in:
- Initialize the initializedSplitter in a separate init method. The reason is 
that you'll need to call this after deserialization, as on the server-side, the 
no arg constructor is used.
{code}
+    public RegexpSplitFunction(List<Expression> children) {
+        super(children);
+        init();
+    }
+    private void init() {
+        Expression patternExpression = children.get(1);
+        if (patternExpression instanceof LiteralExpression) {
+            initializedSplitter = Splitter.onPattern(
+                    ((LiteralExpression) 
patternExpression).getValue().toString());
+        }
+    }
+        
+    @Override
+    public void readFields(DataInput input) throws IOException {
+        super.readFields(input);
+        init();
+    }
+
{code}
- In evaluate, only return false if either child could not be evaluated, as 
that's an indication to the framework that we should keep trying to evaluate it 
(i.e. we may not have seen the KeyValue that we're in need of). Also, I think 
there's a typo in this function, as you want to do a get(1) to get the 
patternExpression (was the second call). Maybe good to add a unit test for this 
if you don't have one already? It'd get triggered if you had a column that 
stored the patternExpression to use (kind of an edge case). There's also 
another edge case where patternExpression could have evaluated to null, in 
which case you want to return true, not false. Probably easiest to change the 
getSplitter() method to split() and do everything there, like this:
{code}
+    @Override
+    public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+        if (!children.get(0).evaluate(tuple, ptr)) {
+            return false;
+        }
+
+        Expression sourceStrExpression = children.get(0);
+        String sourceStr = (String)PDataType.VARCHAR.toObject(ptr, 
sourceStrExpression.getSortOrder());
+        if (sourceStr == null) { // sourceStr evaluated to null
+            ptr.set(ByteUtil.EMPTY_BYTE_ARRAY);
+            return true;
+        }
+
+        return split(tuple, ptr);
+    }
+
+    private boolean split(Tuple tuple, ImmutableBytesWritable ptr) {
+        Splitter splitter = initializedSplitter;
+        if (splitter == null) {
+            Expression patternExpression = children.get(1);
+            if (!patternExpression.evaluate(tuple, ptr)) {
+                 return false;
+            }
+            if (ptr.getLength() == 0) {
+                return true; // ptr is already set to null
+            }
+                
+            String patternStr = (String) PDataType.VARCHAR.toObject(
+                patternPtr, patternExpression.getSortOrder());
+            splitter = Splitter.onPattern(patternStr);
+        }
+
+        List<String> splitStrings = 
Lists.newArrayList(splitter.split(sourceStr));
+        PhoenixArray splitArray = new PhoenixArray(PDataType.VARCHAR, 
splitStrings.toArray());
+        ptr.set(PDataType.VARCHAR_ARRAY.toBytes(splitArray));
+        return true;
+    }
{code}



> Add REGEXP_SPLIT function
> -------------------------
>
>                 Key: PHOENIX-1254
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1254
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: PHOENIX-1254.patch
>
>
> It is really useful in some situations to have a string split function to 
> split string values into arrays of strings based on a delimiter string or 
> pattern.
> The intention is to add a REGEXP_SPLIT function that works in similar fashion 
> to the [Postgresql REGEXP_SPLIT_TO_ARRAY 
> function|http://www.postgresql.org/docs/9.1/static/functions-string.html].
> The function will take two parameters:
> * the input value to be split
> * the regular expression pattern to be used to split the input into an array
> The output of the function will be {{VARCHAR_ARRAY}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to