[jira] [Comment Edited] (PHOENIX-1287) Use the joni byte[] regex engine in place of j.u.regex

Shuxiong Ye (JIRA) Wed, 01 Apr 2015 09:07:12 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385351#comment-14385351
 ]


Shuxiong Ye edited comment on PHOENIX-1287 at 4/1/15 4:04 PM:
--------------------------------------------------------------

I set up environment using my laptop.

I use performance.py to generate 10m rows, and run the following queries, using 
ByteBased and StringBased regex, 5 times each.

{code}
Query # 6 - Like + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
DOMAIN LIKE '%o%e%';
Query # 7 - Replace + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
REGEXP_REPLACE(DOMAIN, '[a-z]+')='G.';
Query # 8 - Substr + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
REGEXP_SUBSTR(DOMAIN, '[a-z]+')='oogle';
{code}

|| || ByteBased || StringBased || SpeedUp(String/Byte) ||
| Like |  8.644/ 7.995/ 7.868/ 7.865/ 7.763 |  9.803/ 9.497/ 8.706/ 8.796/ 
8.805 | 1.136 |
| Replace | 11.725/11.071/11.199/10.988/10.970 | 
10.576/10.495/10.271/10.354/10.178 | 0.927 |
| Substr |  8.380/ 8.107/ 8.248/ 8.319/ 8.302 | 9.478/ 9.227/ 9.294/ 9.024/ 
9.158 | 1.116 |

Like and Substr have slightly speedup, while for Replace, Byte-Based 
implementation is slower than String-Based one. 

-------------------------------------------------------------------

I finish RegexpSplitFunction. ByteBased seems to be a little faster than 
StringBased.
Query # 9 - Split + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
ARRAY_ELEM(REGEXP_SPLIT(DOMAIN, '\\.'), 1)='Google';

ByteBased: 12.245 StringByased: 12.842 SpeedUp(String/Byte): 1.05

The following queries are added in performance.py:
{code}
queryex("6 - Like + Count", "SELECT COUNT(1) FROM %s WHERE DOMAIN LIKE 
'%%o%%e%%';" % (table))
queryex("7 - Replace + Count", "SELECT COUNT(1) FROM %s WHERE 
REGEXP_REPLACE(DOMAIN, '[a-z]+')='G.';" % (table))
queryex("8 - Substr + Count", "SELECT COUNT(1) FROM %s WHERE 
REGEXP_SUBSTR(DOMAIN, '[a-z]+')='oogle';" % (table))
queryex("9 - Split + Count", "SELECT COUNT(1) FROM %s WHERE 
ARRAY_ELEM(REGEXP_SPLIT(DOMAIN, '\\\\.'), 1)='Google';" % (table) )
{code}


was (Author: shuxi0ng):
I set up environment using my laptop.

I use performance.py to generate 10m rows, and run the following queries, using 
ByteBased and StringBased regex, 5 times each.

{code}
Query # 6 - Like + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
DOMAIN LIKE '%o%e%';
Query # 7 - Replace + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
REGEXP_REPLACE(DOMAIN, '[a-z]+')='G.';
Query # 8 - Substr + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
REGEXP_SUBSTR(DOMAIN, '[a-z]+')='oogle';
{code}

|| || ByteBased || StringBased || SpeedUp(String/Byte) ||
| Like |  8.644/ 7.995/ 7.868/ 7.865/ 7.763 |  9.803/ 9.497/ 8.706/ 8.796/ 
8.805 | 1.136 |
| Replace | 11.725/11.071/11.199/10.988/10.970 | 
10.576/10.495/10.271/10.354/10.178 | 0.927 |
| Substr |  8.380/ 8.107/ 8.248/ 8.319/ 8.302 | 9.478/ 9.227/ 9.294/ 9.024/ 
9.158 | 1.116 |

Like and Substr have slightly speedup, while for Replace, Byte-Based 
implementation is slower than String-Based one. 

> Use the joni byte[] regex engine in place of j.u.regex
> ------------------------------------------------------
>
>                 Key: PHOENIX-1287
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1287
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>            Assignee: Shuxiong Ye
>              Labels: gsoc2015
>
> See HBASE-11907. We'd get a 2x perf benefit plus it's driven off of byte[] 
> instead of strings.Thanks for the pointer, [~apurtell].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PHOENIX-1287) Use the joni byte[] regex engine in place of j.u.regex

Reply via email to