[
https://issues.apache.org/jira/browse/PHOENIX-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385351#comment-14385351
]
Shuxiong Ye edited comment on PHOENIX-1287 at 4/1/15 4:04 PM:
--------------------------------------------------------------
I set up environment using my laptop.
I use performance.py to generate 10m rows, and run the following queries, using
ByteBased and StringBased regex, 5 times each.
{code}
Query # 6 - Like + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE
DOMAIN LIKE '%o%e%';
Query # 7 - Replace + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE
REGEXP_REPLACE(DOMAIN, '[a-z]+')='G.';
Query # 8 - Substr + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE
REGEXP_SUBSTR(DOMAIN, '[a-z]+')='oogle';
{code}
|| || ByteBased || StringBased || SpeedUp(String/Byte) ||
| Like | 8.644/ 7.995/ 7.868/ 7.865/ 7.763 | 9.803/ 9.497/ 8.706/ 8.796/
8.805 | 1.136 |
| Replace | 11.725/11.071/11.199/10.988/10.970 |
10.576/10.495/10.271/10.354/10.178 | 0.927 |
| Substr | 8.380/ 8.107/ 8.248/ 8.319/ 8.302 | 9.478/ 9.227/ 9.294/ 9.024/
9.158 | 1.116 |
Like and Substr have slightly speedup, while for Replace, Byte-Based
implementation is slower than String-Based one.
-------------------------------------------------------------------
I finish RegexpSplitFunction. ByteBased seems to be a little faster than
StringBased.
Query # 9 - Split + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE
ARRAY_ELEM(REGEXP_SPLIT(DOMAIN, '\\.'), 1)='Google';
ByteBased: 12.245 StringByased: 12.842 SpeedUp(String/Byte): 1.05
The following queries are added in performance.py:
{code}
queryex("6 - Like + Count", "SELECT COUNT(1) FROM %s WHERE DOMAIN LIKE
'%%o%%e%%';" % (table))
queryex("7 - Replace + Count", "SELECT COUNT(1) FROM %s WHERE
REGEXP_REPLACE(DOMAIN, '[a-z]+')='G.';" % (table))
queryex("8 - Substr + Count", "SELECT COUNT(1) FROM %s WHERE
REGEXP_SUBSTR(DOMAIN, '[a-z]+')='oogle';" % (table))
queryex("9 - Split + Count", "SELECT COUNT(1) FROM %s WHERE
ARRAY_ELEM(REGEXP_SPLIT(DOMAIN, '\\\\.'), 1)='Google';" % (table) )
{code}
was (Author: shuxi0ng):
I set up environment using my laptop.
I use performance.py to generate 10m rows, and run the following queries, using
ByteBased and StringBased regex, 5 times each.
{code}
Query # 6 - Like + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE
DOMAIN LIKE '%o%e%';
Query # 7 - Replace + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE
REGEXP_REPLACE(DOMAIN, '[a-z]+')='G.';
Query # 8 - Substr + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE
REGEXP_SUBSTR(DOMAIN, '[a-z]+')='oogle';
{code}
|| || ByteBased || StringBased || SpeedUp(String/Byte) ||
| Like | 8.644/ 7.995/ 7.868/ 7.865/ 7.763 | 9.803/ 9.497/ 8.706/ 8.796/
8.805 | 1.136 |
| Replace | 11.725/11.071/11.199/10.988/10.970 |
10.576/10.495/10.271/10.354/10.178 | 0.927 |
| Substr | 8.380/ 8.107/ 8.248/ 8.319/ 8.302 | 9.478/ 9.227/ 9.294/ 9.024/
9.158 | 1.116 |
Like and Substr have slightly speedup, while for Replace, Byte-Based
implementation is slower than String-Based one.
> Use the joni byte[] regex engine in place of j.u.regex
> ------------------------------------------------------
>
> Key: PHOENIX-1287
> URL: https://issues.apache.org/jira/browse/PHOENIX-1287
> Project: Phoenix
> Issue Type: Bug
> Reporter: James Taylor
> Assignee: Shuxiong Ye
> Labels: gsoc2015
>
> See HBASE-11907. We'd get a 2x perf benefit plus it's driven off of byte[]
> instead of strings.Thanks for the pointer, [~apurtell].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)