[jira] [Assigned] (PHOENIX-4902) Snappy compression benefit is lost when generate hash cache RPC

2018-09-15 Thread Marcell Ortutay (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay reassigned PHOENIX-4902:


Assignee: Marcell Ortutay

> Snappy compression benefit is lost when generate hash cache RPC
> ---
>
> Key: PHOENIX-4902
> URL: https://issues.apache.org/jira/browse/PHOENIX-4902
> Project: Phoenix
>  Issue Type: Bug
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Minor
>
> Phoenix uses snappy compression on hash caches before it sends them to region 
> server:
> {code}
> int maxCompressedSize = 
> Snappy.maxCompressedLength(baOut.size());
> byte[] compressed = new byte[maxCompressedSize]; // size for 
> worst case
> int compressedSize = Snappy.compress(baOut.getBuffer(), 0, 
> baOut.size(), compressed, 0);
> // Last realloc to size of compressed buffer.
> ptr.set(compressed,0,compressedSize);
> {code}
> However, looking at debug output, it seems like the serialized protobuf that 
> it sends to region servers does not have the benefits of snappy compression. 
> Below is an excerpt of some debug output I put in:
> {code}
> Building an RPC with a cache ptr of size: 39MB  // The compressed size is 39MB
> Done serializing the AddServerCacheRequest RPC, size is 206MB  // However the 
> serialized RPC is 206MB
> And the cache ptr size is: 206MB  // And specifically, the byte array that 
> contains the serialized hash cache is 206MB
> {code}
> I've made a simple test codebase to attempt to reproduce this bug. It shows 
> similar behavior:
> {code}
> bytes size: 1 bytes
> compressed bytes size: 721 bytes
> message size: 10003 bytes
> compressed message size: 11701 bytes
> {code}
> The code for the simplified example is here: 
> https://github.com/ortutay/snappy-bytes-buffer/blob/master/src/main/java/testprotobuf/Main.java
> I observed this behavior in Phoenix 4.14.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PHOENIX-4902) Snappy compression benefit is lost when generate hash cache RPC

2018-09-12 Thread Marcell Ortutay (JIRA)
Marcell Ortutay created PHOENIX-4902:


 Summary: Snappy compression benefit is lost when generate hash 
cache RPC
 Key: PHOENIX-4902
 URL: https://issues.apache.org/jira/browse/PHOENIX-4902
 Project: Phoenix
  Issue Type: Bug
Reporter: Marcell Ortutay


Phoenix uses snappy compression on hash caches before it sends them to region 
server:

{code}
int maxCompressedSize = 
Snappy.maxCompressedLength(baOut.size());
byte[] compressed = new byte[maxCompressedSize]; // size for 
worst case
int compressedSize = Snappy.compress(baOut.getBuffer(), 0, 
baOut.size(), compressed, 0);
// Last realloc to size of compressed buffer.
ptr.set(compressed,0,compressedSize);
{code}

However, looking at debug output, it seems like the serialized protobuf that it 
sends to region servers does not have the benefits of snappy compression. Below 
is an excerpt of some debug output I put in:

{code}
Building an RPC with a cache ptr of size: 39MB  // The compressed size is 39MB
Done serializing the AddServerCacheRequest RPC, size is 206MB  // However the 
serialized RPC is 206MB
And the cache ptr size is: 206MB  // And specifically, the byte array that 
contains the serialized hash cache is 206MB
{code}

I've made a simple test codebase to attempt to reproduce this bug. It shows 
similar behavior:

{code}
bytes size: 1 bytes
compressed bytes size: 721 bytes
message size: 10003 bytes
compressed message size: 11701 bytes
{code}

The code for the simplified example is here: 
https://github.com/ortutay/snappy-bytes-buffer/blob/master/src/main/java/testprotobuf/Main.java

I observed this behavior in Phoenix 4.14.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PHOENIX-4903) Hash cache RPC uses O(N) memory on master

2018-09-12 Thread Marcell Ortutay (JIRA)
Marcell Ortutay created PHOENIX-4903:


 Summary: Hash cache RPC uses O(N) memory on master
 Key: PHOENIX-4903
 URL: https://issues.apache.org/jira/browse/PHOENIX-4903
 Project: Phoenix
  Issue Type: Improvement
Reporter: Marcell Ortutay


To distribute the hash cache to region servers, the master node makes an 
`AddServerCacheRequest` RPC to each region servers. If there are N region 
servers, it makes N of these RPC's. For each of the region servers, it 
generates a serialized RPC message and sends it out. This happens concurrently, 
and the result is that it uses O(N) memory on the master.

As an example, if the `AddServerCacheRequest` RPC message is 100MB, and you 
have a cluster of 100 nodes, it would use 10GB memory on the master, 
potentially resulting in an "OutOfMemory" exception.

It would be better if the master could use O(1) memory for the RPC.

I observed this behavior in Phoenix 4.14.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-08-22 Thread Marcell Ortutay (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay updated PHOENIX-4666:
-
Attachment: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
> Attachments: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch
>
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-08-22 Thread Marcell Ortutay (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay updated PHOENIX-4666:
-
Attachment: (was: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch)

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
> Attachments: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch
>
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-08-20 Thread Marcell Ortutay (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay updated PHOENIX-4666:
-
Attachment: (was: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch)

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
> Attachments: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch
>
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-08-20 Thread Marcell Ortutay (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay updated PHOENIX-4666:
-
Attachment: (was: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch)

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
> Attachments: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch
>
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-08-20 Thread Marcell Ortutay (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay updated PHOENIX-4666:
-
Attachment: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
> Attachments: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch, 
> PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch, 
> PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch
>
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-08-20 Thread Marcell Ortutay (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay updated PHOENIX-4666:
-
Attachment: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
> Attachments: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch, 
> PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch
>
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-08-20 Thread Marcell Ortutay (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay updated PHOENIX-4666:
-
Comment: was deleted

(was:  [^PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch] )

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
> Attachments: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch
>
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-08-20 Thread Marcell Ortutay (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay updated PHOENIX-4666:
-
Attachment: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
> Attachments: PHOENIX-4666-subquery-cache-4.x-HBase-1.4.patch
>
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-21 Thread Marcell Ortutay (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519619#comment-16519619
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Ok, let me look at those two approaches, they both are probably workable

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-20 Thread Marcell Ortutay (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518531#comment-16518531
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

> Can you throw this exception (or derive your 
> PersistentHashJoinCacheNotFoundException  from it)?

I don't think so. It's not really a matter of which exception to throw (I was 
throwing HashJoinCacheNotFoundException originally, it's easier to throw a 
separate exception), but rather that by the time we are throwing the exception, 
we are past the point in query execution where we can (easily) re-run the hash 
join cache generation.

One option *might* be to generate a peeking result iterator, and do a "peek()" 
on the first result for uses of persistent cache. Do you think this approach 
would work? [~jamestaylor] & co.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-19 Thread Marcell Ortutay (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517546#comment-16517546
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

[~jamestaylor] :

> Easiest would be to just let Phoenix rerun the portion of the query that 
> didn't find the hash cache. What's the reason this approach won't work?
> Otherwise, the top level method of BaseResultIterators managing the parallel 
> scans is submitWork(). Maybe you could do what you want from there? Or higher 
> up in the stack where the hash cache is being created?
 
I haven't been able to find a place to put the exception handler that works. 
Below is some additional info:
 
The hash join cache is generated in HashJoinPlan.iterator() method. This is the 
stack trace from that location:
{code:java}
org.apache.phoenix.execute.HashJoinPlan.iterator(HashJoinPlan.java:182)
org.apache.phoenix.execute.DelegateQueryPlan.iterator(DelegateQueryPlan.java:144)
org.apache.phoenix.execute.DelegateQueryPlan.iterator(DelegateQueryPlan.java:139)
org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:316)
org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:295)
org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:294)
org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:286)
org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:1838)
sqlline.Commands.execute(Commands.java:822)
sqlline.Commands.sql(Commands.java:732)
sqlline.SqlLine.dispatch(SqlLine.java:813)
sqlline.SqlLine.begin(SqlLine.java:686)
sqlline.SqlLine.start(SqlLine.java:398)
sqlline.SqlLine.main(SqlLine.java:291){code}
 However, the actual execution of the query comes later. This is where 
PersistentHashJoinCacheNotFoundException is thrown:
{code:java}
org.apache.phoenix.coprocessor.PersistentHashJoinCacheNotFoundException: ERROR 
900 (HJ01): Hash Join cache not found joinId: 7. The cache might have expired 
and have been removed.
at org.apache.phoenix.util.ServerUtil.parseRemoteException(ServerUtil.java:189)
at 
org.apache.phoenix.util.ServerUtil.parseServerExceptionOrNull(ServerUtil.java:174)
at org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:141)
at 
org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:1327)
at 
org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:1245)
at 
org.apache.phoenix.iterate.RoundRobinResultIterator.getIterators(RoundRobinResultIterator.java:176)
at 
org.apache.phoenix.iterate.RoundRobinResultIterator.next(RoundRobinResultIterator.java:91)
at org.apache.phoenix.jdbc.PhoenixResultSet.next(PhoenixResultSet.java:805)
at sqlline.BufferedRows.(BufferedRows.java:37)
at sqlline.SqlLine.print(SqlLine.java:1660)
at sqlline.Commands.execute(Commands.java:833)
at sqlline.Commands.sql(Commands.java:732)
at sqlline.SqlLine.dispatch(SqlLine.java:813)
at sqlline.SqlLine.begin(SqlLine.java:686)
at sqlline.SqlLine.start(SqlLine.java:398)
at sqlline.SqlLine.main(SqlLine.java:291)
{code}
Looking at this, the first common ancestor of these two stack traces is 
`sqlline.Commands.execute`, which is outside Phoenix codebase.

I took a quick look at the sqlline code base also. It appears that the flow for 
generating the hash join cache is started here: 
[https://github.com/julianhyde/sqlline/blob/master/src/main/java/sqlline/Commands.java#L823]
 and then the flow which triggers the exception is started here: 
[https://github.com/julianhyde/sqlline/blob/master/src/main/java/sqlline/Commands.java#L834]
 . 

I'm not sure if there's a way to do this doing flow control via exception, 
though I might be missing something. Please let me know. I've experimented with 
putting the exception handler in various places, including the 
ParallelIterators.submitWork() at 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/iterate/ParallelIterators.java#L135]
 , but it does not re-run the hash join cache generation, as that is in a 
previous part of the query execution.

Guidance/advice appreciated ; FWIW the RPC check version does work as expected 
and improves our query runs substantially.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-21 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483241#comment-16483241
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

[~maryannxue] [~jamestaylor] I was hoping to get a bit of guidance on where/how 
to handle the exception. I tried adding an exception handler for 
HashJoinCacheNotFoundException in BaseResultIterators.java, as shown here: 
[https://github.com/apache/phoenix/commit/b336644a37f6c65524ee91a06a6859c0215b08f2#diff-8c3d3f644c66ef36d5bc604f017fabfcR1315]
 , but that doesn't seem to be correct. What I was hoping to do was to re-run 
the entire query with caching disabled for specific cache ID's using the 
override mechanism. What actually happens is, apparently, it tries to iterate 
again using the same query? I'm not entirely sure of this part of the code, but 
that is what seems to be happening.

Is there a good way / place to have it re-run the entire query with the the 
change to the StatementContext?

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-19 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481490#comment-16481490
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Yea I can add a summary on the PR. Until then, answering the questions here:

(1) I'm not sure actually. it was there before I made any modification.

(2) The naming on this is a bit bad. But basically, my approach was this: on 
first pass, assume caches are present and run the query. Then catch the 
exception if they are not present. If they are not present, disable use of 
persistent cache for not available in StatementContext's 
"Map caches" map. Then, re-run the query 
expecting it to work. Let me know if this is unclear at all; I'll put it in the 
summary.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-16 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478175#comment-16478175
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

[~maryannxue] I've updated the PR to respond to comments. All but one are 
addressed, please see my question in QueryCompiler.java regarding LHS join 
tables

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-01 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460352#comment-16460352
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Thanks for review; working on deploying this internally with changes, will post 
revisions later this week or early next

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446159#comment-16446159
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

thanks [~maryannxue]

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438013#comment-16438013
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

FYI added hint-to-enable ("USE_PERSISTENT_CACHE") to my fork. This is the last 
"feature" that I think is needed for a v1. I'm going to clean things up a bit 
and refactor for the noChildParentJoinOptimization approach suggested above and 
then submit for a full review.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437921#comment-16437921
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 4/13/18 9:10 PM:
---

Ah! got it, thanks


was (Author: ortutay):
Ah! got it, thank

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437901#comment-16437901
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Got it–for option (2) that would basically require calling setScanRange() with 
argument to scan "all"?

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437759#comment-16437759
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Thanks for taking a look [~maryannxue]. responses below:

> 1. First of all, I think it's important that we have an option to enable and 
>disable the persistent cache, making sure that users can still run join 
>queries in the default temp-cache way.

Yes definitely. In fact I am adding a hint, and for now I think it makes sense 
to only enable it if that hint is there, so we don't break any existing 
behavior.

> 2. Regarding to your change [2], can you explain what exactly is the problem 
>of key-range generation? Looks like checkCache() and addCache() are doing 
>redundant work, and CachedSubqueryResultIterator should be unnecessary. We do 
>not wish to read the cache on the client side and then re-add the cache again.

Yes in my first attempt at this I did not have the redundant work, but I ran 
into a bug where I was getting empty results when using the cached code path. 
If you look at ServerCache.addHashCache() you'll notice that it calls 
serialize(): 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L85]

serialize() produces that serialized RHS join cache and iterates over all the 
results in the ResultIterator for the RHS query. In this line 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131]
 it adds entries to keyRangeRhsValues. This is a list of the key values on the 
RHS of the join. It is used here 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java#L226]
 in HashJoinPlan as part of the query, and at some point it becomes an argument 
to set the scan range (I can dig up where if you'd like).

For this reason the cached code path somehow needs to generate correct values 
for keyRangeRhsValues, or correct values for the scan range, and these values 
need to be available on the client side.

The approach I did just re-runs the same codepath for both no-cache and cached 
queries. The advantage is it was fairly simple to implement and it guarantees 
identical execution. The downside is the redundant work. It would also be 
possible to have special case code to set the scan range for cached queries. 
This is a bit harder to implement but is more efficient.

Happy to hear what people think about this. Maybe there is something much 
simpler that I am missing!

> 3. We need to be aware that the string representation of the sub-query 
>statement is not reliable, which means the same join-tables or sub-queries do 
>not necessarily map to the same string representation, and thus will have 
>different generated cache-id. It'd be optimal if we can have some 
>normalization here. We can consider leaving this as a future improvement, yet 
>at this point we'd better have some test cases (counter cases as well) to 
>cover this point.

Yes, definitely. I'd prefer to leave this as a future improvement to keep the 
initial PR focused. IIRC I saw for some complex queries there is a "$1" or "$2" 
placeholder, which changes even across identical queries. There are probably 
more things like this, eg. "x=10 AND y=20" is the same as "y=20 AND x=10".

> 4. Is there a way for us to update the cache content if tables have been 
>updated? This might be related to what approach we take to add and re-validate 
>cache in (2).

Currently, no. I was thinking though that the user application can control 
invalidation using a hint, like this:

    /*+ CACHE_PERSISTENT('2018-01-01 12:00') */

The '2018-01-01 12:00' would be a suffix to whatever cacheId we generate, like 
this:

    cacheId = hash(cacheId + '2018-01-01 12:00')

which lets the application explicitly invalidate the cache when needed.

> 5. A rather minor point as it just occurred to me: Can we have CacheEntry 
>implement Closable?

Yes. Just so I know, what is the benefit of this?

And yes, apologies for the messy code. I'm fixing it up today and it should be 
ready for a more thorough review today or tomorrow.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used 

[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-09 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431546#comment-16431546
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 4/10/18 12:20 AM:


A update on this: I've implemented a basic version of this that re-uses the RHS 
results in a subquery cache. I made a few changes to the original hacky 
implementation that I wanted to get some feedback on.

My code is here: 
[https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please 
note this is a work in progress.

I've changed the following things:
 # In my first implementation, I stored a mapping of subquery hash -> 
ServerCache client side. This works in the single client use case but doesn't 
work if you have a cluster of PQS servers (which is our situation at 23andMe). 
So instead I replaced this with an RPC mechanism. The client will send an RPC 
to each region server, and check if the subquery results are available.
 # Originally I planned to only return a boolean in the RPC check. However, I 
ran into an issue. It turns out that the serialize() method is involved in the 
generation of key ranges that are used in the query [1]. This serialize() 
method is in the addHashCache() code path. In order to make sure this code is 
hit, I am creating a CachedSubqueryResultIterator which is passed to the 
addHashCache() code path. This ensures that all side effects, like the key 
range generation, are the same between cached / uncached code paths.

Would love to get feedback on this approach. For (2) there is an alternate 
approach that also caches the key ranges. This is more efficient but has the 
downside of needing specialized code.

Work still left to do is eviction logic, and hint to enable, and general 
cleanup/testing.

[1]https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131


was (Author: ortutay):
A update on this: I've implemented a basic version of this that re-uses the RHS 
results in a subquery cache. I made a few changes to the original hacky 
implementation that I wanted to get some feedback on.

My code is here: 
[https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please 
note this is a work in progress.

I've changed the following things:
 # In my first implementation, I stored a mapping of subquery hash -> 
ServerCache client side. This works in the single client use case but doesn't 
work if you have a cluster of PQS servers (which is our situation at 23andMe). 
So instead I replaced this with an RPC mechanism. The client will send an RPC 
to each region server, and check if the subquery results are available.
 # Originally I planned to only return a boolean in the RPC check. However, I 
ran into an issue. It turns out that the serialize() method is involved in the 
generation of key ranges that are used in the query [1]. This serialize() 
method is in the addHashCache() code path. In order to make sure this code is 
hit, I am creating a CachedSubqueryResultIterator which is passed to the 
addHashCache() code path. This ensures that all side effects, like the key 
range generation, are the same between cached / uncached code paths.

Would love to get feedback on this approach. For (2) there is an alternate 
approach that also caches the key ranges. This is more efficient but has the 
downside of needing specialized code.

Work still left to do is eviction logic, and hint to enable, and general 
cleanup/testing.

[1]https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-09 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431546#comment-16431546
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

A update on this: I've implemented a basic version of this that re-uses the RHS 
results in a subquery cache. I made a few changes to the original hacky 
implementation that I wanted to get some feedback on.

My code is here: 
[https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please 
note this is a work in progress.

I've changed the following things:
 # In my first implementation, I stored a mapping of subquery hash -> 
ServerCache client side. This works in the single client use case but doesn't 
work if you have a cluster of PQS servers (which is our situation at 23andMe). 
So instead I replaced this with an RPC mechanism. The client will send an RPC 
to each region server, and check if the subquery results are available.
 # Originally I planned to only return a boolean in the RPC check. However, I 
ran into an issue. It turns out that the serialize() method is involved in the 
generation of key ranges that are used in the query [1]. This serialize() 
method is in the addHashCache() code path. In order to make sure this code is 
hit, I am creating a CachedSubqueryResultIterator which is passed to the 
addHashCache() code path. This ensures that all side effects, like the key 
range generation, are the same between cached / uncached code paths.

Would love to get feedback on this approach. For (2) there is an alternate 
approach that also caches the key ranges. This is more efficient but has the 
downside of needing specialized code.

Work still left to do is eviction logic, and hint to enable, and general 
cleanup/testing.

[1]https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PHOENIX-4679) Exit build-proto.sh if not using protoc v2.5.0

2018-03-29 Thread Marcell Ortutay (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcell Ortutay updated PHOENIX-4679:
-
Priority: Minor  (was: Major)

> Exit build-proto.sh if not using protoc v2.5.0
> --
>
> Key: PHOENIX-4679
> URL: https://issues.apache.org/jira/browse/PHOENIX-4679
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Minor
>
> If you use a version of protoc later than v2.5.0 to regenerate protobufs, 
> you'll get a diff from the current protobuf generated code, even if you made 
> no changes to the .proto files. I assume this is undesirable, so it would be 
> nice if the build-proto.sh script warned people about this. The following 
> check would do this:
> {code:java}
> if [[ `protoc --version` != *"2.5.0"* ]]; then
>   echo "Must use protoc version 2.5.0"
>   exit 1
> fi
> {code}
> If this seems useful I can submit a PR to implement this



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PHOENIX-4679) Exit build-proto.sh if not using protoc v2.5.0

2018-03-29 Thread Marcell Ortutay (JIRA)
Marcell Ortutay created PHOENIX-4679:


 Summary: Exit build-proto.sh if not using protoc v2.5.0
 Key: PHOENIX-4679
 URL: https://issues.apache.org/jira/browse/PHOENIX-4679
 Project: Phoenix
  Issue Type: Improvement
Reporter: Marcell Ortutay


If you use a version of protoc later than v2.5.0 to regenerate protobufs, 
you'll get a diff from the current protobuf generated code, even if you made no 
changes to the .proto files. I assume this is undesirable, so it would be nice 
if the build-proto.sh script warned people about this. The following check 
would do this:
{code:java}
if [[ `protoc --version` != *"2.5.0"* ]]; then
  echo "Must use protoc version 2.5.0"
  exit 1
fi
{code}
If this seems useful I can submit a PR to implement this



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-23 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412221#comment-16412221
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 3/23/18 11:01 PM:


Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /\*\- SUBQUERY_CACHE \*/ hint, and is 
only activated if this hint is present. This hint also has an optional cache 
key suffix, eg.: /\*\- SUBQUERY_CACHE('2018-03-23') \*/ which can be used by 
the application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review


was (Author: ortutay):
Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /+\* SUBQUERY_CACHE \*/ hint, and is only 
activated if this hint is present. This hint also has an optional cache key 
suffix, eg.: /+\* SUBQUERY_CACHE('2018-03-23') \*/ which can be used by the 
application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not 

[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-23 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412221#comment-16412221
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 3/23/18 11:01 PM:


Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /+\* SUBQUERY_CACHE \*/ hint, and is only 
activated if this hint is present. This hint also has an optional cache key 
suffix, eg.: /+\* SUBQUERY_CACHE('2018-03-23') \*/ which can be used by the 
application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review


was (Author: ortutay):
Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /*+ SUBQUERY_CACHE */ hint, and is only 
activated if this hint is present. This hint also has an optional cache key 
suffix, eg.: /*+ SUBQUERY_CACHE('2018-03-23') */ which can be used by the 
application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-23 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412221#comment-16412221
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /*+ SUBQUERY_CACHE */ hint, and is only 
activated if this hint is present. This hint also has an optional cache key 
suffix, eg.: /*+ SUBQUERY_CACHE('2018-03-23') */ which can be used by the 
application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408720#comment-16408720
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 3/21/18 10:46 PM:


Thanks for the input [~maryannxue]. My current implementation is here: 
[https://github.com/ortutay23andme/phoenix/tree/4.7.0-HBase-1.1] and in 
particular this is my hacky patch: 
[https://github.com/ortutay23andme/phoenix/commit/04c96f672eb4bcdccec27f124373be766f8dd5af]
 . (Implemented on 4.7 for unrelated reasons, but the same idea I think is 
transferable to HEAD) Instead of a random cache ID it takes a hash of the query 
statement and uses that as the cache ID. Each Phoenix client maintains it's own 
memory of which cache IDs have already been executed (this is not ideal, but it 
was easy to implement this way).

If I'm understanding your proposal, the Phoenix client would attempt to use a 
cache ID with the expectation that it exists on region servers. The region 
server would throw an exception if the cache ID is not found, which indicates 
to Phoenix client that it should evaluate the subquery as usual.


was (Author: ortutay):
Thanks for the input [~maryannxue]. My current implementation is here: 
[https://github.com/ortutay23andme/phoenix/tree/4.7.0-HBase-1.1] and in 
particular this is my hacky patch: 
[https://github.com/ortutay23andme/phoenix/commit/04c96f672eb4bcdccec27f124373be766f8dd5af]
 . Instead of a random cache ID it takes a hash of the query statement and uses 
that as the cache ID. Each Phoenix client maintains it's own memory of which 
cache IDs have already been executed (this is not ideal, but it was easy to 
implement this way).

If I'm understanding your proposal, the Phoenix client would attempt to use a 
cache ID with the expectation that it exists on region servers. The region 
server would throw an exception if the cache ID is not found, which indicates 
to Phoenix client that it should evaluate the subquery as usual.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408720#comment-16408720
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Thanks for the input [~maryannxue]. My current implementation is here: 
[https://github.com/ortutay23andme/phoenix/tree/4.7.0-HBase-1.1] and in 
particular this is my hacky patch: 
[https://github.com/ortutay23andme/phoenix/commit/04c96f672eb4bcdccec27f124373be766f8dd5af]
 . Instead of a random cache ID it takes a hash of the query statement and uses 
that as the cache ID. Each Phoenix client maintains it's own memory of which 
cache IDs have already been executed (this is not ideal, but it was easy to 
implement this way).

If I'm understanding your proposal, the Phoenix client would attempt to use a 
cache ID with the expectation that it exists on region servers. The region 
server would throw an exception if the cache ID is not found, which indicates 
to Phoenix client that it should evaluate the subquery as usual.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408666#comment-16408666
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

As I mentioned above we’re working on a design proposal for this internally at 
23andMe, and there’s one big decision that I wanted to get feedback on.

There is currently “server cache” that is used by the hash join process in 
Phoenix. Hash join tables are broadcast to all region servers that need it, and 
the hash joining happens via coprocessor. This cache is deleted after the query 
ends.

My first thought for a persistent cache was to re-use the server cache, and 
extend the TTL and change the key (“cacheId”) generation. I implemented this as 
a hacky proof-of-concept and it worked quite well, the performance was much 
improved.

However, I’m wondering if a separate cache makes more sense. The current server 
cache has a different use case than a persistent cache, and as such it may be a 
good idea to separate the two.

Some ways in which they are different:

- A persistent cache performs eviction when there is no space left. The server 
cache raises an exception, and the user must do a merge sort join instead.

- Users may want to configure the two differently, eg. allocate more space for 
a persistent cache than the server cache, and set a higher TTL

- The server cache data must be available on all region servers doing the hash 
join. In contrast, the persistent cache only needs 1 copy of the data across 
the system (ie. across all region servers) until the data is needed. Doing this 
would be more space efficient, but result in more network transfer.

- You could in theory have a pluggable system for the persistent cache, eg. use 
memcache or something

 

That said, there are advantages to keeping it all in the server cache:

 

- Simpler implementation, does not add a new system to Phoenix

- Faster in the case that you get a cache hit, since there is no network 
transfer involved

 

Would love to get some feedback / opinions on this, thanks!

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread Marcell Ortutay (JIRA)
Marcell Ortutay created PHOENIX-4666:


 Summary: Add a subquery cache that persists beyond the life of a 
query
 Key: PHOENIX-4666
 URL: https://issues.apache.org/jira/browse/PHOENIX-4666
 Project: Phoenix
  Issue Type: Improvement
Reporter: Marcell Ortutay


The user list thread for additional context is here: 
[https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]



A Phoenix query may contain expensive subqueries, and moreover those expensive 
subqueries may be used across multiple different queries. While whole result 
caching is possible at the application level, it is not possible to cache 
subresults in the application. This can cause bad performance for queries in 
which the subquery is the most expensive part of the query, and the application 
is powerless to do anything at the query level. It would be good if Phoenix 
provided a way to cache subquery results, as it would provide a significant 
performance gain.

An illustrative example:

    SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = \{id}

In this case, the subquery "expensive_result" is expensive to compute, but it 
doesn't change between queries. The rest of the query does because of the \{id} 
parameter. This means the application can't cache it, but it would be good if 
there was a way to cache expensive_result.

Note that there is currently a coprocessor based "server cache", but the data 
in this "cache" is not persisted across queries. It is deleted after a TTL 
expires (30sec by default), or when the query completes.

This is issue is fairly high priority for us at 23andMe and we'd be happy to 
provide a patch with some guidance from Phoenix maintainers. We are currently 
putting together a design document for a solution, and we'll post it to this 
Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)