[jira] [Commented] (PHOENIX-3271) Distribute UPSERT SELECT across cluster

Enis Soztutar (JIRA) Tue, 24 Jan 2017 11:51:42 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836531#comment-15836531
 ]


Enis Soztutar commented on PHOENIX-3271:
----------------------------------------

This is a nice improvement. 
[~rajeshbabu] has a patch which changes the rpc scheduler to be configured 
programmatically from the server side, related to PHOENIX-3360. Do we need that 
patch in before this? 

For this patch in general, depending on the handler priorities proved to be 
brittle, however this will work if we confirm that the index rpc handlers will 
be used in all cross-RS communication. Agreed that we have to fix 
documentation, and also rename "index" handlers. For the long term, I would 
rather have another approach, where the Phoenix Rpc scheduler has a different 
thread pool (with low priority) to execute generic "tasks". In this case, the 
scan fragments will be executed from that task thread pool, but the upsert 
writes go to the normal thread pool. Doing these scans with inserts kind of 
thing should not be piggy-backed on the scan flow I think.  

One other thing is that these scan RPCs will take a longer time and will 
timeout, and retried from the client, causing the worst case behavior to be 
pretty bad user experience. Do we have any plans for dealing with that? On the 
newer HBase's the scanner has heartbeats and can return earlier close to the 
scanner lease timeout. Does it apply for these upsert selects? 

Maybe we should add a safe-guard configuration in case, larger clusters cannot 
execute the scan fragments under rpc timeout. wdyt? 

> Distribute UPSERT SELECT across cluster
> ---------------------------------------
>
>                 Key: PHOENIX-3271
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3271
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: James Taylor
>            Assignee: Ankit Singhal
>             Fix For: 4.10.0
>
>         Attachments: PHOENIX-3271.patch, PHOENIX-3271_v1.patch, 
> PHOENIX-3271_v2.patch, PHOENIX-3271_v3.patch, PHOENIX-3271_v4.patch, 
> PHOENIX-3271_v5.patch
>
>
> Based on some informal testing we've done, it seems that creation of a local 
> index is orders of magnitude faster that creation of global indexes (17 
> seconds versus 10-20 minutes - though more data is written in the global 
> index case). Under the covers, a global index is created through the running 
> of an UPSERT SELECT. Also, UPSERT SELECT provides an easy way of copying a 
> table. In both of these cases, the data being upserted must all flow back to 
> the same client which can become a bottleneck for a large table. Instead, 
> what can be done is to push each separate, chunked UPSERT SELECT call out to 
> a different region server for execution there. One way we could implement 
> this would be to have an endpoint coprocessor push the chunked UPSERT SELECT 
> out to each region server and return the number of rows that were upserted 
> back to the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-3271) Distribute UPSERT SELECT across cluster

Reply via email to