[
https://issues.apache.org/jira/browse/PHOENIX-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836531#comment-15836531
]
Enis Soztutar commented on PHOENIX-3271:
----------------------------------------
This is a nice improvement.
[~rajeshbabu] has a patch which changes the rpc scheduler to be configured
programmatically from the server side, related to PHOENIX-3360. Do we need that
patch in before this?
For this patch in general, depending on the handler priorities proved to be
brittle, however this will work if we confirm that the index rpc handlers will
be used in all cross-RS communication. Agreed that we have to fix
documentation, and also rename "index" handlers. For the long term, I would
rather have another approach, where the Phoenix Rpc scheduler has a different
thread pool (with low priority) to execute generic "tasks". In this case, the
scan fragments will be executed from that task thread pool, but the upsert
writes go to the normal thread pool. Doing these scans with inserts kind of
thing should not be piggy-backed on the scan flow I think.
One other thing is that these scan RPCs will take a longer time and will
timeout, and retried from the client, causing the worst case behavior to be
pretty bad user experience. Do we have any plans for dealing with that? On the
newer HBase's the scanner has heartbeats and can return earlier close to the
scanner lease timeout. Does it apply for these upsert selects?
Maybe we should add a safe-guard configuration in case, larger clusters cannot
execute the scan fragments under rpc timeout. wdyt?
> Distribute UPSERT SELECT across cluster
> ---------------------------------------
>
> Key: PHOENIX-3271
> URL: https://issues.apache.org/jira/browse/PHOENIX-3271
> Project: Phoenix
> Issue Type: Improvement
> Reporter: James Taylor
> Assignee: Ankit Singhal
> Fix For: 4.10.0
>
> Attachments: PHOENIX-3271.patch, PHOENIX-3271_v1.patch,
> PHOENIX-3271_v2.patch, PHOENIX-3271_v3.patch, PHOENIX-3271_v4.patch,
> PHOENIX-3271_v5.patch
>
>
> Based on some informal testing we've done, it seems that creation of a local
> index is orders of magnitude faster that creation of global indexes (17
> seconds versus 10-20 minutes - though more data is written in the global
> index case). Under the covers, a global index is created through the running
> of an UPSERT SELECT. Also, UPSERT SELECT provides an easy way of copying a
> table. In both of these cases, the data being upserted must all flow back to
> the same client which can become a bottleneck for a large table. Instead,
> what can be done is to push each separate, chunked UPSERT SELECT call out to
> a different region server for execution there. One way we could implement
> this would be to have an endpoint coprocessor push the chunked UPSERT SELECT
> out to each region server and return the number of rows that were upserted
> back to the client.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)