[ 
https://issues.apache.org/jira/browse/PHOENIX-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Toth updated PHOENIX-6944:
---------------------------------
    Description: 
Currently, splits are generated by PhoenixInputFormat are in ascending order.
MR does not use this ordering directly, it instead orders the partitions by 
size in descending order.
We set the sizes of the splits to the region size. (Even when splitting by 
guideposts, but this not really a problem)
The result is that mapper jobs are grouped by regions, so usually all the 
mappers running are working on one, or few regions. As a result we have the 
following problems:

Read hotspotting:
All scan operations for the indexing job hit the same one or few region 
servers, causing high loads and slowdowns.

Write hotspotting:
If the data rowkeys and index rowkeys strongly correlate, then the data read 
from one or few data regions will be written to one or few index regions, 
causing high loads and slowdowns. This is a bit of a corner case, we have 
obeserved it when building an index for a column which starts with the same 
bytes as the primary key for the data table.

We can improve this by making sure that the generate mapper jobs are executed 
in a random order. The only way to change the execution order is to manipulate 
the length of the splits. As the length is only used for ordering, and 
calculating completion percentage, this is unlikely to cause problems (we 
already specify wildly off lengths when splitting by guidepost )

I've run some test on a 50M row, 40GB data table, generating secondary indexes 
for a correlated field and for a random field:
The test system has three RS workers, and 12 yarn slots for running IndexTool


||Index rebuild time||on correlated field||on random field||
|w/o randomization|50 min|28 min|
|w/ randomization|30 min|23 min|

  was:
Currently, splits are generated by PhoenixInputFormat are in ascending order.
MR does not use this ordering directly, it instead orders the partitions by 
size in descending order.
We set the sizes of the splits to the region size. (Even when splitting by 
guideposts, but this not really a problem)
The result is that mapper jobs are grouped by regions, so usually all the 
mappers running are working on one, or few regions. As a result we have the 
following problems:

Read hotspotting:
All scan operations for the indexing job hit the same one or few region 
servers, causing high loads and slowdowns.

Write hotspotting:
If the data rowkeys and index rowkeys strongly correlate, then the data read 
from one or few data regions will be written to one or few index regions, 
causing high loads and slowdowns. This is a bit of a corner case, we have 
obeserved it when building an index for a column which starts with the same 
bytes as the primary key for the data table.

We can improve this by making sure that the generate mapper jobs are executed 
in a random order. The only way to change the execution order is to manipulate 
the length of the splits. As the length is only used for ordering, and 
calculating completion percentage, this is unlikely to cause problems (we 
already specify wildly off lengths when splitting by guidepost )


> Randomize mapper task ordering for Indexing MR tools
> ----------------------------------------------------
>
>                 Key: PHOENIX-6944
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-6944
>             Project: Phoenix
>          Issue Type: Improvement
>          Components: core
>            Reporter: Istvan Toth
>            Priority: Major
>
> Currently, splits are generated by PhoenixInputFormat are in ascending order.
> MR does not use this ordering directly, it instead orders the partitions by 
> size in descending order.
> We set the sizes of the splits to the region size. (Even when splitting by 
> guideposts, but this not really a problem)
> The result is that mapper jobs are grouped by regions, so usually all the 
> mappers running are working on one, or few regions. As a result we have the 
> following problems:
> Read hotspotting:
> All scan operations for the indexing job hit the same one or few region 
> servers, causing high loads and slowdowns.
> Write hotspotting:
> If the data rowkeys and index rowkeys strongly correlate, then the data read 
> from one or few data regions will be written to one or few index regions, 
> causing high loads and slowdowns. This is a bit of a corner case, we have 
> obeserved it when building an index for a column which starts with the same 
> bytes as the primary key for the data table.
> We can improve this by making sure that the generate mapper jobs are executed 
> in a random order. The only way to change the execution order is to 
> manipulate the length of the splits. As the length is only used for ordering, 
> and calculating completion percentage, this is unlikely to cause problems (we 
> already specify wildly off lengths when splitting by guidepost )
> I've run some test on a 50M row, 40GB data table, generating secondary 
> indexes for a correlated field and for a random field:
> The test system has three RS workers, and 12 yarn slots for running IndexTool
> ||Index rebuild time||on correlated field||on random field||
> |w/o randomization|50 min|28 min|
> |w/ randomization|30 min|23 min|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to