[
https://issues.apache.org/jira/browse/KUDU-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723084#comment-17723084
]
ASF subversion and git services commented on KUDU-3452:
-------------------------------------------------------
Commit e734ae0216749c6aa7d85756eecf2b8be907de5a in kudu's branch
refs/heads/master from xinghuayu007
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=e734ae021 ]
[KUDU-3452] Make validate tablet creating task not affected
Currently, creating a table with RF=n when the number of
healthy tservers is less than n will get stuck. Because
catalog manager creates tablets for it will fail and retry
continuously. At the same time, creating a table with RF=m
also will get stuck even if there are more than m healthy
tservers. Because catalog manager will return when finds a
tablet-creating task failed and will not try to select replicas
for other PREPARING tablets. For example, creating a three
replicas table times out when one of three tablet servers becomes
unavailable. After that, creating a two-replicas table also
will timeout even if there are enough tablet servers to place
its replicas. The validate two-replicas table-creating task will
be affected by invalidate three-replicas table-creating task.
This patch fixes this problem. If a task of creating tablet fail,
it will not return immediately, but let other tasks of creating
other tablets keep on running.
Change-Id: I64668651d0e8f58b92cfb841bdb20617de6776f9
Reviewed-on: http://gerrit.cloudera.org:8080/19594
Reviewed-by: Yifan Zhang <[email protected]>
Tested-by: Yifan Zhang <[email protected]>
> Support creating three-replicas table or partition when only 2 tservers
> healthy
> -------------------------------------------------------------------------------
>
> Key: KUDU-3452
> URL: https://issues.apache.org/jira/browse/KUDU-3452
> Project: Kudu
> Issue Type: Improvement
> Reporter: Xixu Wang
> Priority: Major
>
> h1. Background
> In my case, every day a new Kudu table (called: history_data_table) will be
> created to store history data and a new partition for another table (called:
> business_data_table) to be ready to store today's data. These tables and
> partitions all require 3 replicas. This business logic was implemented by
> some Python scripts. My Kudu cluster contains 3 masters and 3 tservers. Flag:
> --catalog_manager_check_ts_count_for_create_table is false.
> Sometimes, one tserver maybe become unavailable. Table creating task will
> retry continuously and always fail until the tserver become healthy again.
> See the error:
> {color:#ff8b00}E0222 11:10:32.767140 3321 catalog_manager.cc:672] Error
> processing pending assignments: Invalid argument: error selecting replicas
> for tablet 41dffa9783f14f36a5b6c35e89075c1a, state:0: Not enough tablet
> servers are online for table 'test_table'. Need at least 3 replicas, but only
> 2 tablet servers are available{color}
> {color:#172b4d}As there are no enough replicas, a tablet will never be
> created. The state of this tablet is not running. Therefore, read or write
> this tablet will fail even if there are 2 tservers can be used to create 2
> replicas.{color}
>
> An already created tablet can still be on service even if one of its 3
> replicas become unavailable. Why can not create a three-replicas table when
> only 2 tservers healthy?
>
> Besides, a validate table creating task will be affected by another
> invalidate tasks. In the upper example, a table creating task with RF=1 will
> still not succeed even if there exists more than one alive tablet servers.
> Because the background task manager will break the whole process when finds a
> tablet creating task failed and begin a new process to try to execute all
> tasks.
>
>
> h1. Design
> A new flag: --support_create_tablet_without_enough_healthy_tservers is added.
> The original logic keeps the same. When this flag is set true, a
> three-replicas tablet can be created successfully and its status is losing
> one replica. This tablet can be be read and write normally.
>
> There are 3 things need to do:
> # A tool to cancel the table creating task.
> # A tool to show the running table creating task.
> # A method to create table without enough healthy tservers.
> # make invalidate table creating task not affected by other invalidate tasks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)