[jira] [Updated] (KUDU-3195) Make DMS flush policy more robust when maintenance threads are idle

2020-10-14 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3195:

Code Review: https://gerrit.cloudera.org/#/c/16581/

> Make DMS flush policy more robust when maintenance threads are idle
> ---
>
> Key: KUDU-3195
> URL: https://issues.apache.org/jira/browse/KUDU-3195
> Project: Kudu
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.13.0
>Reporter: Alexey Serbin
>Priority: Major
>
> In one scenario I observed very long bootstrap times of tablet servers 
> (something between to 45 minutes and 60 minutes) even if tablet servers had 
> relatively small amount of data under management (~80GByte).  It turned out 
> the time was spent on replaying WAL segments, with {{kudu cluster ksck}} 
> reporting something like below all the time during bootstrap:
> {noformat}
>   b0a20b117a1242ae9fc15620a6f7a524 (tserver-6.local.site:7050): not running
> State:   BOOTSTRAPPING
> Data state:  TABLET_DATA_READY
> Last status: Bootstrap replaying log segment 21/37 (2.28M/7.85M this 
> segment, stats: ops{read=27374 overwritten=0 applied=25016 ignored=657} 
> inserts{seen=5949247 
> ignored=0} mutations{seen=0 ignored=0} orphaned_commits=7)
> {noformat}
> The workload I ran before shutting down the tablet servers consisted of many 
> small UPSERT operations, but the cluster was idle after terminating the 
> workload for long time (about few hours or so).  The workload was generated by
> {noformat}
> kudu perf loadgen \
>   --table_name=$TABLE_NAME \
>   --num_rows_per_thread=8 \
>   --num_threads=4 \
>   --use_upsert \
>   --use_random_pk \
>   $MASTER_ADDR
> {noformat}
> The table that the UPSERT workload was running against had been pre-populated 
> by the following:
> {noformat}
> kudu perf loadgen --table_num_replicas=3 --keep-auto-table 
> --table_num_hash_partitions=5 --table_num_range_partitions=5 
> --num_rows_per_thread=8 --num_threads=4 $MASTER_ADDR
> {noformat}
> As it turned out, tablet servers accumulated huge number of DMS which 
> required flushing/compaction, but after the memory pressure subsided, the 
> compaction policy was scheduling just one  operation per tablet in every 120 
> seconds (the latter interval is controlled by {{\-\-flush_threshold_secs}}).  
> In fact, tablet servers could flush those rowsets non-stop since the 
> maintenance threads were completely idle otherwise and there were no active 
> workload running against the cluster.  Those DMS has been around for long 
> time (much more than 120 seconds) and were anchoring a lot of WAL segments.  
> So, the operations from the WAL had to be replayed once I restarted the 
> tablet servers.
> It would be great to update the flushing/compaction policy to allow tablet 
> servers run {{FlushDeltaMemStoresOp}} as soon as a DMS becomes older than 
> specified by {{\-\-flush_threshold_secs}} when the maintenance threads are 
> not busy otherwise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3195) Make DMS flush policy more robust when maintenance threads are idle

2020-10-14 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3195:

Status: In Review  (was: Open)

> Make DMS flush policy more robust when maintenance threads are idle
> ---
>
> Key: KUDU-3195
> URL: https://issues.apache.org/jira/browse/KUDU-3195
> Project: Kudu
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.13.0
>Reporter: Alexey Serbin
>Priority: Major
>
> In one scenario I observed very long bootstrap times of tablet servers 
> (something between to 45 minutes and 60 minutes) even if tablet servers had 
> relatively small amount of data under management (~80GByte).  It turned out 
> the time was spent on replaying WAL segments, with {{kudu cluster ksck}} 
> reporting something like below all the time during bootstrap:
> {noformat}
>   b0a20b117a1242ae9fc15620a6f7a524 (tserver-6.local.site:7050): not running
> State:   BOOTSTRAPPING
> Data state:  TABLET_DATA_READY
> Last status: Bootstrap replaying log segment 21/37 (2.28M/7.85M this 
> segment, stats: ops{read=27374 overwritten=0 applied=25016 ignored=657} 
> inserts{seen=5949247 
> ignored=0} mutations{seen=0 ignored=0} orphaned_commits=7)
> {noformat}
> The workload I ran before shutting down the tablet servers consisted of many 
> small UPSERT operations, but the cluster was idle after terminating the 
> workload for long time (about few hours or so).  The workload was generated by
> {noformat}
> kudu perf loadgen \
>   --table_name=$TABLE_NAME \
>   --num_rows_per_thread=8 \
>   --num_threads=4 \
>   --use_upsert \
>   --use_random_pk \
>   $MASTER_ADDR
> {noformat}
> The table that the UPSERT workload was running against had been pre-populated 
> by the following:
> {noformat}
> kudu perf loadgen --table_num_replicas=3 --keep-auto-table 
> --table_num_hash_partitions=5 --table_num_range_partitions=5 
> --num_rows_per_thread=8 --num_threads=4 $MASTER_ADDR
> {noformat}
> As it turned out, tablet servers accumulated huge number of DMS which 
> required flushing/compaction, but after the memory pressure subsided, the 
> compaction policy was scheduling just one  operation per tablet in every 120 
> seconds (the latter interval is controlled by {{\-\-flush_threshold_secs}}).  
> In fact, tablet servers could flush those rowsets non-stop since the 
> maintenance threads were completely idle otherwise and there were no active 
> workload running against the cluster.  Those DMS has been around for long 
> time (much more than 120 seconds) and were anchoring a lot of WAL segments.  
> So, the operations from the WAL had to be replayed once I restarted the 
> tablet servers.
> It would be great to update the flushing/compaction policy to allow tablet 
> servers run {{FlushDeltaMemStoresOp}} as soon as a DMS becomes older than 
> specified by {{\-\-flush_threshold_secs}} when the maintenance threads are 
> not busy otherwise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3149) Lock contention between registering ops and computing maintenance op stats

2020-10-14 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3149:

Code Review: https://gerrit.cloudera.org/#/c/16580/

> Lock contention between registering ops and computing maintenance op stats
> --
>
> Key: KUDU-3149
> URL: https://issues.apache.org/jira/browse/KUDU-3149
> Project: Kudu
>  Issue Type: Bug
>  Components: perf, tserver
>Reporter: Andrew Wong
>Priority: Critical
>
> We saw a bunch of tablets bootstrapping extremely slowly, and many stuck 
> supposedly bootstrapping, but not showing up in the {{/tablets}} page, i.e. 
> we could only see INITIALIZED and RUNNING tablets, no BOOTSTRAPPING.
> Upon digging into the stacks, we saw a bunch waiting to acquire the MM lock:
> {code:java}
> TID 46577(tablet-open [wo):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x23980ff  kudu::MaintenanceManager::RegisterOp()
> @   0xb59b99  kudu::tablet::Tablet::RegisterMaintenanceOps()
> @   0xb855a1  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
> @   0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> TID 46574(tablet-open [wo):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x23980ff  kudu::MaintenanceManager::RegisterOp()
> @   0xb59c74  kudu::tablet::Tablet::RegisterMaintenanceOps()
> @   0xb855a1  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
> @   0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> 7 threads with same stack:
> TID 46575(tablet-open [wo):
> TID 46576(tablet-open [wo):
> TID 46578(tablet-open [wo):
> TID 46580(tablet-open [wo):
> TID 46581(tablet-open [wo):
> TID 46582(tablet-open [wo):
> TID 46583(tablet-open [wo):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x23980ff  kudu::MaintenanceManager::RegisterOp()
> @   0xb85374  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
> @   0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> TID 46573(tablet-open [wo):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x23980ff  kudu::MaintenanceManager::RegisterOp()
> @   0xb854c7  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
> @   0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> 2 threads with same stack:
> TID 43795(MaintenanceMgr ):
> TID 43796(MaintenanceMgr ):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x239a064  kudu::MaintenanceManager::LaunchOp()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> {code}
> A couple more stacks show some work being done by the maintenance manager:
> {code:java}
> TID 43794(MaintenanceMgr ):
> @ 0x7f1dd57147e0  (unknown)
> @   0xba7b41  
> kudu::tablet::BudgetedCompactionPolicy::RunApproximation()
> @   0xba8f5d  
> kudu::tablet::BudgetedCompactionPolicy::PickRowSets()
> @   0xb5b1a1  kudu::tablet::Tablet::PickRowSetsToCompact()
> @   0xb64e93  

[jira] [Updated] (KUDU-3149) Lock contention between registering ops and computing maintenance op stats

2020-10-14 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3149:

Status: In Review  (was: Open)

> Lock contention between registering ops and computing maintenance op stats
> --
>
> Key: KUDU-3149
> URL: https://issues.apache.org/jira/browse/KUDU-3149
> Project: Kudu
>  Issue Type: Bug
>  Components: perf, tserver
>Reporter: Andrew Wong
>Priority: Critical
>
> We saw a bunch of tablets bootstrapping extremely slowly, and many stuck 
> supposedly bootstrapping, but not showing up in the {{/tablets}} page, i.e. 
> we could only see INITIALIZED and RUNNING tablets, no BOOTSTRAPPING.
> Upon digging into the stacks, we saw a bunch waiting to acquire the MM lock:
> {code:java}
> TID 46577(tablet-open [wo):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x23980ff  kudu::MaintenanceManager::RegisterOp()
> @   0xb59b99  kudu::tablet::Tablet::RegisterMaintenanceOps()
> @   0xb855a1  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
> @   0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> TID 46574(tablet-open [wo):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x23980ff  kudu::MaintenanceManager::RegisterOp()
> @   0xb59c74  kudu::tablet::Tablet::RegisterMaintenanceOps()
> @   0xb855a1  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
> @   0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> 7 threads with same stack:
> TID 46575(tablet-open [wo):
> TID 46576(tablet-open [wo):
> TID 46578(tablet-open [wo):
> TID 46580(tablet-open [wo):
> TID 46581(tablet-open [wo):
> TID 46582(tablet-open [wo):
> TID 46583(tablet-open [wo):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x23980ff  kudu::MaintenanceManager::RegisterOp()
> @   0xb85374  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
> @   0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> TID 46573(tablet-open [wo):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x23980ff  kudu::MaintenanceManager::RegisterOp()
> @   0xb854c7  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
> @   0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> 2 threads with same stack:
> TID 43795(MaintenanceMgr ):
> TID 43796(MaintenanceMgr ):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x239a064  kudu::MaintenanceManager::LaunchOp()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> {code}
> A couple more stacks show some work being done by the maintenance manager:
> {code:java}
> TID 43794(MaintenanceMgr ):
> @ 0x7f1dd57147e0  (unknown)
> @   0xba7b41  
> kudu::tablet::BudgetedCompactionPolicy::RunApproximation()
> @   0xba8f5d  
> kudu::tablet::BudgetedCompactionPolicy::PickRowSets()
> @   0xb5b1a1  kudu::tablet::Tablet::PickRowSetsToCompact()
> @   0xb64e93  kudu::tablet::Tablet::Compact()
> @   

[jira] [Resolved] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client

2020-10-09 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3198.
-
Resolution: Fixed

Fixed with the patch mentioned in the prior comment.

Thank you [~zhangyifan27] for debugging and fixing the issue!

> Unable to delete a full row from a table with 64 columns when using java 
> client
> ---
>
> Key: KUDU-3198
> URL: https://issues.apache.org/jira/browse/KUDU-3198
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0
>Reporter: YifanZhang
>Priority: Major
> Fix For: 1.14.0
>
>
> We recently got an error when deleted full rows from a table with 64 columns 
> using sparkSQL, however if we delete a column from the table, this error will 
> not appear. The error is:
> {code:java}
> Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
> Unknown row operation type (error 0){code}
> I tested this by deleting a full row from a table with 64 column using java 
> client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, 
> server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE 
> should not have a value for column: c63 STRING NULLABLE (error 0)
> {code}
> if the row is set values for all columns , I got an error like:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
> status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
> {code}
> I also tested this with tables with different number of columns. The weird 
> thing is I could delete full rows from a table with 8/16/32/63/65 columns,  
> but couldn't do this if the table has 64/128 columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client

2020-10-09 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3198:

Fix Version/s: 1.14.0

> Unable to delete a full row from a table with 64 columns when using java 
> client
> ---
>
> Key: KUDU-3198
> URL: https://issues.apache.org/jira/browse/KUDU-3198
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0
>Reporter: YifanZhang
>Priority: Major
> Fix For: 1.14.0
>
>
> We recently got an error when deleted full rows from a table with 64 columns 
> using sparkSQL, however if we delete a column from the table, this error will 
> not appear. The error is:
> {code:java}
> Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
> Unknown row operation type (error 0){code}
> I tested this by deleting a full row from a table with 64 column using java 
> client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, 
> server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE 
> should not have a value for column: c63 STRING NULLABLE (error 0)
> {code}
> if the row is set values for all columns , I got an error like:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
> status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
> {code}
> I also tested this with tables with different number of columns. The weird 
> thing is I could delete full rows from a table with 8/16/32/63/65 columns,  
> but couldn't do this if the table has 64/128 columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client

2020-10-09 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3198:

Affects Version/s: 1.10.0
   1.10.1
   1.11.0
   1.11.1

> Unable to delete a full row from a table with 64 columns when using java 
> client
> ---
>
> Key: KUDU-3198
> URL: https://issues.apache.org/jira/browse/KUDU-3198
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0
>Reporter: YifanZhang
>Priority: Major
>
> We recently got an error when deleted full rows from a table with 64 columns 
> using sparkSQL, however if we delete a column from the table, this error will 
> not appear. The error is:
> {code:java}
> Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
> Unknown row operation type (error 0){code}
> I tested this by deleting a full row from a table with 64 column using java 
> client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, 
> server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE 
> should not have a value for column: c63 STRING NULLABLE (error 0)
> {code}
> if the row is set values for all columns , I got an error like:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
> status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
> {code}
> I also tested this with tables with different number of columns. The weird 
> thing is I could delete full rows from a table with 8/16/32/63/65 columns,  
> but couldn't do this if the table has 64/128 columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2987) Intra location rebalance will crash in special case

2020-10-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2987:

Affects Version/s: 1.9.0
   1.10.0
   1.10.1

> Intra location rebalance will crash in special case
> ---
>
> Key: KUDU-2987
> URL: https://issues.apache.org/jira/browse/KUDU-2987
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 1.9.0, 1.10.0, 1.10.1, 1.11.0
>Reporter: ZhangYao
>Assignee: ZhangYao
>Priority: Major
> Fix For: 1.12.0, 1.11.1
>
>
> Recently I am doing POC about rebalance and I get core when running intra 
> location rebalance.
> Here is the log:
> {code:java}
> I2019-10-30 20:02:17.843044 40915 rebalancer_tool.cc:225] running rebalancer 
> within location '/location/2044'
> F2019-10-30 20:02:17.884591 40915 map-util.h:109] Check failed: it != 
> collection.end() Map key not found: a9119004b2d24f42a1acf09d142565fb
> *** Check failure stack trace: ***
>     @          0x111a75d  google::LogMessage::Fail()
>     @          0x111c6d3  google::LogMessage::SendToLog()
>     @          0x111a2b9  google::LogMessage::Flush()
>     @          0x111d0ef  google::LogMessageFatal::~LogMessageFatal()
>     @           0xe26da7  FindOrDie<>()
>     @           0xe1f204  
> kudu::tools::RebalancerTool::AlgoBasedRunner::GetNextMovesImpl()
>     @           0xe162e0  
> kudu::tools::RebalancerTool::BaseRunner::GetNextMoves()
>     @           0xe15bf5  kudu::tools::RebalancerTool::RunWith()
>     @           0xe1db0e  kudu::tools::RebalancerTool::Run()
>     @           0xb6fea1  kudu::tools::(anonymous namespace)::RunRebalance()
>     @           0xb70e14  std::_Function_handler<>::_M_invoke()
>     @          0x11714a2  kudu::tools::Action::Run()
>     @           0xc00587  kudu::tools::DispatchCommand()
>     @           0xc00f4b  kudu::tools::RunTool()
>     @           0xb0fd6d  main
>     @     0x7f37086a4b15  __libc_start_main
>     @           0xb6b399  (unknown)
> {code}
> I found it may be the problem in 
> {{RebalancerTool::AlgoBasedRunner::GetNextMovesImpl}} when building 
> extra_info_by_tablet_id, it check that the table id in tablet must occur in 
> table info. But when we build ClusterRawInfo in 
> {{RebalancerTool::KsckResultsToClusterRawInfo}} we only collect the table 
> occurs in location but all tablets in cluster. 
>  This problem will occur when the location doesn't have replica for all 
> table. When location is far more than table's replica it will happen.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3199) kudu cluster rebalaner: add a flag to ignore all defined locations

2020-10-02 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3199:
---

 Summary: kudu cluster rebalaner: add a flag to ignore all defined 
locations
 Key: KUDU-3199
 URL: https://issues.apache.org/jira/browse/KUDU-3199
 Project: Kudu
  Issue Type: Improvement
  Components: CLI
Reporter: Alexey Serbin


When the location-aware rebalancer was designed, it was assumed that the tool 
should always honor the partitioning of the cluster defined by locations 
whatever the partitioning was.  The only options considered were to run the 
process excluding particular rebalancing phases:
* correction of placement policy violations ({{\-\-disable_policy_fixer}})
* inter-location rebalancing, i.e. rebalancing across different locations 
({{\-\-disable_cross_location_rebalancing}})
* intra-location rebalancing, i.e. rebalancing within a location 
({{\-\-disable_intra_location_rebalancing}})

As it turns out, there are use cases when people want to run the rebalancer on 
a location-aware cluster ignoring the location-awareness specifics.  Those 
cases are:
# The locations are defined by some higher-level cluster orchestration 
software, and people are reluctant to disable the location-awareness for Kudu 
specifically (i.e. providing an alternative script for 
{{\-\-location_mapping_cmd}}), but want to even out the distribution of 
replicas.
# Having just two locations defined for some time.  Even if it's a transitional 
phase (e.g., awaiting for a new zone/rack/datacenter to be added 'soon'), it 
could take some time.

For both cases, there is a workaround if every location has the same number of 
tablet servers: run the rebalancer tool with the {{\-\-disable_policy_fixer}} 
flag.  However, this workaround isn't applicable if there is difference in the 
number of tablet replicas per location, and no combination of flags could make 
the location-aware rebalancer run as it there were no locations defined.

Let's add a new flag for the {{kudu cluster rebalance}} CLI tool to make it run 
on a location-aware cluster as if no locations were defined.  Of course, the 
flag should be  {{off}} by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3195) Make DMS flush policy more robust when maitenance threads are idle

2020-09-17 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3195:
---

 Summary: Make DMS flush policy more robust when maitenance threads 
are idle
 Key: KUDU-3195
 URL: https://issues.apache.org/jira/browse/KUDU-3195
 Project: Kudu
  Issue Type: Improvement
  Components: tserver
Affects Versions: 1.13.0
Reporter: Alexey Serbin


In one scenario I observed very long bootstrap times of tablet servers 
(something between to 45 minutes and 60 minutes) even if tablet servers had 
relatively small amount of data under management (~80GByte).  It turned out the 
time was spent on replaying WAL segments, with {{kudu cluster ksck}} reporting 
something like below all the time during bootstrap:

{noformat}
  b0a20b117a1242ae9fc15620a6f7a524 (tserver-6.local.site:7050): not running
State:   BOOTSTRAPPING
Data state:  TABLET_DATA_READY
Last status: Bootstrap replaying log segment 21/37 (2.28M/7.85M this 
segment, stats: ops{read=27374 overwritten=0 applied=25016 ignored=657} 
inserts{seen=5949247 
ignored=0} mutations{seen=0 ignored=0} orphaned_commits=7)
{noformat}

The workload I ran before shutting down the tablet servers consisted of many 
small UPSERT operations, but the cluster was idle after terminating the 
workload for long time (about few hours or so).  The workload was generated by
{noformat}
kudu perf loadgen \
  --table_name=$TABLE_NAME \
  --num_rows_per_thread=8 \
  --num_threads=4 \
  --use_upsert \
  --use_random_pk \
  $MASTER_ADDR
{noformat}

The table that the UPSERT workload was running against had been pre-populated 
by the following:
{noformat}
kudu perf loadgen --table_num_replicas=3 --keep-auto-table 
--table_num_hash_partitions=5 --table_num_range_partitions=5 
--num_rows_per_thread=8 --num_threads=4 $MASTER_ADDR
{noformat}

As it turned out, tablet servers accumulated huge number of DMS which required 
flushing/compaction, but after the memory pressure subsided, the compaction 
policy was scheduling just one  operation per tablet in every 120 seconds (the 
latter interval is controlled by {{\-\-flush_threshold_secs}}).  In fact, 
tablet servers could flush those rowsets non-stop since the maintenance threads 
were completely idle otherwise and there were no active workload running 
against the cluster.  Those DMS has been around for long time (much more than 
120 seconds) and were anchoring a lot of WAL segments.  So, the operations from 
the WAL had to be replayed once I restarted the tablet servers.

It would be great to update the flushing/compaction policy to allow tablet 
servers run {{FlushDeltaMemStoresOp}} as soon as a DMS becomes older than 
specified by {{\-\-flush_threshold_secs}} when the maintenance threads are not 
busy otherwise.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3194) testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest) sometimes fails

2020-09-11 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3194:
---

 Summary: 
testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest) 
sometimes fails
 Key: KUDU-3194
 URL: https://issues.apache.org/jira/browse/KUDU-3194
 Project: Kudu
  Issue Type: Improvement
  Components: client, test
Affects Versions: 1.13.0, 1.14.0
Reporter: Alexey Serbin
 Attachments: test-output.txt.xz

The test scenario sometimes fails.

{noformat}  
Time: 55.485
There was 1 failure:
1) testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest)
java.lang.AssertionError: expected:<100> but was:<99>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot(DefaultSourceTest.scala:784)

FAILURES!!!
Tests run: 30,  Failures: 1
{noformat}

The full log is attached (RELEASE build); the relevant stack trace looks like 
the following:
{noformat}
23:53:48.683 [ERROR - main] (RetryRule.java:219) 
org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot: 
failed attempt 1
java.lang.AssertionError: expected:<100> but was:<99>   
  at org.junit.Assert.fail(Assert.java:89) ~[junit-4.13.jar:4.13]   
  at org.junit.Assert.failNotEquals(Assert.java:835) ~[junit-4.13.jar:4.13] 
  at org.junit.Assert.assertEquals(Assert.java:647) ~[junit-4.13.jar:4.13]  
  at org.junit.Assert.assertEquals(Assert.java:633) ~[junit-4.13.jar:4.13]  
  at 
org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot(DefaultSourceTest.scala:784)
 ~[test/:?]
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_141] 
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_141]
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_141]
  at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_141]
  at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
 ~[junit-4.13.jar:4.13]
  at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 ~[junit-4.13.jar:4.13]
  at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
 ~[junit-4.13.jar:4.13]
  at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 ~[junit-4.13.jar:4.13]
  at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
~[junit-4.13.jar:4.13]
  at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
~[junit-4.13.jar:4.13]
  at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) 
~[junit-4.13.jar:4.13]
  at 
org.apache.kudu.test.junit.RetryRule$RetryStatement.doOneAttempt(RetryRule.java:217)
 [kudu-test-utils-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
  at 
org.apache.kudu.test.junit.RetryRule$RetryStatement.evaluate(RetryRule.java:234)
 [kudu-test-utils-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
  at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
[junit-4.13.jar:4.13]
  at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
 [junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 
[junit-4.13.jar:4.13]
  at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
 [junit-4.13.jar:4.13]
  at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
 [junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
[junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
[junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
[junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
[junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
[junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
[junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 
[junit-4.13.jar:4.13]
  at org.junit.runners.Suite.runChild(Suite.java:128) [junit-4.13.jar:4.13] 
  at org.junit.runners.Suite.runChild(Suite.java:27) [junit-4.13.jar:4.13]  
  at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
[junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
[junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
[junit-4.13.jar:4.13]
  at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 

[jira] [Resolved] (KUDU-1728) Parallelize tablet copy operations

2020-09-09 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-1728.
-
Fix Version/s: 1.14.0
   Resolution: Fixed

> Parallelize tablet copy operations
> --
>
> Key: KUDU-1728
> URL: https://issues.apache.org/jira/browse/KUDU-1728
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tablet
>Reporter: Mike Percy
>Priority: Major
>  Labels: roadmap-candidate
> Fix For: 1.14.0
>
>
> Parallelize tablet copy operations. Right now all data is copied serially. We 
> may want to consider throttling on either side if we want to budget IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3182) 'last_known_addr' is not specified for single master Raft configuration

2020-08-31 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3182:

Code Review: https://gerrit.cloudera.org/#/c/16340/

> 'last_known_addr' is not specified for single master Raft configuration
> ---
>
> Key: KUDU-3182
> URL: https://issues.apache.org/jira/browse/KUDU-3182
> Project: Kudu
>  Issue Type: Task
>  Components: consensus, master
>Reporter: Bankim Bhavsar
>Assignee: Bankim Bhavsar
>Priority: Major
>
> 'last_known_addr' field is not persisted for a single master Raft
> configuration. This is okay as long as we have a single master
> configuration. On dynamically transitioning from single master
> to two master configuration, ChangeConfig() request to ADD_PEER
> fails the validation in VerifyRaftConfig().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads

2020-08-31 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-1587:

Fix Version/s: 1.13.0
   Resolution: Fixed
   Status: Resolved  (was: In Review)

With this 
[commit|https://github.com/apache/kudu/commit/ee3bb83575a051c2feade1f8c159b2902a7160d5],
 it's now possible to specify a threshold for the op apply times.  While not 
yet enabled by default with some safe threshold for the apply queue times, this 
allows to engage this new behavior of rejecting write requests, if needed.  To 
do so, set the {{\-\-tablet_apply_pool_overload_threshold_ms}} tablet server's 
flag to the desired value (anything greater than 0 turns on the new behavior).

> Memory-based backpressure is insufficient on seek-bound workloads
> -
>
> Key: KUDU-1587
> URL: https://issues.apache.org/jira/browse/KUDU-1587
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 0.10.0, 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, 
> 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 
> 1.11.1
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: roadmap-candidate
> Fix For: 1.13.0
>
> Attachments: graph.png, queue-time.png
>
>
> I pushed a uniform random insert workload from a bunch of clients to the 
> point that the vast majority of bloom filters no longer fit in buffer cache, 
> and the compaction had fallen way behind. Thus, every inserted row turns into 
> 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of 
> workload, the current backpressure (based on memory usage) is insufficient to 
> prevent ridiculously long queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads

2020-08-28 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-1587:

Affects Version/s: 1.0.0
   1.0.1
   1.1.0
   1.2.0
   1.3.0
   1.3.1
   1.4.0
   1.5.0
   1.6.0
   1.7.0
   1.8.0
   1.7.1
   1.9.0
   1.10.0
   1.10.1
   1.11.0
   1.12.0
   1.11.1

> Memory-based backpressure is insufficient on seek-bound workloads
> -
>
> Key: KUDU-1587
> URL: https://issues.apache.org/jira/browse/KUDU-1587
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 0.10.0, 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, 
> 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 
> 1.11.1
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: roadmap-candidate
> Attachments: graph.png, queue-time.png
>
>
> I pushed a uniform random insert workload from a bunch of clients to the 
> point that the vast majority of bloom filters no longer fit in buffer cache, 
> and the compaction had fallen way behind. Thus, every inserted row turns into 
> 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of 
> workload, the current backpressure (based on memory usage) is insufficient to 
> prevent ridiculously long queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3169) kudu java client throws scanner expired error while processing large scan on High-load cluster

2020-08-28 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186743#comment-17186743
 ] 

Alexey Serbin commented on KUDU-3169:
-

This issue has been spotted elsewhere as well.  This seems to be specific to 
Java Kudu client only, C++/Python Kudu clients don't have this issue.

Below is a snippet from the client-side log:

{noformat}
20/07/20 00:26:41 INFO client.AsyncKuduClient: Invalidating location 
edd230034290421aa36bbf83c4b3b97e(tserver-00.local:7050) for tablet 
a3dbde5879d3486fa68f442dff1b86d5: Service unavailable: Scan request on 
kudu.tserver.TabletServerService from 10.80.34.23:54724 dropped due to 
backpressure. The service queue is full; it has 50 items.
20/07/20 00:26:42 WARN client.AsyncKuduScanner: 
a3dbde5879d3486fa68f442dff1b86d5@[592d79bf710046a88bf6da9799fe26d6(terver-01.local:7050),d8677f078c754b1dac4a1aad2c5c1c7e(tserver-01.local:7050)]
 pretends to not know KuduScanner(table=impala::t00.p00, tablet=null, 
scannerId="33e4c93f3ca84ef8b5cd40c4846573f7", scanRequestTimeout=3)
org.apache.kudu.client.NonRecoverableException: Scanner 
33e4c93f3ca84ef8b5cd40c4846573f7 not found (it may have expired)
{noformat}

Tablet server at {{tserver-00.local}} drops the RPC with scan request and Kudu 
client proceeds on to the next tablet server at {{tserver-01.local}}, sending 
scan continuation (not a new scan) request there.  The tablet server at 
{{tserver-01.local}} responds with {{Status::NotFound}} status with specific 
error code {{TabletServerErrorPB::SCANNER_EXPIRED}}, hinting that the scanner 
with identifier {{33e4c93f3ca84ef8b5cd40c4846573f7}} might have already expired 
(see [the server-side 
code|https://github.com/apache/kudu/blob/c590a05778443bb6112e831d0b0ad0dce4b74724/src/kudu/tserver/scanners.cc#L170-L175]
 for details).

The tablet server at {{tserver-01.local}} could not find the scanner because 
the client hadn't started scan operation with that tablet server, but started 
the scan operation with tablet server at {{tserver-00.local}}.


> kudu java client throws scanner expired error while processing large scan on  
> High-load cluster
> ---
>
> Key: KUDU-3169
> URL: https://issues.apache.org/jira/browse/KUDU-3169
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1
>Reporter: mintao
>Priority: Major
>  Labels: scalability, stability
>
> user submits a spark task to scan  a kudu table with large amount records, 
> after just few minutes the job failed after 4 attempts, each attempt failed 
> with error :
> {code:java}
>  org.apache.kudu.client.NonRecoverableException: Scanner 
> 4e34e6f821be42b889022ec681e235cc not found (it may have expired) 
> org.apache.kudu.client.NonRecoverableException: Scanner 
> 4e34e6f821be42b889022ec681e235cc not found (it may have expired) at 
> org.apache.kudu.client.KuduException.transformException(KuduException.java:110)
>  at 
> org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402) 
> at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at 
> org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Suppressed: 
> org.apache.kudu.client.KuduException$OriginalException: Original asynchronous 
> stack trace at 
> org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:341) at 
> org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:263) at 
> org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at 
> org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:152) at 
> org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:148) at 
> 

[jira] [Updated] (KUDU-3169) kudu java client throws scanner expired error while processing large scan on High-load cluster

2020-08-28 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3169:

Labels: scalability stability  (was: )

> kudu java client throws scanner expired error while processing large scan on  
> High-load cluster
> ---
>
> Key: KUDU-3169
> URL: https://issues.apache.org/jira/browse/KUDU-3169
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1
>Reporter: mintao
>Priority: Major
>  Labels: scalability, stability
>
> user submits a spark task to scan  a kudu table with large amount records, 
> after just few minutes the job failed after 4 attempts, each attempt failed 
> with error :
> {code:java}
>  org.apache.kudu.client.NonRecoverableException: Scanner 
> 4e34e6f821be42b889022ec681e235cc not found (it may have expired) 
> org.apache.kudu.client.NonRecoverableException: Scanner 
> 4e34e6f821be42b889022ec681e235cc not found (it may have expired) at 
> org.apache.kudu.client.KuduException.transformException(KuduException.java:110)
>  at 
> org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402) 
> at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at 
> org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Suppressed: 
> org.apache.kudu.client.KuduException$OriginalException: Original asynchronous 
> stack trace at 
> org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:341) at 
> org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:263) at 
> org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at 
> org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:152) at 
> org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:148) at 
> org.apache.kudu.client.Connection.messageReceived(Connection.java:391) at 
> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>  at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243) at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
>  at 
> 

[jira] [Updated] (KUDU-3169) kudu java client throws scanner expired error while processing large scan on High-load cluster

2020-08-28 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3169:

Affects Version/s: 1.9.0
   1.10.0
   1.10.1
   1.11.0
   1.12.0
   1.11.1

> kudu java client throws scanner expired error while processing large scan on  
> High-load cluster
> ---
>
> Key: KUDU-3169
> URL: https://issues.apache.org/jira/browse/KUDU-3169
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1
>Reporter: mintao
>Priority: Major
>
> user submits a spark task to scan  a kudu table with large amount records, 
> after just few minutes the job failed after 4 attempts, each attempt failed 
> with error :
> {code:java}
>  org.apache.kudu.client.NonRecoverableException: Scanner 
> 4e34e6f821be42b889022ec681e235cc not found (it may have expired) 
> org.apache.kudu.client.NonRecoverableException: Scanner 
> 4e34e6f821be42b889022ec681e235cc not found (it may have expired) at 
> org.apache.kudu.client.KuduException.transformException(KuduException.java:110)
>  at 
> org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402) 
> at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at 
> org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Suppressed: 
> org.apache.kudu.client.KuduException$OriginalException: Original asynchronous 
> stack trace at 
> org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:341) at 
> org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:263) at 
> org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at 
> org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:152) at 
> org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:148) at 
> org.apache.kudu.client.Connection.messageReceived(Connection.java:391) at 
> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>  at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243) at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>  at 
> org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
>  at 
> 

[jira] [Resolved] (KUDU-3181) Compilation manager queue may have too many tasks

2020-08-15 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3181.
-
Fix Version/s: 1.13.0
   Resolution: Fixed

> Compilation manager queue may have too many tasks
> -
>
> Key: KUDU-3181
> URL: https://issues.apache.org/jira/browse/KUDU-3181
> Project: Kudu
>  Issue Type: Bug
>  Components: codegen
>Reporter: Li Zhiming
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: heap.svg
>
>
> When a client frequently scanning for thousands of diffrent columns, the 
> code_cache_hits rate is quite low. Then compilation task is frequently 
> submitted to queue, but the compiler manager thread cannot consumes the queue 
> quick enough. The queue could accumulate tons of entries, each of which 
> retains a copy of schema meta data, so a lot of memeory is consumed for a 
> long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads

2020-08-14 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178138#comment-17178138
 ] 

Alexey Serbin edited comment on KUDU-1587 at 8/15/20, 2:55 AM:
---

I implemented the requested functionality with the following changelists:
* [a test scenario to simulate apply queue 
"overload"|https://gerrit.cloudera.org/#/c/16312/]
* [tracking the state of the apply queue|https://gerrit.cloudera.org/#/c/16332/]
* [controlling the admission of write requests with CoDel-like 
approach|http://gerrit.cloudera.org:8080/16343]

These follow the approach suggested by [~tlipcon] in the comment above, but 
with small variation: while deciding whether to reject an incoming write 
request, the amount of already rejected requests isn't taken into account.  
Instead, the criterion is to look for how long the queue has been in the 
'overloaded' state: the longer the queue stays overloaded, the greater the 
probability of rejections.


was (Author: aserbin):
I implemented the requested functionality with the following changelists:
* [a test scenario to simulate apply queue 
"overload"|https://gerrit.cloudera.org/#/c/16312/]
* [tracking the state of the apply queue|https://gerrit.cloudera.org/#/c/16332/]
* [controlling the admission of write requests with CoDel-like 
approach|http://gerrit.cloudera.org:8080/16343]


> Memory-based backpressure is insufficient on seek-bound workloads
> -
>
> Key: KUDU-1587
> URL: https://issues.apache.org/jira/browse/KUDU-1587
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 0.10.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: roadmap-candidate
> Attachments: graph.png, queue-time.png
>
>
> I pushed a uniform random insert workload from a bunch of clients to the 
> point that the vast majority of bloom filters no longer fit in buffer cache, 
> and the compaction had fallen way behind. Thus, every inserted row turns into 
> 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of 
> workload, the current backpressure (based on memory usage) is insufficient to 
> prevent ridiculously long queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads

2020-08-14 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178138#comment-17178138
 ] 

Alexey Serbin commented on KUDU-1587:
-

I implemented the requested functionality with the following changelists:
* [a test scenario to simulate apply queue 
"overload"|https://gerrit.cloudera.org/#/c/16312/]
* [tracking the state of the apply queue|https://gerrit.cloudera.org/#/c/16332/]
* [controlling the admission of write requests with CoDel-like 
approach|http://gerrit.cloudera.org:8080/16343]


> Memory-based backpressure is insufficient on seek-bound workloads
> -
>
> Key: KUDU-1587
> URL: https://issues.apache.org/jira/browse/KUDU-1587
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 0.10.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: roadmap-candidate
> Attachments: graph.png, queue-time.png
>
>
> I pushed a uniform random insert workload from a bunch of clients to the 
> point that the vast majority of bloom filters no longer fit in buffer cache, 
> and the compaction had fallen way behind. Thus, every inserted row turns into 
> 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of 
> workload, the current backpressure (based on memory usage) is insufficient to 
> prevent ridiculously long queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads

2020-08-14 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-1587:

Code Review: http://gerrit.cloudera.org:8080/16343

> Memory-based backpressure is insufficient on seek-bound workloads
> -
>
> Key: KUDU-1587
> URL: https://issues.apache.org/jira/browse/KUDU-1587
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 0.10.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: roadmap-candidate
> Attachments: graph.png, queue-time.png
>
>
> I pushed a uniform random insert workload from a bunch of clients to the 
> point that the vast majority of bloom filters no longer fit in buffer cache, 
> and the compaction had fallen way behind. Thus, every inserted row turns into 
> 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of 
> workload, the current backpressure (based on memory usage) is insufficient to 
> prevent ridiculously long queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-08-11 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175778#comment-17175778
 ] 

Alexey Serbin edited comment on KUDU-3119 at 8/11/20, 6:44 PM:
---

I changed the priority to BLOCKER in the context of cutting a new release soon. 
 It would be great to clarify on this:

# If this is a real race, could this affect data consistency, leading to data 
corruption or alike in the long run?
# If this is a race that could cause a corruption (done either by the {{kudu}} 
CLI tool or by kudu tablet server), I think it should be fixed before cutting 
next release.


was (Author: aserbin):
I changed the priority to BLOCKER in the context of cutting a new release soon. 
 It would be great to clarify on this:

# If this is a real race, could this affect data consistency, leading to data 
corruption or alike in the long run?
# If this is a race that could cause a corruption (done either by the {{kudu}} 
CLI tool or by kudu tablet server,) it should be fixed before cutting the 
upcoming release.

> ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
> ---
>
> Key: KUDU-3119
> URL: https://issues.apache.org/jira/browse/KUDU-3119
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, test
>Reporter: Alexey Serbin
>Priority: Blocker
> Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, 
> kudu-tool-test.log.xz
>
>
> Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
> reports races for TSAN builds:
> {noformat}
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
>  Failure
> Failed
> Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
> process exited with non-ze
> ro status 66
> Google Test trace:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
>  W0506 17:5
> 6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
> I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
> I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log 
> directory (fs_wal_dir) as metad
> ata directory
> I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
> manager: real 0.007s
> user 0.005s sys 0.002s
> I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
> files per process limi
> t of 1048576; it is already as high as it can go
> I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm 
> with capacity 419430
> ==
> WARNING: ThreadSanitizer: data race (pid=4432)
> ...
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-08-11 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175778#comment-17175778
 ] 

Alexey Serbin commented on KUDU-3119:
-

I changed the priority to BLOCKER in the context of cutting a new release soon. 
 It would be great to clarify on this:

# If this is a real race, could this affect data consistency, leading to data 
corruption or alike in the long run?
# If this is a race that could cause a corruption (done either by the {{kudu}} 
CLI tool or by kudu tablet server,) it should be fixed before cutting the 
upcoming release.

> ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
> ---
>
> Key: KUDU-3119
> URL: https://issues.apache.org/jira/browse/KUDU-3119
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, test
>Reporter: Alexey Serbin
>Priority: Blocker
> Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, 
> kudu-tool-test.log.xz
>
>
> Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
> reports races for TSAN builds:
> {noformat}
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
>  Failure
> Failed
> Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
> process exited with non-ze
> ro status 66
> Google Test trace:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
>  W0506 17:5
> 6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
> I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
> I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log 
> directory (fs_wal_dir) as metad
> ata directory
> I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
> manager: real 0.007s
> user 0.005s sys 0.002s
> I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
> files per process limi
> t of 1048576; it is already as high as it can go
> I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm 
> with capacity 419430
> ==
> WARNING: ThreadSanitizer: data race (pid=4432)
> ...
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-08-11 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3119:

Priority: Blocker  (was: Minor)

> ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
> ---
>
> Key: KUDU-3119
> URL: https://issues.apache.org/jira/browse/KUDU-3119
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, test
>Reporter: Alexey Serbin
>Priority: Blocker
> Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, 
> kudu-tool-test.log.xz
>
>
> Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
> reports races for TSAN builds:
> {noformat}
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
>  Failure
> Failed
> Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
> process exited with non-ze
> ro status 66
> Google Test trace:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
>  W0506 17:5
> 6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
> I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
> I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log 
> directory (fs_wal_dir) as metad
> ata directory
> I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
> manager: real 0.007s
> user 0.005s sys 0.002s
> I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
> files per process limi
> t of 1048576; it is already as high as it can go
> I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm 
> with capacity 419430
> ==
> WARNING: ThreadSanitizer: data race (pid=4432)
> ...
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads

2020-08-07 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-1587:
---

Assignee: Alexey Serbin

> Memory-based backpressure is insufficient on seek-bound workloads
> -
>
> Key: KUDU-1587
> URL: https://issues.apache.org/jira/browse/KUDU-1587
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 0.10.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: roadmap-candidate
> Attachments: graph.png, queue-time.png
>
>
> I pushed a uniform random insert workload from a bunch of clients to the 
> point that the vast majority of bloom filters no longer fit in buffer cache, 
> and the compaction had fallen way behind. Thus, every inserted row turns into 
> 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of 
> workload, the current backpressure (based on memory usage) is insufficient to 
> prevent ridiculously long queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory

2020-08-06 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172751#comment-17172751
 ] 

Alexey Serbin commented on KUDU-3180:
-

[~awong] put together a great summary of the recent discussion.  I just want to 
add two cents from my side, hoping it might be useful  (if it makes sense).  
It's about looking at this from a generic perspective, (i.e. ignorant of the 
compaction/flush implementation details :) ).

During our recent discussion of this issue with [~awong] and [~granthenke], one 
observation was that using {{memory_size * time_since_last_flush}} as the 
simplest proxy for the cost function allows for easier comprehension (at least 
for me) of an alternative policy that takes into account both the size and the 
age of datasets to flush.  The idea is to make sure that being under starvation 
of flushes/compactions compared with the rate of incoming updates, heavier 
chunks of data are more likely to be picked up for flushing/compacting than 
smaller ones, even if the smaller ones have been around for somewhat longer 
time.  However, super-old tiny data chunks are also picked up eventually even 
if heavy updates arrive all the time.  So, picking datasets with the highest 
values of the cost function among those which cross a pre-set threshold might 
be a model to think of.

As for doing compactions vs flushes, maybe it's possible to use a similar cost 
function but with 0.x something coefficient to reflect the notion that 
occupying disk storage is cheaper than occupying the same amount of RAM for the 
same time interval.

> kudu don't always prefer to flush MRS/DMS that anchor more memory
> -
>
> Key: KUDU-3180
> URL: https://issues.apache.org/jira/browse/KUDU-3180
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
> Attachments: image-2020-08-04-20-26-53-749.png, 
> image-2020-08-04-20-28-00-665.png
>
>
> Current time-based flush policy always give a flush op a high score if we 
> haven't flushed for the tablet in a long time, that may lead to starvation of 
> ops that could free more memory.
> We set  -flush_threshold_mb=32,  -flush_threshold_secs=1800 in a cluster, and 
> find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS 
> flushes and compactions, which seems not so reasonable.
> !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3178) An option to terminate connections which have been open for very long time

2020-07-30 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168126#comment-17168126
 ] 

Alexey Serbin commented on KUDU-3178:
-

Ah, probably the essence was in _not_ terminating a connection, but instead 
send back authn error that would lead to re-authentication (e.g. re-acquiring 
authn token).  Yes, that might be a good option in addressing the actual issue 
behind those long-living connections.  Thanks!

> An option to terminate connections which have been open for very long time
> --
>
> Key: KUDU-3178
> URL: https://issues.apache.org/jira/browse/KUDU-3178
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, security, tserver
>Reporter: Alexey Serbin
>Priority: Major
>
> A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} 
> and keep that connection open indefinitely by issuing some method at least 
> once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call 
> {{Ping()}} method).  This means there isn't a limit on how long an client can 
> have access to cluster once it's authenticated, unless {{kudu-master}} and 
> {{kudu-tserver}} processes are restarted.  When fine-grained authorization if 
> enforced, this issue is really benign because such lingering clients are 
> unable to call any methods that require authz token to be provided.
> It would be nice to address this by providing an option to terminate 
> connections which were established long time ago.  Both the interval of the 
> maximum connection lifetime and whether to terminate over-the-TTL connections 
> should be controlled by flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3178) An option to terminate connections which have been open for very long time

2020-07-30 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168123#comment-17168123
 ] 

Alexey Serbin commented on KUDU-3178:
-

Why 'instead'?  That's exactly one of the way I was thinking to implement this 
:)

> An option to terminate connections which have been open for very long time
> --
>
> Key: KUDU-3178
> URL: https://issues.apache.org/jira/browse/KUDU-3178
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, security, tserver
>Reporter: Alexey Serbin
>Priority: Major
>
> A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} 
> and keep that connection open indefinitely by issuing some method at least 
> once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call 
> {{Ping()}} method).  This means there isn't a limit on how long an client can 
> have access to cluster once it's authenticated, unless {{kudu-master}} and 
> {{kudu-tserver}} processes are restarted.  When fine-grained authorization if 
> enforced, this issue is really benign because such lingering clients are 
> unable to call any methods that require authz token to be provided.
> It would be nice to address this by providing an option to terminate 
> connections which were established long time ago.  Both the interval of the 
> maximum connection lifetime and whether to terminate over-the-TTL connections 
> should be controlled by flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3178) An option to terminate connections which have been open for very long time

2020-07-29 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3178:

Summary: An option to terminate connections which have been open for very 
long time  (was: Terminate connections which have been open for long time)

> An option to terminate connections which have been open for very long time
> --
>
> Key: KUDU-3178
> URL: https://issues.apache.org/jira/browse/KUDU-3178
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, security, tserver
>Reporter: Alexey Serbin
>Priority: Major
>
> A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} 
> and keep that connection open indefinitely by issuing some method at least 
> once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call 
> {{Ping()}} method).  This means there isn't a limit on how long an client can 
> have access to cluster once it's authenticated, unless {{kudu-master}} and 
> {{kudu-tserver}} processes are restarted.  When fine-grained authorization if 
> enforced, this issue is really benign because such lingering clients are 
> unable to call any methods that require authz token to be provided.
> It would be nice to address this by providing an option to terminate 
> connections which were established long time ago.  Both the interval of the 
> maximum connection lifetime and whether to terminate over-the-TTL connections 
> should be controlled by flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3178) Terminate connections which have been open for longer than authn token expiration period

2020-07-29 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3178:
---

 Summary: Terminate connections which have been open for longer 
than authn token expiration period
 Key: KUDU-3178
 URL: https://issues.apache.org/jira/browse/KUDU-3178
 Project: Kudu
  Issue Type: Improvement
  Components: master, security, tserver
Reporter: Alexey Serbin


A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} and 
keep that connection open indefinitely by issuing some method at least once 
every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call {{Ping()}} 
method).  This means there isn't a limit on how long an client can have access 
to cluster once it's authenticated, unless {{kudu-master}} and {{kudu-tserver}} 
processes are restarted.  When fine-grained authorization if enforced, this 
issue is really benign because such lingering clients are unable to call any 
methods that require authz token to be provided.

It would be nice to address this by providing an option to terminate 
connections which were established long time ago.  Both the interval of the 
maximum connection lifetime and whether to terminate over-the-TTL connections 
should be controlled by flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3178) Terminate connections which have been open for long time

2020-07-29 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3178:

Summary: Terminate connections which have been open for long time  (was: 
Terminate connections which have been open for longer than authn token 
expiration period)

> Terminate connections which have been open for long time
> 
>
> Key: KUDU-3178
> URL: https://issues.apache.org/jira/browse/KUDU-3178
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, security, tserver
>Reporter: Alexey Serbin
>Priority: Major
>
> A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} 
> and keep that connection open indefinitely by issuing some method at least 
> once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call 
> {{Ping()}} method).  This means there isn't a limit on how long an client can 
> have access to cluster once it's authenticated, unless {{kudu-master}} and 
> {{kudu-tserver}} processes are restarted.  When fine-grained authorization if 
> enforced, this issue is really benign because such lingering clients are 
> unable to call any methods that require authz token to be provided.
> It would be nice to address this by providing an option to terminate 
> connections which were established long time ago.  Both the interval of the 
> maximum connection lifetime and whether to terminate over-the-TTL connections 
> should be controlled by flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3173) Document time source options and recommendations

2020-07-22 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3173:

Description: 
It's necessary to document existing time source options and recommendations for 
Kudu.  Since the introduction of the [built-in NTP 
client|https://github.com/apache/kudu/commit/c103d51a52d00c3a9d062e06e20a5cc8c98df9a0]
 and the [{{auto}} time 
source|https://github.com/apache/kudu/commit/bd8e8f8b805bec5673590dffa67e48fbc9cfe208],
 more options are available while deploying Kudu clusters, but these are not 
properly documented yet.

Probably, the best place to add that information is at the [configuration 
page|https://kudu.apache.org/docs/configuration.html].

  was:
It's necessary to document existing time source options and recommendations for 
Kudu.  Since the introduction of the built-in NTP client and the {{auto}} time 
source, more options are available, but those are not documented.

Probably, the proper place to add that information is at the [configuration 
page|https://kudu.apache.org/docs/configuration.html].


> Document time source options and recommendations
> 
>
> Key: KUDU-3173
> URL: https://issues.apache.org/jira/browse/KUDU-3173
> Project: Kudu
>  Issue Type: Task
>  Components: documentation
>Reporter: Alexey Serbin
>Priority: Major
>
> It's necessary to document existing time source options and recommendations 
> for Kudu.  Since the introduction of the [built-in NTP 
> client|https://github.com/apache/kudu/commit/c103d51a52d00c3a9d062e06e20a5cc8c98df9a0]
>  and the [{{auto}} time 
> source|https://github.com/apache/kudu/commit/bd8e8f8b805bec5673590dffa67e48fbc9cfe208],
>  more options are available while deploying Kudu clusters, but these are not 
> properly documented yet.
> Probably, the best place to add that information is at the [configuration 
> page|https://kudu.apache.org/docs/configuration.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3173) Document time source options and recommendations

2020-07-22 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3173:
---

 Summary: Document time source options and recommendations
 Key: KUDU-3173
 URL: https://issues.apache.org/jira/browse/KUDU-3173
 Project: Kudu
  Issue Type: Task
Reporter: Alexey Serbin


It's necessary to document existing time source options and recommendations for 
Kudu.  Since the introduction of the built-in NTP client and the {{auto}} time 
source, more options are available, but those are not documented.

Probably, the proper place to add that information is at the [configuration 
page|https://kudu.apache.org/docs/configuration.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3172) Enable hybrid clock and built-in NTP client in Docker by default

2020-07-22 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162899#comment-17162899
 ] 

Alexey Serbin commented on KUDU-3172:
-

Another option is to set {{\-\-time_source=system_unsync}} if all the 
dockerized Kudu cluster runs at a single host.

> Enable hybrid clock and built-in NTP client in Docker by default
> 
>
> Key: KUDU-3172
> URL: https://issues.apache.org/jira/browse/KUDU-3172
> Project: Kudu
>  Issue Type: Improvement
>Affects Versions: 1.12.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Minor
>
> Currently the docker entrypoint sets `--use_hybrid_clock=false` by default. 
> This can cause unusual issues when snapshot scans are needed. Now that the 
> built-in client is available we should switch to use that by default in the 
> docker image by setting `--time_source=auto`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-07-10 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155693#comment-17155693
 ] 

Alexey Serbin commented on KUDU-3119:
-

One more TSAN trace.f [^kudu-tool-test.3.txt.xz] 

> ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
> ---
>
> Key: KUDU-3119
> URL: https://issues.apache.org/jira/browse/KUDU-3119
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, test
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, 
> kudu-tool-test.log.xz
>
>
> Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
> reports races for TSAN builds:
> {noformat}
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
>  Failure
> Failed
> Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
> process exited with non-ze
> ro status 66
> Google Test trace:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
>  W0506 17:5
> 6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
> I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
> I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log 
> directory (fs_wal_dir) as metad
> ata directory
> I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
> manager: real 0.007s
> user 0.005s sys 0.002s
> I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
> files per process limi
> t of 1048576; it is already as high as it can go
> I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm 
> with capacity 419430
> ==
> WARNING: ThreadSanitizer: data race (pid=4432)
> ...
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-07-10 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3119:

Attachment: kudu-tool-test.3.txt.xz

> ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
> ---
>
> Key: KUDU-3119
> URL: https://issues.apache.org/jira/browse/KUDU-3119
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, test
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, 
> kudu-tool-test.log.xz
>
>
> Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
> reports races for TSAN builds:
> {noformat}
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
>  Failure
> Failed
> Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
> process exited with non-ze
> ro status 66
> Google Test trace:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
>  W0506 17:5
> 6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
> I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
> I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log 
> directory (fs_wal_dir) as metad
> ata directory
> I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
> manager: real 0.007s
> user 0.005s sys 0.002s
> I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
> files per process limi
> t of 1048576; it is already as high as it can go
> I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm 
> with capacity 419430
> ==
> WARNING: ThreadSanitizer: data race (pid=4432)
> ...
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-07-09 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155096#comment-17155096
 ] 

Alexey Serbin commented on KUDU-3119:
-

It seems the issue started manifesting itself after 
https://github.com/apache/kudu/commit/98f44f4537ceddffedaf9afce26b634c4ab2142a

> ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
> ---
>
> Key: KUDU-3119
> URL: https://issues.apache.org/jira/browse/KUDU-3119
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, test
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.log.xz
>
>
> Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
> reports races for TSAN builds:
> {noformat}
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
>  Failure
> Failed
> Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
> process exited with non-ze
> ro status 66
> Google Test trace:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
>  W0506 17:5
> 6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
> I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
> I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log 
> directory (fs_wal_dir) as metad
> ata directory
> I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
> manager: real 0.007s
> user 0.005s sys 0.002s
> I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
> files per process limi
> t of 1048576; it is already as high as it can go
> I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm 
> with capacity 419430
> ==
> WARNING: ThreadSanitizer: data race (pid=4432)
> ...
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-07-09 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155078#comment-17155078
 ] 

Alexey Serbin commented on KUDU-3119:
-

One more instance of TSAN race report, log is attached. 
[^kudu-tool-test.20200709.txt.xz] 

I guess there is real race if trying to add a block at the same index from two 
different threads.  Yes, there is a lock per index, but consider what happens 
when two threads trying to access using {{operator[]}} if an element at the 
index which was not present:

{noformat}
bool LogBlockManager::AddLogBlock(LogBlockRefPtr lb) {
  // InsertIfNotPresent doesn't use move semantics, so instead we just
  // insert an empty scoped_refptr and assign into it down below rather
  // than using the utility function.
  int index = lb->block_id().id() & kBlockMapMask;
  std::lock_guard l(*managed_block_shards_[index].lock);
  auto& blocks_by_block_id = *managed_block_shards_[index].blocks_by_block_id;
  LogBlockRefPtr* entry_ptr = _by_block_id[lb->block_id()];
  if (*entry_ptr) {
// Already have an entry for this block ID.
return false;
  }
...
{noformat}

> ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
> ---
>
> Key: KUDU-3119
> URL: https://issues.apache.org/jira/browse/KUDU-3119
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, test
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.log.xz
>
>
> Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
> reports races for TSAN builds:
> {noformat}
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
>  Failure
> Failed
> Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
> process exited with non-ze
> ro status 66
> Google Test trace:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
>  W0506 17:5
> 6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
> I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
> I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log 
> directory (fs_wal_dir) as metad
> ata directory
> I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
> manager: real 0.007s
> user 0.005s sys 0.002s
> I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
> files per process limi
> t of 1048576; it is already as high as it can go
> I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm 
> with capacity 419430
> ==
> WARNING: ThreadSanitizer: data race (pid=4432)
> ...
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-07-09 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3119:

Attachment: kudu-tool-test.20200709.txt.xz

> ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
> ---
>
> Key: KUDU-3119
> URL: https://issues.apache.org/jira/browse/KUDU-3119
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, test
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.log.xz
>
>
> Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
> reports races for TSAN builds:
> {noformat}
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
>  Failure
> Failed
> Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
> process exited with non-ze
> ro status 66
> Google Test trace:
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
>  W0506 17:5
> 6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
> I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
> I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log 
> directory (fs_wal_dir) as metad
> ata directory
> I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
> manager: real 0.007s
> user 0.005s sys 0.002s
> I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
> files per process limi
> t of 1048576; it is already as high as it can go
> I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm 
> with capacity 419430
> ==
> WARNING: ThreadSanitizer: data race (pid=4432)
> ...
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-3156) Whether the CVE-2019-17543 vulnerability of lz affects kudu

2020-07-09 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3156.
-
Fix Version/s: n/a
   Resolution: Information Provided

Kudu doesn't use {{LZ4_compress_fast}} call, so it's not affected by 
CVE-2019-17543.

> Whether the CVE-2019-17543 vulnerability of lz affects kudu
> ---
>
> Key: KUDU-3156
> URL: https://issues.apache.org/jira/browse/KUDU-3156
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: yejiabao_h
>Priority: Major
> Fix For: n/a
>
>
> LZ4 before 1.9.2 has a heap-based buffer overflow in LZ4_write32 (related to 
> LZ4_compress_destSize), affecting applications that call LZ4_compress_fast 
> with a large input. (This issue can also lead to data corruption.) NOTE: the 
> vendor states "only a few specific / uncommon usages of the API are at risk." 
>      
> Whether the CVE-2019-17543 vulnerability of lz affects kudu? if yes, what is 
> the impact?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3163) Long after restarting kudu-tserver nodes, follower replicas continue rejecting scan requests with 'Uninitialized: safe time has not yet been initialized' error

2020-07-09 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3163:
---

 Summary: Long after restarting kudu-tserver nodes, follower 
replicas continue rejecting scan requests with 'Uninitialized: safe time has 
not yet been initialized' error
 Key: KUDU-3163
 URL: https://issues.apache.org/jira/browse/KUDU-3163
 Project: Kudu
  Issue Type: Bug
  Components: tserver
Reporter: Alexey Serbin
 Attachments: logs.tar.bz2

There was a report on a strange state of tablet replicas after some sort of 
rolling restart.  ksck with checksum reported the tablet was fine, but follower 
replicas continued rejecting scan requests with {{Uninitialized: safe time has 
not yet been initialized}} error.  It seems the issue went away after forcing 
tablet leader re-election.  No new write operations (INSERT, UPDATE, DELETE) 
were issued against the tablet.

As already mentioned, some nodes in the cluster were restarted, and before 
doing that {{\-\-follower_unavailable_considered_failed_sec}} flag was set to 
{{3600}}.

At this time, I don't have a clear picture of what was going on, but I just 
wanted to dump available information. I need to do a root cause analysis to 
produce a clear description and diagnosis for the issue.

The logs are attached (these are filtered tablet server logs containing the 
lines attributed only to the affected tablet: UUID 
{{c56432b0164e45d98175f26a54d65270}}).  At the time when the logs were 
captured, {{hdp025}} hosted the leader replica of the tablet, while {{hdp014}} 
and {{hdp035}} hosted the follower ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3154) RangerClientTestBase.TestLogging sometimes fails

2020-07-06 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152197#comment-17152197
 ] 

Alexey Serbin commented on KUDU-3154:
-

This is one of the recent changelists that might be the culprit: introduced 
https://github.com/apache/kudu/commit/15af717e8c200d4e7a33a25171a6bfe58d70aa65

However, I cannot repro the failure when running it via dist-test:

{noformat}
$ ~/bin/run_32_dist_test.sh ./bin/ranger_client-test 
--gtest_filter=RangerClientTestBase.TestLogging
[found] [hashed/size/to hash] [looked up/to lookup] [uploaded/size/to 
upload/size]
[3628] [3626/978.5Mib/3627] [0/3626] [0/0b/0/0b] 600ms
0d9dad6f00b682663cbf81cd1c93a849ce3a96f4  ranger_client-test.0
[3629] [3628/1.01Gib/3628] [3628/3628] [1/12.4Kib/2/2.05Mib] 8.3s
Hits:  3626 (1.01Gib)
Misses  : 2 (2.05Mib)
Duration: 9.607s
INFO:dist_test.client:Submitting job to 
http://dist-test.cloudera.org//submit_job
INFO:dist_test.client:Submitted job aserbin.1594055658.80007
INFO:dist_test.client:Watch your results at 
http://dist-test.cloudera.org//job?job_id=aserbin.1594055658.80007
 769.1s  32/32 tests complete
{noformat}

> RangerClientTestBase.TestLogging sometimes fails
> 
>
> Key: KUDU-3154
> URL: https://issues.apache.org/jira/browse/KUDU-3154
> Project: Kudu
>  Issue Type: Bug
>  Components: ranger, test
>Affects Versions: 1.13.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: ranger_client-test.txt.xz
>
>
> The {{RangerClientTestBase.TestLogging}} scenario of the 
> {{ranger_client-test}} sometimes fails (all types of builds) with error 
> message like below:
> {noformat}
> src/kudu/ranger/ranger_client-test.cc:398: Failure
> Failed
>   
> Bad status: Timed out: timed out while in flight  
>   
> I0620 07:06:02.907177  1140 server.cc:247] Received an EOF from the 
> subprocess  
> I0620 07:06:02.910923  1137 server.cc:317] get failed, inbound queue shut 
> down: Aborted:
> I0620 07:06:02.910964  1141 server.cc:380] outbound queue shut down: Aborted: 
>   
> I0620 07:06:02.910995  1138 server.cc:317] get failed, inbound queue shut 
> down: Aborted:
> I0620 07:06:02.910984  1139 server.cc:317] get failed, inbound queue shut 
> down: Aborted:
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-22 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2727:

Fix Version/s: 1.13.0
   Resolution: Fixed
   Status: Resolved  (was: In Review)

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus, tserver
>Reporter: William Berkeley
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: performance, scalability
> Fix For: 1.13.0
>
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3154) RangerClientTestBase.TestLogging sometimes fails

2020-06-20 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3154:
---

 Summary: RangerClientTestBase.TestLogging sometimes fails
 Key: KUDU-3154
 URL: https://issues.apache.org/jira/browse/KUDU-3154
 Project: Kudu
  Issue Type: Bug
  Components: ranger, test
Affects Versions: 1.13.0
Reporter: Alexey Serbin
 Attachments: ranger_client-test.txt.xz

The {{RangerClientTestBase.TestLogging}} scenario of the {{ranger_client-test}} 
sometimes fails (all types of builds) with error message like below:

{noformat}
src/kudu/ranger/ranger_client-test.cc:398: Failure
Failed  
Bad status: Timed out: timed out while in flight
I0620 07:06:02.907177  1140 server.cc:247] Received an EOF from the subprocess  
I0620 07:06:02.910923  1137 server.cc:317] get failed, inbound queue shut down: 
Aborted:
I0620 07:06:02.910964  1141 server.cc:380] outbound queue shut down: Aborted:   
I0620 07:06:02.910995  1138 server.cc:317] get failed, inbound queue shut down: 
Aborted:
I0620 07:06:02.910984  1139 server.cc:317] get failed, inbound queue shut down: 
Aborted:
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2727:

Component/s: (was: perf)
 tserver
 consensus

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus, tserver
>Reporter: William Berkeley
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2727:

Status: In Review  (was: In Progress)

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2727:

Code Review: https://gerrit.cloudera.org/#/c/16034/

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-2727:
---

Assignee: Alexey Serbin

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2727:

Labels: performance scalability  (was: )

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3129) ToolTest.TestHmsList can timeout

2020-06-16 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136729#comment-17136729
 ] 

Alexey Serbin commented on KUDU-3129:
-

The test also times out in case of RELEASE builds: the log is attached. 
[^kudu-tool-test.2.txt.xz] 

> ToolTest.TestHmsList can timeout
> 
>
> Key: KUDU-3129
> URL: https://issues.apache.org/jira/browse/KUDU-3129
> Project: Kudu
>  Issue Type: Bug
>  Components: hms, test
>Affects Versions: 1.12.0
>Reporter: Andrew Wong
>Priority: Major
> Attachments: kudu-tool-test.2.txt, kudu-tool-test.2.txt.xz
>
>
> When running in TSAN mode, the test timed out, spending 10 minutes not really 
> doing anything. It isn't obvious why, but ToolTest.TestHmsList can timeout, 
> appearing to hang while running the HMS tool.
> {code}
> I0521 22:31:49.436857  4601 catalog_manager.cc:1161] Initializing in-progress 
> tserver states...
> I0521 22:31:49.446161  4606 hms_notification_log_listener.cc:228] Skipping 
> Hive Metastore notification log poll: Service unavailable: Catalog manager is 
> not initialized. State: Starting
> I0521 22:31:49.839709  4488 heartbeater.cc:325] Connected to a master server 
> at 127.0.89.254:42487
> I0521 22:31:49.845547  4559 master_service.cc:295] Got heartbeat from unknown 
> tserver (permanent_uuid: "cf9e08c4271e4d9aa28b1aacbd630908" instance_seqno: 
> 1590100304311876) as {username='slave'} at 127.0.89.193:33867; Asking this 
> server to re-register.
> I0521 22:31:49.846786  4488 heartbeater.cc:416] Registering TS with master...
> I0521 22:31:49.847297  4488 heartbeater.cc:465] Master 127.0.89.254:42487 
> requested a full tablet report, sending...
> I0521 22:31:49.849771  4559 ts_manager.cc:191] Registered new tserver with 
> Master: cf9e08c4271e4d9aa28b1aacbd630908 (127.0.89.193:43527)
> I0521 22:31:49.852535   359 external_mini_cluster.cc:699] 1 TS(s) registered 
> with all masters
> W0521 22:32:23.142868  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b060 after lost signal to thread 4531
> W0521 22:32:23.14  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b780 after lost signal to thread 4591
> W0521 22:32:28.996440  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b740 after lost signal to thread 4531
> W0521 22:32:28.996966  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b520 after lost signal to thread 4591
> W0521 22:33:05.743249  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002aae0 after lost signal to thread 4360
> W0521 22:33:05.743983  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002af00 after lost signal to thread 4486
> I0521 22:33:49.594769  4549 maintenance_manager.cc:326] P 
> c3cc85c33a5447b2aa520019fe162966: Scheduling 
> FlushMRSOp(): perf score=0.033386
> I0521 22:33:49.637208  4548 maintenance_manager.cc:525] P 
> c3cc85c33a5447b2aa520019fe162966: 
> FlushMRSOp() complete. Timing: real 0.042s
> user 0.032s sys 0.008s Metrics: 
> {"bytes_written":6485,"cfile_init":1,"dirs.queue_time_us":675,"dirs.run_cpu_time_us":237,"dirs.run_wall_time_us":997,"drs_written":1,"lbm_read_time_us":231,"lbm_reads_lt_1ms":4,"lbm_write_time_us":1980,"lbm_writes_lt_1ms":27,"rows_written":5,"thread_start_us":953,"threads_started":2,"wal-append.queue_time_us":819}
> I0521 22:33:49.639096  4549 maintenance_manager.cc:326] P 
> c3cc85c33a5447b2aa520019fe162966: Scheduling 
> UndoDeltaBlockGCOp(): 396 bytes on disk
> I0521 22:33:49.640486  4548 maintenance_manager.cc:525] P 
> c3cc85c33a5447b2aa520019fe162966: 
> UndoDeltaBlockGCOp() complete. Timing: real 
> 0.001suser 0.001s sys 0.000s Metrics: 
> {"cfile_init":1,"lbm_read_time_us":269,"lbm_reads_lt_1ms":4}
> W0521 22:34:17.794472  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002ade0 after lost signal to thread 4360
> W0521 22:34:17.795437  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002a7e0 after lost signal to thread 4486
> W0521 22:34:20.286921  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b2e0 after lost signal to thread 4531
> W0521 22:34:20.287376  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b140 after lost signal to thread 4591
> W0521 22:35:27.726336  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002af40 after lost signal to thread 4360
> W0521 22:35:27.727084  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002a980 after lost signal to thread 4486
> W0521 22:36:12.250830  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b9c0 after lost signal to thread 4531
> W0521 22:36:12.251247  4545 

[jira] [Updated] (KUDU-3129) ToolTest.TestHmsList can timeout

2020-06-16 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3129:

Attachment: kudu-tool-test.2.txt.xz

> ToolTest.TestHmsList can timeout
> 
>
> Key: KUDU-3129
> URL: https://issues.apache.org/jira/browse/KUDU-3129
> Project: Kudu
>  Issue Type: Bug
>  Components: hms, test
>Affects Versions: 1.12.0
>Reporter: Andrew Wong
>Priority: Major
> Attachments: kudu-tool-test.2.txt, kudu-tool-test.2.txt.xz
>
>
> When running in TSAN mode, the test timed out, spending 10 minutes not really 
> doing anything. It isn't obvious why, but ToolTest.TestHmsList can timeout, 
> appearing to hang while running the HMS tool.
> {code}
> I0521 22:31:49.436857  4601 catalog_manager.cc:1161] Initializing in-progress 
> tserver states...
> I0521 22:31:49.446161  4606 hms_notification_log_listener.cc:228] Skipping 
> Hive Metastore notification log poll: Service unavailable: Catalog manager is 
> not initialized. State: Starting
> I0521 22:31:49.839709  4488 heartbeater.cc:325] Connected to a master server 
> at 127.0.89.254:42487
> I0521 22:31:49.845547  4559 master_service.cc:295] Got heartbeat from unknown 
> tserver (permanent_uuid: "cf9e08c4271e4d9aa28b1aacbd630908" instance_seqno: 
> 1590100304311876) as {username='slave'} at 127.0.89.193:33867; Asking this 
> server to re-register.
> I0521 22:31:49.846786  4488 heartbeater.cc:416] Registering TS with master...
> I0521 22:31:49.847297  4488 heartbeater.cc:465] Master 127.0.89.254:42487 
> requested a full tablet report, sending...
> I0521 22:31:49.849771  4559 ts_manager.cc:191] Registered new tserver with 
> Master: cf9e08c4271e4d9aa28b1aacbd630908 (127.0.89.193:43527)
> I0521 22:31:49.852535   359 external_mini_cluster.cc:699] 1 TS(s) registered 
> with all masters
> W0521 22:32:23.142868  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b060 after lost signal to thread 4531
> W0521 22:32:23.14  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b780 after lost signal to thread 4591
> W0521 22:32:28.996440  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b740 after lost signal to thread 4531
> W0521 22:32:28.996966  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b520 after lost signal to thread 4591
> W0521 22:33:05.743249  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002aae0 after lost signal to thread 4360
> W0521 22:33:05.743983  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002af00 after lost signal to thread 4486
> I0521 22:33:49.594769  4549 maintenance_manager.cc:326] P 
> c3cc85c33a5447b2aa520019fe162966: Scheduling 
> FlushMRSOp(): perf score=0.033386
> I0521 22:33:49.637208  4548 maintenance_manager.cc:525] P 
> c3cc85c33a5447b2aa520019fe162966: 
> FlushMRSOp() complete. Timing: real 0.042s
> user 0.032s sys 0.008s Metrics: 
> {"bytes_written":6485,"cfile_init":1,"dirs.queue_time_us":675,"dirs.run_cpu_time_us":237,"dirs.run_wall_time_us":997,"drs_written":1,"lbm_read_time_us":231,"lbm_reads_lt_1ms":4,"lbm_write_time_us":1980,"lbm_writes_lt_1ms":27,"rows_written":5,"thread_start_us":953,"threads_started":2,"wal-append.queue_time_us":819}
> I0521 22:33:49.639096  4549 maintenance_manager.cc:326] P 
> c3cc85c33a5447b2aa520019fe162966: Scheduling 
> UndoDeltaBlockGCOp(): 396 bytes on disk
> I0521 22:33:49.640486  4548 maintenance_manager.cc:525] P 
> c3cc85c33a5447b2aa520019fe162966: 
> UndoDeltaBlockGCOp() complete. Timing: real 
> 0.001suser 0.001s sys 0.000s Metrics: 
> {"cfile_init":1,"lbm_read_time_us":269,"lbm_reads_lt_1ms":4}
> W0521 22:34:17.794472  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002ade0 after lost signal to thread 4360
> W0521 22:34:17.795437  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002a7e0 after lost signal to thread 4486
> W0521 22:34:20.286921  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b2e0 after lost signal to thread 4531
> W0521 22:34:20.287376  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b140 after lost signal to thread 4591
> W0521 22:35:27.726336  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002af40 after lost signal to thread 4360
> W0521 22:35:27.727084  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002a980 after lost signal to thread 4486
> W0521 22:36:12.250830  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b9c0 after lost signal to thread 4531
> W0521 22:36:12.251247  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b220 after lost signal to thread 4591
> W0521 

[jira] [Resolved] (KUDU-3145) KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called

2020-06-12 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3145.
-
Fix Version/s: 1.13.0
   Resolution: Fixed

> KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called
> -
>
> Key: KUDU-3145
> URL: https://issues.apache.org/jira/browse/KUDU-3145
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: zhaorenhai
>Assignee: huangtianhua
>Priority: Major
> Fix For: 1.13.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called
>  
> Because in function APPEND_LINKER_FLAGS , there are following logic:
> {code:java}
> if ("${LINKER_FAMILY}" STREQUAL "gold")
>   if("${LINKER_VERSION}" VERSION_LESS "1.12" AND
>  "${KUDU_LINK}" STREQUAL "d")
> message(WARNING "Skipping gold <1.12 with dynamic linking.")
> continue()
>   endif()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-04 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126243#comment-17126243
 ] 

Alexey Serbin commented on KUDU-2727:
-

One more set of stack traces:

{noformat}
  tids=[1324418]
  0x7f61b79fc5e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb4236d kudu::consensus::Peer::SendNextRequest()
0xb43771 
_ZN5boost6detail8function26void_function_obj_invoker0IZN4kudu9consensus4Peer13SignalRequestEbEUlvE_vE6invokeERNS1_15function_bufferE
   0x1eb1d1d kudu::FunctionRunnable::Run()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7f61b79f4e25 start_thread
  0x7f61b5cd234d __clone
  
tids=[93293,93284,93285,93286,93287,93288,93289,93290,93291,93292,93304,93294,93295,93296,93297,93298,93299,93300,93301,93302,93303,93313,93322,93321,93320,93319,93318,93317,93316,93315,93314,93283,93312,93311,93310,93309,93308,93307,93306,93305]
  0x7f61b79fc5e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7deb8 
kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync()
0xaa344c kudu::tablet::TabletReplica::SubmitWrite()
0x928fb0 kudu::tserver::TabletServiceImpl::Write()
   0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle()
   0x1d2efd9 kudu::rpc::ServicePool::RunThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7f61b79f4e25 start_thread
  0x7f61b5cd234d __clone
  tids=[1324661]
  0x7f61b79fc5e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7df8e kudu::consensus::RaftConsensus::Replicate()
0xaab8e7 kudu::tablet::TransactionDriver::Prepare()
0xaac009 kudu::tablet::TransactionDriver::PrepareTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7f61b79f4e25 start_thread
  0x7f61b5cd234d __clone
  tids=[93383]
  0x7f61b79fc5e0 
  0x7f61b79f8cf2 __pthread_cond_timedwait
   0x1dfcfa9 kudu::ConditionVariable::WaitUntil()
0xb73bc7 kudu::consensus::RaftConsensus::UpdateReplica()
0xb75128 kudu::consensus::RaftConsensus::Update()
0x92c5d1 kudu::tserver::ConsensusServiceImpl::UpdateConsensus()
   0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle()
   0x1d2efd9 kudu::rpc::ServicePool::RunThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7f61b79f4e25 start_thread
  0x7f61b5cd234d __clone
{noformat}

Thread {{93383}} holds the lock, waiting on another conditional variable and 
blocks many other threads.

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Priority: Major
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a 

[jira] [Comment Edited] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-03 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125466#comment-17125466
 ] 

Alexey Serbin edited comment on KUDU-2727 at 6/4/20, 3:14 AM:
--

Another set of stacks, just for more context (captured with code close to kudu 
1.10.1):

{noformat}
  tids=[1866940]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex()
0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
0xb47850 
_ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  
tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7deb8 
kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync()
0xaa344c kudu::tablet::TabletReplica::SubmitWrite()
0x928fb0 kudu::tserver::TabletServiceImpl::Write()
   0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle()
   0x1d2efd9 kudu::rpc::ServicePool::RunThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  tids=[1866932,1866929]
  0x7fc8d67f95e0 
  0x7fc8d67f5943 __pthread_cond_wait
0xb99b38 kudu::log::Log::AsyncAppend()
0xb9c24c kudu::log::Log::AsyncAppendCommit()
0xaad489 kudu::tablet::TransactionDriver::ApplyTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  tids=[1866928]
  0x7fc8d67f95e0 
  0x7fc8d67f5943 __pthread_cond_wait
0xb99b38 kudu::log::Log::AsyncAppend()
0xb9c493 kudu::log::Log::AsyncAppendReplicates()
0xb597e9 kudu::consensus::LogCache::AppendOperations()
0xb4fa24 kudu::consensus::PeerMessageQueue::AppendOperations()
0xb4fd45 kudu::consensus::PeerMessageQueue::AppendOperation()
0xb6f28c 
kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
0xb7dff8 kudu::consensus::RaftConsensus::Replicate()
0xaab8e7 kudu::tablet::TransactionDriver::Prepare()
0xaac009 kudu::tablet::TransactionDriver::PrepareTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
{noformat}

In the stacks above, thread {{1866928}} is holding a lock taken in 
{{RaftConsensus::Replicate()}} while waiting on a condition variable in 
{{Log::AsyncAppend()}}, calling 
{{entry_batch_queue_.BlockingPut(entry_batch.get())}}.


was (Author: aserbin):
Another set of stacks, just for more context (captured with code close to kudu 
1.10.1):

{noformat}
  tids=[1866940]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex()
0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
0xb47850 
_ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  
tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7deb8 

[jira] [Assigned] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-2727:
---

Assignee: (was: Mike Percy)

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Priority: Major
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-03 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125466#comment-17125466
 ] 

Alexey Serbin commented on KUDU-2727:
-

Another set of stacks, just for more context (captured with code close to kudu 
1.10.1):

{noformat}
  tids=[1866940]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex()
0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
0xb47850 
_ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  
tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7deb8 
kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync()
0xaa344c kudu::tablet::TabletReplica::SubmitWrite()
0x928fb0 kudu::tserver::TabletServiceImpl::Write()
   0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle()
   0x1d2efd9 kudu::rpc::ServicePool::RunThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  tids=[1866932,1866929]
  0x7fc8d67f95e0 
  0x7fc8d67f5943 __pthread_cond_wait
0xb99b38 kudu::log::Log::AsyncAppend()
0xb9c24c kudu::log::Log::AsyncAppendCommit()
0xaad489 kudu::tablet::TransactionDriver::ApplyTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  tids=[1866928]
  0x7fc8d67f95e0 
  0x7fc8d67f5943 __pthread_cond_wait
0xb99b38 kudu::log::Log::AsyncAppend()
0xb9c493 kudu::log::Log::AsyncAppendReplicates()
0xb597e9 kudu::consensus::LogCache::AppendOperations()
0xb4fa24 kudu::consensus::PeerMessageQueue::AppendOperations()
0xb4fd45 kudu::consensus::PeerMessageQueue::AppendOperation()
0xb6f28c 
kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
0xb7dff8 kudu::consensus::RaftConsensus::Replicate()
0xaab8e7 kudu::tablet::TransactionDriver::Prepare()
0xaac009 kudu::tablet::TransactionDriver::PrepareTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
{noformat}

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Assignee: Mike Percy
>Priority: Major
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> 

[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag

2020-06-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2781:

Component/s: master

> Hardening for location awareness command-line flag
> --
>
> Key: KUDU-2781
> URL: https://issues.apache.org/jira/browse/KUDU-2781
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> Add few verification steps related to the location assignment:
> * the location assignment executable is present and executable
> * the location assignment executable conforms with the expected interface: it 
> accepts one argument (IP address or DNS name) and outputs the assigned 
> location into the stdout
> * the same DNS name/IP address assigned the same location
> * the result location output into the stdout conforms with the format for 
> locations in Kudu
> It's possible to implement these in {{kudu-master}}  using group flag 
> validators: see the {{GROUP_FLAG_VALIDATOR}} macro.
> Performing few verification steps mentioned above should help to avoid 
> situations when Kudu tablet servers cannot be registered with Kudu master if 
> the location assignment executable path is misspelled or the executable 
> behaves not as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag

2020-06-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2781:

Description: 
Add few verification steps related to the location assignment:
* the location assignment executable is present and executable
* the location assignment executable conforms with the expected interface: it 
accepts one argument (IP address or DNS name) and outputs the assigned location 
into the stdout
* the same DNS name/IP address assigned the same location
* the result location output into the stdout conforms with the format for 
locations in Kudu

It's possible to implement these in {{kudu-master}}  using group flag 
validators: see the {{GROUP_FLAG_VALIDATOR}} macro.

Performing few verification steps mentioned above should help to avoid 
situations when Kudu tablet servers cannot be registered with Kudu master if 
the location assignment executable path is misspelled or the executable behaves 
not as expected.

> Hardening for location awareness command-line flag
> --
>
> Key: KUDU-2781
> URL: https://issues.apache.org/jira/browse/KUDU-2781
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> Add few verification steps related to the location assignment:
> * the location assignment executable is present and executable
> * the location assignment executable conforms with the expected interface: it 
> accepts one argument (IP address or DNS name) and outputs the assigned 
> location into the stdout
> * the same DNS name/IP address assigned the same location
> * the result location output into the stdout conforms with the format for 
> locations in Kudu
> It's possible to implement these in {{kudu-master}}  using group flag 
> validators: see the {{GROUP_FLAG_VALIDATOR}} macro.
> Performing few verification steps mentioned above should help to avoid 
> situations when Kudu tablet servers cannot be registered with Kudu master if 
> the location assignment executable path is misspelled or the executable 
> behaves not as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag

2020-06-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2781:

Labels: observability supportability  (was: )

> Hardening for location awareness command-line flag
> --
>
> Key: KUDU-2781
> URL: https://issues.apache.org/jira/browse/KUDU-2781
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: observability, supportability
>
> Add few verification steps related to the location assignment:
> * the location assignment executable is present and executable
> * the location assignment executable conforms with the expected interface: it 
> accepts one argument (IP address or DNS name) and outputs the assigned 
> location into the stdout
> * the same DNS name/IP address assigned the same location
> * the result location output into the stdout conforms with the format for 
> locations in Kudu
> It's possible to implement these in {{kudu-master}}  using group flag 
> validators: see the {{GROUP_FLAG_VALIDATOR}} macro.
> Performing few verification steps mentioned above should help to avoid 
> situations when Kudu tablet servers cannot be registered with Kudu master if 
> the location assignment executable path is misspelled or the executable 
> behaves not as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-2169) Allow replicas that do not exist to vote

2020-06-02 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124314#comment-17124314
 ] 

Alexey Serbin edited comment on KUDU-2169 at 6/2/20, 9:08 PM:
--

Now we have the 3-4-3 replica management scheme, and we don't use the 3-2-3 
scheme anymore.

With the 3-4-3 scheme there are scenarios where the system first evicts a 
replica, and then adds a new non-voter replica: that's when the replica to be 
evicted fails behind WAL segment GC threshold or experience a disk failure.  In 
very rare cases it might happen that a tablet ends up with leader replica A not 
being able to replicate/commit the change in the Raft configuration, as 
described.

>From the other side, such a newly added replica D in case of the 3-4-3 scheme 
>is a non-voter, and it cannot vote by definition.

In other words, some manual intervention would be necessary in the described 
scenario, but not the way how this JIRA proposes it to be implemented.

Closing as 'Won't Do'.


was (Author: aserbin):
Now we have the 3-4-3 replica management scheme, and we don't use 3-2-3 scheme 
anymore.

With the 3-4-3 scheme there are scenarios where the system first evicts a 
replica, and then adds a new non-voter replica: that's when the replica to be 
evicted fails behind WAL segment GC threshold or experience a disk failure.  In 
very rare cases it might happen that a tablet ends up with leader replica A, 
and replica A cannot replicate/commit the change in the Raft configuration as 
described.

>From the other side, such a newly replica D in case of the 3-4-3 scheme is a 
>non-voter, and it cannot vote by definition.

In other words, some manual intervention would be necessary in the described 
scenario, but not the way how this JIRA proposes.

Closing as 'Won't Do'.

> Allow replicas that do not exist to vote
> 
>
> Key: KUDU-2169
> URL: https://issues.apache.org/jira/browse/KUDU-2169
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus
>Reporter: Mike Percy
>Priority: Major
> Fix For: n/a
>
>
> In certain scenarios it is desirable for replicas that do not exist on a 
> tablet server to be able to vote. After the implementation of KUDU-871, 
> tombstoned tablets are now able to vote. However, there are circumstances (at 
> least in a pre- KUDU-1097 world) where voters that do not have a copy of a 
> replica (running or tombstoned) would be needed to vote to ensure 
> availability in certain edge-case failure scenarios.
> The quick justification for why it would be safe for a non-existent replica 
> to vote is that it would be equivalent to a replica that has simply not yet 
> replicated any WAL entries, in which case it would be legal to vote for any 
> candidate. Of course, a candidate would only ask such a replica to vote for 
> it if it believed that replica to be a voter in its config.
> Some additional discussion can be found here: 
> https://github.com/apache/kudu/blob/master/docs/design-docs/raft-tablet-copy.md#should-a-server-be-allowed-to-vote-if-it-does_not_exist-or-is-deleted
> What follows is an example of a scenario where "non-existent" replicas being 
> able to vote would be desired:
> In a 3-2-3 re-replication paradigm, the leader (A) of a 3-replica config \{A, 
> B, C\} evicts one replica (C). Then, the leader (A) adds a new voter (D). 
> Before A is able to replicate this config change to B or D, A is partitioned 
> from a network perspective. However A writes this config change to its local 
> WAL. After this, the entire cluster is brought down, the network is restored, 
> and the entire cluster is restarted. However, B fails to come back online due 
> to a hardware failure.
> The only way to automatically recover in this scenario is to allow D, which 
> has no concept of the tablet being discussed, to vote for A to become leader, 
> which will then tablet copy to D and make the tablet available for writes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2169) Allow replicas that do not exist to vote

2020-06-02 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2169.
-
Fix Version/s: n/a
   Resolution: Won't Do

Now we have the 3-4-3 replica management scheme, and we don't use 3-2-3 scheme 
anymore.

With the 3-4-3 scheme there are scenarios where the system first evicts a 
replica, and then adds a new non-voter replica: that's when the replica to be 
evicted fails behind WAL segment GC threshold or experience a disk failure.  In 
very rare cases it might happen that a tablet ends up with leader replica A, 
and replica A cannot replicate/commit the change in the Raft configuration as 
described.

>From the other side, such a newly replica D in case of the 3-4-3 scheme is a 
>non-voter, and it cannot vote by definition.

In other words, some manual intervention would be necessary in the described 
scenario, but not the way how this JIRA proposes.

Closing as 'Won't Do'.

> Allow replicas that do not exist to vote
> 
>
> Key: KUDU-2169
> URL: https://issues.apache.org/jira/browse/KUDU-2169
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus
>Reporter: Mike Percy
>Priority: Major
> Fix For: n/a
>
>
> In certain scenarios it is desirable for replicas that do not exist on a 
> tablet server to be able to vote. After the implementation of KUDU-871, 
> tombstoned tablets are now able to vote. However, there are circumstances (at 
> least in a pre- KUDU-1097 world) where voters that do not have a copy of a 
> replica (running or tombstoned) would be needed to vote to ensure 
> availability in certain edge-case failure scenarios.
> The quick justification for why it would be safe for a non-existent replica 
> to vote is that it would be equivalent to a replica that has simply not yet 
> replicated any WAL entries, in which case it would be legal to vote for any 
> candidate. Of course, a candidate would only ask such a replica to vote for 
> it if it believed that replica to be a voter in its config.
> Some additional discussion can be found here: 
> https://github.com/apache/kudu/blob/master/docs/design-docs/raft-tablet-copy.md#should-a-server-be-allowed-to-vote-if-it-does_not_exist-or-is-deleted
> What follows is an example of a scenario where "non-existent" replicas being 
> able to vote would be desired:
> In a 3-2-3 re-replication paradigm, the leader (A) of a 3-replica config \{A, 
> B, C\} evicts one replica (C). Then, the leader (A) adds a new voter (D). 
> Before A is able to replicate this config change to B or D, A is partitioned 
> from a network perspective. However A writes this config change to its local 
> WAL. After this, the entire cluster is brought down, the network is restored, 
> and the entire cluster is restarted. However, B fails to come back online due 
> to a hardware failure.
> The only way to automatically recover in this scenario is to allow D, which 
> has no concept of the tablet being discussed, to vote for A to become leader, 
> which will then tablet copy to D and make the tablet available for writes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-1621) Flush lingering operations upon destruction of an AUTO_FLUSH_BACKGROUND session

2020-06-02 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-1621.
-
Fix Version/s: n/a
   Resolution: Won't Fix

Automatically flushing data in the {{KuduSession}} might block, indeed.  It 
seems the current approach of issuing a warning when data is not flushed is 
good enough: it's uniform across all flush modes and avoids application hanging 
on close.

> Flush lingering operations upon destruction of an AUTO_FLUSH_BACKGROUND 
> session
> ---
>
> Key: KUDU-1621
> URL: https://issues.apache.org/jira/browse/KUDU-1621
> Project: Kudu
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 1.0.0
>Reporter: Alexey Serbin
>Priority: Major
> Fix For: n/a
>
>
> In current implementation of AUTO_FLUSH_BACKGROUND mode, it's necessary to 
> call KuduSession::Flush() or KuduSession::FlushAsync() explicitly before 
> destroying/abandoning a session if it's desired to have any pending 
> operations flushed.
> As [~adar] noticed during review of https://gerrit.cloudera.org/#/c/4432/ , 
> it might make sense to change this behavior to automatically flush any 
> pending operations upon closing Kudu AUTO_FLUSH_BACKGROUND session.  That 
> would be more consistent with the semantics of the AUTO_FLUSH_BACKGROUND mode 
> and more user-friendly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release

2020-05-31 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120608#comment-17120608
 ] 

Alexey Serbin commented on KUDU-3131:
-

Hi [~RenhaiZhao], at the server where I do a lot of compilation/testing the 
version of glibc is {{2.12-1.149.el6_6.9}}.  It's a really old installation: 
CentOS6.6

> test rw_mutex-test hangs sometimes if build_type is release
> ---
>
> Key: KUDU-3131
> URL: https://issues.apache.org/jira/browse/KUDU-3131
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Priority: Major
>
> Built and test kudu on aarch64, in release mode there is a test hangs 
> sometimes(maybe a deadlock?) the console out as following:
> [==] Running 2 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 2 tests from Priorities/RWMutexTest
> [ RUN  ] Priorities/RWMutexTest.TestDeadlocks/0
> And seems it's ok in debug mode.
> Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would 
> you please have a look for this? Or give some suggestion to us, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-368) Run local benchmarks under perf-stat

2020-05-28 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-368:
--

Assignee: Alexey Serbin

> Run local benchmarks under perf-stat
> 
>
> Key: KUDU-368
> URL: https://issues.apache.org/jira/browse/KUDU-368
> Project: Kudu
>  Issue Type: Improvement
>  Components: test
>Affects Versions: M4.5
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Minor
>  Labels: benchmarks, perf
>
> Would be nice to run a lot of our nightly benchmarks under perf-stat so we 
> can see on regression what factors changed (eg instruction count, cycles, 
> stalled cycles, cache misses, etc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2604) Add label for tserver

2020-05-28 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118788#comment-17118788
 ] 

Alexey Serbin commented on KUDU-2604:
-

[~granthenke], yes, I think the remaining functionality can be broken down into 
smaller JIRA items.  At the higher level, I see the following pieces:
* Define and assign tags to tablet servers.
* Update master's placement policies to take into account tags when 
adding/distributing replicas of tablets.
* Add support for C++ and Java clients: clients can specify set of tags when 
creating tables.
* The {{kudu cluster rebalance}} tool and the auto-rebalancer honors the tags 
when rebalancing corresponding tables.  The tools is also able to report on 
tablet replicas which are placed in a non-conforming way w.r.t. tags specified 
for tables (those con-conformant-placed replicas might appear during automatic 
re-replication: this is something similar that we have with current placement 
policies).
* The {{kudu cluster ksck}} CLI tool provides information on tags for tablet 
servers.

We can create sub-tasks for this if we decide to implement this.

> Add label for tserver
> -
>
> Key: KUDU-2604
> URL: https://issues.apache.org/jira/browse/KUDU-2604
> Project: Kudu
>  Issue Type: New Feature
>Reporter: Hong Shen
>Priority: Major
>  Labels: location-awareness, rack-awareness
> Fix For: n/a
>
> Attachments: image-2018-10-15-21-52-21-426.png
>
>
> When the cluster is bigger and bigger, big table with a lot of tablets will 
> be distributed in almost all the tservers, when client write batch to the big 
> table, it may cache connections to lots of tservers, the scalability may 
> constrained.
> If the tablets in one table or partition only in a part of tservers, client 
> will only have to cache connections to the part's tservers. So we propose to 
> add label to tservers, each tserver belongs to a unique label. Client 
> specified label when create table or add partition, the tablets will only be 
> created on the tservers in specified label, if not specified, defalut label 
> will be used. 
>  It will also benefit for:
> 1 Tserver across data center.
> 2 Heterogeneous tserver, like different disk, cpu or memory.
> 3 Physical isolating, especially IO, isolate some tables with others.
> 4 Gated Launch, upgrade tservers one by one label.
> In our product cluster, we have encounter the above issues and need to be 
> resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (KUDU-2604) Add label for tserver

2020-05-28 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reopened KUDU-2604:
-

It seems this JIRA item contains some useful ideas and details which are 
orthogonal to current implementation of the rack awareness feature.  If 
implemented, they might complement the overall functionality of the placement 
policies in Kudu.  I'm removing the duplicate of KUDU-1535 resolution.

> Add label for tserver
> -
>
> Key: KUDU-2604
> URL: https://issues.apache.org/jira/browse/KUDU-2604
> Project: Kudu
>  Issue Type: New Feature
>Reporter: Hong Shen
>Priority: Major
>  Labels: location-awareness, rack-awareness
> Fix For: n/a
>
> Attachments: image-2018-10-15-21-52-21-426.png
>
>
> When the cluster is bigger and bigger, big table with a lot of tablets will 
> be distributed in almost all the tservers, when client write batch to the big 
> table, it may cache connections to lots of tservers, the scalability may 
> constrained.
> If the tablets in one table or partition only in a part of tservers, client 
> will only have to cache connections to the part's tservers. So we propose to 
> add label to tservers, each tserver belongs to a unique label. Client 
> specified label when create table or add partition, the tablets will only be 
> created on the tservers in specified label, if not specified, defalut label 
> will be used. 
>  It will also benefit for:
> 1 Tserver across data center.
> 2 Heterogeneous tserver, like different disk, cpu or memory.
> 3 Physical isolating, especially IO, isolate some tables with others.
> 4 Gated Launch, upgrade tservers one by one label.
> In our product cluster, we have encounter the above issues and need to be 
> resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-1865) Create fast path for RespondSuccess() in KRPC

2020-05-26 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117387#comment-17117387
 ] 

Alexey Serbin commented on KUDU-1865:
-

Some more stacks captured from diagnostic logs for {{kudu-master}} process 
(kudu 1.10):

{noformat}
Stacks at 0516 18:53:00.042003 (service queue overflowed for 
kudu.master.MasterService):
  tids=[736230]
  0x7f803a76a5e0 
0xb6219e tcmalloc::ThreadCache::ReleaseToCentralCache()
0xb62530 tcmalloc::ThreadCache::Scavenge()
0xad8a27 
kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock()
0xaa3a31 kudu::master::MasterServiceImpl::GetTableSchema()
   0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle()
   0x221b1a9 kudu::rpc::ServicePool::RunThread()
   0x23a8f84 kudu::Thread::SuperviseThread()
  0x7f803a762e25 start_thread
  0x7f8038a4134d __clone
  tids=[736248,736245,736243,736242]
  0x7f803a76a5e0 
   0x23c5b44 base::internal::SpinLockDelay()
   0x23c59cc base::SpinLock::SlowLock()
0xac5814 kudu::master::CatalogManager::CheckOnline()
0xae5032 kudu::master::CatalogManager::GetTableSchema()
0xaa3a85 kudu::master::MasterServiceImpl::GetTableSchema()
   0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle()
   0x221b1a9 kudu::rpc::ServicePool::RunThread()
   0x23a8f84 kudu::Thread::SuperviseThread()
  0x7f803a762e25 start_thread
  0x7f8038a4134d __clone
  
tids=[736239,736229,736232,736233,736234,736235,736236,736237,736238,736240,736241,736244,736247]
  0x7f803a76a5e0 
   0x23c5b44 base::internal::SpinLockDelay()
   0x23c59cc base::SpinLock::SlowLock()
0xac5814 kudu::master::CatalogManager::CheckOnline()
0xaf102f kudu::master::CatalogManager::GetTableLocations()
0xaa36f8 kudu::master::MasterServiceImpl::GetTableLocations()
   0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle()
   0x221b1a9 kudu::rpc::ServicePool::RunThread()
   0x23a8f84 kudu::Thread::SuperviseThread()
  0x7f803a762e25 start_thread
  0x7f8038a4134d __clone
  tids=[736246,736231]
  0x7f803a76a5e0 
   0x23c5b44 base::internal::SpinLockDelay()
   0x23c59cc base::SpinLock::SlowLock()
0xad8b7c 
kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock()
0xaa369d kudu::master::MasterServiceImpl::GetTableLocations()
   0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle()
   0x221b1a9 kudu::rpc::ServicePool::RunThread()
   0x23a8f84 kudu::Thread::SuperviseThread()
  0x7f803a762e25 start_thread
  0x7f8038a4134d __clone
{noformat}

> Create fast path for RespondSuccess() in KRPC
> -
>
> Key: KUDU-1865
> URL: https://issues.apache.org/jira/browse/KUDU-1865
> Project: Kudu
>  Issue Type: Improvement
>  Components: rpc
>Reporter: Sailesh Mukil
>Priority: Major
>  Labels: perfomance, rpc
> Attachments: alloc-pattern.py, cross-thread.txt
>
>
> A lot of RPCs just respond with RespondSuccess() which returns the exact 
> payload every time. This takes the same path as any other response by 
> ultimately calling Connection::QueueResponseForCall() which has a few small 
> allocations. These small allocations (and their corresponding deallocations) 
> are called quite frequently (once for every IncomingCall) and end up taking 
> quite some time in the kernel (traversing the free list, spin locks etc.)
> This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20 
> node cluster and found the following:
> The exact % of time spent is hard to quantify from the profiles, but these 
> were the among the top 5 of the slowest stacks:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown 
> source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source 
> file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown 
> source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! __gnu_cxx::new_allocator::deallocate + 0x4 - 
> new_allocator.h:110
> impalad ! std::_Vector_base std::allocator>::_M_deallocate + 0x5 - stl_vector.h:178
> impalad ! ~_Vector_base + 0x4 - stl_vector.h:160
> impalad ! ~vector - stl_vector.h:425    'slices' vector
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - 
> connection.cc:433
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! 

[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release

2020-05-26 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117008#comment-17117008
 ] 

Alexey Serbin commented on KUDU-3131:
-

I cannot reproduce this on x86_64 architecture and I don't have access to 
aarch64 at this point.

I'd try to attach to the hung process with a debugger and see what's going on.  
 [~huangtianhua], did you have a chance to try that?

> test rw_mutex-test hangs sometimes if build_type is release
> ---
>
> Key: KUDU-3131
> URL: https://issues.apache.org/jira/browse/KUDU-3131
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Priority: Major
>
> Built and test kudu on aarch64, in release mode there is a test hangs 
> sometimes(maybe a deadlock?) the console out as following:
> [==] Running 2 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 2 tests from Priorities/RWMutexTest
> [ RUN  ] Priorities/RWMutexTest.TestDeadlocks/0
> And seems it's ok in debug mode.
> Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would 
> you please have a look for this? Or give some suggestion to us, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2453) kudu should stop creating tablet infinitely

2020-05-18 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110649#comment-17110649
 ] 

Alexey Serbin commented on KUDU-2453:
-

There is a reproduction scenario for the issue described in this JIRA: 
https://gerrit.cloudera.org/#/c/15912/

> kudu should stop creating tablet infinitely
> ---
>
> Key: KUDU-2453
> URL: https://issues.apache.org/jira/browse/KUDU-2453
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.4.0, 1.7.2
>Reporter: LiFu He
>Priority: Major
>
> I have met this problem again on 2018/10/26. And now the kudu version is 
> 1.7.2.
> -
> We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and 
> there are some load on the kudu cluster. Then someone else created a big 
> table which had tens of thousands of tablets from impala-shell (that was a 
> mistake). 
> {code:java}
> CREATE TABLE XXX(
> ...
>PRIMARY KEY (...)
> )
> PARTITION BY HASH (...) PARTITIONS 100,
> RANGE (...)
> (
>   PARTITION "2018-10-24" <= VALUES < "2018-10-24\000",
>   PARTITION "2018-10-25" <= VALUES < "2018-10-25\000",
>   ...
>   PARTITION "2018-12-07" <= VALUES < "2018-12-07\000"
> )
> STORED AS KUDU
> TBLPROPERTIES ('kudu.master_addresses'= '...');
> {code}
> Here are the logs after creating table (only pick one tablet as example):
> {code:java}
> --Kudu-master log
> ==e884bda6bbd3482f94c07ca0f34f99a4==
> W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS 
> 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC 
> failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service 
> unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService 
> from 10.120.219.118:50247 dropped due to backpressure. The service queue is 
> full; it has 512 items.
> I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of 
> CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS 
> 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1)
> ...
> ==Be replaced by 0b144c00f35d48cca4d4981698faef72==
> W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T 
>  P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 
> e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature 
> [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed 
> timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72
> ...
> I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T 
>  P f6c9a09da7ef4fc191cab6276b942ba3: Sending 
> DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4
> ...
> I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending 
> DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 
> on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by 
> 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST)
> ...
> W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS 
> 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for 
> tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: 
> Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 
> already in progress: creating tablet
> ...
> I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of 
> e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for 
> TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1)
> ...
> W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS 
> b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC 
> failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service 
> unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService 
> from 10.120.219.118:59735 dropped due to backpressure. The service queue is 
> full; it has 512 items.
> I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of 
> CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS 
> b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1)
> ...
> ==Be replaced by c0e0acc448fc42fc9e48f5025b112a75==
> W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T 
>  P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 
> 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature 
> [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed 
> timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75
> ...
> --Kudu-tserver log
> I1024 11:40:52.014571 137358 

[jira] [Created] (KUDU-3124) A safer way to handle CreateTablet requests

2020-05-18 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3124:
---

 Summary: A safer way to handle CreateTablet requests
 Key: KUDU-3124
 URL: https://issues.apache.org/jira/browse/KUDU-3124
 Project: Kudu
  Issue Type: Improvement
  Components: master, tserver
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 
1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.3.0, 1.2.0
Reporter: Alexey Serbin


As of now, catalog manager (a part of kudu-master) sends 
{{CreateTabletRequest}} RPC
as soon as they are realized by {{CatalogManager::ProcessPendingAssignments()}}
when processing the list of deferred DDL operations, and at this level there
isn't any restrictions on how many of those might be in flight or sent to
a particular tablet server (NOTE: there is {{\-\-max_create_tablets_per_ts}} 
flag,
but it works on a higher level and only during initial creation of a table).

The {{CreateTablet}} requests are sent asynchronously, and if the tablet isn't
created within {{\-\-tablet_creation_timeout_ms|| milliseconds, catalog manager
replaces all the tablet replicas, generating a new tablet UUID and sending
corresponding {{CreateTabletRequest}} RPCs to a potentially different set of 
tablet
servers.  Corresponding {{DeleteTabletRequest}} RPCs (to remove the replicas of 
the
stalled-during-creation tablet) are sent separately in an asynchronous way
as well.

There are at least two issues with this approach:

# The {{\-\-max_create_tablets_per_ts}} threshold limits the number of 
concurrent requests hitting one tablet server only during the initial creation 
of a table. However, nothing limits how many requests to create a table replica 
might hit a tablet server when adding partitions to an existing table as a 
result of ALTER TABLE request.
# {{DeleteTabletRequest}} RPCs sometimes might not get into the RPC queues of
corresponding tablet servers, and catalog manager stops retrying sending those
after {{\-\-unresponsive_ts_rpc_timeout_ms}} interval.  This might spiral into 
a situation when requests to create replacement tablet replicas are passing 
through and executed by tablet servers, but corresponding requests to delete 
tablet replica cannot get through because of queue overflows, with catalog 
manager eventually giving up retrying the latter ones.  Eventually, tablet 
servers end up with huge number of tablet replicas created, and they crash 
running out of memory.  The crashed tablet servers cannot start after that 
because they eventually run out of memory trying to bootstrap the huge number 
of tablet replicas (running out of memory again). See 
https://gerrit.cloudera.org/#/c/15912/ for the reproduction scenario and 
[KUDU-2453|https://issues.apache.org/jira/browse/KUDU-2453] for corresponding 
issue reported some time ago. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3000) RemoteKsckTest.TestChecksumSnapshot sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3000:

Attachment: ksck_remote-test.01.txt.xz

> RemoteKsckTest.TestChecksumSnapshot sometimes fails
> ---
>
> Key: KUDU-3000
> URL: https://issues.apache.org/jira/browse/KUDU-3000
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: ksck_remote-test.01.txt.xz, ksck_remote-test.txt.xz
>
>
> The {{TestChecksumSnapshot}} scenario of the {{RemoteKsckTest}} test 
> sometimes fails with the following error message:
> {noformat}
> W1116 06:46:18.593114  3904 tablet_service.cc:2365] Rejecting scan request 
> for tablet 4ce9988aac744b
> 1bbde2772c66cce35d: Uninitialized: safe time has not yet been initialized
> src/kudu/tools/ksck_remote-test.cc:407: Failure
> Failed
> {noformat}
> Full log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3000) RemoteKsckTest.TestChecksumSnapshot sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105098#comment-17105098
 ] 

Alexey Serbin commented on KUDU-3000:
-

Another failure (probably, this time the root cause different):

{noformat}
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/ksck_remote-test.cc:407:
 Failure
Failed
Bad status: Aborted: 1 errors were detected
{noformat}

The log is attached. [^ksck_remote-test.01.txt.xz] 

> RemoteKsckTest.TestChecksumSnapshot sometimes fails
> ---
>
> Key: KUDU-3000
> URL: https://issues.apache.org/jira/browse/KUDU-3000
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: ksck_remote-test.01.txt.xz, ksck_remote-test.txt.xz
>
>
> The {{TestChecksumSnapshot}} scenario of the {{RemoteKsckTest}} test 
> sometimes fails with the following error message:
> {noformat}
> W1116 06:46:18.593114  3904 tablet_service.cc:2365] Rejecting scan request 
> for tablet 4ce9988aac744b
> 1bbde2772c66cce35d: Uninitialized: safe time has not yet been initialized
> src/kudu/tools/ksck_remote-test.cc:407: Failure
> Failed
> {noformat}
> Full log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3120) testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) sometimes fails with timeout

2020-05-11 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3120:
---

 Summary: 
testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) 
sometimes fails with timeout
 Key: KUDU-3120
 URL: https://issues.apache.org/jira/browse/KUDU-3120
 Project: Kudu
  Issue Type: Bug
  Components: test
Reporter: Alexey Serbin
 Attachments: test-output.txt.xz

The subj sometimes fails due to timeout:

{noformat}
Time: 56.114
There was 1 failure:
1) testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster)
org.junit.runners.model.TestTimedOutException: test timed out after 5 
milliseconds
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.kudu.test.cluster.MiniKuduCluster.sendRequestToCluster(MiniKuduCluster.java:162)
at 
org.apache.kudu.test.cluster.MiniKuduCluster.start(MiniKuduCluster.java:235)
at 
org.apache.kudu.test.cluster.MiniKuduCluster.access$300(MiniKuduCluster.java:72)
at 
org.apache.kudu.test.cluster.MiniKuduCluster$MiniKuduClusterBuilder.build(MiniKuduCluster.java:697)
at 
org.apache.kudu.test.TestMiniKuduCluster.testHiveMetastoreIntegration(TestMiniKuduCluster.java:106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-05-11 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3119:
---

 Summary: ToolTest.TestFsAddRemoveDataDirEndToEnd reports race 
under TSAN
 Key: KUDU-3119
 URL: https://issues.apache.org/jira/browse/KUDU-3119
 Project: Kudu
  Issue Type: Bug
  Components: CLI, test
Reporter: Alexey Serbin
 Attachments: kudu-tool-test.log.xz

Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
reports races for TSAN builds:

{noformat}
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
 Failure
Failed
Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
process exited with non-ze
ro status 66
Google Test trace:
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
 W0506 17:5
6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log directory 
(fs_wal_dir) as metad
ata directory
I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
manager: real 0.007s
user 0.005s sys 0.002s
I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
files per process limi
t of 1048576; it is already as high as it can go
I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm with 
capacity 419430
==
WARNING: ThreadSanitizer: data race (pid=4432)
...
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3117:

Description: 
The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
sometimes fails (TSAN builds) with the messages like below:

{noformat}
kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
Failed  
Bad status: Timed out: Unable to find leader of tablet 
65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
Error connecting to replica: Timed out: GetConsensusState RPC to 
127.0.177.65:42397 timed out after -0.003s (SENT)
kudu/util/test_util.cc:349: Failure
Failed
{noformat}

The log is attached.

  was:
The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
sometimes fails (TSAN builds) with the messages like below:

{noformat}
kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
Failed  
Bad status: Timed out: Unable to find leader of tablet 
65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
Error connecting to replica: Timed out: GetConsensusState RPC to 
127.0.177.65:42397 timed out after -0.003s (SENT)
kudu/util/test_util.cc:349: Failure
Failed
{noformat}


> TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
> ---
>
> Key: KUDU-3117
> URL: https://issues.apache.org/jira/browse/KUDU-3117
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.12.0
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: tablet_server_quiescing-itest.txt.xz
>
>
> The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
> sometimes fails (TSAN builds) with the messages like below:
> {noformat}
> kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
> Failed
>   
> Bad status: Timed out: Unable to find leader of tablet 
> 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
> Error connecting to replica: Timed out: GetConsensusState RPC to 
> 127.0.177.65:42397 timed out after -0.003s (SENT)
> kudu/util/test_util.cc:349: Failure
> Failed
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3117:

Attachment: tablet_server_quiescing-itest.txt.xz

> TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
> ---
>
> Key: KUDU-3117
> URL: https://issues.apache.org/jira/browse/KUDU-3117
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.12.0
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: tablet_server_quiescing-itest.txt.xz
>
>
> The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
> sometimes fails (TSAN builds) with the messages like below:
> {noformat}
> kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
> Failed
>   
> Bad status: Timed out: Unable to find leader of tablet 
> 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
> Error connecting to replica: Timed out: GetConsensusState RPC to 
> 127.0.177.65:42397 timed out after -0.003s (SENT)
> kudu/util/test_util.cc:349: Failure
> Failed
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3117:
---

 Summary: TServerQuiescingITest.TestMajorityQuiescingElectsLeader 
sometimes fails
 Key: KUDU-3117
 URL: https://issues.apache.org/jira/browse/KUDU-3117
 Project: Kudu
  Issue Type: Bug
  Components: test
Affects Versions: 1.12.0
Reporter: Alexey Serbin


The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
sometimes fails (TSAN builds) with the messages like below:

{noformat}
kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
Failed  
Bad status: Timed out: Unable to find leader of tablet 
65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
Error connecting to replica: Timed out: GetConsensusState RPC to 
127.0.177.65:42397 timed out after -0.003s (SENT)
kudu/util/test_util.cc:349: Failure
Failed
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3115) Improve scalability of Kudu masters

2020-05-04 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3115:
---

 Summary: Improve scalability of Kudu masters
 Key: KUDU-3115
 URL: https://issues.apache.org/jira/browse/KUDU-3115
 Project: Kudu
  Issue Type: Improvement
Reporter: Alexey Serbin


Currently, multiple masters in a multi-master Kudu cluster are used only for 
high availability & fault tolerance use cases, but not for sharing the load 
among the available master nodes.  For example, Kudu clients detect current 
leader master upon connecting to the cluster and send all their subsequent 
requests to the leader master, so serving many more clients require running 
masters on more powerful nodes.  Current design assumes that masters store and 
process the requests for metadata only, but that makes sense only up to some 
limit on the rate of incoming client requests.

It would be great to achieve better 'horizontal' scalability for Kudu masters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3114) tserver writes core dump when reporting 'out of space'

2020-05-04 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099242#comment-17099242
 ] 

Alexey Serbin commented on KUDU-3114:
-

Right, it's possible to disable coredumps for Kudu processes by adding 
{{\-\-disable_core_dumps}} even if the limit for core files size of set to 
non-zero.  My point was that enabling/disabling coredumps per {{LOG(FATAL)}} 
instance is not feasible.

Dumping a core file might have sense when troubleshooting an issue: e.g., if 
there is a bug in computing the number of bytes to allocate, what event 
triggered the issue if it's requested to allocate unexpectedly high amount of 
space, etc.  Probably, we can keep that for DEBUG builds only.

I'm OK with keeping this JIRA item open (so, I'm re-opening it).   Feel free to 
submit a patch to address the issue as needed.

> tserver writes core dump when reporting 'out of space'
> --
>
> Key: KUDU-3114
> URL: https://issues.apache.org/jira/browse/KUDU-3114
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.1
>Reporter: Balazs Jeszenszky
>Priority: Major
> Fix For: n/a
>
>
> Fatal log has:
> {code}
> F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation 
> failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 
> bytes under path  (39973171200 bytes available vs 39988335247 bytes 
> reserved) (error 28)
> {code}
> Generating a core file in this case yields no benefit, and potentially 
> compounds the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (KUDU-3114) tserver writes core dump when reporting 'out of space'

2020-05-04 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reopened KUDU-3114:
-

> tserver writes core dump when reporting 'out of space'
> --
>
> Key: KUDU-3114
> URL: https://issues.apache.org/jira/browse/KUDU-3114
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.1
>Reporter: Balazs Jeszenszky
>Priority: Major
> Fix For: n/a
>
>
> Fatal log has:
> {code}
> F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation 
> failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 
> bytes under path  (39973171200 bytes available vs 39988335247 bytes 
> reserved) (error 28)
> {code}
> Generating a core file in this case yields no benefit, and potentially 
> compounds the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-3114) tserver writes core dump when reporting 'out of space'

2020-05-04 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3114.
-
Fix Version/s: n/a
   Resolution: Information Provided

> tserver writes core dump when reporting 'out of space'
> --
>
> Key: KUDU-3114
> URL: https://issues.apache.org/jira/browse/KUDU-3114
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.1
>Reporter: Balazs Jeszenszky
>Priority: Major
> Fix For: n/a
>
>
> Fatal log has:
> {code}
> F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation 
> failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 
> bytes under path  (39973171200 bytes available vs 39988335247 bytes 
> reserved) (error 28)
> {code}
> Generating a core file in this case yields no benefit, and potentially 
> compounds the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3114) tserver writes core dump when reporting 'out of space'

2020-05-04 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099075#comment-17099075
 ] 

Alexey Serbin commented on KUDU-3114:
-

Thank you for reporting the issue.

The way how fatal inconsistencies are handled in Kudu doesn't provide control 
to choose between coredump behavior.  The behavior of it's controlled at 
different level: the environment that Kudu processes are run with (check 
{{ulimit -c}}).

As a good operational practice, it's advised to separate the location for core 
files (some directory at system partition/volume?) and the directories where 
Kudu stores its data and WAL.  Also, consider [enabling mini-dumps in 
Kudu|https://kudu.apache.org/docs/troubleshooting.html#crash_reporting] and 
disabling core files if dumping cores isn't feasible due to space limitations.

> tserver writes core dump when reporting 'out of space'
> --
>
> Key: KUDU-3114
> URL: https://issues.apache.org/jira/browse/KUDU-3114
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.1
>Reporter: Balazs Jeszenszky
>Priority: Major
>
> Fatal log has:
> {code}
> F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation 
> failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 
> bytes under path  (39973171200 bytes available vs 39988335247 bytes 
> reserved) (error 28)
> {code}
> Generating a core file in this case yields no benefit, and potentially 
> compounds the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3107) TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service queue is full

2020-04-30 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096864#comment-17096864
 ] 

Alexey Serbin commented on KUDU-3107:
-

I think the problem is that the code doesn't do proper conversion of the 
RPC-level status code into the application status code.  I think the following 
is missing:

{noformat}
if (controller.status().IsRemoteError()) {
  const ErrorStatusPB* err = rpc->error_response();
  CHECK(err && err->has_code() &&
  (err->code() == ErrorStatusPB::ERROR_SERVER_TOO_BUSY ||
   err->code() == ErrorStatusPB::ERROR_UNAVAILABLE));
}
{noformat}

> TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service 
> queue is full
> ---
>
> Key: KUDU-3107
> URL: https://issues.apache.org/jira/browse/KUDU-3107
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: liusheng
>Priority: Major
> Attachments: rpc-test.txt
>
>
> The test TestRpc.TestCancellationMultiThreads fail sometimes on ARM mechine 
> due the the "service queue full" error. related  error message:
> {code:java}
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 318)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 319)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 320)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 321)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 324)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 332)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 334)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 335)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 336)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 337)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 338)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 339)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 340)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 341)
> F0416 13:01:38.616358 31937 rpc-test.cc:1471] Check failed: 
> controller.status().IsAborted() || controller.status().IsServiceUnavailable() 
> || controller.status().ok() Remote error: Service unavailable: PushStrings 
> request on kudu.rpc.GenericCalculatorService from 127.0.0.1:41516 dropped due 
> to backpressure. The service queue is full; it has 100 items.
> *** Check failure stack trace: ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e86bbf) received by PID 27583 (TID 0x84b1f050) from 
> PID 27583; stack trace: ***
> @ 0x93cf0464 raise at ??:0
> @ 0x93cf18b4 abort at ??:0
> @ 0x942c5fdc google::logging_fail() at ??:0
> @ 0x942c7d40 google::LogMessage::Fail() at ??:0
> @ 0x942c9c78 google::LogMessage::SendToLog() at ??:0
> @ 0x942c7874 google::LogMessage::Flush() at ??:0
> @ 0x942ca4fc google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0xdcee4940 kudu::rpc::SendAndCancelRpcs() at ??:0
> @ 0xdcee4b98 
> _ZZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvENKUlvE_clEv
>  at ??:0
> @ 0xdcee76bc 
> _ZSt13__invoke_implIvZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
>  at ??:0
> @ 0xdcee7484 
> _ZSt8__invokeIZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6_
>  at ??:0
> @ 0xdcee8208 
> _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEE9_M_invokeIJLm0DTcl8__invokespcl10_S_declvalIXT_ESt12_Index_tupleIJXspT_EEE
>  at ??:0
> @ 0xdcee8168 
> _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEEclEv
>  at ??:0
> @ 0xdcee8110 
> _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_E6_M_runEv
>  at ??:0
> @ 0x93f22e94 (unknown) at ??:0
> @ 0x93e1e088 start_thread at ??:0
> @ 0x93d8e4ec (unknown) at ??:0
> {code}
> The attatchment is the full test log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle

2020-04-24 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Summary: getEndpointChannelBindings() isn't working as expected with 
BouncyCastle  (was: getEndpointChannelBindings() isn't working as expected with 
BouncyCastle 1.65)

> getEndpointChannelBindings() isn't working as expected with BouncyCastle
> 
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.12.0
>
>
> With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3111) IWYU processes freestanding headers

2020-04-24 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3111:
---

 Summary: IWYU processes freestanding headers 
 Key: KUDU-3111
 URL: https://issues.apache.org/jira/browse/KUDU-3111
 Project: Kudu
  Issue Type: Improvement
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.8.0, 1.7.0, 
1.12.0
Reporter: Alexey Serbin


When working out of the compilation database, IWYU processes only associated 
headers, i.e. {{.h}} files that pair corresponding {{.cc}} files.   It would be 
nice to make IWYU processing so-called freestanding header files.  [This 
thread|https://github.com/include-what-you-use/include-what-you-use/issues/268] 
contains very useful information on the topic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3111) Make IWYU processes freestanding headers

2020-04-24 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3111:

Summary: Make IWYU processes freestanding headers  (was: IWYU processes 
freestanding headers )

> Make IWYU processes freestanding headers
> 
>
> Key: KUDU-3111
> URL: https://issues.apache.org/jira/browse/KUDU-3111
> Project: Kudu
>  Issue Type: Improvement
>Affects Versions: 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 
> 1.11.1
>Reporter: Alexey Serbin
>Priority: Major
>
> When working out of the compilation database, IWYU processes only associated 
> headers, i.e. {{.h}} files that pair corresponding {{.cc}} files.   It would 
> be nice to make IWYU processing so-called freestanding header files.  [This 
> thread|https://github.com/include-what-you-use/include-what-you-use/issues/268]
>  contains very useful information on the topic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3007) ARM/aarch64 platform support

2020-04-24 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091778#comment-17091778
 ] 

Alexey Serbin commented on KUDU-3007:
-

Yes, I'm planning to take a closer look this weekend.  Thank you for the 
contribution!

> ARM/aarch64 platform support
> 
>
> Key: KUDU-3007
> URL: https://issues.apache.org/jira/browse/KUDU-3007
> Project: Kudu
>  Issue Type: Improvement
>Reporter: liusheng
>Priority: Critical
>
> As an import alternative of x86 architecture, Aarch64(ARM) architecture  is 
> currently the dominate architecture in small devices like phone, IOT devices, 
> security cameras, drones etc. And also, there are more and more hadware or 
> cloud vendor start to provide ARM resources, such as AWS, Huawei, Packet, 
> Ampere. etc. Usually, the ARM servers are low cost and more cheap than x86 
> servers, and now more and more ARM servers have comparative performance with 
> x86 servers, and even more efficient in some areas.
> We want to propose to add an Aarch64 CI for KUDU to promote the support for 
> KUDU on Aarch64 platforms. We are willing to provide machines to the current 
> CI system and manpower to mananging the CI and fxing problems that occours.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2986) Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables

2020-04-17 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2986.
-
Fix Version/s: 1.12.0
   Resolution: Fixed

> Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables
> --
>
> Key: KUDU-2986
> URL: https://issues.apache.org/jira/browse/KUDU-2986
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, client, master, metrics
>Affects Versions: 1.11.0
>Reporter: YifanZhang
>Assignee: LiFu He
>Priority: Major
> Fix For: 1.12.0
>
>
> When we upgraded the cluster with pre-1.11.0 tables, we got inconsistent 
> values for the 'live_row_count' metric of these tables:
> When visiting masterURL:port/metrics, we got 0 for old tables, and got a 
> positive integer for a old table with a newly added partition, which is the 
> count of rows in the newly added partition.
> When getting table statistics via `kudu table statistics` CLI tool, we got 0 
> for old tables and the old table with a new parition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Fix Version/s: 1.12.0
   Resolution: Fixed
   Status: Resolved  (was: In Review)

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.12.0
>
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Description: 
With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 1.65 converts the name of the certificate signature 
algorithm uppercase.

  was:
With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 1.65 converts the name of the certificate signature 
algorithm uppercase.


> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.12.0
>
>
> With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Status: In Review  (was: In Progress)

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Code Review: http://gerrit.cloudera.org:8080/15664

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3106:
---

Assignee: Alexey Serbin

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Summary: getEndpointChannelBindings() isn't working as expected with 
BouncyCastle 1.65  (was: getEndpointChannelBindings() isn't working as expected 
with BouncyCastle 2.65)

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 2.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Description: 
With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 1.65 converts the name of the certificate signature 
algorithm uppercase.

  was:
With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 2.65 converts the name of the certificate signature 
algorithm uppercase.


> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 2.65

2020-04-06 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3106:
---

 Summary: getEndpointChannelBindings() isn't working as expected 
with BouncyCastle 2.65
 Key: KUDU-3106
 URL: https://issues.apache.org/jira/browse/KUDU-3106
 Project: Kudu
  Issue Type: Bug
  Components: client, java, security
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 
1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.3.0
Reporter: Alexey Serbin


With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 2.65 converts the name of the certificate signature 
algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2573) Fully support Chrony in place of NTP

2020-04-06 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076583#comment-17076583
 ] 

Alexey Serbin commented on KUDU-2573:
-

With this [changelist|https://gerrit.cloudera.org/#/c/15456/], the necessary 
piece of the documentation will be in 1.12 release notes.

> Fully support Chrony in place of NTP
> 
>
> Key: KUDU-2573
> URL: https://issues.apache.org/jira/browse/KUDU-2573
> Project: Kudu
>  Issue Type: New Feature
>  Components: clock, master, tserver
>Reporter: Grant Henke
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
>
> This is to track fully supporting Chrony in place of NTP. Given Chrony is the 
> default in RHEL7+, running Kudu with Chrony is likely to be more common. 
> The work should entail:
>  * identifying and fixing or documenting any differences or gaps
>  * removing the experimental warnings from the documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2573) Fully support Chrony in place of NTP

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2573.
-
Fix Version/s: 1.12.0
   Resolution: Fixed

> Fully support Chrony in place of NTP
> 
>
> Key: KUDU-2573
> URL: https://issues.apache.org/jira/browse/KUDU-2573
> Project: Kudu
>  Issue Type: New Feature
>  Components: clock, master, tserver
>Reporter: Grant Henke
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
> Fix For: 1.12.0
>
>
> This is to track fully supporting Chrony in place of NTP. Given Chrony is the 
> default in RHEL7+, running Kudu with Chrony is likely to be more common. 
> The work should entail:
>  * identifying and fixing or documenting any differences or gaps
>  * removing the experimental warnings from the documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2798) Fix logging on deleted TSK entries

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2798:

Affects Version/s: 1.10.1
   1.11.0
   1.11.1

> Fix logging on deleted TSK entries
> --
>
> Key: KUDU-2798
> URL: https://issues.apache.org/jira/browse/KUDU-2798
> Project: Kudu
>  Issue Type: Task
>Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Minor
>  Labels: newbie
>
> It seems the identifiers of the deleted TSK entries in the log lines below 
> need decoding:
> {noformat}
> I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T 
>  P f05d759af7824df9aafedcc106674182: 
> Generated new TSK 2
> I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T 
>  P f05d759af7824df9aafedcc106674182: Deleted 
> TSKs: �, �
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >