Sergey Shelukhin created HIVE-21676: ---------------------------------------
Summary: use a system table as an alternative proc store Key: HIVE-21676 URL: https://issues.apache.org/jira/browse/HIVE-21676 Project: Hive Issue Type: Bug Reporter: Sergey Shelukhin We keep hitting these issues: {noformat} 2019-04-30 23:41:52,164 INFO [master/master:17000:becomeActiveMaster] procedure2.ProcedureExecutor: Starting 16 core workers (bigger of cpus/4 or 16) with max (burst) worker count=160 2019-04-30 23:41:52,171 INFO [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recover lease on dfs file .../MasterProcWALs/pv2-00000000000000000481.log 2019-04-30 23:41:52,176 INFO [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recovered lease, attempt=0 on file=.../MasterProcWALs/pv2-00000000000000000481.log after 5ms 2019-04-30 23:41:52,288 INFO [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recover lease on dfs file .../MasterProcWALs/pv2-00000000000000000482.log 2019-04-30 23:41:52,289 INFO [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recovered lease, attempt=0 on file=.../MasterProcWALs/pv2-00000000000000000482.log after 1ms 2019-04-30 23:41:52,373 INFO [master/master:17000:becomeActiveMaster] wal.WALProcedureStore: Rolled new Procedure Store WAL, id=483 2019-04-30 23:41:52,375 INFO [master/master:17000:becomeActiveMaster] procedure2.ProcedureExecutor: Recovered WALProcedureStore lease in 206msec 2019-04-30 23:41:52,782 INFO [master/master:17000:becomeActiveMaster] wal.ProcedureWALFormatReader: Read 1556 entries in .../MasterProcWALs/pv2-00000000000000000482.log 2019-04-30 23:41:55,370 INFO [master/master:17000:becomeActiveMaster] wal.ProcedureWALFormatReader: Read 28113 entries in .../MasterProcWALs/pv2-00000000000000000481.log 2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster] wal.WALProcedureTree: Missing stack id 166, max stack id is 181, root procedure is Procedure(pid=289380, ppid=-1, class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure) 2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster] wal.WALProcedureTree: Missing stack id 178, max stack id is 181, root procedure is Procedure(pid=289380, ppid=-1, class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure) 2019-04-30 23:41:55,389 ERROR [master/master:17000:becomeActiveMaster] wal.WALProcedureTree: Missing stack id 359, max stack id is 360, root procedure is Procedure(pid=285640, ppid=-1, class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure) {noformat} After which the procedure(s) is/are lost and cluster is stuck permanently. There were no errors writing these files in the log, and no issues reading them from HDFS, so it's purely a data loss issue in the structure. I was thinking about debugging it, but on 2nd though what we are trying to store PB state by key. Coincidentally, we have an "HBase" facility that we already deploy, that does just that... and it even has a WAL implementation. I don't know why we cannot use it for procedure state and have to invent another complex implementation of a KV store inside a KV store. In all/most cases, we don't even support rollback and use the latest state, but if we need multiple versions, this HBase product even supports that! I think we should add a hbase:proc table that would be maintained similar to meta. The latter part esp. given the existing code for meta should be much more simple than a separate store impl. This should be pluggable and optional via ProcStore interface (made more abstract as relevant - update state, scan state, get) -- This message was sent by Atlassian JIRA (v7.6.3#76005)