[
https://issues.apache.org/jira/browse/HIVE-21676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831806#comment-16831806
]
Sergey Shelukhin commented on HIVE-21676:
-----------------------------------------
Lol, no, it's supposed to be an HBase ticket
> use a system table as an alternative proc store
> -----------------------------------------------
>
> Key: HIVE-21676
> URL: https://issues.apache.org/jira/browse/HIVE-21676
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Priority: Major
>
> We keep hitting these issues:
> {noformat}
> 2019-04-30 23:41:52,164 INFO [master/master:17000:becomeActiveMaster]
> procedure2.ProcedureExecutor: Starting 16 core workers (bigger of cpus/4 or
> 16) with max (burst) worker count=160
> 2019-04-30 23:41:52,171 INFO [master/master:17000:becomeActiveMaster]
> util.FSHDFSUtils: Recover lease on dfs file
> .../MasterProcWALs/pv2-00000000000000000481.log
> 2019-04-30 23:41:52,176 INFO [master/master:17000:becomeActiveMaster]
> util.FSHDFSUtils: Recovered lease, attempt=0 on
> file=.../MasterProcWALs/pv2-00000000000000000481.log after 5ms
> 2019-04-30 23:41:52,288 INFO [master/master:17000:becomeActiveMaster]
> util.FSHDFSUtils: Recover lease on dfs file
> .../MasterProcWALs/pv2-00000000000000000482.log
> 2019-04-30 23:41:52,289 INFO [master/master:17000:becomeActiveMaster]
> util.FSHDFSUtils: Recovered lease, attempt=0 on
> file=.../MasterProcWALs/pv2-00000000000000000482.log after 1ms
> 2019-04-30 23:41:52,373 INFO [master/master:17000:becomeActiveMaster]
> wal.WALProcedureStore: Rolled new Procedure Store WAL, id=483
> 2019-04-30 23:41:52,375 INFO [master/master:17000:becomeActiveMaster]
> procedure2.ProcedureExecutor: Recovered WALProcedureStore lease in 206msec
> 2019-04-30 23:41:52,782 INFO [master/master:17000:becomeActiveMaster]
> wal.ProcedureWALFormatReader: Read 1556 entries in
> .../MasterProcWALs/pv2-00000000000000000482.log
> 2019-04-30 23:41:55,370 INFO [master/master:17000:becomeActiveMaster]
> wal.ProcedureWALFormatReader: Read 28113 entries in
> .../MasterProcWALs/pv2-00000000000000000481.log
> 2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster]
> wal.WALProcedureTree: Missing stack id 166, max stack id is 181, root
> procedure is Procedure(pid=289380, ppid=-1,
> class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
> 2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster]
> wal.WALProcedureTree: Missing stack id 178, max stack id is 181, root
> procedure is Procedure(pid=289380, ppid=-1,
> class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
> 2019-04-30 23:41:55,389 ERROR [master/master:17000:becomeActiveMaster]
> wal.WALProcedureTree: Missing stack id 359, max stack id is 360, root
> procedure is Procedure(pid=285640, ppid=-1,
> class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
> {noformat}
> After which the procedure(s) is/are lost and cluster is stuck permanently.
> There were no errors writing these files in the log, and no issues reading
> them from HDFS, so it's purely a data loss issue in the structure.
> I was thinking about debugging it, but on 2nd thought what we are trying to
> store is some PB blob, by key.
> Coincidentally, we have an "HBase" facility that we already deploy, that does
> just that... and it even has a WAL implementation. I don't know why we cannot
> use it for procedure state and have to invent another complex implementation
> of a KV store inside a KV store.
> In all/most cases, we don't even support rollback and use the latest state,
> but if we need multiple versions, this HBase product even supports that!
> I think we should add a hbase:proc table that would be maintained similar to
> meta. The latter part esp. given the existing code for meta should be much
> more simple than a separate store impl.
> This should be pluggable and optional via ProcStore interface (made more
> abstract as relevant - update state, scan state, get)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)