[ 
https://issues.apache.org/jira/browse/KUDU-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-3151:
------------------------------
    Summary: segfault when repairing log block container with a missing LBM 
container data file  (was: segfault when starting repairing log block container 
with a missing LBM container data file)

> segfault when repairing log block container with a missing LBM container data 
> file
> ----------------------------------------------------------------------------------
>
>                 Key: KUDU-3151
>                 URL: https://issues.apache.org/jira/browse/KUDU-3151
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs
>            Reporter: Andrew Wong
>            Priority: Critical
>         Attachments: metadump.txt
>
>
> We upgraded a cluster from 1.7 to 1.12 and saw the following segfault on one 
> node:
> {code:java}
> *** SIGSEGV (@0x20f2008) received by PID 35899 (TID 0x7ff7e40cc700) from PID 
> 34545672; stack trace: ***
>     @     0x7ff7f2a395d0 (unknown)
>     @           0x9fe02e std::_Sp_counted_base<>::_M_release()
>     @          0x2049f77 kudu::fs::LogBlockManager::Repair()
>     @          0x204ae45 kudu::fs::LogBlockManager::RepairTask()
>     @          0x228e67e kudu::ThreadPool::DispatchThread()
>     @          0x228778f kudu::Thread::SuperviseThread()
>     @     0x7ff7f2a31dd5 start_thread
>     @     0x7ff7f0d0902d __clone
> {code}
> When running {{kudu fs check}} we saw the following logs:
> {code:java}
> I0617 09:17:37.681373 147811 fs_manager.cc:433] Time spent opening block 
> manager: real 10.871s        user 0.215s     sys 0.162s
> Not found: Could not open container 74e7b95f8ccb4c7b98e52dc48049e967: 
> /data/5/kudu/tablet/data/data/74e7b95f8ccb4c7b98e52dc48049e967.data: No such 
> file or directory (error 2)
> {code}
> and upon inspecting the files, we found 74e7b95f8ccb4c7b98e52dc48049e967.data 
> was indeed missing, while the metadata file 
> 74e7b95f8ccb4c7b98e52dc48049e967.metadata was present but non-empty (more 
> creates than deletes, see attached).
> We were able to delete the metadata file, and I don't think we saw any failed 
> tablets upon doing so (which may surface if the tablet were unable to find 
> some necessary blocks at startup, eg PK blocks when reading min/max keys).
> It's possible the metadata might be left over from a LBM compaction, but it 
> isn't clear what the exact issue is so far. It's also unclear whether the 
> "missing" data file went missing before or after the upgrade, as we didn't 
> run a {{kudu fs check}} before upgrading.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to