[ 
https://issues.apache.org/jira/browse/KUDU-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887615#comment-16887615
 ] 

Adar Dembo commented on KUDU-2260:
----------------------------------

The Google+ shutdown means that the link Mike provided is now broken.

However, I think we saw this in the wild. Here's a very interesting MRS flush 
failure, on a tserver running Kudu 1.7.0:
{noformat}
I0716 02:23:42.355777 22937 tablet.cc:1153] T f4de49e24eb6420bb41a2391921d341d 
P 71430a6bb9b74e09b9767dacc6598102: Flush: entering stage 1 (old memrowset 
already frozen for inserts)
I0716 02:23:42.355800 22937 compaction.cc:914] Selected 1 rowsets to compact:
I0716 02:23:42.355805 22937 compaction.cc:917] memrowset(current size on disk: 
~0 bytes)
I0716 02:23:42.355813 22937 tablet.cc:1155] T f4de49e24eb6420bb41a2391921d341d 
P 71430a6bb9b74e09b9767dacc6598102: Memstore in-memory size: 509719 bytes
I0716 02:23:42.355821 22937 tablet.cc:1444] T f4de49e24eb6420bb41a2391921d341d 
P 71430a6bb9b74e09b9767dacc6598102: Flush: entering phase 1 (flushing 
snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 6403105213255942144 
or (T in {6403105213255942144})}]
I0716 02:23:42.423743 22937 multi_column_writer.cc:98] Opened CFile writers for 
52 column(s)
W0716 02:23:42.695962 22937 log_block_manager.cc:1151] Container 
/data01/kudu/tserver/data/data/539eeb9a0c4b4c87b9cb2ed727f09a19 being marked 
read-only: IO error: Failed to Sync() file: 
/data01/kudu/tserver/data/data/539eeb9a0c4b4c87b9cb2ed727f09a19.metadata: 
Cannot allocate memory (error 12)
W0716 02:23:42.697571 22937 log_block_manager.cc:1370] Failed to abort block 
0000000002936068: IO error: container 
/data01/kudu/tserver/data/data/539eeb9a0c4b4c87b9cb2ed727f09a19 is read-only: 
Failed to Sync() file: 
/data01/kudu/tserver/data/data/539eeb9a0c4b4c87b9cb2ed727f09a19.metadata: 
Cannot allocate memory (error 12)
W0716 02:23:42.716284 22937 tablet_replica_mm_ops.cc:144] T 
f4de49e24eb6420bb41a2391921d341d P 71430a6bb9b74e09b9767dacc6598102: failed to 
flush MRS: IO error: Failed to finish DRS writer: Failed to Sync() file: 
/data01/kudu/tserver/data/data/539eeb9a0c4b4c87b9cb2ed727f09a19.metadata: 
Cannot allocate memory (error 12)
F0716 02:23:42.716315 22937 tablet_replica_mm_ops.cc:145] Check failed: 
tablet->HasBeenStopped() FlushMRS failure is only allowed if the tablet is 
stopped first
{noformat}

Looks like fdatasync() returned ENOMEM. After that, the container was corrupted 
with trailing NULL bytes, though it looks like parts of a message header are 
also in there:
{noformat}
F0716 02:23:59.869529 103240 tablet_server_main.cc:80] Check failed: _s.ok() 
Bad status: Corruption: Failed to load FS layout: Could not process records in 
container /data01/kudu/tserver/data/data/2a1552dd97d645689ef8c39a4f027707: Data 
length checksum does not match: Incorrect checksum in file 
/data01/kudu/tserver/data/data/2a1552dd97d645689ef8c39a4f027707.metadata at 
offset 425981: Checksum does not match. Expected: 4647048. Actual: 1699145864

$ hexdump 
/data01/kudu/tserver/data/data/2a1552dd97d645689ef8c39a4f027707.metadata:
...
0067ff0 fee3 e3ab d202 a931 1629 0000 8800 46e8
0068000 0000 0000 0000 0000 0000 0000 0000 0000
*
0068307

{noformat}



> Log block manager should handle null bytes in metadata on crash
> ---------------------------------------------------------------
>
>                 Key: KUDU-2260
>                 URL: https://issues.apache.org/jira/browse/KUDU-2260
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs
>            Reporter: Mike Percy
>            Assignee: Will Berkeley
>            Priority: Major
>             Fix For: 1.8.0
>
>
> The log block manager currently may leave null bytes at the end of the 
> metadata log file if there is a system crash in the middle of a write. The 
> log block manager should detect null bytes at the end of a metadata entry on 
> startup and potentially truncate the entry or close the container.
> Currently, it prints an error along the following lines:
> {code}
> F0111 09:30:27.327011 28843 tablet_server_main.cc:64] Check failed: _s.ok() 
> Bad status: Corruption: Failed to load FS layout: Could not read records from 
> container /data/3/kudu/data/f70391c7c6084e08bbae7448518e0b5e: Data length 
> checksum does not match: Incorrect checksum in file 
> /data/3/kudu/data/f70391c7c6084e08bbae7448518e0b5e.metadata at offset 372533: 
> Checksum does not match. Expected: 0. Actual: 1323915147
> {code}
> At the time of writing, the workaround for this issue is to truncate the 
> affected file at the start of the incomplete entry in the file. While this 
> may leave orphaned blocks, this should be safe because if the metadata entry 
> was never successfully written then it should not have been considered 
> durable, either.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to