[ 
https://issues.apache.org/jira/browse/KUDU-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Berkeley resolved KUDU-531.
--------------------------------
       Resolution: Won't Do
    Fix Version/s: n/a

I spent a good bit of time working with ALICE and Kudu, trying to get ALICE to 
develop a logical characterization of Kudu's write path that it could use to 
test our durability semantics.

Here's the significant issues I ran in to:
1. Some syscalls aren't supported. For example, ALICE doesn't support analyzing 
the ioctl(2) call Kudu uses on XFS to do hole punching.
2. Some syscalls are buggy. fallocate(2) isn't correctly handled. It's easy to 
work around but unclear if the obvious fixes are sufficient for the behavior to 
be correct.
3. It doesn't handle memory mapped files. This is a small issue since AFAIK the 
only memory-mapping we do is with log index files and they don't have 
durability requirements.
4. It doesn't track sync operations on directories and it's not clear why. 
These are a critical part of how Kudu makes things durable, but ALICE by 
default basically just skips them. Hacking them in is easy enough but it's not 
clear that ALICE will handle it correctly.
5. The project is just generally not mature or maintained. There have only been 
a few commits ever, and it appears the project is mostly inactive since the 
paper was published. Given its a complex mix of long python scripts without any 
tests, it's hard to modify while remaining confident that it's working as 
desired.
6. I don't think it can handle the extra complications of using multiple 
devices. This is particularly troublesome for Kudu since we recommend putting 
the WAL on a separate device from data directories.

With enough patching and testing, I think ALICE would be a good tool for 
checking Kudu's fs update protocols (e.g. able to detect problems like 
KUDU-2260), but the tool is not nearly there, and I judge it's not worth the 
effort to get it there at this time.

We could take this up in the future, and we should be on the lookout for other 
frameworks that do something similar and that may be better tested and 
maintained.

> Run ALICE on Kudu
> -----------------
>
>                 Key: KUDU-531
>                 URL: https://issues.apache.org/jira/browse/KUDU-531
>             Project: Kudu
>          Issue Type: Task
>          Components: test
>    Affects Versions: Backlog
>            Reporter: Todd Lipcon
>            Assignee: Will Berkeley
>            Priority: Major
>             Fix For: n/a
>
>
> http://research.cs.wisc.edu/adsl/Software/alice/ is a cool tool which can 
> test correctness of recovery protocols under various disk faults. We should 
> run this to verify our ordering of operations doesn't result in any possible 
> data loss scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to