[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429019#comment-16429019 ]
Andrew Wong edited comment on KUDU-2359 at 4/7/18 6:31 AM: ----------------------------------------------------------- Clarifying this a bit based on some offline discussion with Adar, it's less that the file vanished; rather, our current code couldn't read anything from the instance file, and thus returned a "file not found" error. We took a look at `ls` via strace, we found that an EIO was triggered in getdents(). A snippet of the strace here: {quote}{{ioctl(1</dev/pts/0>, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, \{B38400 opost isig icanon echo ...}) = 0}} {{ioctl(1</dev/pts/0>, TIOCGWINSZ, \{ws_row=88, ws_col=357, ws_xpixel=0, ws_ypixel=0}) = 0}} {{stat("/data/6", \{st_mode=S_IFDIR|S_ISVTX|0777, st_size=2048, ...}) = 0}} {{open("/data/6", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3}} {{fcntl(3</data/6>, F_GETFD) = 0x1 (flags FD_CLOEXEC)}} {{getdents(3</data/6>, 0x8ecc90, 32768) = -1 EIO (Input/output error)}} {{open("/usr/share/locale/locale.alias", O_RDONLY) = 4}} {{fstat(4</usr/share/locale/locale.alias>, \{st_mode=S_IFREG|0644, st_size=2512, ...}) = 0}} {quote} This might help explain why Kudu wasn't getting hit with an EIO – presumably, the functions we call are hiding some EIO farther down the call stack. We discussed a few options, considering potentially having more stringent checking around a mount point for failures (e.g. snooping around the file system for more info on failures if we run into an error at the file level), but settled on the point that, at least for start up, treating missing instance files as failed instance files would have the desired behavior. The case for update_dirs is trickier, for the reasons mentioned above. One implementation we considered was to perhaps treat _all_ instances that returned errors upon loading as "missing" when running `kudu fs update_dirs`. As long as we don't do anything silly like prematurely overwrite files before knowing that the entire operation has completed, we _should_ be able to get away with this, since the update will eventually fail at some point throughout the run of the tool. What we lose out on is, rather than short-circuiting if we see a disk failure, the update tool will attempt to do stuff (read, rewrite on other drives, etc.) because we're not sure whether we're "failed" or "missing" or whatever. We could have some heuristics like, "If we notice a failed instance, definitely do not try to update, but if we see a missing disk, try to update and if we can't because the disk has actually failed, revert everything" to make the semantics better, but for now I'll see how well this works. was (Author: andrew.wong): Clarifying this a bit based on some offline discussion with Adar, it's less that the file vanished; rather, our current code couldn't read anything from the instance file, and thus returned a "file not found" error. Looking at the file system with strace, we found that the EIO was triggered in getdents(). A snippet of the strace here: {quote}{{ioctl(1</dev/pts/0>, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, \{B38400 opost isig icanon echo ...}) = 0}} {{ioctl(1</dev/pts/0>, TIOCGWINSZ, \{ws_row=88, ws_col=357, ws_xpixel=0, ws_ypixel=0}) = 0}} {{stat("/data/6", \{st_mode=S_IFDIR|S_ISVTX|0777, st_size=2048, ...}) = 0}} {{open("/data/6", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3}} {{fcntl(3</data/6>, F_GETFD) = 0x1 (flags FD_CLOEXEC)}} {{getdents(3</data/6>, 0x8ecc90, 32768) = -1 EIO (Input/output error)}} {{open("/usr/share/locale/locale.alias", O_RDONLY) = 4}} {{fstat(4</usr/share/locale/locale.alias>, \{st_mode=S_IFREG|0644, st_size=2512, ...}) = 0}}{quote} We discussed a few options, considering potentially having more stringent checking around a mount point for failures (snooping around the file system for more info on failures), but settled on the point that, at least for start up, treating missing instance files as failed instance files would have the desired behavior. The case for update_dirs is trickier, for the reasons mentioned above. One implementation we considered was to perhaps treat _all_ instances that returned errors upon loading as missing when running `kudu fs update_dirs`. As long as we don't do anything silly like prematurely overwrite files before knowing that the entire operation has completed, we _should_ be able to get away with this, since presumably the update will eventually fail at some point throughout the run of the tool. What we lose out on is, rather than short-circuiting if we see a disk failure, the update tool will attempt to do stuff (read, rewrite on other drives, etc.) because we're not sure whether we're "failed" or "missing" or whatever. We could have some heuristics like, "If we notice a failed instance, definitely do not try to update, but if we see a missing disk, try to update and if we can't because the disk has actually failed, revert everything" to make the semantics better, but for now I'll see how well this works. > tserver should allow starting with a small number of missing data dirs > ---------------------------------------------------------------------- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver > Reporter: Todd Lipcon > Assignee: Andrew Wong > Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)