Thanks a lot for sharing. One of the problem I am facing is not having enough actual data. I can create simulated data but it is overfitting my algorithm. Second problem is I am not sure what all factors (called features in ML terms) are useful for pattern creation. Some of the factors I could think of were : 1. Memory used 2. CPU 3. shared memory 4. vmstat 5. message queue sizes
Regards, Prathamesh On Wed, Oct 9, 2019 at 2:28 PM Valdis Klētnieks <valdis.kletni...@vt.edu> wrote: > On Wed, 09 Oct 2019 01:23:28 -0700, prathamesh naik said: > > I want to work on project which can predict kernel process > > crash or even user space process crash (or memory usage spikes) using > > machine learning algorithms. > > This sounds like it's isomorphic to the Turing Halting Problem, and there's > plenty of other good reasons to think that predicting a process crash is, > in > general, somewhere between "very difficult" and "impossible". > > Even "memory usage spikes" are going to be a challenge. > > Consider a program that's doing an in-memory sort. Your machine has 16 gig > of > memory, and 2 gig of swap. It's known that the sort algorithm requires > 1.5G of > memory for each gigabyte of input data. > > Does the system start to thrash, or crash entirely, or does the sort > complete > without issues? There's no way to make a prediction without knowing the > size > of the input data. And if you're dealing with something like > > grep <regexp> file | predictable-memory-sort > > where 'file' is a logfile *much* bigger than memory.... > > You can see where this is heading... > > Bottom line: I'm pretty convinced that in the general case, you can't do > much > better than current monitoring systems already do: Look at free space, > look at > the free space trendline for the past 5 minutes or whatever, and issue an > alert > if the current trend indicates exhaustion in under 15 minutes. > > Now, what *might* be interesting is seeing if machine learning across > multiple > events is able to suggest better values than 5 and 15 minutes, to provide a > best tradeoff between issuing an alert early enough that a sysadmin can > take > action, and avoiding issuing early alerts that turn out to be false alarms. > > The problem there is that getting enough data on actual production systems > will be difficult, because sysadmins usually don't leave sub-optimal > configuration > settings in place so you can gather data. > > And data gathered for machine learning on an intentionally misconfigured > test > system won't be applicable to other machines. > > Good luck, this problem is a lot harder than it looks.... >
_______________________________________________ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies