On 03/05/2013 09:57 AM, Brian Candler wrote:
On Tue, Mar 05, 2013 at 08:33:28AM -0800, Joe Julian wrote:
It comes up on this list from time to time that there's not
sufficient documentation on troubleshooting. I assume that's what
some people mean when they refer to disappointing documentation as
the current documentation is far more detailed and useful than it
was 3 years ago when I got started. I'm not really sure what's being
asked for here, nor am I sure how one would document how to
troubleshoot. In my mind, if there's a trouble that can be
documented with a clear path to resolution, then a bug report should
be filed and that should be fixed. Any other cases that cannot be
coded for require human intervention and are already documented.
When people come to this list and say "I am seeing split brain errors" or
"ls shows question marks for file attributes"
Article(s) on the official Q&A Site but that [censored] site can't find
it with a search. Grrr.
or "I need to replace a failed server with a new one"
Article also on the official Q&A Site but again search isn't finding them.
I'll try to grab the contents of those and paste them into the wiki
somewhere (unless you do it first. It is a wiki after all).
or "probing a server fails",
Agreed. This would be good. Does anyone actually know how to answer
this? Please write it up on the wiki. I know I even have trouble
sometimes figuring out why someone's probe fails.
I don't think there's
any official documentation to help them.
"Documenting how to troubleshoot" would include what log messages you should
look for and what they mean, what xattrs you should expect to see on the
bricks and what they mean (for each case of distributed, replicated etc).
Given a basic checklist of these things, it would be easy for users to
report to the list "I checked A, B and C and the output from B was XXXX when
the docs say it should be YYYY on a working system", which is at least a
starting point.
This is where all open source seems to hit problems. Sure, there's error
messages (at least they're not "Error ##" like mysql does...) but they
seem to generally only make sense to whomever wrote the software. There
are 7216 log entries in the source. That's a lot of man-hours to
document all of those even without any degree of detail.
Now, there are only 136 critical errors but I'm not sure I've ever seen
one of those. 2991 at the level of "error" so I'm really not sure how
that could be handled. Even if someone could volunteer 8 hours/day to
spend 15 minutes describing each error message, it would take them
around 4 1/2 months. That's longer than a production cycle (granted,
once they were documented the production cycle would be unlikely to
produce nearly 3000 new error messages).
I'd be willing to make the list and document 1 or 2 a day. Anyone else?
As far as I'm aware, the official admin guide is completely oblivious to
internals like this.
Users may be able to find suggestions by perusing mailing list archives, or
by trying gluster 2.x wiki documentation (which may be stale), or some blog
postings.
Thanks for pointing these out. Some I (obviously) wasn't even aware were
a problem.
By the way - if anyone wants to copy-paste stuff from my blog into the
wiki, feel free. I keep meaning to but have been behind schedule at work
and just haven't had enough free time lately.
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users