from:"Jeremy Chadwick"

Re: [HEADSUP] Re: Is IPV6 option still necessary?

2019-10-10 Thread Jeremy Chadwick via freebsd-stable

On Wed, Oct 09, 2019 at 08:13:39PM -0700, Jeremy Chadwick wrote:
> > Now we can get back on the ipv6 option.
> > 
> > so if we want to proceed further in removing the option to build with or 
> > without
> > ipv6 for the ports side. Please speak up in reply to this email, if you are
> > building without ipv6, why are you doing so, what are the real benefit for 
> > it.
> > How bad it will impact you if we do remove that option?
> 
> Whenever I use ports over FreeBSD-provided packages (or to use ports to
> build my own packages), I often disable IPV6 support.  The lengthy
> response below should explain why.
> {brevity snip}

This was sent to the wrong mailing list; was intended for -ports.
Sorry for the noise.

-- 
| Jeremy Chadwick j...@koitsu.org |
| UNIX Systems Administrator  PGP 0x2A389531 |
| Making life hard for others since 1977.|

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: [HEADSUP] Re: Is IPV6 option still necessary?

2019-10-10 Thread Jeremy Chadwick via freebsd-stable

> Now we can get back on the ipv6 option.
> 
> so if we want to proceed further in removing the option to build with or 
> without
> ipv6 for the ports side. Please speak up in reply to this email, if you are
> building without ipv6, why are you doing so, what are the real benefit for it.
> How bad it will impact you if we do remove that option?

Whenever I use ports over FreeBSD-provided packages (or to use ports to
build my own packages), I often disable IPV6 support.  The lengthy
response below should explain why.

In short: the IPV6 option is useful and important.  Please keep it.

In length: I think anyone operating in the Real World knows quite well
that IPv6 is still treated as a third-class citizen when it comes to
both general connectivity/reliability* and general use cases
code-wise**.  It's still very much in utero; or a toddler, if you will.

When you encounter IPv6 vs. IPv4 prioritisation issues, they are painful
and annoying.  No user or administrator is going to sit for hours
fiddling with it all to restore things to a working state when simply
removing IPv6 relieves the problem permanently.  Time and time again I
see companies advertising  records and webservers listening on IPv6
yet IPv6 transit fails but their A/IPv4 endpoint works fine.  It's the
dual-stack nature that makes a lot of this worse than it should be.  (I
do think this subject should be re-visited once the world as a whole
starts to seriously decommission IPv4, though.  Yes I'm serious.)

I've worked for several companies that are IPv4-only, where the belief
(and one I share) is that IPv6-only clients have some 6-to-4-ish
gateway/NAT somewhere upstream, otherwise they wouldn't be able to reach
most of the Internet.  IPv4 NAT still works for the majority of use
cases still as of 2019.

Furthermore, faux-political statements like "IPv6 is more widely used
than 2012" should be ignored and facts reiterated: IPv6 adoption is
around 25% as of mid-2019.  And it's taken over 10 years to reach that.

IPv4 is also well-understood, and not, as Dave Horsfall accurately
described, "a horse designed by a committee"; people are still trying to
wrap their head around IPv6 NDP/RA, SLAAC, and a myriad of other things
(dare I mention syntax?).  It's this which explains the sluggish
adoption rate.

And yes, I am well-aware of how important IPv6 is in other regions,
particularly Asia.  I am not belittling that need at all.  But not
everyone globally has the same needs.

What should really be asked for is the opposite: for the FreeBSD ports
folks to justify its removal.

How is this hurting you on a daily basis?  Is there a large percentage
of Mk/ framework bits causing you pain?  Are the bulk of per-port
patches inducing maintainer grief?  At what scale is this impacting you?
In 7 years (since the OP picked 2012), how much time has been spent by
maintainers ensuring IPV6=true works for their port(s)?  Are you truly
OK throwing away the integration work done by many, many people (not
just Project members!) over the past N years (see: per-port patches),
and forcing people who still need the option to make their own ports
tree to retain it?

Here's some harsh advice for the FreeBSD Project: quit changing shit for
sake of change, often masked by lies like "XXX is stagnant/old" or
similarly fallacious and loaded statements.  The project (both src and
ports, but especially ports) have lost many very good people in the past
10+ years (and I'm not talking about me) *because* of that change for
sake of change mindset -- the same mindset driving this request!  It's
changes like this that drive people away from FreeBSD.  Really.  It's
the same mindset that provoked people to stop using Linux distros due
to systemd integration.

I will not be replying to this thread past this point.  I have said all
that I care to say / spent enough time on it.  Just please stop hurting
administrators and end users with proposals/actions like this.

* - Real-world IPv6 failures impacting end users tend to be higher
than IPv4; this is anecdotal on my part, but I have a myriad of peers
who have had to disable IPv6 for similar reasons.  The IPv4 fallback in
software (both userland apps and network stacks) does not always work
"correctly".  Just go see how often IPv6 failures/issues are reported on
both NANOG and the outages@ mailing list.  And yes I am quite aware that
a good portion of the Internet backbone at this point is IPv6 (that's
nice, and not what we're talking about here).

** - I still continue to see open-source software committing major fixes
to AF_INET6 related code bits.  Major pieces of software include curl,
wget, Busybox, DNS servers (pick one!), and ntp... just for starters.

-- 
| Jeremy Chadwick j...@koitsu.org |
| UNIX Systems Administrator  PGP 0x2A389531 |
| Making life hard for others since 1977.|

_

Re: svn commit: r351246 - in stable: 11/sys/opencrypto 12/sys/opencrypto

2019-09-20 Thread Jeremy Chadwick via freebsd-stable

> I've committed a fix to head and will MFC it in a few days.  Thanks
> for tracking this down!

Did HEAD r351557 get backported/MFC'd into stable/11 and stable/12?  Can
test stable/11 if needed.

Thanks!

-- 
| Jeremy Chadwick j...@koitsu.org |
| UNIX Systems Administrator  PGP 0x2A389531 |
| Making life hard for others since 1977.|

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Buildworld times (was Re: svn commit: r350256 - in stable/12: . contrib/compiler-rt/lib/sanitizer_common contrib/libunwind/src contrib/llvm/lib/DebugInfo/DWARF contrib/llvm/lib/MC contrib/llvm/lib

2019-07-29 Thread Jeremy Chadwick via freebsd-stable

On Mon, Jul 29, 2019 at 01:44:01PM -0400, mike tancsa wrote:
> On 7/26/2019 10:38 PM, Jeremy Chadwick via freebsd-stable wrote:
> > (Please retain CCs, I am not subscribed to the list)
> >
> > Below is hard evidence of 3 things on stable/11 (not 12) after r350259:
> >
> > 1. r350259 adds *substantial* time to buildworld.
> 
> Are you sure this is not the same as the issue in RELENG12 ? ie. the new
> version of clang is built as part of world since it differs from whats
> installed.  I had a RELENG11 box sitting around from July 4th

By "on stable/11 (not 12)" I meant: I do not run stable/12, thus I
cannot speak on its behalf.

-- 
| Jeremy Chadwick j...@koitsu.org |
| UNIX Systems Administrator  PGP 0x2A389531 |
| Making life hard for others since 1977.|

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Buildworld times (was Re: svn commit: r350256 - in stable/12: . contrib/compiler-rt/lib/sanitizer_common contrib/libunwind/src contrib/llvm/lib/DebugInfo/DWARF contrib/llvm/lib/MC contrib/llvm/lib

2019-07-26 Thread Jeremy Chadwick via freebsd-stable

(Please retain CCs, I am not subscribed to the list)

Below is hard evidence of 3 things on stable/11 (not 12) after r350259:

1. r350259 adds *substantial* time to buildworld.

2. WITHOUT_CLANG_EXTRAS+WITHOUT_CLANG_FULL+WITHOUT_LLDB can help improve
the situation after r350259, but it is still no where near as fast as
pre-r350259.

3. Kernel build times are fine; issue is with world.

TL;DR for lazy folks:

stable/11 r350330 world + minimal clang = 1:29:34
stable/11 r350330 world + full clang= 1:46:31
stable/11 r350252 world + minimal clang =   56:52
stable/11 r350252 world + full clang= 1:14:30

I cannot even begin to tell you how big of an impact this has on my
low-end dual-core VPS box (world takes hours upon hours).

We've been down this road before, many many times, since the
introduction of clang/LLVM.  Here's just a few that went no where.  I
couldn't find the more-useful one that had some concrete numbers in it,
dating back to pre-2016 (maybe sometime in 2014 or 2015?):

https://lists.freebsd.org/pipermail/freebsd-current/2017-January/thread.html#64431
https://lists.freebsd.org/pipermail/freebsd-stable/2017-January/thread.html#86646
https://lists.freebsd.org/pipermail/freebsd-questions/2016-November/thread.html#274684

Does anyone have a good/recent write-up on how to switch to gcc?  :-)


System
==
* Intel Core 2 Quad Q9550 @ 2.83GHz
* 8GB ECC RAM
* Samsung SSD 840 EVO 250GB filesystem (UFS2 + SU (not SUJ) + TRIM) + 32GB swap
* Running stable/11 r349226
* Misc notes
  - r350330 happened to be what was "master" at the time of my test
  - r350252 was the commit on stable/11 immediately before r350259
  - Switching to r350252 accomplished via: cd /usr/src && svnlite up -r350252
  - System uses kern.maxvnodes=856944, last tuned 2018/06/07


Test #1, building r350330 minimal clang
===
# cat /etc/src.conf
WITHOUT_ATM=true
WITHOUT_BLUETOOTH=true
WITHOUT_DEBUG_FILES=true
WITHOUT_FLOPPY=true
WITHOUT_FREEBSD_UPDATE=true
WITHOUT_IPFILTER=true
WITHOUT_IPX=true
WITHOUT_LIB32=true
WITHOUT_NDIS=true
WITHOUT_NETGRAPH=true
WITHOUT_PPP=true
WITHOUT_SENDMAIL=true
WITHOUT_TESTS=true
WITHOUT_WIRELESS=true
WITH_OPENSSH_NONE_CIPHER=true
WITHOUT_CLANG_EXTRAS=true
WITHOUT_CLANG_FULL=true
WITHOUT_LLDB=true
WITHOUT_LLVM_TARGET_AARCH64=true
WITHOUT_LLVM_TARGET_ARM=true
WITHOUT_LLVM_TARGET_MIPS=true
WITHOUT_LLVM_TARGET_POWERPC=true
WITHOUT_LLVM_TARGET_SPARC=true
WITHOUT_REPRODUCIBLE_BUILD=true
# cat /etc/make.conf
KERNCONF=X7SBA_RELENG_11_amd64
CPUTYPE?=core2
SVN_UPDATE=yes
STRIP=
CFLAGS+=-fno-omit-frame-pointer

Result:
# rm -fr /usr/obj/*
# cd /usr/src
# time make -j4 buildworld
19906.874u 1280.928s 1:29:33.51 394.3%  57966+778k 23504+14200io 13867pf+0w
# time make -j4 buildkernel
1592.460u 196.047s 7:36.61 391.6%   48704+614k 6627+18158io 7361pf+0w


Test #2, building r350330 full clang

"full clang" means same as Test #1 but with these 3 src.conf lines
commented out, i.e. CLANG_EXTRAS, CLANG_FULL, and LLDB are ENABLED:

WITHOUT_CLANG_EXTRAS=true
WITHOUT_CLANG_FULL=true
WITHOUT_LLDB=true

Result:
# rm -fr /usr/obj/*
# cd /usr/src
# time make -j4 buildworld
23779.674u 1463.156s 1:46:30.75 394.9%  57621+783k 20093+15423io 7283pf+0w
# time make -j4 buildkernel
1594.079u 194.345s 7:36.48 391.7%   48707+614k 5301+18013io 5342pf+0w


Test #3, building r350252 minimal clang
===
Same configs as Test #1

Result:
# rm -fr /usr/obj/*
# cd /usr/src
# time make -j4 buildworld
12582.693u 882.543s 56:52.35 394.6% 62698+760k 21432+9694io 6923pf+0w
# time make -j4 buildkernel
1649.559u 184.934s 7:48.01 391.9%   57053+622k 7566+18291io 5402pf+0w


Test #4, building r350252 full clang

Same configs as Test #2

# rm -fr /usr/obj/*
# cd /usr/src
# time make -j4 buildworld
16600.975u 1068.754s 1:14:29.53 395.3%  63271+774k 8683+10876io 4707pf+0w
# time make -j4 buildkernel
1650.654u 183.966s 7:47.47 392.4%   57117+623k 2829+17951io 1926pf+0w

-- 
| Jeremy Chadwick j...@koitsu.org |
| UNIX Systems Administrator  PGP 0x2A389531 |
| Making life hard for others since 1977.|

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: /dev/crypto not being used in 12-STABLE

2018-12-07 Thread Jeremy Chadwick

On Fri, Dec 07, 2018 at 06:38:04PM -0500, Jung-uk Kim wrote:
> On 18. 12. 6., Jeremy Chadwick wrote:
> > I'm not subscribed to -stable.
> > 
> > This is in response to jkim@'s messages here:
> > 
> > https://lists.freebsd.org/pipermail/freebsd-stable/2018-December/090202.html
> > https://lists.freebsd.org/pipermail/freebsd-stable/2018-December/090202.html
> > 
> > Based on what I can tell, OpenSSL 1.1.1 or thereabouts removed the
> > cryptodev OpenSSL engine, which was a tie-in to BSD's cryptodev(4),
> > which is accessed via /dev/crypto and related crypto(4) ioctls.
> > 
> > Instead, they offered a replacement engine called devcrypto (what an
> > awful name), with the primary focus being against something from Linux
> > called cryptodev-linux, then was made to work on FreeBSD 8.4.  This code
> > was as of June 2017; 8.4 was EOL'd August 2015.  Interesting.
> > 
> > https://github.com/openssl/openssl/commit/4f79aff is not "add support
> > for BSD" at all.  It's "tweak further stuff for BSD", probably to get it
> > to work on newer FreeBSD; they seem to care about crypto/cryptodev.h
> > details.  I asked myself: why do they care about that if they're doing
> > it all themselves?  Looking at the code sheds light on that.  The actual
> > devcrypto engine commits that added BSD support are here:
> > 
> > https://github.com/openssl/openssl/pull/3744
> > https://github.com/openssl/openssl/pull/3744/files
> > 
> > The commits indicate that the devcrypto is enabled by default on
> > FreeBSD.  But we can tell from Herbert's post and jkim@'s patch that's
> > not true at all, i.e. FreeBSD disables it.  Why?  And is that a good
> > default?
> 
> Why do you think it is enabled by default?
> 
> https://github.com/openssl/openssl/blob/619eb33/Configure#L428

Because of this commit to OpenSSL's CHANGES file, which is part of what
I linked above; last sentence:

https://github.com/openssl/openssl/pull/3744/files#diff-e4eb329834da3d36278b1b7d943b3bc9

  *) Add devcrypto engine.  This has been implemented against cryptodev-linux,
 then adjusted to work on FreeBSD 8.4 as well.
 Enable by configuring with 'enable-devcryptoeng'.  This is done by default
 on BSD implementations, as cryptodev.h is assumed to exist on all of them.
 [Richard Levitte]

Is this message incorrect/false?  While I can read the perl code that is
the Configure script just fine, the CHANGES entry makes me think there
may be "other pieces" that affect the value of the key in that hash
(e.g. some script that uses uname detection and calls Configure with
argument).  Are there?

> Note crypto(4) was imported from OpenBSD.  Since OpenBSD 4.9, it was
> disabled by default.
> 
> https://www.openbsd.org/plus49.html
> 
> Then, they killed it in 5.7.
> 
> https://www.openbsd.org/plus57.html
> 
> o Unlinked the crypto(4) pseudo device (disabled by default for about 4
> years).
> 
> Now FreeBSD is the only major BSD with /dev/crypto.  That's why new
> engine was not thoroughly tested.

Thanks for the information.

So this implies there is a desire to get rid of cryptodev(4) (which is
the /dev/crypto endpoint), at least on OpenBSD.

Apologies if this is off-topic, but: is "device cryptodev" something
that should be removed from one's kernel config (due to what sounds like
desired deprecation), while keeping "device crypto" (to ensure userland
applications that use libcrypto/crypto(4) functions can still get at
crypto(9))?

> > Here's why I ask:
> >
> > The new devcrypto engine most definitely utilises /dev/crypto (thus
> > cryptodev(4) and crypto(4)).  cipher_init(), prepare_cipher_methods(),
> > digest_init(), and prepare_digest_methods() all utilise that interface:
> > 
> > https://github.com/openssl/openssl/pull/3744/files#diff-027f92eb0a10c0986aec873d9fd1ab66
> > 
> > So while OpenSSL now uses more of its own native C and assembly code
> > (e.g. for AES-NI support), and that's certainly faster than all the
> > overhead that cryptodev(4) brings with it (see jhb@'s post), I wonder:
> > 
> > 1. What happens to people using crypto hardware accelerators, ex.
> > hifn(4), padlock(4), ubsec(4), and safe(4)?  How exactly would OpenSSL
> > utilise these H/W accelerators if the devcrypto engine is disabled?
> 
> padlock has a dynamic engine, i.e., /usr/lib/engines/padlock.so.  I
> believe glxsb, hifn(4), safe(4), and ubsec(4) users are very rare
> nowadays.  If we have significant number of users and they show
> reasonable performance, then I will reconsider my decision.

Consider me surprised by this approach.  See below/end of my response.

> > 2. If the devcrypto e

Re: /dev/crypto not being used in 12-STABLE

2018-12-06 Thread Jeremy Chadwick

I'm not subscribed to -stable.

This is in response to jkim@'s messages here:

https://lists.freebsd.org/pipermail/freebsd-stable/2018-December/090202.html
https://lists.freebsd.org/pipermail/freebsd-stable/2018-December/090202.html

Based on what I can tell, OpenSSL 1.1.1 or thereabouts removed the
cryptodev OpenSSL engine, which was a tie-in to BSD's cryptodev(4),
which is accessed via /dev/crypto and related crypto(4) ioctls.

Instead, they offered a replacement engine called devcrypto (what an
awful name), with the primary focus being against something from Linux
called cryptodev-linux, then was made to work on FreeBSD 8.4.  This code
was as of June 2017; 8.4 was EOL'd August 2015.  Interesting.

https://github.com/openssl/openssl/commit/4f79aff is not "add support
for BSD" at all.  It's "tweak further stuff for BSD", probably to get it
to work on newer FreeBSD; they seem to care about crypto/cryptodev.h
details.  I asked myself: why do they care about that if they're doing
it all themselves?  Looking at the code sheds light on that.  The actual
devcrypto engine commits that added BSD support are here:

https://github.com/openssl/openssl/pull/3744
https://github.com/openssl/openssl/pull/3744/files

The commits indicate that the devcrypto is enabled by default on
FreeBSD.  But we can tell from Herbert's post and jkim@'s patch that's
not true at all, i.e. FreeBSD disables it.  Why?  And is that a good
default?  Here's why I ask:

The new devcrypto engine most definitely utilises /dev/crypto (thus
cryptodev(4) and crypto(4)).  cipher_init(), prepare_cipher_methods(),
digest_init(), and prepare_digest_methods() all utilise that interface:

https://github.com/openssl/openssl/pull/3744/files#diff-027f92eb0a10c0986aec873d9fd1ab66

So while OpenSSL now uses more of its own native C and assembly code
(e.g. for AES-NI support), and that's certainly faster than all the
overhead that cryptodev(4) brings with it (see jhb@'s post), I wonder:

1. What happens to people using crypto hardware accelerators, ex.
hifn(4), padlock(4), ubsec(4), and safe(4)?  How exactly would OpenSSL
utilise these H/W accelerators if the devcrypto engine is disabled?

2. If the devcrypto engine is *enabled*, and people have aesni(4)
loaded alongside cryptodev(4), which gets priority: OpenSSL's native
AES-NI code or cryptodev(4)/aesni(4)?

Likewise: if the decrypto engine is to remain disabled as a default:
this needs to be made crystal clear in Release Notes, so that folks
using H/W accelerators know they'll no longer benefit from those cards
unless they use a patch (third-party so/module won't work, AFAIT, as
OpenSSL's dynamic engine loading is unavailable per openssl engine -t).
Might I suggest enabling devcrypto be capable via src.conf, ex.
WITH_OPENSSL_ENGINE_DEVCRYPTO=true?

-- 
| Jeremy Chadwick j...@koitsu.org |
| UNIX Systems Administrator  PGP 0x2A389531 |
| Making life hard for others since 1977.|

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: lightly loaded system eats swap space

2018-06-19 Thread Jeremy Chadwick

(I am not subscribed to -stable, so please CC me, though I doubt I can
help in any way/shape/form past this Email)

Not the first time this has come up -- and every time it has, all that's
heard is crickets in the threads.  Recent proof:

https://lists.freebsd.org/pipermail/freebsd-stable/2018-April/088727.html
https://lists.freebsd.org/pipermail/freebsd-stable/2018-April/088728.html
https://lists.freebsd.org/pipermail/freebsd-stable/2018-June/089094.html

I sent private mail to Peter Jeremy about his issue.  I will not
disclose that Email here.  However, I will disclose the commits I
included in said Email that have touched ZFS ARC-related code:

http://www.freshbsd.org/commit/freebsd/r332785
http://www.freshbsd.org/commit/freebsd/r332552
http://www.freshbsd.org/commit/freebsd/r332540 (may help give insights)
http://www.freshbsd.org/commit/freebsd/r330061
http://www.freshbsd.org/commit/freebsd/r328235
http://www.freshbsd.org/commit/freebsd/r327491
http://www.freshbsd.org/commit/freebsd/r326619
http://www.freshbsd.org/commit/freebsd/r326427 (quota-related, maybe irrelevant)
http://www.freshbsd.org/commit/freebsd/r323667

In short (and nebulous as hell; sorry, I cannot be more specific given
the nature of the problem): there have been changes about ZFS's memory
allocation/releasing decision-making scheme compared to ZFS on "older"
FreeBSD (i.e. earlier 11.x, and definitely 10.x and 9.x).

Recommendations like "limit your ARC" are nothing new in FreeBSD, but
are still ridiculous kludges: tech-lists' system clearly has 105GB MRU
(MRU = most recently used) in ARC, meaning there is memory that can be
released back to the rest of the OS for general use (re: memory
contention/pressure situation), but the OS is choosing to use swap
instead, eventually exhausting it.  That logic sounds broken, IMO.  (And
yes I did notice the size of bhyve process)

ZFS-related kernel folks need to be involved in this conversation.  For
whatever reason, in the past several years, related committers are no
longer participating in these type of discussions.  The opposite was
true back in the 7.x to 9.x days.  The answers have to come from them.
I don't know, today, a) how they prefer these problems get reported to
them, or b) what exact information they want that can help narrow it
down (tech-lists' provided data is, IMO, good and par for the course).

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: kern.maxswzone causing serious problems

2018-03-29 Thread Jeremy Chadwick

I am not subscribed to -stable, so please keep me CC'd.

I mailed -stable about this problem, or a variation of it, earlier this
month:

https://lists.freebsd.org/pipermail/freebsd-stable/2018-March/088467.html

What isn't publicly visible is the list of individuals I CC'd on that
mail who had touched this code in recent days: k...@freebsd.org,
d...@freebsd.org, pluk...@freebsd.org, ead...@freebsd.org

I received no response from them on this matter.  At least two, however,
have been extremely busy commit-wise, so I imagine folks are just
swamped right now + have higher priorities.

I did not read or review your {naiveanalysis} section or your patch, as
tinkering with VM design/internals is *way* outside my comfort zone.

I will say that printing the sizes in a unit other than pages would be
generally helpful; I did try to figure out what value to use for
kern.maxswzone as a workaround by digging through kernel code but gave
up, as I wasn't able to truly determine what "pages" actually
represented (size-wise) in this specific context.

I hope someone with src commit bit will comment, as code slush for
11.2-RELEASE begins on April 20th:

https://www.freebsd.org/releases/11.2R/schedule.html

Else a separate PR can be opened if requested.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Stability of 11.1S

2018-03-20 Thread Jeremy Chadwick

(Please keep me CC'd as I am not subscribed to -stable)

I haven't seen any issues, but that means very little.  Details:

Two boxes -- one bare metal, one VPS (QEMU):

$ uname -a
FreeBSD XXX 11.1-STABLE FreeBSD 11.1-STABLE #0 r330529: Tue Mar  6 
11:36:04 PST 2018 
root@XXX:/usr/obj/usr/src/sys/X7SBA_RELENG_11_amd64  amd64
$ uptime
10:33a.m.  up 13 days, 18:10, 2 users, load averages: 0.15, 0.19, 0.16

$ uname -a
FreeBSD  11.1-STABLE FreeBSD 11.1-STABLE #0 r330753: Sat Mar 10 
21:34:20 PST 2018 
root@:/usr/obj/usr/src/sys/_RELENG_11_amd64  amd64
$ uptime
10:33a.m.  up 9 days, 10:46, 1 user, load averages: 0.31, 0.35, 0.31

Systems were updated recently because I wanted to test Meltdown/Spectre
mitigation (more on that below).  Prior to that, bare metal was running
9.x with 200+ day uptimes, VPS was running 10.x with 80-90 day uptimes
(VPS providers' HV crashed, i.e. not FreeBSD issues).

Since load averages on FreeBSD 10.x onward cannot be trusted[1][2], I
have to explain the general system specs and loads:

Bare metal box is an Intel Core 2 Quad Q9550, 8GB RAM, doing very little
other than running Apache + lots of cron jobs for systems stuff + ZFS
with several disks (but not OS disk; that's a dedicated SSD w/ UFS + SU
(not SUJ).  The cron jobs tend to stress the network and disk I/O a bit;
ZFS gets used every day, but only "heavily" during LAN file copies
to/from it (Samba is involved), and during nightly backups with rsync.

VPS box is some form of QEMU-based Intel Haswell CPU, 1GB RAM, doing
general things like Apache + postfix + SpamAssassin + some other
daemons, and a lot of Perl.  Swap is used heavily on this machine.
Disks are all vtblk, and I use multiple to get capacity for the needed
space for /usr/src and /usr/obj.  Everything is UFS + SU (not SUJ).

Things off the top of my head that might be relevant to you:

1. r329462 added Meltdown/Spectre mitigation[3][4].

Bare metal box has the below in /boot/loader.conf, since this is a
machine that does not need either given its environment:

# Disable PTI (Meltdown mitigation) and IBRS (Spectre mitigation); these
# are not relevant on this bare-metal system given its environment and
# use case.  Details of these tunables is here:
# https://lists.freebsd.org/pipermail/freebsd-stable/2018-March/088526.html
#
vm.pmap.pti="0"
hw.ibrs_disable="1"

VPS box has no tunings of this sort, and ends up with the below, because
the hosting provider has no done BIOS + QEMU updates to add IBRS
support (they're very aware of it + have attempted it twice but
apparently it didn't go well):

vm.pmap.pti: 1
hw.ibrs_disable: 1
hw.ibrs_active: 0

2. If your CPU is an AMD Ryzen, there is a VERY long discussion on
-stable about problems with Ryzen manifesting itself in a very
uncomfortable way, leading to system lock-ups[5].  There are unofficial
patches you can try.  I would recommend chiming in there and not here,
if relevant to your systems.

And yes, the massive number of MFCs that eadler@ is doing make tracking
down exact things more tedious than normal, especially when you have
sweeping commits like this one[6][7] (which, AFAIK, was acting as a
major blocker for several other MFCs and causing general merge
problems).

However, I commend his efforts; it's a massive undertaking (I would say
full-time job).  We stable users must accept that we are running
stable/11 for a reason -- not only to get fixes faster, but to act a
form of "guinea pig" that don't want the risks of HEAD/CURRENT.  The
more people using stable/11 the better overall feedback devs can get on
bugs/issues before making it into the next -RELEASE.  This is exactly
why, for those of you who have known me over the years, I actually
"track" or "follow" commits as they come across.  I do this by using the
FreshBSD site[8] alongside manual review of svnlite update output.  I
generally know what files/bits are relevant to my interests.

Hope this gives you some things to think about.  Good luck!

[1]: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=173541#c8
[2]: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=173541#c22
[3]: 
https://lists.freebsd.org/pipermail/freebsd-stable/2018-February/088396.html
[4]: https://lists.freebsd.org/pipermail/freebsd-stable/2018-March/088526.html
[5]: 
https://lists.freebsd.org/pipermail/freebsd-stable/2018-January/thread.html#88174
[6]: http://www.freshbsd.org/commit/freebsd/r330897
[7]: https://svnweb.freebsd.org/base?view=revision=330897
[8]: http://www.freshbsd.org/?branch=RELENG_11=freebsd

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/fre

total configured swap pages exceeds maximum recommended amount

2018-03-02 Thread Jeremy Chadwick

I am not subscribed to -stable, so please keep me CC'd.  I am CC'ing
folks who have touched this code or dealt with it recently or in the
past.

Something has changed regarding how FreeBSD determines when to emit this
message.  I do not know if this is a regression.  The message below
comes from a stable/11 r330260 amd64 box w/ 8GB RAM and 32GB swap during
boot:

warning: total configured swap (8358563 pages) exceeds maximum recommended 
amount (8141112 pages).
warning: increase kern.maxswzone or reduce amount of swap.

In stable/9, the message could be squelched via kern.maxswzone="0" in
loader.conf.  Confirmation is here (see Dag-Erling's responses):
https://lists.freebsd.org/pipermail/freebsd-stable/2012-August/069301.html

In stable/11, this no longer appears to work (the default value is 0).

The reason this box has 32GB swap (4x more than existing RAM) has to do
with planning ahead.  The system can support up to 32GB RAM, but does
not have all the DIMM slots populated at this time.  Swap on this
machine is a physical partition on its main disk, thus "shrinking swap"
is not not possible without a full format/reinstall.

This code has been touched/tweaked semi-recently in PR 221356:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221356

Code references:

stable/9:  
https://svnweb.freebsd.org/base/stable/9/sys/vm/swap_pager.c?annotate=284100#l2132
stable/10: 
https://svnweb.freebsd.org/base/stable/10/sys/vm/swap_pager.c?annotate=320557#l2156
stable/11: 
https://svnweb.freebsd.org/base/stable/11/sys/vm/swap_pager.c?annotate=329591#l2126

My questions: how does one squelch this warning message on such systems
running stable/11?  If it involves setting the tunable to a more useful
value, how does one reliably calculate that value?

Thank you.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

stable/11 r329462 - Meltdown/Spectre MFC questions

2018-02-17 Thread Jeremy Chadwick

Reference: https://svnweb.freebsd.org/base?view=revision=329462

Do the following new loader tunables and sysctls have documentation
anywhere?  I ask because I wish to know how to turn all of this off (yes
you heard me correctly), as not all systems necessarily require
mitigation of these flaws.

Best I can tell from skimming source:

vm.pmap.pti
  - Description: Page Table Isolation enabled
  - Loader tunable, visible in sysctl (read-only)
  - Integer
  - Default value: depends on CPU model and capabilities, see
function pti_get_default(); looks like AMD = 0, any CPU with
RDCL_NO capability enabled = 0, else 1

hw.ibrs_active
  - Description: Indirect Branch Restricted Speculation active
  - sysctl (read-only)
  - Integer
  - Real-time indicator as to if IBRS is currently on or off

hw.ibrs_disable 
  - Description: Disable Indirect Branch Restricted Speculation
  - Loader tunable and sysctl tunable (read-write)
  - Integer
  - Default value: unsure.  Variable declaration has 1 but
SYSCTL_PROC() macro has 0.

Thank you.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: svn commit: r296462 - in stable/9: crypto/openssl/crypto/bio crypto/openssl/crypto/bn crypto/openssl/doc/apps crypto/openssl/ssl secure/usr.bin/openssl/man

2016-03-09 Thread Jeremy Chadwick

self locally, as postfix's smtp(8) links to
libcrypt/libssl/libcrypto.  Bzzt, nope:

pid 5046 (smtp), uid 125: exited on signal 11

Mar  9 04:49:38 icarus postfix/master[802]: daemon started -- version 3.1.0, 
configuration /usr/local/etc/postfix
Mar  9 04:54:38 icarus postfix/pickup[5043]: 1835D1AF150: uid=1000 from=
Mar  9 04:54:38 icarus postfix/cleanup[5044]: 1835D1AF150: 
message-id=<20160309125438.ga5...@icarus.home.lan>
Mar  9 04:54:38 icarus postfix/qmgr[804]: 1835D1AF150: 
from=<j...@icarus.home.lan>, size=631, nrcpt=1 (queue active)
Mar  9 04:54:38 icarus postfix/qmgr[804]: warning: private/smtp socket: 
malformed response
Mar  9 04:54:38 icarus postfix/qmgr[804]: warning: transport smtp failure -- 
see a previous warning/fatal/panic logfile record for the problem description
Mar  9 04:54:38 icarus postfix/master[802]: warning: process 
/usr/local/libexec/postfix/smtp pid 5046 killed by signal 11
Mar  9 04:54:38 icarus postfix/master[802]: warning: 
/usr/local/libexec/postfix/smtp: bad command startup -- throttling
Mar  9 04:54:38 icarus postfix/error[5048]: 1835D1AF150: to=<j...@koitsu.org>, 
relay=none, delay=0.5, delays=0.05/0.44/0/0.01, dsn=4.3.0, status=deferred 
(unknown mail transport error)

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: stable/10: high load average when box is idle

2015-10-29 Thread Jeremy Chadwick

On Thu, Oct 29, 2015 at 11:00:32AM +0100, Miroslav Lachman wrote:
> Jeremy Chadwick wrote on 10/27/2015 06:05:
> >(I am not subscribed to the mailing list, please keep me CC'd)
> >
> >Issue: a stable/10 system that has an abnormally high load average (e.g.
> >0.15, but may be higher depending on other variables which I can't
> >account for) when the machine is definitely idle (i.e. cannot be traced
> >to high interrupt usage per vmstat -i, cannot be traced to a userland
> >process or kernel thread, etc.).
> >
> >This problem has been discussed many times on the FreeBSD mailing lists
> >and the FreeBSD forum (including some folks seeing it on 9.x, but my
> >complaint here is focused on 10.x so please focus there).
> >
> >I'd politely like to request that anyone experiencing this, or who has
> >experienced it (and if you know when it stopped or why, including what
> >you may have done, include that), to chime in on this ticket from 2012
> >(made for 9.x but style of issue still applies; c#5 is quite valid):
> >
> >https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=173541
> >
> >For those still experiencing it, I'd suggest reading c#8 and seeing if
> >sysctl kern.eventtimer.periodic=1 relieves the problem for you.  (At
> >this time I would not suggest leaving that set indefinitely, as it does
> >seem to increase the interrupt rate in cpuX:timer in vmstat -i.  But for
> >me kern.eventtimer.periodic=1 "fixes" the issue)
> 
> Is it on real HW server or in some kind of virtualization? I am seeing load
> 0.5 - 1.2 on three virtual machines in VMware. The machines are without any
> traffic. Just fresh instalation of FreeBSD 10.1 and some services without
> any public content.

I've seen it on both bare-metal and VMs.  Please see c#8 in the ticket;
there's an itemised list of where I've seen it, but I'm sure it's not
limited to just those.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

stable/10: high load average when box is idle

2015-10-26 Thread Jeremy Chadwick

(I am not subscribed to the mailing list, please keep me CC'd)

Issue: a stable/10 system that has an abnormally high load average (e.g.
0.15, but may be higher depending on other variables which I can't
account for) when the machine is definitely idle (i.e. cannot be traced
to high interrupt usage per vmstat -i, cannot be traced to a userland
process or kernel thread, etc.).

This problem has been discussed many times on the FreeBSD mailing lists
and the FreeBSD forum (including some folks seeing it on 9.x, but my
complaint here is focused on 10.x so please focus there).

I'd politely like to request that anyone experiencing this, or who has
experienced it (and if you know when it stopped or why, including what
you may have done, include that), to chime in on this ticket from 2012
(made for 9.x but style of issue still applies; c#5 is quite valid):

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=173541

For those still experiencing it, I'd suggest reading c#8 and seeing if
sysctl kern.eventtimer.periodic=1 relieves the problem for you.  (At
this time I would not suggest leaving that set indefinitely, as it does
seem to increase the interrupt rate in cpuX:timer in vmstat -i.  But for
me kern.eventtimer.periodic=1 "fixes" the issue)

Thanks.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Stable/9 from today mpssas_scsiio timeouts

2013-07-09 Thread Jeremy Chadwick

On Tue, Jul 09, 2013 at 05:32:39AM -0400, Outback Dingo wrote:
 as of stable today im seeing alot of new mps time outs
 
 9.1-STABLE FreeBSD 9.1-STABLE #0 r253035M: Mon Jul  8 16:34:28 UTC 2013
 root@:/usr/obj/nas/usr/src/sys/
 
 mps1@pci0:130:0:0:  class=0x010700 card=0x30201000 chip=0x00721000
 rev=0x03 hdr=0x00
 vendor = 'LSI Logic / Symbios Logic'
 device = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]'
 class  = mass storage
 subclass   = SAS
 
 
 mps0: mpssas_scsiio_timeout checking sc 0xff8002145000 cm
 0xff80021a6b78
 (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 983
 command timeout cm 0xff80021a6b78 ccb 0xfe002bb5f800
 mps0: mpssas_alloc_tm freezing simq
 mps0: timedout cm 0xff80021a6b78 allocated tm 0xff80021587b0
 (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 983
 completed timedout cm 0xff80021a6b78 ccb 0xfe002bb5f800 during
 recovery ioc 8048 scsi 0 state c xfer 0
 (noperiph:mps0:0:40:0): SMID 6 abort TaskMID 983 status 0x4a code 0x0 count
 1
 (noperiph:mps0:0:40:0): SMID 6 finished recovery after aborting TaskMID 983
 mps0: mpssas_free_tm releasing simq
 (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00
 (probe40:mps0:0:40:0): CAM status: Command timeout
 (probe40:mps0:0:40:0): Retrying command
 mps1: mpssas_scsiio_timeout checking sc 0xff8002384000 cm
 0xff80023e5b78
 (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 983
 command timeout cm 0xff80023e5b78 ccb 0xfe002be14800
 mps1: mpssas_alloc_tm freezing simq
 mps1: timedout cm 0xff80023e5b78 allocated tm 0xff80023977b0
 (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 983
 completed timedout cm 0xff80023e5b78 ccb 0xfe002be14800 during
 recovery ioc 8048 scsi 0 state c xfer 0
 (noperiph:mps1:0:37:0): SMID 6 abort TaskMID 983 status 0x4a code 0x0 count
 1
 (noperiph:mps1:0:37:0): SMID 6 finished recovery after aborting TaskMID 983
 mps1: mpssas_free_tm releasing simq
 (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00
 (probe292:mps1:0:37:0): CAM status: Command timeout
 (probe292:mps1:0:37:0): Retrying command

1. What revision were you running before (i.e. what were you on prior to
the upgrade)?

2. Something in your /usr/src differs from stock r253035, hence the M
at the end.  What is it?

Answer to #1 will help me narrow down the commits; there have been CAM
and mps changes fairly recently.  Otherwise you can dig through the
commits yourself (you'll need to go through many, many pages, as there
was a recent massive influx of SCTP changes (50+ commits)):

http://www.freshbsd.org/?branch=RELENG_9project=freebsd

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Stable/9 from today mpssas_scsiio timeouts

2013-07-09 Thread Jeremy Chadwick

On Tue, Jul 09, 2013 at 09:47:01AM -0400, Outback Dingo wrote:
 On Tue, Jul 9, 2013 at 9:44 AM, Outback Dingo outbackdi...@gmail.comwrote:
  On Tue, Jul 9, 2013 at 8:39 AM, Jeremy Chadwick j...@koitsu.org wrote:
 
  On Tue, Jul 09, 2013 at 05:32:39AM -0400, Outback Dingo wrote:
   as of stable today im seeing alot of new mps time outs
  
   9.1-STABLE FreeBSD 9.1-STABLE #0 r253035M: Mon Jul  8 16:34:28 UTC 2013
   root@:/usr/obj/nas/usr/src/sys/
  
   mps1@pci0:130:0:0:  class=0x010700 card=0x30201000 chip=0x00721000
   rev=0x03 hdr=0x00
   vendor = 'LSI Logic / Symbios Logic'
   device = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]'
   class  = mass storage
   subclass   = SAS
  
  
   mps0: mpssas_scsiio_timeout checking sc 0xff8002145000 cm
   0xff80021a6b78
   (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
  983
   command timeout cm 0xff80021a6b78 ccb 0xfe002bb5f800
   mps0: mpssas_alloc_tm freezing simq
   mps0: timedout cm 0xff80021a6b78 allocated tm 0xff80021587b0
   (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
  983
   completed timedout cm 0xff80021a6b78 ccb 0xfe002bb5f800 during
   recovery ioc 8048 scsi 0 state c xfer 0
   (noperiph:mps0:0:40:0): SMID 6 abort TaskMID 983 status 0x4a code 0x0
  count
   1
   (noperiph:mps0:0:40:0): SMID 6 finished recovery after aborting TaskMID
  983
   mps0: mpssas_free_tm releasing simq
   (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00
   (probe40:mps0:0:40:0): CAM status: Command timeout
   (probe40:mps0:0:40:0): Retrying command
   mps1: mpssas_scsiio_timeout checking sc 0xff8002384000 cm
   0xff80023e5b78
   (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
  983
   command timeout cm 0xff80023e5b78 ccb 0xfe002be14800
   mps1: mpssas_alloc_tm freezing simq
   mps1: timedout cm 0xff80023e5b78 allocated tm 0xff80023977b0
   (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
  983
   completed timedout cm 0xff80023e5b78 ccb 0xfe002be14800 during
   recovery ioc 8048 scsi 0 state c xfer 0
   (noperiph:mps1:0:37:0): SMID 6 abort TaskMID 983 status 0x4a code 0x0
  count
   1
   (noperiph:mps1:0:37:0): SMID 6 finished recovery after aborting TaskMID
  983
   mps1: mpssas_free_tm releasing simq
   (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00
   (probe292:mps1:0:37:0): CAM status: Command timeout
   (probe292:mps1:0:37:0): Retrying command
 
  1. What revision were you running before (i.e. what were you on prior to
  the upgrade)?
 
 
 
  Sorry I was on 252595 from July 3

And does rolling back to r252595 resolve the problem for you?

Because the only commit I see between r253035 and r252595 that might
account for some kind of behavioural change, unless I missed one while
skimming the commit history, is the following:

r252730 -- http://www.freshbsd.org/commit/freebsd/r252730

If at all possible, please try updating to r253037 or newer to see
if that has some effect/improvement.  Why I mention that commit:

r253037 -- http://www.freshbsd.org/commit/freebsd/r253037

Because the only mps(4) changes done in recent days are:

http://svnweb.freebsd.org/base/stable/9/sys/dev/mps/mps_sas.c?view=log

r253037
r251899
r251874

Else I'd say what you're experiencing is legitimate/unrelated to kernel
changes.  I can only speculate.

The messages to me indicate that some part of the kernel is submitting a
SCSI INQUIRY request to the underlying device(s) which results in a CAM
timeout, i.e. the disk attached to the controller did not respond
promptly (while the controller seemed to be alive/well).

If these disks (which we do not know the type of -- no dmesg provided,
etc.) are SSDs then TRIM behaviour is possibly causing the drive to take
too long to perform its TRIM cleanup, or, the drives themselves are
doing some kind of garbage collection which is taking quite a long time.

Steven et all may have a different (and almost certainly more accurate)
analysis.

It would really help if you could provide dmesg from the machine, as
well as any details about your setup (if ZFS, zpool status, etc.), in
addition to (if SSDs) sysctl -a | grep -i trim.  All this matters.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Stable/9 from today mpssas_scsiio timeouts

2013-07-09 Thread Jeremy Chadwick

On Tue, Jul 09, 2013 at 11:20:45AM -0400, Outback Dingo wrote:
 On Tue, Jul 9, 2013 at 10:46 AM, Jeremy Chadwick j...@koitsu.org wrote:
 
  On Tue, Jul 09, 2013 at 09:47:01AM -0400, Outback Dingo wrote:
   On Tue, Jul 9, 2013 at 9:44 AM, Outback Dingo outbackdi...@gmail.com
  wrote:
On Tue, Jul 9, 2013 at 8:39 AM, Jeremy Chadwick j...@koitsu.org
  wrote:
   
On Tue, Jul 09, 2013 at 05:32:39AM -0400, Outback Dingo wrote:
 as of stable today im seeing alot of new mps time outs

 9.1-STABLE FreeBSD 9.1-STABLE #0 r253035M: Mon Jul  8 16:34:28 UTC
  2013
 root@:/usr/obj/nas/usr/src/sys/

 mps1@pci0:130:0:0:  class=0x010700 card=0x30201000
  chip=0x00721000
 rev=0x03 hdr=0x00
 vendor = 'LSI Logic / Symbios Logic'
 device = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]'
 class  = mass storage
 subclass   = SAS


 mps0: mpssas_scsiio_timeout checking sc 0xff8002145000 cm
 0xff80021a6b78
 (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
  SMID
983
 command timeout cm 0xff80021a6b78 ccb 0xfe002bb5f800
 mps0: mpssas_alloc_tm freezing simq
 mps0: timedout cm 0xff80021a6b78 allocated tm 0xff80021587b0
 (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
  SMID
983
 completed timedout cm 0xff80021a6b78 ccb 0xfe002bb5f800
  during
 recovery ioc 8048 scsi 0 state c xfer 0
 (noperiph:mps0:0:40:0): SMID 6 abort TaskMID 983 status 0x4a code
  0x0
count
 1
 (noperiph:mps0:0:40:0): SMID 6 finished recovery after aborting
  TaskMID
983
 mps0: mpssas_free_tm releasing simq
 (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00
 (probe40:mps0:0:40:0): CAM status: Command timeout
 (probe40:mps0:0:40:0): Retrying command
 mps1: mpssas_scsiio_timeout checking sc 0xff8002384000 cm
 0xff80023e5b78
 (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
  SMID
983
 command timeout cm 0xff80023e5b78 ccb 0xfe002be14800
 mps1: mpssas_alloc_tm freezing simq
 mps1: timedout cm 0xff80023e5b78 allocated tm 0xff80023977b0
 (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
  SMID
983
 completed timedout cm 0xff80023e5b78 ccb 0xfe002be14800
  during
 recovery ioc 8048 scsi 0 state c xfer 0
 (noperiph:mps1:0:37:0): SMID 6 abort TaskMID 983 status 0x4a code
  0x0
count
 1
 (noperiph:mps1:0:37:0): SMID 6 finished recovery after aborting
  TaskMID
983
 mps1: mpssas_free_tm releasing simq
 (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00
 (probe292:mps1:0:37:0): CAM status: Command timeout
 (probe292:mps1:0:37:0): Retrying command
   
1. What revision were you running before (i.e. what were you on prior
  to
the upgrade)?
   
   
   
Sorry I was on 252595 from July 3
 
  And does rolling back to r252595 resolve the problem for you?
 
  Because the only commit I see between r253035 and r252595 that might
  account for some kind of behavioural change, unless I missed one while
  skimming the commit history, is the following:
 
  r252730 -- http://www.freshbsd.org/commit/freebsd/r252730
 
  If at all possible, please try updating to r253037 or newer to see
  if that has some effect/improvement.  Why I mention that commit:
 
  r253037 -- http://www.freshbsd.org/commit/freebsd/r253037
 
  Because the only mps(4) changes done in recent days are:
 
  http://svnweb.freebsd.org/base/stable/9/sys/dev/mps/mps_sas.c?view=log
 
  r253037
  r251899
  r251874
 
 
 i can say this its between July 4, and 253048, im rolling back to 252723 to
 validate a good known working state

Looking at your dmesg, it looks like the errors might be for SAS ports
which don't have any actual devices (disks) attached to them, yet parts
of the kernel (not sure which layer) are still trying to submit INQUIRY
commands to those ports as if they did have disks attached.

It looks like you see this behaviour on boot up, and then later during
normal operation at some point (a LUN scan or rescan or bus taste
might cause this to happen; for example I know that zpool import in
effect can sometimes cause this behaviour -- on one of my systems zpool
import would cause the servers' floppy drive to spin up/chunk briefly).

I'm hoping Steven or mav@ might be able to confirm/deny my theory here.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Stable/9 from today mpssas_scsiio timeouts

2013-07-09 Thread Jeremy Chadwick

On Tue, Jul 09, 2013 at 11:46:24AM -0400, Outback Dingo wrote:
 On Tue, Jul 9, 2013 at 11:30 AM, Jeremy Chadwick j...@koitsu.org wrote:
 
  On Tue, Jul 09, 2013 at 11:20:45AM -0400, Outback Dingo wrote:
   On Tue, Jul 9, 2013 at 10:46 AM, Jeremy Chadwick j...@koitsu.org wrote:
  
On Tue, Jul 09, 2013 at 09:47:01AM -0400, Outback Dingo wrote:
 On Tue, Jul 9, 2013 at 9:44 AM, Outback Dingo 
  outbackdi...@gmail.com
wrote:
  On Tue, Jul 9, 2013 at 8:39 AM, Jeremy Chadwick j...@koitsu.org
wrote:
 
  On Tue, Jul 09, 2013 at 05:32:39AM -0400, Outback Dingo wrote:
   as of stable today im seeing alot of new mps time outs
  
   9.1-STABLE FreeBSD 9.1-STABLE #0 r253035M: Mon Jul  8 16:34:28
  UTC
2013
   root@:/usr/obj/nas/usr/src/sys/
  
   mps1@pci0:130:0:0:  class=0x010700 card=0x30201000
chip=0x00721000
   rev=0x03 hdr=0x00
   vendor = 'LSI Logic / Symbios Logic'
   device = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]'
   class  = mass storage
   subclass   = SAS
  
  
   mps0: mpssas_scsiio_timeout checking sc 0xff8002145000 cm
   0xff80021a6b78
   (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
SMID
  983
   command timeout cm 0xff80021a6b78 ccb 0xfe002bb5f800
   mps0: mpssas_alloc_tm freezing simq
   mps0: timedout cm 0xff80021a6b78 allocated tm
  0xff80021587b0
   (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
SMID
  983
   completed timedout cm 0xff80021a6b78 ccb 0xfe002bb5f800
during
   recovery ioc 8048 scsi 0 state c xfer 0
   (noperiph:mps0:0:40:0): SMID 6 abort TaskMID 983 status 0x4a
  code
0x0
  count
   1
   (noperiph:mps0:0:40:0): SMID 6 finished recovery after aborting
TaskMID
  983
   mps0: mpssas_free_tm releasing simq
   (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00
   (probe40:mps0:0:40:0): CAM status: Command timeout
   (probe40:mps0:0:40:0): Retrying command
   mps1: mpssas_scsiio_timeout checking sc 0xff8002384000 cm
   0xff80023e5b78
   (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length
  36
SMID
  983
   command timeout cm 0xff80023e5b78 ccb 0xfe002be14800
   mps1: mpssas_alloc_tm freezing simq
   mps1: timedout cm 0xff80023e5b78 allocated tm
  0xff80023977b0
   (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length
  36
SMID
  983
   completed timedout cm 0xff80023e5b78 ccb 0xfe002be14800
during
   recovery ioc 8048 scsi 0 state c xfer 0
   (noperiph:mps1:0:37:0): SMID 6 abort TaskMID 983 status 0x4a
  code
0x0
  count
   1
   (noperiph:mps1:0:37:0): SMID 6 finished recovery after aborting
TaskMID
  983
   mps1: mpssas_free_tm releasing simq
   (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00
   (probe292:mps1:0:37:0): CAM status: Command timeout
   (probe292:mps1:0:37:0): Retrying command
 
  1. What revision were you running before (i.e. what were you on
  prior
to
  the upgrade)?
 
 
 
  Sorry I was on 252595 from July 3
   
And does rolling back to r252595 resolve the problem for you?
   
Because the only commit I see between r253035 and r252595 that might
account for some kind of behavioural change, unless I missed one while
skimming the commit history, is the following:
   
r252730 -- http://www.freshbsd.org/commit/freebsd/r252730
   
If at all possible, please try updating to r253037 or newer to see
if that has some effect/improvement.  Why I mention that commit:
   
r253037 -- http://www.freshbsd.org/commit/freebsd/r253037
   
Because the only mps(4) changes done in recent days are:
   
http://svnweb.freebsd.org/base/stable/9/sys/dev/mps/mps_sas.c?view=log
   
r253037
r251899
r251874
   
  
   i can say this its between July 4, and 253048, im rolling back to 252723
  to
   validate a good known working state
 
  Looking at your dmesg, it looks like the errors might be for SAS ports
  which don't have any actual devices (disks) attached to them, yet parts
  of the kernel (not sure which layer) are still trying to submit INQUIRY
  commands to those ports as if they did have disks attached.
 
  It looks like you see this behaviour on boot up, and then later during
  normal operation at some point (a LUN scan or rescan or bus taste
  might cause this to happen; for example I know that zpool import in
  effect can sometimes cause this behaviour -- on one of my systems zpool
  import would cause the servers' floppy drive to spin up/chunk briefly).
 
  I'm hoping Steven or mav@ might be able to confirm/deny my theory here.
 
 
 I see it even trying to write to the pool via NFS or FTP, which even times
 out on large files
 now, it was all working

Re: make buildworld is now 50% slower

2013-07-07 Thread Jeremy Chadwick

On Sun, Jul 07, 2013 at 11:50:29AM +0300, Daniel Braniss wrote:
  On Fri, Jul 05, 2013 at 02:39:00PM +0200, Dimitry Andric wrote:
   [redirecting to the correct mailing list, freebsd-stable@ ...]
   
   On Jul 5, 2013, at 10:53, Daniel Braniss da...@cs.huji.ac.il wrote:
after today's update of 9.1-STABLE I noticed that make 
build[world|kernel] are
taking conciderable more time, is it because the upgrade of clang?
and if so, is the code produced any better?

before:
buildwordl:  26m4.52s real 2h28m32.12s user 36m6.27s sys
buildkernel: 7m29.42s real 23m22.22s user 4m26.26s sys

today:
buildwordl: 34m29.80s real 2h38m9.37s user 37m7.61s sys
buildkernel:15m31.52s real 22m59.40s user 4m33.06s sys
   
   Ehm, your user and sys times are not that much different at all, they
   add up to about 5% slower for buildworld, and 1% faster for build kernel.
   Are you sure nothing else is running on that machine, eating up CPU time
   while you are building? :)
   
   But yes, clang 3.3 is of course somewhat larger than 3.2.  You might
   especially notice that, if you are using gcc, which is very slow at
   compiling C++.
   
   In any case, if you do not care about clang, just set WITHOUT_CLANG= in
   your /etc/src.conf, and you can shave off some build time.
  
  I just built world/kernel (stable/9 r252769) 5 hours ago.  Results:
  
  time make -j4 buildworld  = roughly 21 minutes on my hardware
  time make -j4 buildkernel = roughly 8 minutes on my hardware
  
 
 It's been a long time since I saw such numbers, maybe it's time
 to see where time is being spent, I will run it without clang to compare with
 your numbers.
 
  These numbers are about the norm for me, meaning I do not see a
  substantial increase in build times.
  
  Key point: I do not use/build/grok clang, i.e. WITHOUT_CLANG=true is in
  my src.conf.  But I am aware of the big clang change in r252723.
  
  If hardware details are wanted, ask, but I don't think it's relevant to
  what the root cause is.
  
 
 from what you are saying, I guess clang is not responsible.
 looking for my Sherlock Holmes hat.

Some points to those numbers I stated above:

- System is an Intel Q9550 with 8GB of RAM

- Single SSD (UFS2+SU+TRIM) is used for root, /usr, /var, /tmp, and swap

- /usr/src is on ZFS (raidz1 + 3 disks) -- however I got equally small
numbers when it was on the SSD

- /usr/src is using compression=lz4  (to folks from -fs: yeah, I'm
trying it out to see how much of an impact it has on interactivity.  I
can still tell when it kicks in, but it's way, way better than lzjb.
Rather not get into that here)

- Contents of /etc/src.conf (to give you some idea of what I disable):

WITHOUT_ATM=true
WITHOUT_BLUETOOTH=true
WITHOUT_CLANG=true
WITHOUT_FLOPPY=true
WITHOUT_FREEBSD_UPDATE=true
WITHOUT_INET6=true
WITHOUT_IPFILTER=true
WITHOUT_IPX=true
WITHOUT_KERBEROS=true
WITHOUT_LIB32=true
WITHOUT_LPR=true
WITHOUT_NDIS=true
WITHOUT_NETGRAPH=true
WITHOUT_PAM_SUPPORT=true
WITHOUT_PPP=true
WITHOUT_SENDMAIL=true
WITHOUT_WIRELESS=true
WITH_OPENSSH_NONE_CIPHER=true

It's WITHOUT_CLANG that cuts down the buildworld time by a *huge* amount
(I remember when it got introduced, my buildworld jumped up to something
like 40 minutes); the rest probably save a minute or two at most.

- /etc/make.conf doesn't contain much that's relevant, other than:

CPUTYPE?=core2

# For DTrace; also affects ports
STRIP=
CFLAGS+=-fno-omit-frame-pointer

- I do some tweaks in /etc/sysctl.conf (mainly vfs.read_min and
vfs.read_max), but I will admit I am not completely sure what those
do quite yet (I just saw the commit from scottl@ a while back talking
about how an increased vfs.read_min helps them at Netflix quite a
lot).  I also adjust kern.maxvnodes.

- Some ZFS ARC settings are adjusted in /boot/loader.conf (I'm playing
with some stuff I read in Andriy Gapon's ZFS PDF), but they definitely
do not have a major impact on the numbers I listed off.

- I do increase kern.maxdsiz, kern.dfldsiz, and kern.maxssiz in
/boot/loader.conf to 2560M/2560M/256M respectively, but that was mainly
from the days when I ran MySQL and needed a huge userland processes.

All in all my numbers are low/small because of two things: the SSD, and
WITHOUT_CLANG.

Hope this gives you somewhere to start/stuff to ponder.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: USB ports on Lenovo T400 do not work after a suspend/resume

2013-07-07 Thread Jeremy Chadwick

On Sun, Jul 07, 2013 at 03:51:12PM +1000, Ian Smith wrote:
 On Sun, 30 Jun 2013 15:02:57 -0700, Adrian Chadd wrote:
   On 30 June 2013 07:22, Ian Smith smi...@nimnet.asn.au wrote:
 [..]
Nothing of note that I can see, if that usb hub-to-bus remapping is
normal.  As you said, 'CPU0: local APIC error 0x40' looks maybe sus.
Maybe someone who knows might comment on that?
 
 Does noone know what that signifies?  Maybe it's not relevant to this.

It's too vague to know.  The error comes from lapic_handle_error(),
which is a generic/small routine which pulls the local APIC error status
register.  (Note I'm saying APIC, not ACPI -- two different things)

apic_vector.S sets this up/makes use of this function, and its done as
an interrupt handler.

I think this is one of those situations where you have to know *what* is
being set up/done at that moment in time for the error code to mean
something.  Maybe booting verbose would give more information as to what
was being done that lead up to the line.

I've CC'd John Baldwin who might have some ideas.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: make buildworld is now 50% slower

2013-07-07 Thread Jeremy Chadwick

On Sun, Jul 07, 2013 at 05:47:31AM -0500, Matthew D. Fuller wrote:
 Apropos of nothing, but...
 
 On Sun, Jul 07, 2013 at 03:17:14AM -0700 I heard the voice of
 Jeremy Chadwick, and lo! it spake thus:
 
  WITHOUT_LIB32=true
 
 suggests you're running amd64, which I'm pretty sure means
 
  - I do increase kern.maxdsiz, kern.dfldsiz, and kern.maxssiz in
  /boot/loader.conf to 2560M/2560M/256M respectively, but that was mainly
  from the days when I ran MySQL and needed a huge userland processes.
 
 are not necessarily _in_creases, and may well be mostly _de_creases.
 e.g., on a RELENG_9 box with 8 gig of physical RAM:
 
 % sysctl kern.{max{d,s},dfld}siz
 kern.maxdsiz: 34359738368
 kern.maxssiz: 536870912
 kern.dfldsiz: 134217728

 while a -CURRENT box with 16 has dfldsiz blown all the way up too.  I
 don't recall doing anything to change them at all recently, and a
 glance over loader.conf, sysctl.conf, rc.local, and the kernel configs
 doesn't turn up anything.

Thanks!

The settings I mention are from ancient times -- specifically RELENG_6
on i386 (I know because I found an old mailing list post of mine
discussing the settings with a user).

The problem as I said was that mysqld would crap itself (crash and be
quite loud about it) if the process allocated too much memory/became too
large.  I am fairly certain the issue related to the data size, **not**
the stack size (but I didn't see the harm in increasing that either).

It's good to know I can remove these on amd64.  Yay, one less thing in
loader.conf I have to deal with...  :-)  Thanks again!

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found

2013-07-07 Thread Jeremy Chadwick

  
 0 1
 /dev/da0s1d/usrufs noatime,rw  
 0 2
 /dev/da0s1e/varufs noatime,nosuid,rw   
 0 2
 /dev/da10p1/share2 ufs 
 suiddir,groupquota,noatime,nosuid,rw0 2
 /dev/da10p2/raid2  ufs userquota,noatime,nosuid,rw 
 0 2

Where is gstripe(8) in that picture?  Are you **sure** this is the same
system?  Surely I'm missing something here...

Can you provide details of the stripe, specifically gstripe list so I
can see what the disks are and then ask you for smartctl -a output for
each of them (to try and rule out disk-level problems that may be
causing oddities at the layer underneathe the filesystem (sometimes fsck
will not catch this))?

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: make buildworld is now 50% slower

2013-07-05 Thread Jeremy Chadwick

On Fri, Jul 05, 2013 at 02:39:00PM +0200, Dimitry Andric wrote:
 [redirecting to the correct mailing list, freebsd-stable@ ...]
 
 On Jul 5, 2013, at 10:53, Daniel Braniss da...@cs.huji.ac.il wrote:
  after today's update of 9.1-STABLE I noticed that make build[world|kernel] 
  are
  taking conciderable more time, is it because the upgrade of clang?
  and if so, is the code produced any better?
  
  before:
  buildwordl:  26m4.52s real 2h28m32.12s user 36m6.27s sys
  buildkernel: 7m29.42s real 23m22.22s user 4m26.26s sys
  
  today:
  buildwordl: 34m29.80s real 2h38m9.37s user 37m7.61s sys
  buildkernel:15m31.52s real 22m59.40s user 4m33.06s sys
 
 Ehm, your user and sys times are not that much different at all, they
 add up to about 5% slower for buildworld, and 1% faster for build kernel.
 Are you sure nothing else is running on that machine, eating up CPU time
 while you are building? :)
 
 But yes, clang 3.3 is of course somewhat larger than 3.2.  You might
 especially notice that, if you are using gcc, which is very slow at
 compiling C++.
 
 In any case, if you do not care about clang, just set WITHOUT_CLANG= in
 your /etc/src.conf, and you can shave off some build time.

I just built world/kernel (stable/9 r252769) 5 hours ago.  Results:

time make -j4 buildworld  = roughly 21 minutes on my hardware
time make -j4 buildkernel = roughly 8 minutes on my hardware

These numbers are about the norm for me, meaning I do not see a
substantial increase in build times.

Key point: I do not use/build/grok clang, i.e. WITHOUT_CLANG=true is in
my src.conf.  But I am aware of the big clang change in r252723.

If hardware details are wanted, ask, but I don't think it's relevant to
what the root cause is.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: When will subversion be ready for updating/upgrading src ports?

2013-07-05 Thread Jeremy Chadwick

On Fri, Jul 05, 2013 at 08:38:07PM -0700, bsd-li...@hush.com wrote:
 Greetings,
  Well after posting a couple of questions to the list regarding questions I 
 had before
 migrating from (cv)sup to subversion, I took the leap:
 
 mv /usr/src/ /usr/src.old/
 
 mkdir /usr/src
 
 mv /usr/ports/ /usr/ports.old/
 
 mkdir /usr/ports
 
 rm -fr /var/db/sup/*
 rm -fr /var/db/portsnap/*
 
 svn checkout svn://svn.freebsd.org/base/stable/8 /usr/src
 
 svn checkout svn://svn.freebsd.org/ports/head /usr/ports
 
 I then performed a portmaster -a
 
 which left me with a non-working X desktop.
 Turned out to be a problem with the Nvidia driver -- was 2.9.40, now 3.10.14.
 But loading it in loader.conf didn't create /dev/nvidia0, or /dev/nvidiactl
 To make a long story short, I attempted to update my src  ports, and try 
 agaiin;
 
 svn update svn://svn.freebsd.org/ports/head /usr/ports
 FAILED! I don't have the exact output

Incorrect syntax -- should be one of the following (your choice):

  cd /usr/ports  svn update
  svn update /usr/ports

 So I tried:
 cd /usr/ports
 svn update
 Which replied:
 svn: E155036: Please see the 'svn upgrade' command
 svn: E155036: The working copy at '/usr/ports'
 is too old (format 29) to work with client version '1.8.0 (r1490375)' 
 (expects f
 ormat 31). You need to upgrade the working copy first.
 
 So I guess subversion isn't (yet) designed for this sort of stuff, which 
 leaves me with a useless box. :(

Incorrect.  Please look very, VERY closely at what the command is that
it's telling you to use.  Read it 4 times over.  Pay close attention.

The explanation:

You installed subversion 1.7 or earlier when you originally started
(i.e. subversion-1.7 or 1.6 or something else was installed).  No
problem.

You then updated your ports tree.  No problem.

You then ran portmaster -a to upgrade/update all your ports (rebuild
them).  No problem.  However this updated subversion to the latest in
ports, which is 1.8.

The subversion metadata (stored in the .svn directories, ex.
/usr/src/.svn, /usr/ports/.svn, etc.) has changed as of 1.8.  This is
why you need to do svn upgrade in those directories.

This is a one-time thing you have to do.  That's all.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: UFS Trim wont stay set

2013-07-04 Thread Jeremy Chadwick

On Thu, Jul 04, 2013 at 03:37:28PM -0400, Mike Jakubik wrote:
 Hello,
 
 I've just installed a stable snapshot on a new machine with a SSD
 drive, after installing i booted single user mode and ran
 
 # tunefs -t enable /dev/ada0p2
 tunefs: issue TRIM to the disk set
 
 Great, back to multiuser mode, i check the partition
 
 # tunefs -p /dev/ada0p2
 tunefs: POSIX.1e ACLs: (-a)disabled
 tunefs: NFSv4 ACLs: (-N)   disabled
 tunefs: MAC multilabel: (-l)   disabled
 tunefs: soft updates: (-n) enabled
 tunefs: soft update journaling: (-j)   enabled
 tunefs: gjournal: (-J) disabled
 tunefs: trim: (-t) disabled
 
 What the heck.. did i miss something? Back to single user mode and
 
 # tunefs -t enable /dev/ada0p2
 tunefs: issue TRIM to the disk remains unchanged as enabled
 
 I check again in multiuser mode and it says disabled, any ideas what
 is going on here?

Yup, experienced this myself many times over.  The reasons are
understood (it's not limited to just the TRIM bits, it's related to
anything adjusting the superblock -- it gets cached in memory in certain
situations and not flushed back to disk).

Hint: are you booting into single user and then issuing a mount
command before doing your tunefs stuff?  If so, this is probably
what's causing it (at least it was in my case).

Instead just boot into single-user, do not mount anything, and use
/sbin/tunefs (if available -- depends on your filesystem setup) or
/rescue/tunefs.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: UFS Trim wont stay set

2013-07-04 Thread Jeremy Chadwick

On Thu, Jul 04, 2013 at 04:48:38PM -0400, Mike Jakubik wrote:
 On 07/04/13 16:33, Jeremy Chadwick wrote:
 Yup, experienced this myself many times over. The reasons are
 understood (it's not limited to just the TRIM bits, it's related
 to anything adjusting the superblock -- it gets cached in memory
 in certain situations and not flushed back to disk). Hint: are you
 booting into single user and then issuing a mount command before
 doing your tunefs stuff? If so, this is probably what's causing it
 (at least it was in my case). Instead just boot into single-user,
 do not mount anything, and use /sbin/tunefs (if available --
 depends on your filesystem setup) or /rescue/tunefs.
 
 I booted in to single user mode and the system mounted the only file
 system there, which is mounted at /. What i did now however is boot
 off a Live CD and run tunefs, this did the trick!

I talked with Andriy Gapon a couple years ago about this, actually.  I
had to dig up the thread.  Here are the relevant parts (read in order):

http://lists.freebsd.org/pipermail/freebsd-stable/2011-June/062921.html
http://lists.freebsd.org/pipermail/freebsd-stable/2011-June/062922.html
http://lists.freebsd.org/pipermail/freebsd-stable/2011-June/062923.html
http://lists.freebsd.org/pipermail/freebsd-stable/2011-June/062924.html

Make sure you read Andriy's comments (2nd URL) in full.  My follow-up
(4th URL) confirms that the mount -a (which is what made / read-write
since /etc/fstab obviously has / as rw) was causing the issue.  He
explains the reason.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ZFS Panic after freebsd-update

2013-07-02 Thread Jeremy Chadwick

On Tue, Jul 02, 2013 at 08:59:56AM +0300, Andriy Gapon wrote:
 on 01/07/2013 21:50 Jeremy Chadwick said the following:
  The issue is that ZFS on FreeBSD is still young compared to other
  filesystems (specifically UFS).
 
 That's a fact.
 
  Nothing is perfect, but FFS/UFS tends
  to have a significantly larger number of bugs worked out of it to the
  point where people can use it without losing sleep (barring the SUJ
  stuff, don't get me started).
 
 That's subjective.
 
  I have the same concerns over other
  things, like ext2fs and fusefs for that matter -- but this thread is
  about a ZFS-related crash, and that's why I'm over-focused on it.
 
 I have an impression that you seem to state your (negative) opinion of ZFS in
 every other thread about ZFS problems.

The OP in question ended his post with the line Thoughts?, and I have
given those thoughts.  My thoughts/opinions/experience may differ from
that of others.  Diversity of thoughts/opinions/experiences is good.
I'm not some kind of authoritative ZFS guru -- far from it.  If I
misunderstood what Thoughts? meant/implied, then draw and quarter me
for it; my actions/words = my responsibility.

I do not feel I have a negative opinion of ZFS.  I still use it today
on FreeBSD, donated money to Pawel when the project was originally
announced (because I wanted to see something new and useful thrive on
FreeBSD), and try my best to assist with issues pertaining to it where
applicable.  These are not the actions of someone with a negative
opinion, these are the actions of someone who is supportive while
simultaneously very cautious.

Is ZFS better today than it was when it was introduced?  By a long shot.
For example, on my stable/9 system here I don't tune /boot/loader.conf
any longer.  But that doesn't change my viewpoint when it comes to using
ZFS exclusively on a FreeBSD box.

  A heterogeneous (UFS+ZFS) setup, rather than homogeneous (ZFS-only),
  results in a system where an admin can upgrade + boot into single-user
  and perform some tasks to test/troubleshoot; if the ZFS layer is
  broken, it doesn't mean an essentially useless box.  That isn't FUD,
  that's just the stage we're at right now.  I'm aware lots of people have
  working ZFS-exclusive setups; like I said, works great until it
  doesn't.
 
 Yeah, a heterogeneous setup can have its benefits, but it can have its 
 drawbacks
 too.  This is true for heterogeneous vs monoculture in general.
 But the sword cuts both ways: what if something is broken in UFS layer or 
 god
 forbid in VFS layer and you have only UFS?
 Besides, without mentioning specific classes of problems ZFS layer is broken
 is too vague.

The likelihood of something being broken in UFS is significantly lower
given its established history.  I have to go off of experience, both
personal and professional -- in my years of dealing with FreeBSD
(1997-present), I have only encountered issues with UFS a few times (I
can count them on one, maybe two hands), and I'm choosing to exclude
SU+J from the picture for what should be obvious reasons.  With ZFS,
well... just look at the mailing lists and PR count.  I don't want to be
a jerk about it, but you really have to look at the quantity.  It
doesn't mean ZFS is crap, it just means that for me, I don't think
we're quite there yet.

And I will gladly admit -- because you are the one who taught me this --
that every incident need be treated unique.  But one can't deny that a
substantial percentage (I would say majority) of -fs and -stable posts
relate somehow to ZFS; I'm often thrilled when it turns out to be
something else.

Playing a strange devil's advocate, let me give you an interesting
example: softupdates.  When SU was introduced to FreeBSD back in the
late 90s, there were issues and concerns -- lots.  As such, SU was
chosen to be disabled by default on root filesystems given the
importance of that filesystem (re: we do not want to risk losing as
much data in the case of a crash -- see the official FAQ, section 8.3).
All other filesystems defaulted to SU enabled.  It's been like that up
until 9.x where it now defaults to enabled.  So that's what, 15 years?

You could say that my example could also apply to ZFS, i.e. the reports
are a part of its growth and maturity, and I'd agree.  But I don't feel
it's reached the point where I'm willing to risk going ZFS-only.  Down
the road, sure, but not now.  That's just my take on it.

Please make sure to also consider, politely, that a lot of people who
have issues with ZFS have not been subscribed to the lists for long
periods of time.  They sign up/post when they have a problem.  Meaning:
they do not necessarily know of the history.  If they did, I (again
politely) believe they're likely to use a UFS+ZFS mix, or maybe a
gmirror+UFS+ZFS mix (though the GPT/gmirror thing is... never mind...).

  So, how do you kernel guys debug a problem in this environment:
  
  - ZFS-only
  - Running -RELEASE (i.e. no source, thus a kernel cannot

Re: ZFS Panic after freebsd-update

2013-07-01 Thread Jeremy Chadwick

On Mon, Jul 01, 2013 at 11:35:30AM -0400, Scott Sipe wrote:
 *** Sorry for partial first message! (gmail sent after multiple returns
 apparently?) ***
 
 Hello,
 
 I have not had much time to research this problem yet, so please let me
 know what further information I might be able to provide.
 
 This weekend I attempted to upgrade a computer from 8.2-RELEASE-p3 to 8.4
 using freebsd-update. After I rebooted to test the new kernel, I got a
 panic. I had to take a picture of the screen. Here's a condensed version:
 
 panic: page fault
 cpuid = 1
 KDB: stack backtrace:
 #0 kdb_backtrace
 #1 panic
 #2 trap_fatal
 #3 trap_pfault
 #4 trap
 #5 calltrap
 #6 vdev_mirror_child_select
 #7 ved_mirror_io_start
 #8 zio_vdev_io_start
 #9 zio_execute
 #10 arc_read
 #11 dbuf_read
 #12 dbuf_findbp
 #13 dbuf_hold_impl
 #14 dbuf_hold
 #15 dnode_hold_impl
 #16 dnu_buf_hold
 #17 zap_lockdir
 Uptime: 5s
 Cannot dump. Device not defined or unavailable.
 Automatic reboot in 15 seconds - press a key on the console to abort
 
 uname -a from before (and after) the reboot:
 
 FreeBSD xeon 8.2-RELEASE-p3 FreeBSD 8.2-RELEASE-p3 #0: Tue Sep 27 18:45:57
 UTC 2011 r...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
  amd64
 
 dmesg is attached.
 
 I was able to reboot to the old kernel and am up and running back on 8.2
 right now.
 
 Any thoughts?

Thoughts:

- All I see is an amd64 system with 16GB RAM and 4 disks driven by an ICH10
  in AHCI mode.

- Output from: zpool status

- Output from: zpool get all

- Output from: zfs get all

- Output from: gpart show -p for every disk on the system

- Output from: cat /etc/sysctl.conf

- Output from: cat /boot/loader.conf

- Is there a reason you do not have dumpdev defined in /etc/rc.conf (or
  alternately, no swap device defined in /etc/fstab (which will get
  used/honoured by the dumpdev=auto (the default)) ?  Taking photos of
  the console and manually typing backtraces in is borderline worthless.
  Of course when I see lines like this:

  Trying to mount root from zfs:zroot

  ...this greatly diminishes any chances of live debugging on the
  system.  It amazes me how often I see this come up on the lists -- people
  who have ZFS problems but use ZFS for their root/var/tmp/usr.  I wish
  that behaviour would stop, as it makes debugging ZFS a serious PITA.
  This comes up on the list almost constantly, sad panda.

- Get yourself stable/9 and try that:
  https://pub.allbsd.org/FreeBSD-snapshots/

- freebsd-fs is a better place for this discussion, especially since
  you're running a -RELEASE build, not a -STABLE build.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ZFS Panic after freebsd-update

2013-07-01 Thread Jeremy Chadwick

On Mon, Jul 01, 2013 at 08:49:25AM -0700, Jeremy Chadwick wrote:
 - Is there a reason you do not have dumpdev defined in /etc/rc.conf (or
   alternately, no swap device defined in /etc/fstab (which will get
   used/honoured by the dumpdev=auto (the default)) ?

This should have read or alternately, ***A*** swap device defined in
/etc/fstab ...

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ZFS Panic after freebsd-update

2013-07-01 Thread Jeremy Chadwick

On Mon, Jul 01, 2013 at 12:23:45PM -0400, Paul Mather wrote:
 On Jul 1, 2013, at 11:49 AM, Jeremy Chadwick j...@koitsu.org wrote:
 
  On Mon, Jul 01, 2013 at 11:35:30AM -0400, Scott Sipe wrote:
  *** Sorry for partial first message! (gmail sent after multiple returns
  apparently?) ***
  
  Hello,
  
  I have not had much time to research this problem yet, so please let me
  know what further information I might be able to provide.
  [[...]]
  Any thoughts?
  
  Thoughts:
  
  [[..]]
  Of course when I see lines like this:
  
   Trying to mount root from zfs:zroot
  
   ...this greatly diminishes any chances of live debugging on the
   system.  It amazes me how often I see this come up on the lists -- people
   who have ZFS problems but use ZFS for their root/var/tmp/usr.  I wish
   that behaviour would stop, as it makes debugging ZFS a serious PITA.
   This comes up on the list almost constantly, sad panda.
 
 
 I'm not sure why it amazes you that people are making widespread use of ZFS.

It's not widespread use of ZFS.  It's widespread use of ZFS as their
sole filesystem (specifically root/var/tmp/usr, or more specifically
just root/usr).  People are operating with the belief that ZFS just
works, when reality shows it works until it doesn't.  The mentality
seems to be it's so rock solid it'll never break along with it can't
happen to me.  I tend to err on the side of caution, hence avoidance of
ZFS for critical things like the aforementioned.

It's different if you have a UFS root/var/tmp/usr and ZFS for everything
else.  You then have a system you can boot/use without issue even if ZFS
is crapping the bed.

 You could make the same argument that people shouldn't use UFS2
 journaling on their file systems because bugs in the implementation
 might make debugging journaled UFS2 file systems a serious PITA.

Yup, and I do make that argument, quite regularly at that.  There is
even some evidence at this point in time that softupdates are broken:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-June/017424.html

 The point is that there are VERY compelling reasons why people might
 want to use ZFS for root/var/tmp/usr/etc. (pooled storage; easy
 snapshots; etc.) and there should come a time when a given file system
 is generally regarded as safe.

While there may be compelling reasons, those reasons quickly get shot
down when they realise they have a system they can't easily do
troubleshooting with when the issue is with ZFS.

 I'd say the time for ZFS came when they removed the big disclaimer
 from the boot messages.  If ZFS is dangerous, they should reinstate
 the not ready for production warning.  Until they do, I think it's
 unfair to castigate people for using ZFS universally.

The warning meant absolutely nothing at the time (it did not keep people
away from it), and would mean nothing now if brought back.  A single
kernel printf() is not the right choice of action.

Are we better off today than we were when ZFS was originally ported
over?  Yes, by far.  Lots of improvements, in many great/good ways.  No
argument there.  But there is no way I'd risk putting my root filesystem
(or other key filesystems) on it -- still too new, still too many bugs,
and users don't know about those problems until it's too late.

 Isn't it a recurring theme on freebsd-current and freebsd-stable that
 more people need to use features so they can be debugged in realistic
 environments?  If you're telling them, don't use that because it
 makes debugging harder, how are they supposed to get debugged and
 hence improved? :-)

95% of FreeBSD users cannot debug kernel problems**.  To debug a kernel
problem, you need: a crash dump, a usable system with the exact
kernel/world where the crash happened (i.e. you cannot crash 8.4 ZFS and
boot into 8.2 and reliably debug it using that), and (most important of
all) a developer who is familiar with kernel debugging *and* familiar
with the bits which are crashing.  Those who say what you're quoting are
often the latter.

Part of the need people to try this process you refer to is what
stable/X is about, *without* the extra chaos of head.  I'm one of those
who for the past 15 years has advocated stable/X usage for a lot of
reasons; I'll save the diatribe for some other time.

But the OP is running -RELEASE, and chooses to run that, along with use
of freebsd-update for binary updates.  Their choices are limited: stick
with 8.2, switch to stable/X, cease use of ZFS, or change OSes entirely.

But even stable/X doesn't provide enough coverage at times (the recent
fxp(4)/dhclient issue is proof of that).  It's just too bad so many
people have this broken mindset of what stability means on FreeBSD.

** = This number is probably more like 99%, especially when you consider
what FreeNAS is catering to/trying to accomplish.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others

Re: ZFS Panic after freebsd-update

2013-07-01 Thread Jeremy Chadwick

On Mon, Jul 01, 2013 at 02:04:24PM -0400, Scott Sipe wrote:
 On Mon, Jul 1, 2013 at 1:04 PM, Jeremy Chadwick j...@koitsu.org wrote:
 
  On Mon, Jul 01, 2013 at 12:23:45PM -0400, Paul Mather wrote:
   On Jul 1, 2013, at 11:49 AM, Jeremy Chadwick j...@koitsu.org wrote:
  
Of course when I see lines like this:
   
 Trying to mount root from zfs:zroot
   
 ...this greatly diminishes any chances of live debugging on the
 system.  It amazes me how often I see this come up on the lists --
  people
 who have ZFS problems but use ZFS for their root/var/tmp/usr.  I wish
 that behaviour would stop, as it makes debugging ZFS a serious PITA.
 This comes up on the list almost constantly, sad panda.
  
  
   I'm not sure why it amazes you that people are making widespread use of
  ZFS.
 
  It's not widespread use of ZFS.  It's widespread use of ZFS as their
  sole filesystem (specifically root/var/tmp/usr, or more specifically
  just root/usr).  People are operating with the belief that ZFS just
  works, when reality shows it works until it doesn't.  The mentality
  seems to be it's so rock solid it'll never break along with it can't
  happen to me.  I tend to err on the side of caution, hence avoidance of
  ZFS for critical things like the aforementioned.
 
  It's different if you have a UFS root/var/tmp/usr and ZFS for everything
  else.  You then have a system you can boot/use without issue even if ZFS
  is crapping the bed.
 
 
 
  ...
 
 
 
  95% of FreeBSD users cannot debug kernel problems**.  To debug a kernel
  problem, you need: a crash dump, a usable system with the exact
  kernel/world where the crash happened (i.e. you cannot crash 8.4 ZFS and
  boot into 8.2 and reliably debug it using that), and (most important of
  all) a developer who is familiar with kernel debugging *and* familiar
  with the bits which are crashing.  Those who say what you're quoting are
  often the latter.
 
 
 
  ...
 
 
 
  But the OP is running -RELEASE, and chooses to run that, along with use
  of freebsd-update for binary updates.  Their choices are limited: stick
  with 8.2, switch to stable/X, cease use of ZFS, or change OSes entirely.
 
 
 So I realize that neither 8.2-RELEASE or 8.4-RELEASE are stable, but I
 ultimately wasn't sure where the right place to go for discuss 8.4 is?

For filesystem issues, freebsd-fs@ is usually the best choice, because
it discusses filesystem-related thing (regardless of stable vs. release,
but knowing what version you have of course is mandatory).

freebsd-stable@ is mainly for stable/X related discussions.

Sorry to add pedanticism to an already difficult situation for you (and
I sympathise, particularly since the purpose of the lists is often
difficult to discern, even with their terse descriptions in mailman).

 Beyond the FS mailing list, was there a better place for my question? I'll
 provide the other requested information (zfs outputs, etc) to wherever
 would be best.

Nope, not as far as I know.  The only other place is send-pr(1), once
you have an issue that can be reproduced.

Keep in mind, however, that none of these options (mailing lists,
send-pr, etc.) mandate a response from anyone.  You/your business (see
below) should be aware that there is always the possibility no one can
help solve the actual problem; as such it's important that companies
have proper upgrade/migration paths, rollback plans, and so on.

 This is a production machine (has been since late 2010) and after tweaking
 some ZFS settings initially has been totally stable. I wasn't incredibly
 closely involved in the initial configuration, but I've done at least one
 binary freebsd-update previously.

Well regardless it sounds like moving from 8.2-RELEASE to 8.4-RELEASE
causes ZFS to break for you, so that would classify as a regression.
What the root cause is, however, is still unknown.

Point: 8.2-RELEASE came out in February 2011, and 8.4-RELEASE came out
in June 2013 -- that's almost 2.5 years of changes between versions.
The number of changes between these two is major -- hundreds, maybe
thousands.  ZFS got worked on heavily during this time as well.

I tend to tell anyone using ZFS that they should be running a stable/X
(particularly stable/9) branch.  I can expand on that justification if
needed, as it's well-founded for a lot of reasons.

 Before this computer I had always done source upgrades. ZFS (and the
 thought of a panic like the one I saw this weekend!) made me leery of doing
 that. We're a small business--we have this server, an offsite backup
 server, and a firewall box. I understand that issues like this are are
 going to happen when I don't have a dedicated testing box, I just like to
 try to minimize them and keep them to weekends!

Understood.

 It sounds like my best bet might be to add a new UFS disk, do a clean
 install of 9.1 onto that disk, and then import my existing ZFS pool?

I would suggest starting with this:

Get stable/9 from the place I mentioned, burn

Re: ZFS Panic after freebsd-update

2013-07-01 Thread Jeremy Chadwick

On Mon, Jul 01, 2013 at 09:10:45PM +0300, Andriy Gapon wrote:
 on 01/07/2013 20:04 Jeremy Chadwick said the following:
  People are operating with the belief that ZFS just
  works, when reality shows it works until it doesn't
 
 That reality applies to everything that a man creates with a purpose to work.
 I am not sure why you are so over-focused on ZFS.
 Please stop spreading FUD.  Thank you.

The issue is that ZFS on FreeBSD is still young compared to other
filesystems (specifically UFS).  Nothing is perfect, but FFS/UFS tends
to have a significantly larger number of bugs worked out of it to the
point where people can use it without losing sleep (barring the SUJ
stuff, don't get me started).  I have the same concerns over other
things, like ext2fs and fusefs for that matter -- but this thread is
about a ZFS-related crash, and that's why I'm over-focused on it.

A heterogeneous (UFS+ZFS) setup, rather than homogeneous (ZFS-only),
results in a system where an admin can upgrade + boot into single-user
and perform some tasks to test/troubleshoot; if the ZFS layer is
broken, it doesn't mean an essentially useless box.  That isn't FUD,
that's just the stage we're at right now.  I'm aware lots of people have
working ZFS-exclusive setups; like I said, works great until it
doesn't.

So, how do you kernel guys debug a problem in this environment:

- ZFS-only
- Running -RELEASE (i.e. no source, thus a kernel cannot be rebuilt
  with added debugging features, etc.)
- No swap configured
- No serial console

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Subversion 1.8 / FreeBSD 8 x86 STABLE Symlinks

2013-06-30 Thread Jeremy Chadwick

On Sun, Jun 30, 2013 at 02:20:21PM -0400, Jason Hellenthal wrote:
 When using svn 1.8 I have come across a situation where when it is used 
 pointing to a symlink that refers to a working directory that a update will 
 either segfault or exit prematurely and leave a lock held on the working 
 directory that the symlink points to.
 
 This leaves you with one choice but to run cleanup on the referenced actual 
 working directory which was AFAIK never the case for any version below 1.8.
 
 Not sure if this is a problem with svn or FreeBSD itself but thought I would 
 report the characteristics in case it's noticed elsewhere.
 
 Details:
 Using UFS
 FreeBSD 8-STABLE i386 as of this date.
 
 In the directory...
 cd /exports/usr
 ln -s src8 src
 svn up /exports/usr/src

Known bug/problem in Subversion, not FreeBSD:

http://svn.apache.org/viewvc?view=revisionrevision=r1496007

Previous discussion:

http://lists.freebsd.org/pipermail/freebsd-questions/2013-June/251842.html

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FREEBSD_INSTALL failed with error 19 during booting installer

2013-06-29 Thread Jeremy Chadwick

On Sun, Jun 30, 2013 at 02:09:36AM +1000, Ian Smith wrote:
 On Fri, 28 Jun 2013 11:26:15 -0700, Jeremy Chadwick wrote:
   On Fri, Jun 28, 2013 at 08:22:29PM +0200, Marek Salwerowicz wrote:
Hi list,

I am trying to install FreeBSD 9.1-Release amd64 on a Supermicro server:

SuperStorage Server 6027R-E1R12N

with Intel Xeon E5-2640 CPU and 32 GB (4 x 8 ) KVR16R11D4/8HC installed

Currently I have only 2 SSD Kingston  drives (working in mirror)
installed on that server.

during booting installer from the ISO CD (amd64),  the boot process
fails with message:

Mounting from cd9660:/dev/iso9660/FREEBSD_INSTALL failed with error 19.

As I found here: http://forums.freebsd.org/showthread.php?t=36579 ,
probably this could be issue with ACPI, but setting option in
loader:

# set debug.acpi.disabled =hostres
# boot

made nothing for me.



Any ideas?
   
   Try using a USB flash drive + memstick image instead of CD-based media.
 
 Last time I tried - 9.1-release i386 - the memstick boot gave no option 
 to drop to loader; I had to burn a disc1 CD so I could drop to loader to 
 turn cam.ctl off to succeed installing in 128MB.  Did I miss something?

I've used memstick images exclusively for years and have never seen
this.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FREEBSD_INSTALL failed with error 19 during booting installer

2013-06-28 Thread Jeremy Chadwick

On Fri, Jun 28, 2013 at 08:22:29PM +0200, Marek Salwerowicz wrote:
 Hi list,
 
 I am trying to install FreeBSD 9.1-Release amd64 on a Supermicro server:
 
 SuperStorage Server 6027R-E1R12N
 
 with Intel Xeon E5-2640 CPU and 32 GB (4 x 8 ) KVR16R11D4/8HC installed
 
 Currently I have only 2 SSD Kingston  drives (working in mirror)
 installed on that server.
 
 during booting installer from the ISO CD (amd64),  the boot process
 fails with message:
 
 Mounting from cd9660:/dev/iso9660/FREEBSD_INSTALL failed with error 19.
 
 As I found here: http://forums.freebsd.org/showthread.php?t=36579 ,
 probably this could be issue with ACPI, but setting option in
 loader:
 
 # set debug.acpi.disabled =hostres
 # boot
 
 made nothing for me.
 
 
 
 Any ideas?

Try using a USB flash drive + memstick image instead of CD-based media.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: AHCI Patsburg SATA controller and slow transfer speed

2013-06-27 Thread Jeremy Chadwick

On Thu, Jun 27, 2013 at 02:21:57PM -0700, Dave Hayes wrote:
 Greetings all. I'm on FreeBSD 9.1-STABLE #0 r251391M. I'm noticing
 two of my SATA disks are at half speed. Is this normal or is there
 some configuration I'm forgetting?
 
 # dmesg | grep -C 4 ahc
 ...
 ahci0: Intel Patsburg AHCI SATA controller port
 0x2070-0x2077,0x2060-0x2063,0x2050-0x2057,0x2040-0x2043,0x2020-0x203f
 mem 0xd0b0-0xd0b007ff irq 21 at device 31.2 on pci0
 ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported
 ahcich0: AHCI channel at channel 0 on ahci0
 ahcich1: AHCI channel at channel 1 on ahci0
 ahcich2: AHCI channel at channel 2 on ahci0
 ahcich3: AHCI channel at channel 3 on ahci0
 ahcich4: AHCI channel at channel 4 on ahci0
 ahcich5: AHCI channel at channel 5 on ahci0
 ...
 ada0: WDC WD200MFYYZ-01D45B0 01.01K01 ATA-8 SATA 3.x device
 ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
 ada0: Command Queueing enabled
 ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
 ada0: Previously was known as ad4
 ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
 ada1: WDC WD200MFYYZ-01D45B0 01.01K01 ATA-8 SATA 3.x device
 ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
 ada1: Command Queueing enabled
 ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
 ada1: Previously was known as ad6
 ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
 ada2: WDC WD200MFYYZ-01D45B0 01.01K01 ATA-8 SATA 3.x device
 ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
   ^
 ada2: Command Queueing enabled
 ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
 ada2: Previously was known as ad8
 ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
 ada3: WDC WD200MFYYZ-01D45B0 01.01K01 ATA-8 SATA 3.x device
 ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
   ^^^
 ada3: Command Queueing enabled
 ada3: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
 ada3: Previously was known as ad10
 # pciconf -lcvb
 ahci0@pci0:0:31:2:  class=0x010601 card=0x35ae8086
 chip=0x1d028086 rev=0x06 hdr=0x00
 vendor = 'Intel Corporation'
 device = 'Patsburg 6-Port SATA AHCI Controller'
 class  = mass storage
 subclass   = SATA
 bar   [10] = type I/O Port, range 32, base 0x2070, size  8, enabled
 bar   [14] = type I/O Port, range 32, base 0x2060, size  4, enabled
 bar   [18] = type I/O Port, range 32, base 0x2050, size  8, enabled
 bar   [1c] = type I/O Port, range 32, base 0x2040, size  4, enabled
 bar   [20] = type I/O Port, range 32, base 0x2020, size 32, enabled
 bar   [24] = type Memory, range 32, base 0xd0b0, size 2048, enabled
 cap 05[80] = MSI supports 1 message enabled with 1 message
 cap 01[70] = powerspec 3  supports D0 D3  current D0
 cap 12[a8] = SATA Index-Data Pair
 cap 13[b0] = PCI Advanced Features: FLR TP
 
 Thanks for any insight provided.

Intel Patsburg is otherwise known as Intel X79.  The X79
chipset/southbridge offers 6 SATA ports, 2 of which are SATA600, and the
remaining 4 are SATA300:

http://en.wikipedia.org/wiki/Intel_X79

The intention of this was to offer 2 ports for people wanting to use
SSDs (which tend to throttle themselves based on negotiated PHY speed),
and a remaining 4 ports for MHDDs or ATAPI.  You can, of course, use
whatever ports for whatever you want.

More importantly (I think): your devices are MHDDs and will never be
able to reach SATA600 (or SATA300) speeds.  Pure MHDDs which use SATA600
PHYs are somewhat of a marketing gimmick (but my gut feeling is that the
MHDD vendors are choosing to narrow the number of on-disk SATA
controllers they use).  Hybrid HDDs may benefit from faster PHYs.

Next, this statement by ahci(4) then confuses the user:

 ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported

You see, when AHCI was invented, the existing idea was that all ports
would have the same speed (and that was the case at the time).  Only
somewhat recently have some vendors begun to mix-match speeds on the
same controller -- like this one.

The AHCI specification probably (I have not read it even recently) only
provides a number indicating the total number of ports followed by a
single number indicating the speed.

There may be support somewhere within AHCI to provide an updated way to
get more granular information, but I do not know if that's the case.

If there is, FreeBSD's ahci(4) driver does not support such at this
time (see sys/dev/ahci/ahci.c around line 502 for the device_printf()
call and what the arguments are (specifically AHCI_CAP_ISS and
AHCI_CAP_NPMASK)).

TL;DR -- Your motherboard offers 6 ports, 2 of which are SATA600, 4 of
which are SATA300, and despite the line shown above by FreeBSD not
matching reality, everything is working as designed.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977

Re: AHCI Patsburg SATA controller and slow transfer speed

2013-06-27 Thread Jeremy Chadwick

On Thu, Jun 27, 2013 at 06:38:27PM -0700, Jeremy Chadwick wrote:
 Next, this statement by ahci(4) then confuses the user:
 
  ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported
 
 You see, when AHCI was invented, the existing idea was that all ports
 would have the same speed (and that was the case at the time).  Only
 somewhat recently have some vendors begun to mix-match speeds on the
 same controller -- like this one.
 
 The AHCI specification probably (I have not read it even recently) only
 provides a number indicating the total number of ports followed by a
 single number indicating the speed.
 
 There may be support somewhere within AHCI to provide an updated way to
 get more granular information, but I do not know if that's the case.
 
 If there is, FreeBSD's ahci(4) driver does not support such at this
 time (see sys/dev/ahci/ahci.c around line 502 for the device_printf()
 call and what the arguments are (specifically AHCI_CAP_ISS and
 AHCI_CAP_NPMASK)).

Just a technical follow-up:

I spent some time this evening looking at AHCI specification 1.30.  I'll
try to explain the situation.

First, at the HBA level (meaning the entire AHCI controller):

Bits 23-30 of CAP (reg. offset 0x00): Interface Speed Support (ISS).
This indicates, quote, the maximum speed the HBA can support on its
ports.

Next, on a per-port basis, there are two registers available relating to
speed: one indicates speed, the other controls/limits speed:

1) Bits 7-4 of PxSSTS (reg. offset 0x28): SPD: Port x Serial ATA Status
(SCR0: SStatus).  This indicates, quote, the negotiated interface
speed.

2) Bits 7-4 of PxSCTL (reg. offset 0x2c): SPD: Port x Serial ATA Control
(SCR2: SControl).  The register controls, quote, the highest allowable
speed of the interface.  The bit definitions indicate a way to limit
the speed of a port and do not indicate capability.

The actual 1.30 specification even has a section (10.5) on this whole
ordeal, which states clearly, quote:


10.5 Interface Speed Support

The HBA indicates the maximum speed it can support via the CAP.ISS
register. Software can further limit the speed of a port by manipulating
each port's PxSCTL.SPD field to a lower value.


AHCI spec proposal 1.31 also does not address/cover this (all that
adds is per-port sleep capabilities).

I will point out that SATA600 is not officially mentioned in any spec at
this time (that I can get my hands on), so what all the OSes run off of
are educated assumptions.  :-)  But theoretically, a newer AHCI spec
could support per-port maximum speed indication.

It's not easy to phrase all this tersely in a single device_printf(),
and there has already been opposition to adding printing of more lines
to the existing drivers/in dmesg (meaning, printing 6 lines, one for
each port, indicating active speed + maximum speed, would probably be
looked down upon outside of verbose booting).  The best I can come up
with is this:

ahci0: AHCI v1.30, 6 ports, maximum 6Gbps, Port Multiplier not supported

...which is better, but could still be interpreted as 6 ports
with a maximum of 6Gbps per port.

Hope this sheds light in some way or another.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: portupgrade(1) | portmaster(8) -- which is more effective for large upgrade?

2013-06-26 Thread Jeremy Chadwick

On Wed, Jun 26, 2013 at 09:42:43AM -0700, Chris H wrote:
 Greetings,
  I haven't upgraded my tree(s) for awhile. My last attempt to rebuild after 
 an updating
 src  ports, resulted in nearly installing the entire ports tree, which is 
 why I've
 waited so long. Try as I might, I've had great difficulty finding something 
 that will
 _only_ upgrade what I already have installed, _and_ respect the options 
 used during the
 original make  make install, or those options expressed in make.conf.
 As portupgrade(1)  portmaster(8) appear to be the most used in this 
 scenario,
 I'm soliciting opinions on which of these works best, or if there is 
 something else to
 better manage this situation. Is there such a thing as a FreeBSD upgrade 
 easy button?

Use portmaster, avoid portupgrade.  And no I will not expand on my
reasoning -- I urge anyone even mentioning the word portupgrade to spend
a few hours of their day reading the horror stories on the mailing lists
over the past 10 years or so (including recently).  Choose wisely.

And before going on any sort of update crusade, I recommend you
re-examine your make.conf methodologies for options if you haven't
already.  The OPTIONS framework has been revamped and improved many
times over, so you will find things like this on a system whose admin
keeps up with the times (compare this to older ways/methods, which may
break or stop working):

OPTIONS_UNSET+= X11 IPV6 NLS

php5_SET+=  APACHE
php5_UNSET+=CGI
postfix_SET+=   PCRE TLS SASL2
samba36_SET+=   AIO_SUPPORT
samba36_UNSET+= LDAP CUPS ACL_SUPPORT WINBIND POPT
wget_SET+=  OPENSSL
wget_UNSET+=IDN

When rebuilding everything, I have always resorted to this:

rsync -avH /usr/local/ /usr/local.old/
pkg_delete -a -f
rm -fr /usr/local/*
rm -fr /var/db/ports/*
rm -fr /usr/ports/distfiles/*
cd /usr/ports/whatever
make install clean
{lather rinse repeat until done}

And add some pkg_add -r's in there for large-ish things I don't want to
rebuild from source (I think folks who use X probably do this quite a
bit; I remember hearing how Open/LibreOffice takes something like 3-4
hours to build on some systems).

But that's just how I do things.  My advice on using portmaster,
however, still stands.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: portupgrade(1) | portmaster(8) -- which is more effective for large upgrade?

2013-06-26 Thread Jeremy Chadwick

On Wed, Jun 26, 2013 at 01:23:32PM -0700, Jeremy Chadwick wrote:
 On Wed, Jun 26, 2013 at 09:42:43AM -0700, Chris H wrote:
 {snipping}

Also, hoping the OP is subscribed to -stable -- you should probably deal
with this.  This is not the first time I've seen problems with mail
delivery to a 1command.com address.

bsd-li...@1command.com: host male.ultimateDNS.NET[209.180.214.225] said: 550
5.0.0 SPAM and BULK mail REJECTED (in reply to MAIL FROM command)

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Another bug in SSH in FreeBSD 8.4 (sftp cannot create relative symlinks)

2013-06-24 Thread Jeremy Chadwick

On Mon, Jun 24, 2013 at 03:36:24PM -0700, Xin Li wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA512
 
 On 06/24/13 15:11, Miroslav Lachman wrote:
 [...]
  The patch seems really simple and I know how to apply it, but I am
  not able to compile and install only fixed sftp command instead of
  the whole userland. Can you push me to the right direction?
 
 I think you can go to /usr/src/secure/usr.bin/sftp and do:
 
 make depend
 make
 
 Then, as root:
 
 make install
 
 I usually do a full world build to make sure that this doesn't break
 something else but this change should only affect sftp(1).

I'm going to make this real simple:

Is the problem with symlinks in the client (sftp(1)), in the server
(sftp-server(8)), or both?  The impression I get from the original post
that started this thread is that it's in the server part.

So, I believe he'd want to poke about in src/secure/libexec/sftp-server.
However, that may not be enough, due to the fact that sftp-server(8)
depends (links to) libssh.so.X, libcrypt.so.X, and libcrypto.so.X.  I do
not know where the actual broken code lies.

Someone on -security might know exactly what all needs to be built/what
commands need to be run, but I will tell you this up front:

The official security announcements for SSL or SSH-related things have
historically told people to build world.  I went and read the mailing
list archives for -security-announcements and found proof/examples of
this fact when issues pertain to SSL or SSH.

My recommendation is just to build world.  Don't risk it -- this is a
key piece of your system, all you're trying to do is save some time.
Don't.  Just build/install world and don't screw around.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Another bug in SSH in FreeBSD 8.4 (sftp cannot create relative symlinks)

2013-06-24 Thread Jeremy Chadwick

On Tue, Jun 25, 2013 at 03:03:04AM +0200, Miroslav Lachman wrote:
 Jeremy Chadwick wrote:
 On Mon, Jun 24, 2013 at 03:36:24PM -0700, Xin Li wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA512
 
 On 06/24/13 15:11, Miroslav Lachman wrote:
 [...]
 The patch seems really simple and I know how to apply it, but I am
 not able to compile and install only fixed sftp command instead of
 the whole userland. Can you push me to the right direction?
 
 I think you can go to /usr/src/secure/usr.bin/sftp and do:
 
 make depend
 make
 
 Then, as root:
 
 make install
 
 Thank you! I didn't know I must be in /usr/src/secure/usr.bin/sftp
 
 I tried your patch and can confirm it works for me!
 
 I usually do a full world build to make sure that this doesn't break
 something else but this change should only affect sftp(1).
 
 I'm going to make this real simple:
 
 Is the problem with symlinks in the client (sftp(1)), in the server
 (sftp-server(8)), or both?  The impression I get from the original post
 that started this thread is that it's in the server part.
 
 No, it is the problem on the client side. The server side in all
 cases is good old OpenSSH 5.4 on FreeBSD 8.3. Only the newer sftp
 client is broken and this bug is really fixed by patch provided by
 Xin Li.
 
 We tried OpenSSH 6.2 client side from Mac OS X and it is broken too.
 The same apply to openssh-portable from ports (openssh-portable-6.2.p2_3,1)
 
 So, I believe he'd want to poke about in src/secure/libexec/sftp-server.
 However, that may not be enough, due to the fact that sftp-server(8)
 depends (links to) libssh.so.X, libcrypt.so.X, and libcrypto.so.X.  I do
 not know where the actual broken code lies.
 
 Someone on -security might know exactly what all needs to be built/what
 commands need to be run, but I will tell you this up front:
 
 The official security announcements for SSL or SSH-related things have
 historically told people to build world.  I went and read the mailing
 list archives for -security-announcements and found proof/examples of
 this fact when issues pertain to SSL or SSH.
 
 My recommendation is just to build world.  Don't risk it -- this is a
 key piece of your system, all you're trying to do is save some time.
 Don't.  Just build/install world and don't screw around.
 
 I understand your concern and I will rebuild world if the patch
 changes anything in the server part, but this is realy just a fix in
 sftp client command and I want to try it quickly and to have a quick
 path to go back to original version of the sftp command.
 
 This is on testing machine anyway, I will not do this on production
 machines.

Understood -- it was my misunderstanding of the issue (being on the
client side, not server side), so Xin's advice is sound.  Sorry for the
noise on my part.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: slow bootloader on Dell R320

2013-06-22 Thread Jeremy Chadwick

On Sat, Jun 22, 2013 at 09:37:37PM +0200, Loc BLOT wrote:
 Hi all !
 Thanks for the very good support of Dell R320 hardware, perc H310 is
 well supported, BCM5720 seems to work correctly and performances are
 great.
 The only problem i have found is very strange. The FreeBSD bootloader
 take many times to load, 30sec-2minutes to boot the kernel and show the
 bootloader menu. After that, the system boots properly, at a normal
 speed.
 Is there any issue or optimization i can do ?
 The OpenBSD bootloader doesn't have this problem.

1. What FreeBSD version exactly?  (Please don't say 9.1, we need to
know the full version, e.g. 9.1-RELEASE, or if you built your own we
need uname -a output (you can hide the machine name))

2. How many disks are in the machine?

3. Are any of the disks used for ZFS?

There have been **many** improvements to the FreeBSD bootloader with
regards to things taking a long time on boot-up in semi-recent days, but
answers to the above questions will determine that.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: shutdown -r / shutdown -h / reboot all hang and don't cleanly dismount

2013-06-22 Thread Jeremy Chadwick

On Sun, Jun 23, 2013 at 02:41:27AM +0200, Willem Jan Withagen wrote:
 On 19-6-2013 17:04, Jeremy Chadwick wrote:
 - Adam runs 9.1-RELEASE because of business needs pertaining to
freebsd-update and binary updates.  (I ask more about this for
benefits of readers below, however -- because this situation comes
up a lot and I want to know what real-world admins do)
 
 The bug is very specifically available in 9.1-RELEASE because I got
 bit by it before the release of 9.1. But discussed it with avg@ and
 it did not make it into the release, but was submitted only like 2
 weeks later.
 
 So in that case you can probably stop looking.
 
 For just about any 9.1-STABLE after that should the fix be in the code.

I'm not sure why so many people (so far) seem to think that this problem
is always the same issue -- it isn't.  There are multiple things that
have historically (and/or presently) have caused this issue.

Here's the list I composed only a few days ago, and it is far from
thorough:

http://lists.freebsd.org/pipermail/freebsd-stable/2013-June/073863.html

My point is that the shutdown -r issue issue might manifest itself in
the same fashion for everyone, but the **root cause** often differs.
I.e.  what fixed it for you may not fix it for Adam.  We must wait and
see (he's in the process of getting a system to try stable/9 on).

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: shutdown -r / shutdown -h / reboot all hang and don't cleanly dismount

2013-06-19 Thread Jeremy Chadwick

On Wed, Jun 19, 2013 at 06:35:57PM +0700, Adam Strohl wrote:
 Hello -STABLE@,
 
 So I've seen this situation seemingly randomly on a number of both
 physical 9.1 boxes as well as VMs for I would say 6-9 months at
 least.  I finally have a physical box here that reproduces it
 consistently that I can reboot easily (ie; not a production/client
 server).
 
 No matter what I do:
 
 reboot
 shutdown -p
 shutdown -r
 
 This specific server will stop at All buffers synced and not
 actually power down or reboot.  KB input seems to be ignored.  This
 server is a ZFS NAS (with GMIRROR for boot blocks) but the other
 boxes which show this are using GMIRRORs for root/swap/boot (no
 ZFS).
 
 Here is what happens on the console: http://i.imgur.com/1H8JMyB.jpg
 
 When I reset the server it appears that disks were not dismounted
 cleanly ... on this ZFS box it comes back quick because ZFS is good
 like that but on the other servers with GMIRROR roots rebuilding the
 GMIRROR and fscking at the same time is murder on the
 disk/performance until it finishes.

1. You mention as well as VMs.  Anything under a virtual machine or
under a hypervisor is going to be very, very, **VERY** different than
bare metal.  So I hope the issues you're talking about above are on bare
metal -- I will assume so.

2. We need to know what version of 9.1 you're using, i.e. 9.1-RELEASE.
If you use stable/9 (RELENG_9) we need to see uname -a output (you can
hide the machine name if you want).

3. Can we please have dmesg from this machine?  The controller and some
other hardware details matter.

4. Does sysctl hw.usb.no_shutdown_wait=1 help you?

5. Does sysctl hw.acpi.handle_reboot=1 help you?

6. Does sysctl hw.acpi.disable_on_reboot=1 help you?

7. If none of the above helps, can you please boot verbose mode and then
when the system locks up on shutdown -r now take a picture of the
VGA console?

8. Does the machine run moused(8) (check the process list please, do not
rely on rc.conf) ?

 Another interesting thing is that this particular server runs slapd
 (OpenLDAP) which, when it comes back up, has a corrupted DB
 (easily fixed with db_recover, but still).  This might be because FS
 commits aren't happening at the end.   I can even manually stop
 slapd (service slapd stop) then run sync(8) (I assume this does
 something for ZFS too) and it still comes back as hosed if I reboot
 shortly after.  If I start/stop slapd it's fine.  So I feel like
 there is an FS/dismount thing going on here.

sync(8) does not do what you think it does.  Please read (not skim) this
entire thread starting here:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-April/thread.html#16982
http://lists.freebsd.org/pipermail/freebsd-fs/2013-April/016982.html

Your problem is related to unclean shutdown; fix that and your issues go
away.

 Additional information: I also have some boxes which will reboot
 (ie; they don't freeze like some do at the end) but they don't
 dismount cleanly either and have to rebuild both GMIRROR and fsck.
 This might be a different issue, too.

Every issue needs to be handled/treated separately.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: shutdown -r / shutdown -h / reboot all hang and don't cleanly dismount

2013-06-19 Thread Jeremy Chadwick

On Wed, Jun 19, 2013 at 07:53:19PM +0700, Adam Strohl wrote:
 On 6/19/2013 19:21, Jeremy Chadwick wrote:
 On Wed, Jun 19, 2013 at 06:35:57PM +0700, Adam Strohl wrote:
 Hello -STABLE@,
 
 So I've seen this situation seemingly randomly on a number of both
 physical 9.1 boxes as well as VMs for I would say 6-9 months at
 least.  I finally have a physical box here that reproduces it
 consistently that I can reboot easily (ie; not a production/client
 server).
 
 No matter what I do:
 
 reboot
 shutdown -p
 shutdown -r
 
 This specific server will stop at All buffers synced and not
 actually power down or reboot.  KB input seems to be ignored.  This
 server is a ZFS NAS (with GMIRROR for boot blocks) but the other
 boxes which show this are using GMIRRORs for root/swap/boot (no
 ZFS).
 
 Here is what happens on the console: http://i.imgur.com/1H8JMyB.jpg
 
 When I reset the server it appears that disks were not dismounted
 cleanly ... on this ZFS box it comes back quick because ZFS is good
 like that but on the other servers with GMIRROR roots rebuilding the
 GMIRROR and fscking at the same time is murder on the
 disk/performance until it finishes.
 
 1. You mention as well as VMs.  Anything under a virtual machine or
 under a hypervisor is going to be very, very, **VERY** different than
 bare metal.  So I hope the issues you're talking about above are on bare
 metal -- I will assume so.
 
 Nope, I see basically the same thing sometimes under ESXi 5.0
 Hypervisor (and yes it worries me the implications of something so
 broad).  Those unites I just haven't been able to isolate on a
 server which isn't critical.  Lets focus on this server for now
 though per your suggestion below.

I'm sorry but I don't understand your first sentence -- the first part
of your sentence says nope (I have to assume in reply to my on bare
metal part), but then says I see basically the same thing sometimes
under ESXi which implies an alternate environment in comparison (i.e.
we *are* talking about bare metal).  Consider me confused.  :-)

 2. We need to know what version of 9.1 you're using, i.e. 9.1-RELEASE.
 If you use stable/9 (RELENG_9) we need to see uname -a output (you can
 hide the machine name if you want).
 
 Sorry, this ZFS box is 9.1-R P4 (kernel built today):
 
 FreeBSD ilos.dsn 9.1-RELEASE-p4 FreeBSD 9.1-RELEASE-p4 #6: Wed Jun
 19 15:31:12 ICT 2013
 root@hostname:/usr/obj/usr/src/sys/ATEAMSYSTEMS  amd64

I suggest trying stable/9 (and staying with it, for that matter).

 3. Can we please have dmesg from this machine?  The controller and some
 other hardware details matter.
 
 Sure take a look at the full log here: http://pastebin.com/k55gVVuU
 
 This includes a boot, then a reboot as I describe (you can see it
 logs the All Buffers Synced, etc) then powering back on.

Thanks.  I was mainly interested in the storage controller being used
(in this case ahci(4)) and the disks being used (notorious ST3000DM001,
known for excessively parking heads).  AFAIK this isn't one of the
controllers that was known for weird quirky issues pertaining to
flushing data to disk on shutdown.

I have to ask: is this FreeBSD box running under a HV?

If it *is not* running under a HV, could we please get exact motherboard
model and version (including BIOS version)?  Sometimes (not always) you
can get this from kenv | grep smbios.

I can also see you're running your own kernel.  We'll get to that in a
moment.

 4. Does sysctl hw.usb.no_shutdown_wait=1 help you?
 
 Weirdly this allowed it to reboot on the first try (without needing
 to be reset), but not the second.

I'm not surprised.  Pleas re-try with stable/9; Hans has been constantly
working on the USB stack and fixing major bugs.

 The Starting background file
 system checks in 60 seconds message appeared ... that only happens
 when something is dirty, right?

No it does not.  That message is always printed when you use background
fsck, which is the default.

I do not advocate using background fsck, because it has been known (and
may still do this -- I do not care to find out, I do not have time for
unreliable filesystem nonsense) to not always fix all filesystem
problems.  Meaning: people using background fsck have been known to boot
into single-user and issue fsck manually and find issues.

Place background_fsck=no in /etc/rc.conf.  If the machine does not
have a clean filesystem on boot-up, you'll know because the system will
immediately begin fsck (in the foreground actively).  You'll recognise
that output if it happens, trust me.

 So the second try with just this I could ctrl alt del it and it
 responded .. kind of:
 http://i.imgur.com/POAIaNg.jpg
 
 Still had to reset it though.

This looks like a chicken-and-egg problem -- you're probably fighting
with background fsck, as the message there indicate some processes
would not die.  I'm just taking a guess though.

I am now going to ask you for more information:

1. gpart show -p xxx where xxx is each disk you have in the system
2

Re: shutdown -r / shutdown -h / reboot all hang and don't cleanly dismount

2013-06-19 Thread Jeremy Chadwick

On Wed, Jun 19, 2013 at 09:15:18PM +0700, Adam Strohl wrote:
On 6/19/2013 20:35, Jeremy Chadwick wrote:

I've snipped out portions which aren't relevant at this point in the
convo. I'm trying to be terse as much as possible here (honest).

To recap for readers/mailing list:

- Adam seems the same behaviour on systems on bare metal, as well as
FreeBSD guests running under VMware ESXi 5.0 hypervisor. However,
as I stated on the list just yesterday about lock-ups on shutdown,
every situation may be different and there is a well-established
history of this problem on FreeBSD where each root cause (bugs)
were completely different from one another.

- The system we're discussing at this point in the thread is on
bare metal -- specifically an Asus P8B-X motherboard, with BIOS
version 6103, driven entirely by on-board Intel AHCI (not BIOS-level
RAID).

- Adam runs 9.1-RELEASE because of business needs pertaining to
freebsd-update and binary updates. (I ask more about this for
benefits of readers below, however -- because this situation comes
up a lot and I want to know what real-world admins do)

Thanks. I was mainly interested in the storage controller being used
(in this case ahci(4)) and the disks being used (notorious ST3000DM001,
known for excessively parking heads).

Yeah, was not my first choice but then again ... RAIDZ-2 :) HD
supply chain here (Thailand) is weird considering how many are made
here (and can't buy). Smartd screams about them possibly needing a
firmware update (they don't according to Seagate). Had no issues
aside from a failure a month or so again (it's an HD ... it
happens).

Absolutely understood -- and FYI, in case you need backup, your thought
process/conclusion here is spot on (re: it's a MHDD, failures happen).

Irrelevant to your shutdown problem: as for smartmontools bitching about
the firmware: no vendors disclose what actual changes go into their
drive firmware updates (vendors if you are reading this: I will have
your souls...), so I have to read a bunch of end-user forums where
nobody knows what they're talking about, and then of course find this
highly educational *cough* article from Adaptec:

http://ask.adaptec.com/app/answers/detail/a_id/17241/~/known-issues-with-seagate-barracuda-7200.14-desktop-drives

The problem here is that there have been *so many* firmware bugs with
Seagate's drives in the past 2 years or so that it's impossible for me
to know which fixes what. You buy what you buy because that's what you
buy, and that's cool -- but I avoid their stuff like the plague.

unrelated
Readers: if any of you have a ST[123]000DM001 drive running the CC24
firmware, and can confirm high head parking counts (SMART attribute
193), and are willing to upgrade your drive firmware to the latest then
see if the LCC increments stop (or at least settle down to normal
levels), I'd love to hear from you. I have been socially boycotting
these models of drives because of that idiotic firmware design choice
for quite some time now (not to mention the parking on those drives
is audibly loud in a normal living room), and if the F/W actually
inhibits the excessive parking then I have some drives to consider
upgrading. :-)
/unrelated

I can also see you're running your own kernel. We'll get to that in a
moment.

It's GENERIC with the following added to the end:

# -- Add Support for nicer console
#
options VESA
options SC_PIXEL_MODE

Can you try removing VESA and SC_PIXEL_MODE please? I know that
sounds crazy (what on earth would that have to do with it?), but
please try it. I can explain the justification if need be -- I'm being
extra paranoid of something that got discovered here on -stable only a
few days ago. It's a stretch, but I can see potential relevance. I can
provide details/links later.

4. Does sysctl hw.usb.no_shutdown_wait=1 help you?

Weirdly this allowed it to reboot on the first try (without needing
to be reset), but not the second.

I'm not surprised. Pleas re-try with stable/9; Hans has been constantly
working on the USB stack and fixing major bugs.

Got it but probably not going to go this route as it means no more
binary upgrades. While I can reboot it, it is the office NAS here
and so 'testing out' -STABLE I think probably isn't going to happen.

I understand. I have a question relating to this below.

Place background_fsck=no in /etc/rc.conf. If the machine does not
have a clean filesystem on boot-up, you'll know because the system will
immediately begin fsck (in the foreground actively). You'll recognise
that output if it happens, trust me.

Preaching to the choir, we set this on all servers this one somehow
did not have it set (I think due to ZFS making it unique and not
copying our rc.conf template over properly).

Where should I send my bill for services rendered? (Totally kidding --
just had some breakfast so feeling chipper :-) )

So the second try with just this I could ctrl alt

Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Jeremy Chadwick

On Wed, Jun 19, 2013 at 05:02:20PM +0200, Dennis Kgel wrote:
 Am 19.06.2013 um 16:47 schrieb Steven Hartland:
  I'm not familar with that model of the areca but have you tried
  with the standard OS driver or does it not support that card?
 
 The ARC1320 (non-raid) unfortunately isn't supported by the in-tree driver.

Which model of the ARC1320 are you using (there are 2).  I'm having
trouble understanding their chart too:

http://www.areca.us/products/sasnoneraid6g.htm

Because the controllers claim to support up to 128 disks, via break-out
cables, but I'm not sure.

You aren't using any port multipliers, are you?

  Also when you see hangs can you access the disk directly or not
  e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ?
 
 Interesting idea. The dd then hangs right until everything else resumes as 
 well.
 
 ^T during hang says: load: 12.39  cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 
 1632k

Is this ***while** you have immense amounts of ZFS write I/O going to
those drives (your zpool iostat was showing ~250-300MB/sec to the pool)?

It's very important to note that the stats you showed were during
writes.

What we're trying to figure out here is where the blocking (waiting) is
happening:

a) the ZFS layer
b) the storage driver layer ('arcsat', the 3rd-party unofficial driver)
c) the CAM layer
d) the GEOM layer
e) something with the disk(s)
f) something with memory I/O going on (say between the storage driver
   and ZFS, for lack of better way to phrase it)

I have a very big Email written for you, but I wanted to let certain
answers to Ronald's questions come out first.

-rw---1 jdc   users 5576 Jun 19 06:49 dennis_kgel_response.txt

I need to re-word this and take into consideration some of the new stuff
said up to now, but I don't know if I'll ahve the time for this (you
should see my desktop right now, I have literally 4 IM messages to
answer and my Email box is non-stop).

The one I want to get out of the way right now is this:

Can you please try putting this in /boot/loader.conf + reboot and
see if the behaviour for you changes?

vfs.zfs.no_write_throttle=1

Warning: this may actually exacerbate the problem worse, depending on
what the nature/root cause is.  Right now I'm of the opinion ZFS is
actually doing the Right Thing(tm) and that the issue may be in Areca's
driver, but that's hearsay until I have proof.  But the write throttling
stuff added semi-recently (by the Illumos folks, this is not a FreeBSD
feature) has had some reports of problems where disabling it helped
immensely.

Important: 24 disks off a single controller is a lot of bandwidth.
That controller may be overwhelmed, in which case you would see
exactly this kind of behaviour as the controller is screaming GOD HELP
ME, I'M TRYING TO DO ALL THIS STUFF AND YOU KEEP THROWING I/O AT ME.
:-)  This is also why I ask about port multiplier usage.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: shutdown -r / shutdown -h / reboot all hang and don't cleanly dismount

2013-06-19 Thread Jeremy Chadwick

On Wed, Jun 19, 2013 at 10:53:46AM -0500, Matthew D. Fuller wrote:
 On Wed, Jun 19, 2013 at 08:04:14AM -0700 I heard the voice of
 Jeremy Chadwick, and lo! it spake thus:
  
  unrelated
  Readers: if any of you have a ST[123]000DM001 drive running the CC24
  firmware, and can confirm high head parking counts (SMART attribute
  193), and are willing to upgrade your drive firmware to the latest then
  see if the LCC increments stop (or at least settle down to normal
  levels), I'd love to hear from you.  I have been socially boycotting
  these models of drives because of that idiotic firmware design choice
  for quite some time now (not to mention the parking on those drives
  is audibly loud in a normal living room), and if the F/W actually
  inhibits the excessive parking then I have some drives to consider
  upgrading.  :-)
  /unrelated
 
 I dunno about firmware, but you can smack 'em with a big hammer...
 
 /etc/rc.local:
 for i in 0 1; do
 /sbin/camcontrol cmd ada${i} -a EF 85 00 00 00 00 00 00 00 00 00 00
 done
 
 x-ref:
 http://lists.freebsd.org/pipermail/freebsd-stable/2009-November/052997.html
 
 
 LCC was somewhere in the upper 400's (I wanna say 480-some?) a year
 and change ago when I dropped that in.  It's 506/493 now on the two
 drives.

The above CDB + subcommand disables APM entirely.  There is a lot more
to APM than just parking heads (and in all honesty, APM should have
nothing to do with parking heads).  Disabling APM can actually have
drastic effects on drive temperature (meaning there are certain chip
and/or motor operations that said feature controls *in addition* to head
parking), and other firmware-level features that aren't documented.

Furthermore, that CDB does not work for all drives.  There are Seagate
drives -- I know because I bought some and returned them when the APM
trick did not work -- that lack the LCC-disable tie-in to APM.  The
drive either rejected the CDB (ATA status code error returned), while
others accepted it but nothing in 0xec (IDENTIFY) reported as got
changed.

The only model of drive I know that reliably works with this method is
the WD Green/-GP drive, and the drive temperatures do increase.  No idea
on the Blues.  (Another reason I recommend the Reds...)

What *should* have happened is that a new 0xef subcommand should have
been created for this.  Subs range from 0x00-0xff.  T13 spec shows
that a huge number of them (I'd say 30% or more) are marked Reserved
and an additional 30% or so are marked Obsolete.  And finally,
0x56-0x5c, 0xd6-0xdc and 0xe0 are Vendor Specific.

But looking at this from a more general view, the real issue is that
these types of features should not have been introduced to begin with.
The vendors introduced this problem, and now are marketing drives with
said feature disabled, claiming we fixed the problem that annoys so
many of you! -- the same problem **they introduced without asking
anyone**.

I will have -- and eat -- their souls.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: shutdown -r / shutdown -h / reboot all hang and don't cleanly dismount

2013-06-19 Thread Jeremy Chadwick

On Wed, Jun 19, 2013 at 11:34:39AM -0500, Matthew D. Fuller wrote:
 On Wed, Jun 19, 2013 at 09:16:35AM -0700 I heard the voice of
 Jeremy Chadwick, and lo! it spake thus:
  
  The above CDB + subcommand disables APM entirely.  There is a lot
  more to APM than just parking heads (and in all honesty, APM should
  have nothing to do with parking heads).  Disabling APM can actually
  have drastic effects on drive temperature (meaning there are certain
  chip and/or motor operations that said feature controls *in
  addition* to head parking), and other firmware-level features that
  aren't documented.
 
 True enough, in concept.  With all the drives sitting behind
 ventilation perfectly capable of dealing with 15kRPM drives, I don't
 worry about what that might do to the 7200's though...

Justified in your environment, but not in mine -- where most of my
systems (at home) are extremely quiet (1000-1200rpm fans, lots of noise
dampening material, etc.).  A 10C increase *during idle* is enough to
make me wary.  I also have extremely sensitive hearing, so drives
clicking is something I can hear from quite a distance -- I guess
working with them for so long over the years has made me sensitive to
'em.

  Furthermore, that CDB does not work for all drives.  There are
  Seagate drives -- I know because I bought some and returned them
  when the APM trick did not work -- that lack the LCC-disable tie-in
  to APM.  The drive either rejected the CDB (ATA status code error
  returned), while others accepted it but nothing in 0xec (IDENTIFY)
  reported as got changed.
 
 Well, I haven't seen it with these.  Several of
 ada0: ST1000DM003-9YN162 CC4D ATA-8 SATA 3.x device
 and some systems with CC4C too.

The drives I was testing were STx000DM001.  I don't remember if I had a
DM002.  I also don't remember the firmware version they had on them, but
I do remember there were no updates available from Seagate at that time.
On the other hand, their forum was *filled* with post after post about
the issue, including one fellow whose drive in something like 3 months
was almost reaching MTBF head park/reload count.

But my point is this: 3.5 drives do not need this feature in 95% of
environments.  In desktop systems it's worthless -- in consumer desktops
it accomplishes nothing but noise and annoyance and impacts I/O, and in
business desktop desktop environments it serves no purpose because most
places have their desktops go into sleep mode (so drive standby/sleep
gets used).  And in the server environment it's pure 100% worthless.

With 2.5 drives I can see it being more useful, but only if the drive
is used in a laptop.  There are NASes (and now servers too!) which use
2.5 drives, and I sure as hell wouldn't want that happening there.

So really it's just a bad feature all around that should be specific to
one environment demographic; the vendors should have made a 2.5 drive
dedicated for laptops that had this feature enabled, while disabld on
all other drives (2.5 and 3.5).  What we got was nearly opposite.

  I will have -- and eat -- their souls.
 
 The problem with that is that the undigestible bits of soul just get
 passed right back into the ecosystem, and in a more concentrated form.
 
 Some might suggest that's already happened, and is got us here in the
 first place  8-}

If you had what I do (moderate-to-severe IBS), you'd know that it
definitely doesn't get passed back in a more concentrated form.  First
joke I've been able to make about my health condition, yeah!  Ha!  I
kill me! -- Alf

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: system sporadically hangs on shutdown after switching to WITH_NEW_XORG

2013-06-18 Thread Jeremy Chadwick

On Tue, Jun 18, 2013 at 07:00:30PM +0430, Javad Kouhi wrote:
 Thanks for the reply, seems that our source trees are not same, I got this:
 
 % patch -p1  /path/to/patch
 Hmm...  Looks like a unified diff to me...
 The text leading up to this was:
 --
 |diff --git a/sys/dev/drm2/i915/intel_fb.c b/sys/dev/drm2/i915/intel_fb.c
 |index 3cb3b78..e41a49f 100644
 |--- a/sys/dev/drm2/i915/intel_fb.c
 |+++ b/sys/dev/drm2/i915/intel_fb.c
 --
 Patching file sys/dev/drm2/i915/intel_fb.c using Plan A...
 Hunk #1 succeeded at 207 with fuzz 1.
 Hunk #2 failed at 231.
 1 out of 2 hunks failed--saving rejects to sys/dev/drm2/i915/intel_fb.c.rej
 Hmm...  The next patch looks like a unified diff to me...
 The text leading up to this was:
 --
 |diff --git a/sys/dev/syscons/scvgarndr.c b/sys/dev/syscons/scvgarndr.c
 |index 6e6663c..fc7f02f 100644
 |--- a/sys/dev/syscons/scvgarndr.c
 |+++ b/sys/dev/syscons/scvgarndr.c
 --
 Patching file sys/dev/syscons/scvgarndr.c using Plan A...
 Hunk #1 succeeded at 395.
 Hunk #2 failed at 447.
 1 out of 2 hunks failed--saving rejects to sys/dev/syscons/scvgarndr.c.rej
 done
 
 
 And the git way:
 
 % git apply /path/to/patch
 error: patch failed: sys/dev/drm2/i915/intel_fb.c:207
 error: sys/dev/drm2/i915/intel_fb.c: patch does not apply
 error: patch failed: sys/dev/syscons/scvgarndr.c:445
 error: sys/dev/syscons/scvgarndr.c: patch does not apply
 
 
 I have revision 251934 of -STABLE branch. (I updated my source tree
 about 3 hours ago using svn)

I do not use git, I use svn, So I cannot help you with git crap.

Please revert your sys/dev/drm2/i915/intel_fb.c and
sys/dev/syscons/scvgarndr.c back to r251934 (or newer) before following
what I tell you below.

The problem is either that:

- The patch you were given is probably for a different FreeBSD release,
  thus the code/line numbers/info in the code break the fuzzy logic
  matching,
- You copy-pasted the diff and because of tabs vs. spaces botched it,
- git apply/patch/whatever is weird,
- Multitudes of other possibilities I do not care to go into.

The hack kib@ gave you is not hard to manually add yourself.  It's very
few lines of code.  I'm very surprised you didn't try to manually add it
yourself.  So I have done that for you.  First, the proof -- this is
against r251939, by the way, but that shouldn't matter as nobody has
touched this between r251934 and r251939:

$ svn info
Path: .
Working Copy Root Path: /home/jdc/work/src
URL: svn://svn.freebsd.org/base/stable/9
Repository Root: svn://svn.freebsd.org/base
Repository UUID: ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
Revision: 251939
Node Kind: directory
Schedule: normal
Last Changed Author: marius
Last Changed Rev: 251939
Last Changed Date: 2013-06-18 07:20:14 -0700 (Tue, 18 Jun 2013)

$ svn status
M   sys/dev/drm2/i915/intel_fb.c
M   sys/dev/syscons/scvgarndr.c

The diff itself is available here:

http://jdc.koitsu.org/freebsd/sysmouse_vsync.diff

I've also attached it here in Email (assuming the mailing list doesn't
delete it).

You should apply the patch using:

  cd /usr/src  (or wherever your source is)
  patch -p0  sysmouse_vsync.diff

Assuming use of svn, you can revert this patch by doing:

  cd /usr/src  (or wherever your source is)
  svn revert sys/dev/drm2/i915/intel_fb.c
  svn revert sys/dev/syscons/scvgarndr.c
  rm sys/dev/drm2/i915/intel_fb.c.orig
  rm sys/dev/syscons/scvgarndr.c.orig

There is probably some other magical way to do all of this, but as
anyone here knows, I do things manually because in general I do not
trust VCSes or the magic they do under the hood; I prefer to do things
that I know work.

Good luck -- I cannot help with any other aspect to the issue.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

Index: sys/dev/drm2/i915/intel_fb.c
===
--- sys/dev/drm2/i915/intel_fb.c	(revision 251939)
+++ sys/dev/drm2/i915/intel_fb.c	(working copy)
@@ -207,6 +207,8 @@ static void intel_fbdev_destroy(struct drm_device
 	}
 }
 
+extern int sc_txtmouse_no_retrace_wait;
+
 int intel_fbdev_init(struct drm_device *dev)
 {
 	struct intel_fbdev *ifbdev;
@@ -229,6 +231,7 @@ int intel_fbdev_init(struct drm_device *dev)
 
 	drm_fb_helper_single_add_all_connectors(ifbdev-helper);
 	drm_fb_helper_initial_config(ifbdev-helper, 32);
+	sc_txtmouse_no_retrace_wait = 1;
 	return 0;
 }
 
Index: sys/dev/syscons/scvgarndr.c
===
--- sys/dev/syscons/scvgarndr.c	(revision 251939)
+++ sys/dev/syscons/scvgarndr.c	(working copy)
@@ -395,6 +395,8 @@ vga_txtblink(scr_stat *scp, int at, int flip)
 {
 }
 
+int sc_txtmouse_no_retrace_wait;
+
 #ifndef SC_NO_CUTPASTE
 
 static void
@@ -445,7 +447,9

Re: system sporadically hangs on shutdown after switching to WITH_NEW_XORG

2013-06-18 Thread Jeremy Chadwick

On Tue, Jun 18, 2013 at 10:37:10PM +0430, Javad Kouhi wrote:
 On Tue, Jun 18, 2013 at 7:17 PM, Jeremy Chadwick j...@koitsu.org wrote:
 
  I do not use git, I use svn, So I cannot help you with git crap.
 
  Please revert your sys/dev/drm2/i915/intel_fb.c and
  sys/dev/syscons/scvgarndr.c back to r251934 (or newer) before following
  what I tell you below.
 
  The problem is either that:
 
  - The patch you were given is probably for a different FreeBSD release,
thus the code/line numbers/info in the code break the fuzzy logic
matching,
  - You copy-pasted the diff and because of tabs vs. spaces botched it,
  - git apply/patch/whatever is weird,
  - Multitudes of other possibilities I do not care to go into.
 
  The hack kib@ gave you is not hard to manually add yourself.  It's very
  few lines of code.  I'm very surprised you didn't try to manually add it
  yourself.  So I have done that for you.  First, the proof -- this is
  against r251939, by the way, but that shouldn't matter as nobody has
  touched this between r251934 and r251939:
 
  $ svn info
  Path: .
  Working Copy Root Path: /home/jdc/work/src
  URL: svn://svn.freebsd.org/base/stable/9
  Repository Root: svn://svn.freebsd.org/base
  Repository UUID: ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
  Revision: 251939
  Node Kind: directory
  Schedule: normal
  Last Changed Author: marius
  Last Changed Rev: 251939
  Last Changed Date: 2013-06-18 07:20:14 -0700 (Tue, 18 Jun 2013)
 
  $ svn status
  M   sys/dev/drm2/i915/intel_fb.c
  M   sys/dev/syscons/scvgarndr.c
 
  The diff itself is available here:
 
  http://jdc.koitsu.org/freebsd/sysmouse_vsync.diff
 
  I've also attached it here in Email (assuming the mailing list doesn't
  delete it).
 
  You should apply the patch using:
 
cd /usr/src  (or wherever your source is)
patch -p0  sysmouse_vsync.diff
 
  Assuming use of svn, you can revert this patch by doing:
 
cd /usr/src  (or wherever your source is)
svn revert sys/dev/drm2/i915/intel_fb.c
svn revert sys/dev/syscons/scvgarndr.c
rm sys/dev/drm2/i915/intel_fb.c.orig
rm sys/dev/syscons/scvgarndr.c.orig
 
  There is probably some other magical way to do all of this, but as
  anyone here knows, I do things manually because in general I do not
  trust VCSes or the magic they do under the hood; I prefer to do things
  that I know work.
 
  Good luck -- I cannot help with any other aspect to the issue.
 
  --
  | Jeremy Chadwick   j...@koitsu.org |
  | UNIX Systems Administratorhttp://jdc.koitsu.org/ |
  | Making life hard for others since 1977. PGP 4BD6C0CB |
 
 
 Many thanks for the detailed answer. I've applied your patch and then
 rebuilt the world and kernel. To be honest, I tried to apply the patch
 manually but the syntax was too complex for me. Thanks for the help to
 apply the patch.
 
 Unfortunately, the original issue is still exist and shutdown(8)
 doesn't work properly. I'm a newbie and I don't know what informations
 I should provide, but here is some basic information:
 
 % uname -a
 FreeBSD minootux 9.1-STABLE FreeBSD 9.1-STABLE #0 r251946M: Tue Jun 18
 21:16:56 IRDT 2013 root@minootux:/usr/obj/usr/src/sys/GIGABYTE
 amd64
 
 % pkg_info -I -x xorg-server -x drm
 libdrm-2.4.44   Userspace interface to kernel Direct Rendering Module 
 servi
 xorg-server-1.12.4,1 X.Org X server and related programs
 
 The machine is a laptop and the following link contains the details
 about the hardware:
 http://www.gigabyte.com/products/product-page.aspx?pid=3793#sp
 
 KMS and NEW_XORG are enabled in my /etc/make.conf.

First, what makes you think your issue is the same issue as reported by
Michiel Boland?  Let me point you to two of his posts (read them slowly
and in full please):

http://lists.freebsd.org/pipermail/freebsd-stable/2013-June/073821.html

http://lists.freebsd.org/pipermail/freebsd-stable/2013-June/073839.html

Second, the patch is not mine -- it's Konstantin's.  I did not write the
code/fix, nor do I understand it.  All I did was provide a version of
the same patch that applied cleanly on recent stable/9.  (I'm sorry for
needing to state this, but clear ownership of code/issues is important.)

TL;DR -- The patch kib@ wrote was not for you, it was for Michiel.  If
the patch works for Michiel and fixes his issue, great.  It sounds
clearly, to me, like you have a different problem or an issue that
manifests itself in the same manner but the root cause is different.
Your issue should be handled separately (preferably in another thread).

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable

Re: system sporadically hangs on shutdown after switching to WITH_NEW_XORG

2013-06-18 Thread Jeremy Chadwick

On Wed, Jun 19, 2013 at 01:41:10AM +0430, Javad Kouhi wrote:
 I've read the posts again. Although the issue looks same as Michiel
 Boland (first link) but I'm not sure if the root of the issue is same
 as Michiel's too (second link). Anyway, it should be discussed in
 another thread as you said.

Let me be more clear:

I have seen repeated reports from people complaining about lockups when
shutting down many times over the years.  The ones I remember:

- Certain oddities with SCSI/SATA storage drivers and disks (many of
  these have been fixed)
- ACPI-based reboot not working correctly on some motherboards
  (depends on hw.acpi.handle_reboot and sometimes
  hw.acpi.disable_on_reboot) -- not sure if this still pops up
- USB layer causing issues, or possibly some USB CAM integration
  problem (this is still an ongoing one)
- Now some sort of weird Intel graphics driver (and DRM?) quirk
  involving moused(8) and Vsync (the issue reported by Michiel)

And I'm certain I'm forgetting others.

What Kevin Oberman said also applies -- these are painful to debug
because the system is already in a shutting down state where usability
and accessibility becomes bare minimal, and you're kind of at your
wits end.

Booting verbose can help -- there are other messages printed to the VGA
(and/or serial) console during the shutdown phase when verbose.

All you can hope for is that the kernel is still alive and Ctrl-Alt-Esc
to force a drop to DDB (assuming all of this is enabled in your kernel)
works and that someone familiar with the FreeBSD kernel can help you
debug it (possibly it's just easier to do that, type panic, then
issue call doadump to force a dump to swap at that point -- kib@
might have better recommendations).

Serial console can also greatly help, because quite often there are
pages upon pages of debugging information that are useful, otherwise you
have to hope the VGA console keyboard is functional (even more tricky
with USB) and that Scroll Lock + Page Up/Down function along with taking
photos of the screen; doing it this way is stressful and painful for
everyone involved.

I hope this sheds some light on why I said what I did.  :-)

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found

2013-06-16 Thread Jeremy Chadwick

On Fri, May 31, 2013 at 07:25:23PM +0200, Andre Albsmeier wrote:
 On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
  On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
   Each day at 5:15 we are generating snapshots on various machines.
   This used to work perfectly under 7-STABLE for years but since
   we started to use 9.1-STABLE the machine reboots in about 10%
   of all cases.
   
   After rebooting we find a new snapshot file which is a bit
   smaller than the good ones and with different permissions
   It does not succeed a fsck. In this example it is the one
   whose name is beginning with s3:
   
   -r--r-   1 root  operator  snapshot 72802894528 29 May 05:15 
   s2-2013.05.28-03.15.04
   -r   1 root  operator  snapshot 72802893824 29 May 05:15 
   s3-2013.05.29-03.15.03
   -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
   s4-2013.05.23-06.38.44
   -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
   s5-2013.05.24-03.15.03
   -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
   s6-2013.05.25-03.15.03
   
   After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
   I see the following LORs (mksnap_ffs starts exactly at 5:15):
   
   May 29 05:15:00 kern.crit palveli kernel: lock order reversal:
   May 29 05:15:00 kern.crit palveli kernel: 1st 0xc2371da8 ufs (ufs) @ 
   /src/src-9/sys/kern/vfs_mount.c:1240
   May 29 05:15:00 kern.crit palveli kernel: 2nd 0xc2371ec4 devfs (devfs) 
   @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
   May 29 05:15:04 kern.crit palveli kernel: lock order reversal:
   May 29 05:15:04 kern.crit palveli kernel: 1st 0xc228471c snaplk 
   (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
   May 29 05:15:04 kern.crit palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ 
   /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
   
   Unfortunatley no corefiles are being generated ;-(.
   
   I have checked and even rebuilt the (UFS1) fs in question
   from scratch. I have also seen this happen on an UFS2 on
   another machine and on a third one when running dump -L
   on a root fs.
   
   Any hints of how to proceed?
  
  Would it be possible to setup a serial console that is logged on this 
  machine
  to see if it is panic'ing but failing to write out a crashdump?
 
 I'll try to arrange that. It'll take a bit since this
 box is 200 km away... 
 
 Maybe I'll find another one nearby to reproduce it...

SPECIFICALLY regarding lack of crash dumps: I need to see the
following:

* cat /etc/rc.conf
* cat /etc/fstab

I may need output from other commands, but shall deal with that when I
see output from the above.  Thanks.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found

2013-06-16 Thread Jeremy Chadwick

On Sun, Jun 16, 2013 at 10:02:39AM +0200, Andre Albsmeier wrote:
 On Sun, 16-Jun-2013 at 08:54:41 +0200, Jeremy Chadwick wrote:
  On Fri, May 31, 2013 at 07:25:23PM +0200, Andre Albsmeier wrote:
   On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
 Each day at 5:15 we are generating snapshots on various machines.
 This used to work perfectly under 7-STABLE for years but since
 we started to use 9.1-STABLE the machine reboots in about 10%
 of all cases.
 
 After rebooting we find a new snapshot file which is a bit
 smaller than the good ones and with different permissions
 It does not succeed a fsck. In this example it is the one
 whose name is beginning with s3:
 
 -r--r-   1 root  operator  snapshot 72802894528 29 May 05:15 
 s2-2013.05.28-03.15.04
 -r   1 root  operator  snapshot 72802893824 29 May 05:15 
 s3-2013.05.29-03.15.03
 -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
 s4-2013.05.23-06.38.44
 -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
 s5-2013.05.24-03.15.03
 -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
 s6-2013.05.25-03.15.03
 
 After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
 I see the following LORs (mksnap_ffs starts exactly at 5:15):
 
 May 29 05:15:00 kern.crit palveli kernel: lock order reversal:
 May 29 05:15:00 kern.crit palveli kernel: 1st 0xc2371da8 ufs (ufs) 
 @ /src/src-9/sys/kern/vfs_mount.c:1240
 May 29 05:15:00 kern.crit palveli kernel: 2nd 0xc2371ec4 devfs 
 (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
 May 29 05:15:04 kern.crit palveli kernel: lock order reversal:
 May 29 05:15:04 kern.crit palveli kernel: 1st 0xc228471c snaplk 
 (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
 May 29 05:15:04 kern.crit palveli kernel: 2nd 0xc22f25e4 ufs (ufs) 
 @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
 
 Unfortunatley no corefiles are being generated ;-(.
 
 I have checked and even rebuilt the (UFS1) fs in question
 from scratch. I have also seen this happen on an UFS2 on
 another machine and on a third one when running dump -L
 on a root fs.
 
 Any hints of how to proceed?

Would it be possible to setup a serial console that is logged on this 
machine
to see if it is panic'ing but failing to write out a crashdump?
   
   I'll try to arrange that. It'll take a bit since this
   box is 200 km away... 
   
   Maybe I'll find another one nearby to reproduce it...
  
  SPECIFICALLY regarding lack of crash dumps: I need to see the
  following:
  
  * cat /etc/rc.conf
  * cat /etc/fstab
  
  I may need output from other commands, but shall deal with that when I
  see output from the above.  Thanks.
 
 No problem, see below...
 
 To make a long story short, the machine dumps core perfectly
 (tested that a while ago), but not when dealing with _this_
 issue...
 
 I dump on da1s1b and savecore fetches it from there and puts
 it on /var (sitting on da0), that's faster.
 
 rc.conf (beware, rc.conf.local exists):
 ---
 rcshutdown_timeout=180
 tmpmfs=YES
 tmpsize=$(( `/sbin/sysctl -n hw.usermem` / 300 ))m
 tmpmfs_flags=$tmpmfs_flags -v 1 -n
 
 background_fsck=NO
 
 nisdomainname=ofw.tld
 pflog_flags=-S
 
 syslogd_flags=-svv
 inetd_enable=YES
 inetd_flags=-l
 named_flags=-S 1000
 named_chrootdir=
 rwhod_enable=YES
 sshd_enable=YES
 amd_enable=YES
 amd_flags=-F /etc/amd.conf
 nfs_client_enable=YES
 nfs_access_cache=2
 mountd_flags=-n
 rpcbind_enable=YES
 
 ntpdate_enable=YES
 ntpdate_hosts=ntp
 ntpd_enable=YES
 ntpd_flags=-p /var/run/ntpd.pid
 
 nis_client_enable=YES
 nis_client_flags=-s -S ofw.tld,nis-16-1,nis-16-2
 nis_server_flags=-n
 nis_yppasswdd_flags=-t /var/yp/src/master.passwd -f -v
 
 defaultrouter=192.168.16.2
 
 keyrate=fast
 
 sendmail_flags=-bd -q5m
 sendmail_submit_flags=$sendmail_flags -ODaemonPortOptions=Addr=localhost
 sendmail_msp_queue_flags=-Ac -q30m
 sendmail_rebuild_aliases=NO
 
 lpd_enable=YES
 lpd_flags=-s
 chkprintcap_enable=YES
 dumpdev=AUTO
 clear_tmp_X=NO
 ldconfig_paths=/usr/local/lib
 ldconfig_paths_aout=
 entropy_file=/boot/entropy-file
 
 
 rc.conf.local:
 --
 hostname=typhon.ofw.tld
 ifconfig_msk0=inet 192.168.24.1/21
 ifconfig_msk0_alias0=inet 192.168.24.10/32
 
 named_enable=YES
 nfs_server_enable=YES
 
 nis_client_flags=-s -S ofw.tld,nis-24-1,nis-24-2
 nis_server_enable=YES
 
 defaultrouter=192.168.24.2
 
 lpd_flags=-l
 dumpdev=/dev/da1s1b
 quota_enable=YES
 
 
 fstab:
 --
 /dev/da0s1a   /   ufs noatime,rw  
 0 1
 /dev/da0s1b   noneswapsw  
 0 0
 proc  /proc   procfs  rw  
 0 0
 /dev/da0s1d   /usrufs

Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found

2013-06-16 Thread Jeremy Chadwick

On Sun, Jun 16, 2013 at 11:55:38AM +0200, Andre Albsmeier wrote:
 On Sun, 16-Jun-2013 at 10:49:37 +0200, Jeremy Chadwick wrote:
  On Sun, Jun 16, 2013 at 10:02:39AM +0200, Andre Albsmeier wrote:
   On Sun, 16-Jun-2013 at 08:54:41 +0200, Jeremy Chadwick wrote:
On Fri, May 31, 2013 at 07:25:23PM +0200, Andre Albsmeier wrote:
 On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
  On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
   Each day at 5:15 we are generating snapshots on various machines.
   This used to work perfectly under 7-STABLE for years but since
   we started to use 9.1-STABLE the machine reboots in about 10%
   of all cases.
   
   After rebooting we find a new snapshot file which is a bit
   smaller than the good ones and with different permissions
   It does not succeed a fsck. In this example it is the one
   whose name is beginning with s3:
   
   -r--r-   1 root  operator  snapshot 72802894528 29 May 05:15 
   s2-2013.05.28-03.15.04
   -r   1 root  operator  snapshot 72802893824 29 May 05:15 
   s3-2013.05.29-03.15.03
   -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
   s4-2013.05.23-06.38.44
   -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
   s5-2013.05.24-03.15.03
   -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
   s6-2013.05.25-03.15.03
   
   After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
   I see the following LORs (mksnap_ffs starts exactly at 5:15):
   
   May 29 05:15:00 kern.crit palveli kernel: lock order reversal:
   May 29 05:15:00 kern.crit palveli kernel: 1st 0xc2371da8 ufs 
   (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240
   May 29 05:15:00 kern.crit palveli kernel: 2nd 0xc2371ec4 devfs 
   (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
   May 29 05:15:04 kern.crit palveli kernel: lock order reversal:
   May 29 05:15:04 kern.crit palveli kernel: 1st 0xc228471c snaplk 
   (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
   May 29 05:15:04 kern.crit palveli kernel: 2nd 0xc22f25e4 ufs 
   (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
   
   Unfortunatley no corefiles are being generated ;-(.
   
   I have checked and even rebuilt the (UFS1) fs in question
   from scratch. I have also seen this happen on an UFS2 on
   another machine and on a third one when running dump -L
   on a root fs.
   
   Any hints of how to proceed?
  
  Would it be possible to setup a serial console that is logged on 
  this machine
  to see if it is panic'ing but failing to write out a crashdump?
 
 I'll try to arrange that. It'll take a bit since this
 box is 200 km away... 
 
 Maybe I'll find another one nearby to reproduce it...

SPECIFICALLY regarding lack of crash dumps: I need to see the
following:

* cat /etc/rc.conf
* cat /etc/fstab

I may need output from other commands, but shall deal with that when I
see output from the above.  Thanks.
   
   No problem, see below...
   
   To make a long story short, the machine dumps core perfectly
   (tested that a while ago), but not when dealing with _this_
   issue...
   
   I dump on da1s1b and savecore fetches it from there and puts
   it on /var (sitting on da0), that's faster.
   
   rc.conf (beware, rc.conf.local exists):
   ---
   rcshutdown_timeout=180
   tmpmfs=YES
   tmpsize=$(( `/sbin/sysctl -n hw.usermem` / 300 ))m
   tmpmfs_flags=$tmpmfs_flags -v 1 -n
   
   background_fsck=NO
   
   nisdomainname=ofw.tld
   pflog_flags=-S
   
   syslogd_flags=-svv
   inetd_enable=YES
   inetd_flags=-l
   named_flags=-S 1000
   named_chrootdir=
   rwhod_enable=YES
   sshd_enable=YES
   amd_enable=YES
   amd_flags=-F /etc/amd.conf
   nfs_client_enable=YES
   nfs_access_cache=2
   mountd_flags=-n
   rpcbind_enable=YES
   
   ntpdate_enable=YES
   ntpdate_hosts=ntp
   ntpd_enable=YES
   ntpd_flags=-p /var/run/ntpd.pid
   
   nis_client_enable=YES
   nis_client_flags=-s -S ofw.tld,nis-16-1,nis-16-2
   nis_server_flags=-n
   nis_yppasswdd_flags=-t /var/yp/src/master.passwd -f -v
   
   defaultrouter=192.168.16.2
   
   keyrate=fast
   
   sendmail_flags=-bd -q5m
   sendmail_submit_flags=$sendmail_flags -ODaemonPortOptions=Addr=localhost
   sendmail_msp_queue_flags=-Ac -q30m
   sendmail_rebuild_aliases=NO
   
   lpd_enable=YES
   lpd_flags=-s
   chkprintcap_enable=YES
   dumpdev=AUTO
   clear_tmp_X=NO
   ldconfig_paths=/usr/local/lib
   ldconfig_paths_aout=
   entropy_file=/boot/entropy-file
   
   
   rc.conf.local:
   --
   hostname=typhon.ofw.tld
   ifconfig_msk0=inet 192.168.24.1/21
   ifconfig_msk0_alias0=inet 192.168.24.10/32
   
   named_enable=YES
   nfs_server_enable=YES
   
   nis_client_flags=-s -S ofw.tld,nis-24-1,nis-24-2

Re: system sporadically hangs on shutdown after switching to WITH_NEW_XORG

2013-06-16 Thread Jeremy Chadwick

On Sun, Jun 16, 2013 at 05:48:52PM +0200, Michiel Boland wrote:
 On 06/16/2013 17:37, Konstantin Belousov wrote:
 On Sun, Jun 16, 2013 at 05:11:15PM +0200, Michiel Boland wrote:
 Hi. Recently I switched to WITH_NEW_XORG, primarily because the stock X 
 server
 with Intel driver has some issues that make it unusable for me.
 
 The new X server and Intel driver works extremely well, so kudos to whoever 
 made
 this possible.
 
 Unfortunately, I am now experiencing random hangs on shutdown. On shutdown 
 the
 system randomly freezes after
 
 [...] syslogd: exiting on signal 15
 
 I would then expect to see 'Waiting (max 60 seconds) for system process 
 'XXX' to
 stop messages, but these never arrive.
 
 I paniced the machine in ddb, so I have a crash dump if someone want to 
 look at
 it. The crashinfo is at http://barrytown.boland.org/core.txt (I would have
 pasted it here but it is a bit verbose.)
 
 Machine has an Intel G41 chipset, with a SAMSUNG SSD 830 Series HD, running
 9.1-STABLE r251803. Serial console. GENERIC kernel, expect for options DDB 
 and
 ALT_BREAK_TO_DEBUGGER.
 
 Who knows what's going on here?
 
 I do not see anything related to i915 in the core.txt you provided.
 
 Next time the machine hangs, start with the output of ps command from
 ddb and 'show allpcpu', together with 'alltrace'.
 
 
 Ok.
 
 I appended 'thread apply all bt' from kgdb to the core.txt, maybe
 there is something interesting in there.
 
 I did notice the following
 
 Thread 17 (Thread 17):
 #0  cpustop_handler () at /usr/src/sys/amd64/amd64/mp_machdep.c:1392
 #1  0x80cbebbd in ipi_nmi_handler () at
 /usr/src/sys/amd64/amd64/mp_machdep.c:1374
 #2  0x80ccc159 in trap (frame=0x81424890) at
 /usr/src/sys/amd64/amd64/trap.c:211
 #3  0x80cb55af in nmi_calltrap () at
 /usr/src/sys/amd64/amd64/exception.S:501
 #4  0x80d0c029 in vga_txtmouse (scp=0xfe0005586600,
 x=320, y=200, on=value optimized out) at cpufunc.h:186
 Previous frame inner to this frame (corrupt stack?)
 
 Maybe the hang is caused by the removal of the text mouse cursor?
 (Just guessing here.)

vga_txtmouse comes from syscons(4).

Are you making use of vidcontrol(1) in any way to set the system console
(outside of X) to something that uses the VGA framebuffer?  There are
probably some loader.conf or rc.conf variables that control this (I do
not know).

Are you running moused(8)?  Actually, I can see quite clearly that you
are in your core.txt:

Starting ums0 moused.

Try turning that off.  Don't ask me how, because devd(8) / devd.conf(5)
might be involved.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: system sporadically hangs on shutdown after switching to WITH_NEW_XORG

2013-06-16 Thread Jeremy Chadwick

On Sun, Jun 16, 2013 at 06:01:49PM +0200, Michiel Boland wrote:
 On 06/16/2013 17:55, Jeremy Chadwick wrote:
 [...]
 
 Are you running moused(8)?  Actually, I can see quite clearly that you
 are in your core.txt:
 
 Starting ums0 moused.
 
 Try turning that off.  Don't ask me how, because devd(8) / devd.conf(5)
 might be involved.
 
 
 The moused is started by devd - I don't see a quick way of turning that off.

Comment out the relevant crap in devd.conf(5).  Search for ums
and comment out the two notify sections.

 As a workaround I'm trying to run a kernel with
 
  options SC_NO_SYSMOUSE
 
 to see if the hangs go away.

That's one way to do it, I guess.

Be aware that I do not use X, however I have repeatedly seen mentioned
on these lists problems/complexities from where people rely on moused(8)
to drive their mouse while inside of X (or possibly that X and
moused(8) are both simultaneously polling the mouse).  There's
apparently a very specific kind of X configuration you're supposed to
use to get proper mouse/keyboard/HAL/HID/whatever support, and tons of
people have it wrongt.  Warren Block I think has some insights into
this, or could maybe help shed some light on what I'm remembering.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ACPI Warning, then hang

2013-06-13 Thread Jeremy Chadwick

On Thu, Jun 13, 2013 at 05:32:21PM -0500, Bryce Edwards wrote:
 On Mon, Jun 10, 2013 at 9:32 PM, Jeremy Chadwick j...@koitsu.org wrote:
  On Mon, Jun 10, 2013 at 09:18:47PM -0500, Bryce Edwards wrote:
  Verbose boot:
 
  https://www.dropbox.com/s/obm8rtavro68ea8/acpi-verbose.jpg
 
 
  On Mon, Jun 10, 2013 at 11:27 AM, Bryce Edwards br...@bryce.net wrote:
   On Mon, Jun 10, 2013 at 11:19 AM, John Baldwin j...@freebsd.org wrote:
   On Monday, June 10, 2013 10:35:07 am Jeremy Chadwick wrote:
   On Mon, Jun 10, 2013 at 09:18:14AM -0500, Bryce Edwards wrote:
I'm getting the following warning, and then the system locks:
   
ACPI Warning: Incorrect checksum in table [(bunch of spaces)] - 0x29,
should be 0x48
   
Here's a pic: http://db.tt/O6dxONzI
   
System is on a SuperMicro C7X58 motherboard that I just upgraded to
BIOS 2.0a, which I would like to stay on if possible.  I tried
adjusting all the ACPI related BIOS settings without success.
  
   The message in question refers to hard-coded data in one of the many
   ACPI tables (see acpidump(8) for the list -- there are many).  ACPI
   tables are stored within the BIOS -- the motherboard/BIOS vendor has
   full control over all of them and is fully 100% responsible for their
   content.
  
   It looks to me like they severely botched their BIOS, or somehow it got
   flashed wrong.
  
   You need to contact Supermicro Technical Support and tell them of the
   problem.  They need to either fix their BIOS, or help figure out what's
   become corrupted.  You can point them to this thread if you'd like.
  
   I should note that the corruption/issue is major enough that you are
   missing very key/important lines from your dmesg (after avail memory
   but before kdbX at kdbmuxX, which come from pure reliance upon ACPI.
   Lines such as:
  
   Event timer LAPIC quality 400
   ACPI APIC Table: PTLTDAPIC  
   FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
   FreeBSD/SMP: 1 package(s) x 4 core(s)
cpu0 (BSP): APIC ID:  0
cpu1 (AP): APIC ID:  1
cpu2 (AP): APIC ID:  2
cpu3 (AP): APIC ID:  3
   ioapic0 Version 2.0 irqs 0-23 on motherboard
   ioapic1 Version 2.0 irqs 24-47 on motherboard
  
   In the meantime, you can try booting without ACPI support (there should
   be a boot-up menu option for that) and pray that works.  If it doesn't,
   then your workaround is to roll back to an older BIOS version and/or 
   put
   pressure on Supermicro.  You will find their Technical Support folks 
   are
   quite helpful/responsive to technical issues.
  
   Good luck and keep us posted on what transpires.
  
   Actually, that message is mostly harmless.  All sorts of vendors ship
   tables with busted checksums that are in fact fine. :(  However, the 
   table
   name looks very odd which is more worrying.  Booting without ACPI 
   enabled
   would be a good first step.  Trying a verbose boot to capture the last
   message before the hang would also be useful.
  
   --
   John Baldwin
  
   Booting without ACPI did not work for me, although I might be able to
   hack away at lots of BIOS setting to make it work.  It didn't assign
   IRQ's to things like the storage controller, etc. soI thought it was
   probably not worth the effort.
  
   I did contact SuperMicro support as well, so we'll see what they have to 
   say.
  
   I'll get a verbose boot posted up in a bit.
 
  A screenshot of a verbose boot is insufficient; as I'm sure you noticed
  there are pages upon pages of information before the lock-up/crash.
  Those pages are what folks are interested in.
 
  Because the system is hung, I doubt hitting Scroll Lock + using
  PageUp/PageDown to go through the kernel message scrollback will work.
 
  You're going to need a serial-based console (i.e. hook something up to
  COM1 on the motherboard, and get a null modem cable to connect to
  another system where you use a serial port/terminal emulator (ex. PuTTY
  for Windows, etc.) that has a scrollback buffer which you can copy-paste
  or save.  Set your serial port for 9600 baud, 8 bits, no parity, and 1
  stop bit (9600bps, 8N1).  You'll need to have physical access to both
  systems simultaneously.
 
  At the VGA console, boot FreeBSD then escape to the loader prompt
  (ok) and issue the following commands:
 
  set boot_multicons=YES
  set boot_serial=YES
  set console=comconsole,vidconsole
  boot
 
  You should begin seeing output on the serial port, and the system will
  eventually hang/etc..  Then provide the captured output from the serial
  port here.  :-)
 
  --
  | Jeremy Chadwick   j...@koitsu.org |
  | UNIX Systems Administratorhttp://jdc.koitsu.org/ |
  | Making life hard for others since 1977. PGP 4BD6C0CB |
 
 
 I'm having a heck of a time getting the serial console working...

Come to think of it, depending on how they implement the interrupt
tie-ins for that (even with classic LPC/ISA, re: the whole IRQ

Re: Error in make buildkernel `

2013-06-10 Thread Jeremy Chadwick

On Mon, Jun 10, 2013 at 02:04:59PM +0200, Willem Jan Withagen wrote:
 I'm trying to build a stable kernle on a freshly build 8.4-Stable i386
 system.
 
 And I get:
 MAKE=make sh /usr/srcs/src9/src/sys/conf/newvers.sh GENERIC
 /usr/local/bin/svnversion
 cc -c -O -pipe  -std=c99 -g -Wall -Wredundant-decls -Wnested-externs
 -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith -Winline
 -Wcast-qual  -Wundef -Wno-pointer-sign -fformat-extensions
 -Wmissing-include-dirs -fdiagnostics-show-option   -nostdinc  -I.
 -I/usr/srcs/src9/src/sys -I/usr/srcs/src9/src/sys/contrib/altq -D_KERNEL
 -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common
 -finline-limit=8000 --param inline-unit-growth=100 --param
 large-function-growth=1000  -mno-align-long-strings
 -mpreferred-stack-boundary=2 -mno-mmx -mno-sse -msoft-float
 -ffreestanding -fstack-protector -Werror  vers.c
 ctfconvert -L VERSION -g vers.o
 linking kernel.debug
 ld:/usr/srcs/src9/src/sys/conf/ldscript.i386:66: syntax error
 *** Error code 1
 
 Stop in /usr/obj/usr/srcs/src9/src/sys/GENERIC.
 *** Error code 1
 
 Stop in /usr/srcs/src9/src.
 *** Error code 1
 
 Line 66 is:   .eh_frame   : ONLY_IF_RO { KEEP (*(.eh_frame)) }
 The piece of code around line 66 looks like:
 
   PROVIDE (__etext = .);
   PROVIDE (_etext = .);
   PROVIDE (etext = .);
   .rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
   .rodata1: { *(.rodata1) }
   .eh_frame_hdr : { *(.eh_frame_hdr) }
   .eh_frame   : ONLY_IF_RO { KEEP (*(.eh_frame)) }
   .gcc_except_table   : ONLY_IF_RO { *(.gcc_except_table
 .gcc_except_table.*) }
   /* Adjust the address for the data segment.  We want to adjust up to
  the same address within the page on the next page up.  */
   . = ALIGN (CONSTANT (MAXPAGESIZE)) - ((CONSTANT (MAXPAGESIZE) - .) 
 (CONSTANT (MAXPAGESIZE) - 1)); . = DATA_SEGMENT_ALIGN (CONSTANT
 (MAXPAGESIZE), CONSTANT (COMMONPAGESI
 ZE));
   /* Exception handling  */
 
 Any suggestions on how to fix this??

I can't help with the actual syntax error, but from the path names
involved here, it looks like you:

1) are using an alternate location for src (/usr/srcs not /usr/src),

2) are trying to build FreeBSD 9.x on an 8.4-STABLE box
(/usr/obj/usr/srcs/src9)

Is that correct?  You might want to provide /etc/make.conf and
/etc/src.conf from this system or other details of the build framework
you might be using.  That might help/pertain to the situation.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ACPI Warning, then hang

2013-06-10 Thread Jeremy Chadwick

On Mon, Jun 10, 2013 at 09:18:14AM -0500, Bryce Edwards wrote:
 I'm getting the following warning, and then the system locks:
 
 ACPI Warning: Incorrect checksum in table [(bunch of spaces)] - 0x29,
 should be 0x48
 
 Here's a pic: http://db.tt/O6dxONzI
 
 System is on a SuperMicro C7X58 motherboard that I just upgraded to
 BIOS 2.0a, which I would like to stay on if possible.  I tried
 adjusting all the ACPI related BIOS settings without success.

The message in question refers to hard-coded data in one of the many
ACPI tables (see acpidump(8) for the list -- there are many).  ACPI
tables are stored within the BIOS -- the motherboard/BIOS vendor has
full control over all of them and is fully 100% responsible for their
content.

It looks to me like they severely botched their BIOS, or somehow it got
flashed wrong.

You need to contact Supermicro Technical Support and tell them of the
problem.  They need to either fix their BIOS, or help figure out what's
become corrupted.  You can point them to this thread if you'd like.

I should note that the corruption/issue is major enough that you are
missing very key/important lines from your dmesg (after avail memory
but before kdbX at kdbmuxX, which come from pure reliance upon ACPI.
Lines such as:

Event timer LAPIC quality 400
ACPI APIC Table: PTLTD  APIC  
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
ioapic0 Version 2.0 irqs 0-23 on motherboard
ioapic1 Version 2.0 irqs 24-47 on motherboard

In the meantime, you can try booting without ACPI support (there should
be a boot-up menu option for that) and pray that works.  If it doesn't,
then your workaround is to roll back to an older BIOS version and/or put
pressure on Supermicro.  You will find their Technical Support folks are
quite helpful/responsive to technical issues.

Good luck and keep us posted on what transpires.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ACPI Warning, then hang

2013-06-10 Thread Jeremy Chadwick

On Mon, Jun 10, 2013 at 09:18:47PM -0500, Bryce Edwards wrote:
 Verbose boot:
 
 https://www.dropbox.com/s/obm8rtavro68ea8/acpi-verbose.jpg
 
 
 On Mon, Jun 10, 2013 at 11:27 AM, Bryce Edwards br...@bryce.net wrote:
  On Mon, Jun 10, 2013 at 11:19 AM, John Baldwin j...@freebsd.org wrote:
  On Monday, June 10, 2013 10:35:07 am Jeremy Chadwick wrote:
  On Mon, Jun 10, 2013 at 09:18:14AM -0500, Bryce Edwards wrote:
   I'm getting the following warning, and then the system locks:
  
   ACPI Warning: Incorrect checksum in table [(bunch of spaces)] - 0x29,
   should be 0x48
  
   Here's a pic: http://db.tt/O6dxONzI
  
   System is on a SuperMicro C7X58 motherboard that I just upgraded to
   BIOS 2.0a, which I would like to stay on if possible.  I tried
   adjusting all the ACPI related BIOS settings without success.
 
  The message in question refers to hard-coded data in one of the many
  ACPI tables (see acpidump(8) for the list -- there are many).  ACPI
  tables are stored within the BIOS -- the motherboard/BIOS vendor has
  full control over all of them and is fully 100% responsible for their
  content.
 
  It looks to me like they severely botched their BIOS, or somehow it got
  flashed wrong.
 
  You need to contact Supermicro Technical Support and tell them of the
  problem.  They need to either fix their BIOS, or help figure out what's
  become corrupted.  You can point them to this thread if you'd like.
 
  I should note that the corruption/issue is major enough that you are
  missing very key/important lines from your dmesg (after avail memory
  but before kdbX at kdbmuxX, which come from pure reliance upon ACPI.
  Lines such as:
 
  Event timer LAPIC quality 400
  ACPI APIC Table: PTLTDAPIC  
  FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
  FreeBSD/SMP: 1 package(s) x 4 core(s)
   cpu0 (BSP): APIC ID:  0
   cpu1 (AP): APIC ID:  1
   cpu2 (AP): APIC ID:  2
   cpu3 (AP): APIC ID:  3
  ioapic0 Version 2.0 irqs 0-23 on motherboard
  ioapic1 Version 2.0 irqs 24-47 on motherboard
 
  In the meantime, you can try booting without ACPI support (there should
  be a boot-up menu option for that) and pray that works.  If it doesn't,
  then your workaround is to roll back to an older BIOS version and/or put
  pressure on Supermicro.  You will find their Technical Support folks are
  quite helpful/responsive to technical issues.
 
  Good luck and keep us posted on what transpires.
 
  Actually, that message is mostly harmless.  All sorts of vendors ship
  tables with busted checksums that are in fact fine. :(  However, the table
  name looks very odd which is more worrying.  Booting without ACPI enabled
  would be a good first step.  Trying a verbose boot to capture the last
  message before the hang would also be useful.
 
  --
  John Baldwin
 
  Booting without ACPI did not work for me, although I might be able to
  hack away at lots of BIOS setting to make it work.  It didn't assign
  IRQ's to things like the storage controller, etc. soI thought it was
  probably not worth the effort.
 
  I did contact SuperMicro support as well, so we'll see what they have to 
  say.
 
  I'll get a verbose boot posted up in a bit.

A screenshot of a verbose boot is insufficient; as I'm sure you noticed
there are pages upon pages of information before the lock-up/crash.
Those pages are what folks are interested in.

Because the system is hung, I doubt hitting Scroll Lock + using
PageUp/PageDown to go through the kernel message scrollback will work.

You're going to need a serial-based console (i.e. hook something up to
COM1 on the motherboard, and get a null modem cable to connect to
another system where you use a serial port/terminal emulator (ex. PuTTY
for Windows, etc.) that has a scrollback buffer which you can copy-paste
or save.  Set your serial port for 9600 baud, 8 bits, no parity, and 1
stop bit (9600bps, 8N1).  You'll need to have physical access to both
systems simultaneously.

At the VGA console, boot FreeBSD then escape to the loader prompt
(ok) and issue the following commands:

set boot_multicons=YES
set boot_serial=YES
set console=comconsole,vidconsole
boot

You should begin seeing output on the serial port, and the system will
eventually hang/etc..  Then provide the captured output from the serial
port here.  :-)

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: fxp0 interface going up/down/up/down (dhclient related?)

2013-06-09 Thread Jeremy Chadwick

On Sun, Jun 09, 2013 at 12:21:37PM +0200, Alban Hertroys wrote:
 I'm having an issue where my fxp0 interface keeps looping between DOWN/UP, 
 with dhclient requesting a lease each time in between. I think it's caused by 
 dhclient:
 
 solfertje # dhclient -d fxp0
 DHCPREQUEST on fxp0 to 255.255.255.255 port 67
 send_packet: Network is down
 DHCPREQUEST on fxp0 to 255.255.255.255 port 67
 DHCPACK from 109.72.40.1
 bound to 141.105.10.89 -- renewal in 7200 seconds.
 fxp0 link state up - down
 fxp0 link state down - up
 DHCPREQUEST on fxp0 to 255.255.255.255 port 67
 DHCPACK from 109.72.40.1
 bound to 141.105.10.89 -- renewal in 7200 seconds.
 fxp0 link state up - down
 fxp0 link state down - up
 DHCPREQUEST on fxp0 to 255.255.255.255 port 67
 DHCPACK from 109.72.40.1
 bound to 141.105.10.89 -- renewal in 7200 seconds.
 fxp0 link state up - down
 fxp0 link state down - up
 DHCPREQUEST on fxp0 to 255.255.255.255 port 67
 DHCPACK from 109.72.40.1
 bound to 141.105.10.89 -- renewal in 7200 seconds.
 fxp0 link state up - down
 fxp0 link state down - up
 DHCPREQUEST on fxp0 to 255.255.255.255 port 67
 DHCPACK from 109.72.40.1
 bound to 141.105.10.89 -- renewal in 7200 seconds.
 fxp0 link state up - down
 ^C
 
 In above test I turned off devd (/etc/rc.d/devd stop) and background dhclient 
 (/etc/rc.d/dhclient stop fxp0), and I still go the above result. There's 
 practically no time spent between up/down cycles, this just keeps going on 
 and on.
 fxp0 is the only interface that runs on DHCP. The others have static IP's.
 
 Initially I thought the issue might be caused by devd, because I have both 
 ethernet and 822.11 type NICs (2x ethernet, 1x wifi) in that system.
 
 This is 9-STABLE from yesterday.
 
 Before, I had 9-RELEASE running on this system with the same config, and that 
 worked well.

And so what I predicted begins...

The issue is described in the 8.4-RELEASE Errata Notes; the driver is
using the same driver version as in stable/9, hence you're experiencing
the same problem.  See Open Issues:

http://www.freebsd.org/releases/8.4R/errata.html

No fix for this has been committed.  It is still under discussions by
multiple kernel folks as to where the fix should be applied (dhclient or
the fxp(4) driver), because the changes made to dhclient (that tickle
this bug) may actually affect more drivers than just fxp(4).

You can start by reading the (extremely long but very informative)
thread here.  I do urge you to read all the posts, not skim them:

http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073440.html
http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/thread.html#73440

The only known workarounds at this time are:

a) Cease use of DHCP; set a static IP in rc.conf,

b) Try some of the patches mentioned within the above thread,
specifically this one:

http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073581.html

The patch is for head (CURRENT) so it may not patch cleanly.  If not,
you can try to work the patch in yourself/by hand, or you can ask
Yong-Hyeon or others for help.

 I'm not sure it's related, but on the wireless interface I get  alot of:
 Jun  9 12:08:11 solfertje kernel: ath0: stuck beacon; resetting (bmiss count 
 4)

Absolutely 100% unrelated.  That issue has been around for years, and
the root cause varies tremendously.  I discussed it back in February
2011:

http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/061700.html

If you want to know how I solved that problem, I can tell you, but I'm
certain you won't be happy to hear what I have to say.

If you're concerned about this problem, please start another thread
discussing it.  I'm sure Adrian Chadd can provide you lots of insights,
but most of them are already in his response to my above thread/post.

 {snipping other stuff}

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: fxp0 interface going up/down/up/down (dhclient related?)

2013-06-09 Thread Jeremy Chadwick

On Sun, Jun 09, 2013 at 01:21:53PM +0200, ?ukasz Gruner wrote:
 On Sun, Jun 9, 2013, at 12:44, Jeremy Chadwick wrote:
  On Sun, Jun 09, 2013 at 12:21:37PM +0200, Alban Hertroys wrote:
   I'm having an issue where my fxp0 interface keeps looping between 
   DOWN/UP, with dhclient requesting a lease each time in between. I think 
   it's caused by dhclient:
  And so what I predicted begins...
 
 I have been suffering this issue since forever (which for me began at
 freebsd 9.0). Currently I'm at stable9.

The problem we're talking about was a direct result of this PR:

http://www.freebsd.org/cgi/query-pr.cgi?pr=166656

The commit (MFC) was done to stable/8 and stable/9 in this revision and
at this date/time:

stable/9 commit: r247335 -- 2013/02/26
stable/8 commit: r247336 -- 2013/02/26

You can see the commit log/messages in the PR.

Now let's talk about versions:

FreeBSD 9.0-RELEASE came out 2012/01/12:

http://lists.freebsd.org/pipermail/freebsd-announce/2012-January/001406.html

FreeBSD 9.1-RELEASE came out 2012/12/30:

http://lists.freebsd.org/pipermail/freebsd-announce/2012-December/001448.html

So when you say the issue for you began at FreeBSD 9.0, you need to
be more specific (uname -a output would be a good start), because
otherwise to me it sounds like you're experiencing a *completely*
different problem.

 Much appreciated, shouldn't this be at wiki? 

What wiki?  How would people know to read it?  Using a web search engine
like Google?  That would return this mailing list thread, as well as
the ones I've referenced.

There is enough old/outdated/completely and absolutely WRONG crap on the
FreeBSD Wiki as is.  The Wiki is not the official source/list of
problems (there is no official source/list -- the mailing lists are,
for a decade, have been as good as it gets).

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: fxp0 interface going up/down/up/down (dhclient related?)

2013-06-09 Thread Jeremy Chadwick

On Sun, Jun 09, 2013 at 02:48:29PM +0200, Alban Hertroys wrote:
 On Jun 9, 2013, at 12:44, Jeremy Chadwick j...@koitsu.org wrote:
 
  On Sun, Jun 09, 2013 at 12:21:37PM +0200, Alban Hertroys wrote:
  I'm having an issue where my fxp0 interface keeps looping between DOWN/UP, 
  with dhclient requesting a lease each time in between. I think it's caused 
  by dhclient:
  
  solfertje # dhclient -d fxp0
  DHCPREQUEST on fxp0 to 255.255.255.255 port 67
  send_packet: Network is down
  DHCPREQUEST on fxp0 to 255.255.255.255 port 67
  DHCPACK from 109.72.40.1
  bound to 141.105.10.89 -- renewal in 7200 seconds.
  fxp0 link state up - down
  fxp0 link state down - up
  DHCPREQUEST on fxp0 to 255.255.255.255 port 67
  DHCPACK from 109.72.40.1
  bound to 141.105.10.89 -- renewal in 7200 seconds.
  fxp0 link state up - down
  fxp0 link state down - up
  DHCPREQUEST on fxp0 to 255.255.255.255 port 67
  DHCPACK from 109.72.40.1
  bound to 141.105.10.89 -- renewal in 7200 seconds.
  fxp0 link state up - down
  fxp0 link state down - up
  DHCPREQUEST on fxp0 to 255.255.255.255 port 67
  DHCPACK from 109.72.40.1
  bound to 141.105.10.89 -- renewal in 7200 seconds.
  fxp0 link state up - down
  fxp0 link state down - up
  DHCPREQUEST on fxp0 to 255.255.255.255 port 67
  DHCPACK from 109.72.40.1
  bound to 141.105.10.89 -- renewal in 7200 seconds.
  fxp0 link state up - down
  ^C
  
  In above test I turned off devd (/etc/rc.d/devd stop) and background 
  dhclient (/etc/rc.d/dhclient stop fxp0), and I still go the above result. 
  There's practically no time spent between up/down cycles, this just keeps 
  going on and on.
  fxp0 is the only interface that runs on DHCP. The others have static IP's.
  
  Initially I thought the issue might be caused by devd, because I have both 
  ethernet and 822.11 type NICs (2x ethernet, 1x wifi) in that system.
  
  This is 9-STABLE from yesterday.
  
  Before, I had 9-RELEASE running on this system with the same config, and 
  that worked well.
  
  And so what I predicted begins...
  
  The issue is described in the 8.4-RELEASE Errata Notes; the driver is
  using the same driver version as in stable/9, hence you're experiencing
  the same problem.  See Open Issues:
  
  http://www.freebsd.org/releases/8.4R/errata.html
  
  No fix for this has been committed.  It is still under discussions by
  multiple kernel folks as to where the fix should be applied (dhclient or
  the fxp(4) driver), because the changes made to dhclient (that tickle
  this bug) may actually affect more drivers than just fxp(4).
  
  You can start by reading the (extremely long but very informative)
  thread here.  I do urge you to read all the posts, not skim them:
  
  http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073440.html
  http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/thread.html#73440
 
 Goodness, and here I was hoping it was just a silly mistake I made?
 
 IIUC, the issue is a combination of:
 - dhclient now being aware of link state changes and
 - the fxp driver reinitializes for certain mode changes, such as assigning an 
 IP address
 
 Which causes dhclient to think that the link state changed, fetch a new IP 
 address and assigns it to the fxp adapter again, causing the same link state 
 change over and over again.
 
 Is that about correct?

Someone else can answer this.

  The only known workarounds at this time are:
  
  a) Cease use of DHCP; set a static IP in rc.conf,
  
  b) Try some of the patches mentioned within the above thread,
  specifically this one:
  http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073581.html
 
 Or c) Use DHCP with a static media setting:
 ifconfig_fxp0=DHCP media 100baseTX mediaopt full-duplex

DO NOT DO THIS.  People who do this do not understand what this does.
This has bad effects on IEEE 802.3 and will not do/behave like you might
think.  The short version:

The ONLY TIME you should be hard-setting speed and duplex in ifconfig is
when you have a managed switch on the other end where you can set the
speed/duplex for that port as well.  Otherwise, if you have autoneg on
one side, and forced speed/duplex on the other, there is ABSOLUTELY NO
GUARANTEE it will work -- the behaviour at that point is generally
undefined (and chaotic), and in my experience what happens is the switch
ends up picking 100/half while the FreeBSD box thinks 100/full and you
end up with an insane collision rate + hilariously slow network speeds
(but usually only in one direction).  The behaviour varies per brand
(and revision) of switch, firmware, and other things.

So bottom line: if you're going to use autoneg, use it consistently on
both ends; if you're going to force speed/duplex, do so consistently on
both ends.  (If you don't own a managed switch, then autoneg is your
only choice)

 That worked for two out of three people apparently.
 I'm not done reading this thread yet though and I noticed a patch by 
 YongHyeon that I'll test first.

The fact it didn't

Re: 8.4 and EHCI - regression?

2013-06-09 Thread Jeremy Chadwick

   # IPv6-to-IPv4 relaying (translation)
 #lena device  firmware# firmware assist module
 
 # The `bpf' device enables the Berkeley Packet Filter.
 # Be aware of the administrative consequences of enabling this!
 # Note that 'bpf' is required for DHCP.
 devicebpf # Berkeley packet filter
 
 # USB support
 options   USB_DEBUG   # enable debug msgs
 deviceuhci# UHCI PCI-USB interface
 deviceohci# OHCI PCI-USB interface
 deviceehci# EHCI PCI-USB interface (USB 2.0)
 deviceusb # USB Bus (required)
 #device   udbp# USB Double Bulk Pipe devices
 deviceuhid# Human Interface Devices
 deviceukbd# Keyboard
 #lena device  ulpt# Printer
 deviceumass   # Disks/Mass storage - Requires scbus 
 and da
 #lena:load-as-module device   ums # Mouse
 #lena device  urio# Diamond Rio 500 MP3 player
 # USB Serial devices
 #lena device  u3g # USB-based 3G modems (Option, Huawei, 
 Sierra)
 #lena device  uark# Technologies ARK3116 based serial 
 adapters
 #lena device  ubsa# Belkin F5U103 and compatible serial 
 adapters
 #lena device  uftdi   # For FTDI usb serial adapters
 #lena device  uipaq   # Some WinCE based devices
 #lena device  uplcom  # Prolific PL-2303 serial adapters
 #lena device  uslcom  # SI Labs CP2101/CP2102 serial adapters
 #lena device  uvisor  # Visor and Palm devices
 #lena device  uvscom  # USB serial support for DDI pocket's 
 PHS
 # USB Ethernet, requires miibus
 #lena device  aue # ADMtek USB Ethernet
 #lena device  axe # ASIX Electronics USB Ethernet
 #lena device  cdce# Generic USB over Ethernet
 #lena device  cue # CATC USB Ethernet
 #lena device  kue # Kawasaki LSI USB Ethernet
 #lena device  rue # RealTek RTL8150 USB Ethernet
 #lena device  udav# Davicom DM9601E USB
 # USB Wireless
 #lena device  rum # Ralink Technology RT2501USB wireless 
 NICs
 #lena device  uath# Atheros AR5523 wireless NICs
 #lena device  ural# Ralink Technology RT2500USB wireless 
 NICs
 #lena device  zyd # ZyDAS zd1211/zd1211b wireless NICs
 
 # FireWire support
 #lena device  firewire# FireWire bus code
 #device   sbp # SCSI over FireWire (Requires scbus 
 and da)
 #lena device  fwe # Ethernet over FireWire (non-standard!)
 #lena device  fwip# IP over FireWire (RFC 2734,3146)
 #lena device  dcons   # Dumb console driver
 #lena device  dcons_crom  # Configuration ROM for dcons
 
 # VirtIO support
 devicevirtio  # Generic VirtIO bus (required)
 devicevirtio_pci  # VirtIO PCI device
 devicevtnet   # VirtIO Ethernet device
 devicevirtio_blk  # VirtIO Block device
 devicevirtio_scsi # VirtIO SCSI device
 devicevirtio_balloon  # VirtIO Memory Balloon device
 
 #lenab
 # from /sys/conf/NOTES:
 
 # Optional character code conversion support with LIBICONV.
 # Each option requires their base file system and LIBICONV.
 
 options MSDOSFS_ICONV
 
 # Kernel side iconv library
 options LIBICONV
 
 # Set the amount of time (in seconds) the system will wait before
 # rebooting automatically when a kernel panic occurs.  If set to (-1),
 # the system will wait indefinitely until a key is pressed on the
 # console.
 options PANIC_REBOOT_WAIT_TIME=60   #lena was 16
 
 # from /sys/i386/conf/NOTES:
 
 # Enable Linux ABI emulation
 options COMPAT_LINUX
 
 #lenae

CC'ing freebsd-usb@, where Hans can probably help with this.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Serial terminal issues

2013-06-05 Thread Jeremy Chadwick

On Wed, Jun 05, 2013 at 09:29:56PM +0200, Alban Hertroys wrote:
 {sniping stuff that is pending or has been acknowledged}
 On Jun 5, 2013, at 2:59, Jeremy Chadwick j...@koitsu.org wrote:
  Serial port speed settings in a BIOS pertain to BIOS-level console
  redirection -- that redirection is lost the instant anything (boot
  loader, kernel, etc.) touches SMI and/or interrupts and starts
  fiddling with the serial port.
 
 That's the bit I wasn't entirely certain of - that there is no possible 
 interaction from having a BIOS console to the point where the OS takes over. 
 That's why I mentioned it.
 
 I assumed that if the BIOS had set up the serial port to 19200 baud and the 
 OS didn't specify it, that it would be possible that the speed set up in the 
 BIOS would still be in effect and that the serial terminal just incidentally 
 worked for the last 10 years because of that. Far-fetched, I know.

Not far-fetched.  Some system BIOSes with BIOS-level serial console
redirection offer what you describe -- on Supermicro systems, you can
toggle this capability in the BIOS, it's called Continue CR After POST
(CR stands for Console Redirection).

This is hard to explain without getting into the technicalities, so bear
with me here.  Get coffee, etc..

What this BIOS feature does is retain the SMI/interrupt mapping stuff,
so that certain calls to interrupt 0x10 (the BIOS interrupt) for things
like cursor movement, writing text/strings, etc. are done on the native
console (ex. VGA) *as well* as sent to the serial port (and converted
into escape sequences of your choice -- another BIOS option, Console
Type, lets you pick between things like vt100, ANSI, ASCII, etc.).

This option is useful for things like option ROMs or HBAs (SCSI/SAS
controllers, etc.) which print stuff *after* POST.  I'm sure you've seen
this.  With Continue CR After POST disabled, those types of messages
are only seen on the VGA console.

However, regardless of the setting of Continue CR After POST, the
instant any x86 code starts tinkering with the SMI/interrupt stuff, that
functionality is lost (and cannot be restored).  In FreeBSD, this
definitely happens when the kernel starts, but AFAIR not during the
bootstraps.

Instead, the bootstraps (that is: boot0, as well as boot2/loader) have
the ability to speak to the serial port *directly*, rather than relying
on interrupt 0x10.

The -S19200 parameter in /boot.config causes the bootstraps **very**
early on to set the serial port speed to 19200 baud.  This could cause
a problem if you have BIOS-level serial redirect set up and set to a
different speed (ex. 57600), so naturally you need to make sure
everything uses the same speed at all stages.

The -Dh parameter in /boot.config causes the bootstraps **very** early
on to tell FreeBSD to write data to the VGA console/text console, in
addition to the serial port (directly, not via interrupt 0x10).  For how
all that works (meaning how -D vs. -Dh behaves and at what stages
of the FreeBSD boot process), please see the FreeBSD Handbook 
section 27.6.4.1:

http://www.freebsd.org/doc/en/books/handbook/serialconsole-setup.html

The handbook here is also outdated/wrong; it's talking about sio0 when
it means to refer to uart0.  flags for uart0 in this case will be
0x00010 (meaning uart0 is a potential serial console).

Finally, the important/key part: the -Dh capability when used in
/boot.config gets passed on to boot2/loader (so it knows to output
data to the serial port as a console), and boot2/loader **ALSO** passes
that information on to the kernel when it starts so that it knows to
print data to the serial port too.

Make sense?  :-)

This is why I advocate using /boot.config (or you can use /boot/config
if you wish -- both in 9.1-RELEASE work (thanks des@ !)) rather than
mucking about with /boot/loader.conf -- the added advantage is that you
can actually get serial output at an earlier phase/stage, in case some
of your boot blocks don't work.  More specifically, with /boot.config
you can actually get this on the serial port (if you bang on Escape or
Enter repeatedly VERY early on in the boot process):

 FreeBSD/i386 BOOT
Default: 0:ad(0,a)/boot/loader
boot:

But if you don't bang on keys, you won't ever see this.

Anyway, sorry for the long ramble there, but the above is how it works.
(I'm sure readers will go My god, that is one of the best write-ups
I've seen of how the serial console/boot process stuff works, why isn't
this in the handbook!? to which I will opt out/not respond to).

  What you're adjusting in FreeBSD is 1) the FreeBSD boot loader touching
  the serial port, and 2) the FreeBSD kernel outputting to a serial port
  (it also initialises/sets the serial port), and 3) getty et al spawning
  a login prompt on the serial port.
 
 Regarding 2). I found some references on the internet pertaining settings in 
 the kernel config file to adjust serial console settings. Could that be what 
 I'm missing?

You're referring to either

Re: TRIM support through ciss

2013-06-05 Thread Jeremy Chadwick

On Thu, Jun 06, 2013 at 02:00:36AM +0400, Dmitry Morozovsky wrote:
 Dear colleagues,
 
 I have a DB server with ciss and a bunch of disks (8 SAS + 2 Intel SATA SSD).
 
 However, this setup does not seem to support TRIM on SSDs:
 
 kstat.zfs.misc.zio_trim.bytes: 0
 kstat.zfs.misc.zio_trim.success: 0
 kstat.zfs.misc.zio_trim.unsupported: 418
 kstat.zfs.misc.zio_trim.failed: 0
 
 
 Excerpt from dmesg about SSD:
 
 da9 at ciss0 bus 0 scbus0 target 9 lun 0
 da9: COMPAQ RAID 0 OK Fixed Direct Access SCSI-5 device
 da9: Serial Number PACCR9SZ7KJS
 da9: 135.168MB/s transfers
 da9: Command Queueing enabled
 da9: 114439MB (234371520 512 byte sectors: 255H 32S/T 28722C)
 da9: quirks=0x1NO_SYNC_CACHE
 da9: Delete methods: NONE(*)
 
 the last line bothers me...
 
 Is there any tuning I missed?

I'm sure Steve will respond, but in the meantime...

I assume this is you running stable/9 with r251419 or newer (which just
got committed a few hours ago)?

I haven't looked at the code, but it is very, VERY important to remember
that you are *always* at the whim of 1) the controller driver (ciss(4)
in this case), and 2) the controller firmware, as to whether or not
certain pass-through commands are supported (in this case, since you
have a SAS controller, this would be accomplished via a SCSI command
that your controller does not support.

Oh, it looks like Steve just replied and said more or less what I did.
:-)

Bottom line as we (the royal we, I guess) have been saying for many
years now: any controller which operates in a RAID fashion and does not
support true JBOD (meaning the controller acts a generic controller
with no concept of RAID), will almost always get in the way.  Instead,
stick with true non-RAID controllers -- and yes I am aware choices are
limited.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ZFS crashing while zfs recv in progress

2013-06-04 Thread Jeremy Chadwick

 the swap message?

I'm looking for a capture of gstat -I500ms output (you will need a
VERY long/big terminal window to capture this given how many disks you
have) while I/O is happening, as well as top -s 1 in another window.
I would also like to see zpool iostat -v 1 output while things are
going on, to help possibly narrow down if there is a single disk causing
the entire I/O subsystem for that controller to choke.

Next: are you using compression or dedup on any of your filesystems?
If not, have you ever in the past?

Next: could we have your loader.conf and sysctl.conf please?

My gut feeling is that if you're doing zfs {send,recv} for tank --
which you are -- multiple subsystems and busses are so incredibly
overwhelmed by all the I/O and interrupts and *everything* that it's
very hard for the swap I/O time slicer to get a decent share of time to
swap something out to swap (even worse if that controller is overwhelmed
with requests).  Worse, you're using raidz2, which means even more CPU
time + calculation overhead, which means less time for other tasks
(threads).  Everything on the system -- everything! -- is fighting for
time at multiple levels.

If you could put a swap disk on a dedicated controller (and no other
disks on it), that would be ideal.  Please do not use USB for this task
(the USB stack may introduce its own set of complexities pertaining to
interrupt usage).

If all this turns out to be an overall system overwhelmed situation,
my advice is to cut back on the usage.  I would STRONGLY suggest in that
case a 2nd system, and split the number of disks across both.

I'm really surprised given how many disks/etc. you have you didn't
choose to get an actual filer (Netapp).  I sure as hell would have.  I
really do not know why people think ZFS is a full-blown replacement for
a Netapp of this scale -- it isn't.

Anyway take what I say with a grain of salt -- really.  I'm just
throwing out thoughts/ideas as I look over everything.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Serial terminal issues

2013-06-04 Thread Jeremy Chadwick

 sure of its wiring (which is
very common -- sigh, stupid companies...), you will need to figure out
the wiring using a multimetre (continuity test is all that's needed).

 I didn't see any options in the BIOS to set the console speed (just
 address and IRQ, those are in the above). ISTR that my old mobo did
 allow to set that information, but then again, that board (Tyan Tiger)
 gave me access to the BIOS through the serial console.

This has absolutely no relevancy.

Serial port speed settings in a BIOS pertain to BIOS-level console
redirection -- that redirection is lost the instant anything (boot
loader, kernel, etc.) touches SMI and/or interrupts and starts
fiddling with the serial port.

What you're adjusting in FreeBSD is 1) the FreeBSD boot loader touching
the serial port, and 2) the FreeBSD kernel outputting to a serial port
(it also initialises/sets the serial port), and 3) getty et al spawning
a login prompt on the serial port.

I would point you to my FreeBSD via serial console and PXE document,
except there are one-offs specific to the PXE portions that are not
relevant to your situation.  The important part is that I've used
FreeBSD serial console for almost 16 years and have a very good
understanding of what works (including vs. what some developers say
should work; i.e. reality vs. pragmatism).

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Corrupt GPT header on disk from twa array - fixable?

2013-06-03 Thread Jeremy Chadwick

On Mon, Jun 03, 2013 at 09:14:41AM +0200, Alban Hertroys wrote:
 
 On Jun 3, 2013, at 1:09, Warren Block wbl...@wonkity.com wrote:
 
  On Mon, 3 Jun 2013, Alban Hertroys wrote:
  
  Really, the easiest way would be to temporarily install the old RAID 
  controller and copy the data off the array.
  
  Well, that would mean I'd have to assemble the old server again, as the 
  controller is not compatible with the hardware in the new one. And that 
  would probably be unnecessary as well, since I already did copy the data 
  off those disks.
  
  I was just curious whether it would be possible to read that data off the 
  disks while I still have them (with their original contents) in the new 
  server in the eventuality that I _did_ forget to copy something over or 
  that something wasn't copied over correctly.
  
  I copied the data over a 100MBit ethernet link, which was the fastest 
  option I had with the old server; it had USB1 and no native SATA. Hence 
  the RAID controller, but that was on a now deprecated PCI-X channel (those 
  64-bit parallel things) and all 4 ports were in use. Not to mention that 
  the CPU was so old that it had a rather narrow margin for operating 
  temperatures and overheated several times during the copying process, 
  because rsync+sshd put a relatively high load on the CPU (An old Athlon XP 
  2000+).
  
  PCI-X cards will operate in PCI slots.  Or at least some will; I've done 
  that with an Intel network card.  The motherboard can't have components 
  that block the unused part of the edge connector, or the offending card 
  edge could be removed with extreme prejudice.
 
 Not this 3Ware card. I remember buying that particular motherboard because 
 the card wouldn't fit in the PCI slots on the board I had. There's a division 
 in those PCI-X slots opposite of where there's one in normal PCI slots and no 
 groove in the card to match the division in the PCI slot.

This is all besides-the-point, but to clarify: please see the following
diagram:

http://en.wikipedia.org/wiki/File:PCI_Keying.png

I recommend seeing the caption under the diagram, in addition to reading
the Mixing of 32-bit and 64-bit PCI cards in different width slots
section:

http://en.wikipedia.org/wiki/PCI-X

It sounds like your 3Ware card is 5V PCI-X (32-bit or 64-bit is
irrelevant), and your new motherboard only supports 3.3V PCI (which is
pretty much the norm on all motherboards today when it comes to classic
PCI).

The 5V stuff is generally shunned (both with regards to PCI and PCI-X)
and is uncommon at this point in time.

You can find some server-class boards that offer this capability, such
as Supermicro's UIO slots, where you purchase the proper type of riser
(adapter) for the type of card you have, i.e. UIO-5.5V PCI-X 64-bit),
but you will not find this on consumer/desktop or even enthusiast
boards.  Example:

http://www.supermicro.com/support/resources/riser/riser.aspx

If you want to know what kind of card it is, ask 3Ware or see the user
manual.  Note that many vendors do not disclose all the relevant data in
the manual or on their site.  That info: voltage (3.3V vs. 5V vs.
universal), bus width (32-bit vs. 64-bit), and if 64-bit if the card
will function in a 32-bit slot (some cards won't).

Educational footnote: AGP is another one of those standards that went
through the same nonsense (specifically 3.3V vs. 1.5V), except the
situation was worse when some card manufacturers began selling 1.5V
cards with incorrect notchings, resulting in smoke/fire when installed
in a 3.3V slot.  I have one such card, and keep it solely as a reminder
of manufacturer/vendor idiocy.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-06-03 Thread Jeremy Chadwick

On Mon, Jun 03, 2013 at 03:06:53PM +0100, Mike Pumford wrote:
 Ian Lepore wrote:
 On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:
 Steven Hartland wrote:
Have you checked your sata cables and psu outputs?
   
Both of these could be the underlying cause of poor signalling.
 
 I can't easily check that because it is a cheap rented
 server in a remote location.
 
 But I don't believe it is bad cabling or PSU anyway, or
 otherwise the problem would occur intermittently all the
 time if the load on the disks is sufficiently high.
 But it only occurs at tags=3 and above.  At tags=2 it does
 not occur at all, no matter how hard I hammer on the disks.
 
 At the moment I'm inclined to believe that it is either
 a bug in the HDD firmware or in the controller.  The disks
 aren't exactly new, they're 400 GB Samsung ones that are
 several years old.  I think it's not uncommon to have bugs
 in the NCQ implementation in such disks.
 
 The only thing that puzzles me is the fact that the problem
 also disappears completely when I reduce the SATA rev from
 II to I, even at tags=32.
 
 
 It seems to me that you dismiss signaling problems too quickly.
 Consider the possibilities... A bad cable leads to intermittant errors
 at higher speeds.  When NCQ is disabled or limited the software handles
 these errors pretty much transparently.  When NCQ is not limitted and
 there are many outstanding requests, suddenly the error handling in the
 software breaks down somehow and a minor recoverable problem becomes an
 in-your-face error.
 
 It could also be a software bug in the way CAM handles the failure
 of NCQ commands. When command queueing is used on a SCSI drive and a
 queued command fails only that command fails. A queued command
 failure on a SATA device fails ALL currently queued commands. I've
 not looked at the code but do the SATA CAM drivers do the right
 thing here?

Quoting T13/2015-D ATA8-ACS2 WD spec:

If an error occurs while the device is processing an NCQ command, then
the device shall return command aborted for all NCQ commands that are in
the queue and shall return command aborted for any new commands, except
a READ LOG EXT command requesting log address 10h, until the device
completes a READ LOG EXT command requesting log address 10h (i.e.,
reading the NCQ Command Error log) without error.

While I can't easily provide an answer to your question, I can tell you
that sys/dev/ahci/ahci.c does execute READ LOG EXT (command 0x2f) for
certain scenarios (the code is in function ahci_issue_recovery()).

The one person who can answer this question is mav@, who is now CC'd.

 Less commands queued makes it less likely that multiple commands
 will be in progress when a failure occurs.  A lower link rate also
 makes you more immune to signal failures.

He isn't seeing SATA-level signal/link failure; the AHCI driver would
complain about that, and those messages aren't there.  Unless, of
course, those messages are only visible when verbose booting is enabled
(I hope not).

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.1-current disk throughput stalls ?

2013-06-03 Thread Jeremy Chadwick

 that the system clock slews badly - machine time drops
 behind wall clock time.  Something is locking the clock update off.
 
 (Hmmm, I see I'm running a pre-5000/feature flags ZFS pool, FWTW.
 I'll run zpool upgrade, my bad.)

1. There is no such thing as 9.1-CURRENT.  Either you meant 9.1-STABLE
(what should be called stable/9) or -CURRENT (what should be called
head).

2. Is there some reason you excluded details of your ZFS setup?  zpool
status would be a good start.

3. Do any of your filesystems/pools have ZFS compression enabled, or
have in the past?

4. Do any of your filesystems/pools have ZFS dedup enabled, or have in
the past?

5. Does the problem go away after a reboot?

6. Can you provide smartctl -x output for both ada0 and ada1?  You will
need to install ports/sysutils/smartmontools for this.  The reason I'm
asking for this is there may be one of your disks which is causing I/O
transactions to stall for the entire pool (i.e. single point of
annoyance).

7. Can you remove ZFS from the picture entirely (use UFS only) and
re-test?  My guess is that this is ZFS behaviour, particularly the ARC
being flushed to disk, and your disks are old/slow.  (Meaning: you have
16GB RAM + 4 core CPU but with very old disks).

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.1-current disk throughput stalls ?

2013-06-03 Thread Jeremy Chadwick

On Mon, Jun 03, 2013 at 03:48:30PM -0600, Ross Alexander wrote:
 On Mon, 3 Jun 2013, Jeremy Chadwick wrote:
 
 1. There is no such thing as 9.1-CURRENT.  Either you meant 9.1-STABLE
 (what should be called stable/9) or -CURRENT (what should be called
 head).
 
 I wrote:
 The oldest kernel I have that shows the syndrome is -
 
 FreeBSD aukward.bogons 9.1-STABLE FreeBSD 9.1-STABLE #59 r250498:
 Sat May 11 00:03:15 MDT 2013
 toor@aukward.bogons:/usr/obj/usr/src/sys/GENERIC  amd64
 
 See above.  You're right, I shouldn't post after a 07:00 dentist's
 appt while my spouse is worrying me about the ins adjustor's report
 on the car damage :(.  Hey, I'm very fallible.  I'll try harder.
 
 2. Is there some reason you excluded details of your ZFS setup?
 zpool status would be a good start.
 
 Thanks for the useful hint as to what info you need to diagnose.
 
 One of the machines ran a 5 drive zraid-1 pool (Mnemosyne).
 
 Another was a 2 drive gmirror, in the simplest possible gpart/gmirror setup.
 (Mnemosyne-sub-1.)
 
 The third is a 2 drive ZFS raid-1, again in the simplest possible
 gpart/gmirror manner (Aukward).
 
 The fourth is a conceptually identical 2 drive ZFS raid-1, swapping
 to a zvol (Griffon.)
 
 If you look on the FreeBSD wiki, the pages that say bootable zfs
 gptzfsboot and bootable mirror -
 
 https://wiki.freebsd.org/RootOnZFS
 http://www.freebsdwiki.net/index.php/RAID1,_Software,_How_to_setup
 
 Well, I just followed those in cookbook style (modulo device and pool
 names).  Didn't see any reason to be creative; I build for
 reliability, not performance.
 
 Aukward is gpart/zfs raid-1 box #1:
 
 aukward:/u0/rwa  ls -l /dev/gpt
 total 0
 crw-r-  1 root  operator  0x91 Jun  3 10:18 vol0
 crw-r-  1 root  operator  0x8e Jun  3 10:18 vol1
 
 aukward:/u0/rwa  zpool list -v
 NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
 ult_root   111G   108G  2.53G97%  1.00x  ONLINE  -
   mirror   111G   108G  2.53G -
   gpt/vol0  -  -  - -
   gpt/vol1  -  -  - -
 
 aukward:/u0/rwa  zpool status
   pool: ult_root
  state: ONLINE
   scan: scrub repaired 0 in 1h13m with 0 errors on Sun May  5 04:29:30 
 2013
 config:
 
   NAME  STATE READ WRITE CKSUM
   ult_root  ONLINE   0 0 0
 mirror-0ONLINE   0 0 0
   gpt/vol0  ONLINE   0 0 0
   gpt/vol1  ONLINE   0 0 0
 
 errors: No known data errors
 
 (Yes, that machine has no swap.  Has NEVER had swap, has 16 GB and
 uses maybe 10% at max load.  Has been running 9.x since prerelease
 days, FWTW.  The ARC is throttled to 2 GB; zfs-stats says I never get
 near using even that.  It's just the box that drives the radios,
 a ham radio hobby machine.)
 
 Griffon is also gpart/zfs raid-1 -
 
 griffon:/u0/rwa  uname -a
   FreeBSD griffon.cs.athabascau.ca 9.1-STABLE FreeBSD 9.1-STABLE #25 
 r251062M:
   Tue May 28 10:39:13 MDT 2013
   t...@griffon.cs.athabascau.ca:/usr/obj/usr/src/sys/GENERIC
   amd64
 
 griffon:/u0/rwa  ls -l /dev/gpt
 total 0
 crw-r-  1 root  operator  0x7b Jun  3 08:38 disk0
 crw-r-  1 root  operator  0x80 Jun  3 08:38 disk1
 crw-r-  1 root  operator  0x79 Jun  3 08:38 swap0
 crw-r-  1 root  operator  0x7e Jun  3 08:38 swap1
 
 and the pool is fat and happy -
 
 griffon:/u0/rwa  zpool status -v
   pool: pool0
  state: ONLINE
   scan: none requested
 config:
 
   NAME   STATE READ WRITE CKSUM
   pool0  ONLINE   0 0 0
 mirror-0 ONLINE   0 0 0
   gpt/disk0  ONLINE   0 0 0
   gpt/disk1  ONLINE   0 0 0
 
 errors: No known data errors
 
 Note that swap is through ZFS zvol;
 
 griffon:/u0/rwa  cat /etc/fstab
 # DeviceMountpoint  FStype  Options DumpPass#
 #
 #
 /dev/zvol/pool0/swap none   swapsw  0   0
 
 pool0   /   zfs rw  0   0
 pool0/tmp   /tmpzfs rw  0   0
 pool0/var   /varzfs rw  0   0
 pool0/usr   /usrzfs rw  0   0
 pool0/u0/u0 zfs rw  0   0
 
 /dev/cd0/cdrom  cd9660  ro,noauto   0   0
 /dev/ada2s1d/mnt0   ufs rw,noauto   0   0
 /dev/da0s1  /u0/rwa/camera  msdosfs rw,noauto   0   0
 
 The machine has 32 GB and never swaps.  It runs virtualbox loads, anything
 from one to forty virtuals (little OpenBSD images.)  Load is always light.
 
 As for the zraid-5 box (Mnemosyne), I first replaced the ZFS pool with
 a simple gpart/gmirror

Re: 9.1-current disk throughput stalls ?

2013-06-03 Thread Jeremy Chadwick

On Mon, Jun 03, 2013 at 03:34:26PM -0700, Jeremy Chadwick wrote:
 7. ZFS setup is a mirror (RAID-1-like),

Should have referenced [2].

 12. Rolling back to 8.4-STABLE (date/build unknown) apparently fixes
 your issue (I would appreciate you running the system for 72 hours
 before making this statement, and doing the *exact same things* on it
 that cause the problem with 9.1-STABLE) [2]

I should have used the word exacerbate instead of cause.

 v) I really wish you would not have rolled this system back to
 8.4-STABLE.  For anyone to debug this, we need the system in a
 consistent state.  Changing kernels/etc. 

User error while using vim (I have an awful tendency to nuke entire
lines when switching between input mode vs. navigation mode); last line
should read Changing kernels/etc. in the middle of troubleshooting a
problem you ask for assistance with makes things very difficult.  (And
I say that knowing that rolling back as a form of testing is good, since
it can help narrow things down to a specific version or release, i.e. a
software problem).

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Corrupt GPT header on disk from twa array - fixable?

2013-06-02 Thread Jeremy Chadwick

 disks.

I think you're missing what Warren is telling you, because you have
multiple things going on/complexities to deal with simultaneously.

You haven't provided any details about your gmirror setup either.  All
we know at this point:

  GEOM_MIRROR: Device mirror/boot launched (2/2).
  GEOM_MIRROR: Device mirror/swap launched (2/2).
  GEOM_MIRROR: Device mirror/root launched (2/2).

My gut feeling is ada2 and ada3 make up the mirror, and the mirror is at
the disk level (ada2 and ada3).  I'm basing this on past evidence
presented in the thread, and having to make assumptions.  No gmirror
status output = we have to make assumptions.

Now, what Warren is telling you: gmirror + GPT do not play well
together.  This is a design flaw** on the part of gmirror.  If you want
to use gmirror with disks using GPT, your only solutions are to mirror
the partitions (adaXpX) and not the disk (adaX), which has its own set
of caveats, or to use the MBR scheme (and if these are 4K sectors disks,
or you plan on using those, you're even more screwed).  I will not bring
ZFS into this discussion since that also opens up a can of worms -- I'm
trying to stay focused.

The errors you see on ada4 and ada5 about the backup GPT header can be
dealt with in a different manner.

But for (again, assuming) ada2 and ada3, you will see GPT backup header
corruption messages indefinitely because of the above flaw.

** -- I will not get into a debate about terminology.  I am aware of the
history (which came first), and so on.  It's a flaw.  Linux md had the
same problem when GPT was introduced, and it has since been
fixed/addressed.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: System doesn't dump

2013-05-29 Thread Jeremy Chadwick

On Wed, May 29, 2013 at 08:41:38AM +0200, Dominic Fandrey wrote:
 I have a number of actions that reliably panic the system, such as
 performing shutdown -p (yes I'm booting into an inconsistent file
 system every time). Both with my notebook and my workstation.
 
 However I cannot get the system to dump.
 
 dumpdir=/var/crash
 and I've tried ada0s2b, /dev/ada0s2b, label/5swap, /dev/label/5swap and AUTO
 for dumpdev to no avail.

 The swap partition is 16g, the machines have 8g RAM and there's plenty
 of hard disk space available for /var/crash.
 
 I'm looking for that secret, undocumented trigger, that makes the
 system dump if a panic occurs. Once upon a time dumping just worked
 if the swap partition was large enough. I miss those olden days.

Foremost: the fact you did not disclose your FreeBSD version (and SVN
rev if you have it) nor architecture is disappointing.  It matters more
than you think.  Please disclose it.

Onward ho...

If you have VGA console access, try dropping to db and issuing the
command call doadump (possibly preceded by panic).

If you have serial console access, there are ways to drop to ddb but it
depends on your kernel config (look for BREAK_TO_DEBUGGER and
ALT_BREAK_TO_DEBUGGER in /sys/conf/NOTES).  Break with serial, by the
way, means a serial-level break signal (often why I prefer
ALT_BREAK_TO_DEBUGGER).

After doing call doadump you should definitely see the kernel dumping
memory to swap (it gives a progress indicator of sorts).  Google for the
phrase call doadump and look at some of the results to get an idea of
what the output normally is during that phase, for comparison.

If you don't see such, I'm sure many of the kernel folks here can help
figure out why.

See sysctl debug.ddb.scripting.scripts for what should get automatically
done on a panic.  This may or may not be affected by ddb_enable=yes in
rc.conf (which mandates DDB being enabled in your kernel) -- I can't
remember though, so someone else may want to comment.

If your issue is that the kernel actually *does* dump memory to swap but
that on boot-up savecore(8) doesn't recover the memory dump and populate
relevant files in /var/crash: that's a separate issue that has been
discussed for probably 10 years or longer with (to my knowledge) no
definitive explanation.  Theories presented (going off of memory here)
were that that something ended up writing over parts of the panic
metadata on the swap disk/slice/etc. and thus savecore(8) finds
nothing.  This is why rc scripts/etc. have to make sure to look for the
swap panic metadata and run savecore(8) **before** issuing dumpon(8).

My opinion, others' may vary:

Stick with using dumpdev=auto in rc.conf, assuming you have a
/etc/fstab entry of swap somewhere.  Swap should ideally be a
partition or slice, not something abstracted out by other layers (see
above paragraph for why I advocate that, but my additional opinion is
that when it comes to getting a kernel dump and system configurations,
KISS principle applies heavily.  If your system is crashing, the last
thing you want to deal with is why you can't get a kernel dump -- you
could spend more time doing that than you do getting the panic info +
debugging the actual crash), but again, this is my own opinion and there
are legitimate other opinions as well -- I just follow what I do because
I know it works.

Likewise I always get wary of people's setups when I start seeing
labels mentioned.  *waves cane*  Screw all this newfandangled stuff.
:-)

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-05-29 Thread Jeremy Chadwick

 on how the
user set up the system.  In this case, the Samsung HD403LJ is supposedly
a 512-byte sector drive, but the drive probably complies with an older
ATA specification and thus only provides the logical sector size in ATA
IDENTIFY output, thus the system must assume physical=logical
(camcontrol and smartmontools will both say something to the effect of
512 bytes logical/physical).

I would appreciate the following:

1. smartctl -x {ada0,ada1} output using a recent version of
smartmontools (6.1 if possible please),

2. camcontrol identify {ada0,ada1} -v output (note the -v),

3. If you are running smartd(8) or not,

4. pciconf -lvbc output.

Anecdotal story:

A lot of people forget the infamous nVidia nForce 4 vs. Maxtor NCQ issue
that circulated PC enthusiast sites during the mid-2000s.  Neither
company wanted to own up to the problem, blaming each other instead.
There was never any official statement made as to where the problem was,
only that nVidia updated their nForce 4 controller drivers with some
sort of workaround (details were not disclosed), and Maxtor also quietly
added a document to their website stating that you could get a firmware
from Technical Support that would address the problem as well.  I had
a combination of the two at the time, which is why I remember it.  Still to
this day nobody knows who was really responsible.  I won't get into the
whole political/societal aspects of why vendors always blame one another
rather than solve real problems.

There is no way at this time (in real-time or via loader.conf) to
disable NCQ within the AHCI driver.  It is possible to add an entry to
the AHCI quirks table for your controller that sets AHCI_Q_NONCQ, if you
want to try that.  I can give you a patch for that, but I need to see
the output from the above (4) commands first -- it may not be necessary
to try, depending on the results.

I have probably left out key/important informations within this mail,
which is an indicator of how tired I have grown of seeing it come up.
:-(

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: SunFire X2200 ilo's bge1 DOWN/UP

2013-05-28 Thread Jeremy Chadwick

On Tue, May 28, 2013 at 09:28:00AM +0300, Daniel Braniss wrote:
  On Mon, May 27, 2013 at 10:59:28AM +0300, Daniel Braniss wrote:
On Fri, May 24, 2013 at 05:31:13PM +0300, Daniel Braniss wrote:
 hi, after upgrading to 9.1-stable, this particular hardware - SunFire 
 X2200,

Show me dmesg(bge(4) and brgphy(4) only) and 'ifconfig bge1' output.

   
   bge0: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 
   0x009003 mem 
   0xfdff-0xfdff,0xfdfe-0xfdfe irq 17 at device 4.0 on pci6
   bge0: CHIP ID 0x9003; ASIC REV 0x09; CHIP REV 0x90; PCI-X 133 MHz
   miibus2: MII bus on bge0
   brgphy0: BCM5714 1000BASE-T media interface PHY 1 on miibus2
   brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
   1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
   bge0: Ethernet address: 00:1b:24:5d:5b:bd
   bge1: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 
   0x009003 mem 
   0xfdfc-0xfdfc,0xfdfb-0xfdfb irq 18 at device 4.1 on pci6
   bge1: CHIP ID 0x9003; ASIC REV 0x09; CHIP REV 0x90; PCI-X 133 MHz
   miibus3: MII bus on bge1
   brgphy1: BCM5714 1000BASE-T media interface PHY 1 on miibus3
   brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
   1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
   bge1: Ethernet address: 00:1b:24:5d:5b:be
   
   sf-10 ifconfig bge1
   bge1: flags=8802BROADCAST,SIMPLEX,MULTICAST metric 0 mtu 1500
   
   options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTA
   TE
   ether 00:1b:24:5d:5b:be
   nd6 options=21PERFORMNUD,AUTO_LINKLOCAL
   media: Ethernet autoselect (100baseTX full-duplex)
   status: active
   
  
  Because bge1 is not UP, I wonder how you get link UP/DOWN events.
  Do you have some network script run by cron?
 
 no scripts.
 this port is shared with the ILO/IPMI, and back in March you fixed a problem
 that it was hanging soon after it was initialized by the driver,
 (r248226 - but I'm not sure if it was ever MFC'ed).
 Initialy I thought it could be caused by connections to it from other
 hosts (either via the web, or ssh) so I killed them, but it didn't help.
 without that patch the connection fails, and I don't see any DOWN/UP.

Two things:

1. r248226 in head was MFC'd to stable/9 as r248858.  Validation:

http://svnweb.freebsd.org/base/stable/9/sys/dev/bge/if_bge.c?view=log

So the answer: whether or not you have that MFC in stable/9 depends on
what SVN rev your kernel is.

2. Is there some way to verify that the ASF/iLO/IPMI bits (i.e. the IPMI
firmware itself) are not shutting down bge1's PHY intentionally?  Unless
the IPMI module chooses to log something useful (e.g. I'm doing this),
I'm not sure how you'd figure that out.

Other question: is there any correlation between the amount of time that
goes by between events with, say, ARP/MAC address expiry in arp -a?  I
mention this because I know some of the ASF methods have historically
shown two MAC addresses on the same physif, and I can see how this might
confuse some stacks.

rant
That piggybacking crap never should have been invented.  All it has
done is cause problems for every OS I know of (including Windows) since
its inception, and is also exactly why today almost all vendors I've
seen provide a dedicated NIC and RJ45 port for the iLO/IPMI interface.
It's admission the piggybacking method doesn't work.  And may it rot
in hell for all I care, while simultaneously feeling very sorry for
those who have to suffer/deal with it.

This is just another reason why I've always been very picky about what
hardware I'd buy for server deployments.  Vendors never actually
disclose this crap until you've shelled out money for the hardware, by
which point it's too late and you're suffering.  Really great model --
for the pocketbook.  :/
/rant

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: SunFire X2200 ilo's bge1 DOWN/UP

2013-05-28 Thread Jeremy Chadwick

On Mon, May 27, 2013 at 11:49:31PM -0700, Jeremy Chadwick wrote:
 Other question: is there any correlation between the amount of time that
 goes by between events with, say, ARP/MAC address expiry in arp -a?  I
 mention this because I know some of the ASF methods have historically
 shown two MAC addresses on the same physif, and I can see how this might
 confuse some stacks.

Never mind -- I thought about this more, and it's irrelevant.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: SunFire X2200 ilo's bge1 DOWN/UP

2013-05-28 Thread Jeremy Chadwick

On Tue, May 28, 2013 at 10:57:22AM +0300, Daniel Braniss wrote:
 
 [...]
  1. r248226 in head was MFC'd to stable/9 as r248858.  Validation:
  
  http://svnweb.freebsd.org/base/stable/9/sys/dev/bge/if_bge.c?view=log
  
  So the answer: whether or not you have that MFC in stable/9 depends on
  what SVN rev your kernel is.
 
 I do a svnsync then I convert to mercurial so from the svn logs I see that
 the highest rev number is 250960.
 
 [...]
  rant
  That piggybacking crap never should have been invented.  All it has
  done is cause problems for every OS I know of (including Windows) since
  its inception, and is also exactly why today almost all vendors I've
  seen provide a dedicated NIC and RJ45 port for the iLO/IPMI interface.
  It's admission the piggybacking method doesn't work.  And may it rot
  in hell for all I care, while simultaneously feeling very sorry for
  those who have to suffer/deal with it.
  
  This is just another reason why I've always been very picky about what
  hardware I'd buy for server deployments.  Vendors never actually
  disclose this crap until you've shelled out money for the hardware, by
  which point it's too late and you're suffering.  Really great model --
  for the pocketbook.  :/
  /rant
 
 I couldn't agree more!
 
 [...]
 
 in the case of the SunFire X2200, it has 4 bge ports, the
 2nd, bge1, is only used by the ilo, it's not enabled (UP'ed),
 it doesn't have an interrupt assigned, it's, as far as I can tell,
 just anoying to have the DOWN/UP messages - unless something more sinester
 is lurking.

Does output from ps -auxH | grep kernel/bge show anything for
bge1?

What about vmstat -i -a (you might be surprised about the -a flag and
what shows up compared to just using -i).  Gut feeling says it will show
up there.  (See vmstat(8) for what -a does)

Possibly interrupt generation isn't what's triggering the bge(4)
device to see link going up/down; maybe this is done via some memory
mapped I/O, which would explain why vmstat -i shows nothing for bge1
(no interrupts ever generated).

That doesn't change the fact that the driver still is being told via
some means that link is going up/down.

Just a general FYI (probably not relevant here too much, but I often
have to point it out for younger SAs (not saying anyone here is one,
but the list is archived...)): there is a very distinct difference
between a link being physically up/down vs. administratively up/down.

With *IX ifconfig, the social assumption is that there's a 1:1
correlation between those (especially with Ethernet devices), when in
reality it depends on the device driver and all subsystems in between.
I remember quite clearly on some OSes (can't remember if BSD or Linux or
Solaris) where ifconfig xxx down on certain devices would still result
in packets being passed across xxx.  This used to shock me when I was
younger, but nowadays doesn't because I have a better understanding of
why.

ifconfig is just a generic tool that interfaces with a lot of things and
tries to do too much, in my opinion.  On BSD we tend to cram as much
crap into ifconfig as humanly possible, while on other OSes separate
per-device tools/utilities have been developed to segregate the
intended behaviours/desires.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Swap Warning Message?

2013-05-23 Thread Jeremy Chadwick

On Thu, May 23, 2013 at 07:55:20AM -0500, Michael Gass wrote:
 Updated 9.1 to 9 stable on an old PII with 256 MB of memory.
 (FreeBSD runs fine on this machine).  After updating have
 been getting the following warning on startup:
 
 warning: total configured swap (524288 pages) exceeds maximum recommended
 amount (497056 pages).
 warning: increase kern.maxswzone or reduce amount of swap space.
 
 I allocated 2.0 GB of swap when I installed.  This was not a problem
 in the past. 
 
 Should I ignore this warning or do I need to do something?

Taken from my /boot/loader.conf:


# Set kern.maxswzone to 0 to squelch total configured swap exceeds
# maximum recommended amount warning, even with maxpages/2 fix.
# 
http://lists.freebsd.org/pipermail/freebsd-stable/2012-August/thread.html#69301
#
kern.maxswzone=0


Give the small amount of memory on your system, I would suggest using
the above /boot/loader.conf setting, since your system is significantly 
likely to make use of lots of swap; decreasing swap space in your case
seems downright silly.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Apparent fxp regression in FreeBSD 8.4-RC3

2013-05-23 Thread Jeremy Chadwick

On Thu, May 23, 2013 at 08:18:33PM -0400, Michael L. Squires wrote:
 I've just tested 8.4-RC3 using a different Supermicro 1U box with a fresh
 installation of 8.4-RC3.  I had problems with the installation, wouldn't
 boot until I used a Windows 98 FDISK to write a master boot record
 (no idea why; this system uses an Adaptec SATA 1.5 6-channel PCI-X
 board with two
 drives in RAID 1).
 
 Using the em0 interface there are no problems with DHCP; when I
 switch to the fxp0 interface the interface starts going up/down in
 the same manner as reported.
 
 The problem appears associated with world, not with the kernel (running
 the 8.4 kernel with the 8.3 world does not have this problem).
 
 This motherboard is an X5DPL-iGM with 2 Xeon 2.8GHz CPUs and 4 GB of RAM.
 The other unit (an earlier board) has a Serverworks chipset with a single
 Xeon CPU but also with a 100Mbit Intel Pro100 Ethernet port and a 1000Mbit
 Intel Pro1000 Ethernet port.
 
 This unit isn't doing anything useful, so testing isn't a problem.

Mike, Yong-Hyeon asked you a very important question which you didn't
answer:

http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073458.html

If you assign a static IP address, does fxp0 behave properly?

I'm also re-adding Yong-Hyeon to the CC list here.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Apparent fxp regression in FreeBSD 8.4-RC3

2013-05-23 Thread Jeremy Chadwick

On Thu, May 23, 2013 at 09:21:17PM -0400, Glen Barber wrote:
 On Thu, May 23, 2013 at 06:09:43PM -0700, Jeremy Chadwick wrote:
  On Thu, May 23, 2013 at 08:18:33PM -0400, Michael L. Squires wrote:
   I've just tested 8.4-RC3 using a different Supermicro 1U box with a fresh
   installation of 8.4-RC3.  I had problems with the installation, wouldn't
   boot until I used a Windows 98 FDISK to write a master boot record
   (no idea why; this system uses an Adaptec SATA 1.5 6-channel PCI-X
   board with two
   drives in RAID 1).
   
   Using the em0 interface there are no problems with DHCP; when I
   switch to the fxp0 interface the interface starts going up/down in
   the same manner as reported.
   
   The problem appears associated with world, not with the kernel (running
   the 8.4 kernel with the 8.3 world does not have this problem).
   
   This motherboard is an X5DPL-iGM with 2 Xeon 2.8GHz CPUs and 4 GB of RAM.
   The other unit (an earlier board) has a Serverworks chipset with a single
   Xeon CPU but also with a 100Mbit Intel Pro100 Ethernet port and a 1000Mbit
   Intel Pro1000 Ethernet port.
   
   This unit isn't doing anything useful, so testing isn't a problem.
  
  Mike, Yong-Hyeon asked you a very important question which you didn't
  answer:
  
  http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073458.html
  
  If you assign a static IP address, does fxp0 behave properly?
  
  I'm also re-adding Yong-Hyeon to the CC list here.
  
 
 At this point, I am not convinced we have a problem with what will turn
 out to be 8.4-RELEASE.
 
 There have been several attempts to ensure the upgraded version is
 actually 8.4-RC3 (and again, 'uname -a' is not provided in this
 email...).
 
 I find it very hard to believe that we have exactly one fxp(4) user
 upgrading to 8.4-*.
 
 I'd really like to make sure that this is not an issue that will affect
 an uncountable number of users, but truthfully, at this point have to
 consider it a local configuration problem.

I have numerous Supermicro 1U boxes sitting in my garage from closing
down my hosting organisation back in August 2012.  I am certain one or
two of them have Intel NICs that use fxp(4) -- the problem is that I
don't know what exact NIC and PHY model they use.

From what I can tell, there are at least two systems Mike has which
experience this anomaly.  One of those systems' dmesg:

http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073440.html

The relevant lines start at fxp0: Intel 82551 ... and continue all
the way down to pci0:0:8:0: bad VPD cksum, remain 14.  I'm not sure if
the bad VPD checksum message is relevant to the fxp0 device or not.

The 2nd system is mentioned above/in this post:

http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073530.html

But there's no verbose dmesg etc. for the 2nd system so I don't know if
it has the same NIC/PHY.

The model of NIC and PHY matters greatly; most users don't seem to
realise how important this is, they think in terms of Intel vs.
Broadcom vs. Realtek.

Output from pciconf -lvbc, specifically the lines relevant to the fxp0
device, from both systems, would be highly beneficial.

In the meantime, I'll head down to my garage to see if I can find those
fxp(4) boxes and see if they're 85551s (I sure hope I haven't pulled the
CPUs/RAM from them).  If I find a match, I can try to reproduce this.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Apparent fxp regression in FreeBSD 8.4-RC3

2013-05-23 Thread Jeremy Chadwick

On Thu, May 23, 2013 at 11:13:03PM -0400, Glen Barber wrote:
 On Thu, May 23, 2013 at 08:03:51PM -0700, Jeremy Chadwick wrote:
  On Thu, May 23, 2013 at 09:21:17PM -0400, Glen Barber wrote:
   On Thu, May 23, 2013 at 06:09:43PM -0700, Jeremy Chadwick wrote:
On Thu, May 23, 2013 at 08:18:33PM -0400, Michael L. Squires wrote:
 I've just tested 8.4-RC3 using a different Supermicro 1U box with a 
 fresh
 installation of 8.4-RC3.  I had problems with the installation, 
 wouldn't
 boot until I used a Windows 98 FDISK to write a master boot record
 (no idea why; this system uses an Adaptec SATA 1.5 6-channel PCI-X
 board with two
 drives in RAID 1).
 
 Using the em0 interface there are no problems with DHCP; when I
 switch to the fxp0 interface the interface starts going up/down in
 the same manner as reported.
 
 The problem appears associated with world, not with the kernel 
 (running
 the 8.4 kernel with the 8.3 world does not have this problem).
 
 This motherboard is an X5DPL-iGM with 2 Xeon 2.8GHz CPUs and 4 GB of 
 RAM.
 The other unit (an earlier board) has a Serverworks chipset with a 
 single
 Xeon CPU but also with a 100Mbit Intel Pro100 Ethernet port and a 
 1000Mbit
 Intel Pro1000 Ethernet port.
 
 This unit isn't doing anything useful, so testing isn't a problem.

Mike, Yong-Hyeon asked you a very important question which you didn't
answer:

http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073458.html

If you assign a static IP address, does fxp0 behave properly?

I'm also re-adding Yong-Hyeon to the CC list here.

   
   At this point, I am not convinced we have a problem with what will turn
   out to be 8.4-RELEASE.
   
   There have been several attempts to ensure the upgraded version is
   actually 8.4-RC3 (and again, 'uname -a' is not provided in this
   email...).
   
   I find it very hard to believe that we have exactly one fxp(4) user
   upgrading to 8.4-*.
   
   I'd really like to make sure that this is not an issue that will affect
   an uncountable number of users, but truthfully, at this point have to
   consider it a local configuration problem.
  
  I have numerous Supermicro 1U boxes sitting in my garage from closing
  down my hosting organisation back in August 2012.  I am certain one or
  two of them have Intel NICs that use fxp(4) -- the problem is that I
  don't know what exact NIC and PHY model they use.
  
  From what I can tell, there are at least two systems Mike has which
  experience this anomaly.  One of those systems' dmesg:
  
  http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073440.html
  
  The relevant lines start at fxp0: Intel 82551 ... and continue all
  the way down to pci0:0:8:0: bad VPD cksum, remain 14.  I'm not sure if
  the bad VPD checksum message is relevant to the fxp0 device or not.
  
  The 2nd system is mentioned above/in this post:
  
  http://lists.freebsd.org/pipermail/freebsd-stable/2013-May/073530.html
  
  But there's no verbose dmesg etc. for the 2nd system so I don't know if
  it has the same NIC/PHY.
  
 
 My understanding from the start of this thread is that both machines
 are actually the same machine, but with different combinations of
 userland/kernel.  (No, not arguing anything - only one person can answer
 if my understanding is correct or not.)


  The model of NIC and PHY matters greatly; most users don't seem to
  realise how important this is, they think in terms of Intel vs.
  Broadcom vs. Realtek.
  
  Output from pciconf -lvbc, specifically the lines relevant to the fxp0
  device, from both systems, would be highly beneficial.
  
  In the meantime, I'll head down to my garage to see if I can find those
  fxp(4) boxes and see if they're 85551s (I sure hope I haven't pulled the
  CPUs/RAM from them).  If I find a match, I can try to reproduce this.
  
 
 It has been quite costly now (time) waiting for information on this
 particular issue at this point, unfortunately.

I made a typo above -- I meant to say 82551 not 85551.  Anyways...

Hearing you on FM.  :-)

Inventory of old systems of mine didn't take long.  I had 3 systems;
one used dual Broadcom NICs (so that's out), the other used dual Intel
82541EI NICs (which is driven by em(4) so that's out).  The remaining
system:

- Supermicro 5010E
  -- Have CPU + RAM
  -- NIC: Intel DA82562EM
  -- NIC: Intel GD82559
  -- Supermicro site states (1) Intel 82559 and (1) Intel VE CNR
  -- http://www.supermicro.com/products/system/1U/5010/SYS-5010E.cfm
  -- fxp(4) driver claims to support 82562EM and 82559
  -- 
http://svnweb.freebsd.org/base/stable/8/sys/dev/fxp/if_fxp.c?revision=242909view=markup

However neither of these are 82551.

Summary: I do have a system I can use to test fxp(4), however it does
not use the Intel 82551 (in case this turns out to be a chip-specific
driver bug).

If someone wants me to test DHCP via

Re: Apparent fxp regression in FreeBSD 8.4-RC3

2013-05-23 Thread Jeremy Chadwick

On Thu, May 23, 2013 at 11:42:44PM -0400, Glen Barber wrote:
 On Thu, May 23, 2013 at 08:38:06PM -0700, Jeremy Chadwick wrote:
  If someone wants me to test DHCP via fxp(4) on the above system (I can
  do so with both NICs), just let me know; it should only take me half an
  hour or so.
  
  I'll politely wait for someone to say please do so else won't bother.
  
 
 For the sake of completeness...
 
 Please do so.  :)

Issue reproduced 100% reliably, even within sysinstall.

ISO image used:

ftp://ftp4.freebsd.org/pub/FreeBSD/releases/ISO-IMAGES/8.4/FreeBSD-8.4-RC3-i386-disc1.iso

I just chose to Configure the system, selected Networking, chose NO to
the IPv6 configuration choice, and YES to the DHCP configuration choice,
then hit Alt-F2 to watch relevant output.

This was the result:

http://imgbin.org/index.php?page=imageid=13718

...with the fxp0 physif up/down messages continuing indefinitely.

fxp0 on the system is the Intel 82559.  Shot of console's dmesg:

http://imgbin.org/index.php?page=imageid=13720

Nothing is connected to fxp1.

Key points for those asking me to help debug:

- I only have VGA console on this box
- I do not have an IDE hard disk of any sort for temporary OS
  installation, setup, kernel testing, etc..
- The system cannot boot USB media of any sort, so memsticks are out
- The ATAPI drive is CD-only; there is no DVD support, so there's no
  easy way to get a real shell with full utilities (i.e. Fixit)

So if someone wants to take a stab at this, they'll need to do so and
make me an ISO.  Sorry that I can't make things easier.  :-(

This definitely needs to get fixed before 8.4-RELEASE.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Apparent fxp regression in FreeBSD 8.4-RC3

2013-05-23 Thread Jeremy Chadwick

On Thu, May 23, 2013 at 09:40:35PM -0700, Jeremy Chadwick wrote:
 On Thu, May 23, 2013 at 11:42:44PM -0400, Glen Barber wrote:
  On Thu, May 23, 2013 at 08:38:06PM -0700, Jeremy Chadwick wrote:
   If someone wants me to test DHCP via fxp(4) on the above system (I can
   do so with both NICs), just let me know; it should only take me half an
   hour or so.
   
   I'll politely wait for someone to say please do so else won't bother.
   
  
  For the sake of completeness...
  
  Please do so.  :)
 
 Issue reproduced 100% reliably, even within sysinstall.

 {snip} 

Forgot to add:

This issue ONLY happens when using DHCP.

Statically assigning the IP address works fine; fxp0 goes down once,
up once, then stays up indefinitely.

I also tested network I/O in the statically-assigned scenario.  Pinging
the box from another machine on the LAN:

$ ping 192.168.1.192
PING 192.168.1.192 (192.168.1.192): 56 data bytes
64 bytes from 192.168.1.192: icmp_seq=0 ttl=64 time=0.180 ms
64 bytes from 192.168.1.192: icmp_seq=1 ttl=64 time=0.138 ms
64 bytes from 192.168.1.192: icmp_seq=2 ttl=64 time=0.214 ms
64 bytes from 192.168.1.192: icmp_seq=3 ttl=64 time=0.165 ms
64 bytes from 192.168.1.192: icmp_seq=4 ttl=64 time=0.114 ms
^C
--- 192.168.1.192 ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.114/0.162/0.214/0.034 ms

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Apparent fxp regression in FreeBSD 8.4-RC3

2013-05-23 Thread Jeremy Chadwick

On Fri, May 24, 2013 at 12:56:20AM -0400, Glen Barber wrote:
 On Thu, May 23, 2013 at 09:40:35PM -0700, Jeremy Chadwick wrote:
  [...]
  So if someone wants to take a stab at this, they'll need to do so and
  make me an ISO.  Sorry that I can't make things easier.  :-(
  
  This definitely needs to get fixed before 8.4-RELEASE.
  
 
 *sigh*
 
 At this point, it is highly unlikely this will be fixed before
 8.4-RELEASE.  We are _far_ too deep into the release cycle.  In fact, we
 are effectively done with the release, and waiting on release notes to
 be completed.
 
 I think this will likely be included in errata notes for the release.

I urge you to meet with others in Release Engineering and discuss this
fully.  This is major enough that, once fixed, it warrants an immediate
binary update (to the kernel + if_fxp.ko) pushed out via freebsd-update.

fxp(4) is a commonly-used driver; it isn't something rare/uncommon.

Also remember at this stage we don't know if it's a specific PHY model
or specific NIC model (or series) which triggers it.  For all we know it
could affect everything that fxp(4) drives.

Please don't forget that FreeBSD has a very well-established history of
having rock-solid Intel NIC support.  Sure, mistakes happen, we're
human, bugs get introduced, but this does not bode well -- meaning I
would expect Slashdot et al to pick up on this.

 It is very unfortunate that this waited so long to be reported, as much
 time has passed since 8.4-BETA1...

This is what happens when people socially proliferate the belief that
RELEASE is rock solid/stable, don't run stable/X -- the number of
people who test what changes between RELEASE builds is vastly smaller
comparatively.  I've only been saying this for the past 15 years, so
it's even more unfortunate that people keep believing it.  :/

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Apparent fxp regression in FreeBSD 8.4-RC3

2013-05-23 Thread Jeremy Chadwick

On Fri, May 24, 2013 at 01:24:24AM -0400, Glen Barber wrote:
 Speaking entirely on behalf of myself now...
 
 On Thu, May 23, 2013 at 10:11:39PM -0700, Jeremy Chadwick wrote:
   I think this will likely be included in errata notes for the release.
  
  I urge you to meet with others in Release Engineering and discuss this
  fully.  This is major enough that, once fixed, it warrants an immediate
  binary update (to the kernel + if_fxp.ko) pushed out via freebsd-update.
  
 
 It can be solved with a -pN update after 8.4-RELEASE is out.

Not that I'm calling the shots or anything, but:

Let's go with that, combined with an included mention in the Errata
section of the Release Notes as you initially mentioned.

Sorry I can't be of more help; Charles' environment sounds like it would
be better-suited for testing, and I'm sure Michael can test out a patch
if/when someone gets around to poking at things.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Apparent fxp regression in FreeBSD 8.4-RC3

2013-05-23 Thread Jeremy Chadwick

On Fri, May 24, 2013 at 02:47:20PM +0900, YongHyeon PYUN wrote:
 On Thu, May 23, 2013 at 09:49:19PM -0700, Jeremy Chadwick wrote:
  On Thu, May 23, 2013 at 09:40:35PM -0700, Jeremy Chadwick wrote:
   On Thu, May 23, 2013 at 11:42:44PM -0400, Glen Barber wrote:
On Thu, May 23, 2013 at 08:38:06PM -0700, Jeremy Chadwick wrote:
 If someone wants me to test DHCP via fxp(4) on the above system (I can
 do so with both NICs), just let me know; it should only take me half 
 an
 hour or so.
 
 I'll politely wait for someone to say please do so else won't 
 bother.
 

For the sake of completeness...

Please do so.  :)
   
   Issue reproduced 100% reliably, even within sysinstall.
  
   {snip} 
  
  Forgot to add:
  
  This issue ONLY happens when using DHCP.
  
  Statically assigning the IP address works fine; fxp0 goes down once,
  up once, then stays up indefinitely.
 
 I asked Mike to try backing out dhclient(8) change(r247336) but it
 seems he missed that. Jeremy, could you try that?
 
 I guess dhclient(8) does not like flow-control negotiation of
 fxp(4) after link establishment.

I can't test anything without an ISO -- the system in question is truly
bare-bones (no hard disk, can't boot USB memsticks, etc.).  I'm not a
good test subject for changes on this one, I'm sorry to say.  :-(

If there's some way to disable flow-control negotiation in fxp(4) or
miibus(4) via loader, I can try that, but I don't know what the MIB
name would be.

ignore me, I'm just pissed off
If r247336 turns out to be the cause: ironic, as r247336 references PR
166656, which was tested against -- wait for it -- xl(4).

People in *this* thread are saying screw legacy hardware yet the PR is
for something as old as the 3C905B?  Maybe I should bow out of this
thread before I have an aneurysm.
/ignore

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: OpenSSH in -STABLE

2013-05-21 Thread Jeremy Chadwick

On Tue, May 21, 2013 at 11:02:27PM -0400, usa...@hushmail.com wrote:
 On Tue, 21 May 2013 22:20:08 -0400 David Wolfskill 
 da...@catwhisker.org wrote:
 On Tue, May 21, 2013 at 09:42:39PM -0400, usa...@hushmail.com 
 wrote:
  Hi. Are there any plans to get OpenSSH 6.2 in 9-STABLE? I'd like 
 to
  check out the new AES-GCM stuff without going to -CURRENT on 
 this
  system. If there are no plans, is there a possibility? Thanks
  
 
 Please refer to ports/security/openssh-portable; its Makefile says 
 it's
 6.2p2,1, last updated about 5 days ago.
 
 
 Thanks, but that wasn't what I asked about. I'm aware of the 
 version in ports.

Try freebsd-secur...@freebsd.org, I am certain you will get an answer
there.

Fact: OpenSSH 6.2p1 was imported to head/CURRENT on March 22nd, and
6.2p2 was imported to head/CURRENT on May 22nd:

http://svnweb.freebsd.org/base/head/crypto/openssh/ChangeLog

OpenSSH is such an important/key piece of software that, much like
OpenSSL, it is one that does not warrant haste when it comes to getting
MFC'd.  If you want something more recent on non-CURRENT, you will
usually be told to run the version from ports.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: OpenSSH in -STABLE

2013-05-21 Thread Jeremy Chadwick

On Tue, May 21, 2013 at 08:11:09PM -0700, Jeremy Chadwick wrote:
 ... 6.2p2 was imported to head/CURRENT on May 22nd ...

Typo on my part: this should have read May 17th, as is obvious from
svnweb.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: still mbuf leak in 9.0 / 9.1?

2013-05-18 Thread Jeremy Chadwick

On Sat, May 18, 2013 at 12:14:28PM +0200, Ronald Klop wrote:
 On Fri, 17 May 2013 19:31:01 +0200, Jeremy Chadwick j...@koitsu.org wrote:
 
 On Fri, May 17, 2013 at 11:37:23AM +0200, dennis berger wrote:
 Hi List,
 I can confirm that it is the bug you mentioned steven.
 Here is how I found it.
 
 I recorded hourly zfskern and nfsd stats. like this.
 
 echo PROCSTAT  $reportname
 pgrep -S (zfskern|nfsd) | xargs procstat -kk  $reportname
 
 luckily it crashed this night and logged this.
 
  1910 101508 nfsd nfsd: servicemi_switch+0x186
 sleepq_wait+0x42 _sleep+0x376 arc_lowmem+0x77 kmem_malloc+0xc1
 uma_large_malloc+0x4a malloc+0xd9 arc_get_data_buf+0xb5
 arc_read_nolock+0x1ec arc_read+0x93 dbuf_prefetch+0x12c
 dmu_zfetch_dofetch+0x10b dmu_zfetch+0xaf8 dbuf_read+0x4a7
 dmu_buf_hold_array_by_dnode+0x16b dmu_buf_hold_array+0x67
 dmu_read_uio+0x3f zfs_freebsd_read+0x3e3
 
 Maybe it would be good to merge this fix into RELENG_9_1 and
 distribute a fix via freebsd-update what do you think?
 
 best,
 -dennis
 
 
 Am 16.05.2013 um 11:42 schrieb dennis berger:
 
  This is indeed a ZFS+NFS system and I can see that istgt and
 nfs are stuck in some ZIO state. Maybe it's this.
  Thank's for pointing out.
 
  Is it this ZFS+NFS deadlock?
 
  --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
  +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
  @@ -3720,8 +3720,16 @@ arc_lowmem(void *arg __unused, int
 howto __unused)
mutex_enter(arc_reclaim_thr_lock);
needfree = 1;
cv_signal(arc_reclaim_thr_cv);
  - while (needfree)
  -  msleep(needfree, arc_reclaim_thr_lock, 0, zfs:lowmem, 0);
  +
  + /*
  +  * It is unsafe to block here in arbitrary threads, because
 we can come
  +  * here from ARC itself and may hold ARC locks and thus risk
 a deadlock
  +  * with ARC reclaim thread.
  +  */
  + if (curproc == pageproc) {
  +  while (needfree)
  +  msleep(needfree, arc_reclaim_thr_lock, 0, zfs:lowmem, 0);
  + }
mutex_exit(arc_reclaim_thr_lock);
mutex_exit(arc_lowmem_lock);
  }
 
  I'll try to crash our testsystem. I'll assume that stressing
 NFS backed with ZFS a lot might trigger this bug?
 
  -dennis
 
 
  Am 16.05.2013 um 00:03 schrieb Steven Hartland:
 
  - Original Message - From: dennis berger d...@nipsi.de
  FreeBSD  9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec
 4 09:23:10 UTC 2012
 
  3. Regarding this:
  A clean shutdown isn't possible though. It hangs after vnode
  cleaning, normally you would see detaching of usb devices
 here, or
  other devices maybe?
  Please don't conflate this with your above issue.  This is almost
  certainly unrelated.  Please start a new thread about that
 if desired.
 
  Maybe this is a misunderstanding normally this system will
 shutdown cleanly, of course.
  This hang only appears after the network problem above.
 
  If this is a ZFS system, its a known issue which is fixed in current,
  stable-9, stable-8 and the upcoming 8.4 release.
 
  If not and you have USB devices see if the following sysctl helps:
  hw.usb.no_shutdown_wait=1
 
 I'm sorry to say it won't happen.  The only updates that the -RELEASE
 branches get are for security.  If you want fixes for other things, you
 need to follow/run stables branches (i.e. stable/9), otherwise you will
 need to wait until 9.2-RELEASE comes out.
 
 
 And errata notices? Are they for security?

Example case:

http://www.freebsd.org/releases/9.1R/errata.html

Only the items in section Security Advisories would get actual updates
pushed out to the 9.1-RELEASE branch (e.g. RELENG_9_1); the items in
sections Open Issues and Late-breaking News are purely FYIs.  There
are always hundreds of bugs that never show up in either of those
sections but are mentioned in the next official versions' Release Notes.
I can speculate all day and night as to why this is, but it's easier for
me to just say that's just the way it is.

For example, compare the Open Issues in the 9.0-RELEASE errata to all
the bugfixes in the 9.1-RELEASE Release Notes (you'll have to go through
each item by hand and read it):

http://www.freebsd.org/releases/9.0R/errata.html
http://www.freebsd.org/releases/9.1R/relnotes-detailed.html

...and you'll see what I mean.

So to recap: when you run a -RELEASE branch, you should only expect
fixes related to security.  For any other problems, you are expected to
run stable/X (e.g. stable/9) or get to backport the fix yourself.

And because I am certain someone will bring it up: no, the fixes done in
stable/X cannot necessarily be turned into a patch file for a -RELEASE
branch.  The reason is that there are often other commits to stable/X
branches which are for things other than bugfixes (i.e.
re-engineering/refactoring of code, semantics changes, or entire
portions nuked altogether).  Sometimes backported patches can be made,
but it isn't always the case -- it is not always as simple as the patch
applied cleanly.  ZFS and NFS are two (of many) things which have been
undergoing

Re: Unexpected reboot/crash on 8.2-RELEASE.

2013-05-18 Thread Jeremy Chadwick

On Sat, May 18, 2013 at 09:45:21PM -0400, kpn...@pobox.com wrote:
 I had an unexpected reboot of my Dell R610 today around 2:05-06pm today.
 I do not know if it crashed or if it was power cycled.
 
 This machine is running:
 FreeBSD gunsight1.neutralgood.org 8.2-RELEASE FreeBSD 8.2-RELEASE #1: Thu Dec 
  8 21:58:59 UTC 2011 root@:/usr/obj/usr/src/sys/GENERIC  amd64
 
 It's a stock 8.2-RELEASE kernel except I had to tweak it near the top of
 vfs_mountroot() to delay before attempting to mount the root filesystem.
 (Without my tweak it attempts to mount root before the USB drive is finished
 getting attached.)
 
 The dmesg shows this at the reboot:
 mfi0: 24272 (422106527s/0x0020/info) - Patrol Read complete
 mfi0: 24273 (422172000s/0x0020/info) - Patrol Read started 
 mfi0: 24318 (422192750s/0x0020/info) - Patrol Read complete
 mfi0: 24319 (boot + 3s/0x0020/info) - Firmware initialization started (PCI ID 
 0060/1000/1f0c/1028)
 mfi0: 24320 (boot + 3s/0x0020/info) - Firmware version 1.22.12-0952
 mfi0: 24321 (boot + 3s/0x0020/info) - Firmware initialization started (PCI ID 
 0060/1000/1f0c/1028)
 mfi0: 24322 (boot + 3s/0x0020/info) - Firmware version 1.22.12-0952
 
 Does this mean the machine did not lose power? I ask because my datacenter
 had some sort of power incident and I'm not sure if the server lost power
 or not. But if the kernel message buffer from before the incident is still
 present then the machine never lost power, correct? The datacenter's power
 incident I'm told happened somewhere around the time of the reboot so I
 have to ask.
 
 It looks like I didn't have dumps enabled. That's ... not helpful.
 
 The machine has been stable for:
  2:05PM  up 472 days, 21 mins, 7 users, load averages: 0.01, 0.02, 0.00
 
 http://www.neutralgood.org/~kpn/dmesg.boot
 
 Here's various stats I usually keep displayed. This is the last from
 before the reboot:
 http://www.neutralgood.org/~kpn/status.txt

Your system did not reboot nor did it crash.  If it did, your uptime
would not be showing 472 days..

Really, it's that simple. 

 I've got all the power savings features turned off in the BIOS and, like
 I said, the machine has been stable for all this time. However, one thing
 to note from a couple of days ago:
 
 May 14 00:49:13 gunsight1 -- MARK --
 May 14 01:00:45 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 35 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 65 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 95 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 125 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 155 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 185 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 215 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 245 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 275 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 305 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 335 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 365 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 395 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 425 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 455 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 485 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 515 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 545 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 575 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 605 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 635 SECONDS
 May 14 01:11:36 gunsight1 kernel: mfi0: COMMAND 0xff80009d1310 TIMEOUT 
 AFTER 665 SECONDS
 May 14 01:19:36 gunsight1 -- MARK --
 May 14 01:39:36 gunsight1 -- MARK --
 May 14 01:59:37 gunsight1 -- MARK --
 May 14 02:10:55 gunsight1 kernel: mfi0: 24089 (421826400s/0x0020/info) - 
 Patrol Read started

Your mfi device timeouts are unrelated.  If you want to talk about them,
please discuss them in a new/separate thread.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977

Re: still mbuf leak in 9.0 / 9.1?

2013-05-17 Thread Jeremy Chadwick

On Fri, May 17, 2013 at 11:37:23AM +0200, dennis berger wrote:
 Hi List,
 I can confirm that it is the bug you mentioned steven.
 Here is how I found it.
 
 I recorded hourly zfskern and nfsd stats. like this.
 
 echo PROCSTAT  $reportname
 pgrep -S (zfskern|nfsd) | xargs procstat -kk  $reportname
 
 luckily it crashed this night and logged this.
 
  1910 101508 nfsd nfsd: servicemi_switch+0x186 
 sleepq_wait+0x42 _sleep+0x376 arc_lowmem+0x77 kmem_malloc+0xc1 
 uma_large_malloc+0x4a malloc+0xd9 arc_get_data_buf+0xb5 arc_read_nolock+0x1ec 
 arc_read+0x93 dbuf_prefetch+0x12c dmu_zfetch_dofetch+0x10b dmu_zfetch+0xaf8 
 dbuf_read+0x4a7 dmu_buf_hold_array_by_dnode+0x16b dmu_buf_hold_array+0x67 
 dmu_read_uio+0x3f zfs_freebsd_read+0x3e3 
 
 Maybe it would be good to merge this fix into RELENG_9_1 and distribute a fix 
 via freebsd-update what do you think?
 
 best,
 -dennis
 
 
 Am 16.05.2013 um 11:42 schrieb dennis berger:
 
  This is indeed a ZFS+NFS system and I can see that istgt and nfs are stuck 
  in some ZIO state. Maybe it's this. 
  Thank's for pointing out. 
  
  Is it this ZFS+NFS deadlock?
  
  --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c 
  +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c 
  @@ -3720,8 +3720,16 @@ arc_lowmem(void *arg __unused, int howto __unused) 
  mutex_enter(arc_reclaim_thr_lock); 
  needfree = 1; 
  cv_signal(arc_reclaim_thr_cv); 
  -   while (needfree) 
  -msleep(needfree, arc_reclaim_thr_lock, 0, zfs:lowmem, 0); 
  + 
  +   /* 
  +* It is unsafe to block here in arbitrary threads, because we can come 
  +* here from ARC itself and may hold ARC locks and thus risk a deadlock 
  +* with ARC reclaim thread. 
  +*/ 
  +   if (curproc == pageproc) { 
  +while (needfree) 
  +msleep(needfree, arc_reclaim_thr_lock, 0, zfs:lowmem, 0); 
  +   } 
  mutex_exit(arc_reclaim_thr_lock); 
  mutex_exit(arc_lowmem_lock); 
  }
  
  I'll try to crash our testsystem. I'll assume that stressing NFS backed 
  with ZFS a lot might trigger this bug?
  
  -dennis
  
  
  Am 16.05.2013 um 00:03 schrieb Steven Hartland:
  
  - Original Message - From: dennis berger d...@nipsi.de
  FreeBSD  9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 
  UTC 2012
  
  3. Regarding this:
  A clean shutdown isn't possible though. It hangs after vnode
  cleaning, normally you would see detaching of usb devices here, or
  other devices maybe?
  Please don't conflate this with your above issue.  This is almost
  certainly unrelated.  Please start a new thread about that if desired.
  
  Maybe this is a misunderstanding normally this system will shutdown 
  cleanly, of course.
  This hang only appears after the network problem above.
  
  If this is a ZFS system, its a known issue which is fixed in current,
  stable-9, stable-8 and the upcoming 8.4 release.
  
  If not and you have USB devices see if the following sysctl helps:
  hw.usb.no_shutdown_wait=1

I'm sorry to say it won't happen.  The only updates that the -RELEASE
branches get are for security.  If you want fixes for other things, you
need to follow/run stables branches (i.e. stable/9), otherwise you will
need to wait until 9.2-RELEASE comes out.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Command line not responding

2013-05-17 Thread Jeremy Chadwick

On Fri, May 17, 2013 at 12:56:53PM -0500, Michael Gass wrote:
 Running 9.0-Stable on an i386.
 
 Whenever I type a command at the prompt I get
 the output
 
 /usr/local/lib/libintl.so.9: Undefined symbol _ThreadRuneLocale
 
 and nothing else - the command will not run. Just the
 above output.  Commands like ls and exit work, but not much
 else.  This happends whether I am logged in a user or as root.
 Cannot even halt the system from the command line.
 
 Started to happen after trying to update the freetype2 port.
 Got an error msg while updating libXft-2.1.14.  From that point
 on I cannot use  the command line.
 
 I have no idea what to try.  Any suggestions.

First provide the contents of /etc/make.conf and /etc/src.conf.

The _ThreadRuneLocale thing has come up before, but on -CURRENT circa
early 2012.  It happened to a user when trying to build kernel (really)
and that user was tinkering about in make.conf and src.conf heavily,
messing with Clang.  I personally remove Clang from my systems entirely
for many reasons, by simply doing WITHOUT_CLANG=true in src.conf and
thus rely entirely on gcc.

My recommendation, and this isn't going to make you happy:

Boot into single-user, mount your filesystems, and try commands there,
in hopes that they work.  If they do:

pkg_delete -a -f
cp -pR /usr/local /usr/local.old
rm -fr /usr/local/*
reboot

Boot into multi-user, log in, and things should be fine.  Next:

rm -fr /var/db/ports/*
rm -fr /usr/ports/distfiles/*
find /usr/ports -type d -name work -exec rm -fr {} \;

Now begin rebuilding your ports.  If you prefer to use packages, go
right ahead, given that this was just announced a few days ago:

http://lists.freebsd.org/pipermail/freebsd-announce/2013-May/001476.html

But I tend to build everything from source, barring large-ish packages
(things like cmake, python27, perl) which I pkg_add -r.

My attitude has always been when something catastrophic impacts a very
large number of commands (particularly a library with a missing symbol
that a very large number of programs link to), start fresh.  It's
not worth scrambling around with leftover cruft in place that could
appear months later and make you say I thought I fixed that!, where
you then have to follow up to a thread months old and admit actually
there is more breakage...

Footnote: I am likely to get a large amount of backlash for proposing
the above, with claims that will equate it to fixing a minor cut by
amputating the entire limb.  My response to such: that's nice.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Command line not responding

2013-05-17 Thread Jeremy Chadwick

On Fri, May 17, 2013 at 09:49:20PM -0500, Michael Gass wrote:
 On Fri, May 17, 2013 at 11:55:13AM -0700, Jeremy Chadwick wrote:
  On Fri, May 17, 2013 at 12:56:53PM -0500, Michael Gass wrote:
   Running 9.0-Stable on an i386.
   
   Whenever I type a command at the prompt I get
   the output
   
   /usr/local/lib/libintl.so.9: Undefined symbol _ThreadRuneLocale
   
   and nothing else - the command will not run. Just the
   above output.  Commands like ls and exit work, but not much
   else.  This happends whether I am logged in a user or as root.
   Cannot even halt the system from the command line.
   
   Started to happen after trying to update the freetype2 port.
   Got an error msg while updating libXft-2.1.14.  From that point
   on I cannot use  the command line.
   
   I have no idea what to try.  Any suggestions.
  
 
 
  First provide the contents of /etc/make.conf and /etc/src.conf.
  
 
 Thanks for getting back to me. Here are the contents of the two
 files.  I rebuilt the kernel last fall and have updated ports
 fairly regularly since. Things have worked fine until today when
 I tried to update ports.
 
 # File:   make.conf
 # The ? in the below is for buildworld
 CPUTYPE?=pentium2
 # Uncomment the below for general builds.
 CFLAGS= -O -pipe
 # Uncomment the below for kernel builds.
 # COPTFLAGS= -O -pipe
 NO_PROFILE=true
 INSTALL_NODEBUG=true
 #WITHOUT_DILLO_IPV6=yes
 #WITH_DILLO_DLGUI=yes
 # added by use.perl 2013-05-17 11:04:30
 PERL_VERSION=5.12.4
 
 # File:   src.conf
 WITHOUT_PROFILE=true
 WITHOUT_BLUETOOTH=true

These confs look generally good, meaning there isn't the messing about
that the other user had.  I did catch one thing, however.

Speaking strictly about CFLAGS:

This should be CFLAGS+= (plus-equals), not CFLAGS= (equals).  Otherwise
you're effectively overriding CFLAGS for everything, which could cause
issues (some portions of the build infrastructure may set or adjust the
optimiser flags to something other than -O, and you'd be forcing it to
do it anyway).  I obviously don't know if that could/would explain the
missing symbol issue, but it's still something that's erroneous and
major.  In general I recommend people *do not* tinker with CFLAGS at
all in make.conf -- it's not worth the hassle on i386/amd64 if something
goes wrong.

If you ever want to know which syntaxes to use (for example, your
CPUTYPE?= is correct, and your COPTFLAGS= is correct), review
/usr/share/examples/etc/make.conf or src/share/examples/etc/make.conf.

Unrelated to all of this (just a useful comment in passing): NO_PROFILE
serves no purpose there, just keep WITHOUT_PROFILE=true in src.conf like
you have.  NO_PROFILE in make.conf would be from old FreeBSD days
(i.e. prior to src.conf existing).

Your src.conf looks fine.

Sorry I can't be of more help.  :-(

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: revision higher than 250508 breaks webcam support

2013-05-16 Thread Jeremy Chadwick

On Thu, May 16, 2013 at 08:38:39PM -0700, Adrian Chadd wrote:
 Are you able to narrow down the specific commit along 9-stable that broke it?
 
 Thanks!
 
 
 
 Adrian
 
 On 16 May 2013 18:00, Jo??e Zobec jozze.z...@hotmail.com wrote:
  Sorry, for waiting this long to post this problem, I thought it would be 
  dealt with this week, but since it wasn't better to report it now. I hope 
  this is the right mailing list for this particular problem.
 
  I am running FreeBSD 9.1-STABLE and using Logitech Webcam C525. I it's not 
  listed amongst the supported hardware, but it was working perfectly until 
  the updates that came this Sunday, 2013-05-12.
 
  The problem I'm getting is this:
 
  I keep getting this error message from the kernel, if I'm using 9.1-STABLE 
  r250707
 
  ...
  pcm6: detached
  ugen7.2: vendor 0x046d at asbus7
  uaudio0: vendor 0x046d HD Webcam C525, class 239/2, rev 2.00/0.10, addr 2 
  on usbus7
  uaudio0: No playback.
  uaudio0: Record: 48000 Hz, 1 ch, 16-bit S-LE PCM format, 2x8ms buffer.
  uaudio0: Record: 32000 Hz, 1 ch, 16-bit S-LE PCM format, 2x8ms buffer.
  uaudio0: Record: 24000 Hz, 1 ch, 16-bit S-LE PCM format, 2x8ms buffer.
  uaudio0: Record: 16000 Hz, 1 ch, 16-bit S-LE PCM format, 2x8ms buffer.
  uaudio: No MIDI squencer.
  pcm6: USB audio on uaudio0
  uaudio0: No HID volume keys found.
  ugen7.2: vendor 0x046d at usbus7 (disconnected)
  uaudio0: at uhub7, port4, addr 2 (disconnected)
  pcm6: detached
  ...
 
  This message is displayed periodically ad infinitum or at least until I 
  unplug the webcam. It stays this way, even if I use the GENERIC kernel. In 
  a healthy case, revision 250508, kernel message upon plugging the webcam, 
  is
 
  ...
  ugen7.2: vendor 0x046d at usbus7
  uaudio0: vendor 0x046d HD Webcam C525, class 239/2, rev 2.00/1.00, addr2 
  on usbus7
  uaudio: No playback.
  uaudio: Record: 48000 Hz, 1 ch, 16 bit S-LE PCM format, 2x8ms buffer.
  uaudio: No MIDI sequencer.
  pcm6: USB audio on uaudio0
  uaudio0: No HID volume keys found.
 
  And there it stops, and the webcam works in Skype.

Note: I told Joe to mail freebsd-usb@ about this, since it looks like it
pertains to the USB stack, and Hans tends to respond to stuff there.

That said...

Looking at commits between r250508 and r250707, my gut says it's very
likely one of these (with the most probable being marked with arrows):

http://www.freshbsd.org/commit/freebsd/r250581
http://www.freshbsd.org/commit/freebsd/r250561 ---
http://www.freshbsd.org/commit/freebsd/r250560 ---
http://www.freshbsd.org/commit/freebsd/r250559

How I got that list was by manually reviewing the following:

http://www.freshbsd.org/?branch=RELENG_9project=freebsd

So I would recommend rolling back to r250558 (the last stable/9 commit
to happen before r250559) and see if things improve.  Again, my gut
feeling says that they will, and that r250561 or r250560 are
responsible.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: still mbuf leak in 9.0 / 9.1?

2013-05-15 Thread Jeremy Chadwick

 262144).

What you *have* shown is your mbuf count gradually increasing (sans
15-05-2013-13-09.txt vs. 15-05-2013-14-09.txt which shows mbufs almost
doubling (!)), which could indicate a leak but then again might not.

If you reach mbuf maximum, then yes, network I/O can cease or stall
(possibly indefinitely).  However, broken/busted network I/O can also
happen due to other issues unrelated to mbufs, such as network stack
issues, firewall stack issues, or network driver bugs.  Are you using
pf, ipfw, or ipfilter on this system?

2. I think we'd all appreciate if you disclosed **exactly** what version
of FreeBSD you're using (Subject says 9.0 or 9.1 which is
insufficient).  Please provide uname -a output (you can XXX out the
hostname if you want) -- and if you're still using csup/cvsup and built
your own kernel/world, we'll need to know exactly what date your src
files were from when you rebuilt.

I'm wary of CC'ing folks who can help troubleshoot mbuf exhaustion
issues until answers to the above can be provided, as I don't want to
waste their time.

3. Regarding this:

  A clean shutdown isn't possible though. It hangs after vnode
  cleaning, normally you would see detaching of usb devices here, or
  other devices maybe?

Please don't conflate this with your above issue.  This is almost
certainly unrelated.  Please start a new thread about that if desired.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Build GENERIC with IPX support

2013-05-13 Thread Jeremy Chadwick

  * connection and fill server/user from it.
196  */
197 if (li-server[0] == 0 || li-user == NULL) {
198 int connHandle;
199 struct ncp_conn_stat cs;
200
201 if ((error = ncp_conn_scan(li, connHandle)) != 0) {
202 ncp_error(no default connection found, errno);
203 return error;
204 }

To me, this may indicate you have some kind of ncp rc file (I believe
this is ~/.nwfsrc according to the ncplist(1) man page) that may contain
something invalid, or maybe you lack such a file altogether (creating one
might work around the problem).

Back to the actual segfault itself: ncp_error() is pretty simple:

src/lib/libncp/ncpl_subr.c --

447 /*
448  * Print a (descriptive) error message
449  * error values:
450  * 0 - no specific error code available;
451  *  -999..-1 - NDS error
452  *  1..32767 - system error
453  *  the rest - requester error;
454  */
455 void
456 ncp_error(const char *fmt, int error, ...) {
457 va_list ap;
458
459 fprintf(stderr, %s: , _getprogname());
460 va_start(ap, error);
461 vfprintf(stderr, fmt, ap);
462 va_end(ap);
463 if (error == -1)
464 error = errno;
465 if (error  -1000  error  0) {
466 fprintf(stderr, : dserr = %d\n, error);
467 } else if (error  0x8000) {
468 fprintf(stderr, : nwerr = %04x\n, error);
469 } else if (error) {
470 fprintf(stderr, : syserr = %s\n, strerror(error));
471 } else
472 fprintf(stderr, \n);
473 }

What I don't understand from the calling stack is how gettimeofday() is
involved.  I have looked at the libc code, looked at the underlying
calling functions and so on (from fprintf() to vfprintf_l() and deeper),
and I don't see how or where gettimeofday() would be called.  The only
place I can think of might be the related locale stuff, but I'm doubting
that given what I've looked at but could still be wrong.

Have world/kernel on this system ever been rebuilt?  If they have,
were both kernel and world rebuilt together from the same source code
and not at different times?

If you're setting LANG, LC_CTYPE, LC_COLLATE, or other locale-oriented
settings in your environment (and my gut feeling is that you are), you
could try removing them and see if you get an actual useful error
message on stderr, but I'm not holding my breath.

I cannot help you with the remaining IPX-specific stuff; it's fairly
obvious though, as I said, that this code has been neglected.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2080 matches

Mail list logo