Re: zfs arc and amount of wired memory

2012-02-09 Thread Andriy Gapon
on 09/02/2012 06:27 Eugene M. Zheganin said the following:
 The output I promised (if it's MORE acceptable in the form of a link to a 
 paste
 site, just say it):

I prefer links, but both ways are acceptable to me.
Just one more hint on the reporting.  The most useful reports are coherent
reports.  That is, I now have your older reports from top and zfs-stat and I
have newer vmstat reports.  But I do not have all the reports taken at about the
same time, so I don't have a coherent picture of a system state.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: zfs arc and amount of wired memory

2012-02-09 Thread Andriy Gapon
on 09/02/2012 10:33 Andriy Gapon said the following:
 on 09/02/2012 06:27 Eugene M. Zheganin said the following:
 The output I promised (if it's MORE acceptable in the form of a link to a 
 paste
 site, just say it):
 
 I prefer links, but both ways are acceptable to me.
 Just one more hint on the reporting.  The most useful reports are coherent
 reports.  That is, I now have your older reports from top and zfs-stat and I
 have newer vmstat reports.  But I do not have all the reports taken at about 
 the
 same time, so I don't have a coherent picture of a system state.
 

And please take the reports after discrepancy between ARC size an wired size is
large enough, like e.g. 1GB.  That's when they are useful.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: zfs arc and amount of wired memory

2012-02-09 Thread Eugene M. Zheganin

Hi.

On 09.02.2012 14:35, Andriy Gapon wrote:

And please take the reports after discrepancy between ARC size an wired size is
large enough, like e.g. 1GB.  That's when they are useful.

Okay, I wrote a short script capturing sequence of top -b/zfs-stats 
-a/vmstat -m/vmstat -z in a timestamped file and put it in a crontab 
every hour.
I will provide the files it creates (or a subset of files, if there will 
be too many) after the system will enter a deadlock again.

This time varies from one week to two.

Thanks.
Eugene.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: zfs arc and amount of wired memory

2012-02-09 Thread Eugene M. Zheganin

Hi.

On 09.02.2012 14:35, Andriy Gapon wrote:

And please take the reports after discrepancy between ARC size an wired size is
large enough, like e.g. 1GB.  That's when they are useful.

One more thing - this machine is running a debug/ddb kernel, so just in 
order to save two weeks - when/if it will enter a deadlock, do you (or 
anyone else) need crashdump or anything else I can provide using ddb in 
a deadlock ?


Thanks.
Eugene.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: zfs arc and amount of wired memory

2012-02-09 Thread Alexander Leidinger
Hi,

if you are not using USB3 and a fast memory stick, it will be slower than 
swapping to disk.

Bye,
Alexander.

-- 
Send via an Android device, please forgive brevity and typographic and spelling 
errors. 

Freddie Cash fjwc...@gmail.com hat geschrieben:On Wed, Feb 8, 2012 at 10:25 
AM, Eugene M. Zheganin e...@norma.perm.ru wrote:
 On 08.02.2012 18:15, Alexander Leidinger wrote:
 I can't remember to have seen any mention of SWAP on ZFS being safe
 now. So if nobody can provide a reference to a place which tells that
 the problems with SWAP on ZFS are fixed:
  1. do not use SWAP on ZFS
  2. see 1.
  3. check if you see the same problem without SWAP on ZFS (btw. see 1.)

 So, if a swap have to be used, and, it has to be backed up with something
 like gmirror so it won't come down with one of the disks, there's no need to
 use zfs for system.

 This makes zfs only useful in cases where you need to store something on a
 couple+ of terabytes, still having OS on ufs. Occam's razor and so on.

Or, you plug a USB stick into the back (or even inside the case as a
lot of mobos have internal USB connectors now) and use that for swap.

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: zfs arc and amount of wired memory

2012-02-09 Thread Alexander Leidinger
Hi,

this only applies to old systems (slooow disks, no NCQ support), or very fast 
USB3 memory sticks. Current (I would say at least 2-3 year old) hardware is 
slowed down by USB2.

Bye,
Alexander.

-- 
Send via an Android device, please forgive brevity and typographic and spelling 
errors. 

Freddie Cash fjwc...@gmail.com hat geschrieben:On Wed, Feb 8, 2012 at 10:40 
AM, Freddie Cash fjwc...@gmail.com wrote:
 On Wed, Feb 8, 2012 at 10:25 AM, Eugene M. Zheganin e...@norma.perm.ru 
 wrote:
 On 08.02.2012 18:15, Alexander Leidinger wrote:
 I can't remember to have seen any mention of SWAP on ZFS being safe
 now. So if nobody can provide a reference to a place which tells that
 the problems with SWAP on ZFS are fixed:
  1. do not use SWAP on ZFS
  2. see 1.
  3. check if you see the same problem without SWAP on ZFS (btw. see 1.)

 So, if a swap have to be used, and, it has to be backed up with something
 like gmirror so it won't come down with one of the disks, there's no need to
 use zfs for system.

 This makes zfs only useful in cases where you need to store something on a
 couple+ of terabytes, still having OS on ufs. Occam's razor and so on.

 Or, you plug a USB stick into the back (or even inside the case as a
 lot of mobos have internal USB connectors now) and use that for swap.

That also works well for adding L2ARC (cache) to the ZFS pool as well.

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: zfs arc and amount of wired memory

2012-02-09 Thread Alexander Leidinger

Hi,

a possible soution would be to start a wiki pagee with what you know, e.g. a 
page which explains that solaris and zio* belong to ZFS. Over time people can 
extend with additional info.

Bye,
Alexander.

-- 
Send via an Android device, please forgive brevity and typographic and spelling 
errors. 

Jeremy Chadwick free...@jdc.parodius.com hat geschrieben:On Wed, Feb 08, 2012 
at 10:29:36PM +0200, Andriy Gapon wrote:
 on 08/02/2012 12:31 Eugene M. Zheganin said the following:
  Hi.
  
  On 08.02.2012 02:17, Andriy Gapon wrote:
  [output snipped]
 
  Thank you.  I don't see anything suspicious/unusual there.
  Just case, do you have ZFS dedup enabled by a chance?
 
  I think that examination of vmstat -m and vmstat -z outputs may provide 
  some
  clues as to what got all that memory wired.
 
  Nope, I don't have deduplication feature enabled.
 
 OK.  So, did you have a chance to inspect vmstat -m and vmstat -z?

Andriy,

Politely -- recommending this to a user is a good choice of action, but
the problem is that no user, even an experienced user, is going to know
what all of the Types (vmstat -m) or ITEMs (vmstat -z) correlate
with on the system.

For example, for vmstat -m, the ITEM name is solaris.  For vmstat -z,
the Types are named zio_* but I have a feeling there are more than just
that which pertain to ZFS.  I'm having to make *assumptions*.

The FreeBSD VM is highly complex and is not easy to understand even
remotely.  It becomes more complex when you consider that we use terms
like wired, active, inactive, cache, and free -- and none of
them, in simple English terms, actually represent the words chosen for
what they do.

Furthermore, the only definition I've been able to find over the years
for how any of these work, what they do/mean, etc. is here:

http://www.freebsd.org/doc/en/books/arch-handbook/vm.html

And this piece of documentation is only useful for people who understand
VMs (note: it was written by Matt Dillon, for example).  It is not
useful for end-users trying to track down what within the kernel is
actually eating up memory.  vmstat -m is as best as it's going to get,
and like I said, with the ITEM names being borderline ambiguous
(depending on what you're looking for -- with VFS and so on it's spread
all over the place), this becomes a very tedious task, where the user or
admin have to continually ask developers on the mailing lists what it is
they're looking at.

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: zfs arc and amount of wired memory

2012-02-09 Thread Alexander Leidinger
Hi,

feel free to register with FirstnameLastname in the wiki and tell us about it. 
We provide write access to people which seriously want to help improve the wiki 
content.

Bye,
Alexander.

-- 
Send via an Android device, please forgive brevity and typographic and spelling 
errors. 

Charles Sprickman sp...@bway.net hat geschrieben:
On Feb 8, 2012, at 7:43 PM, Artem Belevich wrote:

 On Wed, Feb 8, 2012 at 4:28 PM, Jeremy Chadwick
 free...@jdc.parodius.com wrote:
 On Thu, Feb 09, 2012 at 01:11:36AM +0100, Miroslav Lachman wrote:
 ...
 ARC Size:
  Current Size: 1769 MB (arcsize)
  Target Size (Adaptive):   512 MB (c)
  Min Size (Hard Limit):    512 MB (zfs_arc_min)
  Max Size (Hard Limit):    3584 MB (zfs_arc_max)
 
 The target size is going down to the min size and after few more
 days, the system is so slow, that I must reboot the machine. Then it
 is running fine for about 107 days and then it all repeat again.
 
 You can see more on MRTG graphs
 http://freebsd.quip.cz/ext/2012/2012-02-08-kiwi-mrtg-12-15/
 You can see links to other useful informations on top of the page
 (arc_summary, top, dmesg, fs usage, loader.conf)
 
 There you can see nightly backups (higher CPU load started at
 01:13), otherwise the machine is idle.
 
 It coresponds with ARC target size lowering in last 5 days
 http://freebsd.quip.cz/ext/2012/2012-02-08-kiwi-mrtg-12-15/local_zfs_arcstats_size.html
 
 And with ARC metadata cache overflowing the limit in last 5 days
 http://freebsd.quip.cz/ext/2012/2012-02-08-kiwi-mrtg-12-15/local_zfs_vfs_meta.html
 
 I don't know what's going on and I don't know if it is something
 know / fixed in newer releases. We are running a few more ZFS
 systems on 8.2 without this issue. But those systems are in
 different roles.
 
 This sounds like the... damn, what is it called... some kind of internal
 counter or ticks thing within the ZFS code that was discovered to
 only begin happening after a certain period of time (which correlated to
 some number of days, possibly 107).  I'm sorry that I can't be more
 specific, but it's been discussed heavily on the lists in the past, and
 fixes for all of that were committed to RELENG_8.  I wish I could
 remember the name of the function or macro or variable name it pertained
 to, something like LTHAW or TLOCK or something like that.  I would say
 I don't know why I can't remember, but I do know why I can't remember:
 because I gave up trying to track all of these problems.
 
 Does someone else remember this issue?  CC'ing Martin who might remember
 for certain.
 
 It's LBOLT. :-)
 
 And there was more than one related integer overflow. One of them
 manifested itself as L2ARC feeding thread hogging CPU time after about
 a month of uptime. Another one caused issue with ARC reclaim after 107
 days. See more details in this thread:
 
 http://lists.freebsd.org/pipermail/freebsd-fs/2011-May/011584.html

This would be an excellent piece of information to have on one of the ZFS
wiki pages.  The 107 day issue exists post-8.2, correct?  Anyone on this 
cc: list have permissions to edit those pages?

Thanks,

Charles

 
 --Artem
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Mike Tancsa
On 2/8/2012 5:46 PM, Alexander Motin wrote:
 
 READ LOG EXT for NCQ, same as REQUEST SENSE for ATAPI sent by every
 specific controller driver. In this case by siis_issue_recovery()
 function in dev/siis/siis.c. In case of proper READ LOG EXT completion,
 fetched status returned to CAM together with original command.

Hi,
Is there a way to find out which drive is causing these errors ?
Looking at the logs on the various drives, they all seem to have the odd
non zero value.  I suspect it might be a Segate Disk as smartctl flags
it as having bad firmware issues


=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11
Device Model: ST31000333AS
Serial Number:9TE14SRV
LU WWN Device Id: 5 000c50 010a39664
Firmware Version: SD35
User Capacity:1,000,204,886,016 bytes [1.00 TB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:Thu Feb  9 09:40:56 2012 EST

== WARNING: There are known problems with these drives,
see the following Seagate web pages:
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957

 


-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Jeremy Chadwick
On Thu, Feb 09, 2012 at 09:43:01AM -0500, Mike Tancsa wrote:
 On 2/8/2012 5:46 PM, Alexander Motin wrote:
  
  READ LOG EXT for NCQ, same as REQUEST SENSE for ATAPI sent by every
  specific controller driver. In this case by siis_issue_recovery()
  function in dev/siis/siis.c. In case of proper READ LOG EXT completion,
  fetched status returned to CAM together with original command.
 
 Hi,
   Is there a way to find out which drive is causing these errors ?
 Looking at the logs on the various drives, they all seem to have the odd
 non zero value.  I suspect it might be a Segate Disk as smartctl flags
 it as having bad firmware issues
 
 
 === START OF INFORMATION SECTION ===
 Model Family: Seagate Barracuda 7200.11
 Device Model: ST31000333AS
 Serial Number:9TE14SRV
 LU WWN Device Id: 5 000c50 010a39664
 Firmware Version: SD35
 User Capacity:1,000,204,886,016 bytes [1.00 TB]
 Sector Size:  512 bytes logical/physical
 Device is:In smartctl database [for details use: -P show]
 ATA Version is:   8
 ATA Standard is:  ATA-8-ACS revision 4
 Local Time is:Thu Feb  9 09:40:56 2012 EST
 
 == WARNING: There are known problems with these drives,
 see the following Seagate web pages:
 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957

The URLs listed are for firmware-level problems with this model of
Seagate drive.  This is a very famous firmware issue and got a lot of
media attention.  The bugs with that firmware, however, would not appear
as what you are seeing.

You stated in your original mail that you added a port multiplier then
started getting these errors.  You then provided SMART output of
/dev/ada9, so I made the assumption you had managed to figure out what
device was causing the problem.

I have to assume that devices connected on a port multiplier show up on
a separate scbusX number.  This is from your original mail:

 # camcontrol devlist
 WDC WD2001FASS-00U0B0 01.00101   at scbus0 target 0 lun 0 (pass0,ada0)
 WDC WD2001FASS-00U0B0 01.00101   at scbus0 target 1 lun 0 (pass1,ada1)
 WDC WD2001FASS-00U0B0 01.00101   at scbus0 target 2 lun 0 (pass2,ada2)
 WDC WD2001FASS-00U0B0 01.00101   at scbus0 target 3 lun 0 (pass3,ada3)
 Port Multiplier 47261095 1f06at scbus0 target 15 lun 0 (pass4,pmp1)
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 0 lun 0 (pass5,ada4)
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 1 lun 0 (pass6,ada5)
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 2 lun 0 (pass7,ada6)
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 3 lun 0 (pass8,ada7)
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 4 lun 0 (pass9,ada8)
 Port Multiplier 37261095 1706at scbus1 target 15 lun 0 (pass10,pmp0)
 Areca usrvar R001at scbus4 target 0 lun 0 (pass11,da0)
 Areca backup1 R001   at scbus4 target 0 lun 1 (pass12,da1)
 Areca RAID controller R001   at scbus4 target 16 lun 0 (pass13)
 AMCC 9650SE-2LP DISK 4.10at scbus5 target 0 lun 0 (pass14,da2)
 ST31000333AS SD35at scbus6 target 0 lun 0 (pass15,ada9)
 ST31000528AS CC35at scbus7 target 0 lun 0 (pass16,ada10)
 ST31000340AS SD1Aat scbus8 target 0 lun 0 (pass17,ada11)
 WDC WD1002FAEX-00Z3A0 05.01D05   at scbus11 target 0 lun 0 (pass18,ada12)

Based on this, and assuming my understanding of how this setup works --
and please note I could be wrong, these port multiplier things I have no
familiarity with personally -- but it looks (to me) like this:

scbus0
  -- Associated with Port Multiplier device pmp1
  -- Disk ada0
  -- Disk ada1
  -- Disk ada2
  -- Disk ada3

scbus1
  -- Associated with Port Multiplier device pmp0
  -- Disk ada4
  -- Disk ada5
  -- Disk ada6
  -- Disk ada7
  -- Disk ada8

scbus4
  -- Appeaars to be a Areca controller of some kind, in RAID
  -- Disk da0, volume usrvar 
  -- Disk da1, volume backup1

scbus5
  -- Not sure what this thing is
  -- Disk or thing da2

scbus6
  -- Disk ada9

scbus7
  -- Disk ada10

scbus8
  -- Disk ada11

scbus11
  -- Disk ada12

So which Port Multiplier did you add?  The one at scbus0 or scbus1?

A full dmesg (not just a snippet) would probably be helpful here.  What
you provided in your first post was too terse, especially given how many
disks you have in this system.  :-)

I really see no problem with looking at all disks -- specifically disks
ada0 through ada3, and ada4 through ada8 -- to determine which one may
be having problems.  You're welcome to run smartctl -a on each one and
put them up on the web, preferably segregated by disk name (e.g.
ada0.txt, ada1.txt, etc.) and I can review them all.

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems 

Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Mike Tancsa
On 2/9/2012 10:22 AM, Jeremy Chadwick wrote:
 
 I have to assume that devices connected on a port multiplier show up on
 a separate scbusX number.  This is from your original mail:

 Based on this, and assuming my understanding of how this setup works --
 and please note I could be wrong, these port multiplier things I have no
 familiarity with personally -- but it looks (to me) like this:
 
 scbus0
   -- Associated with Port Multiplier device pmp1
   -- Disk ada0
   -- Disk ada1
   -- Disk ada2
   -- Disk ada3

Correct. This is the original hardware.  It too was showing the odd
error prior to adding the new set of disks to expand the zfs pool.  e.g.
here are some errors on the original PM

Feb  4 22:55:02 backup3 kernel: siisch0: Timeout on slot 24
Feb  4 22:55:02 backup3 kernel: siisch0: siis_timeout is 0004 ss
25002a00 rs 25002a00 es  sts 80182000 serr 
Feb  4 22:55:02 backup3 kernel: siisch0:  ... waiting for slots 24002a00
Feb  4 22:55:02 backup3 kernel: siisch0: Timeout on slot 13
Feb  4 22:55:02 backup3 kernel: siisch0: siis_timeout is 0004 ss
25002a00 rs 25002a00 es  sts 80182000 serr 
Feb  4 22:55:02 backup3 kernel: siisch0:  ... waiting for slots 24000a00
Feb  4 22:55:02 backup3 kernel: siisch0: Timeout on slot 29
Feb  4 22:55:02 backup3 kernel: siisch0: siis_timeout is 0004 ss
25002a00 rs 25002a00 es  sts 80182000 serr 
Feb  4 22:55:02 backup3 kernel: siisch0:  ... waiting for slots 04000a00
Feb  4 22:55:02 backup3 kernel: siisch0: Timeout on slot 11


 
 scbus1
   -- Associated with Port Multiplier device pmp0
   -- Disk ada4
   -- Disk ada5
   -- Disk ada6
   -- Disk ada7
   -- Disk ada8

Correct, this is the new PM. 4 disks in use, and one spare.

 
 scbus4
   -- Appeaars to be a Areca controller of some kind, in RAID

yes.

   -- Disk da0, volume usrvar 
   -- Disk da1, volume backup1
 
 scbus5
   -- Not sure what this thing is

3ware with a pair of faster disks that holds a large DB to slice and
dice netflow data.

   -- Disk or thing da2
 
 scbus6
 scbus7
 scbus8
 scbus11
   -- Disk ada12

Disks off the motherboard.

 
 So which Port Multiplier did you add?  The one at scbus0 or scbus1?

1
WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 0 lun 0 (pass5,ada4)
WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 1 lun 0 (pass6,ada5)
WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 2 lun 0 (pass7,ada6)
WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 3 lun 0 (pass8,ada7)
WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 4 lun 0 (pass9,ada8)
Port Multiplier 37261095 1706at scbus1 target 15 lun 0 (pass10,pmp0)





 
 A full dmesg (not just a snippet) would probably be helpful here.  What
 you provided in your first post was too terse, especially given how many
 disks you have in this system.  :-)
 
 I really see no problem with looking at all disks -- specifically disks
 ada0 through ada3, and ada4 through ada8 -- to determine which one may
 be having problems.  You're welcome to run smartctl -a on each one and
 put them up on the web, preferably segregated by disk name (e.g.
 ada0.txt, ada1.txt, etc.) and I can review them all.

Actually, I just had a look at another server at our DR site. Its
hardware has not changed in a bit, but I did bring the kernel uptodate.
Its now logging the odd 'READ LOG EXT' error as well.  Its kernel is
from Jan 22.  Prior to that kernel update, I had not seen these errors.
 Something in the driver (ahci or cam layer?) that has changed perhaps ?

Feb  4 11:12:36 offsite kernel: siisch1: Error while READ LOG EXT

The output is in one giant txt file.  But each section has the heading
of the disk (for i in `jot 10 0`;do echo  ada$i
==  d.rep; smartctl -x /dev/ada$i d.rep;smartctl -l
gplog,0x10 /dev/ada$i  d.rep;done;)



http://www.tancsa.com/ahci.txt


---Mike







-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: serious packet routing issue causing ntpd high load?

2012-02-09 Thread Qing Li
Hi Vlad,

Sorry about the delayed response. No, this one just fell through the cracks.

Has anyone responded ?  Does it still exist in 9.x ?

--Qing

On Mon, Feb 6, 2012 at 10:16 AM, Vlad Galu d...@dudu.ro wrote:
 Hi Qing,

 Any luck with this?

 Thanks
 Vlad


 On Thu, Nov 3, 2011 at 2:05 PM, Li, Qing qing...@bluecoat.com wrote:

 This endless route lookup miss message problem is reproducible without
 FLOWTABLE.  The problem is with the multiple FIBs. I cannot reproduce
 this problem in my home network but the problem is easily seen at work.

 The route lookup miss itself in multi-FIBs configuration may be normal
 depending on the actual system configuration. It's the flooding of
 RTM_MISS messages that is abnormal. For example, if the route to the
 DNS servers is not configured in all FIBs, then the RTM_MISS
 message will be generated when an userland application sends to an
 explicit IP address in a specific FIB.

 In any case, I can reproduce the issue consistently and just trying to get
 a few uninterrupted
 hours to get it done.

 --Qing

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Gary Palmer
On Thu, Feb 09, 2012 at 07:22:40AM -0800, Jeremy Chadwick wrote:
 I have to assume that devices connected on a port multiplier show up on
 a separate scbusX number.  This is from your original mail:
 
  # camcontrol devlist
  WDC WD2001FASS-00U0B0 01.00101   at scbus0 target 0 lun 0 (pass0,ada0)
  WDC WD2001FASS-00U0B0 01.00101   at scbus0 target 1 lun 0 (pass1,ada1)
  WDC WD2001FASS-00U0B0 01.00101   at scbus0 target 2 lun 0 (pass2,ada2)
  WDC WD2001FASS-00U0B0 01.00101   at scbus0 target 3 lun 0 (pass3,ada3)
  Port Multiplier 47261095 1f06at scbus0 target 15 lun 0 (pass4,pmp1)
  WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 0 lun 0 (pass5,ada4)
  WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 1 lun 0 (pass6,ada5)
  WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 2 lun 0 (pass7,ada6)
  WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 3 lun 0 (pass8,ada7)
  WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 4 lun 0 (pass9,ada8)
  Port Multiplier 37261095 1706at scbus1 target 15 lun 0 (pass10,pmp0)
  Areca usrvar R001at scbus4 target 0 lun 0 (pass11,da0)
  Areca backup1 R001   at scbus4 target 0 lun 1 (pass12,da1)
  Areca RAID controller R001   at scbus4 target 16 lun 0 (pass13)
  AMCC 9650SE-2LP DISK 4.10at scbus5 target 0 lun 0 (pass14,da2)
  ST31000333AS SD35at scbus6 target 0 lun 0 (pass15,ada9)
  ST31000528AS CC35at scbus7 target 0 lun 0 (pass16,ada10)
  ST31000340AS SD1Aat scbus8 target 0 lun 0 (pass17,ada11)
  WDC WD1002FAEX-00Z3A0 05.01D05   at scbus11 target 0 lun 0 (pass18,ada12)
 
 Based on this, and assuming my understanding of how this setup works --
 and please note I could be wrong, these port multiplier things I have no
 familiarity with personally -- but it looks (to me) like this:
 
 scbus5
   -- Not sure what this thing is
   -- Disk or thing da2

3ware 9650SE controller (twa driver I beleive)

Gary
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Jeremy Chadwick
On Thu, Feb 09, 2012 at 11:12:06AM -0500, Mike Tancsa wrote:
 {snipping}

  So which Port Multiplier did you add?  The one at scbus0 or scbus1?
 
 1
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 0 lun 0 (pass5,ada4)
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 1 lun 0 (pass6,ada5)
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 2 lun 0 (pass7,ada6)
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 3 lun 0 (pass8,ada7)
 WDC WD2002FAEX-007BA0 05.01D05   at scbus1 target 4 lun 0 (pass9,ada8)
 Port Multiplier 37261095 1706at scbus1 target 15 lun 0 (pass10,pmp0)

I'll provide analysis for all 5 of these disks below.

  A full dmesg (not just a snippet) would probably be helpful here.  What
  you provided in your first post was too terse, especially given how many
  disks you have in this system.  :-)
  
  I really see no problem with looking at all disks -- specifically disks
  ada0 through ada3, and ada4 through ada8 -- to determine which one may
  be having problems.  You're welcome to run smartctl -a on each one and
  put them up on the web, preferably segregated by disk name (e.g.
  ada0.txt, ada1.txt, etc.) and I can review them all.
 
 Actually, I just had a look at another server at our DR site. Its
 hardware has not changed in a bit, but I did bring the kernel uptodate.
 Its now logging the odd 'READ LOG EXT' error as well.  Its kernel is
 from Jan 22.  Prior to that kernel update, I had not seen these errors.
  Something in the driver (ahci or cam layer?) that has changed perhaps ?
 
 Feb  4 11:12:36 offsite kernel: siisch1: Error while READ LOG EXT

Perhaps, but mav@ would be the authority on that.

 http://www.tancsa.com/ahci.txt

So here are the results of analysis for disks ada4 through ada8:

ada4
  -- When the below errors happened are 100% unknown.  Just noting
  that here.
  -- SMART attribute 199 shows 13 CRC errors.  These would be
  caused by issues between the disk and the device its
  attached to (port multiplier I guess).  Causes could be
  bad SATA cables, bad ports, dirty/dusty ports, or flaky
  PCB (on the disk itself).
  -- SATA PHY log/counters confirms above problem:
  ID  Size Value  Description
  0x0001  2   13  Command failed due to ICRC error
  0x0002  2   13  R_ERR response for data FIS
  0x0003  2   13  R_ERR response for device-to-host data FIS
  -- Given this behaviour, possibly the ATA commands submit which
  experienced errors were NCQ-related.
  -- The NCQ command error log does have non-zero values in it.
  The format of the output is proprietary, sadly, and smartmontools
  does not know how to decode it.  But, compare it to your other
  drives and you'll see there is non-zero data there.
  -- This is a likely candidate for the behaviour seen on this PM.

ada5
  -- When the below errors happened are 100% unknown.  Just noting
  that here.
  -- SMART attribute 199 shows 11 CRC errors.  These would be
  caused by issues between the disk and the device its
  attached to (port multiplier I guess).  Causes could be
  bad SATA cables, bad ports, dirty/dusty ports, or flaky
  PCB (on the disk itself).
  -- SATA PHY log/counters confirms above problem:
  ID  Size Value  Description
  0x0001  2   11  Command failed due to ICRC error
  0x0002  2   11  R_ERR response for data FIS
  0x0003  2   11  R_ERR response for device-to-host data FIS
  -- Given this behaviour, possibly the ATA commands submit which
  experienced errors were NCQ-related.
  -- The NCQ command error log does have non-zero values in it.
  The format of the output is proprietary, sadly, and smartmontools
  does not know how to decode it.  But, compare it to your other
  drives and you'll see there is non-zero data there.
  -- This is a likely candidate for the behaviour seen on this PM.

ada6
  -- When the below errors happened are 100% unknown.  Just noting
  that here.
  -- SMART attribute 199 shows 8 CRC errors.  These would be
  caused by issues between the disk and the device its
  attached to (port multiplier I guess).  Causes could be
  bad SATA cables, bad ports, dirty/dusty ports, or flaky
  PCB (on the disk itself).
  -- SATA PHY log/counters confirms above problem:
  ID  Size Value  Description
  0x0001  28  Command failed due to ICRC error
  0x0002  28  R_ERR response for data FIS
  0x0003  28  R_ERR response for device-to-host data FIS
  -- Given this behaviour, possibly the ATA commands submit which
  experienced errors were NCQ-related.
  -- The NCQ command error log does have non-zero values in it.
  The format of the output is proprietary, sadly, and smartmontools
  does not know how to decode it.  But, compare it to your other
  drives and you'll see there is non-zero data there.
  -- This is a likely candidate for 

Re: siisch1: Error while READ LOG EXT

2012-02-09 Thread Mike Tancsa
On 2/9/2012 11:34 AM, Jeremy Chadwick wrote:
 
 You will probably need to track these drives on a regular basis.  That
 is to say, set up some cronjob or similar that logs the above output to
 a file (appends data to it), specifically output from smartctl -A (not
 -a and not -x) and smartctl -l sataphy on a per-disk basis.  smartd can
 track SMART attribute changes, but does not track GPLog changes.  Make
 sure to put timestamps in your logs.

Thanks very much for having a look, and the suggestions. It think this
is the way to go to see which drive my have errors incrementing.
Alexander, is there a better way you can suggest ?

 
 As for fixing the problem: I have no idea how you would go about this.
 Use of port multipliers involves additional cables, possibly of shoddy
 quality, or other components which may not be decent/reliable.  


Possibly.  Cables are one of those things I am happy to pay extra for
better quality but how does one assess quality of such parts.

 
 Overall, this is just one of the many reasons why I avoid PMs, as well
 as avoid eSATA (especially eSATA).  

Yeah, at some point it doesnt really work with too many PMs, especially
if you cant query the thing to find out where things are bad.  I think
for the next version of this box I will use the newer generation 3ware
SAS/SATA controller

---Mike



-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


known problems with 8.x and HP DL16 G5 server?

2012-02-09 Thread Julian Elischer

does anyone know of problems with freebsd and this system?

the kernel We tried to boot seems to stop somewhere in the ahci probing.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: known problems with 8.x and HP DL16 G5 server?

2012-02-09 Thread Jeremy Chadwick
On Thu, Feb 09, 2012 at 01:48:29PM -0800, Julian Elischer wrote:
 does anyone know of problems with freebsd and this system?
 
 the kernel We tried to boot seems to stop somewhere in the ahci probing.

Few things:

1) Possible to get full console output (e.g. serial, etc.) from a verbose
boot?

2) Can you also provide the exact release/tag/kernel/thing you're trying
to install or upgrade to (8.x is a little vague; there are all sorts
of changes that happen between tags).  For example 8.1 is not going to
behave the same necessarily as 8.2.

3) When you say ahci probing, are you booting a standard installation
CD/DVD/memstick of, say, 8.2?  If so, those won't make use of the
AHCI-to-CAM translation layer (and that AHCI code is also different than
the native-ATA-AHCI code), so you might try, when booting the system,
dropping to the loader prompt and issuing load ahci.ko before typing
boot.  See if that helps.  If it does, great, use it (ahci_load=yes
in /boot/loader.conf) permanently (and benefit from things like NCQ
too).

4) If it's an Intel ESB2 controller, I believe there were some fixes or
identification shims put in place for this in recent RELENG_8, which
wouldn't be available in RELENG_8_2 or 8.2-RELEASE CD/DVDs.  I could be
remembering the wrong controller though.  Hmm...

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: serious packet routing issue causing ntpd high load?

2012-02-09 Thread Steven Hartland
- Original Message - 
From: Qing Li qin...@freebsd.org

Sorry about the delayed response. No, this one just fell through the cracks.

Has anyone responded ?  Does it still exist in 9.x ?


We discovered yesterday that adding the following routes,
which are present in: /etc/rc.d/network_ipv6, but not
active unless ipv6_enable=YES is set fixed the issue:-

route add -inet6 :::0.0.0.0 -prefixlen 96 ::1 -reject
route add -inet6 ::0.0.0.0 -prefixlen 96 ::1 -reject

I haven't confirmed but this is reported to be set
by default on 9.x due to the changes in rc.d scripts.

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: known problems with 8.x and HP DL16 G5 server?

2012-02-09 Thread Jeremy Chadwick
On Thu, Feb 09, 2012 at 04:02:12PM -0800, Julian Elischer wrote:
 On 2/9/12 1:56 PM, Jeremy Chadwick wrote:
 On Thu, Feb 09, 2012 at 01:48:29PM -0800, Julian Elischer wrote:
 does anyone know of problems with freebsd and this system?
 
 the kernel We tried to boot seems to stop somewhere in the ahci probing.
 Few things:
 
 1) Possible to get full console output (e.g. serial, etc.) from a verbose
 boot?
 
 it's freebsd 8.2 from a TrueNAS/FreeNAS. I'm actually at ix-systems
 at the
 moment.. but I wasnhoping someone could save us some time by saying
 Oh yeah, merge in change number xx
 
 2) Can you also provide the exact release/tag/kernel/thing you're trying
 to install or upgrade to (8.x is a little vague; there are all sorts
 of changes that happen between tags).  For example 8.1 is not going to
 behave the same necessarily as 8.2.
 
 3) When you say ahci probing, are you booting a standard installation
 CD/DVD/memstick of, say, 8.2?  If so, those won't make use of the
 AHCI-to-CAM translation layer (and that AHCI code is also different than
 the native-ATA-AHCI code), so you might try, when booting the system,
 dropping to the loader prompt and issuing load ahci.ko before typing
 boot.  See if that helps.  If it does, great, use it (ahci_load=yes
 in /boot/loader.conf) permanently (and benefit from things like NCQ
 too).
 let me forward you an image...
 4) If it's an Intel ESB2 controller, I believe there were some fixes or
 identification shims put in place for this in recent RELENG_8, which
 wouldn't be available in RELENG_8_2 or 8.2-RELEASE CD/DVDs.  I could be
 remembering the wrong controller though.  Hmm...
 
 
 that may be what we are looking for.
 
 I'll try get more info.

For others: the last few lines in the kernel log are:

acpi_hpet0: High Precision Event Timer iomem 0xfed0-0xfed003ff on acpi0
acpi_hpet0: vend: 0x8086 rev: 0x1 num: 3 hz: 14318180 opts: legacy_route 64-bit
Timecounter HPET frequency 14318180 Hz quality 900
acpi: wakeup code va 0xff848311d000 pa 0x4000
ahc_isa_probe 0: ioport 0xc00 alloc failed

I don't see any indication of AHCI problems here (or AHCI at all).
ahc_isa_probe is for the ahc(4) controller -- Adaptec SCSI.

A verbose boot might be more helpful.

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org