Re: [zfs-discuss] SSDs with a SCSI SCA interface?
On Thu, Dec 3, 2009 at 11:06 PM, Erik Trimble erik.trim...@sun.com wrote: I need either: (a) a SSD with an Ultra160/320 parallel interface (I can always find an interface adapter, so I'm not particular about whether it's a 68-pin or SCA) (b) a SAS or SATA to UltraSCSI adapter (preferably with a SCA interface) Check http://lmgtfy.com/?q=scsi+ssd 3 of the top 5 results might work for you. -B -- Brandon High : bh...@freaks.com For sale: One moral compass, never used. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS with hundreds of millions of files
I would like to ask a question regarding ZFS performance overhead when having hundreds of millions of files We have a storage solution, where one of the datasets has a folder containing about 400 million files and folders (very small 1K files) What kind of overhead do we get from this kind of thing? Our storage performance has degraded over time, and we have been looking in different places for cause of problems, but now I am wondering if its simply a file pointer issue? Cheers //Rey -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
I fully agree. This needs fixing. I can think of so many situations, where device names change in OpenSolaris (especially with movable pools). This problem can lead to serious data corruption. Besides persistent L2ARC (which is much more difficult I would say) - Making L2ARC also rely on labels instead of device paths is essential. Can someone open a CR for this ?? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sequential read performance
I wonder if it is a real problem, ie, for example cause longer backup time, will it be addressed in future? It doesn't cause longer backup time, as long as you're doing a zfs send | zfs receive But it could cause longer backup time if you're using something like tar. The only way to solve it is to eliminate copy on write (negating the value of ZFS), or choose to pay the price during regular operation, resulting in an overall slower system. You can expect it won't be changed or addressed in any way. You can also expect you'll never be able to detect or measure this as a performance problem that you care about. ZFS and copy on write are so much faster at other things, such as backups, and add so much value in terms of snapshots and data reliability ... There is a special case where the performance is lower. I don't mean to disrespect the concerns of anybody who is affected by that special case. But I believe it's uncommon. So I should ask anther question: is zfs suitable for an environment that has lots of data changes? I think for random I/O, there will be no such performance penalty, but if you backup a zfs dataset, must the backup utility sequentially read blocks of a dataset? Will zfs dataset suitable for database temporary tablespace or online redo logs? Yes, ZFS is great for environments that write a lot, and do random writes a lot. There is only one situation where the performance is lower, and it's specific: * You write a large amount of sequential data. * Then you randomly write a lot *inside* that large sequential file. * Then you sequentially read the data back. Performance is not hurt, if you eliminate any one of those points. * If you did not start by writing a large file in one shot, you won't have a problem. * If you do lots of random writes, but they're not in the middle of a large sequential file, you won't have a problem. * If you always read or write that file randomly, you won't have a problem. * The only time you have a problem is when you sequentially read a large file that previously had many random writes in the middle. Even in that case, the penalty you pay is usually small enough that you wouldn't notice. But it's possible. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Import zpool from FreeBSD in OpenSolaris
On Wed, Feb 24, 2010 at 3:31 AM, Ethan notet...@gmail.com wrote: On Tue, Feb 23, 2010 at 21:22, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: Just a couple of days ago there was discussion of importing disks from Linux FUSE zfs. The import was successful. The same methods used (directory containing symbolic links to desired device partitions) might be used. Yes, I ran into this very recently, moving from zfs-fuse on linux to OpenSolaris. My import looked almost exactly like yours. I did something along the lines of: mkdir dskp0s # create a temp directory to point to the p0 partitions of the relevant disks cd dskp0s ln -s /dev/dsk/c8t1d0p0 ln -s /dev/dsk/c8t2d0p0 ln -s /dev/dsk/c8t3d0p0 ln -s /dev/dsk/c8t4d0p0 zpool import -d . secure (substituting in info for your pool) and it imported, no problem. I have read your thread now (Help with corrupted pool for anyone interested). It raised some questions in my head... Someone wrote that this method will not work at boot? Does that mean that the pool won't automount at boot or that I can't boot from it (that one I don't mind)? Also, if I import the pool successfully in OpenSolaris (and don't do zfs/zpool upgrade), will that destroy my chances of importing it in FreeBSD again in the (near) future? I haven't tried importing because I'm not yet ready to make the switch to OpenSolaris, I need to replace a controller (Promise PDC40518 SATA150 TX4) that isn't recognized, but that's another issue. I too would want an answer to Is there any significant advantage to having a partition table?, but seeing that your last post is pretty new, maybe that will come. Thanks for the help. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
What kind of overhead do we get from this kind of thing? Overheadache... [i](Tack Kronberg för svaret)[/i] -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
Steve steve.jack...@norman.com writes: I would like to ask a question regarding ZFS performance overhead when having hundreds of millions of files We have a storage solution, where one of the datasets has a folder containing about 400 million files and folders (very small 1K files) What kind of overhead do we get from this kind of thing? at least 50%. I don't think this is obvious, so I'll state it: RAID-Z will not gain you any additional capacity over mirroring in this scenario. remember each individual file gets its own stripe. if the file is 512 bytes or less, you'll need another 512 byte block for the parity (actually as a special case, it's not parity, but a copy. parity would just be an inversion of all bits, so it's not useful to spend time doing it.) what's more, even if the file is 1024 bytes or less, ZFS will allocate an additional padding block to reduce the chance of unusable single disk blocks. a 1536 byte file will also consume 2048 bytes of physical disk, however. the reasoning for RAID-Z2 is similar, except it will add a padding block even for the 1536 byte file. to summarise: net raid-z1 raidz-2 -- 512 1024 2x 1536 3x 1024 2048 2x 3072 3x 1536 2048 1½x 3072 2x 2048 3072 1½x 3072 1½x 2560 3072 1⅕x 3584 1⅖x the above assumes at least 8 (9) disks in the vdev, otherwise you'll get a little more overhead for the larger filesizes. Our storage performance has degraded over time, and we have been looking in different places for cause of problems, but now I am wondering if its simply a file pointer issue? adding new files will fragment directories, that might cause performance degradation depending on access patterns. I don't think many files in itself will cause problems, but since you get a lot more ZFS records in your dataset (128x!), more of the disk space is wasted on block pointers, and you may get more block pointer writes since more levels are needed. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
Hei Kjetil. Actually we are using hardware RAID5 on this setup.. so solaris only sees a single device... The overhead I was thinking of was more in the pointer structures... (bearing in mind this is a 128 bit file system), I would guess that memory requirements would be HUGE for all these files...otherwise arc is gonna struggle, and paging system is going mental? //Rey -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding a zfs mirror drive to rpool - new drive formats to one cylinder less
On Tue, February 23, 2010 17:58, tomwaters wrote: Thanks for that. It seems strange though that the two disks, which are from the same manufacturer, same model, same firmware and similar batch/serial's behave differently. I've found that the ways of writing labels and partitions in Solaris are arcane and unpredictable. The two sides of my rpool mirror are slightly different sizes, even though they're identical drives installed at the same time. I presume it's somehow something I did in the process, but I haven't nailed down exactly what (and eventually got past the point of caring). I was replacing rpool disks, so the pattern was attach new, resilver, attach new, resilver, remove old, remove old -- so it just auto-expanded to the smaller of the two new at the time the second old was removed; I didn't run into the problem of being unable to attach the disk because it's too small. (Also have to run installgrub of course, not listed in above pattern.) I am also puzzled that the rpool disk appears to start at cylinder 0 and not 1. Historical habit, I think, from when cylinders were much smaller and filesystems expected boot information to be outside the filesystem space (my understanding is that ZFS is set up to not overwrite where the boot stuff would be in the slice ZFS is using, just in case it's there). -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sequential read performance
Once the famous bp rewriter is integrated and a defrag functionality built on top of it you will be able to re-arrange your data again so it is sequential again. Then again, this would also rearrange your data to be sequential again: cp -p somefile somefile.tmp ; mv -f somefile.tmp somefile ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Import zpool from FreeBSD in OpenSolaris
On Wed, Feb 24, 2010 at 08:12, li...@dentarg.net wrote: On Wed, Feb 24, 2010 at 3:31 AM, Ethan notet...@gmail.com wrote: On Tue, Feb 23, 2010 at 21:22, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: Just a couple of days ago there was discussion of importing disks from Linux FUSE zfs. The import was successful. The same methods used (directory containing symbolic links to desired device partitions) might be used. Yes, I ran into this very recently, moving from zfs-fuse on linux to OpenSolaris. My import looked almost exactly like yours. I did something along the lines of: mkdir dskp0s # create a temp directory to point to the p0 partitions of the relevant disks cd dskp0s ln -s /dev/dsk/c8t1d0p0 ln -s /dev/dsk/c8t2d0p0 ln -s /dev/dsk/c8t3d0p0 ln -s /dev/dsk/c8t4d0p0 zpool import -d . secure (substituting in info for your pool) and it imported, no problem. I have read your thread now (Help with corrupted pool for anyone interested). It raised some questions in my head... Someone wrote that this method will not work at boot? Does that mean that the pool won't automount at boot or that I can't boot from it (that one I don't mind)? Also, if I import the pool successfully in OpenSolaris (and don't do zfs/zpool upgrade), will that destroy my chances of importing it in FreeBSD again in the (near) future? I haven't tried importing because I'm not yet ready to make the switch to OpenSolaris, I need to replace a controller (Promise PDC40518 SATA150 TX4) that isn't recognized, but that's another issue. I too would want an answer to Is there any significant advantage to having a partition table?, but seeing that your last post is pretty new, maybe that will come. Thanks for the help. My pool, using p0 devices, does currently mount at boot. This is using the p0 devices in /dev/dsk, not symlinks in a directory (those have been deleted). This does seem like it might indicate that importing the devices changed something on disk - I'm not sure what, but something that has caused it to know that it should use the p0 devices rather than s2 or s8. I have no idea whether this would affect ability to import in BSD. It does not seem like it should, intuitively, but I don't have much basis in actual knowledge of the inner workings to say that - you shouldn't take my word for it. As for partition tables, everything seems to be working happily without them. I'd still like to know from somebody more knowledgeable about what I might lose, not having them, but I haven't run into any issues yet. -Ethan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs with a SCSI SCA interface?
On Tue, Feb 23, 2010 at 2:09 PM, Erik Trimble erik.trim...@sun.com wrote: Al Hopper wrote: On Fri, Dec 4, 2009 at 1:06 AM, Erik Trimble erik.trim...@sun.com mailto:erik.trim...@sun.com wrote: Hey folks. I've looked around quite a bit, and I can't find something like this: I have a bunch of older systems which use Ultra320 SCA hot-swap connectors for their internal drives. (e.g. v20z and similar) I'd love to be able to use modern flash SSDs with these systems, but I have yet to find someone who makes anything that would fit the bill. I need either: (a) a SSD with an Ultra160/320 parallel interface (I can always find an interface adapter, so I'm not particular about whether it's a 68-pin or SCA) (b) a SAS or SATA to UltraSCSI adapter (preferably with a SCA interface) Hi Erik, One of the less well known facts about SCSI is that all SCSI commands are sent in legacy 8-bit mode. And it takes multiple SCSI commands to make a SCSI drive do something useful! Translation - it's s-l-o-w. Since one of the big upsides of an SSD is I/O Ops/Sec - get ready for a disappointment if you use SCSI based connection. Sure - after the drive has received the necessary commands it can move data blocks reasonably quickly - but the limit, in terms of an SSD will *definitely* be the rate at which commands can be received by the drive. This (8-bit command) design decision was responsible for SCSIs' long lasting upward compatibility - but it also turned into its achilles heel; that ultimately doomed SCSI to extinction. Really? I hadn't realized this was a problem with SSDs and SCSI. Exactly how does this impact SSDs with a SAS connection, since that's still using the SCSI command set, just over a serial link rather than a parallel one. Or, am I missing something, and is SAS considerably different (protocol wise) from traditional parallel SCSI? The key difference here is that the SCSI protocol commands and other data are sent to/from the SAS drive at the same (high) speed over the serial link. And another point - SAS and SATA are full duplex. This is why parallel SCSI had to die - you simply can't send enough SCSI commands over a SCSI parallel link to keep a modern, mechanical, 7,200RPM drive busy - let alone an SSD. Think about that for a Second - the mechanical drive is probably going to max out at 400 to 500 I/O Ops/Sec. By way of contrast, todays SSDs will do 33,000+ I/O Ops/Sec (for a workload that is I/O Op/Sec intensive). And tomorrows SSDs are going to be much faster. Given the enormous amount of legacy hardware out there that has parallel SCSI drive bays (I mean, SAS is really only 2-3 years old in terms of server hardware adoption), I am just flabbergasted that there's no parallel-SCSI SSD around. Now you know why. There is simply no way to get around the parallel SCSI standard spec and the fact that *all* SCSI commands are sent 8-bits wide at the very slow (original) 8-bit rate. And if you do find a converter, you're going to be bitterly disappointed with the results - even with a low-end SSD. PS: I think if someone does build/sell a parallel SCSI - SATA SSD converter board, they are going to get a very high percentage of them returned from angry customers telling them they get better performance from a USB key that they do with this piece of *...@$!$# converter. And it's going to be very difficult to explain to the customer why the convert board is so slow - and working perfectly. I understand exactly the problem you're solving - and you're not alone (got 4 V20Zs in a CoLo in Menlo Park CA that I maintain for Genunix.Org and I visit them less than once a year at great expense - both in terms of time and dollars)! IMHO any kind of a hardware hack job and a couple of 1.8 or 2.5 SATA SSDs, combined with an OpenSolaris plugin SATA controller, would be a better solution. But I don't like this solution any more than I'm sure you do! Please contact me offlist if you have any ideas and please let us know (on the list) how this works out for you. Regards, -- Al Hopper Logical Approach Inc,Plano,TX a...@logical-approach.com mailto:a...@logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ I've got stacks of both v20z/v40z hardware, plus a whole raft of IBM xSeries (/not/ System X) machines which really, really, really need an SSD for improved I/O. At this point, I'd kill for a parallel SCSI - SATA adapter thingy; something that would plug into a SCA connector on one side, and a SATA port on the other. I could at least hack together a mounting bracket for something like that... -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA -- Al Hopper Logical Approach Inc,Plano,TX
[zfs-discuss] Interrupt sharing
On Tue, February 23, 2010 17:20, Chris Ridd wrote: To see what interrupts are being shared: # echo ::interrupts -d | mdb -k Running intrstat might also be interesting. This just caught my attention. I'm not the original poster, but this sparked something I've been wanting to know about for a while. I know from startup log messages that I've got several interrupts being shared. I've been wondering how serious this is. I don't have any particular performance problems, but then again my cpu and motherboard are from 2006 and I'd like to extend their service life, so using them more efficiently isn't a bad idea. Plus it's all a learning experience :-). While I see the relevance to diagnosing performance problems, for my case, is there likely to be anything I can do about interrupt assignments? Or is this something that, if it's a problem, is an unfixable problem (short of changing hardware)? I think there's BIOS stuff to shuffle interrupt assignments some, but do changes at that level survive kernel startup, or get overwritten? If there's nothing I can do, then no real point in my investigating further. However, if there's possibly something to do, what kinds of things should I look for as problems in the mdb or intrstat data? mdb reports: # echo ::interrupts -d | mdb -k IRQ Vect IPL BusTrg Type CPU Share APIC/INT# Driver Name(s) 10x42 5 ISAEdg Fixed 1 1 0x0/0x1 i8042#0 40xb0 12 ISAEdg Fixed 1 1 0x0/0x4 asy#0 60x44 5 ISAEdg Fixed 0 1 0x0/0x6 fdc#0 90x81 9 PCILvl Fixed 1 1 0x0/0x9 acpi_wrapper_isr 12 0x43 5 ISAEdg Fixed 0 1 0x0/0xc i8042#0 14 0x45 5 ISAEdg Fixed 1 1 0x0/0xe ata#0 16 0x83 9 PCILvl Fixed 0 1 0x0/0x10 pci-ide#1 19 0x86 9 PCILvl Fixed 1 1 0x0/0x13 hci1394#0 20 0x41 5 PCILvl Fixed 0 2 0x0/0x14 nv_sata#1, nv_sata#0 21 0x84 9 PCILvl Fixed 1 2 0x0/0x15 nv_sata#2, ehci#0 22 0x85 9 PCILvl Fixed 0 2 0x0/0x16 audiohd#0, ohci#0 23 0x60 6 PCILvl Fixed 1 2 0x0/0x17 nge#1, nge#0 24 0x82 7 PCIEdg MSI0 1 - pcie_pci#0 25 0x40 5 PCIEdg MSI1 1 - mpt#0 26 0x30 4 PCIEdg MSI1 1 - pcie_pci#5 27 0x87 7 PCIEdg MSI0 1 - pcie_pci#4 160 0xa0 0 Edg IPIall 0 - poke_cpu 192 0xc0 13 Edg IPIall 1 - xc_serv 208 0xd0 14 Edg IPIall 1 - kcpc_hw_overflow_intr 209 0xd1 14 Edg IPIall 1 - cbe_fire 210 0xd3 14 Edg IPIall 1 - cbe_fire 240 0xe0 15 Edg IPIall 1 - xc_serv 241 0xe1 15 Edg IPIall 1 - apic_error_intr -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] snv_133 - high cpu - update
Hi all, I still didn't find the problem but it seems to be related with interrupts sharing between onboard network cards (broadcom) and the intel 10gbE card PCI-e. Runing a simple iperf from a linux box to my zfs box, if i use bnx2 or bnx3 i have a performance over 100 mbs, but if i use bnx0, bxn1 or the 10gbE i hard pass 10 mbs and the load on the box goes from 1.0 to infinite...:( I i fix this issue i will let you guys know it. Bruno smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Import zpool from FreeBSD in OpenSolaris
On Tue, 23 Feb 2010, patrik wrote: I want to import my zpool's from FreeBSD 8.0 in OpenSolaris 2009.06. secureUNAVAIL insufficient replicas raidz1 UNAVAIL insufficient replicas c8t1d0p0 ONLINE c8t2d0s2 ONLINE c8t3d0s8 UNAVAIL corrupted data c8t4d0s8 UNAVAIL corrupted data The zpool import command is finding the wrong slices to import. Notice this says s8 for the last two slice numbers. It's also using s2 for the second disk, but s2 is typcally the entire disk anyway, so it's able to see all the data. On the other hand, s8 is typically just the first cylinder -- just enough information for zfs to see the front labels, but not enough to see all the data. zfs should probably be better about what it chooses for a disk to import. A coworker suggests looking at the zdb -l output, fdisk output, and prtvtoc output to see if there's a common theme that will lead to a solution independent of modifying zfs import code. Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_133 - high cpu
On 02/23/10 15:20, Chris Ridd wrote: On 23 Feb 2010, at 19:53, Bruno Sousa wrote: The system becames really slow during the data copy using network, but i copy data between 2 pools of the box i don't notice that issue, so probably i may be hitting some sort of interrupt conflit in the network cards...This system is configured with alot of interfaces, being : 4 internal broadcom gigabit 1 PCIe 4x, Intel Dual Pro gigabit 1 PCIe 4x, Intel 10gbE card 2 PCIe 8x Sun non-raid HBA With all of this, is there any way to check if there is indeed an interrupt conflit or some other type of conflit that leads this high load? I also noticed some messages about acpi..can this acpi also affect the performance of the system? To see what interrupts are being shared: # echo ::interrupts -d | mdb -k Running intrstat might also be interesting. Cheers, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Is this using mpt driver? There's an issue w/ the fix for 6863127 that causes performance problems on larger memory machines, filed as 6908360. - Bart -- Bart Smaalders Solaris Kernel Performance ba...@cyber.eng.sun.com http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_133 - high cpu
Hi Bart, yep, I got Bruno to run a kernel profile lockstat... it does look like the mpt issue.. andy :--- Count indv cuml rcnt nsec Hottest CPU+PILCaller 2861 7% 55% 0.00 4889 cpu[1]+5 do_splx nsec -- Time Distribution -- count Stack 1024 | 1 xc_common 2048 |@@ 213 xc_call 4096 |@@@1136 hat_tlb_inval 8192 | 1237 x86pte_inval 16384 |@@ 256 hat_pte_unmap 32768 | 15hat_unload_callback 65536 | 1 hat_unload 131072 | 2 segkmem_free_vn segkmem_free vmem_xfree vmem_free kfreea i_ddi_mem_free rootnex_teardown_copybuf rootnex_coredma_unbindhdl rootnex_dma_unbindhdl ddi_dma_unbind_handle scsi_dmafree_attr scsi_free_cache_pkt --- Count indv cuml rcnt nsec Hottest CPU+PILCaller 1857 5% 59% 0.00 1907 cpu[0]+5 getctgsz nsec -- Time Distribution -- count Stack 1024 |@@@206 kfreea 2048 |@@@1203 i_ddi_mem_free 4096 |@@ 387 rootnex_teardown_copybuf 8192 | 24rootnex_coredma_unbindhdl 16384 | 25rootnex_dma_unbindhdl 32768 | 12ddi_dma_unbind_handle scsi_dmafree_attr scsi_free_cache_pkt scsi_destroy_pkt vhci_scsi_destroy_pkt scsi_destroy_pkt sd_destroypkt_for_buf sd_return_command sdintr scsi_hba_pkt_comp vhci_intr scsi_hba_pkt_comp mpt_doneq_empty mpt_intr On 24 Feb 2010, at 10:31, Bart Smaalders wrote: Is this using mpt driver? There's an issue w/ the fix for 6863127 that causes performance problems on larger memory machines, filed as 6908360. - Bart -- Bart SmaaldersSolaris Kernel Performance ba...@cyber.eng.sun.com http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [indiana-discuss] future of OpenSolaris
On Wed, Feb 24, 2010 at 2:02 PM, Troy Campbell troy.campb...@fedex.comwrote: http://www.oracle.com/technology/community/sun-oracle-community-continuity.html Half way down it says: Will Oracle support Java and OpenSolaris User Groups, as Sun has? Yes, Oracle will indeed enthusiastically support the Java User Groups, OpenSolaris User Groups, and other Sun-related user group communities (including the Java Champions), just as Oracle actively supports hundreds of product-oriented user groups today. We will be reaching out to these groups soon. Supporting doesn't necessarily mean continuing the Open Source projects! -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] disks in zpool gone at the same time
Hi, Yesterday I got all my disks in two zpool disconected. They are not real disks - LUNS from StorageTek 2530 array. What could that be - a failing LSI card or a mpt driver in 2009.06? After reboot got four disks in FAILED state - zpool clear fixed things with resilvering. Here is how it started (/var/adm/messages) Feb 23 12:39:03 nexus scsi: [ID 365881 kern.info] /p...@0,0/pci10de,5...@e/pci1000,3...@0 (mpt0): Feb 23 12:39:03 nexus Log info 0x3114 received for target 2. ... Feb 23 12:39:06 nexus scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci10de,5...@e/pci1000,3...@0/s...@2,9 (sd58): Feb 23 12:39:06 nexus Command failed to complete...Device is gone Feb 23 12:39:06 nexus scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci10de,5...@e/pci1000,3...@0/s...@2,7 (sd56): Feb 23 12:39:06 nexus Command failed to complete...Device is gone ... # fmdump -eV -t 23Feb10 12:00 TIME CLASS Feb 23 2010 12:39:03.856423656 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x37f293365c100801 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /p...@0,0/pci10de,5...@e/pci1000,3...@0/s...@2,2 (end detector) driver-assessment = retry op-code = 0x2a cdb = 0x2a 0x0 0x22 0x14 0x51 0xab 0x0 0x0 0x4 0x0 pkt-reason = 0x4 pkt-state = 0x0 pkt-stats = 0x8 __ttl = 0x1 __tod = 0x4b8412b7 0x330bfce8 ... Feb 23 2010 12:39:06.840406312 ereport.fs.zfs.io nvlist version: 0 class = ereport.fs.zfs.io ena = 0x37fdb0f5dc000401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x26b9a51f199f72bf vdev = 0xaf3ea54be8e5909c (end detector) pool = pool2530-2 pool_guid = 0x26b9a51f199f72bf pool_context = 0 pool_failmode = wait vdev_guid = 0xaf3ea54be8e5909c vdev_type = disk vdev_path = /dev/dsk/c8t2d9s0 vdev_devid = id1,s...@n600a0b800036a8ba07484adc4dec/a parent_guid = 0xff4853b09cdcb0bb parent_type = raidz zio_err = 6 zio_offset = 0x42000 zio_size = 0x2000 __ttl = 0x1 __tod = 0x4b8412ba 0x32179528 The system configuration: SunFire X4200, LSI_1068E - 1.18.00.00 StorageTek 2530 with 1TB WD3 SATA drives, not JBOD: Port Name Chip Vendor/Type/RevMPT Rev Firmware Rev IOC 1. mpt0 LSI Logic SAS1068E B1 105 0112 0 Current active firmware version is 0112 (1.18.00) Firmware image's version is MPTFW-01.18.00.00-IT LSI Logic x86 BIOS image's version is MPTBIOS-6.12.00.00 (2006.10.31) FCode image's version is MPT SAS FCode Version 1.00.40 (2006.03.02) # uname -a SunOS nexus 5.11 snv_111b i86pc i386 i86pc Solaris Does anyone have any problem with these LSI cards in NexentaStor? Thanks Evgueni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, Feb 24, 2010 at 02:09:42PM -0600, Bob Friesenhahn wrote: I have a directory here containing a million files and it has not caused any strain for zfs at all although it can cause considerable stress on applications. The biggest problem is always the apps. For example, ls by default sorts, and if you're using a locale with a non-trivial collation (e.g., any UTF-8 locales) then the sort gets very expensive. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_133 - high cpu
Yes i'm using the mtp driver . In total this system has 3 HBA's, 1 internal (Dell perc), and 2 Sun non-raid HBA's. I'm also using multipath, but if i disable multipath i have pretty much the same results.. Bruno On 24-2-2010 19:42, Andy Bowers wrote: Hi Bart, yep, I got Bruno to run a kernel profile lockstat... it does look like the mpt issue.. andy :--- Count indv cuml rcnt nsec Hottest CPU+PILCaller 2861 7% 55% 0.00 4889 cpu[1]+5 do_splx nsec -- Time Distribution -- count Stack 1024 | 1 xc_common 2048 |@@ 213 xc_call 4096 |@@@1136 hat_tlb_inval 8192 | 1237 x86pte_inval 16384 |@@ 256 hat_pte_unmap 32768 | 15hat_unload_callback 65536 | 1 hat_unload 131072 | 2 segkmem_free_vn segkmem_free vmem_xfree vmem_free kfreea i_ddi_mem_free rootnex_teardown_copybuf rootnex_coredma_unbindhdl rootnex_dma_unbindhdl ddi_dma_unbind_handle scsi_dmafree_attr scsi_free_cache_pkt --- Count indv cuml rcnt nsec Hottest CPU+PILCaller 1857 5% 59% 0.00 1907 cpu[0]+5 getctgsz nsec -- Time Distribution -- count Stack 1024 |@@@206 kfreea 2048 |@@@1203 i_ddi_mem_free 4096 |@@ 387 rootnex_teardown_copybuf 8192 | 24rootnex_coredma_unbindhdl 16384 | 25rootnex_dma_unbindhdl 32768 | 12ddi_dma_unbind_handle scsi_dmafree_attr scsi_free_cache_pkt scsi_destroy_pkt vhci_scsi_destroy_pkt scsi_destroy_pkt sd_destroypkt_for_buf sd_return_command sdintr scsi_hba_pkt_comp vhci_intr scsi_hba_pkt_comp mpt_doneq_empty mpt_intr On 24 Feb 2010, at 10:31, Bart Smaalders wrote: Is this using mpt driver? There's an issue w/ the fix for 6863127 that causes performance problems on larger memory machines, filed as 6908360. - Bart -- Bart Smaalders Solaris Kernel Performance ba...@cyber.eng.sun.com http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On 24-Feb-10, at 3:38 PM, Tomas Ögren wrote: On 24 February, 2010 - Bob Friesenhahn sent me these 1,0K bytes: On Wed, 24 Feb 2010, Steve wrote: The overhead I was thinking of was more in the pointer structures... (bearing in mind this is a 128 bit file system), I would guess that memory requirements would be HUGE for all these files...otherwise arc is gonna struggle, and paging system is going mental? It is not reasonable to assume that zfs has to retain everything in memory. I have a directory here containing a million files and it has not caused any strain for zfs at all although it can cause considerable stress on applications. 400 million tiny files is quite a lot and I would hate to use anything but mirrors with so many tiny files. Another tought is am I using the correct storage model for this data? You're not the only one wondering that. :) --Toby /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, Feb 24 at 14:09, Bob Friesenhahn wrote: 400 million tiny files is quite a lot and I would hate to use anything but mirrors with so many tiny files. And at 400 million, you're in the realm of needing mirrors of SSDs, with their fast random reads. Even at the 500+ IOPS of good SAS drives, you're looking at a TON of spindles to move through 400 million 1KB files quickly. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
It was never the intention that this storage system should be used in this way... And I am now clearning alot of this stuff out.. This is very static files, and is rarely used... so traversing it any way is a rare occasion... What has happened is that reading and writing large files which are unrelated to these ones has become appallingly slow... So I was wondering if just the presence of so many files was in some way putting alot of stress on the pool, even if these files arent used very often... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, Feb 24, 2010 at 11:09 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Wed, 24 Feb 2010, Steve wrote: The overhead I was thinking of was more in the pointer structures... (bearing in mind this is a 128 bit file system), I would guess that memory requirements would be HUGE for all these files...otherwise arc is gonna struggle, and paging system is going mental? It is not reasonable to assume that zfs has to retain everything in memory. At the same time 400M files in a single directory should lead to a lot of contention on locks associated with look-ups. Spreading files between a reasonable number of dirs could mitigate this. Regards, Andrey I have a directory here containing a million files and it has not caused any strain for zfs at all although it can cause considerable stress on applications. 400 million tiny files is quite a lot and I would hate to use anything but mirrors with so many tiny files. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
thats not the issue here, as they are spread out in a folder structure based on an integer split into hex blocks... 00/00/00/01 etc... but the number of pointers involved with all these files, and directories (which are files) must have an impact on a system with limited RAM? There is 4GB RAM in this system btw... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, 24 Feb 2010, Steve wrote: What has happened is that reading and writing large files which are unrelated to these ones has become appallingly slow... So I was wondering if just the presence of so many files was in some way putting alot of stress on the pool, even if these files arent used very often... If these millions of files was built up over a long period of time while large files are also being created, then they may contribute to an increased level of filesystem fragmentation. With millions of such tiny files, it makes sense to put the small files in a separate zfs filesystem which has its recordsize property set to a size not much larger than the size of the files. This should reduce waste, resulting in reduced potential for fragmentation in the rest of the pool. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Feb 24, 2010, at 1:17 PM, Steve wrote: It was never the intention that this storage system should be used in this way... And I am now clearning alot of this stuff out.. This is very static files, and is rarely used... so traversing it any way is a rare occasion... What has happened is that reading and writing large files which are unrelated to these ones has become appallingly slow... So I was wondering if just the presence of so many files was in some way putting alot of stress on the pool, even if these files arent used very often... There are (recent) improvements to the allocator that should help this scenario. What release are you running? -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Thu, Feb 25, 2010 at 12:26 AM, Steve steve.jack...@norman.com wrote: thats not the issue here, as they are spread out in a folder structure based on an integer split into hex blocks... 00/00/00/01 etc... but the number of pointers involved with all these files, and directories (which are files) must have an impact on a system with limited RAM? There is 4GB RAM in this system btw... If any significant portion of these 400M files is accessed on a regular basis, you'd be (1) stressing ARC to the limits (2) stressing spindles so that any concurrent sequential I/O would suffer. Small files are always an issue, try moving them off HDDs onto a mirrored SSDs, not necessarily most expensive ones. 400M 2K files is just 400GB, within the reach of a few SSDs. Regards, Andrey -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to know the recordsize of a file
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I would like to know the blocksize of a particular file. I know the blocksize for a particular file is decided at creation time, in fuction of the write size done and the recordsize property of the dataset. How can I access that information?. Some zdb magic?. - -- Jesus Cea Avion _/_/ _/_/_/_/_/_/ j...@jcea.es - http://www.jcea.es/ _/_/_/_/ _/_/_/_/ _/_/ jabber / xmpp:j...@jabber.org _/_/_/_/ _/_/_/_/_/ . _/_/ _/_/_/_/ _/_/ _/_/ Things are not so easy _/_/ _/_/_/_/ _/_/_/_/ _/_/ My name is Dump, Core Dump _/_/_/_/_/_/ _/_/ _/_/ El amor es poner tu felicidad en la felicidad de otro - Leibniz -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQCVAwUBS4WbjZlgi5GaxT1NAQI7lgQAp1VYhsX3mR5sBOkdreUdYv2ApRAIFCfD 6yJxE2gVKW91OciHnS4Gtk1fs97achOu+6ab2eHikziZEy7hoOzGgKzqchpZq6jA fz3KKCS1wmixbbak7SDkzIREqqfi3LvD9ubFIz+hEFPv4DVd4whfCSGDGd87QBIA x32q+Wj/680= =UeU8 -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Thu, Feb 25, 2010 at 12:34 AM, Andrey Kuzmin andrey.v.kuz...@gmail.com wrote: On Thu, Feb 25, 2010 at 12:26 AM, Steve steve.jack...@norman.com wrote: thats not the issue here, as they are spread out in a folder structure based on an integer split into hex blocks... 00/00/00/01 etc... but the number of pointers involved with all these files, and directories (which are files) must have an impact on a system with limited RAM? There is 4GB RAM in this system btw... If any significant portion of these 400M files is accessed on a regular basis, you'd be (1) stressing ARC to the limits (2) stressing spindles so that any concurrent sequential I/O would suffer. Small files are always an issue, try moving them off HDDs onto a mirrored SSDs, not necessarily most expensive ones. 400M 2K files is 1K meant, fat fingers. Regards, Andrey just 400GB, within the reach of a few SSDs. Regards, Andrey -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
Well I am deleting most of them anyway... they are not needed anymore... Will deletion solve the problem... or do I need to do something more to defrag the file system? I have understood that defrag willl not be available until this block rewrite thing is done? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On 24/02/2010 21:31, Bob Friesenhahn wrote: On Wed, 24 Feb 2010, Steve wrote: What has happened is that reading and writing large files which are unrelated to these ones has become appallingly slow... So I was wondering if just the presence of so many files was in some way putting alot of stress on the pool, even if these files arent used very often... If these millions of files was built up over a long period of time while large files are also being created, then they may contribute to an increased level of filesystem fragmentation. With millions of such tiny files, it makes sense to put the small files in a separate zfs filesystem which has its recordsize property set to a size not much larger than the size of the files. This should reduce waste, resulting in reduced potential for fragmentation in the rest of the pool. except for one bug which has been fixed which had to do with consuming lots of CPU to find a free block I don't think you are right. You don't have to set recordsize to smaller value for small files. Recordsize property sets a maximum allowed recordsize but other than that it is being selected automatically when file is being created so for small files their recordsize will be small even if it is set to default 128KB. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, 24 Feb 2010, Robert Milkowski wrote: except for one bug which has been fixed which had to do with consuming lots of CPU to find a free block I don't think you are right. You don't have to set recordsize to smaller value for small files. Recordsize property sets a maximum allowed recordsize but other than that it is being selected automatically when file is being created so for small files their recordsize will be small even if it is set to default 128KB. Didn't we hear on this list just recently that zfs no longer writes short tail blocks (i.e. zfs behavior has been changed)? Did I misunderstand? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, February 24, 2010 14:39, Nicolas Williams wrote: On Wed, Feb 24, 2010 at 02:09:42PM -0600, Bob Friesenhahn wrote: I have a directory here containing a million files and it has not caused any strain for zfs at all although it can cause considerable stress on applications. The biggest problem is always the apps. For example, ls by default sorts, and if you're using a locale with a non-trivial collation (e.g., any UTF-8 locales) then the sort gets very expensive. Which is bad enough if you say ls. And there's no option to say don't sort that I know of, either. If you say ls * it's in some ways worse, in that the * is expanded by the shell, and most of the filenames don't make it to ls at all. (ls abc* is more likely, but with a million files that can still easily overlow the argument limit.) There really ought to be an option to make ls not sort, at least. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, Feb 24, 2010 at 03:31:51PM -0600, Bob Friesenhahn wrote: With millions of such tiny files, it makes sense to put the small files in a separate zfs filesystem which has its recordsize property set to a size not much larger than the size of the files. This should reduce waste, resulting in reduced potential for fragmentation in the rest of the pool. Tuning the dataset recordsize down does not help in this case. The files are already small, so their recordsize is already small. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
I manage several systems with near a billion objects (largest is currently 800M) on each and also discovered slowness over time. This is on X4540 systems with average file sizes being ~5KB. In our environment the following readily sped up performance significantly: Do not use RAID-Z. Use as many mirrored disks as you can. This has been discussed before. Nest data in directories as deeply as possible. Although ZFS doesn't really care, client utilities certainly do and operations in large directories causes needless overhead. Make sure you do not use the filesystem past 80% capacity. As available space decreases so does overhead for allocating new files. Do not keep snapshots around forever, (although we keep them around for months now without issue.) Use ZFS compression (gzip worked best for us.) Record size did not make a significant change with our data, so we left it at 128K. You need lots of memory for a big ARC. Do not use the system for anything else other than serving files. Don't put pressure on system memory and let ARC do its thing. We now use the F20 cache cards as a huge L2ARC in each server which makes a large impact. one the cache is primed. Caching all that file metadata really helps I found using SSD's over iSCSI as a L2ARC was just as effective, so you don't necessarily need expensive PCIe flash. After these tweaks the systems are blazingly quick, able to do many 1000's of ops/second and deliver full gigE line speed even on fully random workloads. Your mileage may very but for now I am very happy with the systems finally (and rightfully so given their performance potential!) -- Adam Serediuk___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
Also you will need to ensure that atime is turned off for the ZFS volume(s) in question as well as any client-side NFS mount settings. There are a number of client-side NFS tuning parameters that can be done if you are using NFS clients with this system. Attributes caches, atime, diratime, etc all make a large different when dealing with very large data sets. On 24-Feb-10, at 2:05 PM, Adam Serediuk wrote: I manage several systems with near a billion objects (largest is currently 800M) on each and also discovered slowness over time. This is on X4540 systems with average file sizes being ~5KB. In our environment the following readily sped up performance significantly: Do not use RAID-Z. Use as many mirrored disks as you can. This has been discussed before. Nest data in directories as deeply as possible. Although ZFS doesn't really care, client utilities certainly do and operations in large directories causes needless overhead. Make sure you do not use the filesystem past 80% capacity. As available space decreases so does overhead for allocating new files. Do not keep snapshots around forever, (although we keep them around for months now without issue.) Use ZFS compression (gzip worked best for us.) Record size did not make a significant change with our data, so we left it at 128K. You need lots of memory for a big ARC. Do not use the system for anything else other than serving files. Don't put pressure on system memory and let ARC do its thing. We now use the F20 cache cards as a huge L2ARC in each server which makes a large impact. one the cache is primed. Caching all that file metadata really helps I found using SSD's over iSCSI as a L2ARC was just as effective, so you don't necessarily need expensive PCIe flash. After these tweaks the systems are blazingly quick, able to do many 1000's of ops/second and deliver full gigE line speed even on fully random workloads. Your mileage may very but for now I am very happy with the systems finally (and rightfully so given their performance potential!) -- Adam Serediuk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to know the recordsize of a file
On 24/02/2010 21:35, Jesus Cea wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I would like to know the blocksize of a particular file. I know the blocksize for a particular file is decided at creation time, in fuction of the write size done and the recordsize property of the dataset. How can I access that information?. Some zdb magic?. pre mi...@r600:~# ls -li /bin/bash 1713998 -r-xr-xr-x 1 root bin 799040 2009-10-30 00:41 /bin/bash mi...@r600:~# zdb -v rpool/ROOT/osol-916 1713998 Dataset rpool/ROOT/osol-916 [ZPL], ID 302, cr_txg 6206087, 24.2G, 1053147 objects Object lvl iblk dblk dsize lsize %full type 1713998216K 128K 898K 896K 100.00 ZFS plain file mi...@r600:~# /pre -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On 24/02/2010 21:54, Bob Friesenhahn wrote: On Wed, 24 Feb 2010, Robert Milkowski wrote: except for one bug which has been fixed which had to do with consuming lots of CPU to find a free block I don't think you are right. You don't have to set recordsize to smaller value for small files. Recordsize property sets a maximum allowed recordsize but other than that it is being selected automatically when file is being created so for small files their recordsize will be small even if it is set to default 128KB. Didn't we hear on this list just recently that zfs no longer writes short tail blocks (i.e. zfs behavior has been changed)? Did I misunderstand? yep, but the last block will be the same as all the other block. So if you have a small file where zfs used 1kb block the tail block will be 1kb as well and not 128kb even if the default recordsize is 128k. I think that only if you would make the recordsize property considerably smaller than average blocksize you could in theory save some space on tail blocks. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On 24/02/2010 21:40, Steve wrote: Well I am deleting most of them anyway... they are not needed anymore... Will deletion solve the problem... or do I need to do something more to defrag the file system? I have understood that defrag willl not be available until this block rewrite thing is done? first the question is if fragmentation is actually your problem. Then if it is if you have lots of free disk space (continuous) then by copying a file you want to defrag you will defragment that file. You might get better results if you do it after you remove most of the files even if you have plenty of disk space available now. It is not as elegant as having a background defrag though. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On 2/24/2010 4:11 PM, Stefan Walk wrote: On 24 Feb 2010, at 22:57, David Dyer-Bennet wrote: On Wed, February 24, 2010 14:39, Nicolas Williams wrote: On Wed, Feb 24, 2010 at 02:09:42PM -0600, Bob Friesenhahn wrote: I have a directory here containing a million files and it has not caused any strain for zfs at all although it can cause considerable stress on applications. The biggest problem is always the apps. For example, ls by default sorts, and if you're using a locale with a non-trivial collation (e.g., any UTF-8 locales) then the sort gets very expensive. Which is bad enough if you say ls. And there's no option to say don't sort that I know of, either. If you say ls * it's in some ways worse, in that the * is expanded by the shell, and most of the filenames don't make it to ls at all. (ls abc* is more likely, but with a million files that can still easily overlow the argument limit.) There really ought to be an option to make ls not sort, at least. ls -f? The man page sure doesn't sound like it: -f Forces each argument to be interpreted as a directory and list the name found in each slot. This option turns off -l, -t, -s, -S, and -r, and turns on -a. The order is the order in which entries appear in the directory. And it doesn't look like it plays well with others. Playing with it, it does look like it works for listing all the files in one directory without sorting, though, so yes, that's a useful solution to the problem. (Yikes what an awful description in the man pages!) -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
David Dyer-Bennet d...@dd-b.net writes: Which is bad enough if you say ls. And there's no option to say don't sort that I know of, either. /bin/ls -f /bin/ls makes sure an alias for ls to ls -F or similar doesn't cause extra work. you can also write \ls -f to ignore a potential alias. without an argument, GNU ls and SunOS ls behave the same. if you write ls -f * you'll only get output for directories in SunOS, while GNU ls will list all files. (ls -f has been there since SunOS 4.0 at least) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding a zfs mirror drive to rpool - new drive formats to one cylinder less
Thanks David. Re. the starting cylinder, it was more that one c8t0d0 the partition started at zero and c8t1d0 it started at 1. ie. c8t0d0 Partition Status Type Start End Length % = == = === == === 1 Active Solaris2 0 30401 30402 100 c8t1d0: Partition Status Type Start End Length % = == = === == === 1 Active Solaris2 1 30400 30400 100 All good now and yes, i did run installgrub...yet to test booting from c8t1d0 (with c8t0d0 removed). -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Oops, ran zfs destroy after renaming a folder and deleted my file system.
Ok, I know NOW that I should have used zfs rename...but just for the record, and to give you folks a laugh, this is the mistake I made... I created a zfs file system, cloud/movies and shared it. I then filled it with movies and music. I then decided to rename it, so I used rename in the Gnome to change the folder name to media...ie cloud/media. MISTAKE I then noticed the zfs share was pointing to /cloud/movies which no longer exists. So, I removed cloud/movies with zfs destroy --- BIGGER MISTAKE So, now I am restoring my media from backup. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Oops, ran zfs destroy after renaming a folder and deleted my file system.
Ok, I know NOW that I should have used zfs rename...but just for the record, and to give you folks a laugh, this is the mistake I made... I created a zfs file system, cloud/movies and shared it. I then filled it with movies and music. I then decided to rename it, so I used rename in the Gnome to change the folder name to media...ie cloud/media. MISTAKE I then noticed the zfs share was pointing to /cloud/movies which no longer exists. So, I removed cloud/movies with zfs destroy --- BIGGER MISTAKE So, now I am restoring my media from backup. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
Could also try /usr/gnu/bin/ls -U. I'm working on improving the memory profile of /bin/ls (as it gets somewhat excessive when dealing with large directories), which as a side effect should also help with this. Currently /bin/ls allocates a structure for every file, and doesn't output anything until it's finished reading the entire directory, so even if it skips the sort, that's generally a fraction of the total time spent, and doesn't save you much. The structure also contains some duplicative data (I'm guessing that at the time -- a _long_ time ago, the decision was made to precompute some stuff versus testing the mode bits -- probably premature optimization, even then). I'm trying to make it so that it does what's necessary and avoid duplicate work (so for example, if the output doesn't need to be sorted it can display the entries as they are read -- though the situations where this can be done are not as often as you think). Hopefully once I'm done (I've been tied down with some other stuff), I'll be able to post some results. On Wed, Feb 24, 2010 at 7:29 PM, Kjetil Torgrim Homme kjeti...@linpro.no wrote: David Dyer-Bennet d...@dd-b.net writes: Which is bad enough if you say ls. And there's no option to say don't sort that I know of, either. /bin/ls -f /bin/ls makes sure an alias for ls to ls -F or similar doesn't cause extra work. you can also write \ls -f to ignore a potential alias. without an argument, GNU ls and SunOS ls behave the same. if you write ls -f * you'll only get output for directories in SunOS, while GNU ls will list all files. (ls -f has been there since SunOS 4.0 at least) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oops, ran zfs destroy after renaming a folder and deleted my file system.
On Thursday 25 of February 2010 03:46, tomwaters wrote: Ok, I know NOW that I should have used zfs rename...but just for the record, and to give you folks a laugh, this is the mistake I made... I created a zfs file system, cloud/movies and shared it. I then filled it with movies and music. I then decided to rename it, so I used rename in the Gnome to change the folder name to media...ie cloud/media. MISTAKE I then noticed the zfs share was pointing to /cloud/movies which no longer exists. So, I removed cloud/movies with zfs destroy --- BIGGER MISTAKE So, now I am restoring my media from backup. just for the record: did you want to rename the mountpoint or the dataset? -- Real programmers don't document. If it was hard to write, it should be hard to understand. signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_133 - high cpu
On 02/24/10 12:57, Bruno Sousa wrote: Yes i'm using the mtp driver . In total this system has 3 HBA's, 1 internal (Dell perc), and 2 Sun non-raid HBA's. I'm also using multipath, but if i disable multipath i have pretty much the same results.. Bruno From what I understand, the fix is expected very soon; your performance is getting killed by the over-aggressive use of bounce buffers... - Bart -- Bart Smaalders Solaris Kernel Performance ba...@cyber.eng.sun.com http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oops, ran zfs destroy after renaming a folder and deleted my file system.
well, both I guess... I thought the dataset name was based upon the file system...so I was assuming that if i renamed the zfs filesystem (with zfs rename) it would also rename the dataset... ie... #zfs create tank/fred gives... NAMEUSED AVAIL REFER MOUNTPOINT tank/fred 26.0K 4.81G 10.0K /tank/fred and then #zfs rename tank/fred tank/mary will give... NAMEUSED AVAIL REFER MOUNTPOINT tank/mary 26.0K 4.81G 10.0K /tank/mary It's all rather confusing to a newbit like me I must admit...so please post examples so I can understand it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oops, ran zfs destroy after renaming a folder and deleted my file system.
On 2/24/2010 7:46 PM, tomwaters wrote: Ok, I know NOW that I should have used zfs rename...but just for the record, and to give you folks a laugh, this is the mistake I made... I created a zfs file system, cloud/movies and shared it. I then filled it with movies and music. I then decided to rename it, so I used rename in the Gnome to change the folder name to media...ie cloud/media. MISTAKE I then noticed the zfs share was pointing to /cloud/movies which no longer exists. So, I removed cloud/movies with zfs destroy--- BIGGER MISTAKE So, now I am restoring my media from backup. And THAT is one of the reasons why backups are important even with redundant safe fileservers. (Software bugs, physical destruction, and user error!) -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Moving dataset to another zpool but same mount?
I need to move a dataset to another zpool, but I need to keep the same mount point. I have a zpool called files and datasets called mail, home and VM. files files/home files/mail files/VM I want to move the files/VM to another zpool, but keep the same mount point. What would be the right steps to create the new zpool, move the data and mount in the same spot? Also, to test this, if I mount to the same spot, will it hide the old dataset or destroy it. I want to test the migration before destroying the old dataset. Or can I? Thanks, Greg -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] raidz2 array FAULTED with only 1 drive down
I recently had a hard drive die on my 6 drive raidz2 array (4+2). Unfortunately, now that the dead drive didn't register anymore, Linux decided to rearrange all of the drive names around such that zfs couldn't figure out what drives went where. After much hair pulling, I gave up on Linux and went to OpenSolaris, but it wouldn't recognize my SATA controller, so I'm now using FreeNAS (BSD based). I now have zfs recognizing the remaining 5 drives, but it still registers as FAULTED: freenas:/var/log# zpool import pool: tank id: 7069795341511677483 state: FAULTED status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-EY config: tankUNAVAIL insufficient replicas raidz2UNAVAIL corrupted data ad6 ONLINE ad8 ONLINE ad10ONLINE sda UNAVAIL cannot open ad12ONLINE ad14ONLINE Isn't raidz2 supposed to be able to continue operation with up to 2 drives missing? What's going on here? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_133 - high cpu
Hi, Until it's fixed the 132 build should be used instead of the 133? Bruno On 25-2-2010 3:22, Bart Smaalders wrote: On 02/24/10 12:57, Bruno Sousa wrote: Yes i'm using the mtp driver . In total this system has 3 HBA's, 1 internal (Dell perc), and 2 Sun non-raid HBA's. I'm also using multipath, but if i disable multipath i have pretty much the same results.. Bruno From what I understand, the fix is expected very soon; your performance is getting killed by the over-aggressive use of bounce buffers... - Bart smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss