Hi,

El 26/12/10 06:14, Thomas Mueller escribió:
>> Fragmentation at file system (OS) level will have much more impact on large 
>> files, caching (at OS level) will be less effective too,
> read-ahead capabilities will be less effective too and finally IO load
> will increaseinevitably.
>
> Could you please provide links to back this up? Or provide a test case
> that shows multiple small files are significantly faster than one
> large file (given the same file operations)?
There are a lot of information related to File Systems efforts to mitigate the 
impact of fragmentation and smart caching strategies. All of this is strongly 
OS and file system dependent, but there are many factors in common. 
This document has some interesting metrics: 
http://www.linuxsymposium.org/2006/filesys_frag_slides.pdf

In regard to H2 FileStore usage and fragmentation , "File Scattering"  can be 
the most interesting type of fragmentation.
For generals about this subject, start with: 
http://en.wikipedia.org/wiki/File_system_fragmentation  and   
http://en.wikipedia.org/wiki/File_sequence ; this pages has many references to 
technical documents and papers.

In regard to caching and read-ahead (or pre-fetch) note that this happen at 
hardware, file system and OS level. Read-ahead (at hardware level) is a way to 
reduce IO operations mainly on sequential access patterns.
Lobs of big size are ideal subjects for streaming (sequential access IO 
pattern) in contrast to indexed table rows that produce mainly random access IO 
patterns.

>> It is easy to measure the degradation of the performance of a database as 
>> the data volume is significantly increased.
> Do you think the database will be faster if you split it into multiple
> files? I don't think so. 

Others DBMS use many directories with many files as storage, but this isn't the 
point.
If you say: do the same thing  with many files , probably will be worst.

I'm talking about separate storage in only 2 files with different FileStore 
implementations. One for general database metadata, table rows and indexes 
(good for random access IO pattern) as small as possible.
And another file containing only LOBS objects, organized to facilitate 
sequential access IO patterns (over each internal Lob object) and to reduce the 
size of the other file.

Motivation: In many applications we see that for any table with LOBS columns 
only 1 of 4 querys (or less) retrieve lobs columns. I know this can't be 
generalized but don't seem unreasonable to think that big LOBS be accessed less 
frequently than the rest
of commons data type values in the same table.
Even more, is a common database design practice to use lob specific tables like 
 ( id, lob_value ) to put all lob values in one table and master tables with FK 
references only.

About performance and IO load, I don't have a well done benchmark, but I have 
an application in production on to sites - similar conditions except their 
database size. We see up to 20% of difference in queries performance.
I will try to use "sar", "iostat", etc. ; to analyze if this is because IO wait 
time or cache hit rate change, but can be tricky to isolate H2 load from the 
rest of IO load.

> But if you want, H2 supports the "split file
> system". You can easily find out. This also has the advantage that all
> files are about the same size (less files)

Remember that we are discussing about if a good idea or not, store big lobs 
values with the rest of database objects in only one file when LobsInDatabase.
Anyway "split file system" can be useful to do some performance comparison 
between one only large file or many fixed size files.

regards,
Dario

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/h2-database?hl=en.

Reply via email to