Re: [sqlite] BLOB data performance?

2007-11-15 Thread Dimitris Servis
Hi all,

I don't want to drive anyone away from SQLite (don't think that I can anyway
:-)) but a good solution for storing large amounts of data is HDF5.

HTH

-- ds

2007/11/15, Roger Binns <[EMAIL PROTECTED]>:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Asif Lodhi wrote:
> > Interestingly, Microsoft's SourceSafe (at least VS-6.0) apparently
> > uses file system
>
> It basically uses a whole bunch of directories and uses a scheme very
> similar to RCS to store the versioning content.
>
> > while SVN uses Berkeley DB as I read once.
>
> SVN initially only had Berkeley DB and it drove people nuts.  In
> particular it used to keep getting wedged and required manual
> administrator intervention to fix.  See this and the two following
> questions:
>
> http://subversion.tigris.org/faq.html#stuck-bdb-repos
>
> SVN added a filesystem based backend using a directory to store the
> deltas for each revision and that is by far the most popular.
>
> The moral of the tale is to make sure your backend database library
> never needs human attention.  I always wondered why they didn't use
> SQLite.
>
> Roger
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHO9QNmOOfHg372QQRAh5vAKC4zRw0Uwq7Og8aKNLrIWiIE0XpRgCfashV
> LgC0Y4jomgU+o7SXh8xHE6M=
> =cVgu
> -END PGP SIGNATURE-
>
>
> -
> To unsubscribe, send email to [EMAIL PROTECTED]
>
> -
>
>


-- 
What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.


Re: [sqlite] BLOB data performance?

2007-11-14 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Asif Lodhi wrote:
> Interestingly, Microsoft's SourceSafe (at least VS-6.0) apparently
> uses file system 

It basically uses a whole bunch of directories and uses a scheme very
similar to RCS to store the versioning content.

> while SVN uses Berkeley DB as I read once.

SVN initially only had Berkeley DB and it drove people nuts.  In
particular it used to keep getting wedged and required manual
administrator intervention to fix.  See this and the two following
questions:

http://subversion.tigris.org/faq.html#stuck-bdb-repos

SVN added a filesystem based backend using a directory to store the
deltas for each revision and that is by far the most popular.

The moral of the tale is to make sure your backend database library
never needs human attention.  I always wondered why they didn't use SQLite.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHO9QNmOOfHg372QQRAh5vAKC4zRw0Uwq7Og8aKNLrIWiIE0XpRgCfashV
LgC0Y4jomgU+o7SXh8xHE6M=
=cVgu
-END PGP SIGNATURE-

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-14 Thread Asif Lodhi
Hi,

On 11/15/07, Andreas Volz <[EMAIL PROTECTED]> wrote:
> Am Tue, 13 Nov 2007 12:46:11 -0800 (PST) schrieb Ken:
>
> > I think your blob file performance may greatly depend upon the file
> > system that it used and the workload.
> >
> > I found this article:
> >
> > http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf
>
> Very interesting document. But I couldn't know and rely on what
> filesystem the user has. There should be another way to ensure a good
> file access speed.

Interestingly, Microsoft's SourceSafe (at least VS-6.0) apparently
uses file system while SVN uses Berkeley DB as I read once. Both
products have a common application but use different storage
mechanism. That boils down to what algorithm you are using for your
particular problem.

Just my two cents.
--
Asif

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-14 Thread John Stanton

Andreas Volz wrote:

Am Tue, 13 Nov 2007 15:15:49 -0600 schrieb John Stanton:

You might find the method used by Squid to manage its cache would be 
worth emulating.


I don't know how squid works. Could you explain it in simple steps?


I haven't looked at the code, but it builds a tree of directories so 
that no directory is excessively loaded.  You can download the source 
and take a look.


 

Using TransmitFile on Windows or sendfile on Unix to despatch the
file to the network is by far the most efficient way to pass on files
from a cache.  It avoids a number of levels of buffer shadowing.


Hm, I looked into the manpage. I get the webpage with curl direct into
memory or into a file. I don't copy files later on harddisk. I would
assume curl does it the most effective way available.

Sendfile and TransmitFile basically post the file direct to the network 
interface without going through multiple layers of intermediate 
buffering as happens if you use send.


I find using that interface in a web server results in very low overhead 
and consequently a large traffic handling ability.



regards
Andreas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-




-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-14 Thread Andreas Volz
Am Tue, 13 Nov 2007 12:46:11 -0800 (PST) schrieb Ken:

> I think your blob file performance may greatly depend upon the file
> system that it used and the workload.
> 
> I found this article:
> 
> http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf

Very interesting document. But I couldn't know and rely on what
filesystem the user has. There should be another way to ensure a good
file access speed.

regards
Andreas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-14 Thread Andreas Volz
Am Tue, 13 Nov 2007 15:15:49 -0600 schrieb John Stanton:

> You might find the method used by Squid to manage its cache would be 
> worth emulating.

I don't know how squid works. Could you explain it in simple steps?
 
> Using TransmitFile on Windows or sendfile on Unix to despatch the
> file to the network is by far the most efficient way to pass on files
> from a cache.  It avoids a number of levels of buffer shadowing.

Hm, I looked into the manpage. I get the webpage with curl direct into
memory or into a file. I don't copy files later on harddisk. I would
assume curl does it the most effective way available.

regards
Andreas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-13 Thread John Stanton
You might find the method used by Squid to manage its cache would be 
worth emulating.


Using TransmitFile on Windows or sendfile on Unix to despatch the file 
to the network is by far the most efficient way to pass on files from a 
cache.  It avoids a number of levels of buffer shadowing.


Andreas Volz wrote:

Am Tue, 13 Nov 2007 07:18:19 -0600 schrieb John Stanton:

In a cache situation I would expect that keeping the binary data in 
files would be preferable because you can use far more efficient 
mechanisms for loading them into your cache and in particular in 
transmitting them downstream.  Your DB only needs to store a pathname.


Just be wary of directory size, and do not put them all in the one 
directory.


I noticed that problem in my current situation. I don't know the file
number and size limit in Linux or Windows, but I'm sure there is a
limit.

My main problem is to find a good algorithm to name the cached files
and split them into directories. My current idea is:

1) Put the URL into DB
2) Use a hash function to create a unique name for the cache file
3) Insert the hash name into the same row as the URL

The problem with many files in a directory:

4) Use e.g. 'modulo 11' on the URL hash value to get one of ten
directory names where to find a file.

But this has the drawback to have a static number of cache directories.
The algorithm isn't scalable with growing files.

Do you think is a good way? Or do you've another idea?

regards
Andreas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-




-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-13 Thread Ken
I think your blob file performance may greatly depend upon the file system that 
it used and the workload.

I found this article:

http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf


Andreas Volz <[EMAIL PROTECTED]> wrote: Am Tue, 13 Nov 2007 07:18:19 -0600 
schrieb John Stanton:

> In a cache situation I would expect that keeping the binary data in 
> files would be preferable because you can use far more efficient 
> mechanisms for loading them into your cache and in particular in 
> transmitting them downstream.  Your DB only needs to store a pathname.
> 
> Just be wary of directory size, and do not put them all in the one 
> directory.

I noticed that problem in my current situation. I don't know the file
number and size limit in Linux or Windows, but I'm sure there is a
limit.

My main problem is to find a good algorithm to name the cached files
and split them into directories. My current idea is:

1) Put the URL into DB
2) Use a hash function to create a unique name for the cache file
3) Insert the hash name into the same row as the URL

The problem with many files in a directory:

4) Use e.g. 'modulo 11' on the URL hash value to get one of ten
directory names where to find a file.

But this has the drawback to have a static number of cache directories.
The algorithm isn't scalable with growing files.

Do you think is a good way? Or do you've another idea?

regards
Andreas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-




RE: [sqlite] BLOB data performance?

2007-11-13 Thread Griggs, Donald
Regarding:  
The problem with many files in a directory:

Another possible algorithm:
Simply name the blob using the ROWID, e.g. Image0783.png

In ancient days, a DOS directory of several hundred files might bog
things down, but you can put thousands into a modern O.S.'s directory if
necessary.  So even though you can't predict the exact size of the
customer's data, you can probably get the maximum order of magnitude --
and that's all you need.

For example,  you could create a two level directory structure of 100
directories as:
/0/0/
/0/1/
/0/2/
etc., up to /9/9

Then, a blob with ROWID of "783" could be stored in directory /8/3/.
(Alternately, if you want to allow for future three-level, four-level,
etc, you could 
 store it in /3/8/ with the first directory chosen for the LEAST
significant digit, etc.
and get better dispersion.)



-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-13 Thread Andreas Volz
Am Tue, 13 Nov 2007 07:18:19 -0600 schrieb John Stanton:

> In a cache situation I would expect that keeping the binary data in 
> files would be preferable because you can use far more efficient 
> mechanisms for loading them into your cache and in particular in 
> transmitting them downstream.  Your DB only needs to store a pathname.
> 
> Just be wary of directory size, and do not put them all in the one 
> directory.

I noticed that problem in my current situation. I don't know the file
number and size limit in Linux or Windows, but I'm sure there is a
limit.

My main problem is to find a good algorithm to name the cached files
and split them into directories. My current idea is:

1) Put the URL into DB
2) Use a hash function to create a unique name for the cache file
3) Insert the hash name into the same row as the URL

The problem with many files in a directory:

4) Use e.g. 'modulo 11' on the URL hash value to get one of ten
directory names where to find a file.

But this has the drawback to have a static number of cache directories.
The algorithm isn't scalable with growing files.

Do you think is a good way? Or do you've another idea?

regards
Andreas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-13 Thread John Stanton
In a cache situation I would expect that keeping the binary data in 
files would be preferable because you can use far more efficient 
mechanisms for loading them into your cache and in particular in 
transmitting them downstream.  Your DB only needs to store a pathname.


Just be wary of directory size, and do not put them all in the one 
directory.


Andreas Volz wrote:

Hello,

I'll tell you my current situation. I implemented a web cache function
for images and other data in my application. In the past I saved the
data on the hard disk with a special name and had a text file with the
relation "cache file name <-> url". But I didn't like it. Now I like to
evaluate sqlite as solution.

So my question is about the binary data. Is it better to insert the
images and other media data (e.g. videos with < 10 MB of size) into the
DB or only a "pointer" to a file laying around on my hard disk? I would
estimate a maximum DB size of several hundred MB.

How good/bad is reading/writing this data into a BLOB compared to write
it as file beside the DB and write only a small name into the DB? Where
is the difference between both ways regarding memory and CPU usage?

BTW: My current use case writes data slow, but reads data fast. Reading
BLOB's must be as fast as reading on the hard disk.

regards
Andreas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-




-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-12 Thread Andreas Volz
Am Mon, 12 Nov 2007 23:41:59 + schrieb [EMAIL PROTECTED]:

> Andreas Volz <[EMAIL PROTECTED]> wrote:
> > Hello,
> > 
> > I'll tell you my current situation. I implemented a web cache
> > function for images and other data in my application. In the past I
> > saved the data on the hard disk with a special name and had a text
> > file with the relation "cache file name <-> url". But I didn't like
> > it. Now I like to evaluate sqlite as solution.
> > 
> > So my question is about the binary data. Is it better to insert the
> > images and other media data (e.g. videos with < 10 MB of size) into
> > the DB or only a "pointer" to a file laying around on my hard disk?
> > I would estimate a maximum DB size of several hundred MB.
> > 
> > How good/bad is reading/writing this data into a BLOB compared to
> > write it as file beside the DB and write only a small name into the
> > DB? Where is the difference between both ways regarding memory and
> > CPU usage?
> > 
> > BTW: My current use case writes data slow, but reads data fast.
> > Reading BLOB's must be as fast as reading on the hard disk.
> > 
> 
> In my studies, BLOB I/O is faster than disk I/O for BLOBs of about
> 50KiB or less on Linux.  Disk I/O is faster for larger BLOBs.  I
> have received reports that the transition threshold is about 14KiB
> on win32.  In my experiements, BLOB I/O is about 10% slower than
> direct disk I/O for multi-megabyte blobs.

Less than 50 kiB would be nice at least for images and HTML files. I'll
give it a try and do my own experiments. Thanks for your experience so
far.

regards
Andreas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] BLOB data performance?

2007-11-12 Thread drh
Andreas Volz <[EMAIL PROTECTED]> wrote:
> Hello,
> 
> I'll tell you my current situation. I implemented a web cache function
> for images and other data in my application. In the past I saved the
> data on the hard disk with a special name and had a text file with the
> relation "cache file name <-> url". But I didn't like it. Now I like to
> evaluate sqlite as solution.
> 
> So my question is about the binary data. Is it better to insert the
> images and other media data (e.g. videos with < 10 MB of size) into the
> DB or only a "pointer" to a file laying around on my hard disk? I would
> estimate a maximum DB size of several hundred MB.
> 
> How good/bad is reading/writing this data into a BLOB compared to write
> it as file beside the DB and write only a small name into the DB? Where
> is the difference between both ways regarding memory and CPU usage?
> 
> BTW: My current use case writes data slow, but reads data fast. Reading
> BLOB's must be as fast as reading on the hard disk.
> 

In my studies, BLOB I/O is faster than disk I/O for BLOBs of about
50KiB or less on Linux.  Disk I/O is faster for larger BLOBs.  I
have received reports that the transition threshold is about 14KiB
on win32.  In my experiements, BLOB I/O is about 10% slower than
direct disk I/O for multi-megabyte blobs.
--
D. Richard Hipp <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-