RE: Lucene index file container

Nicholas Paldino [.NET/C# MVP] Fri, 26 Feb 2010 16:09:50 -0800

Digy,

        You are *always* going to incur the CPU cost because in order to get
to an offset in the ^uncompressed^ file, you have to process the 100MB file
and translate it into the 1GB file.  That cost is always incurred no matter
what.


        Now, what you do with that 1GB and what the libraries do and how
they expose it is implementation-dependent.

        If the libraries just stream the uncompressed data back to you in a
forward-only, read-only way, then it's up to the library consumer to take
care of that in some way.  This usually means keeping it in memory (in which
case, you have to worry about excessive memory consumption) or writing it to
disk (in which case, you incur I/O costs).

        If the libraries are offering you a read/write stream which is
seekable, then it becomes completely implementation-dependent.  It might
very well use temp files, which incur I/O costs, or place data in memory (or
memory mapped files), which incurs a memory (and possibly I/O) cost.  I
don't have details about the specific libraries, but that's generally what
you are looking at in terms of strategies for providing this kind of
functionality.

                - Nick

-----Original Message-----
From: Digy [mailto:[email protected]] 
Sent: Friday, February 26, 2010 6:19 PM
To: [email protected]
Subject: RE: Lucene index file container

Hi Nick,
Suppose that I have 1G file with a compressed size of 100M and I want to
read just a 4K block from offset 900M.
Considering the SharpZip Lib,DotNetZip or similars , would be the cost more
CPU and less IO? Or more CPU more IO?

DIGY

-----Original Message-----
From: Nicholas Paldino [.NET/C# MVP] [mailto:[email protected]] 
Sent: Saturday, February 27, 2010 12:49 AM
To: [email protected]
Subject: RE: Lucene index file container

Andrew,

        If you are going to unpack the index into a temp directory and then
repack the file when you are done, then you are going to instantiate a cost
on startup and on teardown of the process which is mainly I/O and CPU bound
(I/O because you have to read the zip file from disk and then write the
unpacked file from the zip to another location, and CPU bound because you
are translating the byte stream while unpacking).

        That approach doesn't do anything but add that additional I/O and
CPU overhead on startup.  The "big win" for compressing the file is to save
space on disk, or whatever medium the byte stream is being persisted to.

        If all you do is unzip the file in the beginning and zip it up at
the end, then from your app's point of view, you do a lot of extra work for
nothing.  Unless you have real disk space issues, I'd recommend against
this.

        Now, if you were to create a new Directory class which uses a
GZipStream or DeflateStream as a façade over the FileStream which writes to
disk, then you are reaping the benefits of compressing the file.  The index
will always be compressed on disk and you are realizing the gains.

        The cost of doing this, however, is more CPU time (to perform the
translation) but with a gain on less I/O operations to disk (since there are
less bytes that are being written to disk).

        Depending on how much activity you have on reading/writing to/from
the index it might or might not make an impact.  You have to measure that
yourself given your applications use of the index.

        If file size is ^truly^ a concern, have you considered just setting
the compression flag on the *folder* that contains the index files?  Any
files that are added/updated/deleted will automatically be compressed if the
flag is set on the folder, so doing it in code is busywork when the OS
automatically provides it for you (assuming you are on Windows, which is a
safe bet given you are running .NET, but not absolute, of course).

                - Nick

-----Original Message-----
From: Andrew Schuler [mailto:[email protected]] 
Sent: Friday, February 26, 2010 4:48 PM
To: [email protected]
Subject: Re: Lucene index file container

Thanks for both answers on this.
I considered a zip file but was unsure of the associated overhead of
unpacking file. Does any one have experience running an index directly out
of zip file?
Are my worries unfounded? I was just trying to leverage the experience of
the group, but otherwise I'll just have to run some tests on my own.



On Fri, Feb 26, 2010 at 11:55 AM, Nicholas Petersen
<[email protected]>wrote:

> <Can anyone recommend a way to package the index into say some type of
file
> container>
>
> If I understand correctly, it sounds like your asking for a text-book
> implementation of an archiver, like a zip file.  If so, DotNetZip is a
> solid
> product, very easy to use, very fast.  Highly recommended.
> http://www.codeplex.com/DotNetZip.
>
> Best,
> Nick
>
>
>
> On Fri, Feb 26, 2010 at 2:47 PM, Andrew Schuler <[email protected]
> >wrote:
>
> > Yes, that is do-able. I was just thinking it would be cleaner to wrap
the
> > indexes (there will be more than one) in some sort of file container.
One
> > of
> > the things I'd like to do it be able to allow the user to download
> > pre-packaged indexes and load them into the app. This would be easy with
> a
> > file than a directory of files no?
> >
> >
> > On Fri, Feb 26, 2010 at 11:41 AM, Hans Merkl <[email protected]> wrote:
> >
> > > Can't you add all the files in the index directory to the installer
> > > package?
> > > This should be pretty straightforward.
> > >
> > > -----Original Message-----
> > > From: Andrew Schuler [mailto:[email protected]]
> > > Sent: Friday, February 26, 2010 12:16 PM
> > > To: [email protected]
> > > Subject: Lucene index file container
> > >
> > > The discussion about encrypting an index has me thinking about a
> current
> > > use
> > > I have for Lucene.net. I'm building a small app with a static index
> > > distributed with it. Can anyone recommend a way to package the index
> into
> > > say some type of file container for inclusion in an installer package?
> > >
> > > -andy
> > >
> > >
> > >
> >
>

smime.p7s
Description: S/MIME cryptographic signature

RE: Lucene index file container

Reply via email to