RE: Lucene index file container

Nicholas Paldino [.NET/C# MVP] Sat, 27 Feb 2010 01:22:40 -0800

Digy,

        Yes, at least with DeflateStream and GZipStream, you would have to
close the stream, open it again, and then read forward to the appropriate
place in the uncompressed stream, incurring the overhead I made reference to
in previous emails.


        None of the other libraries that I've seen offer a seekable zip
stream.

        I've also pointed out the IMO wasted overhead in zipping/unzipping
indexes when space isn't a concern, given that the index is going to be
accessed on a read/write basis more than you are going to have to zip it for
transport at any particular time.

        To that end, Andrew doesn't have an option, as it's not really worth
it to go through the motions of trying to write a seekable zip stream
implementation (by his own admission, he doesn't have space concerns).

        What I haven't seen is any reaction to the obvious solution in the
event that space *is* an issue, just setting the compression flag on the
folder the indexes are kept in.  That will compress the files on disk, and
all the APIs remain intact.  It's an instant win.

                - Nick

        

-----Original Message-----
From: Digy [mailto:[email protected]] 
Sent: Saturday, February 27, 2010 3:15 AM
To: [email protected]
Subject: RE: Lucene index file container

Hi Nick,

> "If the libraries are offering you a read/write stream which is seekable,"
As I mentioned in my first mail, that is the problem. They are not seekable.

You have to uncompress 900M data to reach to offset 900M. 

So, To avoid to read "0 to offset" whenever a seek request is made,
you have to unzip whole file at the beginning, and use that in your app. 
But this is not what I understand from Andrew's statement 
"Does any one have experience running an index directly out of zip file?"


DIGY

-----Original Message-----
From: Nicholas Paldino [.NET/C# MVP] [mailto:[email protected]] 
Sent: Saturday, February 27, 2010 2:09 AM
To: [email protected]
Subject: RE: Lucene index file container

Digy,

        You are *always* going to incur the CPU cost because in order to get
to an offset in the ^uncompressed^ file, you have to process the 100MB file
and translate it into the 1GB file.  That cost is always incurred no matter
what.

        Now, what you do with that 1GB and what the libraries do and how
they expose it is implementation-dependent.

        If the libraries just stream the uncompressed data back to you in a
forward-only, read-only way, then it's up to the library consumer to take
care of that in some way.  This usually means keeping it in memory (in which
case, you have to worry about excessive memory consumption) or writing it to
disk (in which case, you incur I/O costs).

        If the libraries are offering you a read/write stream which is
seekable, then it becomes completely implementation-dependent.  It might
very well use temp files, which incur I/O costs, or place data in memory (or
memory mapped files), which incurs a memory (and possibly I/O) cost.  I
don't have details about the specific libraries, but that's generally what
you are looking at in terms of strategies for providing this kind of
functionality.

                - Nick

-----Original Message-----
From: Digy [mailto:[email protected]] 
Sent: Friday, February 26, 2010 6:19 PM
To: [email protected]
Subject: RE: Lucene index file container

Hi Nick,
Suppose that I have 1G file with a compressed size of 100M and I want to
read just a 4K block from offset 900M.
Considering the SharpZip Lib,DotNetZip or similars , would be the cost more
CPU and less IO? Or more CPU more IO?

DIGY

-----Original Message-----
From: Nicholas Paldino [.NET/C# MVP] [mailto:[email protected]] 
Sent: Saturday, February 27, 2010 12:49 AM
To: [email protected]
Subject: RE: Lucene index file container

Andrew,

        If you are going to unpack the index into a temp directory and then
repack the file when you are done, then you are going to instantiate a cost
on startup and on teardown of the process which is mainly I/O and CPU bound
(I/O because you have to read the zip file from disk and then write the
unpacked file from the zip to another location, and CPU bound because you
are translating the byte stream while unpacking).

        That approach doesn't do anything but add that additional I/O and
CPU overhead on startup.  The "big win" for compressing the file is to save
space on disk, or whatever medium the byte stream is being persisted to.

        If all you do is unzip the file in the beginning and zip it up at
the end, then from your app's point of view, you do a lot of extra work for
nothing.  Unless you have real disk space issues, I'd recommend against
this.

        Now, if you were to create a new Directory class which uses a
GZipStream or DeflateStream as a façade over the FileStream which writes to
disk, then you are reaping the benefits of compressing the file.  The index
will always be compressed on disk and you are realizing the gains.

        The cost of doing this, however, is more CPU time (to perform the
translation) but with a gain on less I/O operations to disk (since there are
less bytes that are being written to disk).

        Depending on how much activity you have on reading/writing to/from
the index it might or might not make an impact.  You have to measure that
yourself given your applications use of the index.

        If file size is ^truly^ a concern, have you considered just setting
the compression flag on the *folder* that contains the index files?  Any
files that are added/updated/deleted will automatically be compressed if the
flag is set on the folder, so doing it in code is busywork when the OS
automatically provides it for you (assuming you are on Windows, which is a
safe bet given you are running .NET, but not absolute, of course).

                - Nick

-----Original Message-----
From: Andrew Schuler [mailto:[email protected]] 
Sent: Friday, February 26, 2010 4:48 PM
To: [email protected]
Subject: Re: Lucene index file container

Thanks for both answers on this.
I considered a zip file but was unsure of the associated overhead of
unpacking file. Does any one have experience running an index directly out
of zip file?
Are my worries unfounded? I was just trying to leverage the experience of
the group, but otherwise I'll just have to run some tests on my own.



On Fri, Feb 26, 2010 at 11:55 AM, Nicholas Petersen
<[email protected]>wrote:

> <Can anyone recommend a way to package the index into say some type of
file
> container>
>
> If I understand correctly, it sounds like your asking for a text-book
> implementation of an archiver, like a zip file.  If so, DotNetZip is a
> solid
> product, very easy to use, very fast.  Highly recommended.
> http://www.codeplex.com/DotNetZip.
>
> Best,
> Nick
>
>
>
> On Fri, Feb 26, 2010 at 2:47 PM, Andrew Schuler <[email protected]
> >wrote:
>
> > Yes, that is do-able. I was just thinking it would be cleaner to wrap
the
> > indexes (there will be more than one) in some sort of file container.
One
> > of
> > the things I'd like to do it be able to allow the user to download
> > pre-packaged indexes and load them into the app. This would be easy with
> a
> > file than a directory of files no?
> >
> >
> > On Fri, Feb 26, 2010 at 11:41 AM, Hans Merkl <[email protected]> wrote:
> >
> > > Can't you add all the files in the index directory to the installer
> > > package?
> > > This should be pretty straightforward.
> > >
> > > -----Original Message-----
> > > From: Andrew Schuler [mailto:[email protected]]
> > > Sent: Friday, February 26, 2010 12:16 PM
> > > To: [email protected]
> > > Subject: Lucene index file container
> > >
> > > The discussion about encrypting an index has me thinking about a
> current
> > > use
> > > I have for Lucene.net. I'm building a small app with a static index
> > > distributed with it. Can anyone recommend a way to package the index
> into
> > > say some type of file container for inclusion in an installer package?
> > >
> > > -andy
> > >
> > >
> > >
> >
>

smime.p7s
Description: S/MIME cryptographic signature

RE: Lucene index file container

Reply via email to