Re: stop words and index size

2005-01-14 Thread Doug Cutting
David Spencer wrote:
Does anyone know how much stop words are supposed to affect the index size?
I did an experiment of building an index once with, and once without, 
stop words.

The corpus is the English Wikipedia, and I indexed the title and body of 
the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.
The unstopped version is indeed bigger and slower to build, but it's 
only slower to search when folks search on stop words.  One approach to 
minimizing stopwords in searches (used by, e.g. Nutch & Google) is to 
index all stop words but remove them from queries unless they're (a) in 
a phrase or (b) explicitly required with a "+".  (It might be nice if 
Lucene included a query parser that had this feature.)

Nutch also optimizes phrase searches involving a few very common stop 
words (e.g., "the", "a", "to") by indexing these as bigrams and 
converting phrases involving them to bigram phrases.  So, if someone 
searches for "to be or not to be" then this turns into a search for 
"to-be be or not-to to-be" which is considerably faster since it 
involves rarer terms.  But the more words you bigram the bigger the 
index gets and the slower updates get, so you probably can't afford to 
do this for your full stop list.  (It might be nice if Lucene included 
support for this technique too!)

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: stop words and index size

2005-01-13 Thread Chris Hostetter


: The corpus is the English Wikipedia, and I indexed the title and body of
: the articles. I used a list of 525 stop words.
:
: With stopwords removed the index is 227MB.
: With stopwords kept the index is 331MB.

That doesn't seem horribly surprising.

consider that for every Term in the index, lucene is keeping track of the
list of  pairs for every document that contains that term.

Assume that something has to be in at least 25% of the docs before you
decide it's worth making it a stop word.  your URL indicates you are
dealing with 400k docs, which means that for each stop word, the space
need to store the int pairs for  is...

(4B + 4B) * 100,000 =~ 780KB  (per stop word Term, minimum)

...not counting any indexing structures that may be used internally to
improve the lookup of a Term.  assuming some of those words are in more or
less then 25% of your documents, that could easily account for a
differents of 100MB.

I suspect that an interesting excersize would be to use some of the code
I've seen tossed arround on this list that lets you iterate over all Terms
and find the most common once to help you determine your stopword list
progromaticly.  Then remove/reindex any documents that have each word as
you add it to your stoplist (one word at a time) and watch your index
shrink.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



stop words and index size

2005-01-13 Thread David Spencer
Does anyone know how much stop words are supposed to affect the index size?
I did an experiment of building an index once with, and once without, 
stop words.

The corpus is the English Wikipedia, and I indexed the title and body of 
the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.
Thus, the index grows by 45% in this case, which I found suprising, as I 
expected it to not grow as much. I haven't dug into the details of the 
Lucene file formats but thought compression (field/term vector/sparse 
lists/ vints) would negate the affect of stopwords to a large extent.

Some more details + a link to my stopword list are here:
http://www.searchmorph.com/weblog/index.php?id=36
-- Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index size doubled?

2004-12-21 Thread Otis Gospodnetic
You don't need to optimize to simulate an incremental update.  You just
have to re-open your index with the IndexSearcher to see newly added
documents.

Otis

--- aurora <[EMAIL PROTECTED]> wrote:

> Thanks for the heads up. I'm using Lucene 1.4.2.
> 
> I tried to do optimize() again but it has no effect. Adding a just
> tiny  
> dummy document would get rid of it.
> 
> I'm doing optimize every few hundred documents because I tried to
> simulate  
> incremental update. This lead to another question I would post
> separately.
> 
> Thanks.
> 
> 
> > Another possibility is that you are using an older version of
> Lucene,
> > which was known to have a bug with similar symptoms.  Get the
> latest
> > version of Lucene.
> >
> > You shouldn't really have multiple .cfs files after optimizing your
> > index.  Also, optimize only at the end, if you care about indexing
> > speed.
> >
> > Otis
> >
> > --- Paul Elschot <[EMAIL PROTECTED]> wrote:
> >
> >> On Tuesday 21 December 2004 05:49, aurora wrote:
> >> > I'm testing the rebuilding of the index. I add several hundred
> >> documents,
> >> > optimize and add another few hundred and so on. Right now I have
> >> around
> >> > 7000 files. I observed after the index gets to certain size.
> >> Everytime
> >> > after optimize, the are two files roughly the same size like
> below:
> >> >
> >> > 12/20/2004  01:57p  13 deletable
> >> > 12/20/2004  01:57p  29 segments
> >> > 12/20/2004  01:53p  14,460,367 _5qf.cfs
> >> > 12/20/2004  01:57p  15,069,013 _5zr.cfs
> >> >
> >> > The index total index is double of what I expect. This is not
> >> always
> >> > reproducible. (I'm constantly tuning my program and the set of
> >> document).
> >> > Sometime I get a decent single document after optimize. What was
> >> happening?
> >>
> >> Lucene tried to delete the older version (_5cf.cfs above), but got
> an
> >> error
> >> back from the file system. After that it has put the name of that
> >> segment in
> >> the deletable file, so it can try later to delete that segment.
> >>
> >> This is known behaviour on FAT file systems. These randomly take
> some
> >> time
> >> for themselves to finish closing a file after it has been
> correctly
> >> closed by
> >> a program.
> >>
> >> Regards,
> >> Paul Elschot
> >>
> >>
> >>
> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail:
> [EMAIL PROTECTED]
> >>
> >>
> 
> 
> 
> -- 
> Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index size doubled?

2004-12-21 Thread aurora
Thanks for the heads up. I'm using Lucene 1.4.2.
I tried to do optimize() again but it has no effect. Adding a just tiny  
dummy document would get rid of it.

I'm doing optimize every few hundred documents because I tried to simulate  
incremental update. This lead to another question I would post separately.

Thanks.

Another possibility is that you are using an older version of Lucene,
which was known to have a bug with similar symptoms.  Get the latest
version of Lucene.
You shouldn't really have multiple .cfs files after optimizing your
index.  Also, optimize only at the end, if you care about indexing
speed.
Otis
--- Paul Elschot <[EMAIL PROTECTED]> wrote:
On Tuesday 21 December 2004 05:49, aurora wrote:
> I'm testing the rebuilding of the index. I add several hundred
documents,
> optimize and add another few hundred and so on. Right now I have
around
> 7000 files. I observed after the index gets to certain size.
Everytime
> after optimize, the are two files roughly the same size like below:
>
> 12/20/2004  01:57p  13 deletable
> 12/20/2004  01:57p  29 segments
> 12/20/2004  01:53p  14,460,367 _5qf.cfs
> 12/20/2004  01:57p  15,069,013 _5zr.cfs
>
> The index total index is double of what I expect. This is not
always
> reproducible. (I'm constantly tuning my program and the set of
document).
> Sometime I get a decent single document after optimize. What was
happening?
Lucene tried to delete the older version (_5cf.cfs above), but got an
error
back from the file system. After that it has put the name of that
segment in
the deletable file, so it can try later to delete that segment.
This is known behaviour on FAT file systems. These randomly take some
time
for themselves to finish closing a file after it has been correctly
closed by
a program.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index size doubled?

2004-12-21 Thread Otis Gospodnetic
Another possibility is that you are using an older version of Lucene,
which was known to have a bug with similar symptoms.  Get the latest
version of Lucene.

You shouldn't really have multiple .cfs files after optimizing your
index.  Also, optimize only at the end, if you care about indexing
speed.

Otis

--- Paul Elschot <[EMAIL PROTECTED]> wrote:

> On Tuesday 21 December 2004 05:49, aurora wrote:
> > I'm testing the rebuilding of the index. I add several hundred
> documents,  
> > optimize and add another few hundred and so on. Right now I have
> around  
> > 7000 files. I observed after the index gets to certain size.
> Everytime  
> > after optimize, the are two files roughly the same size like below:
> > 
> > 12/20/2004  01:57p  13 deletable
> > 12/20/2004  01:57p  29 segments
> > 12/20/2004  01:53p  14,460,367 _5qf.cfs
> > 12/20/2004  01:57p  15,069,013 _5zr.cfs
> > 
> > The index total index is double of what I expect. This is not
> always  
> > reproducible. (I'm constantly tuning my program and the set of
> document).  
> > Sometime I get a decent single document after optimize. What was
> happening?
> 
> Lucene tried to delete the older version (_5cf.cfs above), but got an
> error
> back from the file system. After that it has put the name of that
> segment in
> the deletable file, so it can try later to delete that segment.
> 
> This is known behaviour on FAT file systems. These randomly take some
> time
> for themselves to finish closing a file after it has been correctly
> closed by
> a program.
> 
> Regards,
> Paul Elschot
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index size doubled?

2004-12-21 Thread Paul Elschot
On Tuesday 21 December 2004 05:49, aurora wrote:
> I'm testing the rebuilding of the index. I add several hundred documents,  
> optimize and add another few hundred and so on. Right now I have around  
> 7000 files. I observed after the index gets to certain size. Everytime  
> after optimize, the are two files roughly the same size like below:
> 
> 12/20/2004  01:57p  13 deletable
> 12/20/2004  01:57p  29 segments
> 12/20/2004  01:53p  14,460,367 _5qf.cfs
> 12/20/2004  01:57p  15,069,013 _5zr.cfs
> 
> The index total index is double of what I expect. This is not always  
> reproducible. (I'm constantly tuning my program and the set of document).  
> Sometime I get a decent single document after optimize. What was happening?

Lucene tried to delete the older version (_5cf.cfs above), but got an error
back from the file system. After that it has put the name of that segment in
the deletable file, so it can try later to delete that segment.

This is known behaviour on FAT file systems. These randomly take some time
for themselves to finish closing a file after it has been correctly closed by
a program.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



index size doubled?

2004-12-20 Thread aurora
I'm testing the rebuilding of the index. I add several hundred documents,  
optimize and add another few hundred and so on. Right now I have around  
7000 files. I observed after the index gets to certain size. Everytime  
after optimize, the are two files roughly the same size like below:

12/20/2004  01:57p  13 deletable
12/20/2004  01:57p  29 segments
12/20/2004  01:53p  14,460,367 _5qf.cfs
12/20/2004  01:57p  15,069,013 _5zr.cfs
The index total index is double of what I expect. This is not always  
reproducible. (I'm constantly tuning my program and the set of document).  
Sometime I get a decent single document after optimize. What was happening?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Impact of stored fields on index size and performance

2004-12-01 Thread Venkatraju
Hi,

I have read (or somehow got this notion into my head) that having many
or large stored fields make the index much larger and affect
perforamance. Is this true? I can understand how it increases index
size but where does the performance impact come from? If the stored
field is stored directly in the index somewhere, why does that affect
search performance (searching should still take the same time because
the inverted index is not directly affected in any way)? Retrieving
each hit doc may take longer because the large stored fields must be
transferred from disk.

Thanks in advance,
Venkat

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: maximum index size

2004-09-08 Thread Doug Cutting
Chris Fraschetti wrote:
I've seen throughout the list mentions of millions of documents.. 8
million, 20 million, etc etc.. but can lucene potentially handle
billions of documents and still efficiently search through them?
Lucene can currently handle up to 2^31 documents in a single index.  To 
a large degree this is limited by Java ints and arrays (which are 
accessed by ints).  There are also a few places where the file format 
limits things to 2^32.

On typical PC hardware, 2-3 word searches of an index with 10M 
documents, each with around 10k of text, require around 1 second, 
including index i/o time.  Performance is more-or-less linear, so that a 
100M document index might require nearly 10 seconds per search.  Thus, 
as indexes grow folks tend to distribute searches in parallel to many 
smaller indexes.  That's what Nutch and Google 
(http://www.computer.org/micro/mi2003/m2022.pdf) do.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: maximum index size

2004-09-08 Thread Otis Gospodnetic
Given adequate hardware, it can.  Take a look at nutch.org.  Nutch uses
Lucene at its core.

Otis

--- Chris Fraschetti <[EMAIL PROTECTED]> wrote:

> I know the index size is very dependent on the content being index...
> 
> but running on a unix based machine w/o a filesize limit, best case
> scenario... what is the largest number of documents that can be
> indexed.
> 
> I've seen throughout the list mentions of millions of documents.. 8
> million, 20 million, etc etc.. but can lucene potentially handle
> billions of documents and still efficiently search through them?
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



maximum index size

2004-09-08 Thread Chris Fraschetti
I know the index size is very dependent on the content being index...

but running on a unix based machine w/o a filesize limit, best case
scenario... what is the largest number of documents that can be
indexed.

I've seen throughout the list mentions of millions of documents.. 8
million, 20 million, etc etc.. but can lucene potentially handle
billions of documents and still efficiently search through them?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Grant
Thanks for your response.  I have fixed this issue.  I have indexed 5 MB
worth of text files and I now only use 224 KB.  I was getting 80 MB.  The
only change I made was to change the way I merge my temp index into my prod
index.  My code changed from:
prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });

To:

int iNumDocs = tempReader.numDocs();

for (int y = 0; y < iNumDocs; y++) {

Document tempDoc = tempReader.document(y);

prodWriter.addDocument(tempDoc);

}



I don't know if this is a bug in the IndexWriter.addIndexes(IndexReader)
method or something else I am doing that caused this, but I am getting much
better results now.



Thanks to everyone who helped, I really appreciate it.



Rob

- Original Message - 
From: "Grant Ingersoll" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 10:51 AM
Subject: Re: Index Size


How many fields do you have and what analyzer are you using?

>>> [EMAIL PROTECTED] 8/19/2004 11:54:25 AM >>>
Otis
I upgraded to 1.4.1.  I deleted all of my old indexes and started from
scratch.  I indexed 2 MB worth of text files and my index size is 8
MB.
Would it be better if I stopped using the
IndexWriter.addIndexes(IndexReader) method and instead traverse the
IndexReader on the temp index and use
IndexWriter.addDocument(Document)
method?

Thanks again for your input, I appreciate it.

Rob
- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 8:00 AM
Subject: Re: Index Size


Just go for 1.4.1 and look at the CHANGES.txt file to see if there
were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose <[EMAIL PROTECTED]> wrote:

> Otis
> I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
> final?
>
> Rob
> - Original Message - 
> From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, August 19, 2004 7:13 AM
> Subject: Re: Index Size
>
>
> I thought this was the case.  I believe there was a bug in one of
the
> recent Lucene releases that caused old CFS files not to be removed
> when
> they should be removed.  This resulted in your index directory
> containing a bunch of old CFS files consuming your disk space.
>
> Try getting a recent nightly build and see if using that takes car
> eof
> your problem.
>
> Otis
>
> --- Rob Jose <[EMAIL PROTECTED]> wrote:
>
> > Hey George
> > Thanks for responding.  I am using windows and I don't see any
> hidden
> > files.
> > I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2,
etc.)
> > files.
> > I have two FDT files and two FDX files. And three FNM files.  Add
> > these
> > files to the deletable and segments file and that is all of the
> files
> > that I
> > have.   The CFS files are appoximately 11 MB each.  The totals I
> gave
> > you
> > before were for all of my indexes together.  This particular index
> > has a
> > size of 21.6 GB.  The files that it indexed have a size of 89 MB.
> >
> > OK - I just removed all of the CFS files from the directory and I
> can
> > still
> > read my indexes.  So know I have to ask what are these CFS files?
> > Why are
> > they created?  And how can I get rid of them if I don't need them.
> I
> > will
> > also take a look at the Lucene website to see if I can find any
> > information.
> >
> > Thanks
> > Rob
> >
> > - Original Message - 
> > From: "Honey George" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Thursday, August 19, 2004 12:29 AM
> > Subject: Re: Index Size
> >
> >
> > Hi,
> >  Please check for hidden files in the index folder. If
> > you are using linx, do something like
> >
> > ls -al 
> >
> > I am also facing a similar problem where the index
> > size is greater than the data size. In my case there
> > were some hidden temproary files which the lucene
> > creates.
> > That was taking half of the total size.
> >
> > My problem is that after deleting the temporary files,
> > the index size is same as that of the data size. That
> > again seems to be a problem. I am yet to find out the
> > reason..
> >
> > Thanks,
> >george
> >
> >
> >  --- Rob Jose <[EMAIL PROTECTED]> wrote:
> > > Hello
> > > I have indexed several thousand (52 to be exact)
> 

Re: Index Size

2004-08-19 Thread Grant Ingersoll
How many fields do you have and what analyzer are you using?

>>> [EMAIL PROTECTED] 8/19/2004 11:54:25 AM >>>
Otis
I upgraded to 1.4.1.  I deleted all of my old indexes and started from
scratch.  I indexed 2 MB worth of text files and my index size is 8
MB.
Would it be better if I stopped using the
IndexWriter.addIndexes(IndexReader) method and instead traverse the
IndexReader on the temp index and use
IndexWriter.addDocument(Document)
method?

Thanks again for your input, I appreciate it.

Rob
- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 8:00 AM
Subject: Re: Index Size


Just go for 1.4.1 and look at the CHANGES.txt file to see if there
were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose <[EMAIL PROTECTED]> wrote:

> Otis
> I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
> final?
>
> Rob
> - Original Message - 
> From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, August 19, 2004 7:13 AM
> Subject: Re: Index Size
>
>
> I thought this was the case.  I believe there was a bug in one of
the
> recent Lucene releases that caused old CFS files not to be removed
> when
> they should be removed.  This resulted in your index directory
> containing a bunch of old CFS files consuming your disk space.
>
> Try getting a recent nightly build and see if using that takes car
> eof
> your problem.
>
> Otis
>
> --- Rob Jose <[EMAIL PROTECTED]> wrote:
>
> > Hey George
> > Thanks for responding.  I am using windows and I don't see any
> hidden
> > files.
> > I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2,
etc.)
> > files.
> > I have two FDT files and two FDX files. And three FNM files.  Add
> > these
> > files to the deletable and segments file and that is all of the
> files
> > that I
> > have.   The CFS files are appoximately 11 MB each.  The totals I
> gave
> > you
> > before were for all of my indexes together.  This particular index
> > has a
> > size of 21.6 GB.  The files that it indexed have a size of 89 MB.
> >
> > OK - I just removed all of the CFS files from the directory and I
> can
> > still
> > read my indexes.  So know I have to ask what are these CFS files?
> > Why are
> > they created?  And how can I get rid of them if I don't need them.
> I
> > will
> > also take a look at the Lucene website to see if I can find any
> > information.
> >
> > Thanks
> > Rob
> >
> > - Original Message ----- 
> > From: "Honey George" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Thursday, August 19, 2004 12:29 AM
> > Subject: Re: Index Size
> >
> >
> > Hi,
> >  Please check for hidden files in the index folder. If
> > you are using linx, do something like
> >
> > ls -al 
> >
> > I am also facing a similar problem where the index
> > size is greater than the data size. In my case there
> > were some hidden temproary files which the lucene
> > creates.
> > That was taking half of the total size.
> >
> > My problem is that after deleting the temporary files,
> > the index size is same as that of the data size. That
> > again seems to be a problem. I am yet to find out the
> > reason..
> >
> > Thanks,
> >george
> >
> >
> >  --- Rob Jose <[EMAIL PROTECTED]> wrote:
> > > Hello
> > > I have indexed several thousand (52 to be exact)
> > > text files and I keep running out of disk space to
> > > store the indexes.  The size of the documents I have
> > > indexed is around 2.5 GB.  The size of the Lucene
> > > indexes is around 287 GB.  Does this seem correct?
> > > I am not storing the contents of the file, just
> > > indexing and tokenizing.  I am using Lucene 1.3
> > > final.  Can you guys let me know what you are
> > > experiencing?  I don't want to go into production
> > > with something that I should be configuring better.
> > >
> > >
> > > I am not sure if this helps, but I have a temp index
> > > and a real index.  I index the file into the temp
> > > index, and then merge the temp index into the real
> > > index using the addIndexes method on the
> > > IndexWriter.  I have also set the production writer
> > > setUseCompoundFile to true.  I did not set this on
> > > the temp index.  The last thing that I do before
> > > closing the production writer is to call the
> > > optimize method.
> > >
> > > I would really appreciate any ideas to get the index
> > > size smaller if it is at all possible.
> > >
> > > Thanks
> > > Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Stephane

Thanks for your response.  I have thought that same question.  If fact after
I went home last night that is exactly what I thought I was doing.  But I
just used Luke to go through all of my documents, and I don't see any
duplicates.  But I will go check again just to make sure.

Rob
- Original Message - 
From: "Stephane James Vaucher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 9:34 AM
Subject: Re: Index Size


Stupid question:

Are you sure you have the right number of docs in your index? i.e. you're
not adding the same document twice into or via your tmp index.

sv

On Thu, 19 Aug 2004, Rob Jose wrote:

> Paul
> Thank you for your response.  I have appended to the bottom of this
message
> the field structure that I am using.  I hope that this helps.  I am using
> the StandardAnalyzer.  I do not believe that I am changing any default
> values, but I have also appended the code that adds the temp index to the
> production index.
>
> Thanks for you help
> Rob
>
> Here is the code that describes the field structure.
> public static Document Document(String contents, String path, Date
modified,
> String runDate, String totalpages, String pagecount, String countycode,
> String reportnum, String reportdescr)
>
> {
>
> SimpleDateFormat showFormat = new
> SimpleDateFormat(TurbineResources.getString("date.default.format"));
>
> SimpleDateFormat searchFormat = new SimpleDateFormat("MMdd");
>
> Document doc = new Document();
>
> doc.add(Field.Keyword("path", path));
>
> doc.add(Field.Keyword("modified", showFormat.format(modified)));
>
> doc.add(Field.UnStored("searchDate", searchFormat.format(modified)));
>
> doc.add(Field.Keyword("runDate", runDate==null?"":runDate));
>
> doc.add(Field.UnStored("searchRunDate",
>
runDate==null?"":runDate.substring(6)+runDate.substring(0,2)+runDate.substri
> ng(3,5)));
>
> doc.add(Field.Keyword("reportnum", reportnum));
>
> doc.add(Field.Text("reportdescr", reportdescr));
>
> doc.add(Field.UnStored("cntycode", countycode));
>
> doc.add(Field.Keyword("totalpages", totalpages));
>
> doc.add(Field.Keyword("page", pagecount));
>
> doc.add(Field.UnStored("contents", contents));
>
> return doc;
>
> }
>
>
>
> Here is the code that adds the temp index to the production index.
>
> File tempFile = new File(sIndex + File.separatorChar + "temp" +
sCntyCode);
>
> tempReader = IndexReader.open(tempFile);
>
> try
>
> {
>
> boolean createIndex = false;
>
> File f = new File(sIndex + File.separatorChar + sCntyCode);
>
> if (!f.exists())
>
> {
>
> createIndex = true;
>
> }
>
> prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
> StandardAnalyzer(), createIndex);
>
> }
>
> catch (Exception e)
>
> {
>
> IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
> sCntyCode, false));
>
> CasesReports.log("Tried to Unlock " + sIndex);
>
> prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);
>
> CasesReports.log("Successfully Unlocked " + sIndex + File.separatorChar +
> sCntyCode);
>
> }
>
> prodWriter.setUseCompoundFile(true);
>
> prodWriter.addIndexes(new IndexReader[] { tempReader });
>
>
>
>
>
> - Original Message -
> From: "Paul Elschot" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Thursday, August 19, 2004 12:16 AM
> Subject: Re: Index Size
>
>
> On Wednesday 18 August 2004 22:44, Rob Jose wrote:
> > Hello
> > I have indexed several thousand (52 to be exact) text files and I keep
> > running out of disk space to store the indexes.  The size of the
documents
> > I have indexed is around 2.5 GB.  The size of the Lucene indexes is
around
> > 287 GB.  Does this seem correct?  I am not storing the contents of the
>
> As noted, one would expect the index size to be about 35%
> of the original text, ie. about 2.5GB * 35% = 800MB.
> That is two orders of magnitude off from what you have.
>
> Could you provide some more information about the field structure,
> ie. how many fields, which fields are stored, which fields are indexed,
> evt. use of non standard analyzers, and evt. non standard
> Lucene settings?
>
> You might also try to change to non compound format to have a look
> at the sizes of the individual index files, see file formats on the lucene
> web site.
> You can then see the total disk size of for example

Re: Index Size

2004-08-19 Thread Stephane James Vaucher
Stupid question:

Are you sure you have the right number of docs in your index? i.e. you're
not adding the same document twice into or via your tmp index.

sv

On Thu, 19 Aug 2004, Rob Jose wrote:

> Paul
> Thank you for your response.  I have appended to the bottom of this message
> the field structure that I am using.  I hope that this helps.  I am using
> the StandardAnalyzer.  I do not believe that I am changing any default
> values, but I have also appended the code that adds the temp index to the
> production index.
>
> Thanks for you help
> Rob
>
> Here is the code that describes the field structure.
> public static Document Document(String contents, String path, Date modified,
> String runDate, String totalpages, String pagecount, String countycode,
> String reportnum, String reportdescr)
>
> {
>
> SimpleDateFormat showFormat = new
> SimpleDateFormat(TurbineResources.getString("date.default.format"));
>
> SimpleDateFormat searchFormat = new SimpleDateFormat("MMdd");
>
> Document doc = new Document();
>
> doc.add(Field.Keyword("path", path));
>
> doc.add(Field.Keyword("modified", showFormat.format(modified)));
>
> doc.add(Field.UnStored("searchDate", searchFormat.format(modified)));
>
> doc.add(Field.Keyword("runDate", runDate==null?"":runDate));
>
> doc.add(Field.UnStored("searchRunDate",
> runDate==null?"":runDate.substring(6)+runDate.substring(0,2)+runDate.substri
> ng(3,5)));
>
> doc.add(Field.Keyword("reportnum", reportnum));
>
> doc.add(Field.Text("reportdescr", reportdescr));
>
> doc.add(Field.UnStored("cntycode", countycode));
>
> doc.add(Field.Keyword("totalpages", totalpages));
>
> doc.add(Field.Keyword("page", pagecount));
>
> doc.add(Field.UnStored("contents", contents));
>
> return doc;
>
> }
>
>
>
> Here is the code that adds the temp index to the production index.
>
> File tempFile = new File(sIndex + File.separatorChar + "temp" + sCntyCode);
>
> tempReader = IndexReader.open(tempFile);
>
> try
>
> {
>
> boolean createIndex = false;
>
> File f = new File(sIndex + File.separatorChar + sCntyCode);
>
> if (!f.exists())
>
> {
>
> createIndex = true;
>
> }
>
> prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
> StandardAnalyzer(), createIndex);
>
> }
>
> catch (Exception e)
>
> {
>
> IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
> sCntyCode, false));
>
> CasesReports.log("Tried to Unlock " + sIndex);
>
> prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);
>
> CasesReports.log("Successfully Unlocked " + sIndex + File.separatorChar +
> sCntyCode);
>
> }
>
> prodWriter.setUseCompoundFile(true);
>
> prodWriter.addIndexes(new IndexReader[] { tempReader });
>
>
>
>
>
> - Original Message -
> From: "Paul Elschot" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Thursday, August 19, 2004 12:16 AM
> Subject: Re: Index Size
>
>
> On Wednesday 18 August 2004 22:44, Rob Jose wrote:
> > Hello
> > I have indexed several thousand (52 to be exact) text files and I keep
> > running out of disk space to store the indexes.  The size of the documents
> > I have indexed is around 2.5 GB.  The size of the Lucene indexes is around
> > 287 GB.  Does this seem correct?  I am not storing the contents of the
>
> As noted, one would expect the index size to be about 35%
> of the original text, ie. about 2.5GB * 35% = 800MB.
> That is two orders of magnitude off from what you have.
>
> Could you provide some more information about the field structure,
> ie. how many fields, which fields are stored, which fields are indexed,
> evt. use of non standard analyzers, and evt. non standard
> Lucene settings?
>
> You might also try to change to non compound format to have a look
> at the sizes of the individual index files, see file formats on the lucene
> web site.
> You can then see the total disk size of for example the stored fields.
>
> Regards,
> Paul Elschot
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Dan
Thanks for your response.  Yes, I have used Luke to look at the index and
everything looks good.

Rob
- Original Message - 
From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 9:14 AM
Subject: RE: Index Size


Have you tried looking at the contents of this small index with Luke, to see
what actually got put into it?  Maybe one of your stored fields is being fed
something you didn't expect.

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index Size

2004-08-19 Thread Armbrust, Daniel C.
Have you tried looking at the contents of this small index with Luke, to see what 
actually got put into it?  Maybe one of your stored fields is being fed something you 
didn't expect.

Dan 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Otis
I upgraded to 1.4.1.  I deleted all of my old indexes and started from
scratch.  I indexed 2 MB worth of text files and my index size is 8 MB.
Would it be better if I stopped using the
IndexWriter.addIndexes(IndexReader) method and instead traverse the
IndexReader on the temp index and use IndexWriter.addDocument(Document)
method?

Thanks again for your input, I appreciate it.

Rob
- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 8:00 AM
Subject: Re: Index Size


Just go for 1.4.1 and look at the CHANGES.txt file to see if there were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose <[EMAIL PROTECTED]> wrote:

> Otis
> I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
> final?
>
> Rob
> - Original Message - 
> From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, August 19, 2004 7:13 AM
> Subject: Re: Index Size
>
>
> I thought this was the case.  I believe there was a bug in one of the
> recent Lucene releases that caused old CFS files not to be removed
> when
> they should be removed.  This resulted in your index directory
> containing a bunch of old CFS files consuming your disk space.
>
> Try getting a recent nightly build and see if using that takes car
> eof
> your problem.
>
> Otis
>
> --- Rob Jose <[EMAIL PROTECTED]> wrote:
>
> > Hey George
> > Thanks for responding.  I am using windows and I don't see any
> hidden
> > files.
> > I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
> > files.
> > I have two FDT files and two FDX files. And three FNM files.  Add
> > these
> > files to the deletable and segments file and that is all of the
> files
> > that I
> > have.   The CFS files are appoximately 11 MB each.  The totals I
> gave
> > you
> > before were for all of my indexes together.  This particular index
> > has a
> > size of 21.6 GB.  The files that it indexed have a size of 89 MB.
> >
> > OK - I just removed all of the CFS files from the directory and I
> can
> > still
> > read my indexes.  So know I have to ask what are these CFS files?
> > Why are
> > they created?  And how can I get rid of them if I don't need them.
> I
> > will
> > also take a look at the Lucene website to see if I can find any
> > information.
> >
> > Thanks
> > Rob
> >
> > - Original Message - 
> > From: "Honey George" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Thursday, August 19, 2004 12:29 AM
> > Subject: Re: Index Size
> >
> >
> > Hi,
> >  Please check for hidden files in the index folder. If
> > you are using linx, do something like
> >
> > ls -al 
> >
> > I am also facing a similar problem where the index
> > size is greater than the data size. In my case there
> > were some hidden temproary files which the lucene
> > creates.
> > That was taking half of the total size.
> >
> > My problem is that after deleting the temporary files,
> > the index size is same as that of the data size. That
> > again seems to be a problem. I am yet to find out the
> > reason..
> >
> > Thanks,
> >george
> >
> >
> >  --- Rob Jose <[EMAIL PROTECTED]> wrote:
> > > Hello
> > > I have indexed several thousand (52 to be exact)
> > > text files and I keep running out of disk space to
> > > store the indexes.  The size of the documents I have
> > > indexed is around 2.5 GB.  The size of the Lucene
> > > indexes is around 287 GB.  Does this seem correct?
> > > I am not storing the contents of the file, just
> > > indexing and tokenizing.  I am using Lucene 1.3
> > > final.  Can you guys let me know what you are
> > > experiencing?  I don't want to go into production
> > > with something that I should be configuring better.
> > >
> > >
> > > I am not sure if this helps, but I have a temp index
> > > and a real index.  I index the file into the temp
> > > index, and then merge the temp index into the real
> > > index using the addIndexes method on the
> > > IndexWriter.  I have also set the production writer
> > > setUseCompoundFile to true.  I did not set this on
> > > the temp index.  The last thing that I do before
> > > closing the production writer is to call the
> > > optimize method.
> > >
> > > I would really appreciate any ideas to get the index
> > > size smaller if it is at all possible.
> > >
> > > Thanks
> > > Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Otis Gospodnetic
Just go for 1.4.1 and look at the CHANGES.txt file to see if there were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose <[EMAIL PROTECTED]> wrote:

> Otis
> I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
> final?
> 
> Rob
> - Original Message - 
> From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, August 19, 2004 7:13 AM
> Subject: Re: Index Size
> 
> 
> I thought this was the case.  I believe there was a bug in one of the
> recent Lucene releases that caused old CFS files not to be removed
> when
> they should be removed.  This resulted in your index directory
> containing a bunch of old CFS files consuming your disk space.
> 
> Try getting a recent nightly build and see if using that takes car
> eof
> your problem.
> 
> Otis
> 
> --- Rob Jose <[EMAIL PROTECTED]> wrote:
> 
> > Hey George
> > Thanks for responding.  I am using windows and I don't see any
> hidden
> > files.
> > I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
> > files.
> > I have two FDT files and two FDX files. And three FNM files.  Add
> > these
> > files to the deletable and segments file and that is all of the
> files
> > that I
> > have.   The CFS files are appoximately 11 MB each.  The totals I
> gave
> > you
> > before were for all of my indexes together.  This particular index
> > has a
> > size of 21.6 GB.  The files that it indexed have a size of 89 MB.
> > 
> > OK - I just removed all of the CFS files from the directory and I
> can
> > still
> > read my indexes.  So know I have to ask what are these CFS files? 
> > Why are
> > they created?  And how can I get rid of them if I don't need them. 
> I
> > will
> > also take a look at the Lucene website to see if I can find any
> > information.
> > 
> > Thanks
> > Rob
> > 
> > - Original Message - 
> > From: "Honey George" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Thursday, August 19, 2004 12:29 AM
> > Subject: Re: Index Size
> > 
> > 
> > Hi,
> >  Please check for hidden files in the index folder. If
> > you are using linx, do something like
> > 
> > ls -al 
> > 
> > I am also facing a similar problem where the index
> > size is greater than the data size. In my case there
> > were some hidden temproary files which the lucene
> > creates.
> > That was taking half of the total size.
> > 
> > My problem is that after deleting the temporary files,
> > the index size is same as that of the data size. That
> > again seems to be a problem. I am yet to find out the
> > reason..
> > 
> > Thanks,
> >george
> > 
> > 
> >  --- Rob Jose <[EMAIL PROTECTED]> wrote:
> > > Hello
> > > I have indexed several thousand (52 to be exact)
> > > text files and I keep running out of disk space to
> > > store the indexes.  The size of the documents I have
> > > indexed is around 2.5 GB.  The size of the Lucene
> > > indexes is around 287 GB.  Does this seem correct?
> > > I am not storing the contents of the file, just
> > > indexing and tokenizing.  I am using Lucene 1.3
> > > final.  Can you guys let me know what you are
> > > experiencing?  I don't want to go into production
> > > with something that I should be configuring better.
> > >
> > >
> > > I am not sure if this helps, but I have a temp index
> > > and a real index.  I index the file into the temp
> > > index, and then merge the temp index into the real
> > > index using the addIndexes method on the
> > > IndexWriter.  I have also set the production writer
> > > setUseCompoundFile to true.  I did not set this on
> > > the temp index.  The last thing that I do before
> > > closing the production writer is to call the
> > > optimize method.
> > >
> > > I would really appreciate any ideas to get the index
> > > size smaller if it is at all possible.
> > >
> > > Thanks
> > > Rob
> > 
> > 
> > 
> > 
> > 
> > ___ALL-NEW
> > Yahoo!
> > Messenger - all new features - even more fun! 
> > http://uk.messenger.yahoo.com
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Otis
I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4 final?

Rob
- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 7:13 AM
Subject: Re: Index Size


I thought this was the case.  I believe there was a bug in one of the
recent Lucene releases that caused old CFS files not to be removed when
they should be removed.  This resulted in your index directory
containing a bunch of old CFS files consuming your disk space.

Try getting a recent nightly build and see if using that takes car eof
your problem.

Otis

--- Rob Jose <[EMAIL PROTECTED]> wrote:

> Hey George
> Thanks for responding.  I am using windows and I don't see any hidden
> files.
> I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
> files.
> I have two FDT files and two FDX files. And three FNM files.  Add
> these
> files to the deletable and segments file and that is all of the files
> that I
> have.   The CFS files are appoximately 11 MB each.  The totals I gave
> you
> before were for all of my indexes together.  This particular index
> has a
> size of 21.6 GB.  The files that it indexed have a size of 89 MB.
> 
> OK - I just removed all of the CFS files from the directory and I can
> still
> read my indexes.  So know I have to ask what are these CFS files? 
> Why are
> they created?  And how can I get rid of them if I don't need them.  I
> will
> also take a look at the Lucene website to see if I can find any
> information.
> 
> Thanks
> Rob
> 
> - Original Message - 
> From: "Honey George" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, August 19, 2004 12:29 AM
> Subject: Re: Index Size
> 
> 
> Hi,
>  Please check for hidden files in the index folder. If
> you are using linx, do something like
> 
> ls -al 
> 
> I am also facing a similar problem where the index
> size is greater than the data size. In my case there
> were some hidden temproary files which the lucene
> creates.
> That was taking half of the total size.
> 
> My problem is that after deleting the temporary files,
> the index size is same as that of the data size. That
> again seems to be a problem. I am yet to find out the
> reason..
> 
> Thanks,
>george
> 
> 
>  --- Rob Jose <[EMAIL PROTECTED]> wrote:
> > Hello
> > I have indexed several thousand (52 to be exact)
> > text files and I keep running out of disk space to
> > store the indexes.  The size of the documents I have
> > indexed is around 2.5 GB.  The size of the Lucene
> > indexes is around 287 GB.  Does this seem correct?
> > I am not storing the contents of the file, just
> > indexing and tokenizing.  I am using Lucene 1.3
> > final.  Can you guys let me know what you are
> > experiencing?  I don't want to go into production
> > with something that I should be configuring better.
> >
> >
> > I am not sure if this helps, but I have a temp index
> > and a real index.  I index the file into the temp
> > index, and then merge the temp index into the real
> > index using the addIndexes method on the
> > IndexWriter.  I have also set the production writer
> > setUseCompoundFile to true.  I did not set this on
> > the temp index.  The last thing that I do before
> > closing the production writer is to call the
> > optimize method.
> >
> > I would really appreciate any ideas to get the index
> > size smaller if it is at all possible.
> >
> > Thanks
> > Rob
> 
> 
> 
> 
> 
> ___ALL-NEW
> Yahoo!
> Messenger - all new features - even more fun! 
> http://uk.messenger.yahoo.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Otis Gospodnetic
I thought this was the case.  I believe there was a bug in one of the
recent Lucene releases that caused old CFS files not to be removed when
they should be removed.  This resulted in your index directory
containing a bunch of old CFS files consuming your disk space.

Try getting a recent nightly build and see if using that takes car eof
your problem.

Otis

--- Rob Jose <[EMAIL PROTECTED]> wrote:

> Hey George
> Thanks for responding.  I am using windows and I don't see any hidden
> files.
> I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
> files.
> I have two FDT files and two FDX files. And three FNM files.  Add
> these
> files to the deletable and segments file and that is all of the files
> that I
> have.   The CFS files are appoximately 11 MB each.  The totals I gave
> you
> before were for all of my indexes together.  This particular index
> has a
> size of 21.6 GB.  The files that it indexed have a size of 89 MB.
> 
> OK - I just removed all of the CFS files from the directory and I can
> still
> read my indexes.  So know I have to ask what are these CFS files? 
> Why are
> they created?  And how can I get rid of them if I don't need them.  I
> will
> also take a look at the Lucene website to see if I can find any
> information.
> 
> Thanks
> Rob
> 
> - Original Message - 
> From: "Honey George" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, August 19, 2004 12:29 AM
> Subject: Re: Index Size
> 
> 
> Hi,
>  Please check for hidden files in the index folder. If
> you are using linx, do something like
> 
> ls -al 
> 
> I am also facing a similar problem where the index
> size is greater than the data size. In my case there
> were some hidden temproary files which the lucene
> creates.
> That was taking half of the total size.
> 
> My problem is that after deleting the temporary files,
> the index size is same as that of the data size. That
> again seems to be a problem. I am yet to find out the
> reason..
> 
> Thanks,
>george
> 
> 
>  --- Rob Jose <[EMAIL PROTECTED]> wrote:
> > Hello
> > I have indexed several thousand (52 to be exact)
> > text files and I keep running out of disk space to
> > store the indexes.  The size of the documents I have
> > indexed is around 2.5 GB.  The size of the Lucene
> > indexes is around 287 GB.  Does this seem correct?
> > I am not storing the contents of the file, just
> > indexing and tokenizing.  I am using Lucene 1.3
> > final.  Can you guys let me know what you are
> > experiencing?  I don't want to go into production
> > with something that I should be configuring better.
> >
> >
> > I am not sure if this helps, but I have a temp index
> > and a real index.  I index the file into the temp
> > index, and then merge the temp index into the real
> > index using the addIndexes method on the
> > IndexWriter.  I have also set the production writer
> > setUseCompoundFile to true.  I did not set this on
> > the temp index.  The last thing that I do before
> > closing the production writer is to call the
> > optimize method.
> >
> > I would really appreciate any ideas to get the index
> > size smaller if it is at all possible.
> >
> > Thanks
> > Rob
> 
> 
> 
> 
> 
> ___ALL-NEW
> Yahoo!
> Messenger - all new features - even more fun! 
> http://uk.messenger.yahoo.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
I did a little more research into my production indexes, and so far the
first index in the only one that has any other files besides the CFS files.
The other indexes that I have seen have just the deletable and segments
files and a whole bunch of cfs files.  Very interesting.  Also worth noting
is that once in awhile one of the production indexes will have a 0 length
FNM file.

Rob
- Original Message - 
From: "Rob Jose" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 6:42 AM
Subject: Re: Index Size


Bernhard
Thanks for responding.  I do have an IndexReader open on the Temp index.  I
pass this IndexReader into the addIndexes method on the IndexWriter to add
these files.  I did notice that I have a ton of CFS files that I removed and
was still able to read the indexes.  Are these the temporary segment files
you are talking about?  Here is my code that adds the temp index to the prod
index.
File tempFile = new File(sIndex + File.separatorChar + "temp" + sCntyCode);
tempReader = IndexReader.open(tempFile);

try

{

boolean createIndex = false;

File f = new File(sIndex + File.separatorChar + sCntyCode);

if (!f.exists())

{

createIndex = true;

}

prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
StandardAnalyzer(), createIndex);

}

catch (Exception e)

{

IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
sCntyCode, false));

CasesReports.log("Tried to Unlock " + sIndex);

prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

CasesReports.log("Successfully Unlocked " + sIndex + File.separatorChar +
sCntyCode);

}

prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });



Am I doing something wrong?  Any help would be extremely appreciated.



Thanks

Rob

- Original Message - 
From: "Bernhard Messer" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 1:09 AM
Subject: Re: Index Size


Rob,

as Doug and Paul already mentioned, the index size is definitely to big :-(.

What could raise the problem, especially when running on a windows
platform, is that an IndexReader is open during the whole index process.
During indexing, the writer creates temporary segment files which will
be merged into bigger segments. If done, the old segment files will be
deleted. If there is an open IndexReader, the environment is unable to
unlock the files and they still stay in the index directory. You will
end up with an index, several times bigger than the dataset.

Can you check your code for any open IndexReaders when indexing, or
paste the relevant part to the list so we could have a look on it.

hope this helps
Bernhard


Rob Jose wrote:

>Hello
>I have indexed several thousand (52 to be exact) text files and I keep
running out of disk space to store the indexes.  The size of the documents I
have indexed is around 2.5 GB.  The size of the Lucene indexes is around 287
GB.  Does this seem correct?  I am not storing the contents of the file,
just indexing and tokenizing.  I am using Lucene 1.3 final.  Can you guys
let me know what you are experiencing?  I don't want to go into production
with something that I should be configuring better.
>
>I am not sure if this helps, but I have a temp index and a real index.  I
index the file into the temp index, and then merge the temp index into the
real index using the addIndexes method on the IndexWriter.  I have also set
the production writer setUseCompoundFile to true.  I did not set this on the
temp index.  The last thing that I do before closing the production writer
is to call the optimize method.
>
>I would really appreciate any ideas to get the index size smaller if it is
at all possible.
>
>Thanks
>Rob
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Bernhard
Thanks for responding.  I do have an IndexReader open on the Temp index.  I
pass this IndexReader into the addIndexes method on the IndexWriter to add
these files.  I did notice that I have a ton of CFS files that I removed and
was still able to read the indexes.  Are these the temporary segment files
you are talking about?  Here is my code that adds the temp index to the prod
index.
File tempFile = new File(sIndex + File.separatorChar + "temp" + sCntyCode);
tempReader = IndexReader.open(tempFile);

try

{

boolean createIndex = false;

File f = new File(sIndex + File.separatorChar + sCntyCode);

if (!f.exists())

{

createIndex = true;

}

prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
StandardAnalyzer(), createIndex);

}

catch (Exception e)

{

IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
sCntyCode, false));

CasesReports.log("Tried to Unlock " + sIndex);

prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

CasesReports.log("Successfully Unlocked " + sIndex + File.separatorChar +
sCntyCode);

}

prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });



Am I doing something wrong?  Any help would be extremely appreciated.



Thanks

Rob

- Original Message - 
From: "Bernhard Messer" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 1:09 AM
Subject: Re: Index Size


Rob,

as Doug and Paul already mentioned, the index size is definitely to big :-(.

What could raise the problem, especially when running on a windows
platform, is that an IndexReader is open during the whole index process.
During indexing, the writer creates temporary segment files which will
be merged into bigger segments. If done, the old segment files will be
deleted. If there is an open IndexReader, the environment is unable to
unlock the files and they still stay in the index directory. You will
end up with an index, several times bigger than the dataset.

Can you check your code for any open IndexReaders when indexing, or
paste the relevant part to the list so we could have a look on it.

hope this helps
Bernhard


Rob Jose wrote:

>Hello
>I have indexed several thousand (52 to be exact) text files and I keep
running out of disk space to store the indexes.  The size of the documents I
have indexed is around 2.5 GB.  The size of the Lucene indexes is around 287
GB.  Does this seem correct?  I am not storing the contents of the file,
just indexing and tokenizing.  I am using Lucene 1.3 final.  Can you guys
let me know what you are experiencing?  I don't want to go into production
with something that I should be configuring better.
>
>I am not sure if this helps, but I have a temp index and a real index.  I
index the file into the temp index, and then merge the temp index into the
real index using the addIndexes method on the IndexWriter.  I have also set
the production writer setUseCompoundFile to true.  I did not set this on the
temp index.  The last thing that I do before closing the production writer
is to call the optimize method.
>
>I would really appreciate any ideas to get the index size smaller if it is
at all possible.
>
>Thanks
>Rob
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Karthik
Thanks for responding.  Yes, I optimize right before I close the index
writer.  I added this a little while ago to try and get the size down.

Rob
- Original Message - 
From: "Karthik N S" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 12:59 AM
Subject: RE: Index Size


Guys

   Are u Using the Optimizing  the index before close process.

  If not try using it...  :}



karthik




-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 1:00 PM
To: Lucene Users List
Subject: Re: Index Size


Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al 

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose <[EMAIL PROTECTED]> wrote:
> Hello
> I have indexed several thousand (52 to be exact)
> text files and I keep running out of disk space to
> store the indexes.  The size of the documents I have
> indexed is around 2.5 GB.  The size of the Lucene
> indexes is around 287 GB.  Does this seem correct?
> I am not storing the contents of the file, just
> indexing and tokenizing.  I am using Lucene 1.3
> final.  Can you guys let me know what you are
> experiencing?  I don't want to go into production
> with something that I should be configuring better.
>
>
> I am not sure if this helps, but I have a temp index
> and a real index.  I index the file into the temp
> index, and then merge the temp index into the real
> index using the addIndexes method on the
> IndexWriter.  I have also set the production writer
> setUseCompoundFile to true.  I did not set this on
> the temp index.  The last thing that I do before
> closing the production writer is to call the
> optimize method.
>
> I would really appreciate any ideas to get the index
> size smaller if it is at all possible.
>
> Thanks
> Rob





___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Hey George
Thanks for responding.  I am using windows and I don't see any hidden files.
I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.) files.
I have two FDT files and two FDX files. And three FNM files.  Add these
files to the deletable and segments file and that is all of the files that I
have.   The CFS files are appoximately 11 MB each.  The totals I gave you
before were for all of my indexes together.  This particular index has a
size of 21.6 GB.  The files that it indexed have a size of 89 MB.

OK - I just removed all of the CFS files from the directory and I can still
read my indexes.  So know I have to ask what are these CFS files?  Why are
they created?  And how can I get rid of them if I don't need them.  I will
also take a look at the Lucene website to see if I can find any information.

Thanks
Rob

- Original Message - 
From: "Honey George" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 12:29 AM
Subject: Re: Index Size


Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al 

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose <[EMAIL PROTECTED]> wrote:
> Hello
> I have indexed several thousand (52 to be exact)
> text files and I keep running out of disk space to
> store the indexes.  The size of the documents I have
> indexed is around 2.5 GB.  The size of the Lucene
> indexes is around 287 GB.  Does this seem correct?
> I am not storing the contents of the file, just
> indexing and tokenizing.  I am using Lucene 1.3
> final.  Can you guys let me know what you are
> experiencing?  I don't want to go into production
> with something that I should be configuring better.
>
>
> I am not sure if this helps, but I have a temp index
> and a real index.  I index the file into the temp
> index, and then merge the temp index into the real
> index using the addIndexes method on the
> IndexWriter.  I have also set the production writer
> setUseCompoundFile to true.  I did not set this on
> the temp index.  The last thing that I do before
> closing the production writer is to call the
> optimize method.
>
> I would really appreciate any ideas to get the index
> size smaller if it is at all possible.
>
> Thanks
> Rob





___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Paul
Thank you for your response.  I have appended to the bottom of this message
the field structure that I am using.  I hope that this helps.  I am using
the StandardAnalyzer.  I do not believe that I am changing any default
values, but I have also appended the code that adds the temp index to the
production index.

Thanks for you help
Rob

Here is the code that describes the field structure.
public static Document Document(String contents, String path, Date modified,
String runDate, String totalpages, String pagecount, String countycode,
String reportnum, String reportdescr)

{

SimpleDateFormat showFormat = new
SimpleDateFormat(TurbineResources.getString("date.default.format"));

SimpleDateFormat searchFormat = new SimpleDateFormat("MMdd");

Document doc = new Document();

doc.add(Field.Keyword("path", path));

doc.add(Field.Keyword("modified", showFormat.format(modified)));

doc.add(Field.UnStored("searchDate", searchFormat.format(modified)));

doc.add(Field.Keyword("runDate", runDate==null?"":runDate));

doc.add(Field.UnStored("searchRunDate",
runDate==null?"":runDate.substring(6)+runDate.substring(0,2)+runDate.substri
ng(3,5)));

doc.add(Field.Keyword("reportnum", reportnum));

doc.add(Field.Text("reportdescr", reportdescr));

doc.add(Field.UnStored("cntycode", countycode));

doc.add(Field.Keyword("totalpages", totalpages));

doc.add(Field.Keyword("page", pagecount));

doc.add(Field.UnStored("contents", contents));

return doc;

}



Here is the code that adds the temp index to the production index.

File tempFile = new File(sIndex + File.separatorChar + "temp" + sCntyCode);

tempReader = IndexReader.open(tempFile);

try

{

boolean createIndex = false;

File f = new File(sIndex + File.separatorChar + sCntyCode);

if (!f.exists())

{

createIndex = true;

}

prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
StandardAnalyzer(), createIndex);

}

catch (Exception e)

{

IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
sCntyCode, false));

CasesReports.log("Tried to Unlock " + sIndex);

prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

CasesReports.log("Successfully Unlocked " + sIndex + File.separatorChar +
sCntyCode);

}

prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });





- Original Message - 
From: "Paul Elschot" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, August 19, 2004 12:16 AM
Subject: Re: Index Size


On Wednesday 18 August 2004 22:44, Rob Jose wrote:
> Hello
> I have indexed several thousand (52 to be exact) text files and I keep
> running out of disk space to store the indexes.  The size of the documents
> I have indexed is around 2.5 GB.  The size of the Lucene indexes is around
> 287 GB.  Does this seem correct?  I am not storing the contents of the

As noted, one would expect the index size to be about 35%
of the original text, ie. about 2.5GB * 35% = 800MB.
That is two orders of magnitude off from what you have.

Could you provide some more information about the field structure,
ie. how many fields, which fields are stored, which fields are indexed,
evt. use of non standard analyzers, and evt. non standard
Lucene settings?

You might also try to change to non compound format to have a look
at the sizes of the individual index files, see file formats on the lucene
web site.
You can then see the total disk size of for example the stored fields.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Bernhard Messer
Rob,
as Doug and Paul already mentioned, the index size is definitely to big :-(.
What could raise the problem, especially when running on a windows 
platform, is that an IndexReader is open during the whole index process. 
During indexing, the writer creates temporary segment files which will 
be merged into bigger segments. If done, the old segment files will be 
deleted. If there is an open IndexReader, the environment is unable to 
unlock the files and they still stay in the index directory. You will 
end up with an index, several times bigger than the dataset.

Can you check your code for any open IndexReaders when indexing, or 
paste the relevant part to the list so we could have a look on it.

hope this helps
Bernhard
Rob Jose wrote:
Hello
I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes.  The size of the documents I have indexed is around 2.5 GB.  The size of the Lucene indexes is around 287 GB.  Does this seem correct?  I am not storing the contents of the file, just indexing and tokenizing.  I am using Lucene 1.3 final.  Can you guys let me know what you are experiencing?  I don't want to go into production with something that I should be configuring better.  

I am not sure if this helps, but I have a temp index and a real index.  I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter.  I have also set the production writer setUseCompoundFile to true.  I did not set this on the temp index.  The last thing that I do before closing the production writer is to call the optimize method.  

I would really appreciate any ideas to get the index size smaller if it is at all 
possible.
Thanks
Rob
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Index Size

2004-08-19 Thread Karthik N S
Guys

   Are u Using the Optimizing  the index before close process.

  If not try using it...  :}



karthik




-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 1:00 PM
To: Lucene Users List
Subject: Re: Index Size


Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al 

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose <[EMAIL PROTECTED]> wrote:
> Hello
> I have indexed several thousand (52 to be exact)
> text files and I keep running out of disk space to
> store the indexes.  The size of the documents I have
> indexed is around 2.5 GB.  The size of the Lucene
> indexes is around 287 GB.  Does this seem correct?
> I am not storing the contents of the file, just
> indexing and tokenizing.  I am using Lucene 1.3
> final.  Can you guys let me know what you are
> experiencing?  I don't want to go into production
> with something that I should be configuring better.
>
>
> I am not sure if this helps, but I have a temp index
> and a real index.  I index the file into the temp
> index, and then merge the temp index into the real
> index using the addIndexes method on the
> IndexWriter.  I have also set the production writer
> setUseCompoundFile to true.  I did not set this on
> the temp index.  The last thing that I do before
> closing the production writer is to call the
> optimize method.
>
> I would really appreciate any ideas to get the index
> size smaller if it is at all possible.
>
> Thanks
> Rob





___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Honey George
Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al 

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose <[EMAIL PROTECTED]> wrote: 
> Hello
> I have indexed several thousand (52 to be exact)
> text files and I keep running out of disk space to
> store the indexes.  The size of the documents I have
> indexed is around 2.5 GB.  The size of the Lucene
> indexes is around 287 GB.  Does this seem correct? 
> I am not storing the contents of the file, just
> indexing and tokenizing.  I am using Lucene 1.3
> final.  Can you guys let me know what you are
> experiencing?  I don't want to go into production
> with something that I should be configuring better. 
> 
> 
> I am not sure if this helps, but I have a temp index
> and a real index.  I index the file into the temp
> index, and then merge the temp index into the real
> index using the addIndexes method on the
> IndexWriter.  I have also set the production writer
> setUseCompoundFile to true.  I did not set this on
> the temp index.  The last thing that I do before
> closing the production writer is to call the
> optimize method.  
> 
> I would really appreciate any ideas to get the index
> size smaller if it is at all possible.
> 
> Thanks
> Rob 





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Paul Elschot
On Wednesday 18 August 2004 22:44, Rob Jose wrote:
> Hello
> I have indexed several thousand (52 to be exact) text files and I keep
> running out of disk space to store the indexes.  The size of the documents
> I have indexed is around 2.5 GB.  The size of the Lucene indexes is around
> 287 GB.  Does this seem correct?  I am not storing the contents of the

As noted, one would expect the index size to be about 35%
of the original text, ie. about 2.5GB * 35% = 800MB.
That is two orders of magnitude off from what you have.

Could you provide some more information about the field structure,
ie. how many fields, which fields are stored, which fields are indexed,
evt. use of non standard analyzers, and evt. non standard
Lucene settings?

You might also try to change to non compound format to have a look
at the sizes of the individual index files, see file formats on the lucene
web site.
You can then see the total disk size of for example the stored fields.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-18 Thread Stephane James Vaucher
From: Doug Cutting
http://www.mail-archive.com/[EMAIL PROTECTED]/msg08757.html

> An index typically requires around 35% of the plain text size.

I think it's a little big.

sv

On Wed, 18 Aug 2004, Rob Jose wrote:

> Hello
> I have indexed several thousand (52 to be exact) text files and I keep 
> running out of disk space to store the indexes.  The size of the 
> documents I have indexed is around 2.5 GB.  The size of the Lucene 
> indexes is around 287 GB.  Does this seem correct?  I am not storing the 
> contents of the file, just indexing and tokenizing.  I am using Lucene 
> 1.3 final.  Can you guys let me know what you are experiencing?  I don't 
> want to go into production with something that I should be configuring 
> better.  
> 
> I am not sure if this helps, but I have a temp index and a real index.  I index the 
> file into the temp index, and then merge the temp index into the real index using 
> the addIndexes method on the IndexWriter.  I have also set the production writer 
> setUseCompoundFile to true.  I did not set this on the temp index.  The last thing 
> that I do before closing the production writer is to call the optimize method.  
> 
> I would really appreciate any ideas to get the index size smaller if it is at all 
> possible.
> 
> Thanks
> Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index Size

2004-08-18 Thread Rob Jose
Hello
I have indexed several thousand (52 to be exact) text files and I keep running out of 
disk space to store the indexes.  The size of the documents I have indexed is around 
2.5 GB.  The size of the Lucene indexes is around 287 GB.  Does this seem correct?  I 
am not storing the contents of the file, just indexing and tokenizing.  I am using 
Lucene 1.3 final.  Can you guys let me know what you are experiencing?  I don't want 
to go into production with something that I should be configuring better.  

I am not sure if this helps, but I have a temp index and a real index.  I index the 
file into the temp index, and then merge the temp index into the real index using the 
addIndexes method on the IndexWriter.  I have also set the production writer 
setUseCompoundFile to true.  I did not set this on the temp index.  The last thing 
that I do before closing the production writer is to call the optimize method.  

I would really appreciate any ideas to get the index size smaller if it is at all 
possible.

Thanks
Rob

RE: index size question

2004-01-14 Thread Chong, Herb
who says a business has to run on a modern operating system?

Herb...

-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 14, 2004 12:27 PM
To: Lucene Users List
Subject: Re: index size question


It's in the jguru FAQ
http://www.jguru.com/faq/view.jsp?EID=538304

By the way, what platforms don't support files greater than 2GB in this
day and age?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index size question

2004-01-14 Thread Dror Matalon
It's in the jguru FAQ
http://www.jguru.com/faq/view.jsp?EID=538304

By the way, what platforms don't support files greater than 2GB in this
day and age?

Answer
This question is often brought up because of the 2GB file size limit of
some 32-bit operating systems.

This is a slightly modified answer from Doug Cutting:

The easiest thing is to set IndexWriter.maxMergeDocs.
If, for instance, you hit the 2GB limit at 8M documents set maxMergeDocs
to 7M. That will keep Lucene from trying to merge an index that won't
fit in your filesystem. It will actually effectively round this down to
the next lower power of Index.mergeFactor.
So with the default mergeFactor set to 10 and maxMergeDocs set to 7M
Lucene will generate a series of 1M document indexes, since merging 10
of these would exceed the maximum.

A slightly more complex solution:
You could further minimize the number of segments if, when you've added
7M documents, optimize the index and start a new index. Then use
MultiSearcher to search the indexes.

An even more complex and optimal solution:
Write a version of FSDirectory that, when a file exceeds 2GB, creates a
subdirectory and represents the file as a series of files.

Is this item helpful?  yes  no Previous votes   Yes: 1  No: 0


On Wed, Jan 14, 2004 at 10:50:48AM -0500, Chong, Herb wrote:
> this should probably in the FAQ. what happens when i index tens of gigabytes of 
> documents on a platform that doesn't support files larger than 2GB. does Lucene 
> automatically stop merging index files intelligently so that its files don't exceed 
> 2GB in size, or must i manage the incoming documents such that no index file exceeds 
> 2GB?
> 
> Herb
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



index size question

2004-01-14 Thread Chong, Herb
this should probably in the FAQ. what happens when i index tens of gigabytes of 
documents on a platform that doesn't support files larger than 2GB. does Lucene 
automatically stop merging index files intelligently so that its files don't exceed 
2GB in size, or must i manage the incoming documents such that no index file exceeds 
2GB?

Herb

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Biggest index size/document in Lucene

2003-11-04 Thread Doug Cutting
There was a bug (recently fixed) when creating indexes with over a 
couple hundred million documents.  So you should use 1.3 RC2, which has 
a fix for this bug.

The biggest indexes I've personally created have around 30M documents. 
I maintain these as a set of separately updated indexes, then merge them 
together into a single big index for deployment.  I find this easier 
than trying to maintain a single massive index.

My guess is that your search times won't be too fast, probably on the 
order of a few seconds (more than one, less than ten).  It will be disk 
bound.  You could improve performance by distributing search over 
multiple machines, each searching a smaller index, a subset of the 
entire data.

Doug

Victor Hadianto wrote:
Hi all,

I'm interested to know have big of Lucene index/documents that you have
experienced with? We are trying to index in the mark of 300 million text
documents. Each document will be quite small around 10kb ish.
Any insight about the scalability of Lucene with this many documents?
Creating the index and searching?
thanks,

/victor

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Biggest index size/document in Lucene

2003-11-04 Thread Victor Hadianto
Hi all,

I'm interested to know have big of Lucene index/documents that you have
experienced with? We are trying to index in the mark of 300 million text
documents. Each document will be quite small around 10kb ish.

Any insight about the scalability of Lucene with this many documents?
Creating the index and searching?

thanks,

/victor


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MAX Index Size POLL

2003-02-27 Thread David Spencer
Samir Satam wrote:

Thanks for ur reply.

Maybe i asked the wrong question.

Lets Say 

Just,

Number of documents indexed. (No. of Document objects in the index)
AND
The index size one has had yet. Regardless of the no of document objects. (To 
determine one the max index size one is working with.)
We use Lucene (v1.2) extensively internally.
Largest single database looks like 11,552 docs with the db size (the sum 
of the
sizes of the files that make up the index) is 842MB.

All indexes are also merged into one "mega" index which contains 249,184 
docs
and is 1.2GB - I think think this is done because the JSP search pages were
not coded to use MultiSearcher.

Overall there are 24 indexes with 536,415 docs indexed into a total of 
2GB of index space - though this count includes the mega index so you 
could 1/2 the numbers which I guess agree with the previous paragraph...




thanks
samir
-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 27, 2003 12:47 PM
To: Lucene Users List
Subject: Re: MAX Index Size POLL
Samir,

The size of the index depends on (a) the size of the documents, (b) the
number of fields per document, (c) the fields that are kept in the index.
The time taken to index depends on the same plus the characteristics of the
processor and storage i/o.  With so many variables, I don't think the simple
listing you're requesting will be of much use.
Regards,

Terry

- Original Message -
From: "Samir Satam" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, February 27, 2003 12:22 PM
Subject: MAX Index Size POLL
Hello friends,
If it is not much of a trouble, I would like to ask as many of you as
possible, to post some statistics.
This would preferably include
1. Size of the index.
2. No of documents indexed.
3. Time taken to index new documents.
4. Time taken for a typical query.
thank you,
Samir
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: MAX Index Size POLL

2003-02-27 Thread Samir Satam
Thanks for ur reply.

Maybe i asked the wrong question.

Lets Say 

Just,

Number of documents indexed. (No. of Document objects in the index)
AND
The index size one has had yet. Regardless of the no of document objects. (To 
determine one the max index size one is working with.)

thanks
samir


-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 27, 2003 12:47 PM
To: Lucene Users List
Subject: Re: MAX Index Size POLL


Samir,

The size of the index depends on (a) the size of the documents, (b) the
number of fields per document, (c) the fields that are kept in the index.
The time taken to index depends on the same plus the characteristics of the
processor and storage i/o.  With so many variables, I don't think the simple
listing you're requesting will be of much use.

Regards,

Terry

- Original Message -
From: "Samir Satam" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, February 27, 2003 12:22 PM
Subject: MAX Index Size POLL


Hello friends,
If it is not much of a trouble, I would like to ask as many of you as
possible, to post some statistics.
This would preferably include

1. Size of the index.
2. No of documents indexed.
3. Time taken to index new documents.
4. Time taken for a typical query.


thank you,
Samir

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MAX Index Size POLL

2003-02-27 Thread Terry Steichen
Samir,

The size of the index depends on (a) the size of the documents, (b) the
number of fields per document, (c) the fields that are kept in the index.
The time taken to index depends on the same plus the characteristics of the
processor and storage i/o.  With so many variables, I don't think the simple
listing you're requesting will be of much use.

Regards,

Terry

- Original Message -
From: "Samir Satam" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, February 27, 2003 12:22 PM
Subject: MAX Index Size POLL


Hello friends,
If it is not much of a trouble, I would like to ask as many of you as
possible, to post some statistics.
This would preferably include

1. Size of the index.
2. No of documents indexed.
3. Time taken to index new documents.
4. Time taken for a typical query.


thank you,
Samir

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MAX Index Size POLL

2003-02-27 Thread Samir Satam
Hello friends,
If it is not much of a trouble, I would like to ask as many of you as possible, to 
post some statistics.
This would preferably include 

1. Size of the index.
2. No of documents indexed.
3. Time taken to index new documents.
4. Time taken for a typical query.


thank you,
Samir

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sanity check - index size

2002-05-21 Thread Erik Hatcher

Duh... that was my issue.  I am storing the content also.  Sorry for the
newbie question.  I'll crawl back under my rock now.


- Original Message -
From: "Peter Carlson" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, May 20, 2002 8:50 PM
Subject: Re: sanity check - index size


> This seems big depending on what you are storing.
>
> For example, I have a set of data with 457MB and my Lucene index is 115MB.
> However, I don't store much.
>
> If you are storing the complete text (even if you don't index it), then it
> will be about the same size (no probably bigger) than your original data
> set.
>
> --Peter
>
> On 5/20/02 4:16 PM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote:
>
> > I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> > These are text files and HTML files.  I only index them into a few
fields
> > (title, content, filename).  My index (specifically _sd.fdt) is 20MB.
The
> > bulk of the HTML files are Javadoc files (Ant's own documentation,
> > actually).
> >
> > Does that seem at all close to being reasonable/normal?  I am calling
> > optimize() before closing the index.
> >
> > Thanks for the sanity check.
> >
> >   Erik
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: sanity check - index size

2002-05-21 Thread James Rozee

Erik,

Mine is amazingly small too.  I store a very small field (a key into my
MySQL database for the full document) and index the content without
storing it. I have 40,000+ docs and it is only 28MB in size.  I've done
several searches and it seems to be correct.

James

*
The Game Development Search Engine
and DQuest E-zine
http://www.gdse.com/

A Member of the Future Games Network
http://www.fgn.com/


On Mon, 20 May 2002, Erik Hatcher wrote:

> I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> These are text files and HTML files.  I only index them into a few fields
> (title, content, filename).  My index (specifically _sd.fdt) is 20MB.  The
> bulk of the HTML files are Javadoc files (Ant's own documentation,
> actually).
> 
> Does that seem at all close to being reasonable/normal?  I am calling
> optimize() before closing the index.
> 
> Thanks for the sanity check.
> 
> Erik
> 
> 
> 
> --
> To unsubscribe, e-mail:   
> For additional commands, e-mail: 
> 
> 


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: sanity check - index size

2002-05-20 Thread Peter Carlson

This seems big depending on what you are storing.

For example, I have a set of data with 457MB and my Lucene index is 115MB.
However, I don't store much.

If you are storing the complete text (even if you don't index it), then it
will be about the same size (no probably bigger) than your original data
set.

--Peter

On 5/20/02 4:16 PM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote:

> I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> These are text files and HTML files.  I only index them into a few fields
> (title, content, filename).  My index (specifically _sd.fdt) is 20MB.  The
> bulk of the HTML files are Javadoc files (Ant's own documentation,
> actually).
> 
> Does that seem at all close to being reasonable/normal?  I am calling
> optimize() before closing the index.
> 
> Thanks for the sanity check.
> 
>   Erik
> 
> 
> 
> --
> To unsubscribe, e-mail:   
> For additional commands, e-mail: 
> 
> 


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: sanity check - index size

2002-05-20 Thread James Cooper

On Mon, 20 May 2002, Erik Hatcher wrote:

> I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> My index (specifically _sd.fdt) is 20MB.
> 
> Does that seem at all close to being reasonable/normal?  I am calling
> optimize() before closing the index.

hi,

I've wondered the same thing.  The indexes I build with Lucene are
generally around the same size as the corpus.  That was larger than I
thought it would be, but it doesn't really matter since disk is pretty
cheap (and my corpus isn't very big).

-- James


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




sanity check - index size

2002-05-20 Thread Erik Hatcher

I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
These are text files and HTML files.  I only index them into a few fields
(title, content, filename).  My index (specifically _sd.fdt) is 20MB.  The
bulk of the HTML files are Javadoc files (Ant's own documentation,
actually).

Does that seem at all close to being reasonable/normal?  I am calling
optimize() before closing the index.

Thanks for the sanity check.

Erik



--
To unsubscribe, e-mail:   
For additional commands, e-mail: