Re: Index Ratio

Chris Collins Wed, 24 Jun 2009 21:34:51 -0700

There are other factors too, such as how broad is the vocabulary ofthe content and your analyzers used. Have you tried running yourfilters to generate just plain text files and compare the differencein size of the text compared to the original.



On Jun 24, 2009, at 9:28 PM, pof wrote:

It would seem that .doc files have about 30KB overhead (not including
pictures, graphs, meta data etc) on top of the plain text and about3KB for
.pdfs.

Otis Gospodnetic wrote:
Hi Brett,
Try creating a simple MS Word document with just a single characterin it.Save it as .doc and check the size. Export to PDF and check thesize. Idon't know exactly how big those docs will be, but I bet they'll bemany,many times larger than that one byte character. Open up your indexwith
Luke to see what's in it.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: pof <[email protected]>
To: [email protected]
Sent: Wednesday, June 24, 2009 8:47:39 PM
Subject: Index Ratio
Hi, I just completed a batch test index of ~1100 documents ofvarious
file
types and I noticed that the original documents take up about145MB but
my
index is only 1.7MB?? I remember reading somewhere that the typical
compression rate is about 20-30% or something, but mine is alittle over
1%!
I'm not complaining or anything It just struck me a odd especiallyas I
have
a lot of archive files and emails with attachments that I parse aswell.
Has
anyone else experienced something like this, I'm just curious.

Cheers. Brett.
--
View this message in context:
http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
Sent from the Lucene - General mailing list archive at Nabble.com.
--
View this message in context: 
http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index Ratio

Reply via email to