It would seem that .doc files have about 30KB overhead (not including
pictures, graphs, meta data etc) on top of the plain text and about 3KB for
.pdfs.

Otis Gospodnetic wrote:
> 
> 
> Hi Brett,
> 
> Try creating a simple MS Word document with just a single character in it. 
> Save it as .doc and check the size.  Export to PDF and check the size.  I
> don't know exactly how big those docs will be, but I bet they'll be many,
> many times larger than that one byte character.  Open up your index with
> Luke to see what's in it.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: pof <[email protected]>
>> To: [email protected]
>> Sent: Wednesday, June 24, 2009 8:47:39 PM
>> Subject: Index Ratio
>> 
>> 
>> Hi, I just completed a batch test index of ~1100 documents of various
>> file
>> types and I noticed that the original documents take up about 145MB but
>> my
>> index is only 1.7MB?? I remember reading somewhere that the typical
>> compression rate is about 20-30% or something, but mine is a little over
>> 1%!
>> I'm not complaining or anything It just struck me a odd especially as I
>> have
>> a lot of archive files and emails with attachments that I parse as well.
>> Has
>> anyone else experienced something like this, I'm just curious.
>> 
>> Cheers. Brett.
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Reply via email to