You mention documents of various file types. It really depends on what those types are. For example the amount of text found in a powerpoint file is slim pickins. Ratios with office type apps tend to be pretty fluffy. I have seen considerably better than 20-30% when extracting text from such formats, some down to the ratio your talking of.

C
On Jun 24, 2009, at 5:47 PM, pof wrote:


Hi, I just completed a batch test index of ~1100 documents of various file types and I noticed that the original documents take up about 145MB but my
index is only 1.7MB?? I remember reading somewhere that the typical
compression rate is about 20-30% or something, but mine is a little over 1%! I'm not complaining or anything It just struck me a odd especially as I have a lot of archive files and emails with attachments that I parse as well. Has
anyone else experienced something like this, I'm just curious.

Cheers. Brett.
--
View this message in context: 
http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Reply via email to