You mention documents of various file types. It really depends on
what those types are. For example the amount of text found in a
powerpoint file is slim pickins. Ratios with office type apps tend to
be pretty fluffy. I have seen considerably better than 20-30% when
extracting text from such formats, some down to the ratio your talking
of.
C
On Jun 24, 2009, at 5:47 PM, pof wrote:
Hi, I just completed a batch test index of ~1100 documents of
various file
types and I noticed that the original documents take up about 145MB
but my
index is only 1.7MB?? I remember reading somewhere that the typical
compression rate is about 20-30% or something, but mine is a little
over 1%!
I'm not complaining or anything It just struck me a odd especially
as I have
a lot of archive files and emails with attachments that I parse as
well. Has
anyone else experienced something like this, I'm just curious.
Cheers. Brett.
--
View this message in context:
http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
Sent from the Lucene - General mailing list archive at Nabble.com.