Re: Index Ratio

pof Wed, 24 Jun 2009 21:07:45 -0700

Most of these files are of type .doc, .pdf and .msg. There are some .eml,
.txt, .htm, .docx and so on as well to a lesser extent. I did consider the
fact that the plain text makes up on a small percentage of each of these
propriatary file types but still the ratio did seem small.



Chris Collins wrote:
> 
> You mention documents of various file types.  It really depends on  
> what those types are.  For example the amount of text found in a  
> powerpoint file is slim pickins.  Ratios with office type apps tend to  
> be pretty fluffy.  I have seen considerably better than 20-30% when  
> extracting text from such formats, some down to the ratio your talking  
> of.
> 
> C
> On Jun 24, 2009, at 5:47 PM, pof wrote:
> 
>>
>> Hi, I just completed a batch test index of ~1100 documents of  
>> various file
>> types and I noticed that the original documents take up about 145MB  
>> but my
>> index is only 1.7MB?? I remember reading somewhere that the typical
>> compression rate is about 20-30% or something, but mine is a little  
>> over 1%!
>> I'm not complaining or anything It just struck me a odd especially  
>> as I have
>> a lot of archive files and emails with attachments that I parse as  
>> well. Has
>> anyone else experienced something like this, I'm just curious.
>>
>> Cheers. Brett.
>> -- 
>> View this message in context:
>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Index-Ratio-tp24195272p24196644.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index Ratio

Reply via email to