Most of these files are of type .doc, .pdf and .msg. There are some .eml, .txt, .htm, .docx and so on as well to a lesser extent. I did consider the fact that the plain text makes up on a small percentage of each of these propriatary file types but still the ratio did seem small.
Chris Collins wrote: > > You mention documents of various file types. It really depends on > what those types are. For example the amount of text found in a > powerpoint file is slim pickins. Ratios with office type apps tend to > be pretty fluffy. I have seen considerably better than 20-30% when > extracting text from such formats, some down to the ratio your talking > of. > > C > On Jun 24, 2009, at 5:47 PM, pof wrote: > >> >> Hi, I just completed a batch test index of ~1100 documents of >> various file >> types and I noticed that the original documents take up about 145MB >> but my >> index is only 1.7MB?? I remember reading somewhere that the typical >> compression rate is about 20-30% or something, but mine is a little >> over 1%! >> I'm not complaining or anything It just struck me a odd especially >> as I have >> a lot of archive files and emails with attachments that I parse as >> well. Has >> anyone else experienced something like this, I'm just curious. >> >> Cheers. Brett. >> -- >> View this message in context: >> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html >> Sent from the Lucene - General mailing list archive at Nabble.com. >> > > -- View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196644.html Sent from the Lucene - General mailing list archive at Nabble.com.
