It would seem that .doc files have about 30KB overhead (not including pictures, graphs, meta data etc) on top of the plain text and about 3KB for .pdfs.
Otis Gospodnetic wrote: > > > Hi Brett, > > Try creating a simple MS Word document with just a single character in it. > Save it as .doc and check the size. Export to PDF and check the size. I > don't know exactly how big those docs will be, but I bet they'll be many, > many times larger than that one byte character. Open up your index with > Luke to see what's in it. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: pof <[email protected]> >> To: [email protected] >> Sent: Wednesday, June 24, 2009 8:47:39 PM >> Subject: Index Ratio >> >> >> Hi, I just completed a batch test index of ~1100 documents of various >> file >> types and I noticed that the original documents take up about 145MB but >> my >> index is only 1.7MB?? I remember reading somewhere that the typical >> compression rate is about 20-30% or something, but mine is a little over >> 1%! >> I'm not complaining or anything It just struck me a odd especially as I >> have >> a lot of archive files and emails with attachments that I parse as well. >> Has >> anyone else experienced something like this, I'm just curious. >> >> Cheers. Brett. >> -- >> View this message in context: >> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html >> Sent from the Lucene - General mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196803.html Sent from the Lucene - General mailing list archive at Nabble.com.
