Three randomly selected documents .doc = 125KB Plain text = 761 bytes (0.59%) .pdf = 372KB Plain text = 12.9KB (3.49%) .eml = 171KB Plain text = 2KB (1.15%)
Even though this is a small sample, it shows my index compression of 1-2% to be plausable. I'm checking out Luke index toolbox now. Chris Collins wrote: > > There are other factors too, such as how broad is the vocabulary of > the content and your analyzers used. Have you tried running your > filters to generate just plain text files and compare the difference > in size of the text compared to the original. > > C > > > On Jun 24, 2009, at 9:28 PM, pof wrote: > >> >> It would seem that .doc files have about 30KB overhead (not including >> pictures, graphs, meta data etc) on top of the plain text and about >> 3KB for >> .pdfs. >> >> Otis Gospodnetic wrote: >>> >>> >>> Hi Brett, >>> >>> Try creating a simple MS Word document with just a single character >>> in it. >>> Save it as .doc and check the size. Export to PDF and check the >>> size. I >>> don't know exactly how big those docs will be, but I bet they'll be >>> many, >>> many times larger than that one byte character. Open up your index >>> with >>> Luke to see what's in it. >>> >>> Otis >>> -- >>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>> >>> >>> >>> ----- Original Message ---- >>>> From: pof <[email protected]> >>>> To: [email protected] >>>> Sent: Wednesday, June 24, 2009 8:47:39 PM >>>> Subject: Index Ratio >>>> >>>> >>>> Hi, I just completed a batch test index of ~1100 documents of >>>> various >>>> file >>>> types and I noticed that the original documents take up about >>>> 145MB but >>>> my >>>> index is only 1.7MB?? I remember reading somewhere that the typical >>>> compression rate is about 20-30% or something, but mine is a >>>> little over >>>> 1%! >>>> I'm not complaining or anything It just struck me a odd especially >>>> as I >>>> have >>>> a lot of archive files and emails with attachments that I parse as >>>> well. >>>> Has >>>> anyone else experienced something like this, I'm just curious. >>>> >>>> Cheers. Brett. >>>> -- >>>> View this message in context: >>>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html >>>> Sent from the Lucene - General mailing list archive at Nabble.com. >>> >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/Index-Ratio-tp24195272p24196803.html >> Sent from the Lucene - General mailing list archive at Nabble.com. >> > > -- View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24197002.html Sent from the Lucene - General mailing list archive at Nabble.com.
