Checked out the index with Luke, yep all the text has been indexed 100% correctly. I have to say WOW Luke is a great little tool, I am majorly impressed. Thanks guys for all you suggestions and insight.
pof wrote: > > Three randomly selected documents > > .doc = 125KB Plain text = 761 bytes (0.59%) > .pdf = 372KB Plain text = 12.9KB (3.49%) > .eml = 171KB Plain text = 2KB (1.15%) > > Even though this is a small sample, it shows my index compression of 1-2% > to be plausable. I'm checking out Luke index toolbox now. > > Chris Collins wrote: >> >> There are other factors too, such as how broad is the vocabulary of >> the content and your analyzers used. Have you tried running your >> filters to generate just plain text files and compare the difference >> in size of the text compared to the original. >> >> C >> >> >> On Jun 24, 2009, at 9:28 PM, pof wrote: >> >>> >>> It would seem that .doc files have about 30KB overhead (not including >>> pictures, graphs, meta data etc) on top of the plain text and about >>> 3KB for >>> .pdfs. >>> >>> Otis Gospodnetic wrote: >>>> >>>> >>>> Hi Brett, >>>> >>>> Try creating a simple MS Word document with just a single character >>>> in it. >>>> Save it as .doc and check the size. Export to PDF and check the >>>> size. I >>>> don't know exactly how big those docs will be, but I bet they'll be >>>> many, >>>> many times larger than that one byte character. Open up your index >>>> with >>>> Luke to see what's in it. >>>> >>>> Otis >>>> -- >>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>>> >>>> >>>> >>>> ----- Original Message ---- >>>>> From: pof <[email protected]> >>>>> To: [email protected] >>>>> Sent: Wednesday, June 24, 2009 8:47:39 PM >>>>> Subject: Index Ratio >>>>> >>>>> >>>>> Hi, I just completed a batch test index of ~1100 documents of >>>>> various >>>>> file >>>>> types and I noticed that the original documents take up about >>>>> 145MB but >>>>> my >>>>> index is only 1.7MB?? I remember reading somewhere that the typical >>>>> compression rate is about 20-30% or something, but mine is a >>>>> little over >>>>> 1%! >>>>> I'm not complaining or anything It just struck me a odd especially >>>>> as I >>>>> have >>>>> a lot of archive files and emails with attachments that I parse as >>>>> well. >>>>> Has >>>>> anyone else experienced something like this, I'm just curious. >>>>> >>>>> Cheers. Brett. >>>>> -- >>>>> View this message in context: >>>>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html >>>>> Sent from the Lucene - General mailing list archive at Nabble.com. >>>> >>>> >>>> >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/Index-Ratio-tp24195272p24196803.html >>> Sent from the Lucene - General mailing list archive at Nabble.com. >>> >> >> > > -- View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24197200.html Sent from the Lucene - General mailing list archive at Nabble.com.
