Hi Digy, Your suggestion worked like a charm - it fixed the problem and the indexing code is now working great (and giving me these same great results everytime)!
Thank you so very much!! Jennifer -----Original Message----- From: Digy [mailto:digyd...@gmail.com] Sent: Thursday, June 23, 2011 10:54 AM To: lucene-net-user@lucene.apache.org Subject: RE: [Lucene.Net] Advice for troubleshooting inconsistent number of terms added to "contents" field? Although I am a Lucene.Net user for many years, I have never used HTMLParser in demo and tested it. When I look at the code, I see many threading related code. So there might be a synchronization bug. I would recommend to grab the HTMLStripCharFilter.cs from https://github.com/synhershko/Lucene.Net.Contrib/blob/master/Lucene.Net.Co nt rib/Analysis/HTMLStripCharFilter.cs (ported to C# by synhershko) and using an analyzer something like that public class HtmlStripAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, TextReader reader) { return new LowerCaseFilter(new StandardTokenizer(new HTMLStripCharFilter(Lucene.Net.Analysis.CharReader.Get(reader)))); } } DIGY -----Original Message----- From: Jennifer Wilson [mailto:jennifer.wil...@researchintegrations.com] Sent: Thursday, June 23, 2011 8:04 PM To: lucene-net-user@lucene.apache.org Subject: [Lucene.Net] Advice for troubleshooting inconsistent number of terms added to "contents" field? Hi all, I'm writing to ask for advice about troubleshooting what seems like a strange error. When I index my test set of files, the number of terms that are added to the Lucene index in my "contents" field changes each time I run a fresh index. Investigating further reveals that this number of terms discrepancy appears to be the result of only SOME of my files having had the "contents" field populated during the indexing. Sometimes out of the 80 files it appears to add the "contents" for only 4 of the files, sometimes for 7, sometimes 15 and once none of the files had their "contents" added. However, the other fields like ID, filename, filepath, etc.. are correctly added for ALL files EVERYTIME... it is only the "contents" that is experiencing this problem. (Note: To determine this, I clear the index by creating a new index over the old one and commit the changes. I visually verify the index is cleared using Luke. I then run the indexing on my 80 files. I re-open Luke and view the Term count in for "contents" field. I then pick a word that I know exists in every file like "the" and conduct a search on that word [contents:the]. The resulting documents is the number I am assuming actually had their contents fields added.) So, I'm really baffled and can't figure out where the process is going wrong. Can anyone offer any advice on troubleshooting this error? Below is some information about the specifics of my project that may shed some light... I am using Visual Studio 2008 C# and created my indexing code in a Windows Forms project. I created the DLLs from Apache-Lucene.Net-2.9.2-src. The files I am indexing are .aspx files and so I am using the Lucene.Net.Demo.Html.HTMLParser to remove the tagging within the file before sending it into my analyzer. I've provided some snippets of code below to show some (possibly?) relevant details... The code for the CreateIndexWriter() method: -------------------------------------------------------------------------- --------------- public void CreateIndexWriter() { // Create Lucene IndexWriter DirectoryInfo dirInfo; Boolean createNewIndex = true; // Assign directory info to dirInfo variable. dirInfo = new DirectoryInfo(PATHINDEX); // Determine if index directory exists if (Directory.Exists(PATHINDEX)) { // Index directory exists. Assign boolean createNewIndex // to false so that IndexWriter will add to existing // index. createNewIndex = false; } Analyzer analyzer = new MyAnalyzer(); // Create Index Writer writer = new IndexWriter(FSDirectory.Open(dirInfo), analyzer, createNewIndex, IndexWriter.MaxFieldLength.UNLIMITED); } ========================================================================== =============== Class definition of my custom analyzer: -------------------------------------------------------------------------- --------------- public class MyAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { //Create the tokenizer TokenStream result = new StandardTokenizer(reader); //add in filters result = new StandardFilter(result); // first normalize the StandardTokenizer result = new LowerCaseFilter(result);// makes sure everything is lower case //return the built token stream. return result; } } } ========================================================================== =============== The indexFile method that calls the BuildDocument method and then adds the Document to the index -------------------------------------------------------------------------- --------------- private void indexFile(FileInfo f) { // Build Lucene Document record for file Document doc = BuildDocument(f); // Add Lucene Document to the Lucene Index writer.AddDocument(doc); } ========================================================================== =============== The portion of the code in the BuildDocument method to add the "contents" (it is taken directly from the Apache-Lucene.Net-2.9.2-src.src.Demo.IndexHtml example): -------------------------------------------------------------------------- --------------- protected Document BuildDocument (FileInfo f) { ... System.IO.FileStream fis = new System.IO.FileStream(f.FullName, System.IO.FileMode.Open, System.IO.FileAccess.Read); HTMLParser parser = new HTMLParser(fis); // Add the main text of the file as a field named "contents". Use a field that is // indexed (i.e. searchable), tokenized with the word position information preserved, // but the original text should not be stored. doc.Add(new Field("contents", parser.GetReader())); ... ========================================================================== =============== Any advice would be very welcome! Thank you in advance, Jennifer