Hi All
I have a series of documents to index and search and could do with some pointers on how best to achieve the desired results. The process flow is relatively straightforward, a queue of documents already exists. The application will pick the next document is a folder queue, initially index the single document in a RAMDirectory(), display the document to the user for adjustment and once the user selects 'Save' the amended document will be committed to a Lucene FSDirectory index. (I'm glossing over a few details here and I'm aware of what needs to be done with IndexWriter and various indexes, locks etc). The document has various parts which will become fields, as follows: Document ID Title Introduction Paragraph1 .... ParagraphN. There may be anything from 1 paragraph to N paragraphs, 40 paragraphs is generally towards the maximum. Each paragraph has a specific purpose and will have its own field for search purposes i.e. it may be required later to search paragraph 3 in all documents for a given term. So, for example, legal precedents and cases which the document may refer to will always be in paragraph 3 and only in paragraph 3. Conclusion Keyword(s) again keyword1 ... keyword Once the initial indexing has been done to RAMDirectory(), I would like to show the user how many instances of the keyword terms are contained in the document in total (Title, Introduction, Paragraph(s), Conclusion) - for example, C#(46), VB.Net(14), ASP(22), JQuery(11). Also, it would be really useful if feasible, to show other terms from the total document which the user could add to the Keywords or ignore e.g. Microsoft(88) Google(109) etc. (I used developer terms rather than the actual application use case as hopefully everyone would be familiar with the examples). Bonus (great if anyone would like to answer but I'm reasonably ok with this): When search the entire document for keywords, I will also be using synonyms so in the example above if the keyword is "Java" and the document (title, introduction, para1...n, conclusion) mentions "Java" twice, JavaBeans once, J2EE six times then the total count will show as Java(9). I have already developed a technical synonym dictionary rather than using wordnet or alternatives so covered for creating the synonym terms). Many thanks in advance and hopefully the example above is reasonably self explanatory Kieran Logan