I am trying to think through the approach I should take to build my blog using lucene (see thread above "Searching Textile Documents").
As you can see from that thread, I am thinking of making the body of my "Document" a coded form of text which ultimately gets translated into html. So I am currently thinking through what a parser might have to do to translate from my form of text (probably not textile as mentioned in that thread - but something which I can customise to my own needs). I am planing that the lexical tokenizer will produce a series of tokens that contain (amongst others) tokens of the form <WORDS> or<URL>types (URLs will NOT be broken into WORDS). I think I can then make the parser, depending on which non terminal element you call parse the article bodies in such a way that the parser will will either produce the html or produce a list of embedded URLs, or a list of embedded WORDS. (I am currently considering using Antlr rather than JavaCC - but that is not fixed in stone) I would like to use lucene in two (actually several more, but two related to this question) ways. i Search for words (suitably stopped and filtered) based on the <WORDS> tokens passed from the lexical tokenizer ii List the <URL,s> refered to in a given document. I have been trying to understand the demo which parses and loads html documents So some questions a) The HTML document demo doesn't seem to use the Analyser to determine what gets indexed, but to rather control the process by determining what happens inside the Document.add Field calls. Is this the better way of doing things, rather than somehow interfacing the Analyzer into my grammer parser? If I need to interface into the parser, how? b) I assume I will want to create two separate Document Fields with the original content (a String named content) being the first of type Text("content", String content), and the second for the urls being of type Unstored("url", String urls), where the urls are those from the article body. The HTML document demo seems to do something equivalent via a complex mechanism of multithreading. I think that one way I could do it would be simply re-parse twice - is there any reason why the demo took this complex route rather than the simple one of parsing twice?. c) An alternative way of doing the whole thing - might be to do Document.add calls inside the lexical tokenizer. Since I can add the field with the same name multiple times, the javadoc says the new content is just appended. Are there any downsides to this approach? Can I still store the whole document this way - by adding all lexical tokens to one of the field names (presumably of type keyword - field name "content") and using field name "urls" of type (? can't see how to set the field type to indexed, not stored, not tokenized) -- Alan Chandler http://www.chandlerfamily.org.uk Open Source. It's the difference between trust and antitrust. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]