Dear fellow Java/Lucene developers: I have a question on creating an index from an XML document for the purpose of searching using the Lucene API in Java.
I am searching shakespeare's "Hamlet" which I have as an xml document. I want to include comentary on each scene and would like to make this section searchable as well for the user. However, at present, I search through a set of <SPEECH> tags which represents a particular character's dialogue. With my new arrangement, each scene, which is composed of several characters respective dialogues, will be enclosed in a pair of <SCENE></SCENE> tags, and will have a set of <SCENE-COMMENTARY></SCENE-COMMENTARY> tags at the top which will provide the commentary for the scene that follows. How would I modify my index code (which follows after the xml document) to create a searchable index which allows the user to search <SCENE-COMMENTARY> section just as easily as the text contained in the <SPEECH> tags? Once I have accomplished this, I would like to then be able to search the text and display the results to the user just as easily as if they were searching through the <SPEECH> tags. I have also listed the code for searching through the current index. Thanks in advance to everyone who replies. Sincerely; Fayyaz Here is the xml snippet for the play: <PLAY> <TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE> <SCENE> <SCENE-COMMENTARY>Here is where I will include commentary on the scene that follows, which I would also like to make searchable to the user.</SCENE-COMMENTARY> <SPEECH> <REFERENCE>ACT 1, SCENE 1</REFERENCE> <SPEAKER>LORD POLONIUS</SPEAKER> <LINES>Yet here, Laertes! aboard, aboard, for shame! The wind sits in the shoulder of your sail, And you are stay'd for. There; my blessing with thee! And these few precepts in thy memory See thou character. Give thy thoughts no tongue, Nor any unproportioned thought his act. Be thou familiar, but by no means vulgar. Those friends thou hast, and their adoption tried, Grapple them to thy soul with hoops of steel; But do not dull thy palm with entertainment Of each new-hatch'd, unfledged comrade. Beware Of entrance to a quarrel, but being in, Bear't that the opposed may beware of thee. Give every man thy ear, but few thy voice; Take each man's censure, but reserve thy judgment. Costly thy habit as thy purse can buy, But not express'd in fancy; rich, not gaudy; For the apparel oft proclaims the man, And they in France of the best rank and station Are of a most select and generous chief in that. Neither a borrower nor a lender be; For loan oft loses both itself and friend, And borrowing dulls the edge of husbandry. This above all: to thine ownself be true, And it must follow, as the night the day, Thou canst not then be false to any man. Farewell: my blessing season this in thee!</LINES> </SPEECH> <SPEECH> <SPEAKER>HAMLET</SPEAKER> <REFERENCE>ACT 1, SCENE 2</REFERENCE> <LINES>To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; No more; and by a sleep to say we end The heart-ache and the thousand natural shocks That flesh is heir to, 'tis a consummation Devoutly to be wish'd. To die, to sleep; To sleep: perchance to dream: ay, there's the rub; For in that sleep of death what dreams may come When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despised love, the law's delay, The insolence of office and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would fardels bear, To grunt and sweat under a weary life, But that the dread of something after death, The undiscover'd country from whose bourn No traveller returns, puzzles the will And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought, And enterprises of great pith and moment With this regard their currents turn awry, And lose the name of action.--Soft you now! The fair Ophelia! Nymph, in thy orisons Be all my sins remember'd.</LINES> </SPEECH> <SPEECH> <REFERENCE>ACT 1, SCENE 3</REFERENCE> <SPEAKER>HAMLET</SPEAKER> <LINES>To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; No more; and by a sleep to say we end The heart-ache and the thousand natural shocks That flesh is heir to, 'tis a consummation Devoutly to be wish'd. To die, to sleep; To sleep: perchance to dream: ay, there's the rub; For in that sleep of death what dreams may come When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despised love, the law's delay, The insolence of office and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would fardels bear, To grunt and sweat under a weary life, But that the dread of something after death, The undiscover'd country from whose bourn No traveller returns, puzzles the will And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought, And enterprises of great pith and moment With this regard their currents turn awry, And lose the name of action.--Soft you now! The fair Ophelia! Nymph, in thy orisons Be all my sins remember'd.</LINES> </SPEECH> <SPEECH> <REFERENCE>ACT 1, SCENE 4</REFERENCE> <SPEAKER>HAMLET</SPEAKER> <LINES>To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; No more; and by a sleep to say we end The heart-ache and the thousand natural shocks That flesh is heir to, 'tis a consummation Devoutly to be wish'd. To die, to sleep; To sleep: perchance to dream: ay, there's the rub; For in that sleep of death what dreams may come When we have shuffled off this mortal coil, Must give us pause: there's the respect That makes calamity of so long life; For who would bear the whips and scorns of time, The oppressor's wrong, the proud man's contumely, The pangs of despised love, the law's delay, The insolence of office and the spurns That patient merit of the unworthy takes, When he himself might his quietus make With a bare bodkin? who would fardels bear, To grunt and sweat under a weary life, But that the dread of something after death, The undiscover'd country from whose bourn No traveller returns, puzzles the will And makes us rather bear those ills we have Than fly to others that we know not of? Thus conscience does make cowards of us all; And thus the native hue of resolution Is sicklied o'er with the pale cast of thought, And enterprises of great pith and moment With this regard their currents turn awry, And lose the name of action.--Soft you now! The fair Ophelia! Nymph, in thy orisons Be all my sins remember'd.</LINES> </SPEECH> </SCENE> </PLAY> Here is my indexing code: package hamlet; import java.io.InputStream; import java.io.IOException; import java.io.File; import java.io.FileInputStream; import java.util.Iterator; import java.util.HashMap; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.SAXException; import org.xml.sax.Attributes; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.ParserConfigurationException; public class HamletHandler extends DefaultHandler implements DocumentHandler { //the directory that stores xml files private final String dataDir = "c:\\dataD"; //the directory that is used to store lucene index private final String indexDir = "c:\\indexD"; private StringBuffer elementBuffer=new StringBuffer(); private HashMap attributeMap; private Document doc; static IndexWriter indexWriter; public Document getDocument(InputStream is) throws DocumentHandlerException { // TODO Auto-generated method stub SAXParserFactory spf=SAXParserFactory.newInstance(); try{ SAXParser parser=spf.newSAXParser(); parser.parse(is, this); } catch(IOException e){ throw new DocumentHandlerException("Cannot parse XML document", e); } catch(ParserConfigurationException e){ throw new DocumentHandlerException("Cannot parse XML document", e); } catch(SAXException e){ throw new DocumentHandlerException("Cannot parse XML document", e); } return doc; } public void startDocument(){ //doc=new Document(); } public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException{ if(qName.equals("SPEECH")){ doc=new Document(); } elementBuffer.setLength(0); //attributeMap.clear(); if(atts.getLength()>0){ attributeMap=new HashMap(); for(int i=0; i<atts.getLength(); i++){ attributeMap.put(atts.getQName(i), atts.getValue(i)); } } } public void characters(char[] text, int start, int length){ elementBuffer.append(text, start, length); } public void endElement(String uri, String localName, String qName) throws SAXException{ try { if(qName.equals("REFERENCE")){ Field reference = new Field(qName, elementBuffer.toString(), Field.Store.YES, Field.Index.NO, Field.TermVector.NO); doc.add(reference); } else if(qName.equals("SPEAKER")){ Field speaker = new Field(qName, elementBuffer.toString(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES); speaker.setBoost(2.0f); doc.add(speaker); } else if(qName.equals("LINES")){ Field lines = new Field(qName, elementBuffer.toString(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES); lines.setBoost(1.0f); doc.add(lines); indexWriter.addDocument(doc); } else{ return; } } catch (CorruptIndexException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } /** * @param args */ public static void main(String[] args) throws Exception{ File index=new File("c:\\Documents and Settings\\Fayyazuddin A Syed\\My Documents\\indexD"); Directory fsDirectory = FSDirectory.getDirectory(index); Analyzer analyzer = new StandardAnalyzer(); indexWriter = new IndexWriter(fsDirectory, analyzer, true); HamletHandler handler=new HamletHandler(); Document doc=handler.getDocument(new FileInputStream(new File(args[0]))); int numIndexed=indexWriter.docCount(); System.out.println(numIndexed); indexWriter.optimize(); indexWriter.close(); } } and here is my searcher code: package search; /* * Searcher.java * * Created on August 6, 2007, 8:46 PM * * To change this template, choose Tools | Template Manager * and open the template in the editor. */ import java.io.File; import java.io.FileReader; import java.io.Reader; import java.io.StringReader; import java.util.Date; import java.util.List; import java.util.ArrayList; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.CachingTokenFilter; import org.apache.lucene.analysis.standard.StandardAnalyzer ; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.FuzzyLikeThisQuery; import org.apache.lucene.search.Query ; import org.apache.lucene.search.Scorer; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.SimpleFragmenter; import org.apache.lucene.search.highlight.Fragmenter; import org.apache.lucene.search.highlight.NullFragmenter; import org.apache.lucene.search.Hits; import org.apache.lucene.index.Term; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.Directory; import org.apache.lucene.queryParser.QueryParser; /** * * */ public class Searcher { /** Creates a new instance of Searcher */ /** * @param args the command line arguments */ public static void main(String[] args) throws Exception{ Searcher searchDoc=new Searcher(); File indexDir=new File("c:\\Documents and Settings\\Fayyazuddin A Syed\\My Documents\\indexD"); String q="SLINGS AND ARROWS"; String s="think~"; if (s.contains("?") || s.contains("*")){ System.out.println("this is a wildcard search"); } else if (s.contains("~")){ System.out.println("this is a fuzzy search"); } else { System.out.println("this is a normal search"); } if(!indexDir.exists() || !indexDir.isDirectory()){ throw new Exception(indexDir + "does not exist of is not a directory."); } //searchDoc.wildSearch(indexDir); searchDoc.search(indexDir, q); //searchDoc.fuzzySearch(indexDir); } public List search(File indexDir, String q) throws Exception { List searchResult = new ArrayList(); Directory fsDir=FSDirectory.getDirectory(indexDir); IndexSearcher is=new IndexSearcher(fsDir); Analyzer analyser = new StandardAnalyzer(); Query parser=new QueryParser("LINES", analyser).parse(q); long start=new Date().getTime(); Hits hits=is.search(parser); long end=new Date().getTime(); QueryScorer scorer = new QueryScorer(parser); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", ""); Highlighter highlighter = new Highlighter(formatter, scorer); Highlighter high = new Highlighter(formatter, scorer); Fragmenter fragmenter = new NullFragmenter(); Fragmenter fragment = new SimpleFragmenter(250); highlighter.setTextFragmenter(fragmenter); high.setTextFragmenter(fragment); for(int i=0; i<hits.length(); i++){ Document doc=hits.doc(i); String lns = doc.get("LINES"); TokenStream lines = analyser.tokenStream("LINES", new StringReader(lns)); CachingTokenFilter filter = new CachingTokenFilter(lines); String highlightedLines = highlighter.getBestFragment(filter, lns); filter.reset(); String highlight = high.getBestFragment(filter, lns); SearchResult resultBean = new SearchResult(); resultBean.setReference(hits.doc(i).get("REFERENCE")); resultBean.setNarrator(hits.doc(i).get("SPEAKER")); resultBean.setHitResult(highlight); resultBean.setQuote(highlightedLines); searchResult.add(resultBean); System.out.println(resultBean.getReference()); System.out.println(resultBean.getNarrator()); System.out.println(resultBean.getHitResult()); System.out.println(""); System.out.println(resultBean.getQuote()); System.out.println(""); } System.err.println("Found " + hits.length() + " document(s)(in " + (end-start) + " milliseconds) that matched query '" + q + "':"); return searchResult; } public List wildSearch(File indexDir) throws Exception { List searchResult = new ArrayList(); Directory fsDir=FSDirectory.getDirectory(indexDir); IndexSearcher is = new IndexSearcher(fsDir); IndexReader ir = IndexReader.open(fsDir); Analyzer analyser = new StandardAnalyzer(); Query parser=new WildcardQuery(new Term("LINES", "the*")); parser=parser.rewrite(ir); long start=new Date().getTime(); Hits hits=is.search(parser); long end=new Date().getTime(); QueryScorer scorer = new QueryScorer(parser); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", ""); Highlighter highlighter = new Highlighter(formatter, scorer); Highlighter high = new Highlighter(formatter, scorer); Fragmenter fragmenter = new NullFragmenter(); Fragmenter fragment = new SimpleFragmenter(250); highlighter.setTextFragmenter(fragmenter); high.setTextFragmenter(fragment); for(int i=0; i<hits.length(); i++){ Document doc=hits.doc(i); String lns = doc.get("LINES"); TokenStream lines = analyser.tokenStream("LINES", new StringReader(lns)); CachingTokenFilter filter = new CachingTokenFilter(lines); String highlightedLines = highlighter.getBestFragment(filter, lns); filter.reset(); String highlight = high.getBestFragment(filter, lns); SearchResult resultBean = new SearchResult(); resultBean.setNarrator(hits.doc(i).get("SPEAKER")); resultBean.setHitResult(highlight); resultBean.setQuote(highlightedLines); searchResult.add(resultBean); System.out.println(resultBean.getNarrator()); System.out.println(resultBean.getHitResult()); System.out.println(""); System.out.println(resultBean.getQuote()); System.out.println(""); } System.err.println("Found " + hits.length() + " document(s)(in " + (end-start) + " milliseconds) that matched query '" + "':"); return searchResult; } public List fuzzySearch(File indexDir) throws Exception { List searchResult = new ArrayList(); Directory fsDir=FSDirectory.getDirectory(indexDir); IndexSearcher is = new IndexSearcher(fsDir); IndexReader ir = IndexReader.open(fsDir); Analyzer analyser = new StandardAnalyzer(); Query parser=new FuzzyQuery(new Term("LINES", "the~")); parser=parser.rewrite(ir); long start=new Date().getTime(); Hits hits=is.search(parser); long end=new Date().getTime(); QueryScorer scorer = new QueryScorer(parser); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", ""); Highlighter highlighter = new Highlighter(formatter, scorer); Highlighter high = new Highlighter(formatter, scorer); Fragmenter fragmenter = new NullFragmenter(); Fragmenter fragment = new SimpleFragmenter(250); highlighter.setTextFragmenter(fragmenter); high.setTextFragmenter(fragment); for(int i=0; i<hits.length(); i++){ Document doc=hits.doc(i); String lns = doc.get("LINES"); TokenStream lines = analyser.tokenStream("LINES", new StringReader(lns)); CachingTokenFilter filter = new CachingTokenFilter(lines); String highlightedLines = highlighter.getBestFragment(filter, lns); filter.reset(); String highlight = high.getBestFragment(filter, lns); SearchResult resultBean = new SearchResult(); resultBean.setNarrator(hits.doc(i).get("SPEAKER")); resultBean.setHitResult(highlight); resultBean.setQuote(highlightedLines); searchResult.add(resultBean); System.out.println(resultBean.getNarrator()); System.out.println(resultBean.getHitResult()); System.out.println(""); System.out.println(resultBean.getQuote()); System.out.println(""); } System.err.println("Found " + hits.length() + " document(s)(in " + (end-start) + " milliseconds) that matched query '" + "':"); return searchResult; } } -- View this message in context: http://www.nabble.com/Creating-an-index-from-an-XML-file-using-Lucene-in-Java-tp18678779p18678779.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]