I have been trying to parse & index different portions of an HTML page using Tika & Lucene. For eg. I would like to index text within <Title>, <H1>, <H2>, <A> tags of a HTML page separately and provide a different boost to each of them. I am using Tika for HTML parsing and creating a Document object with the appropriate fields that need to be indexed. However I could not find anything within Tika which would help me index the tags I want right out of the box.
My code looks something like this : InputStream is = new FileInputStream(f); Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler(-1); ParseContext context = new ParseContext(); context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE); try { parser.parse(is, handler, metadata, context); } finally { is.close(); } Document doc = new Document(); doc.add(new Field("contents", handler.toString(), Field.Store.NO, Field.Index.ANALYZED)); for (String name : metadata.names()) { String value = metadata.get(name); if (textualMetadataFields.contains(name)) { doc.add(new Field("contents", value, Field.Store.NO, Field.Index.ANALYZED)); } doc.add(new Field(name, value, Field.Store.YES, Field.Index.YES)); } Stepping into Tika's HTML parsing code I found that it is org.apache.tika.parser.html.HtmlHandler class that fills up metadata object. I have the following questions : · Do I need to write a custom HTML handler to extract text within specific elements of a HTML page ? · Is there some class in Tika which can parse out text within different HTML tags that one specifies and fill up metadata object accordingly ? Can someone please provide code samples for solutions that you propose ? Thanks, Amg