Your problem is more with tika. Pls post in tika user group. If you want to deal with only HTML then better use html parser. http://www.findbestopensource.com/search/?query=%22html+parser%22
On Tue, Jan 11, 2011 at 7:24 AM, amg qas <amg...@gmail.com> wrote: > I have been trying to parse & index different portions of an HTML page > using > Tika & Lucene. For eg. I would like to index text within <Title>, <H1>, > <H2>, <A> tags > of a HTML page separately and provide a different boost to each of them. I > am using Tika for HTML parsing and creating a Document object with the > appropriate fields > that need to be indexed. However I could not find anything within Tika > which > would help me index the tags I want right out of the box. > > My code looks something like this : > > InputStream is = new FileInputStream(f); > Parser parser = new AutoDetectParser(); > ContentHandler handler = new BodyContentHandler(-1); > ParseContext context = new ParseContext(); > context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE); > > try { > parser.parse(is, handler, metadata, context); > } finally { > is.close(); > } > > Document doc = new Document(); > doc.add(new Field("contents", handler.toString(), > Field.Store.NO <http://field.store.no/>, Field.Index.ANALYZED)); > > for (String name : metadata.names()) { > String value = metadata.get(name); > > if (textualMetadataFields.contains(name)) { > doc.add(new Field("contents", value, > Field.Store.NO <http://field.store.no/>, Field.Index.ANALYZED)); > } > > doc.add(new Field(name, value, Field.Store.YES, Field.Index.YES)); > } > > Stepping into Tika's HTML parsing code I found that it is > org.apache.tika.parser.html.HtmlHandler class that fills up metadata > object. > I have the following questions : > · Do I need to write a custom HTML handler to extract text within specific > elements of a HTML page ? > · Is there some class in Tika which can parse out text within different > HTML tags that one specifies and fill up metadata object accordingly ? Can > someone please provide > code samples for solutions that you propose ? > > Thanks, > Amg >