Hi Linh You can specify a mapper to control what the html parser will filter or not.
see https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639 for an example Julien On Monday, 3 November 2014, Linh Tang <[email protected]> wrote: > Dear All, > > I am Phuong Linh, > I am using Tika to extract content form Html file to search. But HtmlParser > cannot parse all tag of Html. ( I get Html page by Nutch, then use Tika to > extract the important information, after then use Solr to search.) > Can you tell me what i can do to parse all tag of html. > > Thanks advance! > > Regards, > Tang Thi Phuong Linh. > -- > P.Linh > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
