Hi Linh

You can specify a mapper to control what the html parser will filter or not.

see
https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639
for an example

Julien

On Monday, 3 November 2014, Linh Tang <[email protected]> wrote:

> Dear All,
>
> I am Phuong Linh,
> I am using Tika to extract content form Html file to search. But HtmlParser
> cannot parse all tag of Html.  ( I get Html page by Nutch, then use Tika to
> extract the important information, after then use Solr to search.)
> Can you tell me what i can do to parse all tag of html.
>
> Thanks advance!
>
> Regards,
> Tang Thi Phuong Linh.
> --
> P.Linh
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to