meta equiv with single quotes not accepted
------------------------------------------
Key: NUTCH-1006
URL: https://issues.apache.org/jira/browse/NUTCH-1006
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Fix For: 2.0
As posted by Alex F:
the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
suitable for sites using single quotes for <meta http-equiv....>
Example: <meta http-equiv='Content-Type' content='text/html;
charset=iso-8859-1'>
We experienced a couple of pages with that kind of quotes and Nutch-1.2
was not able to handle it.
Is there any fallback or would it be good to use the following
regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" (single
or regular quotes are accepted)?
See this thread:
http://lucene.472066.n3.nabble.com/Character-encoding-on-Html-Pages-td3034850.html
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira