meta equiv with single quotes not accepted
------------------------------------------

                 Key: NUTCH-1006
                 URL: https://issues.apache.org/jira/browse/NUTCH-1006
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.2, 1.3, 2.0
            Reporter: Markus Jelsma
             Fix For: 2.0


As posted by Alex F:

the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
suitable for sites using single quotes for <meta http-equiv....>

  Example: <meta http-equiv='Content-Type' content='text/html;
charset=iso-8859-1'>
  We experienced a couple of pages with that kind of quotes and Nutch-1.2
was not able to handle it.

Is there any fallback or would it be good to use the following
regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" (single
or regular quotes are accepted)?


See this thread:
http://lucene.472066.n3.nabble.com/Character-encoding-on-Html-Pages-td3034850.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to