Sebastian Nagel created NUTCH-1767:
--------------------------------------

             Summary: remove special treatment of "params" in relative links
                 Key: NUTCH-1767
                 URL: https://issues.apache.org/jira/browse/NUTCH-1767
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.2.1, 1.8
            Reporter: Sebastian Nagel
            Priority: Minor
             Fix For: 2.3, 1.9


[RFC 1808|http://www.ietf.org/rfc/rfc1808.txt] specified that path elements of 
URLs may contains so-called params startet by ";", e.g. ";type=a". If the base 
URL contains a path param while the link target does not, params are 
transferred to the target:
{quote}
Step 5: 
 a) if the embedded URL's <params> is non-empty, we skip to
     step 7; otherwise, it inherits the <params> of the base URL (if any)
{quote}
This behaviour has been implemented with NUTCH-436. Later (NUTCH-1115) it had 
been made optional and configurable by property {{parser.fix.embeddedparams}}. 
NUTCH-797 made the changes of both issues inactive for 1.x (not applied to 2.x) 
with reference to RFC 3986.

[RFC 3986|http://tools.ietf.org/html/rfc3986] which obsoletes RFC 1808 does not 
mention params and examples given in sect. 5.4. "Reference Resolution Examples" 
contradict RFC 1808. Also 
[Wikipedia|http://en.wikipedia.org/w/index.php?title=URI_scheme&oldid=604656593]
 states:
{quote}
Historically, each segment was specified to contain parameters separated from 
it using a semicolon (";"), though this was rarely used in practice and current 
specifications allow but no longer specify such semantics.
{quote}

Accordingly, any special treatment of "params" in relative links should be 
removed from Nutch. At a first glance, this would include:
* 2.x parse-html and parse-tika
** remove fixEmbeddedParams(...)
** change unit tests to follow examples from RFC 3986
* 1.x
** remove unused fixEmbeddedParams(...) from parse-html
** remove property {{parser.fix.embeddedparams}} from nutch-default.xml




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to