[ 
https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273954#comment-13273954
 ] 

Sebastian Nagel commented on NUTCH-1323:
----------------------------------------

After a small test crawl on http://si.draagle.com:
# usage is cumbersome because you have to carefully think about in which steps 
to normalize URLs. This is because AjaxNormalizer acts as a flip-flop: hashbang 
URLs are escaped, escaped ones are unescaped. If URLs are normalized during 
parsing and then during CrawlDb update, you get the hashbang URL again.
# relative hashbang links are not resolved correctly. The outlink of
{noformat}
 base: http://si.draagle.com/?_escaped_fragment_=browse/group/root/
 <a href="#!static/draagle_pogoji_uporabe.html">
{noformat}
should be
{noformat}
http://si.draagle.com/?_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
but hardly
{noformat}
http://si.draagle.com/?_escaped_fragment_=browse/group/root/&_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
# the outlink set of one page with escaped base URL may contain escaped and 
unescaped URLs simultaneously as results of
** a relative link without hashbang, e.g., {{<a href="#search">}}
** a global link with hashbang

If understood right:
* URLs with escaped fragments are used
** in crawlDb, segments, linkDb (URL acts as key)
** for fetching
* unescaped hashbang URLs
** are used in the index (and shown to the user)
** may appear in outlinks, redirects, and seeds

Couldn't we bind the decision whether to (un)escape to the current normalizer 
scope:
* if URL contains #!
  and scope is one of { inject, fetcher/redirect, outlink, ?crawldb/update? }
  => escape
* if URL contains _escaped_fragment_=
  and scope is index
  => unescape

                
> AjaxNormalizer
> --------------
>
>                 Key: NUTCH-1323
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them 
> to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to