[
https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273954#comment-13273954
]
Sebastian Nagel commented on NUTCH-1323:
----------------------------------------
After a small test crawl on http://si.draagle.com:
# usage is cumbersome because you have to carefully think about in which steps
to normalize URLs. This is because AjaxNormalizer acts as a flip-flop: hashbang
URLs are escaped, escaped ones are unescaped. If URLs are normalized during
parsing and then during CrawlDb update, you get the hashbang URL again.
# relative hashbang links are not resolved correctly. The outlink of
{noformat}
base: http://si.draagle.com/?_escaped_fragment_=browse/group/root/
<a href="#!static/draagle_pogoji_uporabe.html">
{noformat}
should be
{noformat}
http://si.draagle.com/?_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
but hardly
{noformat}
http://si.draagle.com/?_escaped_fragment_=browse/group/root/&_escaped_fragment_=static/draagle_pogoji_uporabe.html
{noformat}
# the outlink set of one page with escaped base URL may contain escaped and
unescaped URLs simultaneously as results of
** a relative link without hashbang, e.g., {{<a href="#search">}}
** a global link with hashbang
If understood right:
* URLs with escaped fragments are used
** in crawlDb, segments, linkDb (URL acts as key)
** for fetching
* unescaped hashbang URLs
** are used in the index (and shown to the user)
** may appear in outlinks, redirects, and seeds
Couldn't we bind the decision whether to (un)escape to the current normalizer
scope:
* if URL contains #!
and scope is one of { inject, fetcher/redirect, outlink, ?crawldb/update? }
=> escape
* if URL contains _escaped_fragment_=
and scope is index
=> unescape
> AjaxNormalizer
> --------------
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them
> to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira