[
https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274488#comment-13274488
]
behnam nikbakht commented on NUTCH-1323:
----------------------------------------
hi
when i want to crawl some dynamic url like this:
http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html#!mountain
AjaxNorlalizer must convert this to:
http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html?_escaped_fragment_=mountain
but there is problem:
other normalizers remove # from urls based on rules in regex-normalize.xml
also in
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
there is a line that remove ref:
if (url.getRef() != null) {
...
for this, i test that must change to:
if (url.getRef() != null) { // remove the ref
file=file+"#"+url.getRef();
changed = true;
}
and when remove rules in regex-normalize.xml , the plugin works correctly.
> AjaxNormalizer
> --------------
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them
> to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira