[ 
https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274488#comment-13274488
 ] 

behnam nikbakht commented on NUTCH-1323:
----------------------------------------

hi
when i want to crawl some dynamic url like this:
http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html#!mountain
AjaxNorlalizer must convert this to:
http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html?_escaped_fragment_=mountain
but there is problem:
other normalizers remove # from urls based on rules in regex-normalize.xml 
also in 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
there is a line that remove ref:
if (url.getRef() != null) {
...
for this, i test that must change to:
if (url.getRef() != null) {                 // remove the ref
       file=file+"#"+url.getRef();
       changed = true;
}
and when remove rules in regex-normalize.xml , the plugin works correctly.
                
> AjaxNormalizer
> --------------
>
>                 Key: NUTCH-1323
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1323-1.6-1.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them 
> to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to