[
https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1323:
---------------------------------
Attachment: NUTCH-1323-1.8.patch
Updated patch for trunk.
Normalizer now relies on SCOPE_INDEXER, otherwise other rules are tried. This
solves the problem of cumbersome usage. This new patch does not solve the
problem of relative URL's. As far as i know, relative URL's never make it to
normalizers anyway. To confirm i did a test crawl of that
http://si.draagle.com/ homepage (with the crazy cookie thing, really, check it
out!), here's the output of readdb.
{code}
Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry
interval seconds;Retry interval days;Score;Signature;Metadata
"http://si.draagle.com/";6;"db_notmodified";Tue Apr 22 11:57:29 CEST 2014;Tue
Mar 11 10:55:11 CET
2014;0;3628800.0;42.0;0.0;"c44af84abaf0042685a03bf2ecfd2927";"Content-Type:text/html|||_pst_:success(1),
lastModified=0|||_rs_:25|||"
"http://si.draagle.com/?_escaped_fragment_=/basket/show/";1;"db_unfetched";Tue
Mar 11 10:57:32 CET 2014;Thu Jan 01 01:00:00 CET
1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/?_escaped_fragment_=/browse/group/root/";1;"db_unfetched";Tue
Mar 11 10:57:32 CET 2014;Thu Jan 01 01:00:00 CET
1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/?_escaped_fragment_=/login/";1;"db_unfetched";Tue Mar 11
10:57:32 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/draagle_pogoji_uporabe.html";1;"db_unfetched";Tue Mar 11
10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/profiles.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET
2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/tvspot.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET
2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.apta-medica.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu
Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.draagle.si/bolezni/index.html";1;"db_unfetched";Tue Mar 11 10:55:14
CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.medicina-danes.si/";1;"db_unfetched";Tue Mar 11 10:55:14 CET
2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.novartisoncology.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET
2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.orlkotnik.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu
Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.zobozdravstvolavtar.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET
2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
{code}
I think this patch is nearly ready. Any other things to worry about?
> AjaxNormalizer
> --------------
>
> Key: NUTCH-1323
> URL: https://issues.apache.org/jira/browse/NUTCH-1323
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1323-1.6-1.patch, NUTCH-1323-1.8.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them
> to _escaped_fragment_ URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/
--
This message was sent by Atlassian JIRA
(v6.2#6252)