[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879955#comment-13879955
 ] 

Lewis John McGibbney edited comment on NUTCH-1465 at 1/23/14 2:38 PM:
----------------------------------------------------------------------

Hey [~tejasp]. Again, great work! Some minor comments

* Class level Javadoc in SitemapProcessor would be more legible if it used 
format something similar to
{code:title=SitemapProcessor.java|borderStyle=solid}
/**
 * <p>Performs Sitemap processing by fetching sitemap links, parsing the 
content and merging
 * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
 *
 * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
 * <ol>
 *  <li>Sitemaps are considered as "remote seed lists". Crawl administrators 
can prepare a
 *     list of sitemap links and get only those sitemap pages. This suits well 
for targeted
 *     crawl of specific hosts.</li>
 *  <li>For open web crawl, it is not possible to track each host and get the 
sitemap links
 *     manually. Nutch would automatically get the sitemaps for all the hosts 
seen in the
 *     crawls and inject the urls from sitemap to the crawldb.</li>
 * </ol>
 * <p>For more details see:
 *      https://wiki.apache.org/nutch/SitemapFeature </o>
 */
{code}
* I think that the following logging line should be changed to WARN or ERROR
{code:title=SitemapProcessor.java|borderStyle=solid}
} catch (Exception e) {
+          LOG.info("Exception for url " + key.toString() + " : " + 
StringUtils.stringifyException(e)); 
{code}
* This is merely a suggestion, but in SitemapProcessor#filterNormalize(String 
u), could we not use one of the methods from URLUtil.java instead?
{code:title=SitemapProcessor.java|borderStyle=solid}
      if(!u.startsWith("http://";) && !u.startsWith("https://";)) {
        // We received a hostname here so let's make a URL
        url = "http://"; + u + "/";
        isHost = true;
      }
{code}

Thats about it from me mate. This looks like an excellent addition to Nutch 
again. I made a trvial update to the wiki page to drop in some links and 
background to your work on this one.

I should probably add, on local tests this works fine for me. E.g. injecting 
sitemap file and from Hostdb.  


was (Author: lewismc):
Hey [~tejasp]. Again, great work! Some minor comments

* Class level Javadoc in SitemapProcessor would be more legible if it used 
format something similar to
{code:title=SitemapProcessor.java|borderStyle=solid}
/**
 * <p>Performs Sitemap processing by fetching sitemap links, parsing the 
content and merging
 * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
 *
 * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
 * <ol>
 *  <li>Sitemaps are considered as "remote seed lists". Crawl administrators 
can prepare a
 *     list of sitemap links and get only those sitemap pages. This suits well 
for targeted
 *     crawl of specific hosts.</li>
 *  <li>For open web crawl, it is not possible to track each host and get the 
sitemap links
 *     manually. Nutch would automatically get the sitemaps for all the hosts 
seen in the
 *     crawls and inject the urls from sitemap to the crawldb.</li>
 * </ol>
 * <p>For more details see:
 *      https://wiki.apache.org/nutch/SitemapFeature </o>
 */
{code}
* I think that the following logging line should be changed to WARN or ERROR
{code:title=SitemapProcessor.java|borderStyle=solid}
} catch (Exception e) {
+          LOG.info("Exception for url " + key.toString() + " : " + 
StringUtils.stringifyException(e)); 
{code}
* This is merely a suggestion, but in SitemapProcessor#filterNormalize(String 
u), could we not use one of the methods from URLUtil.java instead?
{code:title=SitemapProcessor.java|borderStyle=solid}
      if(!u.startsWith("http://";) && !u.startsWith("https://";)) {
        // We received a hostname here so let's make a URL
        url = "http://"; + u + "/";
        isHost = true;
      }
{code}

Thats about it from me mate. This looks like an excellent addition to Nutch 
again. I made a trvial update to the wiki page to drop in some links and 
background to your work on this one.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to