[
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879955#comment-13879955
]
Lewis John McGibbney edited comment on NUTCH-1465 at 1/23/14 2:38 PM:
----------------------------------------------------------------------
Hey [~tejasp]. Again, great work! Some minor comments
* Class level Javadoc in SitemapProcessor would be more legible if it used
format something similar to
{code:title=SitemapProcessor.java|borderStyle=solid}
/**
* <p>Performs Sitemap processing by fetching sitemap links, parsing the
content and merging
* the urls from Sitemap (with the metadata) with the existing crawldb.</p>
*
* <p>There are two use cases supported in Nutch's Sitemap processing:</p>
* <ol>
* <li>Sitemaps are considered as "remote seed lists". Crawl administrators
can prepare a
* list of sitemap links and get only those sitemap pages. This suits well
for targeted
* crawl of specific hosts.</li>
* <li>For open web crawl, it is not possible to track each host and get the
sitemap links
* manually. Nutch would automatically get the sitemaps for all the hosts
seen in the
* crawls and inject the urls from sitemap to the crawldb.</li>
* </ol>
* <p>For more details see:
* https://wiki.apache.org/nutch/SitemapFeature </o>
*/
{code}
* I think that the following logging line should be changed to WARN or ERROR
{code:title=SitemapProcessor.java|borderStyle=solid}
} catch (Exception e) {
+ LOG.info("Exception for url " + key.toString() + " : " +
StringUtils.stringifyException(e));
{code}
* This is merely a suggestion, but in SitemapProcessor#filterNormalize(String
u), could we not use one of the methods from URLUtil.java instead?
{code:title=SitemapProcessor.java|borderStyle=solid}
if(!u.startsWith("http://") && !u.startsWith("https://")) {
// We received a hostname here so let's make a URL
url = "http://" + u + "/";
isHost = true;
}
{code}
Thats about it from me mate. This looks like an excellent addition to Nutch
again. I made a trvial update to the wiki page to drop in some links and
background to your work on this one.
I should probably add, on local tests this works fine for me. E.g. injecting
sitemap file and from Hostdb.
was (Author: lewismc):
Hey [~tejasp]. Again, great work! Some minor comments
* Class level Javadoc in SitemapProcessor would be more legible if it used
format something similar to
{code:title=SitemapProcessor.java|borderStyle=solid}
/**
* <p>Performs Sitemap processing by fetching sitemap links, parsing the
content and merging
* the urls from Sitemap (with the metadata) with the existing crawldb.</p>
*
* <p>There are two use cases supported in Nutch's Sitemap processing:</p>
* <ol>
* <li>Sitemaps are considered as "remote seed lists". Crawl administrators
can prepare a
* list of sitemap links and get only those sitemap pages. This suits well
for targeted
* crawl of specific hosts.</li>
* <li>For open web crawl, it is not possible to track each host and get the
sitemap links
* manually. Nutch would automatically get the sitemaps for all the hosts
seen in the
* crawls and inject the urls from sitemap to the crawldb.</li>
* </ol>
* <p>For more details see:
* https://wiki.apache.org/nutch/SitemapFeature </o>
*/
{code}
* I think that the following logging line should be changed to WARN or ERROR
{code:title=SitemapProcessor.java|borderStyle=solid}
} catch (Exception e) {
+ LOG.info("Exception for url " + key.toString() + " : " +
StringUtils.stringifyException(e));
{code}
* This is merely a suggestion, but in SitemapProcessor#filterNormalize(String
u), could we not use one of the methods from URLUtil.java instead?
{code:title=SitemapProcessor.java|borderStyle=solid}
if(!u.startsWith("http://") && !u.startsWith("https://")) {
// We received a hostname here so let's make a URL
url = "http://" + u + "/";
isHost = true;
}
{code}
Thats about it from me mate. This looks like an excellent addition to Nutch
again. I made a trvial update to the wiki page to drop in some links and
background to your work on this one.
> Support sitemaps in Nutch
> -------------------------
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch,
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch,
> NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0
> licensed and appears to have been used successfully to parse sitemaps as per
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1]
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)