[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883204#comment-13883204 ]
Tejas Patil commented on NUTCH-1465: ------------------------------------ Hi [~wastl-nagel], Thanks a lot for your comments. First two were straight forward and I agree with those. Re "hacky way" : For hosts from the HostDb, we don't know which protocol they below to. In the code I was checking if http:// is a match and if that was a bad guess then try with https://. I didn't handle for ftp:// and file:/ schemes. By "hacky" I meant this approach of trial-and-error till a suitable match is formed and we create a homepage url for the host. I have thought of your comment and would have a better (yet hacky) way in the coming patch. Re "concurrency": I had thought of this and had searched over internet for internals of MultithreadedMapper. All I could get is that it has an internal thread pool and each input record to handed over to a thread in this pool to run map() over it. I wrote this code to check if thread safety was ensured in MultithreadedMapper: {noformat} private static class SitemapMapper extends Mapper<Text, Writable, Text, CrawlDatum> { private String myurl = null; public void map(Text key, Writable value, Context context) throws IOException, InterruptedException { if (value instanceof Text) { String url = key.toString(); if(foo(url).compareTo(url) != 0) { LOG.warn("Race condition found !!!"); } } } private String foo(String url) { myurl = url; if(Thread.currentThread().getId() % 2 == 1) { try { Thread.sleep(10000); } catch(InterruptedException e) { LOG.warn(e.getMessage()); } } return myurl; } {noformat} I ran it multiple times with threads set to 10, 100, 1000 and 2000 but never hit the race condition in the code. Is the code snippet above a good way to reveal any race condition in the code ? Its won't be a formal conclusion and more of an experimental conclusion. How do I get a concrete conclusion whether MultithreadedMapper is thread safe or not ? > Support sitemaps in Nutch > ------------------------- > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)