[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883204#comment-13883204
 ] 

Tejas Patil commented on NUTCH-1465:
------------------------------------

Hi [~wastl-nagel],
Thanks a lot for your comments. First two were straight forward and I agree 
with those.

Re "hacky way" : For hosts from the HostDb, we don't know which protocol they 
below to. In the code I was checking if http:// is a match and if that was a 
bad guess then try with https://. I didn't handle for ftp:// and file:/ 
schemes. By "hacky" I meant this approach of trial-and-error till a suitable 
match is formed and we create a homepage url for the host. I have thought of 
your comment and would have a better (yet hacky) way in the coming patch.

Re "concurrency": I had thought of this and had searched over internet for 
internals of MultithreadedMapper. All I could get is that it has an internal 
thread pool and each input record to handed over to a thread in this pool to 
run map() over it. I wrote this code to check if thread safety was ensured in 
MultithreadedMapper:

{noformat}
  private static class SitemapMapper extends Mapper<Text, Writable, Text, 
CrawlDatum> {
    private String myurl = null;

    public void map(Text key, Writable value, Context context) throws 
IOException, InterruptedException {
      if (value instanceof Text) {
        String url = key.toString();
        if(foo(url).compareTo(url) != 0) {
          LOG.warn("Race condition found !!!");
        }
      }
    }

    private String foo(String url) {
      myurl = url;
      if(Thread.currentThread().getId() % 2 == 1) {
        try {
          Thread.sleep(10000);
        } catch(InterruptedException e) {
          LOG.warn(e.getMessage());
        }
      }
      return myurl;
    }
{noformat}

I ran it multiple times with threads set to 10, 100, 1000 and 2000 but never 
hit the race condition in the code. Is the code snippet above a good way to 
reveal any race condition in the code ? Its won't be a formal conclusion and 
more of an experimental conclusion. How do I get a concrete conclusion whether 
MultithreadedMapper is thread safe or not ?

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to