[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848569#comment-13848569
 ] 

Lewis John McGibbney commented on NUTCH-1465:
---------------------------------------------

Hi [~tejasp]... nice logic.
Some notes here from my observations of the crawler commons code (and possibly 
sitemap standards as well)
{code}
    /** According to the specs, 50K URLs per Sitemap is the max */
    private static final int MAX_URLS = 50000;

    /** Sitemap docs must be limited to 10MB (10,485,760 bytes) */
    public static int MAX_BYTES_ALLOWED = 10485760;
{code}
I would be inclined to agree with you on your preference to introduce the new 
MR SiteMapMRJob as in *B* above. It generally sounds much much cleaner, with 
the changes being less sporadic hence affecting less areas of the existing 
codebase.
Also, given the the HostDB has been coming along nicely in 1.X I think this 
would be an excellent use of the CC SiteMap code.  

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to