[jira] [Commented] (NUTCH-2392) Get same pages multiple times if URL contains relative path

Jorge Luis Betancourt Gonzalez (JIRA) Wed, 07 Jun 2017 06:33:35 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040881#comment-16040881
 ]


Jorge Luis Betancourt Gonzalez commented on NUTCH-2392:
-------------------------------------------------------

In this case, Nutch is detecting a relative URL and doing the work to make it 
"fetchable" which is making it a full URL, in this case. But you'll find the 
same issue not only with relative URLs, you could find the same situation where 
you find totally different URLs with the same content thanks to the "magic" of 
some CMS, one case that I've found quite often is the presence/lack of 
{{index.php}} in some URLs with exactly the same content. I've also found this 
issue with OCS (Open Conference Systems) https://pkp.sfu.ca/ocs/.

Can you provide the exact URLs that you've found? Are both URLs being indexed 
in Solr? Even if both URLs are being fetched they should be deduplicated later 
on. Even if both URLs are totally different they should have the same 
signature/digest calculated using the text extracted, see 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/TextMD5Signature.java.

The problem is that you need to actually fetch/parse the URL to be able to know 
that they are duplicated, we need to assume that both URLs are different until 
proven otherwise :).

> Get same pages multiple times if URL contains relative path
> -----------------------------------------------------------
>
>                 Key: NUTCH-2392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2392
>             Project: Nutch
>          Issue Type: Bug
>          Components: commoncrawl
>    Affects Versions: 1.13
>         Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>            Reporter: Jayesh Shende
>            Priority: Critical
>              Labels: features
>             Fix For: 1.14
>
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (NUTCH-2392) Get same pages multiple times if URL contains relative path

Reply via email to