[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078012#comment-14078012 ] Hudson commented on NUTCH-1708: --- SUCCESS: Integrated in Nutch-nutchgora #1102 (See [https://builds.apache.org/job/Nutch-nutchgora/1102/]) NUTCH-1708 use same id when indexing and deleting redirects (snagel: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1614375) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/conf/schema.xml * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/branches/2.x/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/schema-solr4.xml * /nutch/trunk/conf/schema.xml * /nutch/trunk/conf/solrindex-mapping.xml * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/trunk/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java * /nutch/trunk/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Fix For: 2.3, 1.9 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, NUTCH-1708-trunk-v2.patch Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078042#comment-14078042 ] Hudson commented on NUTCH-1708: --- SUCCESS: Integrated in Nutch-trunk #2724 (See [https://builds.apache.org/job/Nutch-trunk/2724/]) NUTCH-1708 use same id when indexing and deleting redirects (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1614375) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/conf/schema.xml * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/branches/2.x/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/schema-solr4.xml * /nutch/trunk/conf/schema.xml * /nutch/trunk/conf/solrindex-mapping.xml * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/trunk/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java * /nutch/trunk/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Fix For: 2.3, 1.9 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, NUTCH-1708-trunk-v2.patch Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068374#comment-14068374 ] Julien Nioche commented on NUTCH-1708: -- I like the approach and this would be the best way of solving the issue. +1 to commit Re-field orig in 2.x : sounds like a duplicate of 'id' indeed. Let's do its removal separately Thanks use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Fix For: 1.9 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968270#comment-13968270 ] Markus Jelsma commented on NUTCH-1708: -- Yes, that seems reasonable, but we still need to get rid of the repr_url. To me it makes little sense to have such strange behaviour in index-basic. use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968762#comment-13968762 ] Sebastian Nagel commented on NUTCH-1708: ??need to get rid of the repr_url?? Not necessarily: # if we use for field 'id' the URL a document has been accessed (with any possible status), everything (indexing, updating, deletion) should work -- those IDs are in sync with CrawlDb and may never appear twice. # then we are free to fill the field 'url' with a more pretty thing: repr URL (usually shorter), punycoded (without ugly {{xn--}}), showing letters instead of percent-encoded sequences, etc. Since field 'url' is tokenized, decoding the content makes more sense. In doubt, we could make it configurable which of these denormalization steps are applied. # finally, we achieve the same behaviour in 1.x and 2.x use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951566#comment-13951566 ] Sebastian Nagel commented on NUTCH-1708: HI [~markus17], another way would be to fill the fields 'id' and 'url' differently, as it's done in 2.x. This approach would also allow to decode punycoded IDNs, see NUTCH-1321. use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876288#comment-13876288 ] Markus Jelsma commented on NUTCH-1708: -- Hi Sebastian - we've had issues with that before and tracked it down to the representative URL being indexing in index-basic as well. We choose to completely remove that from our custom indexing filter, in my opinion an URL must be indexed by it's real URL, not some representative URL. Indexing representative URL's also causes duplicates, which may or may not be removed by Nutch' new deduplicating code because the signatures are usually not the same. use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)