[jira] [Resolved] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1708. Resolution: Fixed Fix Version/s: 2.3 Committed to trunk and 2.x, r1614375. use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Fix For: 2.3, 1.9 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, NUTCH-1708-trunk-v2.patch Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[Nutch Wiki] Update of IndexStructure by SebastianNagel
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The IndexStructure page has been changed by SebastianNagel: https://wiki.apache.org/nutch/IndexStructure?action=diffrev1=20rev2=21 Comment: Add field 'id' (cf. NUTCH-1708) The index structure formed after indexing is shown below : - ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' ||'''Comment'''|| '''version'''|| + ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' ||'''Comment'''||-2 '''version'''|| || || || || || || '''1.x''' || '''2.x''' || + || id || YES || Indexed, Un-Tokenized || [[http://nutch.apache.org/apidocs/apidocs-1.8/org/apache/nutch/indexer/IndexerMapReduce.html|IndexerMapReduce]]/[[http://nutch.apache.org/apidocs/apidocs-2.2.1/org/apache/nutch/indexer/IndexUtil.html|IndexUtil]] || '''URL''' used as '''ID''' to update and delete documents || X || X || ||boost|| YES || Not Indexed || various scoring plugins || Adds a '''score''' value field to a particular document. This is allocated based upon its importance within the webgraph. || ? || ? || ||digest || YES || Not Indexed || org.apache.nutch.indexer.IndexerMapReduce.java || Adds a '''message digest''' field to a document. Can be MD5 over content and headers or more sophisticated text profile of the content. || ? || ? || ||lang|| YES || Un-Tokenized|| language-identifier || Add a '''lang''', language field to a document.|| ? || ? ||
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078012#comment-14078012 ] Hudson commented on NUTCH-1708: --- SUCCESS: Integrated in Nutch-nutchgora #1102 (See [https://builds.apache.org/job/Nutch-nutchgora/1102/]) NUTCH-1708 use same id when indexing and deleting redirects (snagel: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1614375) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/conf/schema.xml * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/branches/2.x/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/schema-solr4.xml * /nutch/trunk/conf/schema.xml * /nutch/trunk/conf/solrindex-mapping.xml * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/trunk/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java * /nutch/trunk/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Fix For: 2.3, 1.9 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, NUTCH-1708-trunk-v2.patch Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078042#comment-14078042 ] Hudson commented on NUTCH-1708: --- SUCCESS: Integrated in Nutch-trunk #2724 (See [https://builds.apache.org/job/Nutch-trunk/2724/]) NUTCH-1708 use same id when indexing and deleting redirects (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1614375) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/conf/schema.xml * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/branches/2.x/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/schema-solr4.xml * /nutch/trunk/conf/schema.xml * /nutch/trunk/conf/solrindex-mapping.xml * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/trunk/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java * /nutch/trunk/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Fix For: 2.3, 1.9 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, NUTCH-1708-trunk-v2.patch Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)