[jira] [Resolved] (NUTCH-1708) use same id when indexing and deleting redirects

2014-07-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1708.


   Resolution: Fixed
Fix Version/s: 2.3

Committed to trunk and 2.x, r1614375.

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, 
 NUTCH-1708-trunk-v2.patch


 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[Nutch Wiki] Update of IndexStructure by SebastianNagel

2014-07-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The IndexStructure page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/IndexStructure?action=diffrev1=20rev2=21

Comment:
Add field 'id' (cf. NUTCH-1708)

  
  The index structure formed after indexing is shown below : 
  
- ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' 
||'''Comment'''|| '''version'''||
+ ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' 
||'''Comment'''||-2 '''version'''||
  || || || || || || '''1.x''' || '''2.x''' ||
+ ||  id  ||  YES   ||  Indexed, Un-Tokenized   || 
[[http://nutch.apache.org/apidocs/apidocs-1.8/org/apache/nutch/indexer/IndexerMapReduce.html|IndexerMapReduce]]/[[http://nutch.apache.org/apidocs/apidocs-2.2.1/org/apache/nutch/indexer/IndexUtil.html|IndexUtil]]
  || '''URL''' used as '''ID''' to update and delete documents || X || X ||
  ||boost|| YES ||  Not Indexed || various scoring 
plugins || Adds a '''score''' value field to a particular document. This is 
allocated based upon its importance within the webgraph. || ?  || ? ||
  ||digest  ||  YES ||  Not Indexed || 
org.apache.nutch.indexer.IndexerMapReduce.java || Adds a '''message digest''' 
field to a document. Can be MD5 over content and headers or more sophisticated 
text profile of the content. ||  ?  || ? ||
  ||lang||  YES ||  Un-Tokenized||  
language-identifier || Add a '''lang''', language field to a document.||  ?  || 
? ||


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-07-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078012#comment-14078012
 ] 

Hudson commented on NUTCH-1708:
---

SUCCESS: Integrated in Nutch-nutchgora #1102 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1102/])
NUTCH-1708 use same id when indexing and deleting redirects (snagel: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1614375)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/schema.xml
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* 
/nutch/branches/2.x/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/schema-solr4.xml
* /nutch/trunk/conf/schema.xml
* /nutch/trunk/conf/solrindex-mapping.xml
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* 
/nutch/trunk/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java
* 
/nutch/trunk/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java


 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, 
 NUTCH-1708-trunk-v2.patch


 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-07-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078042#comment-14078042
 ] 

Hudson commented on NUTCH-1708:
---

SUCCESS: Integrated in Nutch-trunk #2724 (See 
[https://builds.apache.org/job/Nutch-trunk/2724/])
NUTCH-1708 use same id when indexing and deleting redirects (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1614375)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/schema.xml
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* 
/nutch/branches/2.x/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/schema-solr4.xml
* /nutch/trunk/conf/schema.xml
* /nutch/trunk/conf/solrindex-mapping.xml
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* 
/nutch/trunk/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java
* 
/nutch/trunk/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java


 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch, 
 NUTCH-1708-trunk-v2.patch


 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)