[Nutch Wiki] Update of "IndexStructure" by PeterCiuffetti

Apache Wiki Thu, 16 Jul 2015 08:15:29 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "IndexStructure" page has been changed by PeterCiuffetti:
https://wiki.apache.org/nutch/IndexStructure?action=diff&rev1=22&rev2=23

  = The Index Structure =
+ The index structure formed after indexing is shown below :
+ ||'''Field Name''' ||'''Stored''' ||'''Index''' ||'''Plugin/Class''' 
||'''Comment''' ||||'''version''' ||
+ || || || || || ||'''1.x''' ||'''2.x''' ||
+ ||id ||YES ||Indexed, Un-Tokenized 
||[[http://nutch.apache.org/apidocs/apidocs-1.8/org/apache/nutch/indexer/IndexerMapReduce.html|IndexerMapReduce]]/[[http://nutch.apache.org/apidocs/apidocs-2.2.1/org/apache/nutch/indexer/IndexUtil.html|IndexUtil]]
 ||'''URL''' used as '''ID''' to update and delete documents ||X ||X ||
+ ||boost ||YES ||Not Indexed ||various scoring plugins ||Adds a '''score''' 
value field to a particular document. This is allocated based upon its 
importance within the webgraph. ||? ||? ||
+ ||digest ||YES ||Not Indexed ||org.apache.nutch.indexer.IndexerMapReduce.java 
||Adds a '''message digest''' field to a document. Can be MD5 over content and 
headers or more sophisticated text profile of the content. ||? ||? ||
+ ||lang ||YES ||Un-Tokenized ||language-identifier ||Add a '''lang''', 
language field to a document. ||? ||? ||
+ ||segment ||YES ||Not Indexed 
||org.apache.nutch.indexer.IndexerMapReduce.java ||Adds the originating 
'''segment''' field to the document, used to identify the most recent segment 
in which this document was fetched. ||? ||? ||
+ ||tstamp ||YES ||Tokenized || /!\ NEEDS COMMENT /!\ ||Adds a '''timestamp''' 
field of the most recent time this document was fetched ||? ||? ||
+ ||cc:license ||YES ||Indexed, Tokenized ||creativecommons ||Adds the entire 
license as '''cc:license=xxx''' and '''attributes''' extracted of the license 
url ||? ||? ||
+ ||cc:meta ||YES ||Indexed, Tokenized ||creativecommons ||Adds the license 
location as '''cc:meta=xxx''' ||? ||? ||
+ ||cc:type ||YES ||Indexed,Tokenized ||creativecommons ||Adds the work type as 
'''cc:type=xxx''' ||? ||? ||
+ ||anchor ||NO ||Tokenized ||index-anchor ||Indexing filter that indexes all 
inbound '''anchor text''' for a document. ||? ||? ||
+ ||title ||YES ||Tokenized ||index-basic ||Adds basic searchable '''title 
field''' to a document. Also indexed by index-more ||? ||? ||
+ ||host ||NO ||Tokenized ||index-basic ||Adds basic searchable '''hostname 
field''' to a document. ||? ||? ||
+ ||url ||YES ||Tokenized ||index-basic ||Adds basic searchable '''URL field''' 
to a document. ||? ||? ||
+ ||content ||NO ||Tokenized ||index-basic ||Adds basic searchable '''content 
field''' to a document. ||? ||? ||
+ ||lastModified ||NO ||Indexed, Un-Tokenized ||index-more ||Adds some time 
related meta info in the form of '''last-modified''' if present. ||? ||? ||
+ ||date ||NO ||Indexed, Un-Tokenized ||index-more ||Index date as 
last-modified, or, if that's not present, uses fetch time. ||? ||? ||
+ ||contentLength ||NO ||Indexed, Un-Tokenized ||index-more || /!\ NEEDS 
COMMENT /!\ ||? ||? ||
+ ||type ||NO ||Indexed, Un-Tokenized ||index-more ||Adds contentType, 
primaryType, subType (all mime-types) ||? ||? ||
+ ||primaryType ||NO ||Indexed, Un-Tokenized ||index-more ||primaryType 
(mime-type) ||? ||? ||
+ ||subType ||NO ||Indexed, Un-Tokenized ||index-more ||subType (mime-type) ||? 
||? ||
+ ||tld ||YES ||Un-Tokenized / NotStored(based on conf) ||tld ||Adds a '''top 
level domain''' field to the document. ||? ||? ||
+ ||subcollection ||YES ||Tokenized ||subcollection ||For Comprehensive 
description see src/java/org/apache/nutch/collection/'''package.html''' ||? ||? 
||
+ ||urlmeta ||NO ||Indexed, Un-Tokenized ||urlmeta ||Adds any specified '''url 
metadata tags''' to the document in the index. ||? ||? ||
  
- The index structure formed after indexing is shown below : 
  
+ 
+ 
- ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' 
||'''Comment'''||<-2> '''version'''||
- || || || || || || '''1.x''' || '''2.x''' ||
- ||      id      ||      YES   ||      Indexed, Un-Tokenized   || 
[[http://nutch.apache.org/apidocs/apidocs-1.8/org/apache/nutch/indexer/IndexerMapReduce.html|IndexerMapReduce]]/[[http://nutch.apache.org/apidocs/apidocs-2.2.1/org/apache/nutch/indexer/IndexUtil.html|IndexUtil]]
  || '''URL''' used as '''ID''' to update and delete documents || X || X ||
- ||    boost    ||     YES     ||      Not Indexed     || various scoring 
plugins || Adds a '''score''' value field to a particular document. This is 
allocated based upon its importance within the webgraph. || ?  || ? ||
- ||    digest  ||      YES     ||      Not Indexed     || 
org.apache.nutch.indexer.IndexerMapReduce.java || Adds a '''message digest''' 
field to a document. Can be MD5 over content and headers or more sophisticated 
text profile of the content. ||  ?  || ? ||
- ||    lang    ||      YES     ||      Un-Tokenized    ||      
language-identifier || Add a '''lang''', language field to a document.||  ?  || 
? ||
- ||    segment ||              YES     ||      Not Indexed     || 
org.apache.nutch.indexer.IndexerMapReduce.java || Adds the originating 
'''segment''' field to the document, used to identify the most recent segment 
in which this document was fetched. ||  ?  || ? ||
- ||    tstamp  ||      YES     ||      Tokenized       || /!\ NEEDS COMMENT 
/!\ || Adds a '''timestamp''' field of the most recent time this document was 
fetched ||  ?  || ? ||
- ||    cc:license      ||      YES     ||      Indexed, Tokenized      || 
creativecommons || Adds the entire license as '''cc:license=xxx''' and 
'''attributes''' extracted of the license url||  ?  || ? ||
- ||    cc:meta ||      YES     ||      Indexed, Tokenized      ||      
creativecommons || Adds the license location as '''cc:meta=xxx''' ||  ?  || ? ||
- ||    cc:type ||      YES     ||      Indexed,Tokenized       ||      
creativecommons || Adds the work type as '''cc:type=xxx'''||  ?  || ? ||
- ||    anchor  ||      NO      ||      Tokenized       ||      index-anchor || 
Indexing filter that indexes all inbound '''anchor text''' for a document.||  ? 
 || ? ||
- ||    title   ||      YES     ||      Tokenized       ||      index-basic     
|| Adds basic searchable '''title field''' to a document. Also indexed by 
index-more ||  ?  || ? ||
- ||    host    ||      NO      ||      Tokenized       ||      index-basic     
|| Adds basic searchable '''hostname field''' to a document. ||  ?  || ? ||
- ||    url     ||      YES     ||      Tokenized       ||      index-basic || 
Adds basic searchable '''URL field''' to a document. ||  ?  || ? ||
- ||    content         ||      NO      ||      Tokenized       ||      
index-basic     || Adds basic searchable '''content field''' to a document. ||  
?  || ? ||
- ||    lastModified    ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more || Adds some time related meta info in the form of 
'''last-modified''' if present. ||  ?  || ? ||
- ||    date    ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more || Index date as last-modified, or, if that's not present, uses 
fetch time. ||  ?  || ? ||
- ||    contentLength   ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more || /!\ NEEDS COMMENT /!\ ||  ?  || ? ||
- ||    type    ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more      || Adds contentType, primaryType, subType (all mime-types) ||  
?  || ? ||
- ||    primaryType     ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more      ||      primaryType (mime-type) ||  ?  || ? ||
- ||    subType         ||      NO      ||      Indexed, Un-Tokenized   ||      
index-more      ||      subType (mime-type) ||  ?  || ? ||
- ||      tld             ||     YES      || Un-Tokenized / NotStored(based on 
conf) || tld || Adds a '''top level domain''' field to the document.  ||  ?  || 
? ||
- ||      subcollection   ||    YES || Tokenized || subcollection || For 
Comprehensive description see 
src/java/org/apache/nutch/collection/'''package.html'''   ||  ?  || ? ||
- ||    urlmeta ||      NO      ||      Indexed, Un-Tokenized   ||      urlmeta 
        || Adds any specified '''url metadata tags''' to the document in the 
index.||  ?  || ? ||
  ----
- Jira Issues about indexing and IndexingFilterPlugins are 
+ Jira Issues about indexing and IndexingFilterPlugins are
  
   * [[http://issues.apache.org/jira/browse/NUTCH-422|index-extra plugin]]
   * [[https://issues.apache.org/jira/browse/NUTCH-940|index-static plugin]]
-  * [[index-replace plugin]]
+  * [[IndexReplace|index-replace plugin]]
  
  ----
+ The index plugins to include are :
  
- The index plugins to include are : 
+  . index-(anchor | basic | more | static | replace ) | tld | subcollection | 
creativecommons | language-identifier | urlmeta
  
-  index-(anchor | basic | more | static | replace ) | tld | subcollection | 
creativecommons | language-identifier | urlmeta
-

[Nutch Wiki] Update of "IndexStructure" by PeterCiuffetti

Reply via email to