Hello,
I am attaching the patch in "svn diff" format. I hope it is ok - I do
[...]
Index: src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
===================================================================
--- src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
(revision 158818)
+++ src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
(working copy)
@@ -77,8 +77,9 @@
/** Returns a new token stream for text from the named field. */
public TokenStream tokenStream(String fieldName, Reader reader) {
Analyzer analyzer;
- if ("url".equals(fieldName) || ("anchor".equals(fieldName)))
- analyzer = ANCHOR_ANALYZER;
+ if ("url".equals(fieldName) || ("anchor".equals(fieldName))
+ || ("host".equals(fieldName)) || ("title".equals(fieldName)))
+ analyzer = ANCHOR_ANALYZER;
else
analyzer = CONTENT_ANALYZER;
Could somebody confirm/deny my analysis in the previous post, that the use of ANCHOR_ANALYZER for "url" is wrong, and the CONTENT_ANALYZER should be used instead?
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
