Bug in existing version of NutchDocumentAnalyzer (Re: [Nutch-dev] Adding title and site to scoring)

Andrzej Bialecki Wed, 23 Mar 2005 13:42:27 -0800

Piotr Kosiorowski wrote:

Hello,
I am attaching the patch in "svn diff" format. I hope it is ok - I do

[...]

Index: src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
===================================================================
--- src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java       
(revision 158818)
+++ src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java       
(working copy)
@@ -77,8 +77,9 @@
   /** Returns a new token stream for text from the named field. */
   public TokenStream tokenStream(String fieldName, Reader reader) {
     Analyzer analyzer;
-    if ("url".equals(fieldName) || ("anchor".equals(fieldName)))
-      analyzer = ANCHOR_ANALYZER;
+    if ("url".equals(fieldName) || ("anchor".equals(fieldName))
+                || ("host".equals(fieldName)) || ("title".equals(fieldName)))
+            analyzer = ANCHOR_ANALYZER;
     else
       analyzer = CONTENT_ANALYZER;

Could somebody confirm/deny my analysis in the previous post, that the use of ANCHOR_ANALYZER for "url" is wrong, and the CONTENT_ANALYZER should be used instead?

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Bug in existing version of NutchDocumentAnalyzer (Re: [Nutch-dev] Adding title and site to scoring)

Reply via email to