Hello,
I am attaching the patch in "svn diff" format. I hope it is ok - I do
[...]
Index: src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
===================================================================
--- src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
(revision 158818)
+++ src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
(working copy)
@@ -77,8 +77,9 @@
/** Returns a new token stream for text from the named field. */
public TokenStream tokenStream(String fieldName, Reader reader) {
Analyzer analyzer;
- if ("url".equals(fieldName) || ("anchor".equals(fieldName)))
- analyzer = ANCHOR_ANALYZER;
+ if ("url".equals(fieldName) || ("anchor".equals(fieldName))
+ || ("host".equals(fieldName)) || ("title".equals(fieldName)))
+ analyzer = ANCHOR_ANALYZER;
else
analyzer = CONTENT_ANALYZER;
Could somebody confirm/deny my analysis in the previous post, that the use of ANCHOR_ANALYZER for "url" is wrong, and the CONTENT_ANALYZER should be used instead?
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
------------------------------------------------------- This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
