[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473129 ]
Doğacan Güney commented on NUTCH-443: ------------------------------------- Andrzej: Thanks for taking the time to review this. > The contract for ParseUtil.getFirstParseEntry() seems unclear - since in most > cases this is a HashMap, there is no predictable > way to get the first entry > added to the map ... I propose also that we should use a specialized class > instead of > general-purpose Map; and then we can record in that class which entry was the > first. ParseUtil.getFirstParseEntry is only a convenience method used by plugins to get the first(and only) entry in a map when it knows that it will create a one-entry parse map(with original url as the key) and it is mostly used in a plugin's main method to get the parse and print it. It is not used in any core part of Nutch. Anyway, it is very incorrectly named. What we meant was ParseUtil.getOnlyParseEntry. Hmm, that doesn't make any sense either :D Instead of creating a specialized class, how about removing the method and just using parseMap.get(key)? Most plugins will use it like parseMap.get(content.getUrl()). > Also, the naming of some methods > seems a bit awkward - why should we insist that we createSingleEntryMap while > we create an ordinary Map, and we don't use > this special-case knowledge > later? I suggest to simply name it createParseMap. You are right, I will change this in the next patch. > In recent versions of Hadoop there is a GenericWritable class - it replaces > ObjectWritable when classes are known in advance, > and provides a more > compact representation. Didn't know this, will change this too. (Why is Nutch not using this class in Indexer?) > Changes to MapWritable must preserve old code values, at most adding some new > ones - otherwise the new code will get > confused when working with older data. I see your point but I am not sure how to fix this. Since this patch removes the FetcherOutput class, what to put there instead of it? I guess we can just keep FetcherOutput as it is, and update its javadoc to reflect the fact that it is not used anymore. > CrawlDbReducer, TODO item: this should be the time stored under > Nutch.FETCH_TIME_KEY, no? > If I'm not mistaken, ParseUtil doesn't need the import of HashMap, only Map. I will remove the TODO item and fix the imports in the next patch. > allow parsers to return multiple Parse object, this will speed up the rss > parser > -------------------------------------------------------------------------------- > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann > Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers