[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473129
 ] 

Doğacan Güney commented on NUTCH-443:
-------------------------------------

Andrzej:

Thanks for taking the time to review this.

> The contract for ParseUtil.getFirstParseEntry() seems unclear - since in most 
> cases this is a HashMap, there is no predictable > way to get the first entry 
> added to the map ... I propose also that we should use a specialized class 
> instead of 
> general-purpose Map; and then we can record in that class which entry was the 
> first. 

ParseUtil.getFirstParseEntry is only a convenience method used by plugins to 
get the first(and only) entry in a map when it knows that it will create a 
one-entry parse map(with original url as the key) and it is mostly used in a 
plugin's main method to get the parse and print it. It is not used in any core 
part of Nutch. 

Anyway, it is very incorrectly named. What we meant was 
ParseUtil.getOnlyParseEntry. Hmm, that doesn't make any sense either :D

Instead of creating a specialized class, how about removing the method and just 
using parseMap.get(key)? Most plugins will use it like 
parseMap.get(content.getUrl()). 

> Also, the naming of some methods 
> seems a bit awkward - why should we insist that we createSingleEntryMap while 
> we create an ordinary Map, and we don't use > this special-case knowledge 
> later? I suggest to simply name it createParseMap.

You are right, I will change this in the next patch.

> In recent versions of Hadoop there is a GenericWritable class - it replaces 
> ObjectWritable when classes are known in advance, > and provides a more 
> compact representation.

Didn't know this, will change this too. (Why is Nutch not using this class in 
Indexer?)

> Changes to MapWritable must preserve old code values, at most adding some new 
> ones - otherwise the new code will get 
> confused when working with older data.

I see your point but I am not sure how to fix this. Since this patch removes 
the FetcherOutput class, what to put there instead of it? I guess we can just 
keep FetcherOutput as it is, and update its javadoc to reflect the fact that it 
is not used anymore.

> CrawlDbReducer, TODO item: this should be the time stored under 
> Nutch.FETCH_TIME_KEY, no?
> If I'm not mistaken, ParseUtil doesn't need the import of HashMap, only Map.

I will remove the TODO item and fix the imports in the next patch.



> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to