[
https://issues.apache.org/jira/browse/NUTCH-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273318#comment-13273318
]
Markus Jelsma commented on NUTCH-1366:
--------------------------------------
Cool!
This indeed does not apply to trunk and i wouldn't know how to implement such
thing in trunk. Data from various sources must be grouped together. This may
apply to NutchGora for now but i am not sure if it stays like that. I do not
know a lot about NutchGora but did you consider the (not yet ported) WebGraph
code or LinkDB data source? If you're positive you can get those different
datasources in the same mapper then this is a very good improvement.
> speed up indexing by eliminating the indexreducer
> -------------------------------------------------
>
> Key: NUTCH-1366
> URL: https://issues.apache.org/jira/browse/NUTCH-1366
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Ferdy Galema
> Fix For: nutchgora
>
> Attachments: NUTCH-1366.patch
>
>
> Currently the indexer in Nutchgora consists of both mappers and reduces. But
> the reduce code does not actually iterate over any (grouped/sorted) values.
> It simply indexes individual key/value (String/Webpage) pairs. Therefore by
> moving this indexing code to the mapper we can eliminate the reduce step
> therefore making the indexing job much faster. (No more unnecessary spilling
> to disk/network and no cpu wasted to sorting).
> Note this is not (directly) applicable to trunk because trunk uses a quite
> different approach. Different types of input are combined to a single value
> in the reducer. Although I think it is possible to implement a similar
> optimization I am not sure how to do this. So if anyone wants this for trunk
> too feel free to implement a similar patch.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira