[
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Patrick Mézard updated NUTCH-2793:
----------------------------------
Comment: was deleted
(was: PR sent here https://github.com/apache/nutch/pull/534)
> CSV indexer does not work in distributed mode
> ---------------------------------------------
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Affects Versions: 1.17
> Reporter: Patrick Mézard
> Priority: Major
>
> Reasons are discussed in
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
> and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with
> calls to 2 open functions:
> * One passing the general configuration and a "name"
> * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file
> name. This is a breaking change because it modifies the output name but
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)