Patrick Mézard created NUTCH-2793:
-------------------------------------
Summary: CSV indexer does not work in distributed mode
Key: NUTCH-2793
URL: https://issues.apache.org/jira/browse/NUTCH-2793
Project: Nutch
Issue Type: Improvement
Components: indexer
Affects Versions: 1.17
Reporter: Patrick Mézard
Reasons are discussed in
https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
and following comments.
To summarize, the indexer interface is not aware of tasks so it cannot generate
unique output name per reducers.
But it seems achievable because IndexWriters initialize each writer with calls
to 2 open functions:
* One passing the general configuration and a "name"
* The second to pass indexer parameters
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214
Fortunately, "name" is generated by calling getUniqueFile which does exactly
what we want:
[https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
I propose we use it instead of "nutch.csv" as CSVIndexWriter output file name.
This is breaking change because it modifies the output name but allows the
indexer to work in distributed mode.
PR will follow the ticket creation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)