[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Mézard updated NUTCH-2793: ---------------------------------- Comment: was deleted (was: PR sent here https://github.com/apache/nutch/pull/534) > CSV indexer does not work in distributed mode > --------------------------------------------- > > Key: NUTCH-2793 > URL: https://issues.apache.org/jira/browse/NUTCH-2793 > Project: Nutch > Issue Type: Improvement > Components: indexer > Affects Versions: 1.17 > Reporter: Patrick Mézard > Priority: Major > > Reasons are discussed in > https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768 > and following comments. > To summarize, the indexer interface is not aware of tasks so it cannot > generate unique output name per reducers. > But it seems achievable because IndexWriters initialize each writer with > calls to 2 open functions: > * One passing the general configuration and a "name" > * The second to pass indexer parameters > [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214] > Fortunately, "name" is generated by calling getUniqueFile which does exactly > what we want: > [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43] > I propose we use it instead of "nutch.csv" as CSVIndexWriter output file > name. This is a breaking change because it modifies the output name but > allows the indexer to work in distributed mode. > PR will follow the ticket creation. -- This message was sent by Atlassian Jira (v8.3.4#803005)