[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130918#comment-17130918 ]
ASF GitHub Bot commented on NUTCH-2793: --------------------------------------- sebastian-nagel commented on a change in pull request #534: URL: https://github.com/apache/nutch/pull/534#discussion_r438267053 ########## File path: src/plugin/indexer-csv/README.md ########## @@ -39,4 +39,4 @@ escapechar | Escape character used to escape a quote character | " maxfieldlength | Max. length of a single field value in characters | 4096 maxfieldvalues | Max. number of values of one field, useful for, e.g., the anchor texts field | 12 header | Write CSV column headers | true -outpath | Output path / directory (local filesystem path, relative to current working directory) | csvindexwriter \ No newline at end of file +outpath | Output path / directory (local filesystem path, relative to current working directory) | csvindexwriter Review comment: Sorry, I've mixed two points mixed together: - the description would also need a change as it will not be a path on the local filesystem if running in distributed mode - there is also the open question how to allow two index writers writing output the filesystem: - in local mode this would require that the `outpath` points to a different directory - in distributed mode we could use `outpath` to write into distinct output directories or distinct subdirectories of one job-specific output directory ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > CSV indexer does not work in distributed mode > --------------------------------------------- > > Key: NUTCH-2793 > URL: https://issues.apache.org/jira/browse/NUTCH-2793 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin > Affects Versions: 1.17 > Reporter: Patrick Mézard > Priority: Major > Fix For: 1.18 > > > Reasons are discussed in > https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768 > and following comments. > To summarize, the indexer interface is not aware of tasks so it cannot > generate unique output name per reducers. > But it seems achievable because IndexWriters initialize each writer with > calls to 2 open functions: > * One passing the general configuration and a "name" > * The second to pass indexer parameters > [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214] > Fortunately, "name" is generated by calling getUniqueFile which does exactly > what we want: > [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43] > I propose we use it instead of "nutch.csv" as CSVIndexWriter output file > name. This is a breaking change because it modifies the output name but > allows the indexer to work in distributed mode. > PR will follow the ticket creation. -- This message was sent by Atlassian Jira (v8.3.4#803005)