[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638670#comment-17638670 ]
Paul Escobar commented on NUTCH-2793: ------------------------------------- There is a problem in local mode: Issue: You put the indexer out of the bin/crawl script main loop to prevent the file nutch.csv could be overwritten, but it still happens, you see only the last part of the parsed documents. Cause: If -Dmapreduce.job.reduces parameter is greater than 1, indexer runs more than once and overwrites the nutch.csv file Workaround: Run indexer with one reducer: -Dmapreduce.job.reduces=1 or the same but from bin/crawl script: NUM_TASKS=1 Feasible fix: Change CSVIndexerWriter.java: |if (fs.exists(csvLocalOutFile)) {| | // clean-up| | LOG.warn("Removing existing output path {}", csvLocalOutFile);| | fs.delete(csvLocalOutFile, true);| } and try to append data instead of deleting and creating the file, in local mode, at least. > CSV indexer does not work in distributed mode > --------------------------------------------- > > Key: NUTCH-2793 > URL: https://issues.apache.org/jira/browse/NUTCH-2793 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin > Affects Versions: 1.17 > Reporter: Patrick Mézard > Priority: Major > Fix For: 1.20 > > > Reasons are discussed in > https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768 > and following comments. > To summarize, the indexer interface is not aware of tasks so it cannot > generate unique output name per reducers. > But it seems achievable because IndexWriters initialize each writer with > calls to 2 open functions: > * One passing the general configuration and a "name" > * The second to pass indexer parameters > [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214] > Fortunately, "name" is generated by calling getUniqueFile which does exactly > what we want: > [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43] > I propose we use it instead of "nutch.csv" as CSVIndexWriter output file > name. This is a breaking change because it modifies the output name but > allows the indexer to work in distributed mode. > PR will follow the ticket creation. -- This message was sent by Atlassian Jira (v8.20.10#820010)