Hello Sebastian, I got it, csv indexer needs one task to run properly, I tested it and it worked. Thank you for the advice.
I tried to comment on this jira issue, but I don't have access, unfortunately I don't know how to do it. I think if a commiter changed CSVIndexerWriter.java: if (fs.exists(csvLocalOutFile)) { // clean-up LOG.warn("Removing existing output path {}", csvLocalOutFile); fs.delete(csvLocalOutFile, true); } Trying to append data instead of delete and create the file, the issue would be fixed in local mode, at least. Thanks again, El jue, 24 nov 2022 a las 7:38, Sebastian Nagel (<wastl.na...@googlemail.com>) escribió: > Hi Paul, > > > the indexer was writing the > > documents info in the file (nutch.csv) twice, > > Yes, I see. And now I know what I've overseen: > > .../bin/nutch index -Dmapreduce.job.reduces=2 > > You need to run the CSV indexer with only a single reducer. > In order to do so, please pass the option > --num-tasks 1 > to the script bin/crawl. > > Alternatively, you could change > NUM_TASKS=2 > in bin/crawl to > NUM_TASKS=1 > > This is related to why at now you can't run the CSV indexer > in (pseudo)distributed mode, see my previous note: > > > A final note: the CSV indexer only works in local mode, it does not yet > > work in distributed mode (on a real Hadoop cluster). It was initially > > thought for debugging, not for larger production set up. > > The issue is described here: > https://issues.apache.org/jira/browse/NUTCH-2793 > > It's a though one because a solution requires a change of the IndexWriter > interface. Index writers are plugins and do not know from which reducer > task they are run and to which path on a distributed or parallelized system > they have to write. On Hadoop the writing the output is done in two steps: > write to a local file and then "commit" the output to the final location > on the > distributed file system. > > But yes, should have a look again at this issue which is stalled since > quite > some time. Also because, it's now clear that you might run into issues even > in local mode. > > Thanks for reporting the issue! If you can, please also comment on the > Jira issue! > > Best, > Sebastian > > > > -- Paul Escobar Mossos skype: paulescom telefono: +57 1 3006815404