Hello Sebastian,

I got it, csv indexer needs one task to run properly, I tested it and it
worked. Thank you for the advice.

I tried to comment on this jira issue, but I don't have access,
unfortunately I don't know how to do it.

I think if a commiter changed CSVIndexerWriter.java:

if (fs.exists(csvLocalOutFile)) {
   // clean-up
   LOG.warn("Removing existing output path {}", csvLocalOutFile);
   fs.delete(csvLocalOutFile, true);
}

Trying to append data instead of delete and create the file, the issue
would be fixed in local mode, at least.

Thanks again,


El jue, 24 nov 2022 a las 7:38, Sebastian Nagel (<wastl.na...@googlemail.com>)
escribió:

> Hi Paul,
>
>  > the indexer was writing the
>  > documents info in the file (nutch.csv) twice,
>
> Yes, I see. And now I know what I've overseen:
>
>   .../bin/nutch index -Dmapreduce.job.reduces=2
>
> You need to run the CSV indexer with only a single reducer.
> In order to do so, please pass the option
>    --num-tasks 1
> to the script bin/crawl.
>
> Alternatively, you could change
>    NUM_TASKS=2
> in bin/crawl to
>    NUM_TASKS=1
>
> This is related to why at now you can't run the CSV indexer
> in (pseudo)distributed mode, see my previous note:
>
>  > A final note: the CSV indexer only works in local mode, it does not yet
>  > work in distributed mode (on a real Hadoop cluster). It was initially
>  > thought for debugging, not for larger production set up.
>
> The issue is described here:
>    https://issues.apache.org/jira/browse/NUTCH-2793
>
> It's a though one because a solution requires a change of the IndexWriter
> interface. Index writers are plugins and do not know from which reducer
> task they are run and to which path on a distributed or parallelized system
> they have to write. On Hadoop the writing the output is done in two steps:
> write to a local file and then "commit" the output to the final location
> on the
> distributed file system.
>
> But yes, should have a look again at this issue which is stalled since
> quite
> some time. Also because, it's now clear that you might run into issues even
> in local mode.
>
> Thanks for reporting the issue! If you can, please also comment on the
> Jira issue!
>
> Best,
> Sebastian
>
>
>
>

-- 
Paul Escobar Mossos
skype: paulescom
telefono: +57 1 3006815404

Reply via email to