Re: CSV indexer file data overwriting

Markus Jelsma Fri, 25 Nov 2022 04:21:32 -0800

Hello Paul,

> I tried to comment on this jira issue, but I don't have access,
unfortunately I don't know how to do it.


Due to too much spam, it is no longer possible to create an account for
yourself, but we can do that for you if you wish

Regards,
Markus

Op do 24 nov. 2022 om 22:46 schreef Paul Escobar <
paul.escobar.mos...@gmail.com>:

> Hello Sebastian,
>
> I got it, csv indexer needs one task to run properly, I tested it and it
> worked. Thank you for the advice.
>
> I tried to comment on this jira issue, but I don't have access,
> unfortunately I don't know how to do it.
>
> I think if a commiter changed CSVIndexerWriter.java:
>
> if (fs.exists(csvLocalOutFile)) {
>    // clean-up
>    LOG.warn("Removing existing output path {}", csvLocalOutFile);
>    fs.delete(csvLocalOutFile, true);
> }
>
> Trying to append data instead of delete and create the file, the issue
> would be fixed in local mode, at least.
>
> Thanks again,
>
>
> El jue, 24 nov 2022 a las 7:38, Sebastian Nagel (<
> wastl.na...@googlemail.com>)
> escribió:
>
> > Hi Paul,
> >
> >  > the indexer was writing the
> >  > documents info in the file (nutch.csv) twice,
> >
> > Yes, I see. And now I know what I've overseen:
> >
> >   .../bin/nutch index -Dmapreduce.job.reduces=2
> >
> > You need to run the CSV indexer with only a single reducer.
> > In order to do so, please pass the option
> >    --num-tasks 1
> > to the script bin/crawl.
> >
> > Alternatively, you could change
> >    NUM_TASKS=2
> > in bin/crawl to
> >    NUM_TASKS=1
> >
> > This is related to why at now you can't run the CSV indexer
> > in (pseudo)distributed mode, see my previous note:
> >
> >  > A final note: the CSV indexer only works in local mode, it does not
> yet
> >  > work in distributed mode (on a real Hadoop cluster). It was initially
> >  > thought for debugging, not for larger production set up.
> >
> > The issue is described here:
> >    https://issues.apache.org/jira/browse/NUTCH-2793
> >
> > It's a though one because a solution requires a change of the IndexWriter
> > interface. Index writers are plugins and do not know from which reducer
> > task they are run and to which path on a distributed or parallelized
> system
> > they have to write. On Hadoop the writing the output is done in two
> steps:
> > write to a local file and then "commit" the output to the final location
> > on the
> > distributed file system.
> >
> > But yes, should have a look again at this issue which is stalled since
> > quite
> > some time. Also because, it's now clear that you might run into issues
> even
> > in local mode.
> >
> > Thanks for reporting the issue! If you can, please also comment on the
> > Jira issue!
> >
> > Best,
> > Sebastian
> >
> >
> >
> >
>
> --
> Paul Escobar Mossos
> skype: paulescom
> telefono: +57 1 3006815404
>

Re: CSV indexer file data overwriting

Reply via email to