[GitHub] [nutch] sebastian-nagel commented on pull request #534: NUTCH-2793 indexer-csv: make it work in distributed mode

GitBox Fri, 12 Jun 2020 01:37:25 -0700


sebastian-nagel commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-643149619

Thanks for the exhaustive listing. I have only a few points to add.

> I assumed that NutchAction writes in a given reducer are serialized. It it
no clear to me if this is correct or not.

The MapReduce framework takes care of data serialization and concurrency
issues: the reduce() method is never called concurrently within one task -
tasks run in parallel and that's why every task needs it's own output
(part-r-nnnnn). The name of the output file (the number in n) is also
determined by the framework - that's important if a task is restarted to avoid
duplicated output.

> writers have distinct output "directories" and the active reducer defines
a unique output file name, so the combination of both should be unique.

I think we need 3 components:
- the task-specific file or folder (part-r-nnnnn)
- a unique folder per index writer (eg. the name or a path defined in
index-writers.xml)
- a job-specific output location - you do not want to change the
index-writers.xml for that if you run another indexing job

In short, the path of a task output might look like:
`job-output/indexer-csv-1/part-r-00000.csv`

> getUniqueFile

You mean
[ParseOutputFormat::getUniqueFile](https://github.com/apache/nutch/blob/59d0d9532abdac409e123ab103a506cfb0df790a/src/java/org/apache/nutch/parse/ParseOutputFormat.java#L120]?
ParseOutputFormat or
[FetcherOutputFormat](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java)
are good examples as they write output into multiple segment subdirectories.
Hence, there are no plugins involved which determine whether there is output
written to the filesystem or not.

> Maybe implement a fallback of the previous method to the new one with a
dummy argument

That could be done using default method implementations in Java 8
interfaces. Note: Nutch requires now Java 8 but it started with Java 1.4 and
there is still a lot of code not using features of Java 8.

Also, to keep the indexer usable (because most index writers (solr,
elasticsearch, etc.) do not write output to the filesystem): if nothing is
written to the filesystem IndexingJob should not require an output location as
command-line argument.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [nutch] sebastian-nagel commented on pull request #534: NUTCH-2793 indexer-csv: make it work in distributed mode

Reply via email to