sebastian-nagel commented on pull request #534: URL: https://github.com/apache/nutch/pull/534#issuecomment-643149619
Thanks for the exhaustive listing. I have only a few points to add. > I assumed that NutchAction writes in a given reducer are serialized. It it no clear to me if this is correct or not. The MapReduce framework takes care of data serialization and concurrency issues: the reduce() method is never called concurrently within one task - tasks run in parallel and that's why every task needs it's own output (part-r-nnnnn). The name of the output file (the number in n) is also determined by the framework - that's important if a task is restarted to avoid duplicated output. > writers have distinct output "directories" and the active reducer defines a unique output file name, so the combination of both should be unique. I think we need 3 components: - the task-specific file or folder (part-r-nnnnn) - a unique folder per index writer (eg. the name or a path defined in index-writers.xml) - a job-specific output location - you do not want to change the index-writers.xml for that if you run another indexing job In short, the path of a task output might look like: `job-output/indexer-csv-1/part-r-00000.csv` > getUniqueFile You mean [ParseOutputFormat::getUniqueFile](https://github.com/apache/nutch/blob/59d0d9532abdac409e123ab103a506cfb0df790a/src/java/org/apache/nutch/parse/ParseOutputFormat.java#L120]? ParseOutputFormat or [FetcherOutputFormat](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java) are good examples as they write output into multiple segment subdirectories. Hence, there are no plugins involved which determine whether there is output written to the filesystem or not. > Maybe implement a fallback of the previous method to the new one with a dummy argument That could be done using default method implementations in Java 8 interfaces. Note: Nutch requires now Java 8 but it started with Java 1.4 and there is still a lot of code not using features of Java 8. Also, to keep the indexer usable (because most index writers (solr, elasticsearch, etc.) do not write output to the filesystem): if nothing is written to the filesystem IndexingJob should not require an output location as command-line argument. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]

