[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

ASF GitHub Bot (Jira) Fri, 12 Jun 2020 01:45:10 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134061#comment-17134061
 ]


ASF GitHub Bot commented on NUTCH-2793:
---------------------------------------

sebastian-nagel commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-643149619


   Thanks for the exhaustive listing. I have only a few points to add.
   
   > I assumed that NutchAction writes in a given reducer are serialized. It it 
no clear to me if this is correct or not.
   
   The MapReduce framework takes care of data serialization and concurrency 
issues: the reduce() method is never called concurrently within one task - 
tasks run in parallel and that's why every task needs it's own output 
(part-r-nnnnn). The name of the output file (the number in n) is also 
determined by the framework - that's important if a task is restarted to avoid 
duplicated output.
   
   > writers have distinct output "directories" and the active reducer defines 
a unique output file name, so the combination of both should be unique.
   
   I think we need 3 components:
   - the task-specific file or folder (part-r-nnnnn)
   - a unique folder per index writer (eg. the name or a path defined in 
index-writers.xml)
   - a job-specific output location - you do not want to change the 
index-writers.xml for that if you run another indexing job
   
   In short, the path of a task output might look like: 
`job-output/indexer-csv-1/part-r-00000.csv`
   
   > getUniqueFile
   
   You mean 
[ParseOutputFormat::getUniqueFile](https://github.com/apache/nutch/blob/59d0d9532abdac409e123ab103a506cfb0df790a/src/java/org/apache/nutch/parse/ParseOutputFormat.java#L120]?
 ParseOutputFormat or 
[FetcherOutputFormat](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java)
 are good examples as they write output into multiple segment subdirectories. 
Hence, there are no plugins involved which determine whether there is output 
written to the filesystem or not.
   
   > Maybe implement a fallback of the previous method to the new one with a 
dummy argument
   
   That could be done using default method implementations in Java 8 
interfaces. Note: Nutch requires now Java 8 but it started with Java 1.4 and 
there is still a lot of code not using features of Java 8.
   
   Also, to keep the indexer usable (because most index writers (solr, 
elasticsearch, etc.) do not write output to the filesystem): if nothing is 
written to the filesystem IndexingJob should not require an output location as 
command-line argument.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> CSV indexer does not work in distributed mode
> ---------------------------------------------
>
>                 Key: NUTCH-2793
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2793
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>    Affects Versions: 1.17
>            Reporter: Patrick Mézard
>            Priority: Major
>             Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

Reply via email to