[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133325#comment-17133325
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---------------------------------------

pmezard commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-642730043


   OK, there is a lot to unpack. Let me try to rephrase what was my naive 
understanding of the issue, how I intended to fix it and what is wrong about it.
   
   What I saw is indexing to csv worked locally but failed in a distributed 
setup (with only 3 nodes). The reduce step emitted errors when writing data to 
GCS. At the end, there was something containing roughly a third of the expected 
dataset. I assumed I had 3 reducers overwriting each other with only one winner 
at the end (or a mix of winning output blocks). So I thought "if only I could 
map the CSVIndexWriter output file to a reducer to separate each reducer 
output, that would solve the issue".
   
   What you are saying is:
   - In addition to distributed mode requiring the writers output to be 
separated, there is a lot of complexity involved with dealing with eventually 
consistent object stores (I will assume that GCS works roughly like S3). 
Ideally we would like reducers output to appear in the outpath only if the 
tasks or jobs succeed, which involves the commiter logic you referenced. But in 
an initial implementation we may not care about that. If the indexing fails, 
partial output will be left in outpath and such is life (I am OK with that).
   - I assumed that NutchAction writes in a given reducer are serialized. It it 
no clear to me if this is correct or not.
   - Exchanges introduce additional complexity in that a single NutchAction can 
be handled by more than one writer. I do not see what would be the issue with 
this assuming each writer output are separated. If I have 2 writers with an 
outpath set to "out1" and "out2", in a reducer generating a "part-r-0001", the 
actions would go either in  "out1/part-r-0001" or "out2/part-r-0002" or both. I 
do not see overlapping writes there.
   - Same reasoning with `there is also the open question how to allow two 
index writers writing output the filesystem:`. Again I assume the writers have 
distinct output "directories" and the active reducer defines a unique output 
file name, so the combination of both should be unique.
   - About `"name" was just an arbitrary name not a file name indicating a 
task-specific output path`, maybe but does anything prevents it to be used that 
way? `getUniqueFile` seems suitable here.
   
   With this current understanding, I would now implement it like:
   - Kill `open(Configuration cfg, String name)` method, if possible (I haven't 
checked the code yet).
   - Refactor `open(IndexWriterParams params)` into `open(IndexWriterParams 
params, String name)`, where `name` would be the same thing passed to the other 
method.
   - In CSVIndexWriter, use `name` directly and drop the `filename` kludge I 
introduced.
   - Maybe implement a fallback of the previous method to the new one with a 
dummy argument.
   
   How far am I?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> ---------------------------------------------
>
>                 Key: NUTCH-2793
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2793
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>    Affects Versions: 1.17
>            Reporter: Patrick Mézard
>            Priority: Major
>             Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to