[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134198#comment-17134198
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---------------------------------------

pmezard commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-643256957


   Thank you for the details.
   
   One thing I wonder is if it would not be possible to define the 
index-writers specific path as their identifier in index-writers.xml, at least 
by default. It would be unique by construction, which reduces a bit the amount 
of configuration. Drawbacks:
   
   - The identifier may be arbitrary and not compatible with FS/Object stores 
paths constraints. Not sure how hard it would be to detect that in practice, or 
if it is a real problem in practice.
   - Said identifiers are a bit ugly, like `indexer_csv_1`. Maybe we can change 
them. Or maybe that's not an issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> CSV indexer does not work in distributed mode
> ---------------------------------------------
>
>                 Key: NUTCH-2793
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2793
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>    Affects Versions: 1.17
>            Reporter: Patrick Mézard
>            Priority: Major
>             Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to