[jira] [Updated] (NUTCH-2793) CSV indexer does not work in distributed mode

Jira Wed, 10 Jun 2020 05:13:04 -0700


     [ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Patrick Mézard updated NUTCH-2793:
----------------------------------
    Description: 
Reasons are discussed in 
https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
 and following comments.

To summarize, the indexer interface is not aware of tasks so it cannot generate 
unique output name per reducers.

But it seems achievable because IndexWriters initialize each writer with calls 
to 2 open functions:
 * One passing the general configuration and a "name"
 * The second to pass indexer parameters

[https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]

Fortunately, "name" is generated by calling getUniqueFile which does exactly 
what we want:

[https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]

I propose we use it instead of "nutch.csv" as CSVIndexWriter output file name. 
This is a breaking change because it modifies the output name but allows the 
indexer to work in distributed mode.

PR will follow the ticket creation.

  was:
Reasons are discussed in 
https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
 and following comments.

To summarize, the indexer interface is not aware of tasks so it cannot generate 
unique output name per reducers.

But it seems achievable because IndexWriters initialize each writer with calls 
to 2 open functions:
 * One passing the general configuration and a "name"
 * The second to pass indexer parameters

https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214

Fortunately, "name" is generated by calling getUniqueFile which does exactly 
what we want:

[https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]

I propose we use it instead of "nutch.csv" as CSVIndexWriter output file name. 
This is breaking change because it modifies the output name but allows the 
indexer to work in distributed mode.

PR will follow the ticket creation.


> CSV indexer does not work in distributed mode
> ---------------------------------------------
>
>                 Key: NUTCH-2793
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2793
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.17
>            Reporter: Patrick Mézard
>            Priority: Major
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2793) CSV indexer does not work in distributed mode

Reply via email to