[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

Paul Escobar (Jira) Fri, 25 Nov 2022 05:59:05 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638670#comment-17638670
 ]


Paul Escobar commented on NUTCH-2793:
-------------------------------------

There is a problem in local mode:

Issue: You put the indexer out of the bin/crawl script main loop to prevent the 
file nutch.csv could be overwritten, but it still happens, you see only the 
last part of the parsed documents.

Cause: If -Dmapreduce.job.reduces parameter is greater than 1, indexer runs 
more than once and overwrites the nutch.csv file

Workaround: Run indexer with one reducer: -Dmapreduce.job.reduces=1 or the same 
but from bin/crawl script: NUM_TASKS=1

Feasible fix: Change CSVIndexerWriter.java:
 
|if (fs.exists(csvLocalOutFile)) {|


|   // clean-up|


|   LOG.warn("Removing existing output path {}", csvLocalOutFile);|


|   fs.delete(csvLocalOutFile, true);|

}
 
and try to append data instead of deleting and creating the file,  in local 
mode, at least.
 

 

> CSV indexer does not work in distributed mode
> ---------------------------------------------
>
>                 Key: NUTCH-2793
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2793
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>    Affects Versions: 1.17
>            Reporter: Patrick Mézard
>            Priority: Major
>             Fix For: 1.20
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

Reply via email to