[
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130783#comment-17130783
]
ASF GitHub Bot commented on NUTCH-2793:
---------------------------------------
sebastian-nagel commented on a change in pull request #534:
URL: https://github.com/apache/nutch/pull/534#discussion_r438197577
##########
File path:
src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java
##########
@@ -192,7 +189,7 @@ protected int find(String value, int start) {
@Override
public void open(Configuration conf, String name) throws IOException {
Review comment:
This method is deprecated since the switch to the XML-based index writer
configuration (see
[NUTCH-1480](https://issues.apache.org/jira/browse/NUTCH-1480) and [the wiki
page
IndexWriters](https://cwiki.apache.org/confluence/display/NUTCH/IndexWriters)).
"name" was just an arbitrary name not a file name indicating a task-specific
output path. We would need a method which takes both: the IndexWriterParams and
the output path. This would require changes in the [IndexWriter
interface](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriter.java)
and also the classes
[IndexWriters](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java)
and
[IndexerMapReduce](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java).
I'm also not sure whether the output path alone is sufficient. We'll
eventually need an
[OutputCommitter](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputCommitter.html)
and need to think about situations if we have multiple index writers (eg. via
[exchanges](https://cwiki.apache.org/confluence/display/NUTCH/Exchanges)). See
also the [discussion in
NUTCH-1541](https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768).
##########
File path: src/plugin/indexer-csv/README.md
##########
@@ -39,4 +39,4 @@ escapechar | Escape character used to escape a quote
character | "
maxfieldlength | Max. length of a single field value in characters | 4096
maxfieldvalues | Max. number of values of one field, useful for, e.g., the
anchor texts field | 12
header | Write CSV column headers | true
-outpath | Output path / directory (local filesystem path, relative to current
working directory) | csvindexwriter
\ No newline at end of file
+outpath | Output path / directory (local filesystem path, relative to current
working directory) | csvindexwriter
Review comment:
still "local filesystem"? Ev. we could the outpath to overcome the
problem of multiple index writers.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> CSV indexer does not work in distributed mode
> ---------------------------------------------
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Affects Versions: 1.17
> Reporter: Patrick Mézard
> Priority: Major
>
> Reasons are discussed in
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
> and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with
> calls to 2 open functions:
> * One passing the general configuration and a "name"
> * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file
> name. This is a breaking change because it modifies the output name but
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)