[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2022-11-25 Thread Paul Escobar (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638670#comment-17638670
 ] 

Paul Escobar commented on NUTCH-2793:
-

There is a problem in local mode:

Issue: You put the indexer out of the bin/crawl script main loop to prevent the 
file nutch.csv could be overwritten, but it still happens, you see only the 
last part of the parsed documents.

Cause: If -Dmapreduce.job.reduces parameter is greater than 1, indexer runs 
more than once and overwrites the nutch.csv file

Workaround: Run indexer with one reducer: -Dmapreduce.job.reduces=1 or the same 
but from bin/crawl script: NUM_TASKS=1

Feasible fix: Change CSVIndexerWriter.java:
 
|if (fs.exists(csvLocalOutFile)) {|


|   // clean-up|


|   LOG.warn("Removing existing output path {}", csvLocalOutFile);|


|   fs.delete(csvLocalOutFile, true);|

}
 
and try to append data instead of deleting and creating the file,  in local 
mode, at least.
 

 

> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
> Fix For: 1.20
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135578#comment-17135578
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

sebastian-nagel commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-643976911


   Yes, we could use the identifier but as we already have the param "outpath" 
- why not use it? The other constraints should be documented.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
> Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134198#comment-17134198
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

pmezard commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-643256957


   Thank you for the details.
   
   One thing I wonder is if it would not be possible to define the 
index-writers specific path as their identifier in index-writers.xml, at least 
by default. It would be unique by construction, which reduces a bit the amount 
of configuration. Drawbacks:
   
   - The identifier may be arbitrary and not compatible with FS/Object stores 
paths constraints. Not sure how hard it would be to detect that in practice, or 
if it is a real problem in practice.
   - Said identifiers are a bit ugly, like `indexer_csv_1`. Maybe we can change 
them. Or maybe that's not an issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
> Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134061#comment-17134061
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

sebastian-nagel commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-643149619


   Thanks for the exhaustive listing. I have only a few points to add.
   
   > I assumed that NutchAction writes in a given reducer are serialized. It it 
no clear to me if this is correct or not.
   
   The MapReduce framework takes care of data serialization and concurrency 
issues: the reduce() method is never called concurrently within one task - 
tasks run in parallel and that's why every task needs it's own output 
(part-r-n). The name of the output file (the number in n) is also 
determined by the framework - that's important if a task is restarted to avoid 
duplicated output.
   
   > writers have distinct output "directories" and the active reducer defines 
a unique output file name, so the combination of both should be unique.
   
   I think we need 3 components:
   - the task-specific file or folder (part-r-n)
   - a unique folder per index writer (eg. the name or a path defined in 
index-writers.xml)
   - a job-specific output location - you do not want to change the 
index-writers.xml for that if you run another indexing job
   
   In short, the path of a task output might look like: 
`job-output/indexer-csv-1/part-r-0.csv`
   
   > getUniqueFile
   
   You mean 
[ParseOutputFormat::getUniqueFile](https://github.com/apache/nutch/blob/59d0d9532abdac409e123ab103a506cfb0df790a/src/java/org/apache/nutch/parse/ParseOutputFormat.java#L120]?
 ParseOutputFormat or 
[FetcherOutputFormat](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java)
 are good examples as they write output into multiple segment subdirectories. 
Hence, there are no plugins involved which determine whether there is output 
written to the filesystem or not.
   
   > Maybe implement a fallback of the previous method to the new one with a 
dummy argument
   
   That could be done using default method implementations in Java 8 
interfaces. Note: Nutch requires now Java 8 but it started with Java 1.4 and 
there is still a lot of code not using features of Java 8.
   
   Also, to keep the indexer usable (because most index writers (solr, 
elasticsearch, etc.) do not write output to the filesystem): if nothing is 
written to the filesystem IndexingJob should not require an output location as 
command-line argument.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
> Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133325#comment-17133325
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

pmezard commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-642730043


   OK, there is a lot to unpack. Let me try to rephrase what was my naive 
understanding of the issue, how I intended to fix it and what is wrong about it.
   
   What I saw is indexing to csv worked locally but failed in a distributed 
setup (with only 3 nodes). The reduce step emitted errors when writing data to 
GCS. At the end, there was something containing roughly a third of the expected 
dataset. I assumed I had 3 reducers overwriting each other with only one winner 
at the end (or a mix of winning output blocks). So I thought "if only I could 
map the CSVIndexWriter output file to a reducer to separate each reducer 
output, that would solve the issue".
   
   What you are saying is:
   - In addition to distributed mode requiring the writers output to be 
separated, there is a lot of complexity involved with dealing with eventually 
consistent object stores (I will assume that GCS works roughly like S3). 
Ideally we would like reducers output to appear in the outpath only if the 
tasks or jobs succeed, which involves the commiter logic you referenced. But in 
an initial implementation we may not care about that. If the indexing fails, 
partial output will be left in outpath and such is life (I am OK with that).
   - I assumed that NutchAction writes in a given reducer are serialized. It it 
no clear to me if this is correct or not.
   - Exchanges introduce additional complexity in that a single NutchAction can 
be handled by more than one writer. I do not see what would be the issue with 
this assuming each writer output are separated. If I have 2 writers with an 
outpath set to "out1" and "out2", in a reducer generating a "part-r-0001", the 
actions would go either in  "out1/part-r-0001" or "out2/part-r-0002" or both. I 
do not see overlapping writes there.
   - Same reasoning with `there is also the open question how to allow two 
index writers writing output the filesystem:`. Again I assume the writers have 
distinct output "directories" and the active reducer defines a unique output 
file name, so the combination of both should be unique.
   - About `"name" was just an arbitrary name not a file name indicating a 
task-specific output path`, maybe but does anything prevents it to be used that 
way? `getUniqueFile` seems suitable here.
   
   With this current understanding, I would now implement it like:
   - Kill `open(Configuration cfg, String name)` method, if possible (I haven't 
checked the code yet).
   - Refactor `open(IndexWriterParams params)` into `open(IndexWriterParams 
params, String name)`, where `name` would be the same thing passed to the other 
method.
   - In CSVIndexWriter, use `name` directly and drop the `filename` kludge I 
introduced.
   - Maybe implement a fallback of the previous method to the new one with a 
dummy argument.
   
   How far am I?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
> Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will 

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130921#comment-17130921
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

sebastian-nagel commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-642132183


   > Is it OK to just change the interface and implement what you suggest?
   
   Yes, that's ok. We'll put a notice about a breaking change to the release 
notes, so that users having there own indexer plugin know they have to adapt it.
   
   > Should it be best-effort to keep things BC?
   
   We could try to only extend the IndexWriter interface and provide default 
do-nothing implementations for newly added methods as most index writers do not 
write data to the filesystem.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
> Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130918#comment-17130918
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

sebastian-nagel commented on a change in pull request #534:
URL: https://github.com/apache/nutch/pull/534#discussion_r438267053



##
File path: src/plugin/indexer-csv/README.md
##
@@ -39,4 +39,4 @@ escapechar | Escape character used to escape a quote 
character | 
 maxfieldlength | Max. length of a single field value in characters | 4096
 maxfieldvalues | Max. number of values of one field, useful for, e.g., the 
anchor texts field | 12
 header | Write CSV column headers | true
-outpath | Output path / directory (local filesystem path, relative to current 
working directory) | csvindexwriter
\ No newline at end of file
+outpath | Output path / directory (local filesystem path, relative to current 
working directory) | csvindexwriter

Review comment:
   Sorry, I've mixed two points mixed together:
   - the description would also need a change as it will not be a path on the 
local filesystem if running in distributed mode
   - there is also the open question how to allow two index writers writing 
output the filesystem:
 - in local mode this would require that the `outpath` points to a 
different directory
 - in distributed mode we could use `outpath` to write into distinct output 
directories or distinct subdirectories of one job-specific output directory





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
> Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130906#comment-17130906
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

pmezard commented on a change in pull request #534:
URL: https://github.com/apache/nutch/pull/534#discussion_r438258817



##
File path: src/plugin/indexer-csv/README.md
##
@@ -39,4 +39,4 @@ escapechar | Escape character used to escape a quote 
character | 
 maxfieldlength | Max. length of a single field value in characters | 4096
 maxfieldvalues | Max. number of values of one field, useful for, e.g., the 
anchor texts field | 12
 header | Write CSV column headers | true
-outpath | Output path / directory (local filesystem path, relative to current 
working directory) | csvindexwriter
\ No newline at end of file
+outpath | Output path / directory (local filesystem path, relative to current 
working directory) | csvindexwriter

Review comment:
   Sorry, I did not understand that, could you elaborate?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
> Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130905#comment-17130905
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

pmezard commented on pull request #534:
URL: https://github.com/apache/nutch/pull/534#issuecomment-642122887


   What are the backward compatibility requirements for nutch? Is it OK to just 
change the interface and implement what you suggest? Should it be best-effort 
to keep things BC? Or is it impossible to implement such a change at this point?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
> Fix For: 1.18
>
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130783#comment-17130783
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

sebastian-nagel commented on a change in pull request #534:
URL: https://github.com/apache/nutch/pull/534#discussion_r438197577



##
File path: 
src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java
##
@@ -192,7 +189,7 @@ protected int find(String value, int start) {
 
   @Override
   public void open(Configuration conf, String name) throws IOException {

Review comment:
   This method is deprecated since the switch to the XML-based index writer 
configuration (see 
[NUTCH-1480](https://issues.apache.org/jira/browse/NUTCH-1480) and [the wiki 
page 
IndexWriters](https://cwiki.apache.org/confluence/display/NUTCH/IndexWriters)). 
"name" was just an arbitrary name not a file name indicating a task-specific 
output path. We would need a method which takes both: the IndexWriterParams and 
the output path. This would require changes in the [IndexWriter 
interface](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriter.java)
 and also the classes 
[IndexWriters](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java)
 and 
[IndexerMapReduce](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java).
 I'm also not sure whether the output path alone is sufficient. We'll 
eventually need an 
[OutputCommitter](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputCommitter.html)
 and need to think about situations if we have multiple index writers (eg. via 
[exchanges](https://cwiki.apache.org/confluence/display/NUTCH/Exchanges)). See 
also the [discussion in 
NUTCH-1541](https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768).

##
File path: src/plugin/indexer-csv/README.md
##
@@ -39,4 +39,4 @@ escapechar | Escape character used to escape a quote 
character | 
 maxfieldlength | Max. length of a single field value in characters | 4096
 maxfieldvalues | Max. number of values of one field, useful for, e.g., the 
anchor texts field | 12
 header | Write CSV column headers | true
-outpath | Output path / directory (local filesystem path, relative to current 
working directory) | csvindexwriter
\ No newline at end of file
+outpath | Output path / directory (local filesystem path, relative to current 
working directory) | csvindexwriter

Review comment:
   still "local filesystem"? Ev. we could the outpath to overcome the 
problem of multiple index writers.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread Jira


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130607#comment-17130607
 ] 

Patrick Mézard commented on NUTCH-2793:
---

PR sent here https://github.com/apache/nutch/pull/534

> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130593#comment-17130593
 ] 

ASF GitHub Bot commented on NUTCH-2793:
---

pmezard opened a new pull request #534:
URL: https://github.com/apache/nutch/pull/534


   Before the change, the output file name was hard-coded to "nutch.csv".
   When running in distributed mode, multiple reducers would clobber each
   other output.
   
   After the change, the filename is taken from the first open(cfg, name)
   initialization call, where name is a unique file name generated by
   IndexerOutputFormat, derived from hadoop FileOutputFormat. The CSV files
   are now named like part-r-000xx.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> CSV indexer does not work in distributed mode
> -
>
> Key: NUTCH-2793
> URL: https://issues.apache.org/jira/browse/NUTCH-2793
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.17
>Reporter: Patrick Mézard
>Priority: Major
>
> Reasons are discussed in 
> https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768
>  and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot 
> generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with 
> calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly 
> what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file 
> name. This is a breaking change because it modifies the output name but 
> allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)