[ 
https://issues.apache.org/jira/browse/NUTCH-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582971#comment-16582971
 ] 

ASF GitHub Bot commented on NUTCH-2635:
---------------------------------------

sebastian-nagel opened a new pull request #376: NUTCH-2635 Generator writes 
unneeded temporary output
URL: https://github.com/apache/nutch/pull/376
 
 
   - output is written to MultipleOutputs, skip context.write(...)
   - fix comment wrapping

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Generator writes unneeded temporary output
> ------------------------------------------
>
>                 Key: NUTCH-2635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2635
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> Generator writes the temporary output of the Selector job/step twice (see 
> [line 
> 516|https://github.com/apache/nutch/blob/branch-1.15/src/java/org/apache/nutch/crawl/Generator.java#L516]).
>  Not a big issue when generating small fetch lists but may be when working on 
> large data. The temporary output looks like:
> {noformat}
> % tree -h generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/
> enerate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/
> |-- [4.0K]  fetchlist-1
> |   `-- [ 25M]  part-r-00000
> `-- [ 77M]  part-r-00000
> 1 directory, 2 files
> % file generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/part-r-00000 
> generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/part-r-00000: ASCII text
> % file 
> generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/fetchlist-1/part-r-00000 
> generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/fetchlist-1/part-r-00000: 
> Apache Hadoop Sequence file version 6
> {noformat}
> The unneeded output is plain-text which explains its larger size compared to 
> the Hadoop Sequence file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to