[jira] [Commented] (NUTCH-1028) Log parser keys

Julien Nioche (JIRA) Tue, 09 Aug 2011 05:08:00 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081585#comment-13081585
 ]


Julien Nioche commented on NUTCH-1028:
--------------------------------------

You can see the progression of the parsing on the hadoop job tracker in 
distributed mode + it has a counter for the number of documents succesfully 
parsed.
Of course you won't see that in local mode, but if you want to parse large 
segments then using the (pseudo)distributed mode would be a good option anyway 
as you'd potentially have more than 1 mapper or reducer at work and would 
leverage the multiple cores that your machine certainly has, not even 
mentioning the benefits of replicated storage etc....
Your suggestion is good though and it makes sense to have a consistent 
behaviour across the various jobs.

> Log parser keys
> ---------------
>
>                 Key: NUTCH-1028
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1028
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Trivial
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1028-1.4-1.patch
>
>
> The parser can take ages (many hours) to complete. During this time the only 
> output is an error or warning if it's unable to parse something (which is 
> very common). Sometimes the parser can run for several hours without any 
> output: this is scary. I propose to add a LOG.info to the mapper and write 
> the key when parsing, similar to the fetcher.
> Thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1028) Log parser keys

Reply via email to