crawl fails with a parsing fetcher

Chris A. Mattmann (JIRA) Mon, 06 Apr 2015 18:19:55 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482356#comment-14482356
 ]


Chris A. Mattmann commented on NUTCH-1854:
------------------------------------------

Lewis, what specific problems can it lead to, I'm interested? Asitang has been 
trying this out and it's been working on small focused crawls, as long as we 
have downstream support for e.g., checking for crawl parser, and then also in 
the later jobs, making sure that there is (or isn't) a parse_text folder in the 
segments. How would that not solve the issues?

> ./bin/crawl fails with a parsing fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1854
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1854
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.11
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> <property>
> >   <name>fetcher.parse</name>
> >   <value>false</value>
> >   <description>If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.</description>
> > </property>
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

Reply via email to