[ 
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474355
 ] 

Dennis Kubes commented on NUTCH-247:
------------------------------------

We could move the code to a utility class but if we want it to be called before 
the job is submitted it would still need to be called in fetcher.  

I thought we wanted a single rule that said people can't run fetcher, for 
whatever they are fetching http or not, unless they first set an agent name 
because too many people were using default fetcher setting and we had hundreds 
of Nutch bots crawling the internet and some people were getting upset about 
the amount of traffic being generated.  I agree that sometimes this setting 
isn't needed but I am not seeing how this would grow into multiple sets of 
checking rules.  If we move this to an http specific class, such as the HTML 
parser, then the job would already have begun.

Simplest solution IMO, if we want to stop all fetching unless an agent name is 
set, is to have a single method inside of fetcher that performs the single 
check and errors.  A more elaborate solution would be some type of extension 
point that supports pre-job configuration checking.

> robot parser to restrict.
> -------------------------
>
>                 Key: NUTCH-247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-247
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>         Assigned To: Dennis Kubes
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: agent-names.patch, agent-names3.patch.txt
>
>
> If the agent name and the robots agents are not proper configure the Robot 
> rule parser uses LOG.severe to log the problem but solve it also. 
> Later on the fetcher thread checks for severe errors and stop if there is one.
> RobotRulesParser:
> if (agents.size() == 0) {
>       agents.add(agentName);
>       LOG.severe("No agents listed in 'http.robots.agents' property!");
>     } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) {
>       agents.add(0, agentName);
>       LOG.severe("Agent we advertise (" + agentName
>                  + ") not listed first in 'http.robots.agents' property!");
>     }
> Fetcher.FetcherThread:
>  if (LogFormatter.hasLoggedSevere())     // something bad happened
>             break;  
> I suggest to use warn or something similar instead of severe to log this 
> problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to