Ryan Stokes created SOLR-12026: ---------------------------------- Summary: SimplePostTool with robots.txt Key: SOLR-12026 URL: https://issues.apache.org/jira/browse/SOLR-12026 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: SimplePostTool Affects Versions: 7.2 Reporter: Ryan Stokes
[First issue here, apologies in advance for missteps.] Three things which could improve working with robots.txt: # When fetching the corresponding robots.txt for a URL, the port is ignored and so it defaults to :80. If nothing is listening :80, it fetches the page. isDisallowedByRobots() could include the url.getPort() when constructing strRobot. This helps when testing your robots on a non-standard port, such as during development. # Disallow directives are applied regardless of User-agent. parseRobotsTxt() could override a Disallow which specifies SimplePostTool-crawler. This would help when indexing your own site which you've explicitly allowed for indexing by SimplePostTool. I don't know if that's a good practice, but it would help in testing. # The User-agent header when fetching robots.txt is not "SimplePostTool-crawler" but shows as "Java/<version>". The code which sets the header correctly from readPageFromUrl() could be reused in isDisallowedByRobots(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org