Ryan Stokes created SOLR-12026:
----------------------------------

             Summary: SimplePostTool with robots.txt
                 Key: SOLR-12026
                 URL: https://issues.apache.org/jira/browse/SOLR-12026
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SimplePostTool
    Affects Versions: 7.2
            Reporter: Ryan Stokes


[First issue here, apologies in advance for missteps.]

Three things which could improve working with robots.txt:
 # When fetching the corresponding robots.txt for a URL, the port is ignored 
and so it defaults to :80.  If nothing is listening :80, it fetches the page.  
isDisallowedByRobots() could include the url.getPort() when constructing 
strRobot.  This helps when testing your robots on a non-standard port, such as 
during development.
 # Disallow directives are applied regardless of User-agent.  parseRobotsTxt() 
could override a Disallow which specifies SimplePostTool-crawler.  This would 
help when indexing your own site which you've explicitly allowed for indexing 
by SimplePostTool.  I don't know if that's a good practice, but it would help 
in testing.
 # The User-agent header when fetching robots.txt is not 
"SimplePostTool-crawler" but shows as "Java/<version>".  The code which sets 
the header correctly from readPageFromUrl() could be reused in 
isDisallowedByRobots().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to