[General] Webboard: Links without specific protocol

bar Tue, 31 Jan 2017 07:58:46 -0800

Author: Julien D.
Email: jul...@clustaar.com
Message:
> Hello,
> 
> > Hello,
> > 
> > I couldn't find any information on this subject.
> > As people start using HTTPS, I get more and more problems when crawling 
with 
> > links that don't use a specific protocol.
> > 
> > Let's take this example of a link from http://www.example.com/page-a.html :
> > <a href="//www.example.com/page-b.html">text</a>
> > 
> > Will be seen as : http://www.example.com/www.example.com/page-b.html
> > And of course will cause a 404 error.
> > 
> > Any idea on how to get the right links ?
> > 
> > Thanks.
> 
> The crawler stores full URLs in the database.
> But you can remove the protocol at search time,
> using the search template language functionality.
> 
> In 3.4.x use regex_substr:
> http://www.mnogosearch.org/doc34/msearch-templates.html#template-
functions
> 
> In 3.3.x use the EREG template operator:
> http://www.mnogosearch.org/doc33/msearch-templates-
oper.html#templates-oper-misc
>


Hello Alexander,

Thanks for the answer.
However, the problem occurs on the indexing phase : the crawler tries to index 
http://www.example.com/www.example.com/page-b.html (which does not exist) 
instead of http://www.example.com/page-b.html

Can I prevent those 404 errors ?

Thanks !

Reply: <http://www.mnogosearch.org/board/message.php?id=21810>

_______________________________________________
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general

[General] Webboard: Links without specific protocol

Reply via email to