[General] Webboard: Links without specific protocol

bar Wed, 25 Jan 2017 03:13:33 -0800

Author: Alexander Barkov
Email: 
Message:
Hello,

> Hello,
> 
> I couldn't find any information on this subject.
> As people start using HTTPS, I get more and more problems when crawling with 
> links that don't use a specific protocol.
> 
> Let's take this example of a link from http://www.example.com/page-a.html :
> <a href="//www.example.com/page-b.html">text</a>
> 
> Will be seen as : http://www.example.com/www.example.com/page-b.html
> And of course will cause a 404 error.
> 
> Any idea on how to get the right links ?
> 
> Thanks.


The crawler stores full URLs in the database.
But you can remove the protocol at search time,
using the search template language functionality.

In 3.4.x use regex_substr:
http://www.mnogosearch.org/doc34/msearch-templates.html#template-functions

In 3.3.x use the EREG template operator:
http://www.mnogosearch.org/doc33/msearch-templates-oper.html#templates-oper-misc


Reply: <http://www.mnogosearch.org/board/message.php?id=21809>

_______________________________________________
General mailing list
[email protected]
http://lists.mnogosearch.org/listinfo/general

[General] Webboard: Links without specific protocol

Reply via email to