[General] Webboard: Links without specific protocol

bar Wed, 01 Feb 2017 02:50:27 -0800

Author: Alexander Barkov
Email: 
Message:
Hello  Julien,

> > Hello,
> > 
> > > Hello,
> > > 
> > > I couldn't find any information on this subject.
> > > As people start using HTTPS, I get more and more problems when crawling 
> with 
> > > links that don't use a specific protocol.
> > > 
> > > Let's take this example of a link from http://www.example.com/page-a.html 
> > > :
> > > <a href="//www.example.com/page-b.html">text</a>
> > > 
> > > Will be seen as : http://www.example.com/www.example.com/page-b.html
> > > And of course will cause a 404 error.
> > > 
> > > Any idea on how to get the right links ?
> > > 
> > > Thanks.
> > 
> > The crawler stores full URLs in the database.
> > But you can remove the protocol at search time,
> > using the search template language functionality.
> > 
> > In 3.4.x use regex_substr:
> > http://www.mnogosearch.org/doc34/msearch-templates.html#template-
> functions
> > 
> > In 3.3.x use the EREG template operator:
> > http://www.mnogosearch.org/doc33/msearch-templates-
> oper.html#templates-oper-misc
> > 
> 
> Hello Alexander,
> 
> Thanks for the answer.
> However, the problem occurs on the indexing phase : the crawler tries to 
> index 
> http://www.example.com/www.example.com/page-b.html (which does not exist) 
> instead of http://www.example.com/page-b.html
> 
> Can I prevent those 404 errors ?
> 
> Thanks !


Oops. This is not supported yet, indeed. I thought it was.
It should be easy to add this. Which version are you using?


Reply: <http://www.mnogosearch.org/board/message.php?id=21811>

_______________________________________________
General mailing list
[email protected]
http://lists.mnogosearch.org/listinfo/general

[General] Webboard: Links without specific protocol

Reply via email to