[General] Webboard: Links without specific protocol

2017-02-06 Thread bar
Author: Alexander Barkov
Email: 
Message:
Hello  Julien,

> Hello Alexander,
> 
> Thanks for the answer.
> However, the problem occurs on the indexing phase : the crawler tries to 
> index 
> http://www.example.com/www.example.com/page-b.html (which does not exist) 
> instead of http://www.example.com/page-b.html
> 
> Can I prevent those 404 errors ?
> 
> Thanks !

I have added support for protocol-relative URLs into the next release 3.4.2. I 
hope to make it available for download this week.

Note, the database structure is slightly different in 3.4.2 vs 3.4.1,
so full re-crawling will be needed. Hope it won't be a serious problem.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Links without specific protocol

2017-02-05 Thread bar
Author: Alexander Barkov
Email: 
Message:

> 
> Hello Alexander,
> 
> I currently use 3.4.1.
> 
> Is there a new release I am not aware of ?
> 
> Thank you for your quick answers !

No, 3.4.1 is the latest.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Links without specific protocol

2017-02-01 Thread bar
Author: Julien D.
Email: jul...@clustaar.com
Message:
> Hello  Julien,
> 
> > > Hello,
> > > 
> > > > Hello,
> > > > 
> > > > I couldn't find any information on this subject.
> > > > As people start using HTTPS, I get more and more problems when 
crawling 
> > with 
> > > > links that don't use a specific protocol.
> > > > 
> > > > Let's take this example of a link from http://www.example.com/page-
a.html :
> > > > text
> > > > 
> > > > Will be seen as : http://www.example.com/www.example.com/page-
b.html
> > > > And of course will cause a 404 error.
> > > > 
> > > > Any idea on how to get the right links ?
> > > > 
> > > > Thanks.
> > > 
> > > The crawler stores full URLs in the database.
> > > But you can remove the protocol at search time,
> > > using the search template language functionality.
> > > 
> > > In 3.4.x use regex_substr:
> > > http://www.mnogosearch.org/doc34/msearch-templates.html#template-
> > functions
> > > 
> > > In 3.3.x use the EREG template operator:
> > > http://www.mnogosearch.org/doc33/msearch-templates-
> > oper.html#templates-oper-misc
> > > 
> > 
> > Hello Alexander,
> > 
> > Thanks for the answer.
> > However, the problem occurs on the indexing phase : the crawler tries to 
index 
> > http://www.example.com/www.example.com/page-b.html (which does not 
exist) 
> > instead of http://www.example.com/page-b.html
> > 
> > Can I prevent those 404 errors ?
> > 
> > Thanks !
> 
> Oops. This is not supported yet, indeed. I thought it was.
> It should be easy to add this. Which version are you using?
> 

Hello Alexander,

I currently use 3.4.1.

Is there a new release I am not aware of ?

Thank you for your quick answers !

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Links without specific protocol

2017-01-31 Thread bar
Author: Julien D.
Email: jul...@clustaar.com
Message:
> Hello,
> 
> > Hello,
> > 
> > I couldn't find any information on this subject.
> > As people start using HTTPS, I get more and more problems when crawling 
with 
> > links that don't use a specific protocol.
> > 
> > Let's take this example of a link from http://www.example.com/page-a.html :
> > text
> > 
> > Will be seen as : http://www.example.com/www.example.com/page-b.html
> > And of course will cause a 404 error.
> > 
> > Any idea on how to get the right links ?
> > 
> > Thanks.
> 
> The crawler stores full URLs in the database.
> But you can remove the protocol at search time,
> using the search template language functionality.
> 
> In 3.4.x use regex_substr:
> http://www.mnogosearch.org/doc34/msearch-templates.html#template-
functions
> 
> In 3.3.x use the EREG template operator:
> http://www.mnogosearch.org/doc33/msearch-templates-
oper.html#templates-oper-misc
> 

Hello Alexander,

Thanks for the answer.
However, the problem occurs on the indexing phase : the crawler tries to index 
http://www.example.com/www.example.com/page-b.html (which does not exist) 
instead of http://www.example.com/page-b.html

Can I prevent those 404 errors ?

Thanks !

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Links without specific protocol

2017-01-25 Thread bar
Author: Alexander Barkov
Email: 
Message:
Hello,

> Hello,
> 
> I couldn't find any information on this subject.
> As people start using HTTPS, I get more and more problems when crawling with 
> links that don't use a specific protocol.
> 
> Let's take this example of a link from http://www.example.com/page-a.html :
> text
> 
> Will be seen as : http://www.example.com/www.example.com/page-b.html
> And of course will cause a 404 error.
> 
> Any idea on how to get the right links ?
> 
> Thanks.

The crawler stores full URLs in the database.
But you can remove the protocol at search time,
using the search template language functionality.

In 3.4.x use regex_substr:
http://www.mnogosearch.org/doc34/msearch-templates.html#template-functions

In 3.3.x use the EREG template operator:
http://www.mnogosearch.org/doc33/msearch-templates-oper.html#templates-oper-misc


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Links without specific protocol

2017-01-23 Thread bar
Author: Julien D.
Email: jul...@clustaar.com
Message:
Hello,

I couldn't find any information on this subject.
As people start using HTTPS, I get more and more problems when crawling with 
links that don't use a specific protocol.

Let's take this example of a link from http://www.example.com/page-a.html :
text

Will be seen as : http://www.example.com/www.example.com/page-b.html
And of course will cause a 404 error.

Any idea on how to get the right links ?

Thanks.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general