=?windows-1250?Q?Re:_Extract_all_links_/_URLs?_[62022]?=

Vlastimil Brom Fri, 21 Feb 2014 05:39:13 -0800

Dirk:
--------------------------------------------------------------------------------
Hi,


Many thanks.

Better having some false positives and catching all links, URLs than missing one
of them without false positives.

But, I have any idea of those expressions, unfortunatly.

OK, extracting this now (I added ":" to avoid the source code is not be shown):

                                                                                
    http:://well.me/dfdfddddf   200     ok      text/html                       
        1       1       1       nginx           00:00.799       utf-8   
http:://well.me/999     200     ok      text/html                               
2       1       1       nginx           00:00.285       utf-8   
http:://well.me/456     200     ok      text/html                               
2       1       2       nginx           00:00.323       utf-8   
http:://well.me/8887kku 200     ok      text/html                               
2       1       1       nginx           00:00.311       utf-8

extracts that:

http:://well.me/dfdfddddf
00.799
http:://well.me/999
00.285
http:://well.me/456
00.323
http:://well.me/8887kku
00.311

May be one could change that.

Many thanks again.
--------------------------------------------------------------------------------


Hi,
well, these matched numbers are the mentioned false positives ... :-)
you may try the following modified pattern
cite:
--------------------------------------------------------------------------------
((news|http|ftp|https):\/\/)?[\w\-_]+(\.[\w]+)*?(\.[a-z]{2,3})([\w\-\.,@?
^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?--------------------------------------------------------------------------------


This should ensure the presence of a toplevel domain consisting of 2-3 letters;
again make sure tpo test it on you data, I only did it in a very limited way.

Alternatively, if you know the form of the urls you want to match, it might be
workable to write a simpler pattern from scratch - a large part of this version
seems to deal with the query part after ?.

hth,
  vbr

-- 
<http://forum.pspad.com/read.php?2,62001,62022>
PSPad freeware editor http://www.pspad.com

=?windows-1250?Q?Re:_Extract_all_links_/_URLs?_[62022]?=

Odpovedet emailem