=?windows-1250?Q?Re:_Extract_all_links_/_URLs?_[62031]?=

Vlastimil Brom Fri, 21 Feb 2014 10:00:58 -0800

Dirk:
--------------------------------------------------------------------------------
...
May be it is easier to extract the www. URLs in a first step and in a second
step all of the remaining URLs containing http and similar.


Thank you very much.
--------------------------------------------------------------------------------


Hi,
it may be difficult to solve this generally using one single regex, but chances
are, the format of your input data allows some assumptions which would make the
extraction simpler,

e.g. if your URLs were always at the beginning of the line and were guaranteed
to either start with http or www and is followed by some whitespace, a naive
pattern might simply be

^(https?://|www\.)\S{3,}

this would still have some invalid false positives like www.abc (which could be
managed separately), but these are probably not that likely; this assumes, there
should not be any spaces in url.

However, if there isn't such regularity, searching in multiple steps seem most
simple.

hth,
   vbr

-- 
<http://forum.pspad.com/read.php?2,62001,62031>
PSPad freeware editor http://www.pspad.com

=?windows-1250?Q?Re:_Extract_all_links_/_URLs?_[62031]?=

Odpovedet emailem