[Nutch-dev] Nutch - Filtering (REGEX)

simon_ece Thu, 03 May 2007 00:36:41 -0700

hi all,
i am new to Nutch. I would like to crawl a particular site and get the
result in the following pattern.I dont want to list other urls from the
Crwaled site.


Site to be Crwal :eg" www.example.com
^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$

i can crawl and geting all the matching urls from the site,
i dont know how to filterout the urls and get only the particular urls,
kindly post the suggestions
Thanks & Regards
Simon
-- 
View this message in context: 
http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3685035.html#a10300328
Sent from the Nutch - Dev mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Nutch - Filtering (REGEX)

Reply via email to