There's a Java library called HtmlCleaner. You might wanna give that a shot. 
Btw, I'm working on quite a similar project so if you like email me and we can 
maybe join forces.
Andreas

On 06/06/2011, at 11:01 AM, Base wrote:

> hi all,
> 
> I am working on an app that will parse web pages to do some NLP and
> statistics.  I am able to parse the HTML using several different tool
> ( enlive, HTML parser, etc).  However I would like to discard all the
> rest of the junk in the web page that is not pertinent (I.e. Ads).
> Does anyone have any experience doing this?  Any tips On how to do
> this - or even better, tools that you can recommend?   I have been
> digging around on this for a while now and am stuck!
> 
> Thanks!
> 
> Base
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to