There's a Java library called HtmlCleaner. You might wanna give that a shot. Btw, I'm working on quite a similar project so if you like email me and we can maybe join forces. Andreas
On 06/06/2011, at 11:01 AM, Base wrote: > hi all, > > I am working on an app that will parse web pages to do some NLP and > statistics. I am able to parse the HTML using several different tool > ( enlive, HTML parser, etc). However I would like to discard all the > rest of the junk in the web page that is not pertinent (I.e. Ads). > Does anyone have any experience doing this? Any tips On how to do > this - or even better, tools that you can recommend? I have been > digging around on this for a while now and am stuck! > > Thanks! > > Base > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with your > first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en