Hello everybody, I will be presenting clustering techniques for Nutch output at ApacheCon NA 2016 later next week. I hope to see you there! Link to the event [1] and presentation [2].
In addition, we are also planning to contribute our toolkit to Nutch as this is a useful post-processing step for a crawler. As of now our algorithms runs on top of Apache Spark (Distributed Matrices and GraphX are really helpful). Let us know your thoughts, details are in presentation [2]. Source code and wiki at [3]. Oh, I missed to introduce myself? I am Thamme Gowda, a grad student at University of Southern California (USC) and also a research assistant of Dr. Chris Mattmann. Prior to the start of my graduate studies, I was building http://datoin.com as a tech co-founder. I am excited to be at ApacheCon. [1] http://sched.co/6OJN [2] http://schd.ws/hosted_files/apachecon2016/11/Apache%20Con%20Slides-Nutch-Clustering.pdf [3] https://github.com/uscdataScience/autoextractor/wiki Best, Thamme -- *Thamme Gowda N. * @thammegowda <https://twitter.com/thammegowda>

