> Do you plan to release the testing / evaluation part? The GitHub repository of the paper [1] contains scripts and instructions to reproduce the results we present in our evaluation. I could not include the dataset we used because it's not publicly available, but it should be reasonably easy to request it or manually build a small one. (I've not yet included a license because I'm not really sure how it works with the text of the paper, but everything else will be MIT)
> Should we put a link to BlogForever on the companies page? At the moment the BlogForever web site not really up-to-date, and we still a bit of work (mostly the connection to our back-end) before putting the crawler in production. The first instance will likely be hosted on CERN servers to monitor high-energy physics related blogs. I suggest to wait for this one to be up and running before adding a link, we will get back to you when it is! Cheers, Olivier [1]: https://github.com/OlivierBlanvillain/blogforever-crawler-publication On Monday, February 3, 2014 11:51:56 PM UTC+1, shane wrote: > > Interesting project! > > It's nice to see the bits on Scrapy in your paper - thanks! We're > delighted it was so useful for the BlogForever crawler. It's great to see > your crawler released as open source too. > > I thought Scrapely could have been a nice comparison.. your approach takes > better advantage of the fact that you have many examples (from the feeds) > where Scrapely is designed for working with very little example data so I > expect your approach would compare favorably. I see you favor using id and > class attributes - something we are considering for Scrapely too as it > currently relies exclusively on HTML structure. Do you plan to release the > testing / evaluation part? > > Should we put a link to BlogForever on the companies > page<http://scrapy.org/companies/> > ? > > Good luck with the conference submission! > > Shane > > > > On 1 February 2014 16:52, <[email protected] <javascript:>> wrote: > >> Hello everyone, >> >> I've very happy to announce the release of the BlogForever crawler! Our >> work >> is entirely based on Scrapy, and we wanted to thank you for the amazing >> work >> you did on this framework, without which we could not have accomplished a >> fraction of what we did during the last 6 months. >> >> The crawler targets web blogs, and is able to automatically extract blog >> post >> articles, title, authors and comments. It's open source and comes with >> tests >> and documentation: <https://github.com/BlogForever/crawler>. We've also >> written and submitted a paper for the WIMS14 conference where we present >> our >> algorithm for content extraction and a high level overview of the crawler >> architecture, available at >> < >> https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf >> >. >> >> I believe that the following parts of our project might be useful for >> other >> application: >> >> - The content extractor interface is similar to the one of Scrapely >> <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely too >> late >> to include it in our evaluation). It's very fast and robust: we got to >> 93% >> success rate on blog articles extraction over 2300 blog posts. >> >> - To crawl blogs mixed up with other resources (such as wiki or a forum), >> we >> use a simple machine-learning based priority predictor to favor crawling >> URLs with links to blog posts. This allows to get the best out of a >> limited >> number of page download, which might otherwise get stuck into unrelevant >> portions of the blog. >> >> - We use PhantomJS do JavaScript rendering, take screenshots and fake some >> user interaction to deal with Disqus comments. We have a pool of >> reusable >> browser which allows to take full advantage of processors (with >> JavaScript >> rendering on, the crawling bottleneck is CPU). >> >> If you take the time to read the paper (8 pages) or the code, don't >> hesitate >> to send comments or feedback! >> >> Regards, >> Olivier Blanvillain >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected]<javascript:> >> . >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
