Re: Announcing the BlogForever crawler

olivierblanvillain Tue, 04 Feb 2014 08:10:45 -0800

> Do you plan to release the testing / evaluation part?

The GitHub repository of the paper [1] contains scripts and instructions to
reproduce the results we present in our evaluation. I could not include the
dataset we used because it's not publicly available, but it should be
reasonably easy to request it or manually build a small one. (I've not yet
included a license because I'm not really sure how it works with the text of
the paper, but everything else will be MIT)



> Should we put a link to BlogForever on the companies page?

At the moment the BlogForever web site not really up-to-date, and we still a
bit of work (mostly the connection to our back-end) before putting the 
crawler
in production. The first instance will likely be hosted on CERN servers to
monitor high-energy physics related blogs. I suggest to wait for this one to
be up and running before adding a link, we will get back to you when it is!

Cheers,
Olivier

[1]: https://github.com/OlivierBlanvillain/blogforever-crawler-publication


On Monday, February 3, 2014 11:51:56 PM UTC+1, shane wrote:
>
> Interesting project!
>
> It's nice to see the bits on Scrapy in your paper - thanks! We're 
> delighted it was so useful for the BlogForever crawler. It's great to see 
> your crawler released as open source too.
>
> I thought Scrapely could have been a nice comparison.. your approach takes 
> better advantage of the fact that you have many examples (from the feeds) 
> where Scrapely is designed for working with very little example data so I 
> expect your approach would compare favorably. I see you favor using id and 
> class attributes - something we are considering for Scrapely too as it 
> currently relies exclusively on HTML structure.  Do you plan to release the 
> testing / evaluation part?
>
> Should we put a link to BlogForever on the companies 
> page<http://scrapy.org/companies/>
> ?
>
> Good luck with the conference submission!
>
> Shane
>
>
>
> On 1 February 2014 16:52, <[email protected] <javascript:>> wrote:
>
>> Hello everyone,
>>
>> I've very happy to announce the release of the BlogForever crawler! Our 
>> work
>> is entirely based on Scrapy, and we wanted to thank you for the amazing 
>> work
>> you did on this framework, without which we could not have accomplished a
>> fraction of what we did during the last 6 months.
>>
>> The crawler targets web blogs, and is able to automatically extract blog 
>> post
>> articles, title, authors and comments. It's open source and comes with 
>> tests
>> and documentation: <https://github.com/BlogForever/crawler>. We've also
>> written and submitted a paper for the WIMS14 conference where we present 
>> our
>> algorithm for content extraction and a high level overview of the crawler
>> architecture, available at
>> <
>> https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf
>> >.
>>
>> I believe that the following parts of our project might be useful for 
>> other
>> application:
>>
>> - The content extractor interface is similar to the one of Scrapely
>>   <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely too 
>> late
>>   to include it in our evaluation). It's very fast and robust: we got to 
>> 93%
>>   success rate on blog articles extraction over 2300 blog posts.
>>
>> - To crawl blogs mixed up with other resources (such as wiki or a forum), 
>> we
>>   use a simple machine-learning based priority predictor to favor crawling
>>   URLs with links to blog posts. This allows to get the best out of a 
>> limited
>>   number of page download, which might otherwise get stuck into unrelevant
>>   portions of the blog.
>>
>> - We use PhantomJS do JavaScript rendering, take screenshots and fake some
>>   user interaction to deal with Disqus comments. We have a pool of 
>> reusable
>>   browser which allows to take full advantage of processors (with 
>> JavaScript
>>   rendering on, the crawling bottleneck is CPU).
>>
>> If you take the time to read the paper (8 pages) or the code, don't 
>> hesitate
>> to send comments or feedback!
>>
>> Regards,
>> Olivier Blanvillain
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected]<javascript:>
>> .
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Announcing the BlogForever crawler

Reply via email to