Scrapy 0.24 is out and it aims to improve the scraping experience for everyone! :)

Daniel Graña Thu, 26 Jun 2014 15:08:36 -0700

Hello all,

It has been 4 months of development, 30 authors, more than 80 issues 
closed, 225 commits, 177 files changed, 6740 insertions and 4134 deletions.
New and old faces has been seen in the past months reporting and fixing 
issues, discussing and helping get new features in shape.
Pretty amazing work, thanks to everyone that contributed in one or other 
way to make Scrapy 0.24 possible!


I'd like to take this opportunity to ask for help with the scrapy.org 
website. Its design is old (hasn't changed much since 2008!) and we would 
like to give it a proper makeover, with a fresher, modern look, maybe 
including a snippet of simple, self-contained code that shows the power of 
Scrapy. Anyone out there that would like to become famous for designing the 
new scrapy.org website? :)

Check out the Release Notes <http://doc.scrapy.org/en/latest/news.html#id1>, 
from where I would like to highlight the now simpler top-level imports and 
selector's shortcuts::

import scrapy

class MySpider(scrapy.Spider): 
    # ... 
    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield scrapy.Request(url)


At last but not less important, the credits:

A.J. Welch (1):
      Generalize the file pipeline log messages so they are not specific to 
downloading images.

Alex Cepoi (2):
      improvements to scrapy check/contracts
      fix contracts tests

Alexander Chekunkov (5):
      test for RFPDupeFilter.request_fingerprint overriding
      added note about RFPDupeFilter.request_fingerprint overriding to the 
settings documentation
      added short RFPDupeFilter.request_fingerprint interface description
      DOWNLOADER setting
      DOWNLOADER setting

Alexey Bezhan (6):
      Clarify MapCompose documentation
      Fix some typos, whitespace and small errors in docs
      Add a note about reporting security issues
      Bind telnet console and webservice to 127.0.0.1 by default
      Fix PEP8 warnings in project template files
      Fix PEP8 warnings in spider templates

Ana Sabina Uban (1):
      Fixed SgmlLinkExtractor constructor to properly handle both string and 
list parameters (attrs, tags, deny_extensions)

Benoit Blanchon (3):
      BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set 
current_link=None when the end tag match the opening tag
      BaseSgmlLinkExtractor: Added unit test of a link with an inner tag
      BaseSgmlLinkExtractor: Fixed the missing space when the link has an inner 
tag

Breno Colom (1):
      Update scrapy command line doc with additional scrapy parse options

Cameron Lane (2):
      [#744] Ensure domain is not None before building regex
      [#744] Test for allowed domains including NoneTypes

Capi Etheriel (4):
      fixes dynamic itemclass example usage of type()
      Running lucasdemarchi/codespell to fix typos in docs
      Running lucasdemarchi/codespell to fix typos in SEPs
      Running lucasdemarchi/codespell to fix typos in code

Carlos Rivera (1):
      grammatical issue

Cash Costello (1):
      Added missing word in practices.rst

Claudio Salazar (4):
      Fixed XXE flaw in sitemap reader
      Fixed XML selector against XXE attacks
      Added test against XXE attacks for Sitemap
      Added resolve_entities to kwargs in SafeXMLParser

Daniel Graña (45):
      Merge 0.22.0 release notes
      bump version to 0.23
      fix 0.22.0 release date
      Update Ubuntu installation instructions
      fix apt-get line
      replace warning about updating package lists by a note on package upgrade
      show ubuntu setup instructions as literal code
      replace unencodeable codepoints with html entities. fixes #562 and #285
      Fix wrong checks on subclassing of deprecated classes. closes #581
      test inspect.stack failure
      localhost666 can resolve under certain circumstances
      Add 0.22.1 release notes
      fix a reference to unexistent engine.slots. closes #593
      Add 0.22.2 release notes
      try to restore pypy tests
      Run testsuite with py.test
      cleanup toplevel namespace
      Add basic top-level shortcuts
      remove .re() shortcut
      update docs
      update spider templates
      Remove "sel" shortcut from scrapy shell$
      document shortcuts in TextResponse class
      Ammend example nesting selectors
      Restore and deprecate "sel" shortcut
      limit Twisted support to pre-14.0.0 while #718 is fixed
      fix tests after changes introduced by scrapy/w3lib#21
      force installation of w3lib and queuelib for trunk env
      Avoid IPython warning. thanks @bryant1410. closes #623
      sort spiders in "scrapy list" cmd. closes #736
      Add a LevelDB cache backend
      add leveldb cache backend docs
      indent parsed-literal as part of ordered list
      Upload sdist and wheel packages to pypi using travis-ci deploys
      Add bumpversion config
      Revert "limit Twisted support to pre-14.0.0 while #718 is fixed"
      hold a reference to backwards compatible _contextFactory
      Restore compatibility with Settings.overrides while still deprecating it
      recognize jl extension as jsonlines exporter and update docs
      promote LxmlLinkExtractor as default in docs
      address latest comments
      No need to keep extracted links as instance attribute. fixes #763
      Add 0.24.0 release notes
      Bump version: 0.23.0 → 0.24.0
      set 0.24.0 release date

Denys Butenko (5):
      Resolved issue #546. Output format parsing from filename extension.
      Added back `-t` option. If `--output-format` not defined parse from 
extension `--output`
      Fix default value.
      Add import os for crawl.
      Added more verbose error message for unrecognized output format. PEP8.

Edwin O Marshall (32):
      Converted sep-001 to rst format
      converted sep 002 to rst
      - decided that removing files would cause conflicts on merge
      - readded file to prevent future merge conflicts
      converted sep 3 for #629
      sep 4 for #629
      sep 11 for #629
      - sep 15 for #629
      sep 6 for #629
      - sep 10 for #629
      - didn't like the way blockquotes rendered
      - trying to separate quote context
      - changing indentation so contexts are recognized
      - given that it'sa block quote, quotation marks seem redundant
      - removing trac file again to see if merges play well together
      - removed trac file
      - removed trac file
      - removed trac file
      - removed track file
      removed trac file
      removed trac file
      - removed trac file
      converted sep 7 for #629
      sep 12 for #629
      - converted sep 18
      converted sep 16
      converted sep 13
      converted sep 5
      - convertd sep 8
      converted sep 9
      converted sep017
      sep 14 for #629

Irhine (2):
      add encoding utf-8 to the first line
      support i18n by using utf-8 coding template files

Julia Medina (34):
      New doc: clickdata in Formrequest.from_response
      New tests: clickdata's nr in Formrequest.from_response
      FormRequest doc improvements
      More appropriate assert in FormRequest test
      Tests for loading download handlers
      Fix minor typo in DownloaderHandlers comment
      Doc for disabling download handler
      Minor fixes in LoadTestCase in test_downloader_handlers
      Trial functionality for running tests with pytest
      Add py33 environment to allowed failures in travis-ci
      Support doctest and __init__.py test discover in pytest
      Ignore files with import errors on pytest test discover
      Change function name so it does not mess up with pytest autodiscover
      Fix httpcache doctest that assumed dictionary order
      Ensure spiders module reload between spider manager tests
      New tox env: docs
      Ignore known broken links in docs linkcheck
      Fix broken links in documentation
      sep#19 proposed changes
      New SettingsAttribute class
      Settings priorities dictionary
      New set and setdict method using SettingsAttribute in Settings
      Deprecate CrawlerSettings, as its functionality is replicable by Settings 
class
      Settings and SettingsAtribute tests
      Fix and extend the documentation of the new Settings api
      Settings topic updated
      Fix settings repr on the logs of the shell and tutorial docs topics
      setmodule helper method on Settings class
      Update get_crawler method in utils/test.py with new Settings interface
      get_project_settings now returns a Settings instance
      Change command's default_settings population in cmdline.py
      Change how settings are overriden in ScrapyCommands
      Fix settings usage in runspider and crawl commands
      Fix settings usage across tests

Mikhail Korobov (18):
      fix typos in news.rst and remove (not released yet) header
      Handle cases when inspect.stack() fails
      testing PIL dependency is removed because there is a new mitmproxy version
      TST Improved twisted installation in tox.ini for Python 3.3
      reduce code duplication in test_spidermiddleware_httperror
      scrapy.utils.test.docrawl function
      Fix for #612 + integration-style tests for HttpErrorMiddleware
      TST fix file descriptor leak and a bad variable name in get_testlog
      make scrapy.version_info a tuple of integers
      remove unused import
      use "import scrapy" in templates
      DOC use top-level shortcuts in docs
      suggest scrapy.Selector in deprecation warnings
      TST fix tests that became broken after adding top-level imports and 
switching to py.test.
      fix scrapy.version_info when SCRAPY_VERSION_FROM_GIT is set
      response.selector, response.xpath(), response.css() and response.re()
      DOC selectors.rst cleanup
      add utf8 encoding header to spider templates

Nikita Nikishin (1):
      Fixed #441.

Nikolaos-Digenis Karagiannis (5):
      downloaderMW doc typo (spiderMW doc copy remnant)
      SpiderMW doc typo: SWP request, response
      ItemLoader doc: missing args in replace_value()
      document spider.closed() shortcut
      Document signal "request_scheduled"

Pablo Hoffman (11):
      make 'basic' the default template spider in genspider, and added info 
with next steps to startproject. closes #488
      add SEP-021 (Add-ons) - work in progress
      remove references to deprecated scrapy-developers list
      rename attribute to match conventions used for XXX_DEBUG settings (in 
autothrottle and cookies mw)
      remove no longer used setting: MAIL_DEBUG
      remove unused setting: DOWNLOADER_DEBUG
      signals doc: make argument order more consistent with code (although it 
doesn't matter in practice)
      add Julia to SEP-019 authors
      crate release notes for 0.24 and #699 to it
      minor change to request_scheduled signal doc
      doc: use |version| substitution in ubuntu packages

Paul Brown (1):
      fixed typo

Paul Tremberth (18):
      Disable smart strings in lxml XPath evaluations
      Make lxml smart strings functionality customizable
      Add testcase to check is default Selector doesnt return smart strings
      Use assertTrue/False
      RegexLinkExtractor: encode URL unicode value when creating Links
      Offsite: add 2 stats counters
      Always enable offsite stats + refactor test to initialize crawler
      Fix tests for Travis-CI build
      CrawSpider: support process_links as generator
      Fix HtmlParserLinkExtractor and tests after #485 merge
      Docs: 4-space indent for final spider example
      DupeFilter: add setting for verbose logging + stats counter for filtered 
requests
      Remove _log_level attribute as per comments
      Support case-insensitive domains in url_is_from_any_domain()
      Add tests for start requests, filtered and non-filtered
      Check pending start_requests before calling _spider_idle() in engine 
(fixes #706)
      Add LxmlLinkExtractor class similar to SgmlLinkExtractor (#528)
      Add doc on LxmlLinkExtractor class

Rafal Jagoda (1):
      add response arg to item_dropped signal handlers     #710

Rendaw (1):
      Elaborated request priority value.

Rolando Espinoza (8):
      Ignore None's values when using the ItemLoader.
      Unused re import and PEP8 minor edits.
      Expose current crawler in the scrapy shell.
      PEP8 minor edits.
      Updated shell docs with the crawler reference and fixed the actual shell 
output.
      Updated the tutorial crawl output with latest output.
      DOC Fixed HTTPCACHE_STORAGE typo in the default value which is now 
Filesystem instead Dbm.
      DOC Use pipelines module name instead of pipieline following default 
project files.

Rolando Espinoza La fuente (1):
      Alow to disable a downloader handler just like any other component.

Ruben Vereecken (2):
      Added content-type check as per issue #193
      Redefined test for #193

deed02392 (1):
      Update httperror.py

ncp1113 (1):
      for loops have to have a : at the end of the line

nyov (2):
      better call to parent class
      update a link reference

stray-leone (1):
      modify the version of scrapy ubuntu package

tpeng (3):
      add message when raise IngoreReques; fix item_scraped document
      set the exit code to non-zero when contracts fails
      print spider name even it has no contract tests when -v is specified

tracicot (1):
      Correct typos

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Scrapy 0.24 is out and it aims to improve the scraping experience for everyone! :)

Reply via email to