Hello Scrapy developers,

I'm really pleased with Python 3 support in an upcoming Scrapy 1.1 release.
I'm thinking about introducing this great release in my blog article and a
book now authoring.

I have a question about a limitation of handling non-ASCII URLs. The
release note of 1.1 (*http://doc.scrapy.org/en/master/news.html#news-betapy3
<http://doc.scrapy.org/en/master/news.html#news-betapy3>*) says:

> * Scrapy has problems handling non-ASCII URLs in Python 3

This limitation seems to be big enough to make Japanese people like me
hesitate to use Scrapy 1.1 in Python 3. However testing with simple spiders
to crawl non-ASCII URLs (
https://gist.github.com/orangain/3724b86a5dc5b2a279f9), I didn't have any
problem. So my question is:

* What does the limitation exactly mean?

More specifically:

* In my understanding, non-ASCII URLs means URLs contain
percent-encoded non-ASCII characters. Is this right? Or, does it mean URLs
contain non-ASCII characters without percent-encoding?
* What kind of problems will occur?
* In what component, problems will occur?
* In what condition, problems will occur?

I've explored the following issues, but I couldn't find a clear answer for
my question.

HTML entity causes UnicodeEncodeError in LxmlLinkExtractor · Issue #998 ·
scrapy/scrapy
https://github.com/scrapy/scrapy/issues/998

Speedup & fix URL parsing · Issue #1306 · scrapy/scrapy
https://github.com/scrapy/scrapy/issues/1306

Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode
character · Issue #1403 · scrapy/scrapy
https://github.com/scrapy/scrapy/issues/1403

Exception in LxmLinkExtractor.extract_links 'ascii' codec can't encode
character · Issue #1405 · scrapy/scrapy
https://github.com/scrapy/scrapy/issues/1405

PY3: add back 3 URL normalization tests by redapple · Pull Request #1664 ·
scrapy/scrapy
https://github.com/scrapy/scrapy/pull/1664

get_base_url fails for non-ascii URLs in Python 3 · Issue #1783 ·
scrapy/scrapy
https://github.com/scrapy/scrapy/issues/1783

Best,

orangain
-- 
[email protected]

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to