Re: [Python-Dev] csv module TODO list

2005-01-05 Thread Martin v. Löwis
Andrew McNamara wrote:
There's a bunch of jobs we (CSV module maintainers) have been putting
off - attached is a list (in no particular order): 

* unicode support (this will probably uglify the code considerably).
Can you please elaborate on that? What needs to be done, and how is
that going to be done? It might be possible to avoid considerable
uglification.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] csv module TODO list

2005-01-05 Thread M.-A. Lemburg
Martin v. Löwis wrote:
Andrew McNamara wrote:
There's a bunch of jobs we (CSV module maintainers) have been putting
off - attached is a list (in no particular order):
* unicode support (this will probably uglify the code considerably).

Can you please elaborate on that? What needs to be done, and how is
that going to be done? It might be possible to avoid considerable
uglification.
Indeed. The trick is to convert to Unicode early and to use Unicode
literals instead of string literals in the code.
Note that the only real-life Unicode format in use is UTF-16
(with BOM mark) written by Excel. Note that there's no standard
for specifying the encoding in CSV files, so this is also the only
feasable format.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source  (#1, Jan 05 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 2.3.5 schedule, and something I'd like to get in

2005-01-05 Thread Ronald Oussoren
On 5-jan-05, at 9:33, Martin v. Löwis wrote:
Bob Ippolito wrote:
It doesn't for reasons I care not to explain in depth, again.  Search 
 the pythonmac-sig archives for longer explanations.  The gist is 
that  you specifically do not want to link directly to the framework 
at all  when building extensions.
Because an Apple-built extension then may pick up a user-installed
Python? Why can this problem not be solved by adding -F options,
as Jack Jansen proposed?
It gets worse when you have a user-installed python 2.3 and a 
user-installed python 2.4. Those will be both be installed as 
/Library/Frameworks/Python.framework. This means that you cannot use 
the -F flag to select which one you want to link to, '-framework 
Python' will only link to the python that was installed the latest.

This is an issue on Mac OS X 10.2.
Ronald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] csv module TODO list

2005-01-05 Thread Andrew McNamara
 Andrew McNamara wrote:
 There's a bunch of jobs we (CSV module maintainers) have been putting
 off - attached is a list (in no particular order):
 * unicode support (this will probably uglify the code considerably).
 
Martin v. Löwis wrote:
 Can you please elaborate on that? What needs to be done, and how is
 that going to be done? It might be possible to avoid considerable
 uglification.

I'm not altogether sure there. The parsing state machine is all written in
C, and deals with signed chars - I expect we'll need two versions of that
(or one version that's compiled twice using pre-processor macros). Quite
a large job. Suggestions gratefully received.

M.-A. Lemburg wrote:
Indeed. The trick is to convert to Unicode early and to use Unicode
literals instead of string literals in the code.

Yes, although it would be nice to also retain the 8-bit versions as well.

Note that the only real-life Unicode format in use is UTF-16
(with BOM mark) written by Excel. Note that there's no standard
for specifying the encoding in CSV files, so this is also the only
feasable format.

Yes - that's part of the problem I hadn't really thought about yet - the
csv module currently interacts directly with files as iterators, but it's 
clear that we'll need to decode as we go.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] csv module TODO list

2005-01-05 Thread M.-A. Lemburg
Andrew McNamara wrote:
Andrew McNamara wrote:
There's a bunch of jobs we (CSV module maintainers) have been putting
off - attached is a list (in no particular order):
* unicode support (this will probably uglify the code considerably).

Martin v. Löwis wrote:
Can you please elaborate on that? What needs to be done, and how is
that going to be done? It might be possible to avoid considerable
uglification.

I'm not altogether sure there. The parsing state machine is all written in
C, and deals with signed chars - I expect we'll need two versions of that
(or one version that's compiled twice using pre-processor macros). Quite
a large job. Suggestions gratefully received.
M.-A. Lemburg wrote:
Indeed. The trick is to convert to Unicode early and to use Unicode
literals instead of string literals in the code.

Yes, although it would be nice to also retain the 8-bit versions as well.
You can do so by using latin-1 as default encoding. Works great !
Note that the only real-life Unicode format in use is UTF-16
(with BOM mark) written by Excel. Note that there's no standard
for specifying the encoding in CSV files, so this is also the only
feasable format.
Yes - that's part of the problem I hadn't really thought about yet - the
csv module currently interacts directly with files as iterators, but it's 
clear that we'll need to decode as we go.
Depends on your needs: CSV files tend to be small enough
to do the decoding in one call in memory.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source  (#1, Jan 05 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] csv module TODO list

2005-01-05 Thread Andrew McNamara
 Yes, although it would be nice to also retain the 8-bit versions as well.

You can do so by using latin-1 as default encoding. Works great !

Yep, although that means we wear the cost of decoding and encoding for
all 8 bit input.

What does the _sre.c code do?

Depends on your needs: CSV files tend to be small enough
to do the decoding in one call in memory.

We are routinely dealing with multi-gigabyte csv files - which is why the
original 2001 vintage csv module was written as a C state machine. 

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] csv module TODO list

2005-01-05 Thread M.-A. Lemburg
Andrew McNamara wrote:
Yes, although it would be nice to also retain the 8-bit versions as well.
You can do so by using latin-1 as default encoding. Works great !
Yep, although that means we wear the cost of decoding and encoding for
all 8 bit input.
Right, but it makes the code very clean and straight forward.
Again, it depends on what you need. If performance is critical
then you probably need a C version written using the same trick
as _sre.c...
What does the _sre.c code do?
It comes in two versions: one for 8-bit the other for Unicode.
Depends on your needs: CSV files tend to be small enough
to do the decoding in one call in memory.
We are routinely dealing with multi-gigabyte csv files - which is why the
original 2001 vintage csv module was written as a C state machine. 
I see, but are you sure that the typical Python user will have
the same requirements to make it worth the effort (and
complexity) ?
I've written a few CSV parsers and writers myself over the years
and the requirements were different every time, in terms
of being flexible in the parsing phase, the interfaces and
the performance needs. Haven't yet found a one fits all
solution and don't really expect to any more :-)
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source  (#1, Jan 05 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] csv module TODO list

2005-01-05 Thread Andrew McNamara
 Yep, although that means we wear the cost of decoding and encoding for
 all 8 bit input.

Right, but it makes the code very clean and straight forward.

I agree it makes for a very clean solution, and 99% of the time I'd
chose that option.

Again, it depends on what you need. If performance is critical
then you probably need a C version written using the same trick
as _sre.c...

 What does the _sre.c code do?

It comes in two versions: one for 8-bit the other for Unicode.

That's what I thought. I think the motivations here are similar to those
that drove the _sre developers.

 We are routinely dealing with multi-gigabyte csv files - which is why the
 original 2001 vintage csv module was written as a C state machine. 

I see, but are you sure that the typical Python user will have
the same requirements to make it worth the effort (and
complexity) ?

This is open source, so I scratch my own itch (and that of my employers) - 
we need fast csv parsing more than we need unicode... 8-)

Okay, assuming we go the produce two versions via evil macro tricks
path, it's still not quite the same situation as _sre.c, which only has
to deal with the internal unicode representation.

One way to approach this would be to add an encoding keyword argument
to the readers and writers. If given, the parser would decode the input
stream to the internal representation before passing it through the
unicode state machine, which would yield tuples of unicode objects.

That leaves us with a bit of a problem where the source is already unicode
(eg, a list of unicode strings)... hmm.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 2.3.5 schedule, and something I'd like to get in

2005-01-05 Thread Michael Hudson
Martin v. Löwis [EMAIL PROTECTED] writes:

 Bob Ippolito wrote:
 It doesn't for reasons I care not to explain in depth, again.
 Search  the pythonmac-sig archives for longer explanations.  The
 gist is that  you specifically do not want to link directly to the
 framework at all  when building extensions.

 Because an Apple-built extension then may pick up a user-installed
 Python? Why can this problem not be solved by adding -F options,
 as Jack Jansen proposed?

 This is not the wrong way to do it.

 I'm not convinced.

Martin, can you please believe that Jack, Bob, Ronald et al know what
they are talking about here?

Cheers,
mwh

-- 
  Q: Isn't it okay to just read Slashdot for the links?
  A: No. Reading Slashdot for the links is like having just one hit
 off the crack pipe.
 -- http://www.cs.washington.edu/homes/klee/misc/slashdot.html#faq
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 2.3.5 schedule, and something I'd like to get in

2005-01-05 Thread Bob Ippolito
On Jan 5, 2005, at 3:33 AM, Martin v. Löwis wrote:
Bob Ippolito wrote:
It doesn't for reasons I care not to explain in depth, again.  Search 
 the pythonmac-sig archives for longer explanations.  The gist is 
that  you specifically do not want to link directly to the framework 
at all  when building extensions.
Because an Apple-built extension then may pick up a user-installed
Python? Why can this problem not be solved by adding -F options,
as Jack Jansen proposed?
This is not the wrong way to do it.
I'm not convinced.
Then you haven't done the appropriate research by searching 
pythonmac-sig.  Do you even own a Mac?

-bob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] ast branch pragmatics

2005-01-05 Thread Guido van Rossum
 I think it would be easier to create a new branch from the current
 head, integrate the small number of changed files from ast-branch, and
 work with that branch instead.  The idea is that it's an end-run
 around doing an automatic CVS merge and relying on someone to manually
 merge the changes.
 
 At the same time, since there is a groundswell of support for
 finishing the AST work, I'd like to propose that we stop making
 compiler / bytecode changes until it is done.  Every change to
 compile.c or the bytecode ends up creating a new incompatibilty that
 needs to be merged.
 
 If these two plans sound good, I'll get started on the new branch.

+1

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 2.3.5 schedule, and something I'd like to get in

2005-01-05 Thread Bob Ippolito
On Jan 5, 2005, at 18:46, Martin v. Löwis wrote:
Bob Ippolito wrote:
I just dug up some information I had written on this particular topic  
 but never published, if you're interested:
http://bob.pythonmac.org/archives/2005/01/05/versioned-frameworks-  
considered-harmful/
Interesting. I don't get the part why -undefined dynamic_lookup
is a good idea (and this is indeed what bothered me most to begin  
with).
As you say, explicitly specifying the target .dylib should work as
well, and it also does not require 10.3.
Without -undefined dynamic_lookup, your Python extensions are bound to  
a specific Python installation location (i.e. the system 2.3.0 and a  
user-installed 2.3.4).  This tends to be quite a problem.  With  
-undefined dynamic_lookup, they are not.

Just search for version mismatch on pythonmac-sig:
http://www.google.com/search?q=%22version+mismatch%22+pythonmac- 
sig+site:mail.python.orgie=UTF-8oe=UTF-8

-bob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


RE: [Python-Dev] an idea for improving struct.unpack api

2005-01-05 Thread Raymond Hettinger
[Ilya Sandler]
 A problem:
 
 The current struct.unpack api works well for unpacking C-structures
where
 everything is usually unpacked at once, but it
 becomes  inconvenient when unpacking binary files where things
 often have to be unpacked field by field. Then one has to keep track
 of offsets, slice the strings,call struct.calcsize(), etc...

Yes.  That bites.


 Eg. with a current api unpacking  of a record which consists of a
 header followed by a variable  number of items would go like this
 
  hdr_fmt=
  item_fmt=
  item_size=calcsize(item_fmt)
  hdr_size=calcsize(hdr_fmt)
  hdr=unpack(hdr_fmt, rec[0:hdr_size]) #rec is the record to unpack
  offset=hdr_size
  for i in range(hdr[0]): #assume 1st field of header is a counter
item=unpack( item_fmt, rec[ offset: offset+item_size])
offset+=item_size
 
 which is quite inconvenient...
 
 
 A  solution:
 
 We could have an optional offset argument for
 
 unpack(format, buffer, offset=None)
 
 the offset argument is an object which contains a single integer field
 which gets incremented inside unpack() to point to the next byte.
 
 so with a new API the above code could be written as
 
  offset=struct.Offset(0)
  hdr=unpack(, offset)
  for i in range(hdr[0]):
 item=unpack( , rec, offset)
 
 When an offset argument is provided, unpack() should allow some bytes
to
 be left unpacked at the end of the buffer..
 
 
 Does this suggestion make sense? Any better ideas?

Rather than alter struct.unpack(), I suggest making a separate class
that tracks the offset and encapsulates some of the logic that typically
surrounds unpacking:

r = StructReader(rec)
hdr = r('')
for item in r.getgroups('', times=rec[0]):
   . . .

It would be especially nice if it handled the more complex case where
the next offset is determined in-part by the data being read (see the
example in section 11.3 of the tutorial):

r = StructReader(open('myfile.zip', 'rb'))
for i in range(3):  # show the first 3 file headers
fields = r.getgroup('LLLHH', offset=14)
crc32, comp_size, uncomp_size, filenamesize, extra_size = fields
filename = g.getgroup('c', offset=16, times=filenamesize)
extra = g.getgroup('c', times=extra_size)
r.advance(comp_size)
print filename, hex(crc32), comp_size, uncomp_size

If you come up with something, I suggest posting it as an ASPN recipe
and then announcing it on comp.lang.python.  That ought to generate some
good feedback based on other people's real world issues with
struct.unpack().


Raymond Hettinger

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com