Re: [Numpy-discussion] Comparison changes
On Sat, 2014-01-25 at 00:18 +, Nathaniel Smith wrote: On 25 Jan 2014 00:05, Sebastian Berg sebast...@sipsolutions.net wrote: Hi all, in https://github.com/numpy/numpy/pull/3514 I proposed some changes to the comparison operators. This includes: 1. Comparison with None will broadcast in the future, so that `arr == None` will actually compare all elements to None. (A FutureWarning for now) 2. I added that == and != will give FutureWarning when an error was raised. In the future they should not silence these errors anymore. (For example shape mismatches) This can just be a DeprecationWarning, because the only change is to raise new more errors. Right, already is the case. 3. We used to use PyObject_RichCompareBool for equality which includes an identity check. I propose to not do that identity check since we have elementwise equality (returning an object array for objects would be nice in some ways, but I think that is only an option for a dedicated function). The reason is that for example a = np.array([np.array([1, 2, 3]), 1]) b = np.array([np.array([1, 2, 3]), 1]) a == b will happen to work if it happens to be that `a[0] is b[0]`. This currently has no deprecation, since the logic is in the inner loop and I am not sure if it is easy to add well there. Surely any environment where we can call PyObject_RichCompareBool is an environment where we can issue a warning...? Right, I suppose an extra identity check and comparing it with the other result is indeed no problem. So I think I will add that. - Sebastian -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Catching out-of-memory error before it happens
On 24 January 2014 23:09, Dinesh Vadhia dineshbvad...@hotmail.com wrote: Francesc: Thanks. I looked at numexpr a few years back but it didn't support array slicing/indexing. Has that changed? No, but you can do it yourself. big_array = np.empty(2) piece = big_array[30:-50] ne.evaluate('sqrt(piece)') Here, creating piece does not increase memory use, as slicing shares the original data (well, actually, it adds a mere 80 bytes, the overhead of an array). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 11:49 AM, Chris Barker chris.bar...@noaa.govwrote: Thanks for poking into this all. I've lost track a bit, but I think: The 'S' type is clearly broken on py3 (at least). I think that gives us room to change it, and backward compatibly is less of an issue because it's broken already -- do we need to preserve bug-for-bug compatibility? Maybe, but I suspect in this case, not -- the code the works fine on py3 with the 'S' type is probably only lucky that it hasn't encountered the issues yet. And no matter how you slice it, code being ported to py3 needs to deal with text handling issues. But here is where we stand: The 'S' dtype: - was designed for one-byte-per-char text data. - was mapped to the py2 string type. - used the classic C null-terminated approach. - can be used for arbitrary bytes (as the py2 string type can), but not quite, as it truncates null bytes -- so it really a bad idea to use it that way. Under py3: The 'S' type maps to the py3 bytes type, because that's the closest to the py2 string type. But it also does some inconsistent things with encoding, and does treat a lot of other things as text. But the py3 bytes type does not have the same text handling as the py2 string type, so things like: s = 'a string' np.array((s,), dtype='S')[0] == s Gives you False, rather than True on py2. This is because a py3 string is translated to the 'S' type (presumable with the default encoding, another maybe not a good idea, but returns a bytes object, which does not compare true to a py3 string. YOu can work aroudn this with varios calls to encode() and decode, and/or using b'a string', but that is ugly, kludgy, and doesn't work well with the py3 text model. The py2 = py3 transition separated bytes and strings: strings are unicode, and bytes are not to be used for text (directly). While there is some text-related functionality still in bytes, the core devs are quite clear that that is for special cases only, and not for general text processing. I don't think numpy should fight this, but rather embrace the py3 text model. The most natural way to do that is to use the existing 'U' dtype for text. Really the best solution for most cases. (Like the above case) However, there is a use case for a more efficient way to deal with text. There are a couple ways to go about that that have been brought up here: 1: have a more efficient unicode dtype: variable length, multiple encoding options, etc - This is a fine idea that would support better text handling in numpy, and _maybe_ better interaction with external libraries (HDF, etc...) 2: Have a one-byte-per-char text dtype: - This would be much easier to implement fit into the current numpy model, and satisfy a lot of common use cases for scientific data sets. We could certainly do both, but I'd like to see (2) get done sooner than later This is pretty much my sense of things at the moment. I think 1) is needed in the long term but that 2) is a quick fix that solves most problems in the short term. A related issue is whether numpy needs a dtype analogous to py3 bytes -- I'm still not sure of the use-case there, so can't comment -- would it need to be fixed length (fitting into the numpy data model better) or variable length, or ??? Some folks are (apparently) using the current 'S' type in this way, but I think that's ripe for errors, due to the null bytes issue. Though maybe there is a null-bytes-are-special binary format that isn't text -- I have no idea. So what do we do with 'S'? It really is pretty broken, so we have a couple choices: (1) depricate it, so that it stays around for backward compatibility but encourage people to either use 'U' for text, or one of the new dtypes that are yet to be implemented (maybe 's' for a one-byte-per-char dtype), and use either uint8 or the new bytes dtype that is yet to be implemented. (2) fix it -- in this case, I think we need to be clear what it is: -- A one-byte-char-text type? If so, it should map to a py3 string, and have a defined encoding (ascii or latin-1, probably), or even better a settable encoding (but only for one-byte-per-char encodings -- I don't think utf-8 is a good idea here, as a utf-8 encoded string is of unknown length. (there is some room for debate here, as the 'S' type is fixed length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as long as it doesn't partially truncate in teh middle of a charactor) I think we should make it a one character encoded type compatible with str in python 2, and maybe latin-1 in python 3. I'm thinking latin-1 because of pep 393 where it is effectively a UCS-1, but ascii might be a bit more flexible because it is a subset of utf-8 and might serve better in python 2. -- a bytes type? in which case, we should clean out all teh automatic conversion to-from text that iare in it now. I'm not sure what to
Re: [Numpy-discussion] Numpy arrays vs typed memoryviews
I think I have said this before, but its worth a repeat: Pickle (including cPickle) is a slow hog! That might not be the overhead you see, you just haven't noticed it yet. I saw this some years ago when I worked on shared memory arrays for Numpy (cf. my account on Github). Shared memory really did not help to speed up the IPC, because the entire overhead was dominated by pickle. (Shared memory is a fine way of saving RAM, though.) multiprocessing.Queue will use pickle for serialization, and is therefore not the right tool for numerical parallel computing with Cython or NumPy. In order to use multiprocessing efficiently with NumPy, we need a new Queue type that knows about NumPy arrays (and/or Cython memoryviews), and treat them as special cases. Getting rid of pickle altogether is the important part, not facilitating its use even further. It is easy to make a Queue type for Cython or NumPy arrays using a duplex pipe and couple of mutexes. Or you can use shared memory as ringbuffer and atomic compare-and-swap on the first bytes as spinlocks. It is not difficult to get the overhead of queuing arrays down to little more than a memcpy. I've been wanting to do this for a while, so maybe it is time to start a new toy project :) Sturla Neal Hughes hughes.n...@gmail.com wrote: I like Cython a lot. My only complaint is that I have to keep switching between the numpy array support and typed memory views. Both have there advantages but neither can do every thing I need. Memoryviews have the clean syntax and seem to work better in cdef classes and in inline functions. But Memoryviews can't be pickled and so can't be passed between processes. Also there seems to be a high overhead on converting between memory views and python numpy arrays. Where this overhead is a problem, or where i need to use pythons multiprocessing module I tend to switch to numpy arrays. If memory views could be converted into python fast, and pickled I would have no need for the old numpy array support. Wondering if these problems will ever be addressed, or if I am missing something completely. -- --- You received this message because you are subscribed to the Google Groups cython-users group. To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+unsubscr...@googlegroups.com. For more options, visit a href=https://groups.google.com/groups/opt_out.;https://groups.google.com/groups/opt_out./a --=_Part_1342_18667054.1390644115997 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable div dir=3DltrI like Cython a lot. My only complaint is that I have to k= eep switching between the numpy array support and typed memory views. Both = have there advantages but neither can do every thing I need.divbr/div= divMemoryviews have the clean syntax and seem to work better in cdef clas= ses and in inline functions./divdivbr/divdivBut Memoryviews can't= be pickled and so can't be passed between processes. Also there seems to b= e a high overhead on converting between memory views and python numpy array= s. Where this overhead is a problem, or where i need to use pythons multipr= ocessing module I tend to switch to numpy arrays./divdivbr/divdiv= If memory views could be converted into python fast, and pickled I would ha= ve no need for the old numpy array support./divdivbr/divdivWonder= ing if these problems will ever be addressed, or if I am missing something = completely.br/divdivbr/divdivbr/div/div p/p -- br / amp;nbsp;br / --- br / You received this message because you are subscribed to the Google Groups amp;= quot;cython-usersamp;quot; group.br / To unsubscribe from this group and stop receiving emails from it, send an e= mail to cython-users+unsubscr...@googlegroups.com.br / For more options, visit a href=3Dhttps://groups.google.com/groups/opt_out= https://groups.google.com/groups/opt_out/a.br / --=_Part_1342_18667054.1390644115997-- ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Comparison changes
On Sat, 25 Jan 2014 01:05:15 +0100, Sebastian Berg wrote: 1. Comparison with None will broadcast in the future, so that `arr == None` will actually compare all elements to None. (A FutureWarning for now) This is a very useful change in behavior--thanks! Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Text array dtype for numpy
On 24 January 2014 22:43, Chris Barker chris.bar...@noaa.gov wrote: Oscar, Cool stuff, thanks! I'm wondering though what the use-case really is. The use-case is precisely the use-case for dtype='S' on Py2 except that it also works on Py3. The P3 text model (actually the py2 one, too), is quite clear that you want users to think of, and work with, text as text -- and not care how things are encoding in the underlying implementation. You only want the user to think about encodings on I/O -- transferring stuff between systems where you can't avoid it. And you might choose different encodings based on different needs. Exactly. But what you're missing is that storing text in a numpy array is putting the text into bytes and the encoding needs to be specified. My proposal involves explicitly specifying the encoding. This is the key point about the Python 3 text model: it is not that encoding isn't automatic (e.g. when you print() or call file.write with a text file); the point is that there must never be ambiguity about the encoding that is used when encode/decode occurs. So why have a different, the-user-needs-to-think-about-encodings numpy dtype? We already have 'U' for full-on unicode support for text. There is a good argument for a more compact internal representation for text compatible with one-byte-per-char encoding, thus the suggestion for such a dtype. But I don't see the need for quite this. Maybe I'm not being a creative enough thinker. Because users want to store text in a numpy array and use less than 4 bytes per character. You expressed a desire for this. The only difference between this and your latin-1 suggestion is that this one has an explicit encoding that is visible to the user and that you can choose that encoding to be anything that your Python installation supports. Also, we may want numpy to interact at a low level with other libs that might have binary encoded text (HDF, etc) -- in which case we need a bytes dtype that can store that data, and perhaps encoding and decoding ufuncs. Perhaps there is a need for a bytes dtype as well. But not that you can use textarray with encoding='ascii' to satisfy many of these use cases. So h5py and pytables can expose an interface that stores text as bytes but has a clearly labelled (and enforced) encoding. If we want a more efficient and compact unicode implementation then the py3 one is a good place to start -it's pretty slick! Though maybe harder to due in numpy as text in numpy probably wouldn't be immutable. It's not a good fit for numpy because numpy arrays expose their memory buffer. More on this below but if there was to be something as drastic as the FSR then it would be better to think about how to make an ndarray type that is completely different, has an opaque memory buffer and can handle arbitrary length text strings. To make a slightly more concrete proposal, I've implemented a pure Python ndarray subclass that I believe can consistently handle text/bytes in Python 3. this scares me right there -- is it text or bytes??? We really don't want something that is both. I believe that there is a conceptual misunderstanding about what a numpy array is here. A numpy array is a clever view onto a memory buffer. A numpy array always has two interfaces, one that describes a memory buffer and one that delivers Python objects representing the abstract quantities described by each portion of the memory buffer. The dtype specifies three things: 1) How many bytes of the buffer are used. 2) What kind of abstract object this part of the buffer represents. 3) The mapping from the bytes in this segment of the buffer to the abstract object. As an example: import numpy as np a = np.array([1, 2, 3], dtype='u4') a array([1, 2, 3], dtype=uint32) a.tostring() b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' So what is this array? Is it bytes or is it integers? It is both. The array is a view onto a memory buffer and the dtype is the encoding that describes the meaning of the bytes in different segments. In this case the dtype is 'u4'. This tells us that we need 4 bytes per segment, that each segment represents an integer and that the mapping from byte segments to integers is the unsigned little-endian mapping. How can we do the same thing with text? We need a way to map text to fixed-width bytes. Mapping text to bytes is done with text encodings. So we need a dtype that incorporates a text encoding in order to define the relationship between the bytes in the array's memory buffer and the abstract entity that is a sequence of Unicode characters. Using dtype='U' doesn't get around this: a = np.array(['qwe'], dtype='U') a array(['qwe'], dtype='U3') a[0] # text 'qwe' a.tostring() # bytes b'q\x00\x00\x00w\x00\x00\x00e\x00\x00\x00' In my proposal you'd get the same by using 'utf-32-le' as the encoding for your text array. The idea is that the array has an encoding. It stores strings as bytes. The
[Numpy-discussion] ANN: numexpr 2.3 (final) released
== Announcing Numexpr 2.3 == Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. Numexpr is already being used in a series of packages (PyTables, pandas, BLZ...) for helping doing computations faster. What's new == The repository has been migrated to https://github.com/pydata/numexpr. All new tickets and PR should be directed there. Also, a `conj()` function for computing the conjugate of complex arrays has been added. Thanks to David Menéndez. See PR #125. Finallly, we fixed a DeprecationWarning derived of using ``oa_ndim == 0`` and ``op_axes == NULL`` with `NpyIter_AdvancedNew()` and NumPy 1.8. Thanks to Mark Wiebe for advise on how to fix this properly. Many thanks to Christoph Gohlke and Ilan Schnell for his help during the testing of this release in all kinds of possible combinations of platforms and MKL. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/wiki/Release-Notes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] ANN: python-blosc 1.2.0 released
= Announcing python-blosc 1.2.0 = What is new? This release adds support for the multiple compressors added in Blosc 1.3 series. The new compressors are: * lz4 (http://code.google.com/p/lz4/): A very fast compressor/decompressor. Could be thought as a replacement of the original BloscLZ, but it can behave better is some scenarios. * lz4hc (http://code.google.com/p/lz4/): This is a variation of LZ4 that achieves much better compression ratio at the cost of being much slower for compressing. Decompression speed is unaffected (and sometimes better than when using LZ4 itself!), so this is very good for read-only datasets. * snappy (http://code.google.com/p/snappy/): A very fast compressor/decompressor. Could be thought as a replacement of the original BloscLZ, but it can behave better is some scenarios. * zlib (http://www.zlib.net/): This is a classic. It achieves very good compression ratios, at the cost of speed. However, decompression speed is still pretty good, so it is a good candidate for read-only datasets. Selecting the compressor is just a matter of specifying the new `cname` parameter in compression functions. For example:: in = numpy.arange(N, dtype=numpy.int64) out = blosc.pack_array(in, cname=lz4) Just to have an overview of the differences between the different compressors in new Blosc, here it is the output of the included compress_ptr.py benchmark: https://github.com/ContinuumIO/python-blosc/blob/master/bench/compress_ptr.py that compresses/decompresses NumPy arrays with different data distributions:: Creating different NumPy arrays with 10**7 int64/float64 elements: *** np.copy() Time for memcpy(): 0.030 s *** the arange linear distribution *** *** blosclz *** Time for comp/decomp: 0.013/0.022 s. Compr ratio: 136.83 *** lz4 *** Time for comp/decomp: 0.009/0.031 s. Compr ratio: 137.19 *** lz4hc*** Time for comp/decomp: 0.103/0.021 s. Compr ratio: 165.12 *** snappy *** Time for comp/decomp: 0.012/0.045 s. Compr ratio: 20.38 *** zlib *** Time for comp/decomp: 0.243/0.056 s. Compr ratio: 407.60 *** the linspace linear distribution *** *** blosclz *** Time for comp/decomp: 0.031/0.036 s. Compr ratio: 14.27 *** lz4 *** Time for comp/decomp: 0.016/0.033 s. Compr ratio: 19.68 *** lz4hc*** Time for comp/decomp: 0.188/0.020 s. Compr ratio: 78.21 *** snappy *** Time for comp/decomp: 0.020/0.032 s. Compr ratio: 11.72 *** zlib *** Time for comp/decomp: 0.290/0.048 s. Compr ratio: 90.90 *** the random distribution *** *** blosclz *** Time for comp/decomp: 0.083/0.025 s. Compr ratio: 4.35 *** lz4 *** Time for comp/decomp: 0.022/0.034 s. Compr ratio: 4.65 *** lz4hc*** Time for comp/decomp: 1.803/0.039 s. Compr ratio: 5.61 *** snappy *** Time for comp/decomp: 0.028/0.023 s. Compr ratio: 4.48 *** zlib *** Time for comp/decomp: 3.146/0.073 s. Compr ratio: 6.17 That means that Blosc in combination with LZ4 can compress at speeds that can be up to 3x faster than a pure memcpy operation. Decompression is a bit slower (but still in the same order than memcpy()) probably because writing to memory is slower than reading. This was using an Intel Core i5-3380M CPU @ 2.90GHz, runnng Python 3.3 and Linux 3.7.10, but YMMV (and will vary!). For more info, you can have a look at the release notes in: https://github.com/ContinuumIO/python-blosc/wiki/Release-notes More docs and examples are available in the documentation site: http://blosc.pydata.org What is it? === python-blosc (http://blosc.pydata.org/) is a Python wrapper for the Blosc compression library. Blosc (http://blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Whether this is achieved or not depends of the data compressibility, the number of cores in the system, and other factors. See a series of benchmarks conducted for many different systems: http://blosc.org/trac/wiki/SyntheticBenchmarks. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. There is also a handy command line for Blosc called Bloscpack (https://github.com/esc/bloscpack) that allows you to compress large binary datafiles on-disk. Although the format for Bloscpack has not stabilized yet, it allows you to effectively use Blosc from your favorite shell. Installing == python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you should omit the python- prefix Download sources The sources are managed through github
[Numpy-discussion] ANN: BLZ 0.6.1 has been released
Announcing BLZ 0.6 series = What it is -- BLZ is a chunked container for numerical data. Chunking allows for efficient enlarging/shrinking of data container. In addition, it can also be compressed for reducing memory/disk needs. The compression process is carried out internally by Blosc, a high-performance compressor that is optimized for binary data. The main objects in BLZ are `barray` and `btable`. `barray` is meant for storing multidimensional homogeneous datasets efficiently. `barray` objects provide the foundations for building `btable` objects, where each column is made of a single `barray`. Facilities are provided for iterating, filtering and querying `btables` in an efficient way. You can find more info about `barray` and `btable` in the tutorial: http://blz.pydata.org/blz-manual/tutorial.html BLZ can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too) either from memory or from disk. In the future, it is planned to use Numba as the computational kernel and to provide better Blaze (http://blaze.pydata.org) integration. What's new -- BLZ has been branched off from the Blaze project (http://blaze.pydata.org). BLZ was meant as a persistent format and library for I/O in Blaze. BLZ in Blaze is based on previous carray 0.5 and this is why this new version is labeled 0.6. BLZ supports completely transparent storage on-disk in addition to memory. That means that *everything* that can be done with the in-memory container can be done using the disk as well. The advantages of a disk-based container is that the addressable space is much larger than just your available memory. Also, as BLZ is based on a chunked and compressed data layout based on the super-fast Blosc compression library, the data access speed is very good. The format chosen for the persistence layer is based on the 'bloscpack' library and described in the Persistent format for BLZ chapter of the user manual ('docs/source/persistence-format.rst'). More about Bloscpack here: https://github.com/esc/bloscpack You may want to know more about BLZ in this blog entry: http://continuum.io/blog/blz-format In this version, support for Blosc 1.3 has been added, that meaning that a new `cname` parameter has been added to the `bparams` class, so that you can select you preferred compressor from 'blosclz', 'lz4', 'lz4hc', 'snappy' and 'zlib'. Also, many bugs have been fixed, providing a much smoother experience. CAVEAT: The BLZ/bloscpack format is still evolving, so don't trust on forward compatibility of the format, at least until 1.0, where the internal format will be declared frozen. Resources - Visit the main BLZ site repository at: http://github.com/ContinuumIO/blz Read the online docs at: http://blz.pydata.org/blz-manual/index.html Home of Blosc compressor: http://www.blosc.org User's mail list: blaze-...@continuum.io Enjoy! Francesc Alted Continuum Analytics, Inc. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion