Hello community, here is the log from the commit of package python-dask for openSUSE:Factory checked in at 2018-10-11 11:58:21 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/python-dask (Old) and /work/SRC/openSUSE:Factory/.python-dask.new (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-dask" Thu Oct 11 11:58:21 2018 rev:10 rq:640983 version:0.19.4 Changes: -------- --- /work/SRC/openSUSE:Factory/python-dask/python-dask.changes 2018-10-09 15:53:29.190331138 +0200 +++ /work/SRC/openSUSE:Factory/.python-dask.new/python-dask.changes 2018-10-11 11:58:23.913798749 +0200 @@ -1,0 +2,23 @@ +Wed Oct 10 01:49:52 UTC 2018 - Arun Persaud <[email protected]> + +- update to version 0.19.4: + * Array + + Implement apply_gufunc(..., axes=..., keepdims=...) (:pr:`3985`) + Markus Gonser + * Bag + + Fix typo in datasets.make_people (:pr:`4069`) Matthew Rocklin + * Dataframe + + Added percentiles options for dask.dataframe.describe method + (:pr:`4067`) Zhenqing Li + + Add DataFrame.partitions accessor similar to Array.blocks + (:pr:`4066`) Matthew Rocklin + * Core + + Pass get functions and Clients through scheduler keyword + (:pr:`4062`) Matthew Rocklin + * Documentation + + Fix Typo on hpc example. (missing = in kwarg). (:pr:`4068`) + Matthias Bussonier + + Extensive copy-editing: (:pr:`4065`), (:pr:`4064`), (:pr:`4063`) + Miguel Farrajota + +------------------------------------------------------------------- Old: ---- dask-0.19.3.tar.gz New: ---- dask-0.19.4.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python-dask.spec ++++++ --- /var/tmp/diff_new_pack.Vpd5bP/_old 2018-10-11 11:58:24.925797463 +0200 +++ /var/tmp/diff_new_pack.Vpd5bP/_new 2018-10-11 11:58:24.945797438 +0200 @@ -22,7 +22,7 @@ # python(2/3)-distributed has a dependency loop with python(2/3)-dask %bcond_with test_distributed Name: python-dask -Version: 0.19.3 +Version: 0.19.4 Release: 0 Summary: Minimal task scheduling abstraction License: BSD-3-Clause ++++++ dask-0.19.3.tar.gz -> dask-0.19.4.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/PKG-INFO new/dask-0.19.4/PKG-INFO --- old/dask-0.19.3/PKG-INFO 2018-10-05 20:57:35.000000000 +0200 +++ new/dask-0.19.4/PKG-INFO 2018-10-09 21:27:57.000000000 +0200 @@ -1,12 +1,11 @@ -Metadata-Version: 1.2 +Metadata-Version: 2.1 Name: dask -Version: 0.19.3 +Version: 0.19.4 Summary: Parallel PyData with Task Scheduling Home-page: http://github.com/dask/dask/ -Author: Matthew Rocklin -Author-email: [email protected] +Maintainer: Matthew Rocklin +Maintainer-email: [email protected] License: BSD -Description-Content-Type: UNKNOWN Description: Dask ==== @@ -45,3 +44,9 @@ Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.* +Provides-Extra: dataframe +Provides-Extra: array +Provides-Extra: bag +Provides-Extra: distributed +Provides-Extra: delayed +Provides-Extra: complete diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/_version.py new/dask-0.19.4/dask/_version.py --- old/dask-0.19.3/dask/_version.py 2018-10-05 20:57:35.000000000 +0200 +++ new/dask-0.19.4/dask/_version.py 2018-10-09 21:27:57.000000000 +0200 @@ -11,8 +11,8 @@ { "dirty": false, "error": null, - "full-revisionid": "2e98e50a9055cab1a5d04d777f4e59702318a0ca", - "version": "0.19.3" + "full-revisionid": "bbae2d8a03b5b018e019f9fd2b90004fe6b601ac", + "version": "0.19.4" } ''' # END VERSION_JSON diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/array/gufunc.py new/dask-0.19.4/dask/array/gufunc.py --- old/dask-0.19.3/dask/array/gufunc.py 2018-09-26 23:49:35.000000000 +0200 +++ new/dask-0.19.4/dask/array/gufunc.py 2018-10-09 21:03:29.000000000 +0200 @@ -55,6 +55,100 @@ return ins, outs +def _validate_normalize_axes(axes, axis, keepdims, input_coredimss, output_coredimss): + """ + Validates logic of `axes`/`axis`/`keepdims` arguments and normalize them. + Refer to [1]_ for details + + Arguments + --------- + axes: List of tuples + axis: int + keepdims: bool + input_coredimss: List of Tuple of dims + output_coredimss: List of Tuple of dims + + Returns + ------- + input_axes: List of tuple of int + output_axes: List of tuple of int + + References + ---------- + .. [1] https://docs.scipy.org/doc/numpy/reference/ufuncs.html#optional-keyword-arguments + """ + nin = len(input_coredimss) + nout = 1 if not isinstance(output_coredimss, list) else len(output_coredimss) + + if axes is not None and axis is not None: + raise ValueError("Only one of `axis` or `axes` keyword arguments should be given") + if axes and not isinstance(axes, list): + raise ValueError("`axes` has to be of type list") + + output_coredimss = output_coredimss if nout > 1 else [output_coredimss] + filtered_core_dims = list(filter(len, input_coredimss)) + nr_outputs_with_coredims = len([True for x in output_coredimss if len(x) > 0]) + + if keepdims: + if nr_outputs_with_coredims > 0: + raise ValueError("`keepdims` can only be used for scalar outputs") + output_coredimss = len(output_coredimss) * [filtered_core_dims[0]] + + core_dims = input_coredimss + output_coredimss + if axis is not None: + if not isinstance(axis, int): + raise ValueError("`axis` argument has to be an integer value") + if filtered_core_dims: + cd0 = filtered_core_dims[0] + if len(cd0) != 1: + raise ValueError("`axis` can be used only, if one core dimension is present") + for cd in filtered_core_dims: + if cd0 != cd: + raise ValueError("To use `axis`, all core dimensions have to be equal") + + # Expand dafaults or axis + if axes is None: + if axis is not None: + axes = [(axis,) if cd else tuple() for cd in core_dims] + else: + axes = [tuple(range(-len(icd), 0)) for icd in core_dims] + elif not isinstance(axes, list): + raise ValueError("`axes` argument has to be a list") + axes = [(a,) if isinstance(a, int) else a for a in axes] + + if (((nr_outputs_with_coredims == 0) and (nin != len(axes)) and (nin + nout != len(axes))) or + ((nr_outputs_with_coredims > 0) and (nin + nout != len(axes)))): + raise ValueError("The number of `axes` entries is not equal the number of input and output arguments") + + # Treat outputs + output_axes = axes[nin:] + output_axes = output_axes if output_axes else [tuple(range(-len(ocd), 0)) for ocd in output_coredimss] + input_axes = axes[:nin] + + # Assert we have as many axes as output core dimensions + for idx, (iax, icd) in enumerate(zip(input_axes, input_coredimss)): + if len(iax) != len(icd): + raise ValueError("The number of `axes` entries for argument #{} is not equal " + "the number of respective input core dimensions in signature" + .format(idx)) + if not keepdims: + for idx, (oax, ocd) in enumerate(zip(output_axes, output_coredimss)): + if len(oax) != len(ocd): + raise ValueError("The number of `axes` entries for argument #{} is not equal " + "the number of respective output core dimensions in signature" + .format(idx)) + else: + if input_coredimss: + icd0 = input_coredimss[0] + for icd in input_coredimss: + if icd0 != icd: + raise ValueError("To use `keepdims`, all core dimensions have to be equal") + iax0 = input_axes[0] + output_axes = [iax0 for _ in output_coredimss] + + return input_axes, output_axes + + def apply_gufunc(func, signature, *args, **kwargs): """ Apply a generalized ufunc or similar python function to arrays. @@ -83,6 +177,30 @@ According to the specification of numpy.gufunc signature [2]_ *args : numeric Input arrays or scalars to the callable function. + axes: List of tuples, optional, keyword only + A list of tuples with indices of axes a generalized ufunc should operate on. + For instance, for a signature of ``"(i,j),(j,k)->(i,k)"`` appropriate for + matrix multiplication, the base elements are two-dimensional matrices + and these are taken to be stored in the two last axes of each argument. The + corresponding axes keyword would be ``[(-2, -1), (-2, -1), (-2, -1)]``. + For simplicity, for generalized ufuncs that operate on 1-dimensional arrays + (vectors), a single integer is accepted instead of a single-element tuple, + and for generalized ufuncs for which all outputs are scalars, the output + tuples can be omitted. + axis: int, optional, keyword only + A single axis over which a generalized ufunc should operate. This is a short-cut + for ufuncs that operate over a single, shared core dimension, equivalent to passing + in axes with entries of (axis,) for each single-core-dimension argument and ``()`` for + all others. For instance, for a signature ``"(i),(i)->()"``, it is equivalent to passing + in ``axes=[(axis,), (axis,), ()]``. + keepdims: bool, optional, keyword only + If this is set to True, axes which are reduced over will be left in the result as + a dimension with size one, so that the result will broadcast correctly against the + inputs. This option can only be used for generalized ufuncs that operate on inputs + that all have the same number of core dimensions and with outputs that have no core + dimensions , i.e., with signatures like ``"(i),(i)->()"`` or ``"(m,m)->()"``. + If used, the location of the dimensions in the output can be controlled with axes + and axis. output_dtypes : Optional, dtype or list of dtypes, keyword only Valid numpy dtype specification or list thereof. If not given, a call of ``func`` with a small set of data @@ -113,7 +231,7 @@ >>> def stats(x): ... return np.mean(x, axis=-1), np.std(x, axis=-1) >>> a = da.random.normal(size=(10,20,30), chunks=(5, 10, 30)) - >>> mean, std = da.apply_gufunc(stats, "(i)->(),()", a, output_dtypes=2*(a.dtype,)) + >>> mean, std = da.apply_gufunc(stats, "(i)->(),()", a) >>> mean.compute().shape (10, 20) @@ -122,7 +240,7 @@ ... return np.einsum("i,j->ij", x, y) >>> a = da.random.normal(size=( 20,30), chunks=(10, 30)) >>> b = da.random.normal(size=(10, 1,40), chunks=(5, 1, 40)) - >>> c = da.apply_gufunc(outer_product, "(i),(j)->(i,j)", a, b, output_dtypes=a.dtype, vectorize=True) + >>> c = da.apply_gufunc(outer_product, "(i),(j)->(i,j)", a, b, vectorize=True) >>> c.compute().shape (10, 20, 30, 40) @@ -131,6 +249,9 @@ .. [1] http://docs.scipy.org/doc/numpy/reference/ufuncs.html .. [2] http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html """ + axes = kwargs.pop("axes", None) + axis = kwargs.pop("axis", None) + keepdims = kwargs.pop("keepdims", False) output_dtypes = kwargs.pop("output_dtypes", None) output_sizes = kwargs.pop("output_sizes", None) vectorize = kwargs.pop("vectorize", None) @@ -140,14 +261,18 @@ ## Signature if not isinstance(signature, str): raise TypeError('`signature` has to be of type string') - core_input_dimss, core_output_dimss = _parse_gufunc_signature(signature) + input_coredimss, output_coredimss = _parse_gufunc_signature(signature) ## Determine nout: nout = None for functions of one direct return; nout = int for return tuples - nout = None if not isinstance(core_output_dimss, list) else len(core_output_dimss) + nout = None if not isinstance(output_coredimss, list) else len(output_coredimss) ## Determine and handle output_dtypes if output_dtypes is None: - output_dtypes = apply_infer_dtype(func, args, kwargs, "apply_gufunc", "output_dtypes", nout) + if vectorize: + tempfunc = np.vectorize(func, signature=signature) + else: + tempfunc = func + output_dtypes = apply_infer_dtype(tempfunc, args, kwargs, "apply_gufunc", "output_dtypes", nout) if isinstance(output_dtypes, (tuple, list)): if nout is None: @@ -171,26 +296,41 @@ if output_sizes is None: output_sizes = {} + ## Axes + input_axes, output_axes = _validate_normalize_axes(axes, axis, keepdims, input_coredimss, output_coredimss) + # Main code: ## Cast all input arrays to dask args = [asarray(a) for a in args] - if len(core_input_dimss) != len(args): + if len(input_coredimss) != len(args): ValueError("According to `signature`, `func` requires %d arguments, but %s given" - % (len(core_output_dimss), len(args))) + % (len(input_coredimss), len(args))) + + ## Axes: transpose input arguments + transposed_args = [] + for arg, iax, input_coredims in zip(args, input_axes, input_coredimss): + shape = arg.shape + iax = tuple(a if a < 0 else a - len(shape) for a in iax) + tidc = tuple(i for i in range(-len(shape) + 0, 0) if i not in iax) + iax + + transposed_arg = arg.transpose(tidc) + transposed_args.append(transposed_arg) + args = transposed_args ## Assess input args for loop dims input_shapes = [a.shape for a in args] input_chunkss = [a.chunks for a in args] - num_loopdims = [len(s) - len(cd) for s, cd in zip(input_shapes, core_input_dimss)] + num_loopdims = [len(s) - len(cd) for s, cd in zip(input_shapes, input_coredimss)] max_loopdims = max(num_loopdims) if num_loopdims else None - _core_input_shapes = [dict(zip(cid, s[n:])) for s, n, cid in zip(input_shapes, num_loopdims, core_input_dimss)] - core_shapes = merge(output_sizes, *_core_input_shapes) + core_input_shapes = [dict(zip(icd, s[n:])) for s, n, icd in zip(input_shapes, num_loopdims, input_coredimss)] + core_shapes = merge(*core_input_shapes) + core_shapes.update(output_sizes) loop_input_dimss = [tuple("__loopdim%d__" % d for d in range(max_loopdims - n, max_loopdims)) for n in num_loopdims] - input_dimss = [l + c for l, c in zip(loop_input_dimss, core_input_dimss)] + input_dimss = [l + c for l, c in zip(loop_input_dimss, input_coredimss)] - loop_output_dims = max(loop_input_dimss, key=len) if loop_input_dimss else set() + loop_output_dims = max(loop_input_dimss, key=len) if loop_input_dimss else tuple() ## Assess input args for same size and chunk sizes ### Collect sizes and chunksizes of all dims in all arrays @@ -198,12 +338,12 @@ chunksizess = {} for dims, shape, chunksizes in zip(input_dimss, input_shapes, input_chunkss): for dim, size, chunksize in zip(dims, shape, chunksizes): - _dimsizes = dimsizess.get(dim, []) - _dimsizes.append(size) - dimsizess[dim] = _dimsizes - _chunksizes = chunksizess.get(dim, []) - _chunksizes.append(chunksize) - chunksizess[dim] = _chunksizes + dimsizes = dimsizess.get(dim, []) + dimsizes.append(size) + dimsizess[dim] = dimsizes + chunksizes_ = chunksizess.get(dim, []) + chunksizes_.append(chunksize) + chunksizess[dim] = chunksizes_ ### Assert correct partitioning, for case: for dim, sizes in dimsizess.items(): #### Check that the arrays have same length for same dimensions or dimension `1` @@ -237,19 +377,18 @@ loop_output_chunks = tmp.chunks dsk = tmp.__dask_graph__() keys = list(flatten(tmp.__dask_keys__())) - _anykey = keys[0] - name, token = _anykey[0].split('-') + name, token = keys[0][0].split('-') ### *) Treat direct output if nout is None: - core_output_dimss = [core_output_dimss] + output_coredimss = [output_coredimss] output_dtypes = [output_dtypes] ## Split output leaf_arrs = [] - for i, cod, odt in zip(count(0), core_output_dimss, output_dtypes): - core_output_shape = tuple(core_shapes[d] for d in cod) - core_chunkinds = len(cod) * (0,) + for i, ocd, odt, oax in zip(count(0), output_coredimss, output_dtypes, output_axes): + core_output_shape = tuple(core_shapes[d] for d in ocd) + core_chunkinds = len(ocd) * (0,) output_shape = loop_output_shape + core_output_shape output_chunks = loop_output_chunks + core_output_shape leaf_name = "%s_%d-%s" % (name, i, token) @@ -259,6 +398,21 @@ chunks=output_chunks, shape=output_shape, dtype=odt) + + ### Axes: + if keepdims: + slices = len(leaf_arr.shape) * (slice(None),) + len(oax) * (np.newaxis,) + leaf_arr = leaf_arr[slices] + + tidcs = [None] * len(leaf_arr.shape) + for i, oa in zip(range(-len(oax), 0), oax): + tidcs[oa] = i + j = 0 + for i in range(len(tidcs)): + if tidcs[i] is None: + tidcs[i] = j + j += 1 + leaf_arr = leaf_arr.transpose(tidcs) leaf_arrs.append(leaf_arr) return leaf_arrs if nout else leaf_arrs[0] # Undo *) from above @@ -281,8 +435,35 @@ signature : String, keyword only Specifies what core dimensions are consumed and produced by ``func``. According to the specification of numpy.gufunc signature [2]_ - output_dtypes : dtype or list of dtypes, keyword only - dtype or list of output dtypes. + axes: List of tuples, optional, keyword only + A list of tuples with indices of axes a generalized ufunc should operate on. + For instance, for a signature of ``"(i,j),(j,k)->(i,k)"`` appropriate for + matrix multiplication, the base elements are two-dimensional matrices + and these are taken to be stored in the two last axes of each argument. The + corresponding axes keyword would be ``[(-2, -1), (-2, -1), (-2, -1)]``. + For simplicity, for generalized ufuncs that operate on 1-dimensional arrays + (vectors), a single integer is accepted instead of a single-element tuple, + and for generalized ufuncs for which all outputs are scalars, the output + tuples can be omitted. + axis: int, optional, keyword only + A single axis over which a generalized ufunc should operate. This is a short-cut + for ufuncs that operate over a single, shared core dimension, equivalent to passing + in axes with entries of (axis,) for each single-core-dimension argument and ``()`` for + all others. For instance, for a signature ``"(i),(i)->()"``, it is equivalent to passing + in ``axes=[(axis,), (axis,), ()]``. + keepdims: bool, optional, keyword only + If this is set to True, axes which are reduced over will be left in the result as + a dimension with size one, so that the result will broadcast correctly against the + inputs. This option can only be used for generalized ufuncs that operate on inputs + that all have the same number of core dimensions and with outputs that have no core + dimensions , i.e., with signatures like ``"(i),(i)->()"`` or ``"(m,m)->()"``. + If used, the location of the dimensions in the output can be controlled with axes + and axis. + output_dtypes : Optional, dtype or list of dtypes, keyword only + Valid numpy dtype specification or list thereof. + If not given, a call of ``func`` with a small set of data + is performed in order to try to automatically determine the + output dtypes. output_sizes : dict, optional, keyword only Optional mapping from dimension names to sizes for outputs. Only used if new core dimensions (not found on inputs) appear on outputs. @@ -330,6 +511,9 @@ self.pyfunc = pyfunc self.signature = kwargs.pop("signature", None) self.vectorize = kwargs.pop("vectorize", False) + self.axes = kwargs.pop("axes", None) + self.axis = kwargs.pop("axis", None) + self.keepdims = kwargs.pop("keepdims", False) self.output_sizes = kwargs.pop("output_sizes", None) self.output_dtypes = kwargs.pop("output_dtypes", None) self.allow_rechunk = kwargs.pop("allow_rechunk", False) @@ -359,6 +543,9 @@ self.signature, *args, vectorize=self.vectorize, + axes=self.axes, + axis=self.axis, + keepdims=self.keepdims, output_sizes=self.output_sizes, output_dtypes=self.output_dtypes, allow_rechunk=self.allow_rechunk or kwargs.pop("allow_rechunk", False), @@ -374,8 +561,35 @@ signature : String Specifies what core dimensions are consumed and produced by ``func``. According to the specification of numpy.gufunc signature [2]_ - output_dtypes : dtype or list of dtypes, keyword only - dtype or list of output dtypes. + axes: List of tuples, optional, keyword only + A list of tuples with indices of axes a generalized ufunc should operate on. + For instance, for a signature of ``"(i,j),(j,k)->(i,k)"`` appropriate for + matrix multiplication, the base elements are two-dimensional matrices + and these are taken to be stored in the two last axes of each argument. The + corresponding axes keyword would be ``[(-2, -1), (-2, -1), (-2, -1)]``. + For simplicity, for generalized ufuncs that operate on 1-dimensional arrays + (vectors), a single integer is accepted instead of a single-element tuple, + and for generalized ufuncs for which all outputs are scalars, the output + tuples can be omitted. + axis: int, optional, keyword only + A single axis over which a generalized ufunc should operate. This is a short-cut + for ufuncs that operate over a single, shared core dimension, equivalent to passing + in axes with entries of (axis,) for each single-core-dimension argument and ``()`` for + all others. For instance, for a signature ``"(i),(i)->()"``, it is equivalent to passing + in ``axes=[(axis,), (axis,), ()]``. + keepdims: bool, optional, keyword only + If this is set to True, axes which are reduced over will be left in the result as + a dimension with size one, so that the result will broadcast correctly against the + inputs. This option can only be used for generalized ufuncs that operate on inputs + that all have the same number of core dimensions and with outputs that have no core + dimensions , i.e., with signatures like ``"(i),(i)->()"`` or ``"(m,m)->()"``. + If used, the location of the dimensions in the output can be controlled with axes + and axis. + output_dtypes : Optional, dtype or list of dtypes, keyword only + Valid numpy dtype specification or list thereof. + If not given, a call of ``func`` with a small set of data + is performed in order to try to automatically determine the + output dtypes. output_sizes : dict, optional, keyword only Optional mapping from dimension names to sizes for outputs. Only used if new core dimensions (not found on inputs) appear on outputs. @@ -418,7 +632,7 @@ .. [1] http://docs.scipy.org/doc/numpy/reference/ufuncs.html .. [2] http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html """ - _allowedkeys = {"vectorize", "output_sizes", "output_dtypes", "allow_rechunk"} + _allowedkeys = {"vectorize", "axes", "axis", "keepdims", "output_sizes", "output_dtypes", "allow_rechunk"} if set(_allowedkeys).issubset(kwargs.keys()): raise TypeError("Unsupported keyword argument(s) provided") diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/array/tests/test_gufunc.py new/dask-0.19.4/dask/array/tests/test_gufunc.py --- old/dask-0.19.3/dask/array/tests/test_gufunc.py 2018-09-26 23:49:35.000000000 +0200 +++ new/dask-0.19.4/dask/array/tests/test_gufunc.py 2018-10-09 21:03:29.000000000 +0200 @@ -8,7 +8,7 @@ import numpy as np from dask.array.core import Array -from dask.array.gufunc import _parse_gufunc_signature, apply_gufunc,gufunc, as_gufunc +from dask.array.gufunc import _parse_gufunc_signature, _validate_normalize_axes, apply_gufunc,gufunc, as_gufunc # Copied from `numpy.lib.test_test_function_base.py`: @@ -34,6 +34,80 @@ _parse_gufunc_signature('(x)->(x),') +def test_apply_gufunc_axes_input_validation_01(): + def foo(x): + return np.mean(x, axis=-1) + + a = da.random.normal(size=(20, 30), chunks=30) + + with pytest.raises(ValueError): + apply_gufunc(foo, "(i)->()", a, axes=0) + + apply_gufunc(foo, "(i)->()", a, axes=[0]) + apply_gufunc(foo, "(i)->()", a, axes=[(0,)]) + apply_gufunc(foo, "(i)->()", a, axes=[0, tuple()]) + apply_gufunc(foo, "(i)->()", a, axes=[(0,), tuple()]) + + with pytest.raises(ValueError): + apply_gufunc(foo, "(i)->()", a, axes=[(0, 1)]) + + with pytest.raises(ValueError): + apply_gufunc(foo, "(i)->()", a, axes=[0, 0]) + + +def test__validate_normalize_axes_01(): + with pytest.raises(ValueError): + _validate_normalize_axes([(1, 0)], None, False, [('i', 'j')], ('j',)) + + with pytest.raises(ValueError): + _validate_normalize_axes([0, 0], None, False, [('i', 'j')], ('j',)) + + with pytest.raises(ValueError): + _validate_normalize_axes([(0,), 0], None, False, [('i', 'j')], ('j',)) + + i, o = _validate_normalize_axes([(1, 0), 0], None, False, [('i', 'j')], ('j',)) + assert i == [(1, 0)] + assert o == [(0,)] + + +def test__validate_normalize_axes_02(): + i, o = _validate_normalize_axes(None, 0, False, [('i', ), ('i', )], ()) + assert i == [(0,), (0,)] + assert o == [()] + + i, o = _validate_normalize_axes(None, 0, False, [('i',)], ('i',)) + assert i == [(0,)] + assert o == [(0,)] + + i, o = _validate_normalize_axes(None, 0, True, [('i',), ('i',)], ()) + assert i == [(0,), (0,)] + assert o == [(0,)] + + with pytest.raises(ValueError): + _validate_normalize_axes(None, (0,), False, [('i',), ('i',)], ()) + + with pytest.raises(ValueError): + _validate_normalize_axes(None, 0, False, [('i',), ('j',)], ()) + + with pytest.raises(ValueError): + _validate_normalize_axes(None, 0, False, [('i',), ('j',)], ('j',)) + + +def test__validate_normalize_axes_03(): + i, o = _validate_normalize_axes(None, 0, True, [('i',)], ()) + assert i == [(0,)] + assert o == [(0,)] + + with pytest.raises(ValueError): + _validate_normalize_axes(None, 0, True, [('i',)], ('i',)) + + with pytest.raises(ValueError): + _validate_normalize_axes([(0, 1), (0, 1)], None, True, [('i', 'j')], ('i', 'j')) + + with pytest.raises(ValueError): + _validate_normalize_axes([(0,), (0,)], None, True, [('i',), ('j',)], ()) + + def test_apply_gufunc_01(): def stats(x): return np.mean(x, axis=-1), np.std(x, axis=-1) @@ -223,7 +297,7 @@ def foo(x): return np.mean(x, axis=-1) - gufoo = gufunc(foo, signature="(i)->()", output_dtypes=float, vectorize=True) + gufoo = gufunc(foo, signature="(i)->()", axis=-1, keepdims=False, output_dtypes=float, vectorize=True) y = gufoo(x) valy = y.compute() @@ -236,7 +310,7 @@ def test_as_gufunc(): x = da.random.normal(size=(10, 5), chunks=(2, 5)) - @as_gufunc("(i)->()", output_dtypes=float, vectorize=True) + @as_gufunc("(i)->()", axis=-1, keepdims=False, output_dtypes=float, vectorize=True) def foo(x): return np.mean(x, axis=-1) @@ -336,3 +410,149 @@ assert_eq(z0, dx + dy) assert_eq(z1, dx - dy) + + [email protected]('keepdims', [False, True]) +def test_apply_gufunc_axis_01(keepdims): + def mymedian(x): + return np.median(x, axis=-1) + + a = np.random.randn(10, 5) + da_ = da.from_array(a, chunks=2) + + m = np.median(a, axis=0, keepdims=keepdims) + dm = apply_gufunc(mymedian, "(i)->()", da_, axis=0, keepdims=keepdims, allow_rechunk=True) + assert_eq(m, dm) + + +def test_apply_gufunc_axis_02(): + def myfft(x): + return np.fft.fft(x, axis=-1) + + a = np.random.randn(10, 5) + da_ = da.from_array(a, chunks=2) + + m = np.fft.fft(a, axis=0) + dm = apply_gufunc(myfft, "(i)->(i)", da_, axis=0, allow_rechunk=True) + assert_eq(m, dm) + + +def test_apply_gufunc_axis_02b(): + def myfilter(x, cn=10, axis=-1): + y = np.fft.fft(x, axis=axis) + y[cn:-cn] = 0 + nx = np.fft.ifft(y, axis=axis) + return np.real(nx) + + a = np.random.randn(3, 6, 4) + da_ = da.from_array(a, chunks=2) + + m = myfilter(a, axis=1) + dm = apply_gufunc(myfilter, "(i)->(i)", da_, axis=1, allow_rechunk=True) + assert_eq(m, dm) + + +def test_apply_gufunc_axis_03(): + def mydiff(x): + return np.diff(x, axis=-1) + + a = np.random.randn(3, 6, 4) + da_ = da.from_array(a, chunks=2) + + m = np.diff(a, axis=1) + dm = apply_gufunc(mydiff, "(i)->(i)", da_, axis=1, output_sizes={'i': 5}, allow_rechunk=True) + assert_eq(m, dm) + + [email protected]('axis', [-2, -1, None]) +def test_apply_gufunc_axis_keepdims(axis): + def mymedian(x): + return np.median(x, axis=-1) + + a = np.random.randn(10, 5) + da_ = da.from_array(a, chunks=2) + + m = np.median(a, axis=-1 if not axis else axis, keepdims=True) + dm = apply_gufunc(mymedian, "(i)->()", da_, axis=axis, keepdims=True, allow_rechunk=True) + assert_eq(m, dm) + + [email protected]('axes', [[0, 1], [(0,), (1,)]]) +def test_apply_gufunc_axes_01(axes): + def mystats(x, y): + return np.std(x, axis=-1) * np.mean(y, axis=-1) + + a = np.random.randn(10, 5) + b = np.random.randn(5, 6) + da_ = da.from_array(a, chunks=2) + db_ = da.from_array(b, chunks=2) + + m = np.std(a, axis=0) * np.mean(b, axis=1) + dm = apply_gufunc(mystats, "(i),(j)->()", da_, db_, axes=axes, allow_rechunk=True) + assert_eq(m, dm) + + +def test_apply_gufunc_axes_02(): + def matmul(x, y): + return np.einsum("...ij,...jk->...ik", x, y) + + a = np.random.randn(3, 2, 1) + b = np.random.randn(3, 7, 5) + + da_ = da.from_array(a, chunks=2) + db = da.from_array(b, chunks=3) + + m = np.einsum("jiu,juk->uik", a, b) + dm = apply_gufunc(matmul, "(i,j),(j,k)->(i,k)", da_, db, axes=[(1, 0), (0, -1), (-2, -1)], allow_rechunk=True) + assert_eq(m, dm) + + [email protected](LooseVersion(np.__version__) < '1.12.0', + reason="`np.vectorize(..., signature=...)` not supported yet") +def test_apply_gufunc_axes_two_kept_coredims(): + a = da.random.normal(size=( 20, 30), chunks=(10, 30)) + b = da.random.normal(size=(10, 1, 40), chunks=(5, 1, 40)) + + def outer_product(x, y): + return np.einsum("i,j->ij", x, y) + + c = apply_gufunc(outer_product, "(i),(j)->(i,j)", a, b, vectorize=True) + assert c.compute().shape == (10, 20, 30, 40) + + [email protected](LooseVersion(np.__version__) < '1.12.0', + reason="Additional kwargs for this version not supported") +def test_apply_gufunc_via_numba_01(): + numba = pytest.importorskip('numba') + + @numba.guvectorize([(numba.float64[:], numba.float64[:], numba.float64[:])], '(n),(n)->(n)') + def g(x, y, res): + for i in range(x.shape[0]): + res[i] = x[i] + y[i] + + a = da.random.normal(size=(20, 30), chunks=30) + b = da.random.normal(size=(20, 30), chunks=30) + + x = a + b + y = g(a, b, axis=0) + + assert_eq(x, y) + + [email protected](LooseVersion(np.__version__) < '1.12.0', + reason="Additional kwargs for this version not supported") +def test_apply_gufunc_via_numba_02(): + numba = pytest.importorskip('numba') + + @numba.guvectorize([(numba.float64[:], numba.float64[:])], '(n)->()') + def mysum(x, res): + res[0] = 0. + for i in range(x.shape[0]): + res[0] += x[i] + + a = da.random.normal(size=(20, 30), chunks=5) + + x = a.sum(axis=0, keepdims=True) + y = mysum(a, axis=0, keepdims=True, allow_rechunk=True) + + assert_eq(x, y) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/base.py new/dask-0.19.4/dask/base.py --- old/dask-0.19.3/dask/base.py 2018-10-05 20:48:56.000000000 +0200 +++ new/dask-0.19.4/dask/base.py 2018-10-09 18:48:12.000000000 +0200 @@ -825,10 +825,13 @@ else: if get in named_schedulers.values(): _warnned_on_get[0] = True - warnings.warn("The get= keyword has been deprecated. " - "Please use the scheduler= keyword instead with the " - "name of the desired scheduler " - "like 'threads' or 'processes'") + warnings.warn( + "The get= keyword has been deprecated. " + "Please use the scheduler= keyword instead with the name of " + "the desired scheduler like 'threads' or 'processes'\n" + " x.compute(scheduler='threads') \n" + "or with a function that takes the graph and keys\n" + " x.compute(scheduler=my_scheduler_function)") def get_scheduler(get=None, scheduler=None, collections=None, cls=None): @@ -851,7 +854,11 @@ return get if scheduler is not None: - if scheduler.lower() in named_schedulers: + if callable(scheduler): + return scheduler + elif "Client" in type(scheduler).__name__ and hasattr(scheduler, 'get'): + return scheduler.get + elif scheduler.lower() in named_schedulers: return named_schedulers[scheduler.lower()] elif scheduler.lower() in ('dask.distributed', 'distributed'): from distributed.worker import get_client diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/dataframe/core.py new/dask-0.19.4/dask/dataframe/core.py --- old/dask-0.19.3/dask/dataframe/core.py 2018-10-05 20:48:56.000000000 +0200 +++ new/dask-0.19.4/dask/dataframe/core.py 2018-10-09 21:03:29.000000000 +0200 @@ -20,7 +20,7 @@ from .. import array as da from .. import core -from ..utils import partial_by_order, Dispatch +from ..utils import partial_by_order, Dispatch, IndexCallable from .. import threaded from ..compatibility import apply, operator_div, bind_method, string_types, Iterator from ..context import globalmethod @@ -896,10 +896,47 @@ """ Purely label-location based indexer for selection by label. >>> df.loc["b"] # doctest: +SKIP - >>> df.loc["b":"d"] # doctest: +SKIP""" + >>> df.loc["b":"d"] # doctest: +SKIP + """ from .indexing import _LocIndexer return _LocIndexer(self) + def _partitions(self, index): + if not isinstance(index, tuple): + index = (index,) + from ..array.slicing import normalize_index + index = normalize_index(index, (self.npartitions,)) + index = tuple(slice(k, k + 1) if isinstance(k, Number) else k + for k in index) + name = 'blocks-' + tokenize(self, index) + new_keys = np.array(self.__dask_keys__(), dtype=object)[index].tolist() + + divisions = [self.divisions[i] for _, i in new_keys] + [self.divisions[new_keys[-1][1] + 1]] + dsk = {(name, i): tuple(key) for i, key in enumerate(new_keys)} + + return new_dd_object(merge(dsk, self.dask), name, self._meta, divisions) + + @property + def partitions(self): + """ Slice dataframe by partitions + + This allows partitionwise slicing of a Dask Dataframe. You can perform normal + Numpy-style slicing but now rather than slice elements of the array you + slice along partitions so, for example, ``df.partitions[:5]`` produces a new + Dask Dataframe of the first five partitions. + + Examples + -------- + >>> df.partitions[0] # doctest: +SKIP + >>> df.partitions[:3] # doctest: +SKIP + >>> df.partitions[::10] # doctest: +SKIP + + Returns + ------- + A Dask DataFrame + """ + return IndexCallable(self._partitions) + # Note: iloc is implemented only on DataFrame def repartition(self, divisions=None, npartitions=None, freq=None, force=False): @@ -1458,19 +1495,22 @@ return DataFrame(dask, keyname, meta, quantiles[0].divisions) @derived_from(pd.DataFrame) - def describe(self, split_every=False): + def describe(self, split_every=False, percentiles=None): # currently, only numeric describe is supported num = self._get_numeric_data() if self.ndim == 2 and len(num.columns) == 0: raise ValueError("DataFrame contains only non-numeric data.") elif self.ndim == 1 and self.dtype == 'object': raise ValueError("Cannot compute ``describe`` on object dtype.") - + if percentiles is None: + percentiles = [0.25, 0.5, 0.75] + else: + percentiles = list(set(sorted(percentiles + [0.5]))) stats = [num.count(split_every=split_every), num.mean(split_every=split_every), num.std(split_every=split_every), num.min(split_every=split_every), - num.quantile([0.25, 0.5, 0.75]), + num.quantile(percentiles), num.max(split_every=split_every)] stats_names = [(s._name, 0) for s in stats] diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/dataframe/methods.py new/dask-0.19.4/dask/dataframe/methods.py --- old/dask-0.19.3/dask/dataframe/methods.py 2018-10-05 20:48:56.000000000 +0200 +++ new/dask-0.19.4/dask/dataframe/methods.py 2018-10-09 21:03:29.000000000 +0200 @@ -121,7 +121,7 @@ typ = pd.DataFrame if isinstance(count, pd.Series) else pd.Series part1 = typ([count, mean, std, min], index=['count', 'mean', 'std', 'min']) - q.index = ['25%', '50%', '75%'] + q.index = ['{0:g}%'.format(l * 100) for l in q.index.tolist()] part3 = typ([max], index=['max']) return pd.concat([part1, q, part3]) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/dataframe/tests/test_dataframe.py new/dask-0.19.4/dask/dataframe/tests/test_dataframe.py --- old/dask-0.19.3/dask/dataframe/tests/test_dataframe.py 2018-10-04 23:25:10.000000000 +0200 +++ new/dask-0.19.4/dask/dataframe/tests/test_dataframe.py 2018-10-09 21:03:29.000000000 +0200 @@ -288,6 +288,9 @@ assert_eq(s.describe(), ds.describe()) assert_eq(df.describe(), ddf.describe()) + test_quantiles = [0.25, 0.75] + assert_eq(df.describe(percentiles=test_quantiles), + ddf.describe(percentiles=test_quantiles)) assert_eq(s.describe(), ds.describe(split_every=2)) assert_eq(df.describe(), ddf.describe(split_every=2)) @@ -3279,3 +3282,17 @@ a = ddf.map_partitions(lambda x, y: x, big) assert any(big is v for v in a.dask.values()) + + +def test_partitions_indexer(): + df = pd.DataFrame({'x': range(10)}) + ddf = dd.from_pandas(df, npartitions=5) + + assert_eq(ddf.partitions[0], ddf.get_partition(0)) + assert_eq(ddf.partitions[3], ddf.get_partition(3)) + assert_eq(ddf.partitions[-1], ddf.get_partition(4)) + + assert ddf.partitions[:3].npartitions == 3 + assert ddf.x.partitions[:3].npartitions == 3 + + assert ddf.x.partitions[::2].compute().tolist() == [0, 1, 4, 5, 8, 9] diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/datasets.py new/dask-0.19.4/dask/datasets.py --- old/dask-0.19.3/dask/datasets.py 2018-10-05 20:48:54.000000000 +0200 +++ new/dask-0.19.4/dask/datasets.py 2018-10-09 17:02:36.000000000 +0200 @@ -135,8 +135,8 @@ 'telephone': field('person.telephone'), 'address': {'address': field('address.address'), 'city': field('address.city')}, - 'credt-card': {'number': field('payment.credit_card_number'), - 'expiration-date': field('payment.credit_card_expiration_date')}, + 'credit-card': {'number': field('payment.credit_card_number'), + 'expiration-date': field('payment.credit_card_expiration_date')}, } return _make_mimesis({'locale': locale}, schema, npartitions, records_per_partition, seed) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/tests/test_base.py new/dask-0.19.4/dask/tests/test_base.py --- old/dask-0.19.3/dask/tests/test_base.py 2018-10-05 20:48:54.000000000 +0200 +++ new/dask-0.19.4/dask/tests/test_base.py 2018-10-09 18:48:12.000000000 +0200 @@ -811,7 +811,7 @@ assert dsk == dict(y.dask) # but they aren't return dask.get(dsk, keys) - with dask.config.set(array_optimize=None, get=my_get): + with dask.config.set(array_optimize=None, scheduler=my_get): y.compute() @@ -856,3 +856,14 @@ with dask.config.set(scheduler='threads'): assert get_scheduler(scheduler='threads') is dask.threaded.get assert get_scheduler() is None + + +def test_callable_scheduler(): + called = [False] + + def get(dsk, keys, *args, **kwargs): + called[0] = True + return dask.get(dsk, keys) + + assert delayed(lambda: 1)().compute(scheduler=get) == 1 + assert called[0] diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask/tests/test_distributed.py new/dask-0.19.4/dask/tests/test_distributed.py --- old/dask-0.19.3/dask/tests/test_distributed.py 2018-09-19 13:54:36.000000000 +0200 +++ new/dask-0.19.4/dask/tests/test_distributed.py 2018-10-09 18:48:12.000000000 +0200 @@ -176,3 +176,11 @@ a = da.ones((3, 3), chunks=c) z = zarr.zeros_like(a, chunks=c) a.to_zarr(z) + + +def test_scheduler_equals_client(loop): + with cluster() as (s, [a, b]): + with Client(s['address'], loop=loop) as client: + x = delayed(lambda: 1)() + assert x.compute(scheduler=client) == 1 + assert client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.story(x.key)) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask.egg-info/PKG-INFO new/dask-0.19.4/dask.egg-info/PKG-INFO --- old/dask-0.19.3/dask.egg-info/PKG-INFO 2018-10-05 20:57:35.000000000 +0200 +++ new/dask-0.19.4/dask.egg-info/PKG-INFO 2018-10-09 21:27:57.000000000 +0200 @@ -1,12 +1,11 @@ -Metadata-Version: 1.2 +Metadata-Version: 2.1 Name: dask -Version: 0.19.3 +Version: 0.19.4 Summary: Parallel PyData with Task Scheduling Home-page: http://github.com/dask/dask/ -Author: Matthew Rocklin -Author-email: [email protected] +Maintainer: Matthew Rocklin +Maintainer-email: [email protected] License: BSD -Description-Content-Type: UNKNOWN Description: Dask ==== @@ -45,3 +44,9 @@ Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.* +Provides-Extra: dataframe +Provides-Extra: array +Provides-Extra: bag +Provides-Extra: distributed +Provides-Extra: delayed +Provides-Extra: complete diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/dask.egg-info/SOURCES.txt new/dask-0.19.4/dask.egg-info/SOURCES.txt --- old/dask-0.19.3/dask.egg-info/SOURCES.txt 2018-10-05 20:57:35.000000000 +0200 +++ new/dask-0.19.4/dask.egg-info/SOURCES.txt 2018-10-09 21:27:57.000000000 +0200 @@ -252,7 +252,6 @@ docs/source/index.rst docs/source/install.rst docs/source/logos.rst -docs/source/machine-learning.rst docs/source/optimize.rst docs/source/presentations.rst docs/source/remote-data-services.rst diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/api.rst new/dask-0.19.4/docs/source/api.rst --- old/dask-0.19.3/docs/source/api.rst 2018-09-30 16:48:25.000000000 +0200 +++ new/dask-0.19.4/docs/source/api.rst 2018-10-09 15:05:48.000000000 +0200 @@ -5,7 +5,7 @@ - The :doc:`Dask Array API <array-api>` follows the Numpy API - The :doc:`Dask Dataframe API <dataframe-api>` follows the Pandas API -- The `Dask-ML API <https://ml.dask.org/en/latest/modules/api.html>`_ follows the Scikit-Learn API and other related machine learning libraries +- The `Dask-ML API <https://ml.dask.org/modules/api.html>`_ follows the Scikit-Learn API and other related machine learning libraries - The :doc:`Dask Bag API <bag-api>` follows the map/filter/groupby/reduce API common in PySpark, PyToolz, and the Python standard library - The :doc:`Dask Delayed API <delayed-api>` wraps general Python code - The :doc:`Real-time Futures API <futures>` follows the `concurrent.futures <https://docs.python.org/3/library/concurrent.futures.html>`_ API from the standard library. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/changelog.rst new/dask-0.19.4/docs/source/changelog.rst --- old/dask-0.19.3/docs/source/changelog.rst 2018-10-05 20:56:12.000000000 +0200 +++ new/dask-0.19.4/docs/source/changelog.rst 2018-10-09 21:25:03.000000000 +0200 @@ -1,6 +1,37 @@ Changelog ========= +0.19.4 / 2018-10-09 +------------------- + +Array ++++++ + +- Implement ``apply_gufunc(..., axes=..., keepdims=...)`` (:pr:`3985`) `Markus Gonser`_ + +Bag ++++ + +- Fix typo in datasets.make_people (:pr:`4069`) `Matthew Rocklin`_ + +Dataframe ++++++++++ + +- Added `percentiles` options for `dask.dataframe.describe` method (:pr:`4067`) `Zhenqing Li`_ +- Add DataFrame.partitions accessor similar to Array.blocks (:pr:`4066`) `Matthew Rocklin`_ + +Core +++++ + +- Pass get functions and Clients through scheduler keyword (:pr:`4062`) `Matthew Rocklin`_ + +Documentation ++++++++++++++ + +- Fix Typo on hpc example. (missing `=` in kwarg). (:pr:`4068`) `Matthias Bussonier`_ +- Extensive copy-editing: (:pr:`4065`), (:pr:`4064`), (:pr:`4063`) `Miguel Farrajota`_ + + 0.19.3 / 2018-10-05 ------------------- @@ -1460,3 +1491,5 @@ .. _`Jeremy Chan`: https://github.com/convexset .. _`Eric Wolak`: https://github.com/epall .. _`Miguel Farrajota`: https://github.com/farrajota +.. _`Zhenqing Li`: https://github.com/DigitalPig +.. _`Matthias Bussonier`: https://github.com/Carreau diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/dataframe-api.rst new/dask-0.19.4/docs/source/dataframe-api.rst --- old/dask-0.19.3/docs/source/dataframe-api.rst 2018-09-26 15:00:28.000000000 +0200 +++ new/dask-0.19.4/docs/source/dataframe-api.rst 2018-10-09 17:02:36.000000000 +0200 @@ -55,6 +55,7 @@ DataFrame.ndim DataFrame.nlargest DataFrame.npartitions + DataFrame.partitions DataFrame.pow DataFrame.quantile DataFrame.query diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/index.rst new/dask-0.19.4/docs/source/index.rst --- old/dask-0.19.3/docs/source/index.rst 2018-09-30 16:48:25.000000000 +0200 +++ new/dask-0.19.4/docs/source/index.rst 2018-10-09 15:05:48.000000000 +0200 @@ -171,7 +171,7 @@ dataframe.rst delayed.rst futures.rst - machine-learning.rst + Machine Learning <https://ml.dask.org> api.rst **Scheduling** diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/machine-learning.rst new/dask-0.19.4/docs/source/machine-learning.rst --- old/dask-0.19.3/docs/source/machine-learning.rst 2018-09-30 16:48:25.000000000 +0200 +++ new/dask-0.19.4/docs/source/machine-learning.rst 1970-01-01 01:00:00.000000000 +0100 @@ -1,11 +0,0 @@ -Machine Learning -================ - -Dask facilitates machine learning, statistics, and optimization workloads in a -variety of ways. Generally Dask tries to support other high-quality solutions -within the PyData ecosystem rather than reinvent new systems. Dask makes it -easier to scale single-machine libraries like Scikit-Learn where possible and -makes using distributed libraries like XGBoost or Tensorflow more comfortable -for everyday users. - -See the separate `Dask-ML documentation <https://ml.dask.org/en/latest>`_ for more information. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/setup/cloud.rst new/dask-0.19.4/docs/source/setup/cloud.rst --- old/dask-0.19.3/docs/source/setup/cloud.rst 2018-09-24 15:51:28.000000000 +0200 +++ new/dask-0.19.4/docs/source/setup/cloud.rst 2018-10-09 16:10:28.000000000 +0200 @@ -1,9 +1,8 @@ Cloud Deployments ================= -To get started running Dask on common Cloud providers -like Amazon, Google, or Microsoft -we currently recommend deploying +To get started running Dask on common Cloud providers like Amazon, +Google, or Microsoft, we currently recommend deploying :doc:`Dask with Kubernetes and Helm <kubernetes-helm>`. All three major cloud vendors now provide managed Kubernetes services. @@ -14,7 +13,7 @@ ----------- You may want to install additional libraries in your Jupyter and worker images -to access the object stores of each cloud +to access the object stores of each cloud: - `s3fs <https://s3fs.readthedocs.io/>`_ for Amazon's S3 - `gcsfs <https://gcsfs.readthedocs.io/>`_ for Google's GCS diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/setup/hpc.rst new/dask-0.19.4/docs/source/setup/hpc.rst --- old/dask-0.19.3/docs/source/setup/hpc.rst 2018-10-05 20:48:54.000000000 +0200 +++ new/dask-0.19.4/docs/source/setup/hpc.rst 2018-10-09 17:02:32.000000000 +0200 @@ -37,7 +37,7 @@ from dask_jobqueue import PBSCluster cluster = PBSCluster(cores=36, - memory"100GB", + memory="100GB", project='P48500028', queue='premium', walltime='02:00:00') diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/setup/kubernetes-helm.rst new/dask-0.19.4/docs/source/setup/kubernetes-helm.rst --- old/dask-0.19.3/docs/source/setup/kubernetes-helm.rst 2018-09-24 15:51:28.000000000 +0200 +++ new/dask-0.19.4/docs/source/setup/kubernetes-helm.rst 2018-10-09 16:10:28.000000000 +0200 @@ -1,18 +1,18 @@ Kubernetes and Helm =================== -It is easy to launch a Dask cluster and Jupyter notebook server on cloud +It is easy to launch a Dask cluster and a Jupyter notebook server on cloud resources using Kubernetes_ and Helm_. .. _Kubernetes: https://kubernetes.io/ .. _Helm: https://helm.sh/ This is particularly useful when you want to deploy a fresh Python environment -on Cloud services, like Amazon Web Services, Google Compute Engine, or +on Cloud services like Amazon Web Services, Google Compute Engine, or Microsoft Azure. If you already have Python environments running in a pre-existing Kubernetes -cluster then you may prefer the :doc:`Kubernetes native<kubernetes-native>` +cluster, then you may prefer the :doc:`Kubernetes native<kubernetes-native>` documentation, which is a bit lighter weight. @@ -21,17 +21,17 @@ This document assumes that you have a Kubernetes cluster and Helm installed. -If this is not the case then you might consider setting up a Kubernetes cluster -either on one of the common cloud providers like Google, Amazon, or -Microsoft's. We recommend the first part of the documentation in the guide +If this is not the case, then you might consider setting up a Kubernetes cluster +on one of the common cloud providers like Google, Amazon, or +Microsoft. We recommend the first part of the documentation in the guide `Zero to JupyterHub <http://zero-to-jupyterhub.readthedocs.io/en/latest/>`_ -that focuses on Kubernetes and Helm. You do not need to follow all of these -instructions. JupyterHub is not necessary to deploy Dask: +that focuses on Kubernetes and Helm (you do not need to follow all of these +instructions). Also, JupyterHub is not necessary to deploy Dask: - `Creating a Kubernetes Cluster <https://zero-to-jupyterhub.readthedocs.io/en/v0.4-doc/create-k8s-cluster.html>`_ - `Setting up Helm <https://zero-to-jupyterhub.readthedocs.io/en/v0.4-doc/setup-helm.html>`_ -Alternatively you may want to experiment with Kubernetes locally using +Alternatively, you may want to experiment with Kubernetes locally using `Minikube <https://kubernetes.io/docs/getting-started-guides/minikube/>`_. @@ -45,7 +45,7 @@ helm repo update -Now you can launch Dask on your Kubernetes cluster using the Dask Helm_ chart:: +Now, you can launch Dask on your Kubernetes cluster using the Dask Helm_ chart:: helm install stable/dask @@ -56,7 +56,7 @@ Verify Deployment ----------------- -This might make a minute to deploy. You can check on the status with +This might take a minute to deploy. You can check its status with ``kubectl``:: kubectl get pods @@ -82,9 +82,9 @@ Notice the name ``bald-eel``. This is the name that Helm has given to your particular deployment of Dask. You could, for example, have multiple -Dask-and-Jupyter clusters running at once and each would be given a different -name. You will use this name to refer to your deployment in the future. You -can list all active helm deployments with:: +Dask-and-Jupyter clusters running at once, and each would be given a different +name. Note that you will need to use this name to refer to your deployment in the future. +Additionally, you can list all active helm deployments with:: helm list @@ -95,7 +95,7 @@ Connect to Dask and Jupyter --------------------------- -When we ran ``kubectl get services`` we saw some externally visible IPs +When we ran ``kubectl get services``, we saw some externally visible IPs: .. code-block:: bash @@ -105,8 +105,8 @@ bald-eel-scheduler LoadBalancer 10.11.245.241 35.202.201.129 8786:31166/TCP,80:31626/TCP 2m kubernetes ClusterIP 10.11.240.1 <none> 443/TCP 48m -We can navigate to these from any web browser. One is the Dask diagnostic -dashboard. The other is the Jupyter server. You can log into the Jupyter +We can navigate to these services from any web browser. Here, one is the Dask diagnostic +dashboard, and the other is the Jupyter server. You can log into the Jupyter notebook server with the password, ``dask``. You can create a notebook and create a Dask client from there. The @@ -131,12 +131,12 @@ Configure Environment --------------------- -By default the Helm deployment launches three workers using two cores each and +By default, the Helm deployment launches three workers using two cores each and a standard conda environment. We can customize this environment by creating a small yaml file that implements a subset of the values in the -`dask helm chart values.yaml file <https://github.com/dask/helm-chart/blob/master/dask/values.yaml>`_ +`dask helm chart values.yaml file <https://github.com/dask/helm-chart/blob/master/dask/values.yaml>`_. -For example we can increase the number of workers, and include extra conda and +For example, we can increase the number of workers, and include extra conda and pip packages to install on the both the workers and Jupyter server (these two environments should be matched). @@ -168,13 +168,13 @@ - name: EXTRA_PIP_PACKAGES value: s3fs dask-ml --upgrade -This config file overrides configuration for number and size of workers and the +This config file overrides the configuration for the number and size of workers and the conda and pip packages installed on the worker and Jupyter containers. In -general we will want to make sure that these two software environments match. +general, we will want to make sure that these two software environments match. Update your deployment to use this configuration file. Note that *you will not -use helm install* for this stage. That would create a *new* deployment on the -same Kubernetes cluster. Instead you will upgrade your existing deployment by +use helm install* for this stage: that would create a *new* deployment on the +same Kubernetes cluster. Instead, you will upgrade your existing deployment by using the current name:: helm upgrade bald-eel stable/dask -f config.yaml @@ -188,10 +188,10 @@ Check status and logs --------------------- -For standard issues you should be able to see worker status and logs using the -Dask dashboard (in particular see the worker links from the ``info/`` page). -However if your workers aren't starting you can check on the status of pods and -their logs with the following commands +For standard issues, you should be able to see the worker status and logs using the +Dask dashboard (in particular, you can see the worker links from the ``info/`` page). +However, if your workers aren't starting, you can check the status of pods and +their logs with the following commands: .. code-block:: bash @@ -228,15 +228,15 @@ ... -Delete Helm deployment ----------------------- +Delete a Helm deployment +------------------------ You can always delete a helm deployment using its name:: helm delete bald-eel --purge Note that this does not destroy any clusters that you may have allocated on a -Cloud service, you will need to delete those explicitly. +Cloud service (you will need to delete those explicitly). Avoid the Jupyter Server diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/setup/python-advanced.rst new/dask-0.19.4/docs/source/setup/python-advanced.rst --- old/dask-0.19.3/docs/source/setup/python-advanced.rst 2018-09-24 15:51:28.000000000 +0200 +++ new/dask-0.19.4/docs/source/setup/python-advanced.rst 2018-10-09 16:10:28.000000000 +0200 @@ -1,7 +1,7 @@ Python API (advanced) ===================== -In some rare cases experts may want to create ``Scheduler`` and ``Worker`` +In some rare cases, experts may want to create ``Scheduler`` and ``Worker`` objects explicitly in Python manually. This is often necessary when making tools to automatically deploy Dask in custom settings. @@ -11,7 +11,7 @@ Scheduler --------- -Start the Scheduler, provide the listening port (defaults to 8786) and Tornado +To start the Scheduler, provide the listening port (defaults to 8786) and Tornado IOLoop (defaults to ``IOLoop.current()``) .. code-block:: python @@ -27,7 +27,7 @@ loop.start() Alternatively, you may want the IOLoop and scheduler to run in a separate -thread. In that case you would replace the ``loop.start()`` call with the +thread. In this case, you would replace the ``loop.start()`` call with the following: .. code-block:: python @@ -39,7 +39,7 @@ Worker ------ -On other nodes start worker processes that point to the URL of the scheduler. +On other nodes, start worker processes that point to the URL of the scheduler. .. code-block:: python @@ -55,8 +55,8 @@ Alternatively, replace ``Worker`` with ``Nanny`` if you want your workers to be managed in a separate process by a local nanny process. This allows workers to -restart themselves in case of failure, provides some additional monitoring, and -is useful when coordinating many workers that should live in different -processes to avoid the GIL_. +restart themselves in case of failure. Also, it provides some additional monitoring, +and is useful when coordinating many workers that should live in different +processes in order to avoid the GIL_. .. _GIL: https://docs.python.org/3/glossary.html#term-gil diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.3/docs/source/spark.rst new/dask-0.19.4/docs/source/spark.rst --- old/dask-0.19.3/docs/source/spark.rst 2018-09-30 16:48:25.000000000 +0200 +++ new/dask-0.19.4/docs/source/spark.rst 2018-10-09 15:05:48.000000000 +0200 @@ -98,7 +98,7 @@ - Dask allows you to specify arbitrary task graphs for more complex and custom systems that are not part of the standard set of collections. -.. _dask-ml: https://ml.dask.org/en/latest +.. _dask-ml: https://ml.dask.org Reasons you might choose Spark
