Hello community, here is the log from the commit of package python-dask for openSUSE:Factory checked in at 2018-09-11 17:17:52 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/python-dask (Old) and /work/SRC/openSUSE:Factory/.python-dask.new (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-dask" Tue Sep 11 17:17:52 2018 rev:7 rq:634440 version:0.19.1 Changes: -------- --- /work/SRC/openSUSE:Factory/python-dask/python-dask.changes 2018-09-04 22:56:24.821050827 +0200 +++ /work/SRC/openSUSE:Factory/.python-dask.new/python-dask.changes 2018-09-11 17:17:59.183346691 +0200 @@ -1,0 +2,31 @@ +Sat Sep 8 04:33:17 UTC 2018 - Arun Persaud <[email protected]> + +- update to version 0.19.1: + * Array + + Don't enforce dtype if result has no dtype (:pr:`3928`) Matthew + Rocklin + + Fix NumPy issubtype deprecation warning (:pr:`3939`) Bruce Merry + + Fix arg reduction tokens to be unique with different arguments + (:pr:`3955`) Tobias de Jong + + Coerce numpy integers to ints in slicing code (:pr:`3944`) Yu + Feng + + Linalg.norm ndim along axis partial fix (:pr:`3933`) Tobias de + Jong + * Dataframe + + Deterministic DataFrame.set_index (:pr:`3867`) George Sakkis + + Fix divisions in read_parquet when dealing with filters #3831 + #3930 (:pr:`3923`) (:pr:`3931`) @andrethrill + + Fixing returning type in categorical.as_known (:pr:`3888`) + Sriharsha Hatwar + + Fix DataFrame.assign for callables (:pr:`3919`) Tom Augspurger + + Include partitions with no width in repartition (:pr:`3941`) + Matthew Rocklin + + Don't constrict stage/k dtype in dataframe shuffle (:pr:`3942`) + Matthew Rocklin + * Documentation + + DOC: Add hint on how to render task graphs horizontally + (:pr:`3922`) Uwe Korn + + Add try-now button to main landing page (:pr:`3924`) Matthew + Rocklin + +------------------------------------------------------------------- Old: ---- dask-0.19.0.tar.gz New: ---- dask-0.19.1.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python-dask.spec ++++++ --- /var/tmp/diff_new_pack.klQLhF/_old 2018-09-11 17:17:59.943345525 +0200 +++ /var/tmp/diff_new_pack.klQLhF/_new 2018-09-11 17:17:59.947345519 +0200 @@ -22,7 +22,7 @@ # python(2/3)-distributed has a dependency loop with python(2/3)-dask %bcond_with test_distributed Name: python-dask -Version: 0.19.0 +Version: 0.19.1 Release: 0 Summary: Minimal task scheduling abstraction License: BSD-3-Clause ++++++ dask-0.19.0.tar.gz -> dask-0.19.1.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/PKG-INFO new/dask-0.19.1/PKG-INFO --- old/dask-0.19.0/PKG-INFO 2018-08-30 18:41:43.000000000 +0200 +++ new/dask-0.19.1/PKG-INFO 2018-09-06 14:15:04.000000000 +0200 @@ -1,11 +1,12 @@ -Metadata-Version: 2.1 +Metadata-Version: 1.2 Name: dask -Version: 0.19.0 +Version: 0.19.1 Summary: Parallel PyData with Task Scheduling Home-page: http://github.com/dask/dask/ -Maintainer: Matthew Rocklin -Maintainer-email: [email protected] +Author: Matthew Rocklin +Author-email: [email protected] License: BSD +Description-Content-Type: UNKNOWN Description: Dask ==== @@ -44,9 +45,3 @@ Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.* -Provides-Extra: complete -Provides-Extra: bag -Provides-Extra: array -Provides-Extra: delayed -Provides-Extra: distributed -Provides-Extra: dataframe diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/_version.py new/dask-0.19.1/dask/_version.py --- old/dask-0.19.0/dask/_version.py 2018-08-30 18:41:43.000000000 +0200 +++ new/dask-0.19.1/dask/_version.py 2018-09-06 14:15:04.000000000 +0200 @@ -11,8 +11,8 @@ { "dirty": false, "error": null, - "full-revisionid": "546760c5ced47ace30bca21fb125b7258c56035c", - "version": "0.19.0" + "full-revisionid": "40b5d7b07c9db16e7cbd70be1bc8738ce94fe32c", + "version": "0.19.1" } ''' # END VERSION_JSON diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/array/core.py new/dask-0.19.1/dask/array/core.py --- old/dask-0.19.0/dask/array/core.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/array/core.py 2018-09-06 13:45:35.000000000 +0200 @@ -3552,7 +3552,7 @@ function = kwargs.pop('enforce_dtype_function') result = function(*args, **kwargs) - if dtype != result.dtype and dtype != object: + if hasattr(result, 'dtype') and dtype != result.dtype and dtype != object: if not np.can_cast(result, dtype, casting='same_kind'): raise ValueError("Inferred dtype from function %r was %r " "but got %r, which can't be cast using " diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/array/creation.py new/dask-0.19.1/dask/array/creation.py --- old/dask-0.19.0/dask/array/creation.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/array/creation.py 2018-09-06 13:45:35.000000000 +0200 @@ -943,7 +943,7 @@ result = result.map_blocks( wrapped_pad_func, - token="pad", + name="pad", dtype=result.dtype, pad_func=mode, iaxis_pad_width=pad_width[d], diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/array/linalg.py new/dask-0.19.1/dask/array/linalg.py --- old/dask-0.19.0/dask/array/linalg.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/array/linalg.py 2018-09-06 13:45:35.000000000 +0200 @@ -111,6 +111,7 @@ " 2. Have only one column of blocks\n\n" "Note: This function (tsqr) supports QR decomposition in the case of\n" "tall-and-skinny matrices (single column chunk/block; see qr)" + "Current shape: {},\nCurrent chunksize: {}".format(data.shape, data.chunksize) ) token = '-' + tokenize(data, compute_svd) @@ -1081,9 +1082,6 @@ @wraps(np.linalg.norm) def norm(x, ord=None, axis=None, keepdims=False): - if x.ndim > 2: - raise ValueError("Improper number of dimensions to norm.") - if axis is None: axis = tuple(range(x.ndim)) elif isinstance(axis, Number): @@ -1091,6 +1089,9 @@ else: axis = tuple(axis) + if len(axis) > 2: + raise ValueError("Improper number of dimensions to norm.") + if ord == "fro": ord = None if len(axis) == 1: @@ -1104,6 +1105,8 @@ elif ord == "nuc": if len(axis) == 1: raise ValueError("Invalid norm order for vectors.") + if x.ndim > 2: + raise NotImplementedError("SVD based norm not implemented for ndim > 2") r = svd(x)[1][None].sum(keepdims=keepdims) elif ord == np.inf: @@ -1111,29 +1114,41 @@ if len(axis) == 1: r = r.max(axis=axis, keepdims=keepdims) else: - r = r.sum(axis=axis[1], keepdims=keepdims).max(keepdims=keepdims) + r = r.sum(axis=axis[1], keepdims=True).max(axis=axis[0], keepdims=True) + if keepdims is False: + r = r.squeeze(axis=axis) elif ord == -np.inf: r = abs(r) if len(axis) == 1: r = r.min(axis=axis, keepdims=keepdims) else: - r = r.sum(axis=axis[1], keepdims=keepdims).min(keepdims=keepdims) + r = r.sum(axis=axis[1], keepdims=True).min(axis=axis[0], keepdims=True) + if keepdims is False: + r = r.squeeze(axis=axis) elif ord == 0: if len(axis) == 2: raise ValueError("Invalid norm order for matrices.") - r = (r != 0).astype(r.dtype).sum(axis=0, keepdims=keepdims) + r = (r != 0).astype(r.dtype).sum(axis=axis, keepdims=keepdims) elif ord == 1: r = abs(r) if len(axis) == 1: r = r.sum(axis=axis, keepdims=keepdims) else: - r = r.sum(axis=axis[0], keepdims=keepdims).max(keepdims=keepdims) + r = r.sum(axis=axis[0], keepdims=True).max(axis=axis[1], keepdims=True) + if keepdims is False: + r = r.squeeze(axis=axis) elif len(axis) == 2 and ord == -1: - r = abs(r).sum(axis=axis[0], keepdims=keepdims).min(keepdims=keepdims) + r = abs(r).sum(axis=axis[0], keepdims=True).min(axis=axis[1], keepdims=True) + if keepdims is False: + r = r.squeeze(axis=axis) elif len(axis) == 2 and ord == 2: + if x.ndim > 2: + raise NotImplementedError("SVD based norm not implemented for ndim > 2") r = svd(x)[1][None].max(keepdims=keepdims) elif len(axis) == 2 and ord == -2: + if x.ndim > 2: + raise NotImplementedError("SVD based norm not implemented for ndim > 2") r = svd(x)[1][None].min(keepdims=keepdims) else: if len(axis) == 2: diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/array/reductions.py new/dask-0.19.1/dask/array/reductions.py --- old/dask-0.19.0/dask/array/reductions.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/array/reductions.py 2018-09-06 13:45:35.000000000 +0200 @@ -614,7 +614,8 @@ "got '{0}'".format(axis)) # Map chunk across all blocks - name = 'arg-reduce-chunk-{0}'.format(tokenize(chunk, axis)) + name = 'arg-reduce-{0}'.format(tokenize(axis, x, chunk, + combine, split_every)) old = x.name keys = list(product(*map(range, x.numblocks))) offsets = list(product(*(accumulate(operator.add, bd[:-1], 0) @@ -714,7 +715,8 @@ m = x.map_blocks(func, axis=axis, dtype=dtype) - name = '%s-axis=%d-%s' % (func.__name__, axis, tokenize(x, dtype)) + name = '{0}-{1}'.format(func.__name__, tokenize(func, axis, binop, + ident, x, dtype)) n = x.numblocks[axis] full = slice(None, None, None) slc = (full,) * axis + (slice(-1, None),) + (full,) * (x.ndim - axis - 1) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/array/slicing.py new/dask-0.19.1/dask/array/slicing.py --- old/dask-0.19.0/dask/array/slicing.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/array/slicing.py 2018-09-06 13:45:35.000000000 +0200 @@ -69,7 +69,7 @@ return np.asanyarray(nonzero) elif np.issubdtype(index_array.dtype, np.integer): return index_array - elif np.issubdtype(index_array.dtype, float): + elif np.issubdtype(index_array.dtype, np.floating): int_index = index_array.astype(np.intp) if np.allclose(index_array, int_index): return int_index @@ -391,7 +391,7 @@ ind = index - chunk_boundaries[i - 1] else: ind = index - return {i: ind} + return {int(i): int(ind)} assert isinstance(index, slice) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/array/tests/test_array_core.py new/dask-0.19.1/dask/array/tests/test_array_core.py --- old/dask-0.19.0/dask/array/tests/test_array_core.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/array/tests/test_array_core.py 2018-09-06 13:45:35.000000000 +0200 @@ -3644,3 +3644,8 @@ da.argmax(Y, axis=0).compute() assert not record + + +def test_3925(): + x = da.from_array(np.array(['a', 'b', 'c'], dtype=object), chunks=-1) + assert (x[0] == x[0]).compute(scheduler='sync') diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/array/tests/test_linalg.py new/dask-0.19.1/dask/array/tests/test_linalg.py --- old/dask-0.19.0/dask/array/tests/test_linalg.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/array/tests/test_linalg.py 2018-09-06 13:45:35.000000000 +0200 @@ -659,10 +659,6 @@ [(5,), (2,), 0], [(5,), (2,), (0,)], [(5, 6), (2, 2), None], - [(5, 6), (2, 2), 0], - [(5, 6), (2, 2), 1], - [(5, 6), (2, 2), (0, 1)], - [(5, 6), (2, 2), (1, 0)], ]) @pytest.mark.parametrize("norm", [ None, @@ -685,6 +681,40 @@ assert_eq(a_r, d_r) [email protected] [email protected]("shape, chunks", [ + [(5,), (2,)], + [(5, 3), (2, 2)], + [(4, 5, 3), (2, 2, 2)], + [(4, 5, 2, 3), (2, 2, 2, 2)], + [(2, 5, 2, 4, 3), (2, 2, 2, 2, 2)], +]) [email protected]("norm", [ + None, + 1, + -1, + np.inf, + -np.inf, +]) [email protected]("keepdims", [ + False, + True, +]) +def test_norm_any_slice(shape, chunks, norm, keepdims): + a = np.random.random(shape) + d = da.from_array(a, chunks=chunks) + + for firstaxis in range(len(shape)): + for secondaxis in range(len(shape)): + if firstaxis != secondaxis: + axis = (firstaxis, secondaxis) + else: + axis = firstaxis + a_r = np.linalg.norm(a, ord=norm, axis=axis, keepdims=keepdims) + d_r = da.linalg.norm(d, ord=norm, axis=axis, keepdims=keepdims) + assert_eq(a_r, d_r) + + @pytest.mark.parametrize("shape, chunks, axis", [ [(5,), (2,), None], [(5,), (2,), 0], @@ -730,9 +760,30 @@ # Need one chunk on last dimension for svd. if norm == "nuc" or norm == 2 or norm == -2: - d = d.rechunk((d.chunks[0], d.shape[1])) + d = d.rechunk({-1: -1}) a_r = np.linalg.norm(a, ord=norm, axis=axis, keepdims=keepdims) d_r = da.linalg.norm(d, ord=norm, axis=axis, keepdims=keepdims) assert_eq(a_r, d_r) + + [email protected]("shape, chunks, axis", [ + [(3, 2, 4), (2, 2, 2), (1, 2)], + [(2, 3, 4, 5), (2, 2, 2, 2), (-1, -2)], +]) [email protected]("norm", [ + "nuc", + 2, + -2 +]) [email protected]("keepdims", [ + False, + True, +]) +def test_norm_implemented_errors(shape, chunks, axis, norm, keepdims): + a = np.random.random(shape) + d = da.from_array(a, chunks=chunks) + if len(shape) > 2 and len(axis) == 2: + with pytest.raises(NotImplementedError): + da.linalg.norm(d, ord=norm, axis=axis, keepdims=keepdims) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/array/tests/test_optimization.py new/dask-0.19.1/dask/array/tests/test_optimization.py --- old/dask-0.19.0/dask/array/tests/test_optimization.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/array/tests/test_optimization.py 2018-09-06 13:45:35.000000000 +0200 @@ -273,3 +273,15 @@ assert dask.get(a, y.__dask_keys__()) == dask.get(b, y.__dask_keys__()) assert len(a) < len(b) + + +def test_gh3937(): + # test for github issue #3937 + x = da.from_array([1, 2, 3.], (2,)) + x = da.concatenate((x, [x[-1]])) + y = x.rechunk((2,)) + # This will produce Integral type indices that are not ints (np.int64), failing + # the optimizer + y = da.coarsen(np.sum, y, {0: 2}) + # How to trigger the optimizer explicitly? + y.compute() diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/array/tests/test_reductions.py new/dask-0.19.1/dask/array/tests/test_reductions.py --- old/dask-0.19.0/dask/array/tests/test_reductions.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/array/tests/test_reductions.py 2018-09-06 13:45:35.000000000 +0200 @@ -6,7 +6,7 @@ import dask.array as da from dask.array.utils import assert_eq as _assert_eq, same_keys from dask.core import get_deps -from dask.context import set_options +import dask.config as config def assert_eq(a, b): @@ -139,7 +139,7 @@ assert_eq(dfunc(a, 0), func(x, 0)) assert_eq(dfunc(a, 1), func(x, 1)) assert_eq(dfunc(a, 2), func(x, 2)) - with set_options(split_every=2): + with config.set(split_every=2): assert_eq(dfunc(a), func(x)) assert_eq(dfunc(a, 0), func(x, 0)) assert_eq(dfunc(a, 1), func(x, 1)) @@ -368,7 +368,7 @@ def test_tree_reduce_set_options(): x = da.from_array(np.arange(242).reshape((11, 22)), chunks=(3, 4)) - with set_options(split_every={0: 2, 1: 3}): + with config.set(split_every={0: 2, 1: 3}): assert_max_deps(x.sum(), 2 * 3) assert_max_deps(x.sum(axis=0), 2) @@ -487,3 +487,14 @@ da.topk(a, 5, axis=1, split_every=2)) assert_eq(a.argtopk(5, axis=1, split_every=2), da.argtopk(a, 5, axis=1, split_every=2)) + + [email protected]('func', [da.cumsum, da.cumprod, + da.argmin, da.argmax, + da.min, da.max, + da.nansum, da.nanmax]) +def test_regres_3940(func): + a = da.ones((5,2), chunks=(2,2)) + assert func(a).name != func(a + 1).name + assert func(a, axis=0).name != func(a).name + assert func(a, axis=0).name != func(a, axis=1).name diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/categorical.py new/dask-0.19.1/dask/dataframe/categorical.py --- old/dask-0.19.0/dask/dataframe/categorical.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/categorical.py 2018-09-06 13:45:35.000000000 +0200 @@ -184,7 +184,7 @@ Keywords to pass on to the call to `compute`. """ if self.known: - return self + return self._series categories = self._property_map('categories').unique().compute(**kwargs) return self.set_categories(categories.values) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/core.py new/dask-0.19.1/dask/dataframe/core.py --- old/dask-0.19.0/dask/dataframe/core.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/core.py 2018-09-06 13:45:35.000000000 +0200 @@ -2527,6 +2527,9 @@ pd.compat.isidentifier(c))) return list(o) + def _ipython_key_completions_(self): + return self.columns.tolist() + @property def ndim(self): """ Return dimensionality """ @@ -2678,6 +2681,9 @@ callable(v) or pd.api.types.is_scalar(v)): raise TypeError("Column assignment doesn't support type " "{0}".format(type(v).__name__)) + if callable(v): + kwargs[k] = v(self) + pairs = list(sum(kwargs.items(), ())) # Figure out columns of the output @@ -4078,8 +4084,9 @@ else: d[(out1, k)] = (methods.boundary_slice, (name, i - 1), low, b[j], False) low = b[j] + if len(a) == i + 1 or a[i] < a[i + 1]: + j += 1 i += 1 - j += 1 c.append(low) k += 1 @@ -4113,7 +4120,7 @@ while c[i] < b[j]: tmp.append((out1, i)) i += 1 - if last_elem and c[i] == b[-1] and (b[-1] != b[-2] or j == len(b) - 1) and i < k: + while last_elem and c[i] == b[-1] and (b[-1] != b[-2] or j == len(b) - 1) and i < k: # append if last split is not included tmp.append((out1, i)) i += 1 diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/io/parquet.py new/dask-0.19.1/dask/dataframe/io/parquet.py --- old/dask-0.19.0/dask/dataframe/io/parquet.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/io/parquet.py 2018-09-06 13:45:35.000000000 +0200 @@ -285,12 +285,15 @@ if index_names and infer_divisions is not False: index_name = meta.index.name - minmax = fastparquet.api.sorted_partitioned_columns(pf) + try: + # is https://github.com/dask/fastparquet/pull/371 available in + # current fastparquet installation? + minmax = fastparquet.api.sorted_partitioned_columns(pf, filters) + except TypeError: + minmax = fastparquet.api.sorted_partitioned_columns(pf) if index_name in minmax: - divisions = (list(minmax[index_name]['min']) + - [minmax[index_name]['max'][-1]]) - divisions = [divisions[i] for i, rg in enumerate(pf.row_groups) - if rg in rgs] + [divisions[-1]] + divisions = minmax[index_name] + divisions = divisions['min'] + [divisions['max'][-1]] else: if infer_divisions is True: raise ValueError( diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/io/tests/test_parquet.py new/dask-0.19.1/dask/dataframe/io/tests/test_parquet.py --- old/dask-0.19.0/dask/dataframe/io/tests/test_parquet.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/io/tests/test_parquet.py 2018-09-06 13:45:35.000000000 +0200 @@ -819,6 +819,52 @@ assert len(ddf2) > 0 +def test_divisions_read_with_filters(tmpdir): + check_fastparquet() + tmpdir = str(tmpdir) + #generate dataframe + size = 100 + categoricals = [] + for value in ['a', 'b', 'c', 'd']: + categoricals += [value] * int(size / 4) + df = pd.DataFrame({'a': categoricals, + 'b': np.random.random(size=size), + 'c': np.random.randint(1, 5, size=size)}) + d = dd.from_pandas(df, npartitions=4) + #save it + d.to_parquet(tmpdir, partition_on=['a'], engine='fastparquet') + #read it + out = dd.read_parquet(tmpdir, + engine='fastparquet', + filters=[('a', '==', 'b')]) + #test it + expected_divisions = (25, 49) + assert out.divisions == expected_divisions + + +def test_divisions_are_known_read_with_filters(tmpdir): + check_fastparquet() + tmpdir = str(tmpdir) + #generate dataframe + df = pd.DataFrame({'unique': [0, 0, 1, 1, 2, 2, 3, 3], + 'id': ['id1', 'id2', + 'id1', 'id2', + 'id1', 'id2', + 'id1', 'id2']}, + index=[0, 0, 1, 1, 2, 2, 3, 3]) + d = dd.from_pandas(df, npartitions=2) + #save it + d.to_parquet(tmpdir, partition_on=['id'], engine='fastparquet') + #read it + out = dd.read_parquet(tmpdir, + engine='fastparquet', + filters=[('id', '==', 'id1')]) + #test it + assert out.known_divisions + expected_divisions = (0, 2, 3) + assert out.divisions == expected_divisions + + def test_read_from_fastparquet_parquetfile(tmpdir): check_fastparquet() fn = str(tmpdir) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/partitionquantiles.py new/dask-0.19.1/dask/dataframe/partitionquantiles.py --- old/dask-0.19.0/dask/dataframe/partitionquantiles.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/partitionquantiles.py 2018-09-06 13:45:35.000000000 +0200 @@ -436,7 +436,7 @@ qs = np.linspace(0, 1, npartitions + 1) token = tokenize(df, qs, upsample) if random_state is None: - random_state = hash(token) % np.iinfo(np.int32).max + random_state = int(token, 16) % np.iinfo(np.int32).max state_data = random_state_data(df.npartitions, random_state) df_keys = df.__dask_keys__() diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/shuffle.py new/dask-0.19.1/dask/dataframe/shuffle.py --- old/dask-0.19.0/dask/dataframe/shuffle.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/shuffle.py 2018-09-06 13:45:35.000000000 +0200 @@ -456,12 +456,9 @@ c = ind._values typ = np.min_scalar_type(npartitions * 2) - npartitions, k, stage = [np.array(x, dtype=np.min_scalar_type(x))[()] - for x in [npartitions, k, stage]] - c = np.mod(c, npartitions).astype(typ, copy=False) - c = np.floor_divide(c, k ** stage, out=c) - c = np.mod(c, k, out=c) + np.floor_divide(c, k ** stage, out=c) + np.mod(c, k, out=c) indexer, locations = groupsort_indexer(c.astype(np.int64), k) df2 = df.take(indexer) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/tests/test_categorical.py new/dask-0.19.1/dask/dataframe/tests/test_categorical.py --- old/dask-0.19.0/dask/dataframe/tests/test_categorical.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/tests/test_categorical.py 2018-09-06 13:45:35.000000000 +0200 @@ -271,6 +271,14 @@ assert_eq(left, pd.Index(right) if isinstance(right, np.ndarray) else right) +def test_return_type_known_categories(): + df = pd.DataFrame({"A": ['a', 'b', 'c']}) + df['A'] = df['A'].astype('category') + dask_df = dd.from_pandas(df, 2) + ret_type = dask_df.A.cat.as_known() + assert isinstance(ret_type, dd.core.Series) + + class TestCategoricalAccessor: @pytest.mark.parametrize('series', cat_series) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/tests/test_dataframe.py new/dask-0.19.1/dask/dataframe/tests/test_dataframe.py --- old/dask-0.19.0/dask/dataframe/tests/test_dataframe.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/tests/test_dataframe.py 2018-09-06 13:45:35.000000000 +0200 @@ -925,6 +925,13 @@ d.assign(foo=d_unknown.a) +def test_assign_callable(): + df = dd.from_pandas(pd.DataFrame({"A": range(10)}), npartitions=2) + a = df.assign(B=df.A.shift()) + b = df.assign(B=lambda x: x.A.shift()) + assert_eq(a, b) + + def test_map(): assert_eq(d.a.map(lambda x: x + 1), full.a.map(lambda x: x + 1)) lk = dict((v, v + 1) for v in full.a.values) @@ -2718,6 +2725,16 @@ assert_eq(df[cols], ddf[cols]) +def test_ipython_completion(): + df = pd.DataFrame({'a': [1], 'b': [2]}) + ddf = dd.from_pandas(df, npartitions=1) + + completions = ddf._ipython_key_completions_() + assert 'a' in completions + assert 'b' in completions + assert 'c' not in completions + + def test_diff(): df = pd.DataFrame(np.random.randn(100, 5), columns=list('abcde')) ddf = dd.from_pandas(df, 5) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/tests/test_multi.py new/dask-0.19.1/dask/dataframe/tests/test_multi.py --- old/dask-0.19.0/dask/dataframe/tests/test_multi.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/tests/test_multi.py 2018-09-06 13:45:35.000000000 +0200 @@ -1308,3 +1308,29 @@ joined = ddf2.join(ddf2, rsuffix='r') assert joined.divisions == (1, 1) joined.compute() + + +def test_repartition_repeated_divisions(): + df = pd.DataFrame({'x': [0, 0, 0, 0]}) + ddf = dd.from_pandas(df, npartitions=2).set_index('x') + + ddf2 = ddf.repartition(divisions=(0, 0), force=True) + assert_eq(ddf2, df.set_index('x')) + + +def test_multi_duplicate_divisions(): + df1 = pd.DataFrame({'x': [0, 0, 0, 0]}) + df2 = pd.DataFrame({'x': [0]}) + + ddf1 = dd.from_pandas(df1, npartitions=2).set_index('x') + ddf2 = dd.from_pandas(df2, npartitions=1).set_index('x') + assert ddf1.npartitions == 2 + assert len(ddf1) == len(df1) + + r1 = ddf1.merge(ddf2, how='left', left_index=True, right_index=True) + + sf1 = df1.set_index('x') + sf2 = df2.set_index('x') + r2 = sf1.merge(sf2, how='left', left_index=True, right_index=True) + + assert_eq(r1, r2) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask/dataframe/tests/test_shuffle.py new/dask-0.19.1/dask/dataframe/tests/test_shuffle.py --- old/dask-0.19.0/dask/dataframe/tests/test_shuffle.py 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/dask/dataframe/tests/test_shuffle.py 2018-09-06 13:45:35.000000000 +0200 @@ -1,9 +1,11 @@ import os +import sys import pandas as pd import pytest import pickle import numpy as np import string +import multiprocessing as mp from copy import copy import pandas.util.testing as tm @@ -358,6 +360,28 @@ ddf.set_index('y', divisions=['a', 'b', 'd', 'c'], sorted=True) [email protected] [email protected](sys.version_info < (3, 4), + reason="multiprocessing spawn only after Py3.4") +def test_set_index_consistent_divisions(): + # See https://github.com/dask/dask/issues/3867 + df = pd.DataFrame({'x': np.random.random(100), + 'y': np.random.random(100) // 0.2}, + index=np.random.random(100)) + ddf = dd.from_pandas(df, npartitions=4) + ddf = ddf.clear_divisions() + + ctx = mp.get_context('spawn') + pool = ctx.Pool(processes=8) + results = [pool.apply_async(_set_index, (ddf, 'x')) for _ in range(100)] + divisions_set = set(result.get() for result in results) + assert len(divisions_set) == 1 + + +def _set_index(df, *args, **kwargs): + return df.set_index(*args, **kwargs).divisions + + @pytest.mark.parametrize('shuffle', ['disk', 'tasks']) def test_set_index_reduces_partitions_small(shuffle): df = pd.DataFrame({'x': np.random.random(100)}) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/dask.egg-info/PKG-INFO new/dask-0.19.1/dask.egg-info/PKG-INFO --- old/dask-0.19.0/dask.egg-info/PKG-INFO 2018-08-30 18:41:43.000000000 +0200 +++ new/dask-0.19.1/dask.egg-info/PKG-INFO 2018-09-06 14:15:04.000000000 +0200 @@ -1,11 +1,12 @@ -Metadata-Version: 2.1 +Metadata-Version: 1.2 Name: dask -Version: 0.19.0 +Version: 0.19.1 Summary: Parallel PyData with Task Scheduling Home-page: http://github.com/dask/dask/ -Maintainer: Matthew Rocklin -Maintainer-email: [email protected] +Author: Matthew Rocklin +Author-email: [email protected] License: BSD +Description-Content-Type: UNKNOWN Description: Dask ==== @@ -44,9 +45,3 @@ Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.* -Provides-Extra: complete -Provides-Extra: bag -Provides-Extra: array -Provides-Extra: delayed -Provides-Extra: distributed -Provides-Extra: dataframe diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/docs/source/_static/main-page.css new/dask-0.19.1/docs/source/_static/main-page.css --- old/dask-0.19.0/docs/source/_static/main-page.css 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/docs/source/_static/main-page.css 2018-09-06 13:45:35.000000000 +0200 @@ -22,10 +22,10 @@ border-radius: 0.3rem; } .navbar li:hover { - background-color: #ECB172; + background-color: #FDA061; } .navbar li .nav-link{ - color: #ECB172; + color: #FDA061; } .navbar li:hover .nav-link{ color: #212529; @@ -36,11 +36,11 @@ } .dropdown-item { - color: #ECB172; + color: #FDA061; } .dropdown-item:hover { - background-color: #ECB172D0; + background-color: #FDA061D0; } .hero { @@ -56,15 +56,26 @@ .outline-dask { - color: #ECB172; + color: #FDA061; background-color: transparent; - border-color: #ECB172; + border-color: #FDA061; } + .outline-dask:hover { color: #212529; - background-color: #ECB172; - border-color: #ECB172; + background-color: #FDA061; + border-color: #FDA061; +} + +.solid-dask { + color: #212529; + background-color: #FDA061; +} + +.solid-dask:hover { + color: #212529; + background-color: #EC9050; } diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/docs/source/changelog.rst new/dask-0.19.1/docs/source/changelog.rst --- old/dask-0.19.0/docs/source/changelog.rst 2018-08-30 18:39:37.000000000 +0200 +++ new/dask-0.19.1/docs/source/changelog.rst 2018-09-06 14:12:47.000000000 +0200 @@ -1,7 +1,7 @@ Changelog ========= -0.19.1 / YYYY-MM-DD +0.19.2 / YYYY-MM-DD ------------------- Array @@ -25,6 +25,35 @@ - +0.19.1 / 2018-09-06 +------------------- + +Array ++++++ + +- Don't enforce dtype if result has no dtype (:pr:`3928`) `Matthew Rocklin`_ +- Fix NumPy issubtype deprecation warning (:pr:`3939`) `Bruce Merry`_ +- Fix arg reduction tokens to be unique with different arguments (:pr:`3955`) `Tobias de Jong`_ +- Coerce numpy integers to ints in slicing code (:pr:`3944`) `Yu Feng`_ +- Linalg.norm ndim along axis partial fix (:pr:`3933`) `Tobias de Jong`_ + +Dataframe ++++++++++ + +- Deterministic DataFrame.set_index (:pr:`3867`) `George Sakkis`_ +- Fix divisions in read_parquet when dealing with filters #3831 #3930 (:pr:`3923`) (:pr:`3931`) `@andrethrill`_ +- Fixing returning type in categorical.as_known (:pr:`3888`) `Sriharsha Hatwar`_ +- Fix DataFrame.assign for callables (:pr:`3919`) `Tom Augspurger`_ +- Include partitions with no width in repartition (:pr:`3941`) `Matthew Rocklin`_ +- Don't constrict stage/k dtype in dataframe shuffle (:pr:`3942`) `Matthew Rocklin`_ + +Documentation ++++++++++++++ + +- DOC: Add hint on how to render task graphs horizontally (:pr:`3922`) `Uwe Korn`_ +- Add try-now button to main landing page (:pr:`3924`) `Matthew Rocklin`_ + + 0.19.0 / 2018-08-29 ------------------- @@ -32,7 +61,7 @@ +++++ - Fix argtopk split_every bug (:pr:`3810`) `Guido Imperiale`_ -- Ensure result computing dask.array.isnull(`) always gives a numpy array (:pr:`3825`) `Stephan Hoyer`_ +- Ensure result computing dask.array.isnull() always gives a numpy array (:pr:`3825`) `Stephan Hoyer`_ - Support concatenate for scipy.sparse in dask array (:pr:`3836`) `Matthew Rocklin`_ - Fix argtopk on 32-bit systems. (:pr:`3823`) `Elliott Sales de Andrade`_ - Normalize keys in rechunk (:pr:`3820`) `Matthew Rocklin`_ @@ -1366,3 +1395,8 @@ .. _`Hans Moritz Günther`: https://github.com/hamogu .. _`@rtobar`: https://github.com/rtobar .. _`Julia Signell`: https://github.com/jsignell +.. _`Sriharsha Hatwar`: https://github.com/Sriharsha-hatwar +.. _`Bruce Merry`: https://github.com/bmerry +.. _`Joe Hamman`: https://github.com/jhamman +.. _`Robert Sare`: https://github.com/rmsare +.. _`Jeremy Chan`: https://github.com/convexset diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/docs/source/graphviz.rst new/dask-0.19.1/docs/source/graphviz.rst --- old/dask-0.19.0/docs/source/graphviz.rst 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/docs/source/graphviz.rst 2018-09-06 13:45:35.000000000 +0200 @@ -18,6 +18,10 @@ except that rather than computing the result, they produce an image of the task graph. +By default the task graph is rendered from top to bottom. +In the case that you prefer to visualize it from left to right, pass +``rankdir="LR"`` as a keyword argument to ``.visualize``. + .. code-block:: python import dask.array as da diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/dask-0.19.0/docs/source/index.html new/dask-0.19.1/docs/source/index.html --- old/dask-0.19.0/docs/source/index.html 2018-08-30 18:28:02.000000000 +0200 +++ new/dask-0.19.1/docs/source/index.html 2018-09-06 13:45:35.000000000 +0200 @@ -67,6 +67,7 @@ enabling performance at scale for the tools you love </p> <a class="btn outline-dask btn-lg" href="docs.html">Learn More</a> + <a class="btn solid-dask btn-lg" href="https://mybinder.org/v2/gh/dask/dask-examples/master" role="button">Try Now »</a> </div> <div class="product-device box-shadow d-none d-md-block"></div> <div class="product-device product-device-2 box-shadow d-none d-md-block"></div>
