gt;
>
>
> Sent with Good (www.good.com)
>
>
> -Original Message-
> *From: *Tobias Pfeiffer [t...@preferred.jp]
> *Sent: *Tuesday, January 13, 2015 08:06 PM Eastern Standard Time
> *To: *Kevin Burton
> *Cc: *Ganelin, Ilya; user@spark.apache.org
> *Subject: *Re: quickly
me
> To: Kevin Burton
> Cc: Ganelin, Ilya; user@spark.apache.org
> Subject: Re: quickly counting the number of rows in a partition?
>
> Hi,
>
> On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya
> wrote:
> Use the mapPartitions function. It returns an iterator to ea
mailto:t...@preferred.jp>]
Sent: Tuesday, January 13, 2015 08:06 PM Eastern Standard Time
To: Kevin Burton
Cc: Ganelin, Ilya; user@spark.apache.org
Subject: Re: quickly counting the number of rows in a partition?
Hi,
On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya
mailto:ilya.gane...@capitalone.com>&g
Hi again,
On Wed, Jan 14, 2015 at 10:06 AM, Tobias Pfeiffer wrote:
> If you think of
> items.map(x => /* throw exception */).count()
> then even though the count you want to get does not necessarily require
> the evaluation of the function in map() (i.e., the number is the same), you
> may n
Hi,
On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya
wrote:
> Use the mapPartitions function. It returns an iterator to each partition.
> Then just get that length by converting to an array.
>
On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton wrote:
> Doesn’t that just read in all the values? The
Doesn’t that just read in all the values? The count isn’t pre-computed?
It’s not the end of the world if it’s not but would be faster.
On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya
wrote:
> Use the mapPartitions function. It returns an iterator to each
> partition. Then just get that length b
Yes, using mapPartitionsWithIndex, e.g. in PySpark:
>>> sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda
idx,iter: ((idx, len(list(iter))),)).collect()
[(0, 250), (1, 250), (2, 250), (3, 250)]
(This is not the most efficient way to get the length of an iterator, but
you get the ide