subject:"\"RE\\\: quickly counting the number of rows in a partition\\\?\""

Re: quickly counting the number of rows in a partition?

2015-01-14 Thread Ilya Ganelin

gt; > > > Sent with Good (www.good.com) > > > -Original Message- > *From: *Tobias Pfeiffer [t...@preferred.jp] > *Sent: *Tuesday, January 13, 2015 08:06 PM Eastern Standard Time > *To: *Kevin Burton > *Cc: *Ganelin, Ilya; user@spark.apache.org > *Subject: *Re: quickly

Re: quickly counting the number of rows in a partition?

2015-01-14 Thread Michael Segel

me > To: Kevin Burton > Cc: Ganelin, Ilya; user@spark.apache.org > Subject: Re: quickly counting the number of rows in a partition? > > Hi, > > On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya > wrote: > Use the mapPartitions function. It returns an iterator to ea

RE: quickly counting the number of rows in a partition?

2015-01-13 Thread Ganelin, Ilya

mailto:t...@preferred.jp>] Sent: Tuesday, January 13, 2015 08:06 PM Eastern Standard Time To: Kevin Burton Cc: Ganelin, Ilya; user@spark.apache.org Subject: Re: quickly counting the number of rows in a partition? Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya mailto:ilya.gane...@capitalone.com>&g

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer

Hi again, On Wed, Jan 14, 2015 at 10:06 AM, Tobias Pfeiffer wrote: > If you think of > items.map(x => /* throw exception */).count() > then even though the count you want to get does not necessarily require > the evaluation of the function in map() (i.e., the number is the same), you > may n

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer

Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya wrote: > Use the mapPartitions function. It returns an iterator to each partition. > Then just get that length by converting to an array. > On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton wrote: > Doesn’t that just read in all the values? The

Re: quickly counting the number of rows in a partition?

2015-01-12 Thread Kevin Burton

Doesn’t that just read in all the values? The count isn’t pre-computed? It’s not the end of the world if it’s not but would be faster. On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya wrote: > Use the mapPartitions function. It returns an iterator to each > partition. Then just get that length b

Re: quickly counting the number of rows in a partition?

2015-01-12 Thread Sven Krasser

Yes, using mapPartitionsWithIndex, e.g. in PySpark: >>> sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda idx,iter: ((idx, len(list(iter))),)).collect() [(0, 250), (1, 250), (2, 250), (3, 250)] (This is not the most efficient way to get the length of an iterator, but you get the ide

Re: quickly counting the number of rows in a partition?

Re: quickly counting the number of rows in a partition?

RE: quickly counting the number of rows in a partition?

Re: quickly counting the number of rows in a partition?

Re: quickly counting the number of rows in a partition?

Re: quickly counting the number of rows in a partition?

Re: quickly counting the number of rows in a partition?

7 matches

Site Navigation

Mail list logo

Footer information