Re: New Feature Request

2015-08-05 Thread Sandeep Giri
Hi Jonathan,

Does that guarantee a result? I do not see that it is really optimized.

Hi Carsten,


How does the following code work:

data.filter(qualifying_function).take(n).count() = n


Also, as per my understanding, in both the approaches you mentioned the
qualifying function will be executed on whole dataset even if the value was
already found in the first element of RDD:


   - data.filter(qualifying_function).take(n).count() = n
  - val contains1MatchingElement = !(data.filter(qualifying_
  function).isEmpty())

Isn't it? Am I missing something?


Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. http://KnowBigData.com.
Phone: +1-253-397-1945 (Office)

[image: linkedin icon] https://linkedin.com/company/knowbigdata [image:
other site icon] http://knowbigdata.com  [image: facebook icon]
https://facebook.com/knowbigdata [image: twitter icon]
https://twitter.com/IKnowBigData https://twitter.com/IKnowBigData


On Fri, Jul 31, 2015 at 3:37 PM, Jonathan Winandy 
jonathan.wina...@gmail.com wrote:

 Hello !

 You could try something like that :

 def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = {
   rdd.filter(f).countApprox(timeout = 1).getFinalValue().low  n
 }

 If would work for large datasets and large value of n.

 Have a nice day,

 Jonathan



 On 31 July 2015 at 11:29, Carsten Schnober 
 schno...@ukp.informatik.tu-darmstadt.de wrote:

 Hi,
 the RDD class does not have an exist()-method (in the Scala API), but
 the functionality you need seems easy to resemble with the existing
 methods:

 val containsNMatchingElements =
 data.filter(qualifying_function).take(n).count() = n

 Note: I am not sure whether the intermediate take(n) really increases
 performance, but the idea is to arbitrarily reduce the number of
 elements in the RDD before counting because we are not interested in the
 full count.

 If you need to check specifically whether there is at least one matching
 occurrence, it is probably preferable to use isEmpty() instead of
 count() and check whether the result is false:

 val contains1MatchingElement =
 !(data.filter(qualifying_function).isEmpty())

 Best,
 Carsten



 Am 31.07.2015 um 11:11 schrieb Sandeep Giri:
  Dear Spark Dev Community,
 
  I am wondering if there is already a function to solve my problem. If
  not, then should I work on this?
 
  Say you just want to check if a word exists in a huge text file. I could
  not find better ways than those mentioned here
  
 http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6
 .
 
  So, I was proposing if we have a function called /exists /in RDD with
  the following signature:
 
  #returns the true if n elements exist which qualify our criteria.
  #qualifying function would receive the element and its index and return
  true or false.
  def /exists/(qualifying_function, n):
   
 
 
  Regards,
  Sandeep Giri,
  +1 347 781 4573 (US)
  +91-953-899-8962 (IN)
 
  www.KnowBigData.com. http://KnowBigData.com.
  Phone: +1-253-397-1945 (Office)
 
  linkedin icon https://linkedin.com/company/knowbigdata other site
 icon
  http://knowbigdata.com facebook icon
  https://facebook.com/knowbigdatatwitter icon
  https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData
 

 --
 Carsten Schnober
 Doctoral Researcher
 Ubiquitous Knowledge Processing (UKP) Lab
 FB 20 / Computer Science Department
 Technische Universität Darmstadt
 Hochschulstr. 10, D-64289 Darmstadt, Germany
 phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
 schno...@ukp.informatik.tu-darmstadt.de
 www.ukp.tu-darmstadt.de

 Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
 GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
 (AIPHES): www.aiphes.tu-darmstadt.de
 PhD program: Knowledge Discovery in Scientific Literature (KDSL)
 www.kdsl.tu-darmstadt.de

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: New Feature Request

2015-08-05 Thread Sean Owen
I don't think countApprox is appropriate here unless approximation is OK.
But more generally, counting everything matching a filter requires applying
the filter to the whole data set, which seems like the thing to be avoided
here.

The take approach is better since it would stop after finding n matching
elements (it might do a little extra work given partitioning and
buffering). It would not filter the whole data set.

The only downside there is that it would copy n elements to the driver.

On Wed, Aug 5, 2015 at 10:34 AM, Sandeep Giri sand...@knowbigdata.com
wrote:

 Hi Jonathan,

 Does that guarantee a result? I do not see that it is really optimized.

 Hi Carsten,


 How does the following code work:

 data.filter(qualifying_function).take(n).count() = n


 Also, as per my understanding, in both the approaches you mentioned the
 qualifying function will be executed on whole dataset even if the value was
 already found in the first element of RDD:


- data.filter(qualifying_function).take(n).count() = n
   - val contains1MatchingElement = !(data.filter(qualifying_
   function).isEmpty())

 Isn't it? Am I missing something?


 Regards,
 Sandeep Giri,
 +1 347 781 4573 (US)
 +91-953-899-8962 (IN)

 www.KnowBigData.com. http://KnowBigData.com.
 Phone: +1-253-397-1945 (Office)

 [image: linkedin icon] https://linkedin.com/company/knowbigdata [image:
 other site icon] http://knowbigdata.com  [image: facebook icon]
 https://facebook.com/knowbigdata [image: twitter icon]
 https://twitter.com/IKnowBigData https://twitter.com/IKnowBigData


 On Fri, Jul 31, 2015 at 3:37 PM, Jonathan Winandy 
 jonathan.wina...@gmail.com wrote:

 Hello !

 You could try something like that :

 def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = {
   rdd.filter(f).countApprox(timeout = 1).getFinalValue().low  n
 }

 If would work for large datasets and large value of n.

 Have a nice day,

 Jonathan



 On 31 July 2015 at 11:29, Carsten Schnober 
 schno...@ukp.informatik.tu-darmstadt.de wrote:

 Hi,
 the RDD class does not have an exist()-method (in the Scala API), but
 the functionality you need seems easy to resemble with the existing
 methods:

 val containsNMatchingElements =
 data.filter(qualifying_function).take(n).count() = n

 Note: I am not sure whether the intermediate take(n) really increases
 performance, but the idea is to arbitrarily reduce the number of
 elements in the RDD before counting because we are not interested in the
 full count.

 If you need to check specifically whether there is at least one matching
 occurrence, it is probably preferable to use isEmpty() instead of
 count() and check whether the result is false:

 val contains1MatchingElement =
 !(data.filter(qualifying_function).isEmpty())

 Best,
 Carsten



 Am 31.07.2015 um 11:11 schrieb Sandeep Giri:
  Dear Spark Dev Community,
 
  I am wondering if there is already a function to solve my problem. If
  not, then should I work on this?
 
  Say you just want to check if a word exists in a huge text file. I
 could
  not find better ways than those mentioned here
  
 http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6
 .
 
  So, I was proposing if we have a function called /exists /in RDD with
  the following signature:
 
  #returns the true if n elements exist which qualify our criteria.
  #qualifying function would receive the element and its index and return
  true or false.
  def /exists/(qualifying_function, n):
   
 
 
  Regards,
  Sandeep Giri,
  +1 347 781 4573 (US)
  +91-953-899-8962 (IN)
 
  www.KnowBigData.com. http://KnowBigData.com.
  Phone: +1-253-397-1945 (Office)
 
  linkedin icon https://linkedin.com/company/knowbigdata other site
 icon
  http://knowbigdata.com facebook icon
  https://facebook.com/knowbigdatatwitter icon
  https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData
 

 --
 Carsten Schnober
 Doctoral Researcher
 Ubiquitous Knowledge Processing (UKP) Lab
 FB 20 / Computer Science Department
 Technische Universität Darmstadt
 Hochschulstr. 10, D-64289 Darmstadt, Germany
 phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
 schno...@ukp.informatik.tu-darmstadt.de
 www.ukp.tu-darmstadt.de

 Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
 GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
 (AIPHES): www.aiphes.tu-darmstadt.de
 PhD program: Knowledge Discovery in Scientific Literature (KDSL)
 www.kdsl.tu-darmstadt.de

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






New Feature Request

2015-07-31 Thread Sandeep Giri
Dear Spark Dev Community,

I am wondering if there is already a function to solve my problem. If not,
then should I work on this?

Say you just want to check if a word exists in a huge text file. I could
not find better ways than those mentioned here
http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6
.

So, I was proposing if we have a function called *exists *in RDD with the
following signature:

#returns the true if n elements exist which qualify our criteria.
#qualifying function would receive the element and its index and return
true or false.
def *exists*(qualifying_function, n):
 


Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. http://KnowBigData.com.
Phone: +1-253-397-1945 (Office)

[image: linkedin icon] https://linkedin.com/company/knowbigdata [image:
other site icon] http://knowbigdata.com  [image: facebook icon]
https://facebook.com/knowbigdata [image: twitter icon]
https://twitter.com/IKnowBigData https://twitter.com/IKnowBigData


Re: New Feature Request

2015-07-31 Thread Carsten Schnober
Hi,
the RDD class does not have an exist()-method (in the Scala API), but
the functionality you need seems easy to resemble with the existing methods:

val containsNMatchingElements =
data.filter(qualifying_function).take(n).count() = n

Note: I am not sure whether the intermediate take(n) really increases
performance, but the idea is to arbitrarily reduce the number of
elements in the RDD before counting because we are not interested in the
full count.

If you need to check specifically whether there is at least one matching
occurrence, it is probably preferable to use isEmpty() instead of
count() and check whether the result is false:

val contains1MatchingElement = !(data.filter(qualifying_function).isEmpty())

Best,
Carsten



Am 31.07.2015 um 11:11 schrieb Sandeep Giri:
 Dear Spark Dev Community,
 
 I am wondering if there is already a function to solve my problem. If
 not, then should I work on this?
 
 Say you just want to check if a word exists in a huge text file. I could
 not find better ways than those mentioned here
 http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6. 
 
 So, I was proposing if we have a function called /exists /in RDD with
 the following signature:
 
 #returns the true if n elements exist which qualify our criteria.
 #qualifying function would receive the element and its index and return
 true or false. 
 def /exists/(qualifying_function, n):
  
 
 
 Regards,
 Sandeep Giri,
 +1 347 781 4573 (US)
 +91-953-899-8962 (IN)
 
 www.KnowBigData.com. http://KnowBigData.com.
 Phone: +1-253-397-1945 (Office)
 
 linkedin icon https://linkedin.com/company/knowbigdata other site icon
 http://knowbigdata.com facebook icon
 https://facebook.com/knowbigdatatwitter icon
 https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData
 

-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: New Feature Request

2015-07-31 Thread Jonathan Winandy
Hello !

You could try something like that :

def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = {
  rdd.filter(f).countApprox(timeout = 1).getFinalValue().low  n
}

If would work for large datasets and large value of n.

Have a nice day,

Jonathan



On 31 July 2015 at 11:29, Carsten Schnober 
schno...@ukp.informatik.tu-darmstadt.de wrote:

 Hi,
 the RDD class does not have an exist()-method (in the Scala API), but
 the functionality you need seems easy to resemble with the existing
 methods:

 val containsNMatchingElements =
 data.filter(qualifying_function).take(n).count() = n

 Note: I am not sure whether the intermediate take(n) really increases
 performance, but the idea is to arbitrarily reduce the number of
 elements in the RDD before counting because we are not interested in the
 full count.

 If you need to check specifically whether there is at least one matching
 occurrence, it is probably preferable to use isEmpty() instead of
 count() and check whether the result is false:

 val contains1MatchingElement =
 !(data.filter(qualifying_function).isEmpty())

 Best,
 Carsten



 Am 31.07.2015 um 11:11 schrieb Sandeep Giri:
  Dear Spark Dev Community,
 
  I am wondering if there is already a function to solve my problem. If
  not, then should I work on this?
 
  Say you just want to check if a word exists in a huge text file. I could
  not find better ways than those mentioned here
  
 http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6
 .
 
  So, I was proposing if we have a function called /exists /in RDD with
  the following signature:
 
  #returns the true if n elements exist which qualify our criteria.
  #qualifying function would receive the element and its index and return
  true or false.
  def /exists/(qualifying_function, n):
   
 
 
  Regards,
  Sandeep Giri,
  +1 347 781 4573 (US)
  +91-953-899-8962 (IN)
 
  www.KnowBigData.com. http://KnowBigData.com.
  Phone: +1-253-397-1945 (Office)
 
  linkedin icon https://linkedin.com/company/knowbigdata other site icon
  http://knowbigdata.com facebook icon
  https://facebook.com/knowbigdatatwitter icon
  https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData
 

 --
 Carsten Schnober
 Doctoral Researcher
 Ubiquitous Knowledge Processing (UKP) Lab
 FB 20 / Computer Science Department
 Technische Universität Darmstadt
 Hochschulstr. 10, D-64289 Darmstadt, Germany
 phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
 schno...@ukp.informatik.tu-darmstadt.de
 www.ukp.tu-darmstadt.de

 Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
 GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
 (AIPHES): www.aiphes.tu-darmstadt.de
 PhD program: Knowledge Discovery in Scientific Literature (KDSL)
 www.kdsl.tu-darmstadt.de

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org