Re: New Feature Request
Hi Jonathan, Does that guarantee a result? I do not see that it is really optimized. Hi Carsten, How does the following code work: data.filter(qualifying_function).take(n).count() = n Also, as per my understanding, in both the approaches you mentioned the qualifying function will be executed on whole dataset even if the value was already found in the first element of RDD: - data.filter(qualifying_function).take(n).count() = n - val contains1MatchingElement = !(data.filter(qualifying_ function).isEmpty()) Isn't it? Am I missing something? Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) [image: linkedin icon] https://linkedin.com/company/knowbigdata [image: other site icon] http://knowbigdata.com [image: facebook icon] https://facebook.com/knowbigdata [image: twitter icon] https://twitter.com/IKnowBigData https://twitter.com/IKnowBigData On Fri, Jul 31, 2015 at 3:37 PM, Jonathan Winandy jonathan.wina...@gmail.com wrote: Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = { rdd.filter(f).countApprox(timeout = 1).getFinalValue().low n } If would work for large datasets and large value of n. Have a nice day, Jonathan On 31 July 2015 at 11:29, Carsten Schnober schno...@ukp.informatik.tu-darmstadt.de wrote: Hi, the RDD class does not have an exist()-method (in the Scala API), but the functionality you need seems easy to resemble with the existing methods: val containsNMatchingElements = data.filter(qualifying_function).take(n).count() = n Note: I am not sure whether the intermediate take(n) really increases performance, but the idea is to arbitrarily reduce the number of elements in the RDD before counting because we are not interested in the full count. If you need to check specifically whether there is at least one matching occurrence, it is probably preferable to use isEmpty() instead of count() and check whether the result is false: val contains1MatchingElement = !(data.filter(qualifying_function).isEmpty()) Best, Carsten Am 31.07.2015 um 11:11 schrieb Sandeep Giri: Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6 . So, I was proposing if we have a function called /exists /in RDD with the following signature: #returns the true if n elements exist which qualify our criteria. #qualifying function would receive the element and its index and return true or false. def /exists/(qualifying_function, n): Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) linkedin icon https://linkedin.com/company/knowbigdata other site icon http://knowbigdata.com facebook icon https://facebook.com/knowbigdatatwitter icon https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData -- Carsten Schnober Doctoral Researcher Ubiquitous Knowledge Processing (UKP) Lab FB 20 / Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 schno...@ukp.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources (AIPHES): www.aiphes.tu-darmstadt.de PhD program: Knowledge Discovery in Scientific Literature (KDSL) www.kdsl.tu-darmstadt.de - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: New Feature Request
I don't think countApprox is appropriate here unless approximation is OK. But more generally, counting everything matching a filter requires applying the filter to the whole data set, which seems like the thing to be avoided here. The take approach is better since it would stop after finding n matching elements (it might do a little extra work given partitioning and buffering). It would not filter the whole data set. The only downside there is that it would copy n elements to the driver. On Wed, Aug 5, 2015 at 10:34 AM, Sandeep Giri sand...@knowbigdata.com wrote: Hi Jonathan, Does that guarantee a result? I do not see that it is really optimized. Hi Carsten, How does the following code work: data.filter(qualifying_function).take(n).count() = n Also, as per my understanding, in both the approaches you mentioned the qualifying function will be executed on whole dataset even if the value was already found in the first element of RDD: - data.filter(qualifying_function).take(n).count() = n - val contains1MatchingElement = !(data.filter(qualifying_ function).isEmpty()) Isn't it? Am I missing something? Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) [image: linkedin icon] https://linkedin.com/company/knowbigdata [image: other site icon] http://knowbigdata.com [image: facebook icon] https://facebook.com/knowbigdata [image: twitter icon] https://twitter.com/IKnowBigData https://twitter.com/IKnowBigData On Fri, Jul 31, 2015 at 3:37 PM, Jonathan Winandy jonathan.wina...@gmail.com wrote: Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = { rdd.filter(f).countApprox(timeout = 1).getFinalValue().low n } If would work for large datasets and large value of n. Have a nice day, Jonathan On 31 July 2015 at 11:29, Carsten Schnober schno...@ukp.informatik.tu-darmstadt.de wrote: Hi, the RDD class does not have an exist()-method (in the Scala API), but the functionality you need seems easy to resemble with the existing methods: val containsNMatchingElements = data.filter(qualifying_function).take(n).count() = n Note: I am not sure whether the intermediate take(n) really increases performance, but the idea is to arbitrarily reduce the number of elements in the RDD before counting because we are not interested in the full count. If you need to check specifically whether there is at least one matching occurrence, it is probably preferable to use isEmpty() instead of count() and check whether the result is false: val contains1MatchingElement = !(data.filter(qualifying_function).isEmpty()) Best, Carsten Am 31.07.2015 um 11:11 schrieb Sandeep Giri: Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6 . So, I was proposing if we have a function called /exists /in RDD with the following signature: #returns the true if n elements exist which qualify our criteria. #qualifying function would receive the element and its index and return true or false. def /exists/(qualifying_function, n): Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) linkedin icon https://linkedin.com/company/knowbigdata other site icon http://knowbigdata.com facebook icon https://facebook.com/knowbigdatatwitter icon https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData -- Carsten Schnober Doctoral Researcher Ubiquitous Knowledge Processing (UKP) Lab FB 20 / Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 schno...@ukp.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources (AIPHES): www.aiphes.tu-darmstadt.de PhD program: Knowledge Discovery in Scientific Literature (KDSL) www.kdsl.tu-darmstadt.de - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
New Feature Request
Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6 . So, I was proposing if we have a function called *exists *in RDD with the following signature: #returns the true if n elements exist which qualify our criteria. #qualifying function would receive the element and its index and return true or false. def *exists*(qualifying_function, n): Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) [image: linkedin icon] https://linkedin.com/company/knowbigdata [image: other site icon] http://knowbigdata.com [image: facebook icon] https://facebook.com/knowbigdata [image: twitter icon] https://twitter.com/IKnowBigData https://twitter.com/IKnowBigData
Re: New Feature Request
Hi, the RDD class does not have an exist()-method (in the Scala API), but the functionality you need seems easy to resemble with the existing methods: val containsNMatchingElements = data.filter(qualifying_function).take(n).count() = n Note: I am not sure whether the intermediate take(n) really increases performance, but the idea is to arbitrarily reduce the number of elements in the RDD before counting because we are not interested in the full count. If you need to check specifically whether there is at least one matching occurrence, it is probably preferable to use isEmpty() instead of count() and check whether the result is false: val contains1MatchingElement = !(data.filter(qualifying_function).isEmpty()) Best, Carsten Am 31.07.2015 um 11:11 schrieb Sandeep Giri: Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6. So, I was proposing if we have a function called /exists /in RDD with the following signature: #returns the true if n elements exist which qualify our criteria. #qualifying function would receive the element and its index and return true or false. def /exists/(qualifying_function, n): Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) linkedin icon https://linkedin.com/company/knowbigdata other site icon http://knowbigdata.com facebook icon https://facebook.com/knowbigdatatwitter icon https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData -- Carsten Schnober Doctoral Researcher Ubiquitous Knowledge Processing (UKP) Lab FB 20 / Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 schno...@ukp.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources (AIPHES): www.aiphes.tu-darmstadt.de PhD program: Knowledge Discovery in Scientific Literature (KDSL) www.kdsl.tu-darmstadt.de - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: New Feature Request
Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = { rdd.filter(f).countApprox(timeout = 1).getFinalValue().low n } If would work for large datasets and large value of n. Have a nice day, Jonathan On 31 July 2015 at 11:29, Carsten Schnober schno...@ukp.informatik.tu-darmstadt.de wrote: Hi, the RDD class does not have an exist()-method (in the Scala API), but the functionality you need seems easy to resemble with the existing methods: val containsNMatchingElements = data.filter(qualifying_function).take(n).count() = n Note: I am not sure whether the intermediate take(n) really increases performance, but the idea is to arbitrarily reduce the number of elements in the RDD before counting because we are not interested in the full count. If you need to check specifically whether there is at least one matching occurrence, it is probably preferable to use isEmpty() instead of count() and check whether the result is false: val contains1MatchingElement = !(data.filter(qualifying_function).isEmpty()) Best, Carsten Am 31.07.2015 um 11:11 schrieb Sandeep Giri: Dear Spark Dev Community, I am wondering if there is already a function to solve my problem. If not, then should I work on this? Say you just want to check if a word exists in a huge text file. I could not find better ways than those mentioned here http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6 . So, I was proposing if we have a function called /exists /in RDD with the following signature: #returns the true if n elements exist which qualify our criteria. #qualifying function would receive the element and its index and return true or false. def /exists/(qualifying_function, n): Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. Phone: +1-253-397-1945 (Office) linkedin icon https://linkedin.com/company/knowbigdata other site icon http://knowbigdata.com facebook icon https://facebook.com/knowbigdatatwitter icon https://twitter.com/IKnowBigDatahttps://twitter.com/IKnowBigData -- Carsten Schnober Doctoral Researcher Ubiquitous Knowledge Processing (UKP) Lab FB 20 / Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 schno...@ukp.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources (AIPHES): www.aiphes.tu-darmstadt.de PhD program: Knowledge Discovery in Scientific Literature (KDSL) www.kdsl.tu-darmstadt.de - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org