Query building time is significant because it's a simple query but a long one at almost 4,000 characters alone.
Task deserialization time takes up an inordinate amount of time (0.9s) when I run your test and building the query itself is several seconds. I would recommend using a JOIN (a broadcast join if your data set is small enough) when the alternative is a massive IN statement. On Wed, Apr 5, 2017 at 2:31 PM, Maciej Bryński [via Apache Spark Developers List] <ml-node+s1001551n21307...@n3.nabble.com> wrote: > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more than 10K values IN operator is getting slower. > > For example this code is running about 20 seconds. > > df = spark.range(0,100000,1,1) > df.where('id in ({})'.format(','.join(map(str,range(100000))))).count() > > Any ideas how to improve this ? > Is it a bug ? > -- > Maciek Bryński > > --------------------------------------------------------------------- > To unsubscribe e-mail: [hidden email] > <http:///user/SendEmail.jtp?type=node&node=21307&i=0> > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-developers-list.1001551.n3. > nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html > To unsubscribe from Apache Spark Developers List, click here > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=Z3N0YXVibGlAZ21haWwuY29tfDF8LTM1NDYzMTky> > . > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Pyspark-SQL-Very-slow-IN-operator-tp21309.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.