Hi, I see a big difference in performance of the same query expressed via Spark SQL and CURL. In CURL the query runs less then a second, and in Spark SQL it runs 15 seconds. The index/type which I am querying contains 1M documents. Can you please explain why there is so big difference in performance? Are there any ways to tune performance of Elasticsearch + Spark SQL?
Environment: (everything is running on the same box): Elasticsearch 1.4.4 elasticsearch-hadoop 2.1.0.BUILD-SNAPSHOT Spark 1.3.0. CURL: curl -XPOST "http://localhost:9200/summary/intervals/_search" -d' { "query" : { "filtered" : { "query" : { "match_all" : {}}, "filter" : { "bool" : { "must" : [ { "term" : { "User" : "Robert Greene" } }, { "term" : { "DataStore" : "PROD_HK_HR" } }, { "term" : { "EventAffectedCount" : 56 } } ] } } } } }' Spark: val sparkConf = new SparkConf().setAppName("Test1") // increasing scroll size to 5000 from the default 50 improved performance by 2.5 times sparkConf.set("es.scroll.size", "5000") val sc = new SparkContext(sparkConf) val sqlContext = new SQLContext(sc) val intv = sqlContext.esDF("summary/intervals") intv.registerTempTable("INTERVALS") val intv2 = sqlContext.sql("select EventCount, Hour " + "from intervals " + "where User = 'Robert Greene' " + "and DataStore = 'PROD_HK_HR' " + "and EventAffectedCount = 56 ") intv2.show(1000) -- Please update your bookmarks! We have moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8fc0d384-23bd-4807-8eae-a2ef2011f6ed%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.