never mind! I has a space at the end of my data which was not showing up in manual testing.
thanks ________________________________ From: jeff saremi <jeffsar...@hotmail.com> Sent: Tuesday, June 20, 2017 2:48:06 PM To: user@spark.apache.org Subject: Bizzare diff in behavior between scala REPL and sparkSQL UDF I have this function which does a regex matching in scala. I test it in the REPL I get expected results. I use it as a UDF in sparkSQL i get completely incorrect results. Function: class UrlFilter (filters: Seq[String]) extends Serializable { val regexFilters = filters.map(new Regex(_)) regexFilters.foreach(println) def matches(s: String) : Boolean = { if(s == null || s.isEmpty) return false regexFilters.exists(f => {print("matching " + f + " against " + s); s match { case f() => { println("; matched! returning true"); true }; case _ => { println("; did NOT match. returning false"); false } }}) } } Instantiating it with a pattern like: ^[^:]+://[^.]*\.company[0-9]*9\.com$ (matches a url that has company in the name and a number that ends in digit 9) Test it in Scala REPL: scala> val filters = Source.fromFile("D:\\cosmos-modules\\testdata\\fakefilters.txt").getLines.toList scala> val urlFilter = new UrlFilter(filters) scala> urlFilter.matches("ftp://ftp.company9.com") matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com; matched! returning true res2: Boolean = true Use it in SparkSQL: val urlFilter = new UrlFilter(filters) sqlContext.udf.register("filterListMatch", (url: String) => urlFilter.matches(url)) val nonMatchingUrlsDf = sqlContext.sql("SELECT url FROM distinctUrls WHERE NOT filterListMatch(url)") Look at the debug prints in the console: matching ^[^:]+://[^.]*\.company[0-9]*9\.com$ against ftp://ftp.company9.com ; did NOT match. returning false I have repeated this several times to make sure I'm comparing apples only I am using Spark 1.6 and Scala 2.10.5 with Java 1.8 thanks