Had a few quick questions... Just wondering if right now spark sql is expected to be thread safe on master?
doing a simple hadoop file -> RDD -> schema RDD -> write parquet will fail in reflection code if i run these in a thread pool. The SparkSqlSerializer, seems to create a new Kryo instance each time it wants to serialize anything. I got a huge speedup when I had any non-primitive type in my SchemaRDD using the ResourcePool's from Chill for providing the KryoSerializer to it. (I can open an RB if there is some reason not to re-use them?) ==== With the Distinct Count operator there is no map-side operations, and a test to check for this. Is there any reason not to do a map side combine into a set and then merge the sets later? (similar to the approximate distinct count operator) === Another thing while i'm mailing.. the 1.0.1 docs have a section like: " // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. " Which sounds great, we have lots of data in thrift.. so via scrooge ( https://github.com/twitter/scrooge), we end up with ultimately instances of traits which implement product. Though the reflection code appears to look for the constructor of the class and base the types based on those parameters? Ian.