Could you share which data types are optimized in the in-memory storage and how are they optimized ?
On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust <mich...@databricks.com> wrote: > You'll probably only get good compression for strings when dictionary > encoding works. We don't optimize decimals in the in-memory columnar > storage, so you are paying expensive serialization there likely. > > On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel <manojsamelt...@gmail.com> > wrote: > >> Flat data of types String, Int and couple of decimal(14,4) >> >> On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> Is this nested data or flat data? >>> >>> On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel <manojsamelt...@gmail.com> >>> wrote: >>> >>>> Hi Michael, >>>> >>>> The storage tab shows the RDD resides fully in memory (10 partitions) >>>> with zero disk usage. Tasks for subsequent select on this table in cache >>>> shows minimal overheads (GC, queueing, shuffle write etc. etc.), so >>>> overhead is not issue. However, it is still twice as slow as reading >>>> uncached table. >>>> >>>> I have spark.rdd.compress = true, >>>> spark.sql.inMemoryColumnarStorage.compressed >>>> = true, spark.serializer = org.apache.spark.serializer.KryoSerializer >>>> >>>> Something that may be of relevance ... >>>> >>>> The underlying table is Parquet, 10 partitions totaling ~350 MB. For >>>> mapPartition phase of query on uncached table shows input size of 351 MB. >>>> However, after the table is cached, the storage shows the cache size as >>>> 12GB. So the in-memory representation seems much bigger than on-disk, even >>>> with the compression options turned on. Any thoughts on this ? >>>> >>>> mapPartition phase same query for cache table shows input size of 12GB >>>> (full size of cache table) and takes twice the time as mapPartition for >>>> uncached query. >>>> >>>> Thanks, >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Feb 6, 2015 at 6:47 PM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>>> Check the storage tab. Does the table actually fit in memory? >>>>> Otherwise you are rebuilding column buffers in addition to reading the >>>>> data >>>>> off of the disk. >>>>> >>>>> On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel <manojsamelt...@gmail.com> >>>>> wrote: >>>>> >>>>>> Spark 1.2 >>>>>> >>>>>> Data stored in parquet table (large number of rows) >>>>>> >>>>>> Test 1 >>>>>> >>>>>> select a, sum(b), sum(c) from table >>>>>> >>>>>> Test >>>>>> >>>>>> sqlContext.cacheTable() >>>>>> select a, sum(b), sum(c) from table - "seed cache" First time slow >>>>>> since loading cache ? >>>>>> select a, sum(b), sum(c) from table - Second time it should be >>>>>> faster as it should be reading from cache, not HDFS. But it is slower >>>>>> than >>>>>> test1 >>>>>> >>>>>> Any thoughts? Should a different query be used to seed cache ? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> >>>>> >>>> >>> >> >