Typically the complaint I hear from users is that the data blows up in size, since we don't have nearly as well fleshed out columnar compression in Spark SQL as parquet does. As a result I often see people reading parquet data stored in tachyon or something to get memory speeds without the data size blowup.
As you suggest though, this isn't optimal as you end up pretty CPU bound doing the decoding. For this use case, you could probably optimize things if the parquet library has a way to read directly from byte buffers instead of through the Hadoop FS API. That way you could just mmap the data directly into the process. On Mon, Aug 31, 2015 at 6:46 PM, Alex Levenson < [email protected]> wrote: > Why do you think parquet format will be a better in memory representation? > My guess is that it won't be -- it wasn't really designed for that, and I > think a specialized format designed for in memory use would probably work a > lot better. > > On Mon, Aug 31, 2015 at 6:05 PM, Wangchangchun (A) < > [email protected]> wrote: > > > Can anyone help to answer me this question? > > Thanks! > > > > 发件人: Wangchangchun (A) > > 发送时间: 2015年8月29日 11:03 > > 收件人: '[email protected]' > > 主题: [implement a memory parquet ] > > > > Hi, everyone, > > Can somebody help to answer me a question? > > > > In SparkSQL, memory data stored in an object model named internalrow, > > If you cache table, spark will convert internalrow into in-memory > columnar > > storage, > > We think that in-memory columnar storage is not an efficient storage, we > > want to try to store it in memory using parquet format. > > That is , we want to implement a memory parquet format, and store > sparksql > > cache table in it. > > > > Is this feasible? If it feasible, can someone give me some advise about > > how to implement it? I should use which parquet APIs? > > > > > > -- > Alex Levenson > @THISWILLWORK >
