Re: 答复: [implement a memory parquet ]

Michael Armbrust Mon, 31 Aug 2015 19:23:22 -0700

Typically the complaint I hear from users is that the data blows up in
size, since we don't have nearly as well fleshed out columnar compression
in Spark SQL as parquet does.  As a result I often see people reading
parquet data stored in tachyon or something to get memory speeds without
the data size blowup.


As you suggest though, this isn't optimal as you end up pretty CPU bound
doing the decoding.  For this use case, you could probably optimize things
if the parquet library has a way to read directly from byte buffers instead
of through the Hadoop FS API.  That way you could just mmap the data
directly into the process.

On Mon, Aug 31, 2015 at 6:46 PM, Alex Levenson <
[email protected]> wrote:

> Why do you think parquet format will be a better in memory representation?
> My guess is that it won't be -- it wasn't really designed for that, and I
> think a specialized format designed for in memory use would probably work a
> lot better.
>
> On Mon, Aug 31, 2015 at 6:05 PM, Wangchangchun (A) <
> [email protected]> wrote:
>
> > Can anyone help to answer me this question?
> > Thanks!
> >
> > 发件人: Wangchangchun (A)
> > 发送时间: 2015年8月29日 11:03
> > 收件人: '[email protected]'
> > 主题: [implement a memory parquet ]
> >
> > Hi, everyone,
> > Can somebody help to answer me a question?
> >
> > In SparkSQL, memory data stored in an object model named internalrow,
> > If you cache table, spark will convert internalrow into in-memory
> columnar
> > storage,
> > We think that in-memory columnar storage is not an efficient storage, we
> > want to try to store it in memory using parquet format.
> > That is , we want to implement a memory parquet format, and store
> sparksql
> > cache table in it.
> >
> > Is this feasible? If it feasible, can someone give me some advise about
> > how to implement it? I should use which parquet APIs?
> >
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>

Re: 答复: [implement a memory parquet ]

Reply via email to