Currently parquet-mr only supports reading Parquet records through HDFS API. This cost can be eliminated if we have a set of APIs that read Parquet data directly from in-memory byte buffers. In this way, you can simply mmap the target file, and read it directly without going through the heavy HDFS API stack.

Cheng

On 9/1/15 11:48 AM, Wangchangchun (A) wrote:
Hi Michael Armbrust,
I can not understand statements below, can you explain It more detailed?



As you suggest though, this isn't optimal as you end up pretty CPU bound doing 
the decoding.  For this use case, you could probably optimize things if the 
parquet library has a way to read directly from byte buffers instead of through 
the Hadoop FS API.  That way you could just mmap the data directly into the 
process.




发件人: Michael Armbrust [mailto:[email protected]]
发送时间: 2015年9月1日 10:22
收件人: [email protected]
抄送: Wangchangchun (A)
主题: Re: 答复: [implement a memory parquet ]

Typically the complaint I hear from users is that the data blows up in size, 
since we don't have nearly as well fleshed out columnar compression in Spark 
SQL as parquet does.  As a result I often see people reading parquet data 
stored in tachyon or something to get memory speeds without the data size 
blowup.

As you suggest though, this isn't optimal as you end up pretty CPU bound doing 
the decoding.  For this use case, you could probably optimize things if the 
parquet library has a way to read directly from byte buffers instead of through 
the Hadoop FS API.  That way you could just mmap the data directly into the 
process.

On Mon, Aug 31, 2015 at 6:46 PM, Alex Levenson 
<[email protected]<mailto:[email protected]>> 
wrote:
Why do you think parquet format will be a better in memory representation?
My guess is that it won't be -- it wasn't really designed for that, and I
think a specialized format designed for in memory use would probably work a
lot better.

On Mon, Aug 31, 2015 at 6:05 PM, Wangchangchun (A) <
[email protected]<mailto:[email protected]>> wrote:

Can anyone help to answer me this question?
Thanks!

发件人: Wangchangchun (A)
发送时间: 2015年8月29日 11:03
收件人: '[email protected]<mailto:[email protected]>'
主题: [implement a memory parquet ]

Hi, everyone,
Can somebody help to answer me a question?

In SparkSQL, memory data stored in an object model named internalrow,
If you cache table, spark will convert internalrow into in-memory columnar
storage,
We think that in-memory columnar storage is not an efficient storage, we
want to try to store it in memory using parquet format.
That is , we want to implement a memory parquet format, and store sparksql
cache table in it.

Is this feasible? If it feasible, can someone give me some advise about
how to implement it? I should use which parquet APIs?


--
Alex Levenson
@THISWILLWORK


Reply via email to