Not the cost of decoding, but the cost of going through all the HDFS API calls.

On 9/1/15 3:05 PM, Wangchangchun (A) wrote:
Hi lian cheng,
You mean that if we mmap parquet file in memory, and read it through new API 
which read Parquet data directly from in-memory byte buffers , is maybe 
feasible?
Why the cost of decoding can be eliminated by using this method?

-----邮件原件-----
发件人: Cheng Lian [mailto:[email protected]]
发送时间: 2015年9月1日 14:43
收件人: [email protected]
主题: Re: 答复: 答复: [implement a memory parquet ]

Currently parquet-mr only supports reading Parquet records through HDFS API. 
This cost can be eliminated if we have a set of APIs that read Parquet data 
directly from in-memory byte buffers. In this way, you can simply mmap the 
target file, and read it directly without going through the heavy HDFS API 
stack.

Cheng

On 9/1/15 11:48 AM, Wangchangchun (A) wrote:
Hi Michael Armbrust,
I can not understand statements below, can you explain It more detailed?



As you suggest though, this isn't optimal as you end up pretty CPU bound doing 
the decoding.  For this use case, you could probably optimize things if the 
parquet library has a way to read directly from byte buffers instead of through 
the Hadoop FS API.  That way you could just mmap the data directly into the 
process.




发件人: Michael Armbrust [mailto:[email protected]]
发送时间: 2015年9月1日 10:22
收件人: [email protected]
抄送: Wangchangchun (A)
主题: Re: 答复: [implement a memory parquet ]

Typically the complaint I hear from users is that the data blows up in size, 
since we don't have nearly as well fleshed out columnar compression in Spark 
SQL as parquet does.  As a result I often see people reading parquet data 
stored in tachyon or something to get memory speeds without the data size 
blowup.

As you suggest though, this isn't optimal as you end up pretty CPU bound doing 
the decoding.  For this use case, you could probably optimize things if the 
parquet library has a way to read directly from byte buffers instead of through 
the Hadoop FS API.  That way you could just mmap the data directly into the 
process.

On Mon, Aug 31, 2015 at 6:46 PM, Alex Levenson 
<[email protected]<mailto:[email protected]>> 
wrote:
Why do you think parquet format will be a better in memory representation?
My guess is that it won't be -- it wasn't really designed for that,
and I think a specialized format designed for in memory use would
probably work a lot better.

On Mon, Aug 31, 2015 at 6:05 PM, Wangchangchun (A) <
[email protected]<mailto:[email protected]>> wrote:

Can anyone help to answer me this question?
Thanks!

发件人: Wangchangchun (A)
发送时间: 2015年8月29日 11:03
收件人: '[email protected]<mailto:[email protected]>'
主题: [implement a memory parquet ]

Hi, everyone,
Can somebody help to answer me a question?

In SparkSQL, memory data stored in an object model named internalrow,
If you cache table, spark will convert internalrow into in-memory
columnar storage, We think that in-memory columnar storage is not an
efficient storage, we want to try to store it in memory using parquet
format.
That is , we want to implement a memory parquet format, and store
sparksql cache table in it.

Is this feasible? If it feasible, can someone give me some advise
about how to implement it? I should use which parquet APIs?

--
Alex Levenson
@THISWILLWORK


Reply via email to