答复: 答复: 答复: [implement a memory parquet ]

Wangchangchun (A) Tue, 01 Sep 2015 00:10:08 -0700

Hi lian cheng,
You mean that if we mmap parquet file in memory, and read it through new API 
which read Parquet data directly from in-memory byte buffers , is maybe 
feasible?
Why the cost of decoding can be eliminated by using this method?


-----邮件原件-----
发件人: Cheng Lian [mailto:[email protected]] 
发送时间: 2015年9月1日 14:43
收件人: [email protected]
主题: Re: 答复: 答复: [implement a memory parquet ]

Currently parquet-mr only supports reading Parquet records through HDFS API. 
This cost can be eliminated if we have a set of APIs that read Parquet data 
directly from in-memory byte buffers. In this way, you can simply mmap the 
target file, and read it directly without going through the heavy HDFS API 
stack.

Cheng

On 9/1/15 11:48 AM, Wangchangchun (A) wrote:
> Hi Michael Armbrust,
> I can not understand statements below, can you explain It more detailed?
>
>
>
> As you suggest though, this isn't optimal as you end up pretty CPU bound 
> doing the decoding.  For this use case, you could probably optimize things if 
> the parquet library has a way to read directly from byte buffers instead of 
> through the Hadoop FS API.  That way you could just mmap the data directly 
> into the process.
>
>
>
>
> 发件人: Michael Armbrust [mailto:[email protected]]
> 发送时间: 2015年9月1日 10:22
> 收件人: [email protected]
> 抄送: Wangchangchun (A)
> 主题: Re: 答复: [implement a memory parquet ]
>
> Typically the complaint I hear from users is that the data blows up in size, 
> since we don't have nearly as well fleshed out columnar compression in Spark 
> SQL as parquet does.  As a result I often see people reading parquet data 
> stored in tachyon or something to get memory speeds without the data size 
> blowup.
>
> As you suggest though, this isn't optimal as you end up pretty CPU bound 
> doing the decoding.  For this use case, you could probably optimize things if 
> the parquet library has a way to read directly from byte buffers instead of 
> through the Hadoop FS API.  That way you could just mmap the data directly 
> into the process.
>
> On Mon, Aug 31, 2015 at 6:46 PM, Alex Levenson 
> <[email protected]<mailto:[email protected]>> 
> wrote:
> Why do you think parquet format will be a better in memory representation?
> My guess is that it won't be -- it wasn't really designed for that, 
> and I think a specialized format designed for in memory use would 
> probably work a lot better.
>
> On Mon, Aug 31, 2015 at 6:05 PM, Wangchangchun (A) < 
> [email protected]<mailto:[email protected]>> wrote:
>
>> Can anyone help to answer me this question?
>> Thanks!
>>
>> 发件人: Wangchangchun (A)
>> 发送时间: 2015年8月29日 11:03
>> 收件人: '[email protected]<mailto:[email protected]>'
>> 主题: [implement a memory parquet ]
>>
>> Hi, everyone,
>> Can somebody help to answer me a question?
>>
>> In SparkSQL, memory data stored in an object model named internalrow, 
>> If you cache table, spark will convert internalrow into in-memory 
>> columnar storage, We think that in-memory columnar storage is not an 
>> efficient storage, we want to try to store it in memory using parquet 
>> format.
>> That is , we want to implement a memory parquet format, and store 
>> sparksql cache table in it.
>>
>> Is this feasible? If it feasible, can someone give me some advise 
>> about how to implement it? I should use which parquet APIs?
>>
>
> --
> Alex Levenson
> @THISWILLWORK
>

答复: 答复: 答复: [implement a memory parquet ]

Reply via email to