Hi lian cheng, You mean that if we mmap parquet file in memory, and read it through new API which read Parquet data directly from in-memory byte buffers , is maybe feasible? Why the cost of decoding can be eliminated by using this method?
-----邮件原件----- 发件人: Cheng Lian [mailto:[email protected]] 发送时间: 2015年9月1日 14:43 收件人: [email protected] 主题: Re: 答复: 答复: [implement a memory parquet ] Currently parquet-mr only supports reading Parquet records through HDFS API. This cost can be eliminated if we have a set of APIs that read Parquet data directly from in-memory byte buffers. In this way, you can simply mmap the target file, and read it directly without going through the heavy HDFS API stack. Cheng On 9/1/15 11:48 AM, Wangchangchun (A) wrote: > Hi Michael Armbrust, > I can not understand statements below, can you explain It more detailed? > > > > As you suggest though, this isn't optimal as you end up pretty CPU bound > doing the decoding. For this use case, you could probably optimize things if > the parquet library has a way to read directly from byte buffers instead of > through the Hadoop FS API. That way you could just mmap the data directly > into the process. > > > > > 发件人: Michael Armbrust [mailto:[email protected]] > 发送时间: 2015年9月1日 10:22 > 收件人: [email protected] > 抄送: Wangchangchun (A) > 主题: Re: 答复: [implement a memory parquet ] > > Typically the complaint I hear from users is that the data blows up in size, > since we don't have nearly as well fleshed out columnar compression in Spark > SQL as parquet does. As a result I often see people reading parquet data > stored in tachyon or something to get memory speeds without the data size > blowup. > > As you suggest though, this isn't optimal as you end up pretty CPU bound > doing the decoding. For this use case, you could probably optimize things if > the parquet library has a way to read directly from byte buffers instead of > through the Hadoop FS API. That way you could just mmap the data directly > into the process. > > On Mon, Aug 31, 2015 at 6:46 PM, Alex Levenson > <[email protected]<mailto:[email protected]>> > wrote: > Why do you think parquet format will be a better in memory representation? > My guess is that it won't be -- it wasn't really designed for that, > and I think a specialized format designed for in memory use would > probably work a lot better. > > On Mon, Aug 31, 2015 at 6:05 PM, Wangchangchun (A) < > [email protected]<mailto:[email protected]>> wrote: > >> Can anyone help to answer me this question? >> Thanks! >> >> 发件人: Wangchangchun (A) >> 发送时间: 2015年8月29日 11:03 >> 收件人: '[email protected]<mailto:[email protected]>' >> 主题: [implement a memory parquet ] >> >> Hi, everyone, >> Can somebody help to answer me a question? >> >> In SparkSQL, memory data stored in an object model named internalrow, >> If you cache table, spark will convert internalrow into in-memory >> columnar storage, We think that in-memory columnar storage is not an >> efficient storage, we want to try to store it in memory using parquet >> format. >> That is , we want to implement a memory parquet format, and store >> sparksql cache table in it. >> >> Is this feasible? If it feasible, can someone give me some advise >> about how to implement it? I should use which parquet APIs? >> > > -- > Alex Levenson > @THISWILLWORK >
