Currently parquet-mr only supports reading Parquet records through HDFS
API. This cost can be eliminated if we have a set of APIs that read
Parquet data directly from in-memory byte buffers. In this way, you can
simply mmap the target file, and read it directly without going through
the heavy HDFS API stack.
Cheng
On 9/1/15 11:48 AM, Wangchangchun (A) wrote:
Hi Michael Armbrust,
I can not understand statements below, can you explain It more detailed?
As you suggest though, this isn't optimal as you end up pretty CPU bound doing
the decoding. For this use case, you could probably optimize things if the
parquet library has a way to read directly from byte buffers instead of through
the Hadoop FS API. That way you could just mmap the data directly into the
process.
发件人: Michael Armbrust [mailto:[email protected]]
发送时间: 2015年9月1日 10:22
收件人: [email protected]
抄送: Wangchangchun (A)
主题: Re: 答复: [implement a memory parquet ]
Typically the complaint I hear from users is that the data blows up in size,
since we don't have nearly as well fleshed out columnar compression in Spark
SQL as parquet does. As a result I often see people reading parquet data
stored in tachyon or something to get memory speeds without the data size
blowup.
As you suggest though, this isn't optimal as you end up pretty CPU bound doing
the decoding. For this use case, you could probably optimize things if the
parquet library has a way to read directly from byte buffers instead of through
the Hadoop FS API. That way you could just mmap the data directly into the
process.
On Mon, Aug 31, 2015 at 6:46 PM, Alex Levenson
<[email protected]<mailto:[email protected]>>
wrote:
Why do you think parquet format will be a better in memory representation?
My guess is that it won't be -- it wasn't really designed for that, and I
think a specialized format designed for in memory use would probably work a
lot better.
On Mon, Aug 31, 2015 at 6:05 PM, Wangchangchun (A) <
[email protected]<mailto:[email protected]>> wrote:
Can anyone help to answer me this question?
Thanks!
发件人: Wangchangchun (A)
发送时间: 2015年8月29日 11:03
收件人: '[email protected]<mailto:[email protected]>'
主题: [implement a memory parquet ]
Hi, everyone,
Can somebody help to answer me a question?
In SparkSQL, memory data stored in an object model named internalrow,
If you cache table, spark will convert internalrow into in-memory columnar
storage,
We think that in-memory columnar storage is not an efficient storage, we
want to try to store it in memory using parquet format.
That is , we want to implement a memory parquet format, and store sparksql
cache table in it.
Is this feasible? If it feasible, can someone give me some advise about
how to implement it? I should use which parquet APIs?
--
Alex Levenson
@THISWILLWORK