RE: AVRO File size when caching in-memory

Shreya Agarwal Wed, 16 Nov 2016 08:30:06 -0800

Ah, yes. Nested schemas should be avoided if you want the best memory usage.

Sent from my Windows 10 phone

From: Prithish<mailto:prith...@gmail.com>
Sent: Wednesday, November 16, 2016 12:48 AM
To: Takeshi Yamamuro<mailto:linguin....@gmail.com>
Cc: Shreya Agarwal<mailto:shrey...@microsoft.com>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: AVRO File size when caching in-memory

It's something like the schema shown below (with several additional 
levels/sublevels)

root
 |-- sentAt: long (nullable = true)
 |-- sharing: string (nullable = true)
 |-- receivedAt: long (nullable = true)
 |-- ip: string (nullable = true)
 |-- story: struct (nullable = true)
 |    |-- super: string (nullable = true)
 |    |-- lang: string (nullable = true)
 |    |-- setting: string (nullable = true)
 |    |-- myapp: struct (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- ver: string (nullable = true)
 |    |    |-- build: string (nullable = true)
 |    |-- comp: struct (nullable = true)
 |    |    |-- notes: string (nullable = true)
 |    |    |-- source: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- sub: string (nullable = true)
 |    |-- loc: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- country: string (nullable = true)
 |    |    |-- lat: double (nullable = true)
 |    |    |-- long: double (nullable = true)

On Wed, Nov 16, 2016 at 2:08 PM, Takeshi Yamamuro 
<linguin....@gmail.com<mailto:linguin....@gmail.com>> wrote:
Hi,

What's the schema interpreted by spark?
A compression logic of the spark caching depends on column types.

// maropu

On Wed, Nov 16, 2016 at 5:26 PM, Prithish 
<prith...@gmail.com<mailto:prith...@gmail.com>> wrote:
Thanks for your response.

I did some more tests and I am seeing that when I have a flatter structure for 
my AVRO, the cache memory use is close to the CSV. But, when I use few levels 
of nesting, the cache memory usage blows up. This is really critical for 
planning the cluster we will be using. To avoid using a larger cluster, looks 
like, we will have to consider keeping the structure flat as much as possible.

On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal 
<shrey...@microsoft.com<mailto:shrey...@microsoft.com>> wrote:
(Adding user@spark back to the discussion)

Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope for 
compression. On the other hand avro and parquet are already compressed and just 
store extra schema info, afaik. Avro and parquet are both going to make your 
data smaller, parquet through compressed columnar storage, and avro through its 
binary data format.

Next, talking about the 62kb becoming 1224kb. I actually do not see such a 
massive blow up. The avro you shared is 28kb on my system and becomes 53.7kb 
when cached in memory deserialized and 52.9kb when cached In memory serialized. 
Exact same numbers with parquet as well. This is expected behavior, if I am not 
wrong.

In fact, now that I think about it, even larger blow ups might be valid, since 
your data must have been deserialized from the compressed avro format, making 
it bigger. The order of magnitude of difference in size would depend on the 
type of data you have and how well it was compressable.

The purpose of these formats is to store data to persistent storage in a way 
that's faster to read from, not to reduce cache-memory usage.

Maybe others here have more info to share.

Regards,
Shreya

Sent from my Windows 10 phone

From: Prithish<mailto:prith...@gmail.com>
Sent: Tuesday, November 15, 2016 11:04 PM
To: Shreya Agarwal<mailto:shrey...@microsoft.com>
Subject: Re: AVRO File size when caching in-memory

I did another test and noting my observations here. These were done with the 
same data in avro and csv formats.

In AVRO, the file size on disk was 62kb and after caching, the in-memory size 
is 1224kb
In CSV, the file size on disk was 690kb and after caching, the in-memory size 
is 290kb

I'm guessing that the spark caching is not able to compress when the source is 
avro. Not sure if this is just my immature conclusion. Waiting to hear your 
observation.

On Wed, Nov 16, 2016 at 12:14 PM, Prithish 
<prith...@gmail.com<mailto:prith...@gmail.com>> wrote:
Thanks for your response.

I have attached the code (that I ran using the Spark-shell) as well as a sample 
avro file. After you run this code, the data is cached in memory and you can go 
to the "storage" tab on the Spark-ui (localhost:4040) and see the size it uses. 
In this example the size is small, but in my actual scenario, the source file 
size is 30GB and the in-memory size comes to around 800GB. I am trying to 
understand if this is expected when using avro or not.

On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal 
<shrey...@microsoft.com<mailto:shrey...@microsoft.com>> wrote:
I haven't used Avro ever. But if you can send over a quick sample code, I can 
run and see if I repro it and maybe debug.

From: Prithish [mailto:prith...@gmail.com<mailto:prith...@gmail.com>]
Sent: Tuesday, November 15, 2016 8:44 PM
To: J?rn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>>
Cc: User <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: AVRO File size when caching in-memory

Anyone?

On Tue, Nov 15, 2016 at 10:45 AM, Prithish 
<prith...@gmail.com<mailto:prith...@gmail.com>> wrote:
I am using 2.0.1 and databricks avro library 3.0.1. I am running this on the 
latest AWS EMR release.

On Mon, Nov 14, 2016 at 3:06 PM, J?rn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
spark version? Are you using tungsten?

> On 14 Nov 2016, at 10:05, Prithish 
> <prith...@gmail.com<mailto:prith...@gmail.com>> wrote:
>
> Can someone please explain why this happens?
>
> When I read a 600kb AVRO file and cache this in memory (using cacheTable), it 
> shows up as 11mb (storage tab in Spark UI). I have tried this with different 
> file sizes, and the size in-memory is always proportionate. I thought Spark 
> compresses when using cacheTable.

--
---
Takeshi Yamamuro

RE: AVRO File size when caching in-memory

Reply via email to