Thanks for your response.

I did some more tests and I am seeing that when I have a flatter structure
for my AVRO, the cache memory use is close to the CSV. But, when I use few
levels of nesting, the cache memory usage blows up. This is really critical
for planning the cluster we will be using. To avoid using a larger cluster,
looks like, we will have to consider keeping the structure flat as much as
possible.

On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com>
wrote:

> (Adding user@spark back to the discussion)
>
>
>
> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope
> for compression. On the other hand avro and parquet are already compressed
> and just store extra schema info, afaik. Avro and parquet are both going to
> make your data smaller, parquet through compressed columnar storage, and
> avro through its binary data format.
>
>
>
> Next, talking about the 62kb becoming 1224kb. I actually do not see such a
> massive blow up. The avro you shared is 28kb on my system and becomes
> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory
> serialized. Exact same numbers with parquet as well. This is expected
> behavior, if I am not wrong.
>
>
>
> In fact, now that I think about it, even larger blow ups might be valid,
> since your data must have been deserialized from the compressed avro
> format, making it bigger. The order of magnitude of difference in size
> would depend on the type of data you have and how well it was compressable.
>
>
>
> The purpose of these formats is to store data to persistent storage in a
> way that's faster to read from, not to reduce cache-memory usage.
>
>
>
> Maybe others here have more info to share.
>
>
>
> Regards,
>
> Shreya
>
>
>
> Sent from my Windows 10 phone
>
>
>
> *From: *Prithish <prith...@gmail.com>
> *Sent: *Tuesday, November 15, 2016 11:04 PM
> *To: *Shreya Agarwal <shrey...@microsoft.com>
> *Subject: *Re: AVRO File size when caching in-memory
>
>
> I did another test and noting my observations here. These were done with
> the same data in avro and csv formats.
>
> In AVRO, the file size on disk was 62kb and after caching, the in-memory
> size is 1224kb
> In CSV, the file size on disk was 690kb and after caching, the in-memory
> size is 290kb
>
> I'm guessing that the spark caching is not able to compress when the
> source is avro. Not sure if this is just my immature conclusion. Waiting to
> hear your observation.
>
> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote:
>
>> Thanks for your response.
>>
>> I have attached the code (that I ran using the Spark-shell) as well as a
>> sample avro file. After you run this code, the data is cached in memory and
>> you can go to the "storage" tab on the Spark-ui (localhost:4040) and see
>> the size it uses. In this example the size is small, but in my actual
>> scenario, the source file size is 30GB and the in-memory size comes to
>> around 800GB. I am trying to understand if this is expected when using avro
>> or not.
>>
>> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <shrey...@microsoft.com>
>> wrote:
>>
>>> I haven’t used Avro ever. But if you can send over a quick sample code,
>>> I can run and see if I repro it and maybe debug.
>>>
>>>
>>>
>>> *From:* Prithish [mailto:prith...@gmail.com]
>>> *Sent:* Tuesday, November 15, 2016 8:44 PM
>>> *To:* Jörn Franke <jornfra...@gmail.com>
>>> *Cc:* User <user@spark.apache.org>
>>> *Subject:* Re: AVRO File size when caching in-memory
>>>
>>>
>>>
>>> Anyone?
>>>
>>>
>>>
>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote:
>>>
>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this on
>>> the latest AWS EMR release.
>>>
>>>
>>>
>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>> spark version? Are you using tungsten?
>>>
>>>
>>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote:
>>> >
>>> > Can someone please explain why this happens?
>>> >
>>> > When I read a 600kb AVRO file and cache this in memory (using
>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
>>> this with different file sizes, and the size in-memory is always
>>> proportionate. I thought Spark compresses when using cacheTable.
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Reply via email to