Thanks for your response. I did some more tests and I am seeing that when I have a flatter structure for my AVRO, the cache memory use is close to the CSV. But, when I use few levels of nesting, the cache memory usage blows up. This is really critical for planning the cluster we will be using. To avoid using a larger cluster, looks like, we will have to consider keeping the structure flat as much as possible.
On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com> wrote: > (Adding user@spark back to the discussion) > > > > Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope > for compression. On the other hand avro and parquet are already compressed > and just store extra schema info, afaik. Avro and parquet are both going to > make your data smaller, parquet through compressed columnar storage, and > avro through its binary data format. > > > > Next, talking about the 62kb becoming 1224kb. I actually do not see such a > massive blow up. The avro you shared is 28kb on my system and becomes > 53.7kb when cached in memory deserialized and 52.9kb when cached In memory > serialized. Exact same numbers with parquet as well. This is expected > behavior, if I am not wrong. > > > > In fact, now that I think about it, even larger blow ups might be valid, > since your data must have been deserialized from the compressed avro > format, making it bigger. The order of magnitude of difference in size > would depend on the type of data you have and how well it was compressable. > > > > The purpose of these formats is to store data to persistent storage in a > way that's faster to read from, not to reduce cache-memory usage. > > > > Maybe others here have more info to share. > > > > Regards, > > Shreya > > > > Sent from my Windows 10 phone > > > > *From: *Prithish <prith...@gmail.com> > *Sent: *Tuesday, November 15, 2016 11:04 PM > *To: *Shreya Agarwal <shrey...@microsoft.com> > *Subject: *Re: AVRO File size when caching in-memory > > > I did another test and noting my observations here. These were done with > the same data in avro and csv formats. > > In AVRO, the file size on disk was 62kb and after caching, the in-memory > size is 1224kb > In CSV, the file size on disk was 690kb and after caching, the in-memory > size is 290kb > > I'm guessing that the spark caching is not able to compress when the > source is avro. Not sure if this is just my immature conclusion. Waiting to > hear your observation. > > On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote: > >> Thanks for your response. >> >> I have attached the code (that I ran using the Spark-shell) as well as a >> sample avro file. After you run this code, the data is cached in memory and >> you can go to the "storage" tab on the Spark-ui (localhost:4040) and see >> the size it uses. In this example the size is small, but in my actual >> scenario, the source file size is 30GB and the in-memory size comes to >> around 800GB. I am trying to understand if this is expected when using avro >> or not. >> >> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <shrey...@microsoft.com> >> wrote: >> >>> I haven’t used Avro ever. But if you can send over a quick sample code, >>> I can run and see if I repro it and maybe debug. >>> >>> >>> >>> *From:* Prithish [mailto:prith...@gmail.com] >>> *Sent:* Tuesday, November 15, 2016 8:44 PM >>> *To:* Jörn Franke <jornfra...@gmail.com> >>> *Cc:* User <user@spark.apache.org> >>> *Subject:* Re: AVRO File size when caching in-memory >>> >>> >>> >>> Anyone? >>> >>> >>> >>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote: >>> >>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this on >>> the latest AWS EMR release. >>> >>> >>> >>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>> spark version? Are you using tungsten? >>> >>> >>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote: >>> > >>> > Can someone please explain why this happens? >>> > >>> > When I read a 600kb AVRO file and cache this in memory (using >>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried >>> this with different file sizes, and the size in-memory is always >>> proportionate. I thought Spark compresses when using cacheTable. >>> >>> >>> >>> >>> >> >> >