Re: cache table vs. parquet table performance

2019-04-17 Thread Bin Fan
Hi Tomas,

One option is to cache your table as Parquet files into Alluxio (which can
serve as an in-memory distributed caching layer for Spark in your case).

The code on Spark will be like

> df.write.parquet("alluxio://master:19998/data.parquet")> df = 
> sqlContext.read.parquet("alluxio://master:19998/data.parquet")

(See more details at the documentation
http://www.alluxio.org/docs/1.8/en/compute/Spark.html

)

This would require running Alluxio as a separate service (ideally colocated
with Spark servers), of course.
But also enables data sharing across Spark jobs.

- Bin




On Tue, Jan 15, 2019 at 10:29 AM Tomas Bartalos 
wrote:

> Hello,
>
> I'm using spark-thrift server and I'm searching for best performing
> solution to query hot set of data. I'm processing records with nested
> structure, containing subtypes and arrays. 1 record takes up several KB.
>
> I tried to make some improvement with cache table:
>
> cache table event_jan_01 as select * from events where day_registered =
> 20190102;
>
>
> If I understood correctly, the data should be stored in *in-memory
> columnar* format with storage level MEMORY_AND_DISK. So data which
> doesn't fit to memory will be spille to disk (I assume also in columnar
> format (?))
> I cached 1 day of data (1 M records) and according to spark UI storage tab
> none of the data was cached to memory and everything was spilled to disk.
> The size of the data was *5.7 GB.*
> Typical queries took ~ 20 sec.
>
> Then I tried to store the data to parquet format:
>
> CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"
> as
>
> select * from event_jan_01;
>
>
> The whole parquet took up only *178MB.*
> And typical queries took 5-10 sec.
>
> Is it possible to tune spark to spill the cached data in parquet format ?
> Why the whole cached table was spilled to disk and nothing stayed in
> memory ?
>
> Spark version: 2.4.0
>
> Best regards,
> Tomas
>
>


Re: cache table vs. parquet table performance

2019-01-16 Thread Jörn Franke
I believe the in-memory solution misses the storage indexes that parquet / orc 
have.

The in-memory solution is more suitable if you iterate in the whole set of data 
frequently.

> Am 15.01.2019 um 19:20 schrieb Tomas Bartalos :
> 
> Hello,
> 
> I'm using spark-thrift server and I'm searching for best performing solution 
> to query hot set of data. I'm processing records with nested structure, 
> containing subtypes and arrays. 1 record takes up several KB.
> 
> I tried to make some improvement with cache table:
> cache table event_jan_01 as select * from events where day_registered = 
> 20190102;
> 
> If I understood correctly, the data should be stored in in-memory columnar 
> format with storage level MEMORY_AND_DISK. So data which doesn't fit to 
> memory will be spille to disk (I assume also in columnar format (?))
> I cached 1 day of data (1 M records) and according to spark UI storage tab 
> none of the data was cached to memory and everything was spilled to disk. The 
> size of the data was 5.7 GB.
> Typical queries took ~ 20 sec.
> 
> Then I tried to store the data to parquet format:
> CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as 
> select * from event_jan_01;
> 
> The whole parquet took up only 178MB.
> And typical queries took 5-10 sec.
> 
> Is it possible to tune spark to spill the cached data in parquet format ?
> Why the whole cached table was spilled to disk and nothing stayed in memory ?
> 
> Spark version: 2.4.0
> 
> Best regards,
> Tomas
> 


Re: cache table vs. parquet table performance

2019-01-16 Thread Todd Nist
Hi Tomas,

Have you considered using something like https://www.alluxio.org/ for you
cache?  Seems like a possible solution for what your trying to do.

-Todd

On Tue, Jan 15, 2019 at 11:24 PM 大啊  wrote:

> Hi ,Tomas.
> Thanks for your question give me some prompt.But the best way use cache
> usually stores smaller data.
> I think cache large data will consume memory or disk space too much.
> Spill the cached data in parquet format maybe a good improvement.
>
> At 2019-01-16 02:20:56, "Tomas Bartalos"  wrote:
>
> Hello,
>
> I'm using spark-thrift server and I'm searching for best performing
> solution to query hot set of data. I'm processing records with nested
> structure, containing subtypes and arrays. 1 record takes up several KB.
>
> I tried to make some improvement with cache table:
>
> cache table event_jan_01 as select * from events where day_registered =
> 20190102;
>
>
> If I understood correctly, the data should be stored in *in-memory
> columnar* format with storage level MEMORY_AND_DISK. So data which
> doesn't fit to memory will be spille to disk (I assume also in columnar
> format (?))
> I cached 1 day of data (1 M records) and according to spark UI storage tab
> none of the data was cached to memory and everything was spilled to disk.
> The size of the data was *5.7 GB.*
> Typical queries took ~ 20 sec.
>
> Then I tried to store the data to parquet format:
>
> CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"
> as
>
> select * from event_jan_01;
>
>
> The whole parquet took up only *178MB.*
> And typical queries took 5-10 sec.
>
> Is it possible to tune spark to spill the cached data in parquet format ?
> Why the whole cached table was spilled to disk and nothing stayed in
> memory ?
>
> Spark version: 2.4.0
>
> Best regards,
> Tomas
>
>
>
>
>


cache table vs. parquet table performance

2019-01-15 Thread Tomas Bartalos
Hello,

I'm using spark-thrift server and I'm searching for best performing
solution to query hot set of data. I'm processing records with nested
structure, containing subtypes and arrays. 1 record takes up several KB.

I tried to make some improvement with cache table:

cache table event_jan_01 as select * from events where day_registered =
20190102;


If I understood correctly, the data should be stored in *in-memory columnar*
format with storage level MEMORY_AND_DISK. So data which doesn't fit to
memory will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab
none of the data was cached to memory and everything was spilled to disk.
The size of the data was *5.7 GB.*
Typical queries took ~ 20 sec.

Then I tried to store the data to parquet format:

CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as


select * from event_jan_01;


The whole parquet took up only *178MB.*
And typical queries took 5-10 sec.

Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory
?

Spark version: 2.4.0

Best regards,
Tomas