Parquet Partition Strategy - how to partition data correctly

Todd Nist Tue, 05 May 2015 12:35:30 -0700

Hi,

I have a DataFrame that represents my data looks like this:


+-------------+----------------------------+
| col_name    |         data_type          |
+-------------+----------------------------+
| obj_id      | string                     |
| type        | string                     |
| name        | string                     |
| metric_name | string                     |
| value       | double                     |
| ts          | timestamp                  |
+-------------+----------------------------+

It is working fine, and I can store it to parquet with:

df.saveAsParquetFile("/user/data/metrics")

I would like to leverage parquet partitioning as referenced here,
https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files

I would like to see a representation something like this:

usr
|__ data
      |__ metrics
            |__ type=Virtual Machine
                  |__ objId=1234
                        |__ metricName=CPU Demand
                              |__ yyyymmdd
                                    |__ data.parquet
                        |__ metricName=CPU Utilization
                              |__ yyyymmdd
                                    |__ data.parquet
                  |__ objId=5678
                        |__ metricName=CPU Demand
                              |__ yyyymmdd
                                    |__ data.parquet
            |__ type=Application
                  |__ objId=0009
                        |__ metricName=Response Time
                              |__ yyyymmdd
                                    |__ data.parquet
                        |__ metricName=Slow Response
                              |__ yyyymmdd
                                    |__ data.parquet
                  |__ objId=0303
                        |__ metricName=Response Time
                              |__ yyyymmdd
                                    |__ data.parquet


What is the correct way to achieve this? I can do something like:

df.map{  case Row(nodeType: String, objId: String, name: String,
metricName: String, value: Double, ts: java.sql.Timestamp) =>

  ...
   // construct path
   val path = 
s"/usr/data/metrics/type=${Row.nodeType}/objId=${Row.objId}/metricName=${Row.metricName}/floorToDay(ts)"
   // save record as parquet
   df.saveAsParquet(path, Row)

  ...}
Is this the right approach or is there a more optimal approach?  This would save
every row as an individual file.  I will receive multiple entries for a
given metric, type and objId combination in a given day.

TIA for the assistance.

-Todd

Parquet Partition Strategy - how to partition data correctly

Reply via email to