Hi, I have a DataFrame that represents my data looks like this:
+-------------+----------------------------+ | col_name | data_type | +-------------+----------------------------+ | obj_id | string | | type | string | | name | string | | metric_name | string | | value | double | | ts | timestamp | +-------------+----------------------------+ It is working fine, and I can store it to parquet with: df.saveAsParquetFile("/user/data/metrics") I would like to leverage parquet partitioning as referenced here, https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files I would like to see a representation something like this: usr |__ data |__ metrics |__ type=Virtual Machine |__ objId=1234 |__ metricName=CPU Demand |__ yyyymmdd |__ data.parquet |__ metricName=CPU Utilization |__ yyyymmdd |__ data.parquet |__ objId=5678 |__ metricName=CPU Demand |__ yyyymmdd |__ data.parquet |__ type=Application |__ objId=0009 |__ metricName=Response Time |__ yyyymmdd |__ data.parquet |__ metricName=Slow Response |__ yyyymmdd |__ data.parquet |__ objId=0303 |__ metricName=Response Time |__ yyyymmdd |__ data.parquet What is the correct way to achieve this? I can do something like: df.map{ case Row(nodeType: String, objId: String, name: String, metricName: String, value: Double, ts: java.sql.Timestamp) => ... // construct path val path = s"/usr/data/metrics/type=${Row.nodeType}/objId=${Row.objId}/metricName=${Row.metricName}/floorToDay(ts)" // save record as parquet df.saveAsParquet(path, Row) ...} Is this the right approach or is there a more optimal approach? This would save every row as an individual file. I will receive multiple entries for a given metric, type and objId combination in a given day. TIA for the assistance. -Todd