Hello.

I am working to get a simple solution working using Spark SQL.  I am
writing streaming data to persistent tables using a HiveContext.  Writing
to a persistent non-partitioned table works well - I update the table using
Spark streaming, and the output is available via Hive Thrift/JDBC.

I create a table that looks like the following:

0: jdbc:hive2://localhost:10000> describe windows_event;
describe windows_event;
+--------------------------+---------------------+----------+
|         col_name         |      data_type      | comment  |
+--------------------------+---------------------+----------+
| target_entity            | string              | NULL     |
| target_entity_type       | string              | NULL     |
| date_time_utc            | timestamp           | NULL     |
| machine_ip               | string              | NULL     |
| event_id                 | string              | NULL     |
| event_data               | map<string,string>  | NULL     |
| description              | string              | NULL     |
| event_record_id          | string              | NULL     |
| level                    | string              | NULL     |
| machine_name             | string              | NULL     |
| sequence_number          | string              | NULL     |
| source                   | string              | NULL     |
| source_machine_name      | string              | NULL     |
| task_category            | string              | NULL     |
| user                     | string              | NULL     |
| additional_data          | map<string,string>  | NULL     |
| windows_event_time_bin   | timestamp           | NULL     |
| # Partition Information  |                     |          |
| # col_name               | data_type           | comment  |
| windows_event_time_bin   | timestamp           | NULL     |
+--------------------------+---------------------+----------+


However, when I create a partitioned table and write data using the
following:

    hiveWindowsEvents.foreachRDD( rdd => {
      val eventsDataFrame = rdd.toDF()

eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event")
    })

The data is written as though the table is not partitioned (so everything
is written to /user/hive/warehouse/windows_event/file.gz.paquet.  Because
the data is not following the partition schema, it is not accessible (and
not partitioned).

Is there a straightforward way to write to partitioned tables using Spark
SQL?  I understand that the read performance for partitioned data is far
better - are there other performance improvements that might be better to
use instead of partitioning?

Regards,

Bryan Jeffrey

Reply via email to