Susan,

I did give that a shot -- I'm seeing a number of oddities:

(1) 'Partition By' appears only accepts alphanumeric lower case fields.  It
will work for 'machinename', but not 'machineName' or 'machine_name'.
(2) When partitioning with maps included in the data I get odd string
conversion issues
(3) When partitioning without maps I see frequent out of memory issues

I'll update this email when I've got a more concrete example of problems.

Regards,

Bryan Jeffrey



On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <suchenz...@gmail.com> wrote:

> Have you tried partitionBy?
>
> Something like
>
> hiveWindowsEvents.foreachRDD( rdd => {
>       val eventsDataFrame = rdd.toDF()
>       eventsDataFrame.write.mode(SaveMode.Append).partitionBy("
> windows_event_time_bin").saveAsTable("windows_event")
>     })
>
>
>
> On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <bryan.jeff...@gmail.com>
> wrote:
>
>> Hello.
>>
>> I am working to get a simple solution working using Spark SQL.  I am
>> writing streaming data to persistent tables using a HiveContext.  Writing
>> to a persistent non-partitioned table works well - I update the table using
>> Spark streaming, and the output is available via Hive Thrift/JDBC.
>>
>> I create a table that looks like the following:
>>
>> 0: jdbc:hive2://localhost:10000> describe windows_event;
>> describe windows_event;
>> +--------------------------+---------------------+----------+
>> |         col_name         |      data_type      | comment  |
>> +--------------------------+---------------------+----------+
>> | target_entity            | string              | NULL     |
>> | target_entity_type       | string              | NULL     |
>> | date_time_utc            | timestamp           | NULL     |
>> | machine_ip               | string              | NULL     |
>> | event_id                 | string              | NULL     |
>> | event_data               | map<string,string>  | NULL     |
>> | description              | string              | NULL     |
>> | event_record_id          | string              | NULL     |
>> | level                    | string              | NULL     |
>> | machine_name             | string              | NULL     |
>> | sequence_number          | string              | NULL     |
>> | source                   | string              | NULL     |
>> | source_machine_name      | string              | NULL     |
>> | task_category            | string              | NULL     |
>> | user                     | string              | NULL     |
>> | additional_data          | map<string,string>  | NULL     |
>> | windows_event_time_bin   | timestamp           | NULL     |
>> | # Partition Information  |                     |          |
>> | # col_name               | data_type           | comment  |
>> | windows_event_time_bin   | timestamp           | NULL     |
>> +--------------------------+---------------------+----------+
>>
>>
>> However, when I create a partitioned table and write data using the
>> following:
>>
>>     hiveWindowsEvents.foreachRDD( rdd => {
>>       val eventsDataFrame = rdd.toDF()
>>
>> eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event")
>>     })
>>
>> The data is written as though the table is not partitioned (so everything
>> is written to /user/hive/warehouse/windows_event/file.gz.paquet.  Because
>> the data is not following the partition schema, it is not accessible (and
>> not partitioned).
>>
>> Is there a straightforward way to write to partitioned tables using Spark
>> SQL?  I understand that the read performance for partitioned data is far
>> better - are there other performance improvements that might be better to
>> use instead of partitioning?
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>
>

Reply via email to