Re: Spark -- Writing to Partitioned Persistent Table

Bryan Jeffrey Fri, 30 Oct 2015 13:08:59 -0700

Deenar,

This worked perfectly - I moved to SQL Server and things are working well.


Regards,

Bryan Jeffrey

On Thu, Oct 29, 2015 at 8:14 AM, Deenar Toraskar <deenar.toras...@gmail.com>
wrote:

> Hi Bryan
>
> For your use case you don't need to have multiple metastores. The default
> metastore uses embedded Derby
> <https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-Local/EmbeddedMetastoreDatabase(Derby)>.
> This cannot be shared amongst multiple processes. Just switch to a
> metastore that supports multiple connections viz. Networked Derby or mysql.
> see https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode
>
> Deenar
>
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
> On 29 October 2015 at 00:56, Bryan <bryan.jeff...@gmail.com> wrote:
>
>> Yana,
>>
>> My basic use-case is that I want to process streaming data, and publish
>> it to a persistent spark table. After that I want to make the published
>> data (results) available via JDBC and spark SQL to drive a web API. That
>> would seem to require two drivers starting separate HiveContexts (one for
>> sparksql/jdbc, one for streaming)
>>
>> Is there a way to share a hive context between the driver for the thrift
>> spark SQL instance and the streaming spark driver? A better method to do
>> this?
>>
>> An alternate option might be to create the table in two separate
>> metastores and simply use the same storage location for the data. That
>> seems very hacky though, and likely to result in maintenance issues.
>>
>> Regards,
>>
>> Bryan Jeffrey
>> ------------------------------
>> From: Yana Kadiyska <yana.kadiy...@gmail.com>
>> Sent: ‎10/‎28/‎2015 8:32 PM
>> To: Bryan Jeffrey <bryan.jeff...@gmail.com>
>> Cc: Susan Zhang <suchenz...@gmail.com>; user <user@spark.apache.org>
>> Subject: Re: Spark -- Writing to Partitioned Persistent Table
>>
>> For this issue in particular ( ERROR XSDB6: Another instance of Derby
>> may have already booted the database /spark/spark-1.4.1/metastore_db) --
>> I think it depends on where you start your application and HiveThriftserver
>> from. I've run into a similar issue running a driver app first, which would
>> create a directory called metastore_db. If I then try to start SparkShell
>> from the same directory, I will see this exception. So it is like
>> SPARK-9776. It's not so much that the two are in the same process (as the
>> bug resolution states) I think you can't run 2 drivers which start a
>> HiveConext from the same directory.
>>
>>
>> On Wed, Oct 28, 2015 at 4:10 PM, Bryan Jeffrey <bryan.jeff...@gmail.com>
>> wrote:
>>
>>> All,
>>>
>>> One issue I'm seeing is that I start the thrift server (for jdbc access)
>>> via the following: /spark/spark-1.4.1/sbin/start-thriftserver.sh --master
>>> spark://master:7077 --hiveconf "spark.cores.max=2"
>>>
>>> After about 40 seconds the Thrift server is started and available on
>>> default port 10000.
>>>
>>> I then submit my application - and the application throws the following
>>> error:
>>>
>>> Caused by: java.sql.SQLException: Failed to start database
>>> 'metastore_db' with class loader
>>> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6a552721,
>>> see the next exception for details.
>>>         at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>>> Source)
>>>         at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>>> Source)
>>>         ... 86 more
>>> Caused by: java.sql.SQLException: Another instance of Derby may have
>>> already booted the database /spark/spark-1.4.1/metastore_db.
>>>         at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>>> Source)
>>>         at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>>> Source)
>>>         at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown
>>> Source)
>>>         at
>>> org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
>>>         ... 83 more
>>> Caused by: ERROR XSDB6: Another instance of Derby may have already
>>> booted the database /spark/spark-1.4.1/metastore_db.
>>>
>>> This also happens if I do the opposite (submit the application first,
>>> and then start the thrift server).
>>>
>>> It looks similar to the following issue -- but not quite the same:
>>> https://issues.apache.org/jira/browse/SPARK-9776
>>>
>>> It seems like this set of steps works fine if the metadata database is
>>> not yet created - but once it's created this happens every time.  Is this a
>>> known issue? Is there a workaround?
>>>
>>> Regards,
>>>
>>> Bryan Jeffrey
>>>
>>> On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey <bryan.jeff...@gmail.com>
>>> wrote:
>>>
>>>> Susan,
>>>>
>>>> I did give that a shot -- I'm seeing a number of oddities:
>>>>
>>>> (1) 'Partition By' appears only accepts alphanumeric lower case
>>>> fields.  It will work for 'machinename', but not 'machineName' or
>>>> 'machine_name'.
>>>> (2) When partitioning with maps included in the data I get odd string
>>>> conversion issues
>>>> (3) When partitioning without maps I see frequent out of memory issues
>>>>
>>>> I'll update this email when I've got a more concrete example of
>>>> problems.
>>>>
>>>> Regards,
>>>>
>>>> Bryan Jeffrey
>>>>
>>>>
>>>>
>>>> On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <suchenz...@gmail.com>
>>>> wrote:
>>>>
>>>>> Have you tried partitionBy?
>>>>>
>>>>> Something like
>>>>>
>>>>> hiveWindowsEvents.foreachRDD( rdd => {
>>>>>       val eventsDataFrame = rdd.toDF()
>>>>>       eventsDataFrame.write.mode(SaveMode.Append).partitionBy("
>>>>> windows_event_time_bin").saveAsTable("windows_event")
>>>>>     })
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <
>>>>> bryan.jeff...@gmail.com> wrote:
>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I am working to get a simple solution working using Spark SQL.  I am
>>>>>> writing streaming data to persistent tables using a HiveContext.  Writing
>>>>>> to a persistent non-partitioned table works well - I update the table 
>>>>>> using
>>>>>> Spark streaming, and the output is available via Hive Thrift/JDBC.
>>>>>>
>>>>>> I create a table that looks like the following:
>>>>>>
>>>>>> 0: jdbc:hive2://localhost:10000> describe windows_event;
>>>>>> describe windows_event;
>>>>>> +--------------------------+---------------------+----------+
>>>>>> |         col_name         |      data_type      | comment  |
>>>>>> +--------------------------+---------------------+----------+
>>>>>> | target_entity            | string              | NULL     |
>>>>>> | target_entity_type       | string              | NULL     |
>>>>>> | date_time_utc            | timestamp           | NULL     |
>>>>>> | machine_ip               | string              | NULL     |
>>>>>> | event_id                 | string              | NULL     |
>>>>>> | event_data               | map<string,string>  | NULL     |
>>>>>> | description              | string              | NULL     |
>>>>>> | event_record_id          | string              | NULL     |
>>>>>> | level                    | string              | NULL     |
>>>>>> | machine_name             | string              | NULL     |
>>>>>> | sequence_number          | string              | NULL     |
>>>>>> | source                   | string              | NULL     |
>>>>>> | source_machine_name      | string              | NULL     |
>>>>>> | task_category            | string              | NULL     |
>>>>>> | user                     | string              | NULL     |
>>>>>> | additional_data          | map<string,string>  | NULL     |
>>>>>> | windows_event_time_bin   | timestamp           | NULL     |
>>>>>> | # Partition Information  |                     |          |
>>>>>> | # col_name               | data_type           | comment  |
>>>>>> | windows_event_time_bin   | timestamp           | NULL     |
>>>>>> +--------------------------+---------------------+----------+
>>>>>>
>>>>>>
>>>>>> However, when I create a partitioned table and write data using the
>>>>>> following:
>>>>>>
>>>>>>     hiveWindowsEvents.foreachRDD( rdd => {
>>>>>>       val eventsDataFrame = rdd.toDF()
>>>>>>
>>>>>> eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event")
>>>>>>     })
>>>>>>
>>>>>> The data is written as though the table is not partitioned (so
>>>>>> everything is written to
>>>>>> /user/hive/warehouse/windows_event/file.gz.paquet.  Because the data is 
>>>>>> not
>>>>>> following the partition schema, it is not accessible (and not 
>>>>>> partitioned).
>>>>>>
>>>>>> Is there a straightforward way to write to partitioned tables using
>>>>>> Spark SQL?  I understand that the read performance for partitioned data 
>>>>>> is
>>>>>> far better - are there other performance improvements that might be 
>>>>>> better
>>>>>> to use instead of partitioning?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Bryan Jeffrey
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark -- Writing to Partitioned Persistent Table

Reply via email to