All, One issue I'm seeing is that I start the thrift server (for jdbc access) via the following: /spark/spark-1.4.1/sbin/start-thriftserver.sh --master spark://master:7077 --hiveconf "spark.cores.max=2"
After about 40 seconds the Thrift server is started and available on default port 10000. I then submit my application - and the application throws the following error: Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6a552721, see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 86 more Caused by: java.sql.SQLException: Another instance of Derby may have already booted the database /spark/spark-1.4.1/metastore_db. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) ... 83 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /spark/spark-1.4.1/metastore_db. This also happens if I do the opposite (submit the application first, and then start the thrift server). It looks similar to the following issue -- but not quite the same: https://issues.apache.org/jira/browse/SPARK-9776 It seems like this set of steps works fine if the metadata database is not yet created - but once it's created this happens every time. Is this a known issue? Is there a workaround? Regards, Bryan Jeffrey On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Susan, > > I did give that a shot -- I'm seeing a number of oddities: > > (1) 'Partition By' appears only accepts alphanumeric lower case fields. > It will work for 'machinename', but not 'machineName' or 'machine_name'. > (2) When partitioning with maps included in the data I get odd string > conversion issues > (3) When partitioning without maps I see frequent out of memory issues > > I'll update this email when I've got a more concrete example of problems. > > Regards, > > Bryan Jeffrey > > > > On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <suchenz...@gmail.com> wrote: > >> Have you tried partitionBy? >> >> Something like >> >> hiveWindowsEvents.foreachRDD( rdd => { >> val eventsDataFrame = rdd.toDF() >> eventsDataFrame.write.mode(SaveMode.Append).partitionBy(" >> windows_event_time_bin").saveAsTable("windows_event") >> }) >> >> >> >> On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <bryan.jeff...@gmail.com> >> wrote: >> >>> Hello. >>> >>> I am working to get a simple solution working using Spark SQL. I am >>> writing streaming data to persistent tables using a HiveContext. Writing >>> to a persistent non-partitioned table works well - I update the table using >>> Spark streaming, and the output is available via Hive Thrift/JDBC. >>> >>> I create a table that looks like the following: >>> >>> 0: jdbc:hive2://localhost:10000> describe windows_event; >>> describe windows_event; >>> +--------------------------+---------------------+----------+ >>> | col_name | data_type | comment | >>> +--------------------------+---------------------+----------+ >>> | target_entity | string | NULL | >>> | target_entity_type | string | NULL | >>> | date_time_utc | timestamp | NULL | >>> | machine_ip | string | NULL | >>> | event_id | string | NULL | >>> | event_data | map<string,string> | NULL | >>> | description | string | NULL | >>> | event_record_id | string | NULL | >>> | level | string | NULL | >>> | machine_name | string | NULL | >>> | sequence_number | string | NULL | >>> | source | string | NULL | >>> | source_machine_name | string | NULL | >>> | task_category | string | NULL | >>> | user | string | NULL | >>> | additional_data | map<string,string> | NULL | >>> | windows_event_time_bin | timestamp | NULL | >>> | # Partition Information | | | >>> | # col_name | data_type | comment | >>> | windows_event_time_bin | timestamp | NULL | >>> +--------------------------+---------------------+----------+ >>> >>> >>> However, when I create a partitioned table and write data using the >>> following: >>> >>> hiveWindowsEvents.foreachRDD( rdd => { >>> val eventsDataFrame = rdd.toDF() >>> >>> eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event") >>> }) >>> >>> The data is written as though the table is not partitioned (so >>> everything is written to >>> /user/hive/warehouse/windows_event/file.gz.paquet. Because the data is not >>> following the partition schema, it is not accessible (and not partitioned). >>> >>> Is there a straightforward way to write to partitioned tables using >>> Spark SQL? I understand that the read performance for partitioned data is >>> far better - are there other performance improvements that might be better >>> to use instead of partitioning? >>> >>> Regards, >>> >>> Bryan Jeffrey >>> >> >> >