Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

Mich Talebzadeh Fri, 10 Jun 2016 15:58:19 -0700

I personally use Scala to do something similar. For example here I extract
data from an Oracle table and store in ORC table in Hive. This is compiled
via sbt as run with SparkSubmit.


It is similar to your code but in Scala. Note that I do not enclose my
column names in double quotes.

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._

object ETL_scratchpad_dummy {
  def main(args: Array[String]) {
  val conf = new SparkConf().
               setAppName("ETL_scratchpad_dummy").
               set("spark.driver.allowMultipleContexts", "true")
  val sc = new SparkContext(conf)
  // Create sqlContext based on HiveContext
  val sqlContext = new HiveContext(sc)
  import sqlContext.implicits._
  val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  println ("\nStarted at"); sqlContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
  HiveContext.sql("use oraclehadoop")
  var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
  var _username : String = "scratchpad"
  var _password : String = "xxxxxxxx"

  // Get data from Oracle table scratchpad.dummy
  val d = HiveContext.load("jdbc",
  Map("url" -> _ORACLEserver,
  "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED,
to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED,
RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
  "user" -> _username,
  "password" -> _password))

   d.registerTempTable("tmp")
  //
  // Need to create and populate target ORC table oraclehadoop.dummy
  //
  HiveContext.sql("use oraclehadoop")
  //
  // Drop and create table dummy
  //
  HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
  var sqltext : String = ""
  sqltext = """
  CREATE TABLE oraclehadoop.dummy (
     ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
  )
  CLUSTERED BY (ID) INTO 256 BUCKETS
  STORED AS ORC
  TBLPROPERTIES (
  "orc.create.index"="true",
  "orc.bloom.filter.columns"="ID",
  "orc.bloom.filter.fpp"="0.05",
  "orc.compress"="SNAPPY",
  "orc.stripe.size"="16777216",
  "orc.row.index.stride"="10000" )
  """
   HiveContext.sql(sqltext)
  //
  // Put data in Hive table. Clean up is already done
  //
  sqltext = """
  INSERT INTO TABLE oraclehadoop.dummy
  SELECT
          ID
        , CLUSTERED
        , SCATTERED
        , RANDOMISED
        , RANDOM_STRING
        , SMALL_VC
        , PADDING
  FROM tmp
  """
   HiveContext.sql(sqltext)
  println ("\nFinished at"); sqlContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
  sys.exit()
 }
}

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 10 June 2016 at 23:38, Ajay Chander <itsche...@gmail.com> wrote:

> Hi Mich,
>
> Thanks for the response. If you look at my programs, I am not writings my
> queries to include column names in a pair of "". My driver in spark
> program is generating such query with column names in "" which I do not
> want. On the other hand, I am using the same driver in my pure Java program
> which is attached, in that program the same driver is generating a proper
> sql query with out "".
>
> Pure Java log:
>
> 2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set
> 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
> Spark SQL log:
>
> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>
> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
> Please find complete program and full logs attached in the below thread.
> Thank you.
>
> Regards,
> Ajay
>
>
> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Assuming I understood your query, in Spark SQL (that is you log in to
>> spark sql like  spark-sql --master spark://<HOST_NAME>:7077 you do not
>> need double quotes around column names for sql to work
>>
>> spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1;
>> hello from Mich
>>
>> Anything between a pair of "" will be interpreted as text NOT column name.
>>
>> In Spark SQL you do not need double quotes. So simply
>>
>> spark-sql> select prod_id, cust_id from sales limit 2;
>> 17      28017
>> 18      10419
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 10 June 2016 at 21:54, Ajay Chander <itsche...@gmail.com> wrote:
>>
>>> Hi again, anyone in this group tried to access SAS dataset through Spark
>>> SQL ? Thank you
>>>
>>> Regards,
>>> Ajay
>>>
>>>
>>> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:
>>>
>>>> Hi Spark Users,
>>>>
>>>> I hope everyone here are doing great.
>>>>
>>>> I am trying to read data from SAS through Spark SQL and write into
>>>> HDFS. Initially, I started with pure java program please find the
>>>> program and logs in the attached file sas_pure_java.txt . My program
>>>> ran successfully and it returned the data from Sas to Spark_SQL. Please
>>>> note the highlighted part in the log.
>>>>
>>>> My SAS dataset has 4 rows,
>>>>
>>>> Program ran successfully. So my output is,
>>>>
>>>> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>>>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result
>>>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>>>>
>>>> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next);
>>>> time= 0.045 secs (com.sas.rio.MVAResultSet:773)
>>>>
>>>> 1,'2016-01-01','2016-01-31'
>>>>
>>>> 2,'2016-02-01','2016-02-29'
>>>>
>>>> 3,'2016-03-01','2016-03-31'
>>>>
>>>> 4,'2016-04-01','2016-04-30'
>>>>
>>>>
>>>> Please find the full logs attached to this email in file
>>>> sas_pure_java.txt.
>>>>
>>>> _______________________
>>>>
>>>>
>>>> Now I am trying to do the same via Spark SQL. Please find my program
>>>> and logs attached to this email in file sas_spark_sql.txt .
>>>>
>>>> Connection to SAS dataset is established successfully. But please note
>>>> the highlighted log below.
>>>>
>>>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared
>>>> statement 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>>>
>>>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result
>>>> set 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>>>> Please find the full logs attached to this email in file
>>>>  sas_spark_sql.txt
>>>>
>>>> I am using same driver in both pure java and spark sql programs. But
>>>> the query generated in spark sql has quotes around the column
>>>> names(Highlighted above).
>>>> So my resulting output for that query is like this,
>>>>
>>>> +-----+--------+------+
>>>> |  _c0|     _c1|   _c2|
>>>> +-----+--------+------+
>>>> |SR_NO|start_dt|end_dt|
>>>> |SR_NO|start_dt|end_dt|
>>>> |SR_NO|start_dt|end_dt|
>>>> |SR_NO|start_dt|end_dt|
>>>> +-----+--------+------+
>>>>
>>>> Since both programs are using the same driver com.sas.rio.MVADriver .
>>>> Expected output should be same as my pure java programs output. But
>>>> something else is happening behind the scenes.
>>>>
>>>> Any insights on this issue. Thanks for your time.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Ajay
>>>>
>>>
>>

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

Reply via email to