John Ferguson created SPARK-6666:
------------------------------------

             Summary: org.apache.spark.sql.jdbc.JDBCRDD  does not escape/quote 
column names
                 Key: SPARK-6666
                 URL: https://issues.apache.org/jira/browse/SPARK-6666
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.3.0
         Environment:  
            Reporter: John Ferguson
            Priority: Critical


Is there a way to have JDBC DataFrames use quoted/escaped column names?  Right 
now, it looks like it "sees" the names correctly in the schema created but does 
not escape them in the SQL it creates when they are not compliant:

org.apache.spark.sql.jdbc.JDBCRDD
....
private val columnList: String = {
val sb = new StringBuilder()
columns.foreach(x => sb.append(",").append(x))
if (sb.length == 0) "1" else sb.substring(1)
}


If you see value in this, I would take a shot at adding the quoting (escaping) 
of column names here.  If you don't do it, some drivers... like postgresql's 
will simply drop case all names when parsing the query.  As you can see in the 
TL;DR below that means they won't match the schema I am given.

TL;DR:
-------- 
I am able to connect to a Postgres database in the shell (with driver 
referenced):

   val jdbcDf = 
sqlContext.jdbc("jdbc:postgresql://localhost/sparkdemo?user=dbuser", "sp500")

In fact when I run:

   jdbcDf.registerTempTable("sp500")
   val avgEPSNamed = sqlContext.sql("SELECT AVG(`Earnings/Share`) as AvgCPI 
FROM sp500")

and

   val avgEPSProg = jsonDf.agg(avg(jsonDf.col("Earnings/Share")))

The values come back as expected.  However, if I try:

   jdbcDf.show

Or if I try
   
   val all = sqlContext.sql("SELECT * FROM sp500")
   all.show

I get errors about column names not being found.  In fact the error includes a 
mention of column names all lower cased.  For now I will change my schema to be 
more restrictive.  Right now it is, per a Stack Overflow poster, not ANSI 
compliant by doing things that are allowed by ""'s in pgsql, MySQL and 
SQLServer.  BTW, our users are giving us tables like this... because various 
tools they already use support non-compliant names.  In fact, this is mild 
compared to what we've had to support.

Currently the schema in question uses mixed case, quoted names with special 
characters and spaces:

CREATE TABLE sp500
(
"Symbol" text,
"Name" text,
"Sector" text,
"Price" double precision,
"Dividend Yield" double precision,
"Price/Earnings" double precision,
"Earnings/Share" double precision,
"Book Value" double precision,
"52 week low" double precision,
"52 week high" double precision,
"Market Cap" double precision,
"EBITDA" double precision,
"Price/Sales" double precision,
"Price/Book" double precision,
"SEC Filings" text
) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to