[ 
https://issues.apache.org/jira/browse/SPARK-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598655#comment-14598655
 ] 

Justin McCarthy commented on SPARK-6666:
----------------------------------------

Lack of quoting is also leading to errors where a column name collides with a 
SQL reserved word.

There are some very common and useful words that are frequently used as column 
identifiers: 
http://www.postgresql.org/docs/9.0/static/sql-keywords-appendix.html

Here's SQL-99's take on quoted identifiers: 
http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier

Fix could be as simple as:

{code:title=JdbcRDD.scala}
private val columnList : String = if (columns.length==0) "1" else 
"\""+columns.mkString("\",\"")+"\""
{code}

> org.apache.spark.sql.jdbc.JDBCRDD  does not escape/quote column names
> ---------------------------------------------------------------------
>
>                 Key: SPARK-6666
>                 URL: https://issues.apache.org/jira/browse/SPARK-6666
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>         Environment:  
>            Reporter: John Ferguson
>            Priority: Critical
>
> Is there a way to have JDBC DataFrames use quoted/escaped column names?  
> Right now, it looks like it "sees" the names correctly in the schema created 
> but does not escape them in the SQL it creates when they are not compliant:
> org.apache.spark.sql.jdbc.JDBCRDD
> ....
> private val columnList: String = {
> val sb = new StringBuilder()
> columns.foreach(x => sb.append(",").append(x))
> if (sb.length == 0) "1" else sb.substring(1)
> }
> If you see value in this, I would take a shot at adding the quoting 
> (escaping) of column names here.  If you don't do it, some drivers... like 
> postgresql's will simply drop case all names when parsing the query.  As you 
> can see in the TL;DR below that means they won't match the schema I am given.
> TL;DR:
> -------- 
> I am able to connect to a Postgres database in the shell (with driver 
> referenced):
>    val jdbcDf = 
> sqlContext.jdbc("jdbc:postgresql://localhost/sparkdemo?user=dbuser", "sp500")
> In fact when I run:
>    jdbcDf.registerTempTable("sp500")
>    val avgEPSNamed = sqlContext.sql("SELECT AVG(`Earnings/Share`) as AvgCPI 
> FROM sp500")
> and
>    val avgEPSProg = jsonDf.agg(avg(jsonDf.col("Earnings/Share")))
> The values come back as expected.  However, if I try:
>    jdbcDf.show
> Or if I try
>    
>    val all = sqlContext.sql("SELECT * FROM sp500")
>    all.show
> I get errors about column names not being found.  In fact the error includes 
> a mention of column names all lower cased.  For now I will change my schema 
> to be more restrictive.  Right now it is, per a Stack Overflow poster, not 
> ANSI compliant by doing things that are allowed by ""'s in pgsql, MySQL and 
> SQLServer.  BTW, our users are giving us tables like this... because various 
> tools they already use support non-compliant names.  In fact, this is mild 
> compared to what we've had to support.
> Currently the schema in question uses mixed case, quoted names with special 
> characters and spaces:
> CREATE TABLE sp500
> (
> "Symbol" text,
> "Name" text,
> "Sector" text,
> "Price" double precision,
> "Dividend Yield" double precision,
> "Price/Earnings" double precision,
> "Earnings/Share" double precision,
> "Book Value" double precision,
> "52 week low" double precision,
> "52 week high" double precision,
> "Market Cap" double precision,
> "EBITDA" double precision,
> "Price/Sales" double precision,
> "Price/Book" double precision,
> "SEC Filings" text
> ) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to