[
https://issues.apache.org/jira/browse/SPARK-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598655#comment-14598655
]
Justin McCarthy commented on SPARK-6666:
----------------------------------------
Lack of quoting is also leading to errors where a column name collides with a
SQL reserved word.
There are some very common and useful words that are frequently used as column
identifiers:
http://www.postgresql.org/docs/9.0/static/sql-keywords-appendix.html
Here's SQL-99's take on quoted identifiers:
http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier
Fix could be as simple as:
{code:title=JdbcRDD.scala}
private val columnList : String = if (columns.length==0) "1" else
"\""+columns.mkString("\",\"")+"\""
{code}
> org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names
> ---------------------------------------------------------------------
>
> Key: SPARK-6666
> URL: https://issues.apache.org/jira/browse/SPARK-6666
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.0
> Environment:
> Reporter: John Ferguson
> Priority: Critical
>
> Is there a way to have JDBC DataFrames use quoted/escaped column names?
> Right now, it looks like it "sees" the names correctly in the schema created
> but does not escape them in the SQL it creates when they are not compliant:
> org.apache.spark.sql.jdbc.JDBCRDD
> ....
> private val columnList: String = {
> val sb = new StringBuilder()
> columns.foreach(x => sb.append(",").append(x))
> if (sb.length == 0) "1" else sb.substring(1)
> }
> If you see value in this, I would take a shot at adding the quoting
> (escaping) of column names here. If you don't do it, some drivers... like
> postgresql's will simply drop case all names when parsing the query. As you
> can see in the TL;DR below that means they won't match the schema I am given.
> TL;DR:
> --------
> I am able to connect to a Postgres database in the shell (with driver
> referenced):
> val jdbcDf =
> sqlContext.jdbc("jdbc:postgresql://localhost/sparkdemo?user=dbuser", "sp500")
> In fact when I run:
> jdbcDf.registerTempTable("sp500")
> val avgEPSNamed = sqlContext.sql("SELECT AVG(`Earnings/Share`) as AvgCPI
> FROM sp500")
> and
> val avgEPSProg = jsonDf.agg(avg(jsonDf.col("Earnings/Share")))
> The values come back as expected. However, if I try:
> jdbcDf.show
> Or if I try
>
> val all = sqlContext.sql("SELECT * FROM sp500")
> all.show
> I get errors about column names not being found. In fact the error includes
> a mention of column names all lower cased. For now I will change my schema
> to be more restrictive. Right now it is, per a Stack Overflow poster, not
> ANSI compliant by doing things that are allowed by ""'s in pgsql, MySQL and
> SQLServer. BTW, our users are giving us tables like this... because various
> tools they already use support non-compliant names. In fact, this is mild
> compared to what we've had to support.
> Currently the schema in question uses mixed case, quoted names with special
> characters and spaces:
> CREATE TABLE sp500
> (
> "Symbol" text,
> "Name" text,
> "Sector" text,
> "Price" double precision,
> "Dividend Yield" double precision,
> "Price/Earnings" double precision,
> "Earnings/Share" double precision,
> "Book Value" double precision,
> "52 week low" double precision,
> "52 week high" double precision,
> "Market Cap" double precision,
> "EBITDA" double precision,
> "Price/Sales" double precision,
> "Price/Book" double precision,
> "SEC Filings" text
> )
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]