[ 
https://issues.apache.org/jira/browse/SPARK-11777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010819#comment-15010819
 ] 

Stanislav Hadjiiski commented on SPARK-11777:
---------------------------------------------

I would like to refer you to the cloudera documentation:
http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/impala_hadoop.html#intro_metastore_unique_1
http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/impala_refresh.html

Here are some quotes as well:
{quote}
For DDL and DML issued through Hive, or changes made manually to files in HDFS, 
you still use the REFRESH statement (when new data files are added to existing 
tables) or the INVALIDATE METADATA statement (for entirely new tables, or after 
dropping a table, performing an HDFS rebalance operation, or deleting data 
files). Issuing INVALIDATE METADATA by itself retrieves metadata for all the 
tables tracked by the metastore. If you know that only specific tables have 
been changed outside of Impala, you can issue REFRESH table_name for each 
affected table to only retrieve the latest metadata for those tables.
{quote}

{quote}
A metadata update for an impalad instance is required if:
* A metadata change occurs.
* and the change is made through Hive.
* and the change is made to a database to which clients such as the Impala 
shell or ODBC directly connect.
{quote}

It is neither "Not an issue" nor "question about impala", it is spark not 
forcing the metastore to update. It is working as expected when a table is 
created (metastore's metadata is invalidated and then Impala is made to be 
aware of the presence of the new table). It is not working as expected on 
overwrite. The refresh of the metastore should either be triggered 
automatically by the saveAsTable method (when overriding data) or another way 
to do this should be provided to the user. Refreshing the metadata should be 
done after the HDFS contents change. It should not be left for when data is to 
be read from external client (as Impala or JDBC) which could be limited to 
SELECT queries only (for security reasons).

If you are still sure it is "not a Spark issue" could you please elaborate on 
what kind of an issue is this then?

> HiveContext.saveAsTable does not update the metastore on overwrite
> ------------------------------------------------------------------
>
>                 Key: SPARK-11777
>                 URL: https://issues.apache.org/jira/browse/SPARK-11777
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.1
>            Reporter: Stanislav Hadjiiski
>
> Consider the following code:
> {quote}
> case class Bean(cdata: String)
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
> val df = hiveContext.createDataFrame(Bean("test10") :: Bean("test20") :: Nil)
> df.write.mode(SaveMode.Overwrite).saveAsTable("db_name.table")
> {quote}
> This works as expected - if the table does not exist it is created, otherwise 
> it's content is replaced. However, only in the first case the data is 
> accessible through impala (i.e. outside of spark environment). To get it 
> working after overwriting a
> {quote}
> REFRESH db_name.table
> {quote}
> should be issued in impala-shell. Neither
> {quote}
> hiveContext.refreshTable("db_name.table")
> {quote}
> nor
> {quote}
> hiveContext.sql("REFRESH TABLE db_name.table")
> {quote}
> fixes the issue. The same applies if the {{default}} database is used (and 
> {{db_name.}} is omiited everywhere)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to