[
https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717737#comment-14717737
]
Yin Huai commented on SPARK-10287:
----------------------------------
We need to put the following release note "JSON data source will not
automatically load new files that are created by other applications (i.e. files
that are not inserted to the dataset through Spark SQL). [SPARK-10287].".
> After processing a query using JSON data, Spark SQL continuously refreshes
> metadata of the table
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-10287
> URL: https://issues.apache.org/jira/browse/SPARK-10287
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0
> Reporter: Yin Huai
> Assignee: Yin Huai
> Priority: Critical
> Labels: releasenotes
> Fix For: 1.5.1
>
>
> I have a partitioned json table with 1824 partitions.
> {code}
> val df = sqlContext.read.format("json").load("aPartitionedJsonData")
> val columnStr = df.schema.map(_.name).mkString(",")
> println(s"columns: $columnStr")
> val hash = df
> .selectExpr(s"hash($columnStr) as hashValue")
> .groupBy()
> .sum("hashValue")
> .head()
> .getLong(0)
> {code}
> Looks like for JSON, we refresh metadata when we call buildScan. For a
> partitioned table, we call buildScan for every partition. So, looks like we
> will refresh this table 1824 times.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]