Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/195#issuecomment-38324790
Hey Andre, I haven't had a chance to look at this closely, but I do really
like the idea of this functionality. Some thoughts:
- Right now it looks like this is exposing catalyst `DataType` to end
users. While this maybe okay, the original intention was to keep the catalyst
API purely internal to leave things flexible in the future. Therefore, we may
want to create wrapper classes in the public API that don't expose so many
details (for example dataTypes have ClassTags that talk about the underlying
JVM datatype that we use during execution). Another option (maybe in addition
to a programatic api) would be use use HiveQL strings to describe schema (there
is already a converter). At the very least we need to mark this method
experimental though. A task for me right now is to create a Java API for Spark
SQL, and part of this will be deciding how to expose schema to end users.
- I looked the test that is failing. I'm confused how
`InsertIntoParquetTable` is sneaking into the Hive test cases. Is something
getting mixed up during analysis? Are we just failing to reset some state in
the reset function of TestHive?
- The above aside, I think the issue here is the second parameter list to
`InsertIntoParquetTable`. It is correct to keep the spark context in the
second list so it is not consider part of the case class as far as equality is
concerned, but arguments in other parameter lists are not included in the
'magic' tree copying and so need to be added manually `override def
otherCopyArgs = sc :: Nil`.
- You may already handle these cases correctly, but a few things that came
to mind.
- What happens to existing RDDs when a table is appended to? Is there an
a case when recomputing will rescan the directory looking for files and cause
the RDD to change? Will this always happen?
- What happens if multiple users append to a parquet "table", stored on
HDFS for example, at the same time?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---