[GitHub] spark pull request: Spark parquet improvements

marmbrus Fri, 21 Mar 2014 14:13:30 -0700

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38324790
  
    Hey Andre, I haven't had a chance to look at this closely, but I do really 
like the idea of this functionality. Some thoughts:
     - Right now it looks like this is exposing catalyst `DataType` to end 
users.  While this maybe okay, the original intention was to keep the catalyst 
API purely internal to leave things flexible in the future.  Therefore, we may 
want to create wrapper classes in the public API that don't expose so many 
details (for example dataTypes have ClassTags that talk about the underlying 
JVM datatype that we use during execution).  Another option (maybe in addition 
to a programatic api) would be use use HiveQL strings to describe schema (there 
is already a converter).  At the very least we need to mark this method 
experimental though.  A task for me right now is to create a Java API for Spark 
SQL, and part of this will be deciding how to expose schema to end users.
     - I looked the test that is failing.  I'm confused how 
`InsertIntoParquetTable` is sneaking into the Hive test cases.  Is something 
getting mixed up during analysis?  Are we just failing to reset some state in 
the reset function of TestHive?
     - The above aside, I think the issue here is the second parameter list to 
`InsertIntoParquetTable`.  It is correct to keep the spark context in the 
second list so it is not consider part of the case class as far as equality is 
concerned, but arguments in other parameter lists are not included in the 
'magic' tree copying and so need to be added manually `override def 
otherCopyArgs = sc :: Nil`.
     - You may already handle these cases correctly, but a few things that came 
to mind.
      - What happens to existing RDDs when a table is appended to?  Is there an 
a case when recomputing will rescan the directory looking for files and cause 
the RDD to change?  Will this always happen?
      - What happens if multiple users append to a parquet "table", stored on 
HDFS for example, at the same time?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Reply via email to