[GitHub] spark pull request: Spark parquet improvements

AndreSchumacher Wed, 26 Mar 2014 08:14:34 -0700

Github user AndreSchumacher commented on the pull request:

    https://github.com/apache/spark/pull/195#issuecomment-38695808
  
    @marmbrus @pwendell  Thanks a lot for the detailed comments. Here are some 
thoughts regarding the points you raised above:
    * Exposing ``DataType`` to users is probably not a good idea, at least not 
the interval JVM type. Some kind of matching against data types should be 
supported (IMHO). Maybe more thought could go into which part of ``DataType`` 
should be exposed? Also it seems that currently there is no notion of a 
primitive type as in Parquet (the closest thing to that is a ``NativeType``; 
maybe that could be renamed or extended?). I will hold off with adding schema 
strings until the other APIs have stabilized somewhat. OK?
    * I have no idea why that one test failed (and can't find the error output 
anymore in Jenkins). Locally I cannot reproduce this failure (but other not in 
Jenkins show up which I believe are unrelated so I don't trust my local 
installation). Once this build has completed I try to find the root cause.
    * I added otherCopyArgs to the two operators; please let me know if I 
missed anything else.
    * About the INSERT INTO TABLE operation without overwrite:
        - I tried to find out how Hive deals with this situation. Unless the 
Metastore somehow locks access to tables I don't see many checks on HDFS level 
(but I may have just missed them). Data is written to temporary files and then 
moved to its final destination, possibly renaming the destination files if they 
alreadt exist. We could try to implement something similar, which could 
mitigate (but not avoid) the concurrent write access situation.
        - Also, data could be saved in directories only accessible to the user 
that created them. But that would be kind of a strong limitation.
        - Currently the child RDD is returned unchanged, which may be different 
from the RDD that can be built from the Parquet files in the destination 
directory. This could be changed to always doing a scan after an INSERT INTO 
TABLE so that the actual RDD is returned. Maybe that would be less confusing to 
users?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Spark parquet improvements

Reply via email to