Github user AndreSchumacher commented on the pull request:
https://github.com/apache/spark/pull/195#issuecomment-38695808
@marmbrus @pwendell Thanks a lot for the detailed comments. Here are some
thoughts regarding the points you raised above:
* Exposing ``DataType`` to users is probably not a good idea, at least not
the interval JVM type. Some kind of matching against data types should be
supported (IMHO). Maybe more thought could go into which part of ``DataType``
should be exposed? Also it seems that currently there is no notion of a
primitive type as in Parquet (the closest thing to that is a ``NativeType``;
maybe that could be renamed or extended?). I will hold off with adding schema
strings until the other APIs have stabilized somewhat. OK?
* I have no idea why that one test failed (and can't find the error output
anymore in Jenkins). Locally I cannot reproduce this failure (but other not in
Jenkins show up which I believe are unrelated so I don't trust my local
installation). Once this build has completed I try to find the root cause.
* I added otherCopyArgs to the two operators; please let me know if I
missed anything else.
* About the INSERT INTO TABLE operation without overwrite:
- I tried to find out how Hive deals with this situation. Unless the
Metastore somehow locks access to tables I don't see many checks on HDFS level
(but I may have just missed them). Data is written to temporary files and then
moved to its final destination, possibly renaming the destination files if they
alreadt exist. We could try to implement something similar, which could
mitigate (but not avoid) the concurrent write access situation.
- Also, data could be saved in directories only accessible to the user
that created them. But that would be kind of a strong limitation.
- Currently the child RDD is returned unchanged, which may be different
from the RDD that can be built from the Parquet files in the destination
directory. This could be changed to always doing a scan after an INSERT INTO
TABLE so that the actual RDD is returned. Maybe that would be less confusing to
users?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---