On 4/9/15 3:09 AM, Michael Armbrust wrote:
Back to the user list so everyone can see the result of the discussion...

    Ah. It all makes sense now. The issue is that when I created the
    parquet files, I included an unnecessary directory name
    (data.parquet) below the partition directories. It’s just a
    leftover from when I started with Michael’s sample code and it
    only made sense before I added the partition directories. I
    probably thought it was some magic name that was required when
    spark scanned for parquet files. The structure looks something
    like this:

    drwxr-xr-x   - user supergroup          0 2015-04-02 13:17
    hdfs://host/tablename/date=20150302/sym=A/data.parquet/...

    If I just move all the files up a level (there goes a day of work)
    , the existing code should work fine. Whether it’s useful to
    handle intermediate non-partition directories or whether that just
    creates some extra risk I can’t say, since I’m new to all the
    technology in this whole stack.


I'm mixed here. There is always a tradeoff between "silently" ignoring structure that people might not be aware of (and thus might be a bug) and "just working". Having this as an option at least certainly seems reasonable. I'd be curious if anyone had other thoughts?
Take the following directory name as an example:

   /path/to/partition/a=1/random/b=foo

One possible approach can be, we grab both "a=1" and "b=foo", then either report "random" by throwing an exception or ignore it with a WARN log.

    Unfortunately, it takes many minutes (even with mergeSchema=false)
    to create the RDD. It appears that the whole data store will still
    be recursively traversed (even with mergeSchema=false, a manually
    specified schema, and a partition spec [which I can’t pass in
    through a public API]) so that all of the metadata FileStatuses
    can be cached. In my case I’m going to have years of data, so
    there’s no way that will be feasible.

    Should I just explicitly load the partitions I want instead of
    using partition discovery? Is there any plan to have a less
    aggressive version of support for partitions, where metadata is
    only cached for partitions that are used in queries?


We improved the speed here in 1.3.1 so I'd be curious if that helps. We definitely need to continue to speed things up here though. We have to enumerate all the partitions so we know what data to read when a query comes in, but I do think we can parallelize it or something.

Reply via email to