Back to the user list so everyone can see the result of the discussion...

Ah. It all makes sense now. The issue is that when I created the parquet
> files, I included an unnecessary directory name (data.parquet) below the
> partition directories. It’s just a leftover from when I started with
> Michael’s sample code and it only made sense before I added the partition
> directories. I probably thought it was some magic name that was required
> when spark scanned for parquet files. The structure looks something like
> this:
>
>
>
> drwxr-xr-x   - user supergroup          0 2015-04-02 13:17
> hdfs://host/tablename/date=20150302/sym=A/data.parquet/...
>
 If I just move all the files up a level (there goes a day of work) , the
> existing code should work fine. Whether it’s useful to handle intermediate
> non-partition directories or whether that just creates some extra risk I
> can’t say, since I’m new to all the technology in this whole stack.
>

I'm mixed here.  There is always a tradeoff between "silently" ignoring
structure that people might not be aware of (and thus might be a bug) and
"just working".  Having this as an option at least certainly seems
reasonable.  I'd be curious if anyone had other thoughts?


>   Unfortunately, it takes many minutes (even with mergeSchema=false) to
> create the RDD. It appears that the whole data store will still be
> recursively traversed (even with mergeSchema=false, a manually specified
> schema, and a partition spec [which I can’t pass in through a public API])
> so that all of the metadata FileStatuses can be cached. In my case I’m
> going to have years of data, so there’s no way that will be feasible.
>
>
>
> Should I just explicitly load the partitions I want instead of using
> partition discovery? Is there any plan to have a less aggressive version of
> support for partitions, where metadata is only cached for partitions that
> are used in queries?
>

We improved the speed here in 1.3.1 so I'd be curious if that helps.  We
definitely need to continue to speed things up here though.  We have to
enumerate all the partitions so we know what data to read when a query
comes in, but I do think we can parallelize it or something.

Reply via email to