irrespective of naming, know that deep directory trees are performance killers 
when listing files on s3 and setting up jobs. You might actually be better off 
having them in the same directory and using a pattern like 2016-03-11-*
as the pattten to find files.



On 28 Nov 2016, at 04:18, Prasanna Santhanam 
<t...@apache.org<mailto:t...@apache.org>> wrote:

I've been toying around with Spark SQL lately and trying to move some workloads 
from Hive. In the hive world the partitions below are recovered on an ALTER 
TABLE RECOVER PARTITIONS

Path:
s3://bucket-company/path/2016/03/11
s3://bucket-company/path/2016/03/12
s3://bucket-company/path/2016/03/13

Where as Spark ignores these unless the partition information is of the format 
below

s3://bucket-company/path/year=2016/month=03/day=11
s3://bucket-company/path/year=2016/month=03/day=12
s3://bucket-company/path/year=2016/month=03/day=13

The code for this is in 
ddl.scala.<https://github.com/apache/spark/blob/ddd02f50bb7458410d65427321efc75da5e65224/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L589>
If my DDL already expresses the partition information why does Spark ignore the 
partition and enforce this separator?

DDL:
CREATE EXTERNAL TABLE test_tbl
(
   column1 STRING,
   column2 STRUCT  <... >
)
PARTITIONED BY (year STRING, month STRING, day STRING)
LOCATION s3://bucket-company/path

Thanks,





Reply via email to