On 29 Nov 2016, at 05:19, Prasanna Santhanam 
<t...@apache.org<mailto:t...@apache.org>> wrote:

On Mon, Nov 28, 2016 at 4:39 PM, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:

irrespective of naming, know that deep directory trees are performance killers 
when listing files on s3 and setting up jobs. You might actually be better off 
having them in the same directory and using a pattern like 2016-03-11-*
as the pattten to find files.

Thanks Bharat and Steve - I've generally followed the partitioned table format 
over the flat structure since this aides WHERE clause filtering 
(PredicatePushDown?). Wrt performance that helps the write once, query many 
times kind of workloads. Changing this in our production application that dumps 
these is cumbersome. Is there a configuration that would override this 
restriction for Spark? Does it make sense to have one?

if it's done' leave alone. Just be aware that s3 doesn't like deep directories 
that much, as listing is fairly slow

Reply via email to