On 29 Nov 2017, at 21:45, Lalwani, Jayesh 
<jayesh.lalw...@capitalone.com<mailto:jayesh.lalw...@capitalone.com>> wrote:

AWS announced at re:Invent that they are launching S3 Select. This can allow 
Spark to push down predicates to S3, rather than read the entire file in 
memory. Are there any plans to update Spark to use S3 Select?


  1.  ORC and Parquet don't read the whole file in memory anyway, except in the 
special case that the file is gzipped
  2.  Hadoop's s3a <= 2.7 doesn't handle the aggressive seeks of those columnar 
formats that well, as it does a GET pos-EOF & has to abort the TCP connection 
if the seek is backwards
  3.  Hadoop 2.8+ with spark.hadoop.fs.s3a.experimental.fadvise=random switches 
to random IO and only does smaller GET reads of the data requested (actually 
min(min-read-length, buffer-size). This delivers ~3x performance boost in 
TCP-DS benchmarks


I don't yet know how much more efficient the new mechanism will be against 
columnar data, given those facts. You'd need to do experiments

The place to implement this would be though predicate push down from the file 
format to the FS. ORC & Parquet support predicate pushdown, so they'd need to 
recognise when the underlying store could do some of the work for them, open 
the store input stream differently, and use a whole new (undefined?) API to the 
queries. Most likely: s3a would add a way to specify a predicate to select on 
in open(), as well as the expected file type. This would need the underlying 
mechanism to also support those formats though, which the announcement doesn't/

Someone could do something more immediately though some modified CSV data 
source which did the pushdown. However, If you are using CSV for your datasets, 
there's something fundamental w.r.t your data storage policy you need to look 
at. It works sometimes as an exchange format, though I prefer Avro there due to 
its schemas and support for more complex structures.  As a format you run 
queries over? No.

Reply via email to