I couldn't get S3 select to work with a limit (it's not designed for that), I have used clickhouse with S3 select with Arrow support (for snowflake) and it works great.
On Sat, 13 Feb 2021, 5:24 pm Rémi Dettai, <rdet...@gmail.com> wrote: > Thank you Daniel for taking the time to go through the slides! > > S3 select is an interesting beast, but I think the benefit we could draw > from it in this usecase is pretty limited: > - for now Buzz focuses on Parquet data, which already allows efficient > projection capabilities (it uses HTTP Range requests to download only the > relevant parts of the files) and once supported by datafusion, we might > even push down filters to skip downloading entire row groups. > - S3 select can only output CSV and JSON, so in the cases where you have to > bring back a lot of data, it would actually amplify the volumes of data > fetched from s3 and make the deserialization more expensive. > > There are still some situations where S3 select would definitely be > beneficial, but it would be quite hard to automatically identify those and > let S3 Select kick accordingly. > > Have you used S3 Select at scale? Does it provide good and consistent > latencies? > > Le mer. 10 févr. 2021 à 19:35, Daniël Heres <danielhe...@gmail.com> a > écrit : > > > Thanks for sharing the slides Rémi! That looks really cool. > > > > One question I have after this, do you plan to use S3 Select ( > > https://aws.amazon.com/blogs/aws/s3-glacier-select/)?Seems it would fit > > your architecture nicely and I think shouldn't be too hard to create the > > query from the filters/projection in the datasource scan method to spend > > less time in Lambda. > > > > On Wed, Feb 10, 2021, 18:44 Rémi Dettai <rdet...@gmail.com> wrote: > > > > > Thanks for the notes Andy. Here is the slide deck I presented, for > > further > > > reference: > > > > > > > > > https://docs.google.com/presentation/d/1uZ5PbazC1zCX24k0Hh-UItddIh9BRvD5GL7NUDgc9eQ/edit?usp=sharing > > > > > > If anyone wants to see how it works in practice and does not have an > AWS > > > account to try it out, feel free to reach out to me and I can walk you > > > through it! > > > > > > Le mer. 10 févr. 2021 à 18:37, Andy Grove <andygrov...@gmail.com> a > > écrit > > > : > > > > > > > Attendees > > > > > > > > > > > > - > > > > > > > > Andy Grove > > > > - > > > > > > > > Benjamin Blodgett > > > > - > > > > > > > > Marc Prud’Hommeaux > > > > - > > > > > > > > Mike Seddon > > > > - > > > > > > > > Jorge Leitao > > > > - > > > > > > > > Andrew Lamb > > > > - > > > > > > > > Fernando Herrera > > > > - > > > > > > > > Neville Dipale > > > > - > > > > > > > > Remi Dettai > > > > > > > > > > > > (Please let me know if I have misspelled anyone’s names) > > > > > > > > Topics Discussed > > > > > > > > > > > > - > > > > > > > > Discussion of Jorge’s proposal to redesign Arrow crate to resolve > > > safety > > > > violations (following on from mailing list discussion) > > > > - > > > > > > > > Mike has a PR up to implement a large number of Postgres string > > > > functions that needs reviewing > > > > - > > > > > > > > Remi gave a short presentation about his Buzz project which > provides > > > > serverless compute using Arrow and DataFusion > > > > > > > > > > > > Planned for next time: > > > > > > > > > > > > - > > > > > > > > Marc Prud’Hommeaux to give a presentation/demo on his use of Arrow > > > > - > > > > > > > > Andy Grove to give a presentation/demo on Ballista, which provides > > > > distributed query execution using DataFusion > > > > > > > > > > > > On Wed, Feb 10, 2021 at 8:56 AM Andy Grove <andygrov...@gmail.com> > > > wrote: > > > > > > > > > A quick reminder that the bi-weekly Arrow Rust sync call starts > about > > > an > > > > > hour from now. Everyone is welcome. > > > > > > > > > > https://meet.google.com/ctp-yujs-aee > > > > > > > > > > Thanks, > > > > > > > > > > Andy. > > > > > > > > > > > > > > >