I started a conversation with the DMLC developers who have C++11 implementations of both S3 and Azure FS that they are maintaining
https://github.com/dmlc/dmlc-core/issues/273 On Thu, Jun 22, 2017 at 9:18 AM, Wes McKinney <wesmck...@gmail.com> wrote: > If you want to use pure Python, you should probably just use the s3fs > package. We should be able to get better throughput using C++ (and making > using multithreading to make multiple requests for larger reads) -- the AWS > C++ SDK probably has everything we need to make a really strong > implementation. > > Dato/Turi created an S3 file source implementation in C++ > https://github.com/turi-code/SFrame/blob/master/oss_ > src/fileio/s3_fstream.hpp, that is BSD licensed and does not depend on > the (quite large) AWS C++ SDK, so that might not be a bad place to start. > > On Thu, Jun 22, 2017 at 9:01 AM, Colin Nichols <co...@bam-x.com> wrote: > >> I am using a pa.PythonFile() wrapping the file-like object provided by >> s3fs package. I am able to write parquet files directly to S3 this way. I >> am not reading using pyarrow (reading gzipped csvs with python) but I >> imagine it would work much the same. >> >> -- sent from my phone -- >> >> > On Jun 22, 2017, at 00:54, Kevin Moore <ke...@quiltdata.io> wrote: >> > >> > Has anyone started looking into how to read data sets from S3? I started >> > looking into it and wondered if anyone has a design in mind. >> > >> > We could implement an S3FileSystem class in pyarrow/filesystem.py. The >> > filesystem components could probably be written against the AWS Python >> SDK. >> > >> > The HDFS file system and file classes, however, are implemented at least >> > partially in Cython & C++. Is there an advantage to doing that for S3 >> too? >> > >> > Thanks, >> > >> > Kevin >> > >> > ---- >> > Kevin Moore >> > CEO, Quilt Data, Inc. >> > ke...@quiltdata.io | LinkedIn <https://www.linkedin.com/in/kevinemoore/ >> > >> > (415) 497-7895 >> > >> > >> > Data packages for fast, reproducible data science >> > quiltdata.com >> > >