[ 
https://issues.apache.org/jira/browse/ARROW-10758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10758:
------------------------------------
    Summary: [C++] Arrow Dataset Loading CSV format file from S3  (was: Arrow 
Dataset Loading CSV format file from S3)

> [C++] Arrow Dataset Loading CSV format file from S3
> ---------------------------------------------------
>
>                 Key: ARROW-10758
>                 URL: https://issues.apache.org/jira/browse/ARROW-10758
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 2.0.0
>            Reporter: Lynch
>            Priority: Major
>
> I am using `S3FileSystem` along with `CsvFileFormat` in Arrow dataset to load 
> all csv files under a S3 bucket. 
> Main test code is as below:
>  
> {code:java}
> auto format = std::make_shared<CsvFileFormat>();
> string output_path;
> auto s3_file_system = arrow::fs::FileSystemFromUri("s3://test-csv-bucket", 
> &output_path).ValueOrDie();
> FileSystemFactoryOptions options;
> options.partition_base_dir = output_path;
> arrow::fs::FileSelector _file_selector;
> ASSERT_OK_AND_ASSIGN(auto factory,
>                      FileSystemDatasetFactory::Make(s3_file_system, 
> _file_selector, format, options));
> ASSERT_OK_AND_ASSIGN(auto schema, factory->Inspect());
> ASSERT_OK_AND_ASSIGN(auto dataset, factory->Finish(schema));
> {code}
> But it seems when calling `ASSERT_OK_AND_ASSIGN(auto schema, 
> factory->Inspect());` it throws exception when reading file from S3 bucket 
> and the exception stack is as follows:
>  
>  
> {code:java}
> __pthread_kill 0x00007fff70dc033a
> pthread_kill 0x00007fff70e7ce60
> abort 0x00007fff70d47808
> malloc_vreport 0x00007fff70e3d50b
> malloc_report 0x00007fff70e4040f
> Aws::Free(void*) AWSMemory.cpp:97
> std::__1::enable_if<std::is_polymorphic<std::__1::basic_iostream<char, 
> std::__1::char_traits<char> > >::value, void>::type 
> Aws::Delete<std::__1::basic_iostream<char, std::__1::char_traits<char> > 
> >(std::__1::basic_iostream<char, std::__1::char_traits<char> >*) 
> AWSMemory.h:119
> Aws::Utils::Stream::ResponseStream::ReleaseStream() ResponseStream.cpp:62
> Aws::Utils::Stream::ResponseStream::~ResponseStream() ResponseStream.cpp:54
> Aws::Utils::Stream::ResponseStream::~ResponseStream() ResponseStream.cpp:53
> Aws::S3::Model::GetObjectResult::~GetObjectResult() GetObjectResult.h:30
> Aws::S3::Model::GetObjectResult::~GetObjectResult() GetObjectResult.h:30
> arrow::fs::(anonymous namespace)::ObjectInputFile::ReadAt(long long, long 
> long, void*) s3fs.cc:724
> arrow::fs::(anonymous namespace)::ObjectInputFile::ReadAt(long long, long 
> long) s3fs.cc:735
> arrow::dataset::OpenReader(arrow::dataset::FileSource const&, 
> arrow::dataset::CsvFileFormat const&, 
> std::__1::shared_ptr<arrow::dataset::ScanOptions> const&, arrow::MemoryPool*) 
> file_csv.cc:119
> arrow::dataset::CsvFileFormat::Inspect(arrow::dataset::FileSource const&) 
> const file_csv.cc:182
> arrow::dataset::FileSystemDatasetFactory::InspectSchemas(arrow::dataset::InspectOptions)
>  discovery.cc:219
> arrow::dataset::DatasetFactory::Inspect(arrow::dataset::InspectOptions) 
> discovery.cc:41
> {code}
>  
> Does Arrow dataset support reading csv/parquest/ipc from S3Filesystem?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to