?????? construct dataset for s3 by ParquetDatasetFactory failed

1057445597 Sun, 10 Apr 2022 19:25:16 -0700

This is a folder that contains some parquet files. What do you mean? 
ParquetDatasetFactory can only be used for a file?? FileSystemDatasetFactory 
can be used for folders.??Or can you tell me how to use parquetDatasetFactory 
correctly? What do I need to make sure of? For example, what should I notice 
about the metadata_path parameter? It's best to have an example. The reason I 
want to use ParquetDatasetFactory is because using the FileSystemDatasetFactory 
process seems to as follows



```
FileSystemDatasetFactory---&gt;get a dataset
dataset-&gt;GetFragments---------&gt;get fragments for parquet files in the 
folder
for fragment in fragments -------&gt;construct a scanner 
builder----&gt;Finish()---&gt;get a scanner
scanner -&gt;ToTable() ---&gt;get a table (read the file to memory)


// I want to filt some columns before ToTable(), But it seems that only struct 
table has the function of ColumnNames()
```
Is this a wrong way?
My ultimate goal is to use arrow to read S3 parquet files for tensorflow 
training





------------------&nbsp;????????&nbsp;------------------
??????:                                                                         
                                               "dev"                            
                                                        
<weston.p...@gmail.com&gt;;
????????:&nbsp;2022??4??9??(??????) ????11:38
??????:&nbsp;"dev"<dev@arrow.apache.org&gt;;

????:&nbsp;Re: construct dataset for s3 by ParquetDatasetFactory failed



Is `iceberg-test/warehouse/test/metadata` a parquet file?&nbsp; I only ask
because there is no extension.&nbsp; The commented out
FileSystemDatasetFactory is only accessing bucket_uri so it would
potentially succeed even if the metadata file did not exist.

On Fri, Apr 8, 2022 at 1:48 AM 1057445597 <1057445...@qq.com.invalid&gt; wrote:
&gt;
&gt; I want use ParquetDatasetFactory to create a dataset for s3, but failed! 
The error message as follows
&gt;
&gt;
&gt; /build/apache-arrow-7.0.0/cpp/src/arrow/result.cc:28: ValueOrDie called on 
an error: IOError: Path does not exist 'iceberg-test/warehouse/test/metadata' 
/lib/x86_64-linux-gnu/libarrow.so.700(+0x10430bb)[0x7f4ee6fe50bb] 
/lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f4ee6fe52fd]
 
/lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x17e)[0x7f4ee7104a2e]
 ./example(+0xd97d)[0x564087f3e97d] ./example(+0x8bc2)[0x564087f39bc2] 
./example(+0x94c8)[0x564087f3a4c8] ./example(+0x9fb4)[0x564087f3afb4] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f4ee572b0b3] 
./example(+0x69fe)[0x564087f379fe] Aborted (core dumped)
&gt;
&gt;
&gt; In the follow code snippet??There is a line of comment code??use 
FileSystemDatasetFactory to create dataset, It works well, Can't a dataset be 
created through a ParquetDatasetFactory????
&gt;
&gt;
&gt; std::shared_ptr<ds::Dataset&amp;gt; GetDatasetFromS3(const 
std::string&amp;amp; access_key,
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 const std::string&amp;amp; secret_key,
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 const std::string&amp;amp; endpoint_override,
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 const std::string&amp;amp; bucket_uri) {
&gt;&nbsp;&nbsp; EnsureS3Initialized();
&gt;
&gt;&nbsp;&nbsp; S3Options s3Options = S3Options::FromAccessKey(access_key, 
secret_key);
&gt;&nbsp;&nbsp; s3Options.endpoint_override = endpoint_override;
&gt;&nbsp;&nbsp; s3Options.scheme = "http";
&gt;
&gt;&nbsp;&nbsp; std::shared_ptr<S3FileSystem&amp;gt; s3fs = 
S3FileSystem::Make(s3Options).ValueOrDie();
&gt;
&gt;&nbsp;&nbsp; std::string path;
&gt;&nbsp;&nbsp; std::stringstream ss;
&gt;&nbsp;&nbsp; ss << "s3://" << access_key << ":" << secret_key
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; << "@" << K_METADATA_PATH
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; << 
"?scheme=http&amp;amp;endpoint_override=" << endpoint_override;
&gt;&nbsp;&nbsp; auto fs = arrow::fs::FileSystemFromUri(ss.str(), 
&amp;amp;path).ValueOrDie();
&gt;&nbsp;&nbsp; // auto fileInfo = fs-&amp;gt;GetFileInfo().ValueOrDie();
&gt;
&gt;&nbsp;&nbsp; auto format = std::make_shared<ParquetFileFormat&amp;gt;();
&gt;
&gt;&nbsp;&nbsp; // FileSelector selector;
&gt;&nbsp;&nbsp; // selector.base_dir = bucket_uri;
&gt;
&gt;&nbsp;&nbsp; // FileSystemFactoryOptions options;
&gt;&nbsp;&nbsp; ds::ParquetFactoryOptions options;
&gt;
&gt;&nbsp;&nbsp; std::string metadata_path = bucket_uri;
&gt;
&gt;&nbsp;&nbsp; ds::FileSource source(bucket_uri, s3fs);
&gt;&nbsp;&nbsp; //auto factory = ds::ParquetDatasetFactory::Make(source, 
bucket_uri, fs, format, options).ValueOrDie();
&gt;&nbsp;&nbsp; auto factory = ds::ParquetDatasetFactory::Make(path, fs, 
format, options).ValueOrDie();
&gt;
&gt;&nbsp;&nbsp; //auto factory = FileSystemDatasetFactory::Make(s3fs, 
selector, format, options).ValueOrDie();
&gt;&nbsp;&nbsp; return factory-&amp;gt;Finish().ValueOrDie();
&gt; }

?????? construct dataset for s3 by ParquetDatasetFactory failed

Reply via email to