?????? construct dataset for s3 by ParquetDatasetFactory failed

1057445597 Mon, 25 Apr 2022 03:05:29 -0700

Thank you for your previous reply.&nbsp;&nbsp;I still have some question want 
to ask




I found that the RecordBatchReader reads fewer rows at a time than each 
row_group contains, meaning that a row_group needs to be read twice by 
RecordBatchReader. So what is the default batch size for RecordBatchReader?


Also, any good advice if I have to follow the row_group? I have a lot of 
parquet files stored on S3, and if I convert scanner to BatchRecordReader, I 
just loop ReadNext(), and if I want to read row_group, I find, I have to call 
`auto Fragments dataset-&gt;GetFragments()`,then iterate through fragments and 
call SplitByRowGroups() to split each fragment again, The scanner is then 
constructed for each fragment divided and the scanner's ToTable() is called to 
read the data.


Finally, is there a performance difference between ToTable() and ReadNext()?





------------------&nbsp;????????&nbsp;------------------
??????:                                                                         
                                               "dev"                            
                                                        
<weston.p...@gmail.com&gt;;
????????:&nbsp;2022??4??11??(??????) ????4:23
??????:&nbsp;"dev"<dev@arrow.apache.org&gt;;

????:&nbsp;Re: construct dataset for s3 by ParquetDatasetFactory failed



ParquetDatasetFactory should only be used when you have a "_metadata"
file that describes which files are in your dataset.&nbsp; Some dataset
creators (e.g. Dask) can create this file.&nbsp; This saves time because
you do not have to list directories to find all the files in your
dataset.&nbsp; This is described in the python docs[1] this way:

&gt; Some processing frameworks such as Dask (optionally) use a _metadata file
&gt; with partitioned datasets which includes information about the schema and
&gt; the row group metadata of the full dataset. Using such a file can give a 
more
&gt; efficient creation of a parquet Dataset, since it does not need to infer 
the
&gt; schema and crawl the directories for all Parquet files (this is especially 
the
&gt; case for filesystems where accessing files is expensive). The
&gt; parquet_dataset() function allows us to create a Dataset from a partitioned
&gt; dataset with a _metadata file:

You can only use ParquetDatasetFactory if you have one of these
"_metadata" files.

&gt; The reason I want to use ParquetDatasetFactory is because using the
&gt; FileSystemDatasetFactory process seems to as follows
&gt; ...
&gt; I want to filt some columns before ToTable(), But it seems that only
&gt; struct table has the function of ColumnNames()

To get a list of columns in your dataset before you load the dataset
you can use the FileSystemDatasetFactory to create a Dataset and then
access the arrow::dataset::Dataset::schema property[2].
You can then pass the list of columns you want to read when you create
the scanner.

&gt; FileSystemDatasetFactory---&gt;get a dataset
&gt; dataset-&gt;GetFragments---------&gt;get fragments for parquet files in 
the folder
&gt; for fragment in fragments -------&gt;construct a scanner 
builder----&gt;Finish()---&gt;get a scanner
&gt; scanner -&gt;ToTable() ---&gt;get a table (read the file to memory)

You should not have to call dataset-&gt;GetFragments.&nbsp; You should not
create a scanner from a fragment.&nbsp; Instead you can create a scanner
from the dataset.

There are a few examples.&nbsp; This example[3] shows how to do projection.
In the C++ API the selection of columns is sometimes called
"projection".&nbsp; In the example I linked the code is loading all columns
AND one extra dynamic column (b_large).&nbsp; However, you can also use the
same approach to load fewer columns.&nbsp; You can see that `names` and
`exprs` are created.&nbsp; These vectors define which columns will be
loaded.&nbsp; To load fewer columns you would only add the columns you want
to these vectors.

[1] 
https://arrow.apache.org/docs/python/dataset.html#working-with-parquet-datasets
[2] 
https://github.com/apache/arrow/blob/e453ffeff233c358ec934a53a33b8b4b1d4e299b/cpp/src/arrow/dataset/dataset.h#L151
[3] 
https://github.com/apache/arrow/blob/e453ffeff233c358ec934a53a33b8b4b1d4e299b/cpp/examples/arrow/dataset_documentation_example.cc#L244

On Sun, Apr 10, 2022 at 4:25 PM 1057445597 <1057445...@qq.com.invalid&gt; wrote:
&gt;
&gt; This is a folder that contains some parquet files. What do you mean? 
ParquetDatasetFactory can only be used for a file?? FileSystemDatasetFactory 
can be used for folders.??Or can you tell me how to use parquetDatasetFactory 
correctly? What do I need to make sure of? For example, what should I notice 
about the metadata_path parameter? It's best to have an example. The reason I 
want to use ParquetDatasetFactory is because using the FileSystemDatasetFactory 
process seems to as follows
&gt;
&gt;
&gt; ```
&gt; FileSystemDatasetFactory---&amp;gt;get a dataset
&gt; dataset-&amp;gt;GetFragments---------&amp;gt;get fragments for parquet 
files in the folder
&gt; for fragment in fragments -------&amp;gt;construct a scanner 
builder----&amp;gt;Finish()---&amp;gt;get a scanner
&gt; scanner -&amp;gt;ToTable() ---&amp;gt;get a table (read the file to memory)
&gt;
&gt;
&gt; // I want to filt some columns before ToTable(), But it seems that only 
struct table has the function of ColumnNames()
&gt; ```
&gt; Is this a wrong way?
&gt; My ultimate goal is to use arrow to read S3 parquet files for tensorflow 
training
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; ------------------&amp;nbsp;????????&amp;nbsp;------------------
&gt; 
??????:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 
"dev"&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 <weston.p...@gmail.com&amp;gt;;
&gt; ????????:&amp;nbsp;2022??4??9??(??????) ????11:38
&gt; ??????:&amp;nbsp;"dev"<dev@arrow.apache.org&amp;gt;;
&gt;
&gt; ????:&amp;nbsp;Re: construct dataset for s3 by ParquetDatasetFactory failed
&gt;
&gt;
&gt;
&gt; Is `iceberg-test/warehouse/test/metadata` a parquet file?&amp;nbsp; I only 
ask
&gt; because there is no extension.&amp;nbsp; The commented out
&gt; FileSystemDatasetFactory is only accessing bucket_uri so it would
&gt; potentially succeed even if the metadata file did not exist.
&gt;
&gt; On Fri, Apr 8, 2022 at 1:48 AM 1057445597 
<1057445...@qq.com.invalid&amp;gt; wrote:
&gt; &amp;gt;
&gt; &amp;gt; I want use ParquetDatasetFactory to create a dataset for s3, but 
failed! The error message as follows
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; /build/apache-arrow-7.0.0/cpp/src/arrow/result.cc:28: ValueOrDie 
called on an error: IOError: Path does not exist 
'iceberg-test/warehouse/test/metadata' 
/lib/x86_64-linux-gnu/libarrow.so.700(+0x10430bb)[0x7f4ee6fe50bb] 
/lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f4ee6fe52fd]
 
/lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x17e)[0x7f4ee7104a2e]
 ./example(+0xd97d)[0x564087f3e97d] ./example(+0x8bc2)[0x564087f39bc2] 
./example(+0x94c8)[0x564087f3a4c8] ./example(+0x9fb4)[0x564087f3afb4] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f4ee572b0b3] 
./example(+0x69fe)[0x564087f379fe] Aborted (core dumped)
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; In the follow code snippet??There is a line of comment code??use 
FileSystemDatasetFactory to create dataset, It works well, Can't a dataset be 
created through a ParquetDatasetFactory????
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; std::shared_ptr<ds::Dataset&amp;amp;gt; GetDatasetFromS3(const 
std::string&amp;amp;amp; access_key,
&gt; 
&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
 const std::string&amp;amp;amp; secret_key,
&gt; 
&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
 const std::string&amp;amp;amp; endpoint_override,
&gt; 
&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
 const std::string&amp;amp;amp; bucket_uri) {
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; EnsureS3Initialized();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; S3Options s3Options = 
S3Options::FromAccessKey(access_key, secret_key);
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; s3Options.endpoint_override = 
endpoint_override;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; s3Options.scheme = "http";
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; std::shared_ptr<S3FileSystem&amp;amp;gt; s3fs 
= S3FileSystem::Make(s3Options).ValueOrDie();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; std::string path;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; std::stringstream ss;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; ss << "s3://" << access_key << ":" << 
secret_key
&gt; &amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; << 
"@" << K_METADATA_PATH
&gt; &amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; << 
"?scheme=http&amp;amp;amp;endpoint_override=" << endpoint_override;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; auto fs = 
arrow::fs::FileSystemFromUri(ss.str(), &amp;amp;amp;path).ValueOrDie();
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; // auto fileInfo = 
fs-&amp;amp;gt;GetFileInfo().ValueOrDie();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; auto format = 
std::make_shared<ParquetFileFormat&amp;amp;gt;();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; // FileSelector selector;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; // selector.base_dir = bucket_uri;
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; // FileSystemFactoryOptions options;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; ds::ParquetFactoryOptions options;
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; std::string metadata_path = bucket_uri;
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; ds::FileSource source(bucket_uri, s3fs);
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; //auto factory = 
ds::ParquetDatasetFactory::Make(source, bucket_uri, fs, format, 
options).ValueOrDie();
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; auto factory = 
ds::ParquetDatasetFactory::Make(path, fs, format, options).ValueOrDie();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; //auto factory = 
FileSystemDatasetFactory::Make(s3fs, selector, format, options).ValueOrDie();
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; return 
factory-&amp;amp;gt;Finish().ValueOrDie();
&gt; &amp;gt; }

?????? construct dataset for s3 by ParquetDatasetFactory failed

Reply via email to