mapleFU commented on issue #37630:
URL: https://github.com/apache/arrow/issues/37630#issuecomment-1729672858

   ```c++
   Result<std::vector<std::shared_ptr<FileFragment>>>
   ParquetDatasetFactory::CollectParquetFragments(const Partitioning& 
partitioning) {
     std::vector<std::shared_ptr<FileFragment>> 
fragments(paths_with_row_group_ids_.size());
   
     size_t i = 0;
     for (const auto& e : paths_with_row_group_ids_) {
       const auto& path = e.first;
       auto metadata_subset = metadata_->Subset(e.second);
   
       auto row_groups = Iota(metadata_subset->num_row_groups());
   
       auto partition_expression =
           partitioning.Parse(StripPrefix(path, options_.partition_base_dir))
               .ValueOr(compute::literal(true));
   
       ARROW_ASSIGN_OR_RAISE(
           auto fragment,
           format_->MakeFragment({path, filesystem_}, 
std::move(partition_expression),
                                 physical_schema_, std::move(row_groups)));
   
       RETURN_NOT_OK(fragment->SetMetadata(metadata_subset, manifest_));
       fragments[i++] = std::move(fragment);
     }
   
     return fragments;
   }
   ```
   
   I noticed that these metadata and manifest is `shared_ptr` and share among 
fragments, maybe when users have too many columns, this would be a huge cost?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to