timothydijamco commented on issue #45287:
URL: https://github.com/apache/arrow/issues/45287#issuecomment-2627468938

   Superficially (in the context of "scan node") it looks safe from what I can 
tell from the code:
   
   * The `Fragment`s whose `physical_schema_`s we would delete are **first 
created** at the beginning of `MakeScanNode` 
([here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L1007),
 where `dataset->GetFragments` is called).
   * These `Fragment`s **do not propogate into** the `ExecNode` that is created 
at the end of `MakeScanNode` (since we're passing a generator of only 
*`ExecBatch`es* to `acero::MakeExecNode`).
   * So it seems like we have to make sure that any usages of `Fragment` within 
the long chain of generator logic between <when the `Fragment`s are created> 
and <when they have finally been turned into just `ExecBatch`es> don't need to 
use `physical_schema_` _before_ we clear `physical_schema_`.
   
   Where is `Fragment` used in this long chain of generator logic?
    * A call to `Fragment::ScanBatchesAsync` 
[here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L307)
        * In your PR this currently happens before you trigger the clearing of 
any fields of `Fragment` (i.e. before we would clear `physical_schema_`) 
anyways, so this call won't be a problem
    * A call to `Fragment::partition_expression` 
[here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L1048)
        * Today, this (happens to) not use `physical_schema_` — the behavior of 
this method for `Fragment` and all its subclasses is to return a 
`partition_expression_` value that was passed to the constructor of `Fragment`.
     * A call to `Fragment::ToString` 
[here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L1059)
       * Today, this (happens to) not use `physical_schema_` — the behavior of 
this method is to return a `type_name()` value that is overriden by each 
subclass to return a constant.
   
   In summary, it seems that today, clearing `physical_schema_` 
[here](https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/scanner.cc#L323)
 (as you're doing in your PR) would be safe, although doing it 
[here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L1060)
 (i.e. after the last time the `Fragment`s are used) is safest.
   
   ---
   
   I locally added the below check to the end of your "Parquet cached metadata" 
test 
[here](https://github.com/apache/arrow/pull/45330/files#diff-d88654840d0432223c1617e8fd9289db0f4e6fff6b34e9f062861ef8eec724fcL349)
 and it runs OK. I think this exercises that `physical_schema_` is re-populated 
and usable after previously clearing it and then doing another read (because 
`CountRows` calls into `ParquetFileFragment::TestRowGroups` which makes use of 
`physical_schema_` 
[here](https://github.com/apache/arrow/blob/f8723722341df31c0091c91ec451319ded58c214/cpp/src/arrow/dataset/file_parquet.cc#L927))
   ```
     // ...
     auto predicate = compute::equal(compute::field_ref("x"), 
compute::literal(0));
     ASSERT_OK_AND_ASSIGN(predicate, predicate.Bind(*test_schema));
     ASSERT_FINISHES_OK_AND_EQ(std::make_optional<int64_t>(1),
                               pq_fragment->CountRows(predicate, options));
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to