alamb opened a new issue, #9146:
URL: https://github.com/apache/arrow-datafusion/issues/9146

   ### Describe the bug
   
   Reported in DiscordL 
https://discord.com/channels/885562378132000778/1166447479609376850/1204466794165706802
   
   My node requests only one column (I defined it as an expression as it stated 
in UserDefinedLogicalNodeCore), but parquet reader scans all the cols.
   
   The expected behavior is:
   ```
   MyNode(col1)
    Parquet(col1)
   ```
   
   However the actual behavior is
   ```
   MyNode(col1)
    Parquet(col1,col2)
   ```
   
   Projection Pushdown works with predefined nodes like filter. But not with my 
custom node 
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   This came from discord forums: 
https://discord.com/channels/885562378132000778/1166447479609376850/1204466794165706802
   
   
   ```
   TableScan: ?table? projection=[project_id, user_id, created_at, event_id, 
event, str_0, str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, 
str_10, str_11, str_12, str_13, str_14, str_15, str_16, str_17, str_18, str_19, 
str_20, str_21, str_22, str_23, str_24, ts_0, i_8, i_16, i_32, i_64, ts, bool, 
bool_nullable, string, decimal, group, v, string_dict]
    Sort: date_trunc(Utf8("day"), created_at) AS created_at ASC NULLS LAST
      PartitionedAggregatePartial: , agg: Count { filter: None, groups: 
Some([(Alias(Alias { expr: ScalarFunction(ScalarFunction { func_def: 
BuiltIn(DateTrunc), args: [Literal(Utf8("day")), Column(Column { relation: 
None, name: "created_at" })] }), relation: None, name: "created_at" }), 
SortField { data_type: Timestamp(Nanosecond, None) })]), predicate: Column { 
relation: None, name: "event" }, partition_col: Column { relation: None, name: 
"user_id" }, distinct: false } as "0_0"
        Filter: project_id = Int64(1) AND created_at >= 
TimestampNanosecond(1706966073340870000, None) AND created_at <= 
TimestampNanosecond(1707225273340870000, None) AND event = UInt16(6)
          Projection: project_id, user_id, created_at, event
            TableScan: ?table? projection=[project_id, user_id, created_at, 
event_id, event]
   ```
   
   
   Physical Plan
   ```
   SortPreservingMergeExec: [date_trunc(day, created_at@1) ASC NULLS LAST], 
metrics=[]
     SortExec: expr=[date_trunc(day, created_at@1) ASC NULLS LAST], metrics=[]
       SegmentedAggregatePartialExec, metrics=[]
         SortExec: expr=[project_id@0 ASC NULLS LAST,user_id@1 ASC NULLS LAST], 
metrics=[]
           CoalesceBatchesExec: target_batch_size=8192, metrics=[]
             RepartitionExec: partitioning=Hash([project_id@0, user_id@1], 12), 
input_partitions=12, metrics=[]
               ProjectionExec: expr=[project_id@0 as project_id, user_id@1 as 
user_id, created_at@2 as created_at, event@4 as event], metrics=[]
                 CoalesceBatchesExec: target_batch_size=8192, metrics=[]
                   FilterExec: project_id@0 = 1 AND created_at@2 >= 
1706966073340870000 AND created_at@2 <= 1707225273340870000 AND event@4 = 6, 
metrics=[]
                     RepartitionExec: partitioning=RoundRobinBatch(12), 
input_partitions=1, metrics=[]
                       ParquetExec: file_groups={1 group: 
[[Users/maximbogdanov/user_files/store/tables/events/0/0.parquet]]}, 
projection=[project_id, user_id, created_at, event_id, event], 
output_orderings=[[project_id@0 DESC NULLS LAST], [user_id@1 DESC NULLS LAST]], 
predicate=project_id@0 = 1 AND created_at@2 >= 1706966073340870000 AND 
created_at@2 <= 1707225273340870000 AND event@4 = 6, 
pruning_predicate=project_id_min@0 <= 1 AND 1 <= project_id_max@1 AND 
created_at_max@2 >= 1706966073340870000 AND created_at_min@3 <= 
1707225273340870000 AND event_min@4 <= 6 AND 6 <= event_max@5, 
metrics=[num_predicate_creation_errors=0]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to