Re: [I] [Python] Unable to filter datasets on __fragment_index? [arrow]

via GitHub Mon, 25 Nov 2024 19:40:50 -0800


piotrkai commented on issue #40196:
URL: https://github.com/apache/arrow/issues/40196#issuecomment-2499587477


   It does not work with columns either. I am trying to skip first N rows in a 
huge dataset, but I cannot get it work using pyarrow.
   
   Example code:
   
   ```
   scan_columns['column_name' = pc.field('column_name')
   scan_columns['__exp_index'] = 
pc.add(pc.multiply(pc.field('__fragment_index')), pc.scalar(10000)),
                                                                        
pc.field('__batch_index'))
   filter = None
   if self.start_dataset_index > 0:
       filter = pc.field('__exp_index') >= pc.scalar(self.start_dataset_index)
       self.start_dataset_index = 0
   
   batches = self.dataset.scanner(columns=scan_columns, filter=filter, 
batch_size=self.batch_size).to_batches()
   ```
   
   I am getting pretty much the same error:
   ```
     File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.scanner
     File "pyarrow/_dataset.pyx", line 3557, in 
pyarrow._dataset.Scanner.from_dataset
     File "pyarrow/_dataset.pyx", line 3475, in 
pyarrow._dataset.Scanner._make_scan_options
     File "pyarrow/_dataset.pyx", line 3422, in 
pyarrow._dataset._populate_builder
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(__fragment_index) in 
column_name: int64
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Unable to filter datasets on __fragment_index? [arrow]

Reply via email to