alamb commented on issue #13261:
URL: https://github.com/apache/datafusion/issues/13261#issuecomment-2460335893

   > I agree with the assessment that the information must be coning from the 
file reader itself.
   
   I also agree with this assessment
   
   In general I am not sure a SQL level solution will work well in general. 
Some challenges:
   - `ctx.read_parquet("foo.parquet")` may read the file in parallel, 
interleaving the rows
   - `ctx.read_parquet("<directory>")` can read more than one file and the row 
off set / position are per file
   
   However, the DataFrame API you sketch out above seems reasonable and a 
relatively small part
   
   THe other systems I know that support Delete Vectors (e.g. Vectica) 
basically have
   1. A special flag on the scan node (`ParquetExec` in DataFusion) that says 
to emit positions (in addition to potentially adding filters, etc)
   2. A special operator that knows how to take a stream of positions and 
encode them as whatever delete vectory format there is.
   
   
   
   So in DataFusion this might look more like adding a method to 
`TableProvider` like `TableProvider::delete_from` similar to 
https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.insert_into
   
   And then each table provider would implement whatever API (which would 
likely involve positions as you describe)
   
   This would allow DataFusion to handle the planning of DELETE  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to